Assignment with both fillna() and loc() apparently not working - python

I've searched for answer around, but I cannot find them.
My goal: I'm trying to fill some missing values in a DataFrame, using supervised learning to decide how to fill it.
My code looks like this: NOTE - THIS FIRST PART IS NOT IMPORTANT, IT IS JUST TO GIVE CONTEXT
train_df = df[df['my_column'].notna()] #I need to train the model without using the missing data
train_x = train_df[['lat','long']] #Lat e Long are the inputs
train_y = train_df[['my_column']] #My_column is the output
clf = neighbors.KNeighborsClassifier(2)
clf.fit(train_x,train_y) #clf is the classifies, here we train it
df_x = df[['lat','long']] #I need this part to do the prediction
prediction = clf.predict(df_x) #clf.predict() returns an array
series_pred = pd.Series(prediction) #now the array is a series
print(series_pred.shape) #RETURNS (2381,)
print(series_pred.isna().sum()) #RETURN 0
So far, so good. I have my 2381 predictions (I need only a few of them) and there is no NaN value inside (why would there be a NaN value in the predictions? I just wanted to be sure, as I don't understand my error)
Here I try to assign the predictions to my Dataframe:
#test_1
df.loc[df['my_colum'].isna(), 'my_colum'] = series_pred #I assign the predictions using .loc()
#test_2
df['my_colum'] = df['my_colum'].fillna(series_pred) #Double check: I assign the predictions using .fillna()
print(df['my_colum'].shape) #RETURNS (2381,)
print(df['my_colum'].isna().sum()) #RETURN 6
As you can see, it didn't work: the missing values are still 6. I randomly tried a slightly different approach:
#test_3
df[['my_colum']] = df[['my_colum']].fillna(series_pred) #Will it work?
print(df[['my_colum']].shape) #RETURNS (2381, 1)
print(df[['my_colum']].isna().sum()) #RETURNS 6
Did not work. I decided to try one last thing: check the fillna result even before assigning the results to the original df:
In[42]:
print(df['my_colum'].fillna(series_pred).isna().sum()) #extreme test
Out[42]:
6
So... where is my very very stupid mistake? Thanks a lot
EDIT 1
To show a little bit of the data,
In[1]:
df.head()
Out[1]:
my_column lat long
id
9df Wil 51 5
4f3 Fabio 47 9
x32 Fabio 47 8
z6f Fabio 47 9
a6f Giovanni 47 7
Also, I've added info at the beginning of the question

#Ben.T or #Dan should post their own answers, they deserve to be accepted as the correct one.
Following their hints, I would say that there are two solutions:
Solution 1 (Best): Use loc()
The problem
The problem with the current solution is that df.loc[df['my_column'].isna(), 'my_column'] is expecting to receive X values, where X is the number of missing values. My variable prediction has actually both the prediction for the missing values and for the non missing values
The solution
pred_df = df[df['my_column'].isna()] #For the prediction, use a Dataframe with only the missing values. Problem solved
df_x = pred_df[['lat','long']]
prediction = clf.predict(df_x)
df.loc[df['my_column'].isna(), 'my_column'] = prediction
Solution 2: Use fillna()
The problem
The problem with the current solution is that df['my_colum'].fillna(series_pred) requires the indexes of my df to be the same of series_pred, which is impossible in this situation unless you have a simple index in your df, like [0, 1, 2, 3, 4...]
The solution
Resetting the index of the df at the very beginning of the code.
Why is this not the best
The cleanest way is to do the prediction only when you need it. This approach is easy to obtain with loc(), and I do not know how would you obtain it with fillna() because you would need to preserve the index through the classification
Edit: series_pred.index = df['my_column'].isna().index Thanks #Dan

Related

How do I get one column of an array into a new Array whilst applying a fancy indexing filter on it?

So basically I have an array, that consists of 14 Columns and 426 rows, every column represents one property of a dog and every row represents one dog, now I want to know the average heart frequency of an ill dog, the 14. column is the column that indicates whether the Dog is ill or not [0 = Healthy 1 = ill], the 8. row is the heart frequency. Now my problem is, that I don't know how I can get the 8. column out of the whole array and use the boolean filter on it
I am pretty new to Python. As I mentioned above I think that I know what I have to do [Use a fancy indexing filter] but I don't know how I can do this. I tried doing it while still being in the original Array but that didn't work out, so I thought I need to get the Infos into another one and use the Boolean filter on that one.
EDIT: Ok, so here is the code that I got right now:
import numpy as np
def average_heart_rate_for_pathologic_group(D):
a=np.array(D[:, 13]) #gets information, wether the dogs are sick or not
b=np.array(D[:, 7]) #gets the heartfrequency
R=(a >= 0) #gets all the values that are from sick dogs
amhr = np.mean(R) #calculates the average heartfrequency
return amhr
I think boolean indexing is the way foward.
The shortcuts for this work like:
#Your data:
data = [[0,1,2,3,4,5,6,7,8...],[..]...]
#This indexing chooses the rows in the 8th column that equals 1 and then their
#column number 14 values. Any analysis can be done after this on the new variable
heart_frequency_ill = data[data[:,7] == 1,13]
Probably you'll have to actually copy the data from the original array into a new one with the selected data.
Could you please share a sample with let's say 3 or 4 rows of your data?
I will give a try thought.
Let me build data with 4 columns here (but you could use 14 as in your problem)
data = [['c1a','c2a','c3a','c4a'], ['c1b','c2b','c3b','c4b']]
You could use numpy.array to get its nth column.
See how one can get the 2nd column:
import numpy as np
a = np.array(data)
a[:,2]
If you want to get the 8. Column of all the dogs that are healthy, you can do it the following:
# we use 7 for the column because the index starts by 0
# we use filter and fancy to get the rows where the conditions are true
# we use n.argwhere to get the indices where the conditions are true
A[np.argwhere([A[:,13] == 0])[:,1],7]
If you also want to compute the mean:
A[np.argwhere([A[:,13] == 0])[:,1],7].mean()

Python Sklearn MinMaxScaler ValueError: Input contains infinity or a value too large for dtype('float64') [duplicate]

I am using sklearn and having a problem with the affinity propagation. I have built an input matrix and I keep getting the following error.
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I have run
np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True
I tried using
mat[np.isfinite(mat) == True] = 0
to remove the infinite values but this did not work either.
What can I do to get rid of the infinite values in my matrix, so that I can use the affinity propagation algorithm?
I am using anaconda and python 2.7.9.
This might happen inside scikit, and it depends on what you're doing. I recommend reading the documentation for the functions you're using. You might be using one which depends e.g. on your matrix being positive definite and not fulfilling that criteria.
EDIT: How could I miss that:
np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True
is obviously wrong. Right would be:
np.any(np.isnan(mat))
and
np.all(np.isfinite(mat))
You want to check whether any of the elements are NaN, and not whether the return value of the any function is a number...
I got the same error message when using sklearn with pandas. My solution is to reset the index of my dataframe df before running any sklearn code:
df = df.reset_index()
I encountered this issue many times when I removed some entries in my df, such as
df = df[df.label=='desired_one']
This is my function (based on this) to clean the dataset of nan, Inf, and missing cells (for skewed datasets):
import pandas as pd
import numpy as np
def clean_dataset(df):
assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
df.dropna(inplace=True)
indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(axis=1)
return df[indices_to_keep].astype(np.float64)
In most cases getting rid of infinite and null values solve this problem.
get rid of infinite values.
df.replace([np.inf, -np.inf], np.nan, inplace=True)
get rid of null values the way you like, specific value such as 999, mean, or create your own function to impute missing values
df.fillna(999, inplace=True)
This is the check on which it fails:
https://github.com/scikit-learn/scikit-learn/blob/0.17.X/sklearn/utils/validation.py#L51
Which says
def _assert_all_finite(X):
"""Like assert_all_finite, but only for ndarray."""
X = np.asanyarray(X)
# First try an O(n) time, O(1) space solution for the common case that
# everything is finite; fall back to O(n) space np.isfinite to prevent
# false positives from overflow in sum method.
if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
and not np.isfinite(X).all()):
raise ValueError("Input contains NaN, infinity"
" or a value too large for %r." % X.dtype)
So make sure that you have non NaN values in your input. And all those values are actually float values. None of the values should be Inf either.
The Dimensions of my input array were skewed, as my input csv had empty spaces.
With this version of python 3:
/opt/anaconda3/bin/python --version
Python 3.6.0 :: Anaconda 4.3.0 (64-bit)
Looking at the details of the error, I found the lines of codes causing the failure:
/opt/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
56 and not np.isfinite(X).all()):
57 raise ValueError("Input contains NaN, infinity"
---> 58 " or a value too large for %r." % X.dtype)
59
60
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
From this, I was able to extract the correct way to test what was going on with my data using the same test which fails given by the error message: np.isfinite(X)
Then with a quick and dirty loop, I was able to find that my data indeed contains nans:
print(p[:,0].shape)
index = 0
for i in p[:,0]:
if not np.isfinite(i):
print(index, i)
index +=1
(367340,)
4454 nan
6940 nan
10868 nan
12753 nan
14855 nan
15678 nan
24954 nan
30251 nan
31108 nan
51455 nan
59055 nan
...
Now all I have to do is remove the values at these indexes.
None of the answers here worked for me. This was what worked.
Test_y = np.nan_to_num(Test_y)
It replaces the infinity values with high finite values and the nan values with numbers
I had the same error, and in my case X and y were dataframes so I had to convert them to matrices first:
X = X.values.astype(np.float)
y = y.values.astype(np.float)
Edit: The originally suggested X.as_matrix() is Deprecated
Problem seems to occur in DecisionTreeClassifier input check, Try
X_train = X_train.replace((np.inf, -np.inf, np.nan), 0).reset_index(drop=True)
I had the error after trying to select a subset of rows:
df = df.reindex(index=my_index)
Turns out that my_index contained values that were not contained in df.index, so the reindex function inserted some new rows and filled them with nan.
Remove all infinite values:
(and replace with min or max for that column)
import numpy as np
# generate example matrix
matrix = np.random.rand(5,5)
matrix[0,:] = np.inf
matrix[2,:] = -np.inf
>>> matrix
array([[ inf, inf, inf, inf, inf],
[0.87362809, 0.28321499, 0.7427659 , 0.37570528, 0.35783064],
[ -inf, -inf, -inf, -inf, -inf],
[0.72877665, 0.06580068, 0.95222639, 0.00833664, 0.68779902],
[0.90272002, 0.37357483, 0.92952479, 0.072105 , 0.20837798]])
# find min and max values for each column, ignoring nan, -inf, and inf
mins = [np.nanmin(matrix[:, i][matrix[:, i] != -np.inf]) for i in range(matrix.shape[1])]
maxs = [np.nanmax(matrix[:, i][matrix[:, i] != np.inf]) for i in range(matrix.shape[1])]
# go through matrix one column at a time and replace + and -infinity
# with the max or min for that column
for i in range(matrix.shape[1]):
matrix[:, i][matrix[:, i] == -np.inf] = mins[i]
matrix[:, i][matrix[:, i] == np.inf] = maxs[i]
>>> matrix
array([[0.90272002, 0.37357483, 0.95222639, 0.37570528, 0.68779902],
[0.87362809, 0.28321499, 0.7427659 , 0.37570528, 0.35783064],
[0.72877665, 0.06580068, 0.7427659 , 0.00833664, 0.20837798],
[0.72877665, 0.06580068, 0.95222639, 0.00833664, 0.68779902],
[0.90272002, 0.37357483, 0.92952479, 0.072105 , 0.20837798]])
I found that after calling pct_change on a new column that nan existed in one of rows. I remove the nan row with the following code
df = df.replace([np.inf, -np.inf], np.nan)
df = df.dropna()
df = df.reset_index()
i got the same error. it worked with df.fillna(-99999, inplace=True) before doing any replacement, substitution etc
I would like to propose a solution for numpy that worked well for me. The line
from numpy import inf
inputArray[inputArray == inf] = np.finfo(np.float64).max
substitues all infite values of a numpy array with the maximum float64 number.
Puff !! In my case the problem was about NaN values...
You can list your columns that had NaN with this function
your_data.isnull().sum()
and then you can fill these NAN values in your dataset file.
Here is the code for how to "Replace NaN with zero and infinity with large finite numbers."
your_data[:] = np.nan_to_num(your_data)
from numpy.nan_to_num
In my case the problem was that many scikit functions return numpy arrays, which are devoid of pandas index. So there was an index mismatch when I used those numpy arrays to build new DataFrames and then I tried to mix them with the original data.
dataset = dataset.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
This worked for me
I had the same issue, in my case the answer was simply that I had a cell in my CSV with no value ("x,y,z,,"). Putting a default value in fixed it for me.
Using isneginf may help.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.isneginf.html#numpy.isneginf
x[numpy.isneginf(x)] = 0 #0 is the value you want to replace with
Note: This solution only applies if you consciously want to keep NaN entries in your dataset.
This error happened to me when I was using some of the scikit-learn functionality (in my case: GridSearchCV). Under the hood I was using an xgboost XGBClassifier which handles NaN data gracefully. However, GridSearchCV was using sklearn.utils.validation module that encforced lack of missing data in the input data by calling _assert_all_finite function. This was ultimately causing an error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64')
Sidenote: _assert_all_finite accepts an allow_nan argument, which, if set to True, would not be causing issues. However, scikit-learn API does not allow us to have control over this argument.
Solution
My solution was to use patch module to silence the _assert_all_finite function so that it does not raise ValueError. Here is a snippet
import sklearn
with mock.patch("sklearn.utils.validation._assert_all_finite"):
# your code that raises ValueError
this will replace the _assert_all_finite by a dummy mock function so it won't get executed.
Please note that patching is not a recommended practice and might result in unpredictable behaviour!
EDIT:
This Pull Request should resolve the issue (though the fix has not been released as of Jan 2022)
If you're running an estimator, it could be that your learning rate is too high. I passed in the wrong array to a grid search by accident and ended up training with a learning rate of 500, which I could see causing issues with the training process.
Basically it's not necessarily only your inputs that have to all be valid, but the intermediate data as well.
After a long time of dealing with this problem, I realized that this is because in splits of training and testing sets there are columns of data which are the same for all data rows. Then some calculations in some algorithms may lead to infinity results. If the data that you are using is in a way that close rows are more likely to be similar then shuffling the data can help. This is a bug with scikit. I'm using version 0.23.2.
If you happen to use the "kc_house_data.csv" dataset (which some commenters and many data-science newcomers seem to use, because it's presented in lots of popular course material), the data is faulty and the true source for the error.
To fix it, as of 2022:
Delete the last (empty) line in the csv file
There are two lines that contain one empty data value "x,x,,x,x" - to fix it, don't delete the comma, instead add a random integer value like 2000, so it looks like this "x,x,2000,x,x"
Don't forget to save and reload in your project.
All the other answers are helpful and correct, but not in this case:
If you use kc_house_data.csv you need to fix the data in the file, nothing else will help, the empty data field will shift the other data around randomly and generate weird bugs that are hard to track to the source!
In my case the algorithm required data to be between (0,1) noninclusive. My quite brutal solutions was to add a small random number to all desired values:
y_train = pd.DataFrame(y_train).applymap(lambda x: x + np.random.rand()/100000.0)["col_name"]
y_train[y_train >= 1] = 0.999999
while y_train is in the range of [0,1].
This is definitely not suitable for all cases, as you are messing with your input data but can be a solution if you have sparse data and only need a quick forecast
try
mat.sum()
If the sum of your data is infinity (greater that the max float value which is 3.402823e+38) you will get that error.
see the _assert_all_finite function in validation.py from the scikit source code:
if is_float and np.isfinite(X.sum()):
pass
elif is_float:
msg_err = "Input contains {} or a value too large for {!r}."
if (allow_nan and np.isinf(X).any() or
not allow_nan and not np.isfinite(X).all()):
type_err = 'infinity' if allow_nan else 'NaN, infinity'
# print(X.sum())
raise ValueError(msg_err.format(type_err, X.dtype))

How to convert dataframe to 1D array ?

First of all apologies. I am very new to pandas, scikit learn and python. So I am sure I am doing something silly. Let me give a little background.
I am trying to run KNeighborsClassifier from scikit learn (python)
Following is my strategy
#Reading the Training set
data = pd.read_csv('Path_TO_File\\Train_Set.csv', sep=',') # reading CSV File
X = data[['Attribute 1','Attribute 2']]
y = data['Target_Column'] # the output is a Dataframe of single column with many rows
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X,y)
Next I try to read Test data
test = pd.read_csv('PATH_TO_FILE\\Test.csv', sep=',')
t = test[['Attribute 1','Attribute 2']]
pred = neigh.predict(t)
actual = test['Target_Column']
Next I try to check the accuracy by following function which is throwing error.
accuracy=neigh.score(actual,pred)
ERROR: ValueError: could not convert string to float: N
I checked actual and pred both and they are of following data type and content
actual
Out[161]:
Target_Column
0 Y
1 N
:
[614 rows x 1 columns]
pred
Out[162]:
array(['Y', 'N', .....'N'], dtype=object)
N.B : pred has 614 values.
I tried to convert "actual" variable to 1D array I might be able to execute the function however, I am not successful.
I think I need to do following two things, however, was not able to do so (after googling it)
1) Convert actual into 1Dimen array
2) Making a transpose of the 1Dimen array since the pred has 614 columns.
Please let me know how to correct the function.
Thanks in advance !
Raj
Thanks Vivek and Thornhale
Indeed I was doing two wrong things.
As pointed by you guys, I should have been using 1, 0 in stead of Y,
N.
I was giving wrong parameters to the function score. It should be
accuracy=neigh.score(t, actual) , where t is test feature set and
actual is test label information.
You could convert your series which is what you get when you do "test[COLUMN_NAME]" into an array like so:
actual = np.array(test['Target_Column'])
To then reshape an np array, you would emply this command:
actual.reshape(1, 612) # <- Could be the other way around as well.
Your main issue though is that your Series needs to be boolean (as in 0,1).

Duplicate the samples in a dataset?

I use the code to check my dataset 'df' and see serious imbalance in column 'Has_Arrears'. I would expand my target dataset with duplicate samples under Has_Arrears = 1 35 times. i.e. sample 35 times for each observation of Has_Arrears = 1. How can I achive this? Cheers
If I would like to use stratify sampling, how can I code for this?
If I understand you correctly, this may be what you're looking for:
new = df['Has_Arrears'] == 1
a = df[new]
df = df.append([a]*35, ignore_index=True)

Output of sklearn.ensemble.RandomForestClassifier includes NaN values

I am using sklearn.ensemble.RandomForestClassifier to analyze data and I was puzzled to see NaN values in the prediction without any NaN in the training set or in testing set.
print preds_y[preds_y.isnull().any(axis=1)].shape
print train_y[train_y.isnull().any(axis=1)].shape
print train_features[train_features.isnull().any(axis=1)].shape
print test_features[train_features.isnull().any(axis=1)].shape
> (4830, 1)
> (0, 1)
> (0, 22)
> (0, 22)
These NaN values are causing the call to sklearn.metrics.classification_report to fail with the following error:
> ValueError: Mix of label input types (string and number)
Right now I'm mostly interested in understanding why the random forest is spitting out NaNs. As soon as I figure that out, I can filter the results accordingly and see how well the method is performing.
Thanks in advance for your input.
(I'm sorry if this has been asked before. I searched for it but all the results I found concerned NaNs in the training data, which is not my issue at all.)
EDIT 1: Just to be clear, there are many valid predictions in the output:
print preds_y[~preds_y.isnull().any(axis=1)].shape
print train_y[~train_y.isnull().any(axis=1)].shape
> (11760, 1)
> (39749, 1)
EDIT 2:
As I wrote in a comment below, the original data has numeric and categorical columns. All the categorical columns are converted to numeric using pandas.get_dummies() before calling fit(). I convert the results back to a pandas.DataFrame and reconstruct the original categorical columns for readability. The two pandas.Series -- predicted and actual values -- I am feeding classification_report() have only one type (category).
It seems that the NaNs in the predictions arise if the random forest predicts 0 for every dummy binary column corresponding to the original categorical column. I was not expecting this to happen so often -- it seems that 30% of my entries go unclassified -- but I'm not sure there is anything further to add on this issue.
You can first remove all NaN by replacing them as zeros.
See this link.
Maybe use df.fillna(0), then you should be fine I suppose.

Categories

Resources