I am using sklearn.ensemble.RandomForestClassifier to analyze data and I was puzzled to see NaN values in the prediction without any NaN in the training set or in testing set.
print preds_y[preds_y.isnull().any(axis=1)].shape
print train_y[train_y.isnull().any(axis=1)].shape
print train_features[train_features.isnull().any(axis=1)].shape
print test_features[train_features.isnull().any(axis=1)].shape
> (4830, 1)
> (0, 1)
> (0, 22)
> (0, 22)
These NaN values are causing the call to sklearn.metrics.classification_report to fail with the following error:
> ValueError: Mix of label input types (string and number)
Right now I'm mostly interested in understanding why the random forest is spitting out NaNs. As soon as I figure that out, I can filter the results accordingly and see how well the method is performing.
Thanks in advance for your input.
(I'm sorry if this has been asked before. I searched for it but all the results I found concerned NaNs in the training data, which is not my issue at all.)
EDIT 1: Just to be clear, there are many valid predictions in the output:
print preds_y[~preds_y.isnull().any(axis=1)].shape
print train_y[~train_y.isnull().any(axis=1)].shape
> (11760, 1)
> (39749, 1)
EDIT 2:
As I wrote in a comment below, the original data has numeric and categorical columns. All the categorical columns are converted to numeric using pandas.get_dummies() before calling fit(). I convert the results back to a pandas.DataFrame and reconstruct the original categorical columns for readability. The two pandas.Series -- predicted and actual values -- I am feeding classification_report() have only one type (category).
It seems that the NaNs in the predictions arise if the random forest predicts 0 for every dummy binary column corresponding to the original categorical column. I was not expecting this to happen so often -- it seems that 30% of my entries go unclassified -- but I'm not sure there is anything further to add on this issue.
You can first remove all NaN by replacing them as zeros.
See this link.
Maybe use df.fillna(0), then you should be fine I suppose.
Related
I have a csv file with 2 columns. One column has string toxic comments, other column has float toxicity values 0 to 1. (comments become more toxic when toxicity value close to 1).
And I want to do linear regression for correctly predict amount of toxic values.
For that, first I converted the "comment" (string) column to integer like that :
train['comment']= pd.to_numeric(train['comment'], errors='coerce').fillna(0).astype(np.int64)
Then, I wrote that code for linear regression :
linX = train.iloc[:, 0].values.reshape(-1,1)
linY = train.iloc[:, 1].values.reshape(-1,1)
lr = LinearRegression()
lr.fit(linX, linY)
Y_pred = lr.predict(linX)
plt.scatter(linX,linY)
plt.plot(linX,Y_pred, color='red')
That worked but I don't think I did right. Because that regression table didn't seem right to me :
I couldn't solve the problem. My questions is ;
Is my code for linear regression for this problem right ?
Should I split the "toxicity" column seperate from 0 values ?
I'm not sure if turning strings into numeric values with the code below will return the results you are looking for.
pd.to_numeric(train['comment'], errors='coerce')
This code only change the variable type of the string comments. String comments are unable to be converted into integers. The coerce optional parameter causes the strings to be converted into NaN values, and the NaN values are converted into zeros with fillna.
To solve text classification problems using machine learning techniques you will need to preprocess the data using techniques such as TF-IDF.
I've searched for answer around, but I cannot find them.
My goal: I'm trying to fill some missing values in a DataFrame, using supervised learning to decide how to fill it.
My code looks like this: NOTE - THIS FIRST PART IS NOT IMPORTANT, IT IS JUST TO GIVE CONTEXT
train_df = df[df['my_column'].notna()] #I need to train the model without using the missing data
train_x = train_df[['lat','long']] #Lat e Long are the inputs
train_y = train_df[['my_column']] #My_column is the output
clf = neighbors.KNeighborsClassifier(2)
clf.fit(train_x,train_y) #clf is the classifies, here we train it
df_x = df[['lat','long']] #I need this part to do the prediction
prediction = clf.predict(df_x) #clf.predict() returns an array
series_pred = pd.Series(prediction) #now the array is a series
print(series_pred.shape) #RETURNS (2381,)
print(series_pred.isna().sum()) #RETURN 0
So far, so good. I have my 2381 predictions (I need only a few of them) and there is no NaN value inside (why would there be a NaN value in the predictions? I just wanted to be sure, as I don't understand my error)
Here I try to assign the predictions to my Dataframe:
#test_1
df.loc[df['my_colum'].isna(), 'my_colum'] = series_pred #I assign the predictions using .loc()
#test_2
df['my_colum'] = df['my_colum'].fillna(series_pred) #Double check: I assign the predictions using .fillna()
print(df['my_colum'].shape) #RETURNS (2381,)
print(df['my_colum'].isna().sum()) #RETURN 6
As you can see, it didn't work: the missing values are still 6. I randomly tried a slightly different approach:
#test_3
df[['my_colum']] = df[['my_colum']].fillna(series_pred) #Will it work?
print(df[['my_colum']].shape) #RETURNS (2381, 1)
print(df[['my_colum']].isna().sum()) #RETURNS 6
Did not work. I decided to try one last thing: check the fillna result even before assigning the results to the original df:
In[42]:
print(df['my_colum'].fillna(series_pred).isna().sum()) #extreme test
Out[42]:
6
So... where is my very very stupid mistake? Thanks a lot
EDIT 1
To show a little bit of the data,
In[1]:
df.head()
Out[1]:
my_column lat long
id
9df Wil 51 5
4f3 Fabio 47 9
x32 Fabio 47 8
z6f Fabio 47 9
a6f Giovanni 47 7
Also, I've added info at the beginning of the question
#Ben.T or #Dan should post their own answers, they deserve to be accepted as the correct one.
Following their hints, I would say that there are two solutions:
Solution 1 (Best): Use loc()
The problem
The problem with the current solution is that df.loc[df['my_column'].isna(), 'my_column'] is expecting to receive X values, where X is the number of missing values. My variable prediction has actually both the prediction for the missing values and for the non missing values
The solution
pred_df = df[df['my_column'].isna()] #For the prediction, use a Dataframe with only the missing values. Problem solved
df_x = pred_df[['lat','long']]
prediction = clf.predict(df_x)
df.loc[df['my_column'].isna(), 'my_column'] = prediction
Solution 2: Use fillna()
The problem
The problem with the current solution is that df['my_colum'].fillna(series_pred) requires the indexes of my df to be the same of series_pred, which is impossible in this situation unless you have a simple index in your df, like [0, 1, 2, 3, 4...]
The solution
Resetting the index of the df at the very beginning of the code.
Why is this not the best
The cleanest way is to do the prediction only when you need it. This approach is easy to obtain with loc(), and I do not know how would you obtain it with fillna() because you would need to preserve the index through the classification
Edit: series_pred.index = df['my_column'].isna().index Thanks #Dan
I have a pandas dataframe with feature values that are, really really small, of the order -322. I am trying to standardize the features but getting
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
A few values from the dataframe are as follows:
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
I am assuming that I am dealing with value underflow problem. How can I deal with this problem.
This is for python 3.6 and pandas dataframe.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The values in the dataframe should be standardized as needed but getting error due to value underflow.
Multiply them.
You're right: your values are too small for Pandas to handle as floats. The minimum np.float64 value is ~2.22e-308. You can handle somewhat smaller values by using more obscure types like np.longdouble, but these have their limits too and can be system-dependent.
As some of the comments point out, most plausible use cases don't require values this small. But if yours does, one simple way to get around the float boundaries is to multiply all of your values by a consistent integer that brings them within the acceptable float range (perhaps by 10^320). You're not losing any information, just dropping a long sequence of zeroes.
Note: this only works if you're not simultaneously storing numbers too huge to multiply without breaking the float limits in the other direction. But this seems unlikely.
Store the log of the number, and reverse with exp when needed later. If you then need to shift them the shift is additive (instead of multiplicative). Working in the log-space helps avoid machine zero though you'll still have issues you need to deal with operating with the log values, i.e. log-of-sum != sum-of-logs
You should try normalization of your data to bring it within some scale of value.
Here is the sample code
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
You are receiving NAN because the numbers went off your handling scale.
EDIT1:
Your error says that your dataset contains NAN values and cannot be converted to float64 type. Are you sure there are no empty values. If so try to drop those
values using .drop function like below:
DataFrame.drop()
I'm trying to fit and transform some data to use later in a model to a Classifier but it's always giving me an error and I don't understand why.
Please, can somebody help me?
##stores the function Pipeline with parameters decided above
inputPipe = getPreProcPipe(normIn=normIn, pca=pca, pcaN=pcaN, whiten=whiten)
print inputPipe
print
#print devData[classTrainFeatures].values.astype('float32')
print devData[classTrainFeatures].shape
print type(devData[classTrainFeatures].values)
##fit pipeline to inputs features and types
inputPipe.fit(devData[classTrainFeatures].values.astype('float32'))
##transform inputs X
X_class = inputPipe.transform(devData[classTrainFeatures].values.astype(double))
## Output Y, i.e, 0 or 1 as it is the target
Y_class = devData['gen_target'].values.astype('int')
#print Y_class
Output:
Pipeline(memory=None,
steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('normPCA', StandardScaler(copy=True, with_mean=True, with_std=True))])
(32583, 2)
<type 'numpy.ndarray'>
Error in the end of code:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
Code
Error part 1
Error part 2
you have to check the data you use ( not the code ) if it contains NaN ( not a number values ), in numpy there is the function .isnan() ( https://docs.scipy.org/doc/numpy/reference/generated/numpy.isnan.html ) for this How to get the indices list of all NaN value in numpy array?
also check for infinite values with .isinf()
in this kaggle kernel is example code for filling NaNs and Infs in datasets that then are used in classifiers https://www.kaggle.com/mknorps/titanic-with-decision-trees , also see https://datascience.stackexchange.com/questions/25924/difference-between-interpolate-and-fillna-in-pandas?rq=1 for interpolate()
dropping rows that contain NaNs and Infs is done by
indx = devData[classTrainFeatures].index[devData[classTrainFeatures].apply(np.isnan)]
devData=devData.drop(devData.index[indx]).copy()
devData=devData.reset_index(drop=True)
( get index of NaN , drop all rows containing NaN using the index, reset index of dataframe )
I see 3 possibilities for this kind of error:
You may have Infs in your data. In that case you may need to remove those samples. To find the Infs try. df.index[np.isinf(df).any(1)]
You may have NaNs in yout data. Check it using df.index[np.isnan(df).any(1)]. In that case you may replace the NaNs with the mean value of the column doing df.fillna(df.mean()).dropna(axis=1, how='all') .
Finally but most probably, is that you have a constant or almost constant feature that, once it gets normalized and divided by the standard deviation gives you NaNs or Infs. In that case you should drop that feature using VarianceThreshold
Binary one-hot (also known as one-of-K) coding lies in making one binary column for each distinct value for a categorical variable. For example, if one has a color column (categorical variable) that takes the values 'red', 'blue', 'yellow', and 'unknown' then a binary one-hot coding replaces the color column with binaries columns 'color=red', 'color=blue', and 'color=yellow'. I begin with data in a pandas data-frame and I want to use this data to train a model with scikit-learn. I know two ways to do the binary one-hot coding, none of them satisfactory to me.
Pandas and get_dummies in the categorical columns of the data-frame. This method seems excellent as far as the original data-frame contains all data available. That is, you do the one-hot coding before splitting your data in training, validation, and test sets. However, if the data is already split in different sets, this method doesn't work very well. Why? Because one of the data sets (say, the test set) can contain fewer values for a given variable. For example, it can happen that whereas the training set contain the values red, blue, yellow, and unknown for the variable color, the test set only contains red and blue. So the test set would end up having fewer columns than the training set. (I don't know either how the new columns are sorted, and if even having the same columns, this could be in a different order in each set).
Sklearn and DictVectorizer This solves the previous issue, as we can make sure that we are applying the very same transformation to the test set. However, the outcome of the transformation is a numpy array instead of a pandas data-frame. If we want to recover the output as a pandas data-frame, we need to (or at least this is the way I do it): 1) pandas.DataFrame(data=outcome of DictVectorizer transformation, index=index of original pandas data frame, columns= DictVectorizer().get_features_names) and 2) join along the index the resulting data-frame with the original one containing the numerical columns. This works, but it is somewhat cumbersome.
Is there a better way to do a binary one-hot encoding within a pandas data-frame if we have our data split in training and test set?
If your columns are in the same order, you can concatenate the dfs, use get_dummies, and then split them back again, e.g.,
encoded = pd.get_dummies(pd.concat([train,test], axis=0))
train_rows = train.shape[0]
train_encoded = encoded.iloc[:train_rows, :]
test_encoded = encoded.iloc[train_rows:, :]
If your columns are not in the same order, then you'll have challenges regardless of what method you try.
You can set your data type to categorical:
In [5]: df_train = pd.DataFrame({"car":Series(["seat","bmw"]).astype('category',categories=['seat','bmw','mercedes']),"color":["red","green"]})
In [6]: df_train
Out[6]:
car color
0 seat red
1 bmw green
In [7]: pd.get_dummies(df_train )
Out[7]:
car_seat car_bmw car_mercedes color_green color_red
0 1 0 0 0 1
1 0 1 0 1 0
See this issue of Pandas.