SciKitLearn tree returning error - python

I'm trying to build a decision tree with SciKitLearn, and it tells me:
Input contains NaN, infinity or a value too large for dtype('float64').
Running .isnull().any() on the input data returns False for every column.
There are four input columns of type float64; the data in them is properly formatted to two decimal places, no crazy values.
What might the culprit be and how can I fix it?
y = df["CutoffValue"]
X = df_new
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X,y)

Fixed it! In this case, "input" in the error refers to LABELED data, the y! Dropped nulls for the column, and all is ok.

Related

How to deal with really small (order of -322) floating values in pandas dataframe?

I have a pandas dataframe with feature values that are, really really small, of the order -322. I am trying to standardize the features but getting
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
A few values from the dataframe are as follows:
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
I am assuming that I am dealing with value underflow problem. How can I deal with this problem.
This is for python 3.6 and pandas dataframe.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The values in the dataframe should be standardized as needed but getting error due to value underflow.
Multiply them.
You're right: your values are too small for Pandas to handle as floats. The minimum np.float64 value is ~2.22e-308. You can handle somewhat smaller values by using more obscure types like np.longdouble, but these have their limits too and can be system-dependent.
As some of the comments point out, most plausible use cases don't require values this small. But if yours does, one simple way to get around the float boundaries is to multiply all of your values by a consistent integer that brings them within the acceptable float range (perhaps by 10^320). You're not losing any information, just dropping a long sequence of zeroes.
Note: this only works if you're not simultaneously storing numbers too huge to multiply without breaking the float limits in the other direction. But this seems unlikely.
Store the log of the number, and reverse with exp when needed later. If you then need to shift them the shift is additive (instead of multiplicative). Working in the log-space helps avoid machine zero though you'll still have issues you need to deal with operating with the log values, i.e. log-of-sum != sum-of-logs
You should try normalization of your data to bring it within some scale of value.
Here is the sample code
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
You are receiving NAN because the numbers went off your handling scale.
EDIT1:
Your error says that your dataset contains NAN values and cannot be converted to float64 type. Are you sure there are no empty values. If so try to drop those
values using .drop function like below:
DataFrame.drop()

ValueError: could not convert string to float <ipython-input-5-1a15d1ec0505> in <module>()

I am trying to implement KNN but when i transform X_train and X_test it gives error, I am new please any help regarding this.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
It gives error "could not convert string to float"
what to do?
The first thing I do when I see an error like this is to check my inputs. If you're getting a "could not convert string to float" error, you probably have a non-numeric string somewhere in your inputs (because it looks like this KNN function only takes numbers as input).
I'm assuming X_train and X_test are dataframes--try running the following for each column in your dataframe:
for i in range(len(X_train.columns)):
try:
[float(j) for j in X_train[X_train.columns[i]]]
print(X_train.columns[i],' is all-numeric')
except:
pass
Whichever column doesn't print when you run this is the one you need to look at and see if you can clean up the non-numeric entries in that column.
Edit: if you have a column of only non-numeric strings (for example, "Iris-setosa","Iris-versicolor", etc), you will have to convert them into numbers or Dummy Variable columns for the purposes of the KNN function.
Edit 2: Whoooops. I wrote bad code. Fixed it.

Scikit Learn - Combining output of TfidfVectorizer and OneHotEncoder - dimensionality

I am currently developing a machine learning algorithm for ticket classification that combines a Title, Description and Customer name together to predict what team a ticket should be assigned to but have been stuck for the past few days.
Title and description are both free text and so I am passing them through TfidfVectorizer. Customer name is a category, for this I am using OneHotEncoder. I want these to work within a pipeline so have them being joined with a column transformer where I can pass in an entire dataframe and have it be processed.
file = "train_data.csv"
train_data= pd.read_csv(train_file)
string_features = ['Title', 'Description']
string_transformer = Pipeline(steps=[('tfidf', TfidfVectorizer()))
categorical_features = ['Customer']
categorical_transformer = Pipeline(steps=[('OHE', preprocessing.OneHotEncoder()))
preprocessor = ColumnTransformer(transformers = [('str', string_transformer, string_features), ('cat', categorical_transformer, categorical_features)])
clf = Pipeline(steps=[('preprocessor', preprocessor),('clf', SGDClassifier())]
X_train = train_data.drop('Team', axis=1)
y_train = train_data['Team']
clf.fit(X_train, y_train)
However I get an error: all the input array dimensions except for the concatenation axis must match exactly.
After looking into it, print(OneHotEncoder().fit_transform(X_train['Customer'])) on its own returns an error: Expected 2d array got 1d array instead.
I believe that OneHotEncoder is failing as it is expecting an array of arrays (a pandas dataframe), each being length one containing the customer name. But instead is just getting a pandas series. By converting the series to a dataframe with .to_frame() the printed output now seems to match what is outputted by the TfidfVectorizer and the dimensions should match.
Is there a way I can modify OneHotEncoder in the pipeline so that it accepts the input as it is in 1 dimension? Or is there something I can add to the pipeline that will convert it before it's passed into OneHotEncoder? Am I right in that this is the reason for the error?
Thanks.
I believe the problem lies in the fact that you're giving two columns to the TfIdfVectorizer (which is thus converted to a DataFrame). This will not work: TfIdfVectorizer expects a list of strings. So an immediate solution (and therefore a check of whether this is in fact the source of the problem), is changing this line to: string_features = 'Description'. Note this is not a list, it just a string. Therefore the Series is passed to the TfIdfVectorizer, and not the DataFrame.
If you would like to combine both string columns, you could either
concatanenate the strings, so you keep one column (which is the easiest), or
Fit two different TfIdfVectorizers, which is more complex but might perform better. See for instance Computing separate tfidf scores for two different columns using sklearn
Should this not solve your problem, I would advise you to share some sample data so we can actually test what is happening.
I believe the difference between your perceived error and the actual pipeline lies in the fact that you're giving it X_train['Customer'] (again a Series), but in the actual pipeline you're giving it X_train[['Customer']] (a DataFrame).

Python SKLearn fit Value Error Input

I'm trying to fit and transform some data to use later in a model to a Classifier but it's always giving me an error and I don't understand why.
Please, can somebody help me?
##stores the function Pipeline with parameters decided above
inputPipe = getPreProcPipe(normIn=normIn, pca=pca, pcaN=pcaN, whiten=whiten)
print inputPipe
print
#print devData[classTrainFeatures].values.astype('float32')
print devData[classTrainFeatures].shape
print type(devData[classTrainFeatures].values)
##fit pipeline to inputs features and types
inputPipe.fit(devData[classTrainFeatures].values.astype('float32'))
##transform inputs X
X_class = inputPipe.transform(devData[classTrainFeatures].values.astype(double))
## Output Y, i.e, 0 or 1 as it is the target
Y_class = devData['gen_target'].values.astype('int')
#print Y_class
Output:
Pipeline(memory=None,
steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('normPCA', StandardScaler(copy=True, with_mean=True, with_std=True))])
(32583, 2)
<type 'numpy.ndarray'>
Error in the end of code:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
Code
Error part 1
Error part 2
you have to check the data you use ( not the code ) if it contains NaN ( not a number values ), in numpy there is the function .isnan() ( https://docs.scipy.org/doc/numpy/reference/generated/numpy.isnan.html ) for this How to get the indices list of all NaN value in numpy array?
also check for infinite values with .isinf()
in this kaggle kernel is example code for filling NaNs and Infs in datasets that then are used in classifiers https://www.kaggle.com/mknorps/titanic-with-decision-trees , also see https://datascience.stackexchange.com/questions/25924/difference-between-interpolate-and-fillna-in-pandas?rq=1 for interpolate()
dropping rows that contain NaNs and Infs is done by
indx = devData[classTrainFeatures].index[devData[classTrainFeatures].apply(np.isnan)]
devData=devData.drop(devData.index[indx]).copy()
devData=devData.reset_index(drop=True)
( get index of NaN , drop all rows containing NaN using the index, reset index of dataframe )
I see 3 possibilities for this kind of error:
You may have Infs in your data. In that case you may need to remove those samples. To find the Infs try. df.index[np.isinf(df).any(1)]
You may have NaNs in yout data. Check it using df.index[np.isnan(df).any(1)]. In that case you may replace the NaNs with the mean value of the column doing df.fillna(df.mean()).dropna(axis=1, how='all') .
Finally but most probably, is that you have a constant or almost constant feature that, once it gets normalized and divided by the standard deviation gives you NaNs or Infs. In that case you should drop that feature using VarianceThreshold

Output of sklearn.ensemble.RandomForestClassifier includes NaN values

I am using sklearn.ensemble.RandomForestClassifier to analyze data and I was puzzled to see NaN values in the prediction without any NaN in the training set or in testing set.
print preds_y[preds_y.isnull().any(axis=1)].shape
print train_y[train_y.isnull().any(axis=1)].shape
print train_features[train_features.isnull().any(axis=1)].shape
print test_features[train_features.isnull().any(axis=1)].shape
> (4830, 1)
> (0, 1)
> (0, 22)
> (0, 22)
These NaN values are causing the call to sklearn.metrics.classification_report to fail with the following error:
> ValueError: Mix of label input types (string and number)
Right now I'm mostly interested in understanding why the random forest is spitting out NaNs. As soon as I figure that out, I can filter the results accordingly and see how well the method is performing.
Thanks in advance for your input.
(I'm sorry if this has been asked before. I searched for it but all the results I found concerned NaNs in the training data, which is not my issue at all.)
EDIT 1: Just to be clear, there are many valid predictions in the output:
print preds_y[~preds_y.isnull().any(axis=1)].shape
print train_y[~train_y.isnull().any(axis=1)].shape
> (11760, 1)
> (39749, 1)
EDIT 2:
As I wrote in a comment below, the original data has numeric and categorical columns. All the categorical columns are converted to numeric using pandas.get_dummies() before calling fit(). I convert the results back to a pandas.DataFrame and reconstruct the original categorical columns for readability. The two pandas.Series -- predicted and actual values -- I am feeding classification_report() have only one type (category).
It seems that the NaNs in the predictions arise if the random forest predicts 0 for every dummy binary column corresponding to the original categorical column. I was not expecting this to happen so often -- it seems that 30% of my entries go unclassified -- but I'm not sure there is anything further to add on this issue.
You can first remove all NaN by replacing them as zeros.
See this link.
Maybe use df.fillna(0), then you should be fine I suppose.

Categories

Resources