ValueError: Input contains NaN in python - python

My Code
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
mnb=MultinomialNB()
svm=SGDClassifier(max_iter=1000, tol=0.2)
mnb_bow_predictions=train_predict_evaluate_model(classifier=mnb,
train_features=bow_train_features,
train_labels=train_labels,
test_features=bow_test_features,
test_labels=test_labels)
and raise the error
~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan)
58 elif X.dtype == np.dtype('object') and not allow_nan:
59 if _object_dtype_isnan(X).any():
---> 60 raise ValueError("Input contains NaN")
61
62
ValueError: Input contains NaN\
whats make my program raise this error? error in dataset or in function?

All feature and label values must be finite. If bow_train_features, train_labels, bow_train_features, train_labels are DataFrames or Numpy arrays, you can filter for only the fully-finite observations in the train/test sets using the code below:
# Create finite observation filters for train/test sets
train_finite_filter = np.isfinite(bow_train_features) & np.isfinite(train_labels)
test_finite_filter = np.isfinite(bow_test_features) & np.isfinite(test_labels)
# Filter for finite training observations
bow_train_features_finite = bow_train_features[train_finite_filter]
train_labels_finite = train_labels[train_finite_filter]
# Filter for finite test observations
bow_test_features_finite = bow_test_features[test_finite_filter]
test_labels_finite = test_labels[test_finite_filter]

Related

Do we One Hot Encode (create Dummy Variables) before or after Train/Test Split?

I've seen quite a lot of conflicting views on if one-hot encoding (dummy variable creation) should be done before/after the training/test split.
Responses seem to state that one-hot encoding before leads to "data leakage".
This example states it's industry norm to do one-hot encoding on the entire data before training/test split:
Industry Example
This example from kaggle states that it should be done after the training/test split to avoid data leakage:
kaggle response - after split
My question is the following;
Do we perform one-hot encoding before or after the Train/Test Split?
Where is the data leakage occuring in the following example?
If we take the following example, we have two columns - web_views and website (non-ordinal categorical feature) (assuming we are one-hot encoding across the entire column, not dropping any dummies)
Our dataframe:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
df = pd.DataFrame({'web_views': [100,200,300,400],
'website': ['Youtube','Facebook','Instagram', 'Google']})
Scenario 1: One-Hot Encoding/Dummy Variables before splitting into Train/Test:
np.random.seed(123)
df_before_split = pd.concat([df.drop('website', axis = 1), pd.get_dummies(df['website'])], axis=1)
# create your X and y dataframes
X_before_split = df_before_split.drop('web_views', axis = 1)
y_before_split = df_before_split['web_views']
# perform train test split
X_train_before_split, X_test_before_split, y_train_before_split, y_test_before_split = train_test_split(X_before_split, y_before_split, test_size = 0.20)
Now viewing the dataframes we have:
# view X train dataset (this is encoding before split)
X_train_before_split
and then for test
# View X test dataset dataset (this is encoding before split)
X_test_before_split
Scenario 2: One-Hot Encoding/Dummy Variables AFTER splitting into Train/Test:
# Perform One Hot encoding after the train/test split instead
X = df.drop('web_views', axis = 1)
y = df['web_views']
# perform data split:
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
# perform one hot encoding on the train and test dataset datasets:
X_train = pd.concat([X_train.drop('website', axis = 1), pd.get_dummies(X_train['website'])], axis=1)
X_test = pd.concat([X_test.drop('website', axis = 1), pd.get_dummies(X_test['website'])], axis=1)
Viewing the X_train and X_test dataframes:
# encode after train/test split - train dataframe
X_train
# encode after train/test split - test dataframe
X_test
Performing Linear Regression Modelling
Now that we have split our data to demonstrate we will create a simple linear model:
from sklearn.linear_model import LinearRegression
Before split linear model
regressor_before_split = LinearRegression()
regressor_before_split.fit(X_train_before_split, y_train_before_split)
y_pred_before_split = regressor_before_split.predict(X_test_before_split)
y_pred_before_split
y_pred_before_split returns a predicting value what we would expect.
After split linear model
regressor_after_split = LinearRegression()
regressor_after_split.fit(X_train, y_train)
y_pred_after_split = regressor_after_split.predict(X_test)
y_pred_after_split
Error message from Scenario 2:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-92-c63978a198c8> in <module>()
2 regressor_after_split.fit(X_train, y_train)
3
----> 4 y_pred_after_split = regressor_after_split.predict(X_test)
5 y_pred_after_split
C:\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
254 Returns predicted values.
255 """
--> 256 return self._decision_function(X)
257
258 _preprocess_data = staticmethod(_preprocess_data)
C:\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in _decision_function(self, X)
239 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
240 return safe_sparse_dot(X, self.coef_.T,
--> 241 dense_output=True) + self.intercept_
242
243 def predict(self, X):
C:\Anaconda3\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
138 return ret
139 else:
--> 140 return np.dot(a, b)
141
142
<__array_function__ internals> in dot(*args, **kwargs)
ValueError: shapes (1,1) and (3,) not aligned: 1 (dim 1) != 3 (dim 0)
My thoughts:
Encoding with dummies before splitting ensures that the test data that we pass in e.g. X_test to perform the predicitions has the same shape as the training data that the model was trained on therefore understands how to predict values when it encounters these features - unlike with encoding after splitting, since the X_test data has only one feature we are using to make predicitions with whereas the X_train has 3 features
Maybe I've introduced data leakage?
I'd be happy for someone to correct me if i've got things wrong or misinterpreted anything, but i'm stuck scratching me head if you encode before or after splitting!

"The least populated class in y has only 1 ... groups for any class cannot be less than 2." Without train_test_split()

I am trying to run this code, using a dataset on the relation of Corona cases to Corona deaths.
I have not found any reason why the error should appear through the way i handle the split into X and y dataframes, but I do not fully understand the Error either.
Does someone know what is wrong here?
import numpy as np
#from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn import preprocessing
#import csv
X_test = pd.read_csv("test.csv")
y_output = pd.read_csv("sample_submission.csv")
data_train = pd.read_csv("train.csv")
X_train = data_train.drop(columns=["Next Week's Deaths"])
y_train = data_train["Next Week's Deaths"]
#prepare for fit (transform Location strings into classes)
Location = data_train["Location"]
le = preprocessing.LabelEncoder()
le.fit(Location)
LocationToInt = le.transform(Location)
LocationDict = dict(zip(Location, LocationToInt))
X_train["Location"] = X_train["Location"].replace(LocationDict)
#train and run
model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)
model.fit(X_train, y_train)
Traceback:
Input In [89], in <cell line: 29>()
27 #train and run
28 model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)
---> 29 model.fit(X_train, y_train)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\ensemble\_hist_gradient_boosting\gradient_boosting.py:348, in BaseHistGradientBoosting.fit(self, X, y, sample_weight)
343 # Save the state of the RNG for the training and validation split.
344 # This is needed in order to have the same split when using
345 # warm starting.
347 if sample_weight is None:
--> 348 X_train, X_val, y_train, y_val = train_test_split(
349 X,
350 y,
351 test_size=self.validation_fraction,
352 stratify=stratify,
353 random_state=self._random_seed,
354 )
355 sample_weight_train = sample_weight_val = None
356 else:
357 # TODO: incorporate sample_weight in sampling here, as well as
358 # stratify
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_split.py:2454, in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays)
2450 CVClass = ShuffleSplit
2452 cv = CVClass(test_size=n_test, train_size=n_train, random_state=random_state)
-> 2454 train, test = next(cv.split(X=arrays[0], y=stratify))
2456 return list(
2457 chain.from_iterable(
2458 (_safe_indexing(a, train), _safe_indexing(a, test)) for a in arrays
2459 )
2460 )
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_split.py:1613, in BaseShuffleSplit.split(self, X, y, groups)
1583 """Generate indices to split data into training and test set.
1584
1585 Parameters
(...)
1610 to an integer.
1611 """
1612 X, y, groups = indexable(X, y, groups)
-> 1613 for train, test in self._iter_indices(X, y, groups):
1614 yield train, test
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_split.py:1953, in StratifiedShuffleSplit._iter_indices(self, X, y, groups)
1951 class_counts = np.bincount(y_indices)
1952 if np.min(class_counts) < 2:
-> 1953 raise ValueError(
1954 "The least populated class in y has only 1"
1955 " member, which is too few. The minimum"
1956 " number of groups for any class cannot"
1957 " be less than 2."
1958 )
1960 if n_train < n_classes:
1961 raise ValueError(
1962 "The train_size = %d should be greater or "
1963 "equal to the number of classes = %d" % (n_train, n_classes)
1964 )
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
For Text Highlighting: Picture of Traceback
The HistGradientBoostingClassifier internally splits your dataset into train and validation. Default is 10% for validation (checkout validation_fraction param in docs).
In your case, there is a class with a single element on it, so if it goes to the train split, the classifier can't validate this class, or vice versa. The point is: you need at least two examples in each class.
How to solve it? Well, first you need an appropriated diagnosis: run the following code to see which class is the problem:
import bumpy as np
unq, cnt = no.unique(y_train, return_counts=True)
for u, c in zip(unq, cnt):
print(f"class {u} contains {c}")
What to do now? Well, first make sure that those results make sense to you, and there is no a previous error (maybe reading incorrectly your CSV or loosing data some steps before).
Then, if the problem persist, your options are the following:
Collect more data. Not always possible but this is the best.
Add synthetic data. imblearn for instance, is a sklearn-like library to work on imbalanced problems like yours. It provides several well known oversampling methods. You can also create your own synthetic data, since you know what is it.
Remove classes with a single example. This implies re-framing your problem a little bit but may work. Just drop the row. You can also re-label it to one of the closest labels, for instance, if you have classes positives, negatives and neutral, and a single example of neutral class, well maybe you can re-label it as negative.
Group classes with low cardinality. This is useful when you have multiple classes, let's say 10 classes, and there are some of those, let's say 3, with really few examples. You can Mix those low cardinality classes into a single class "other" and convert your problem to another similar with less classes but more populated, in the example, instead of 10 you will have 8.
What is the best alternative? It really depends on your problem.

RandomOverSampler doesn't seem to accept log transform as my y target variable

I am trying to to random oversampling over a small dataset for linear regression. However it seems the scikit learn sampling API doesnt work with float values as its target variable. Is there anyway to solve this?
This is a sample of my y_train values, which are log transformed.
3.688879
3.828641
3.401197
3.091042
4.624973
from imblearn.over_sampling import RandomOverSampler
X_over, y_over = RandomOverSampler(random_state=42).fit_sample(X_train,y_train)
--------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-53-036424abd2bd> in <module>
1 from imblearn.over_sampling import RandomOverSampler
~\Anaconda3\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
73 The corresponding label of `X_resampled`.
74 """
---> 75 check_classification_targets(y)
76 arrays_transformer = ArraysTransformer(X, y)
77 X, y, binarize_y = self._check_X_y(X, y)
~\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
170 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
171 'multilabel-indicator', 'multilabel-sequences']:
--> 172 raise ValueError("Unknown label type: %r" % y_type)
173
174
ValueError: Unknown label type: 'continuous'
Re-sampling strategies are not meant for regression problems. Hence, the RandomOverSampler will not accept float type targets. There are approaches to re-sample data with continuous targets though. One example is the reg_resample which can be used like the following:
from imblearn.over_sampling import RandomOverSampler
from sklearn.datasets import make_regression
from reg_resampler import resampler
import numpy as np
# Create some dummy data for demonstration
X, y = make_regression(n_features=10)
df = np.append(X, y.reshape(100, 1), axis=1)
# Initialize the resampler object and generate pseudo-classes
rs = resampler()
y_classes = rs.fit(df, target=10)
# Now resample
X_res, y_res = rs.resample(
sampler_obj=RandomOverSampler(random_state=27),
trainX=df,
trainY=y_classes
)
The resampler object will generate pseudo-classes based on your target values and then use a classic re-sampling object from the imblearn package to re-sample your data. Note that the data you pass to the resampler object should contain all data, including the targets.

Why do I get 1D array instead of 2D array Index error for Machine Learning

I am trying to predict income (70000+) based on specific categorical fields (Sex and Highest Cert, dip, deg) based on python code below.
I created a range for the average income and then specified the specific range of income (70000+) I wanted to predict using
(Sex and Highest Cert, dip, deg)
I have the following code. However, I get an error when I get to the One hot encoding part of the code. I am using python on visual studio. I have tried changing the categorical field to "Age", but it does not work. The code is below. Please how can I fix it? Thank you.
# %% read dataframe from part1
import pandas as pd
df = pd.read_pickle("data.pkl")
#%%
import numpy as np
bins = [0, 30000, 50000, 70000, 100000, np.inf]
names = ['<30000', '30000-50000', '50000-70000', '70000-100000', '100000+']
df['Avg Emp Income Range'] = pd.cut(df['Avg Emp Income'], bins, labels=names)
#%% OHE of Avg empl income
for val in df["Avg Emp Income Range"].unique():
df[f"Avg Emp Income Range_{val}"] = df["Avg Emp Income Range"] == val
#%% selecting data
x= ["Sex","Highest Cert,dip,deg"]
#%%
success = ["Avg Emp Income Range_70000-100000","Avg Emp Income Range_100000+"]
y = success
# %% split into training / testing sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=123)
#%%
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
enc = OneHotEncoder(handle_unknown="ignore")
ct = ColumnTransformer(
[
("ohe", enc, ["Sex","Highest Cert,dip,deg",])
],
remainder="passthrough",
)
x_train = ct.fit_transform(x_train)
x_test = ct.transform(x_test)
I get this error
Error:
--------------------------------------------------------------------------- ValueError Traceback (most recent call
last) c:\Users\maria\Documents\Project Capstone 2\Z NO\machine L.py in 42 )
43
---> 44 x_train = ct.fit_transform(x_train)
45 x_test = ct.transform(x_test)
c:\Users\maria\Documents\Project Capstone 2\Z
NO\venv\lib\site-packages\sklearn\compose_column_transformer.py in
fit_transform(self, X, y)
522 else:
523 self._feature_names_in = None
--> 524 X = check_X(X)
525 # set n_features_in attribute
526 self._check_n_features(X, reset=True)
c:\Users\maria\Documents\Project Capstone 2\Z
NO\venv\lib\site-packages\sklearn\compose_column_transformer.py in
_check_X(X)
649 if hasattr(X, 'array') or sparse.issparse(X):
650 return X
--> 651 return check_array(X, force_all_finite='allow-nan', dtype=np.object)
652
653
c:\Users\maria\Documents\Project Capstone 2\Z
NO\venv\lib\site-packages\sklearn\utils\validation.py in
inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
c:\Users\maria\Documents\Project Capstone 2\Z
NO\venv\lib\site-packages\sklearn\utils\validation.py in
check_array(array, accept_sparse, accept_large_sparse, dtype, order,
copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples,
ensure_min_features, estimator)
621 "Reshape your data either using array.reshape(-1, 1) if "
622 "your data has a single feature or array.reshape(1, -1) "
--> 623 "if it contains a single sample.".format(array))
624
625 # in the future np.flexible dtypes will be handled like object dtypes
ValueError: Expected 2D array, got 1D array instead: array=['Sex'].
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or
array.reshape(1, -1) if it contains a single sample.
You say, that you trainings data is
x = ["Sex","Highest Cert,dip,deg"]
y = ["Avg Emp Income Range_70000-100000","Avg Emp Income Range_100000+"]
# splitting the data
x_train = train_test_split(x, y, random_state=123)
but what you encode is an array of a tupel
ct = ColumnTransformer([("ohe", enc, ["Sex","Highest Cert,dip,deg",])])
Now, if you call ct.fit_transform(x_train), the encoded object cf expects the input to be of 1D size (because it was just encoded to be an array of tuples) but your data is a 2D-array, which raises the exception.
However, I assume that you rather wanted to use x and y as keys for the data matrix df:
x = ["Sex","Highest Cert,dip,deg"]
y = ["Avg Emp Income Range_70000-100000","Avg Emp Income Range_100000+"]
# splitting the data
x_train = train_test_split(df[x], df[y], random_state=123)
It is helpful to use the debugging option or the execute the code step-wise in iPython so that you can keep track of the size of the arrays and if the code actually does what you thought it should be doing.
Your x and y data are not set correct: You are just using the column headers as lists instead of the dataframe's values. Try setting:
x = df[["Sex","Highest Cert,dip,deg"]]
y = df[["Avg Emp Income Range_70000-100000","Avg Emp Income Range_100000+"]]

Predict movie reviews with scikit-learn

I'm using scikit-learn MultinomialNB and Vectorizer to build a prediction model of whether the review is good or bad.
After training on the labelled data, how do I use it to predict new reviews (or existing review)? I'm getting the error message below.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
X = vectorizer.fit_transform(df.quote)
X = X.tocsc()
Y = (df.fresh == 'fresh').values.astype(np.int)
xtrain, xtest, ytrain, ytest = train_test_split(X, Y)
clf = MultinomialNB().fit(xtrain, ytrain)
new_review = ['this is a new review, movie was awesome']
new_review = vectorizer.fit_transform(new_review)
print df.quote[15]
print(clf.predict(df.quote[10])) #predict existing review in dataframe
print(clf.predict(new_review)) #predict new review
Technically, Toy Story is nearly flawless.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-91-27a0698bbd1f> in <module>()
15
16 print df.quote[15]
---> 17 print(clf.predict(df.quote[10])) #predict existing quote in dataframe
18 print(clf.predict(new_review)) #predict new review
//anaconda/lib/python2.7/site-packages/sklearn/naive_bayes.pyc in predict(self, X)
60 Predicted target values for X
61 """
---> 62 jll = self._joint_log_likelihood(X)
63 return self.classes_[np.argmax(jll, axis=1)]
64
//anaconda/lib/python2.7/site-packages/sklearn/naive_bayes.pyc in _joint_log_likelihood(self, X)
439 """Calculate the posterior log probability of the samples X"""
440 X = atleast2d_or_csr(X)
--> 441 return (safe_sparse_dot(X, self.feature_log_prob_.T)
442 + self.class_log_prior_)
443
//anaconda/lib/python2.7/site-packages/sklearn/utils/extmath.pyc in safe_sparse_dot(a, b, dense_output)
178 return ret
179 else:
--> 180 return fast_dot(a, b)
181
182
TypeError: Cannot cast array data from dtype('float64') to dtype('S32') according to the rule 'safe'
You need to pass a Bag of Words representation to predict and not the text directly. You are doing it almost correctly with new_review, only change new_review = vectorizer.transform(new_review), (see #Stergios comment) . Try this:
print(clf.predict(X[10, :]))

Categories

Resources