Does SimpleImputer remove features? - python

I have a dataset of 284 features I am trying to impute using scikit-learn, however I get an error where the number of features changes to 283:
imputer = SimpleImputer(missing_values = np.nan, strategy = "mean")
imputer = imputer.fit(data.iloc[:,0:284])
df[:,0:284] = imputer.transform(df[:,0:284])
X = MinMaxScaler().fit_transform(df)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-150-849be5be8fcb> in <module>
1 imputer = SimpleImputer(missing_values = np.nan, strategy = "mean")
2 imputer = imputer.fit(data.iloc[:,0:284])
----> 3 df[:,0:284] = imputer.transform(df[:,0:284])
4 X = MinMaxScaler().fit_transform(df)
~\Anaconda3\envs\environment\lib\site-packages\sklearn\impute\_base.py in transform(self, X)
411 if X.shape[1] != statistics.shape[0]:
412 raise ValueError("X has %d features per sample, expected %d"
--> 413 % (X.shape[1], self.statistics_.shape[0]))
414
415 # Delete the invalid columns if strategy is not constant
ValueError: X has 283 features per sample, expected 284
I don't understand how this is reaching 283 features, I assume on fitting it's finding features that have all 0s or something and deciding to drop that, but I can't find documentation which tells me how to make sure those features are still kept. I am not a programmer so not sure if I am missing something else that's obvious or if I'm better looking into another method?

This could happen if you have a feature without any values, from https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html:
'Columns which only contained missing values at fit are discarded upon transform if strategy is not “constant”'.
You can tell if this is indeed the problem by using a high 'verbose' value when constructing the imputer:
sklearn.impute.SimpleImputer(..., verbose=100,...)
It will spit sth like:
UserWarning: Deleting features without observed values: [ ... ]

I was dealing with the same situation and i got my solution by adding this transformation before the SimpleImputer mean strategy
imputer = SimpleImputer(strategy = 'constant', fill_value = 0)
df_prepared_to_mean_or_anything_else = imputer.fit_transform(previous_df)
What does it do? Fills everything missing with the value specified on parameter fill_value

Related

How to run a random classifer in the following case

I am trying to experiment with sentiment analysis case and I am trying to run a random classifier for the following:
|Topic |value|label|
|Apples are great |-0.99|0 |
|Balloon is red |-0.98|1 |
|cars are running |-0.93|0 |
|dear diary |0.8 |1 |
|elephant is huge |0.91 |1 |
|facebook is great |0.97 |0 |
after splitting it into train test from sklearn library,
I am doing the following for the Topic column for the count vectoriser to work upon it:
x = train.iloc[:,0:2]
#except for alphabets removing all punctuations
x.replace("[^a-zA-Z]"," ",regex=True, inplace=True)
#convert to lower case
x = x.apply(lambda a: a.astype(str).str.lower())
x.head(2)
After that I apply countvectorizer to the topics column, convert it together with value column and apply Random classifier.
## Import library to check accuracy
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
## implement BAG OF WORDS
countvector=CountVectorizer(ngram_range=(2,2))
traindataset=countvector.fit_transform(x['Topics'])
train_set = pd.concat([x['compound'], pd.DataFrame(traindataset)], axis=1)
# implement RandomForest Classifier
randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
randomclassifier.fit(train_set,train['label'])
But I receive an error:
TypeError Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'csr_matrix'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-41-7a1f9b292921> in <module>()
1 # implement RandomForest Classifier
2 randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
----> 3 randomclassifier.fit(train_set,train['label'])
4 frames
/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
ValueError: setting an array element with a sequence.
My idea is:
The values I received are from applying vader-sentiment and I want to apply that too - to my random classifier to see the impact of vader scores on the output.
Maybe is there a way to multiply the data in the value column with sparse matrix traindata generated
Can anyone please tell me how to do that in this case.
The issue is concatenating another column to sparse matrix (the output from countvector.fit_transform ). For simplicity sake, let's say your training is:
x = pd.DataFrame({'Topics':['Apples are great','Balloon is red','cars are running',
'dear diary','elephant is huge','facebook is great'],
'value':[-0.99,-0.98,-0.93,0.8,0.91,0.97,],
'label':[0,1,0,1,1,0]})
You can see this gives you something weird:
countvector=CountVectorizer(ngram_range=(2,2))
traindataset=countvector.fit_transform(x['Topics'])
train_set = pd.concat([x['value'], pd.DataFrame(traindataset)], axis=1)
train_set.head(2)
value 0
0 -0.99 (0, 0)\t1\n (0, 1)\t1
1 -0.98 (0, 3)\t1\n (0, 10)\t1
It is possible to convert your sparse to a dense numpy array and then your pandas dataframe will work, however if your dataset is huge this is extremely costly. To keep it as sparse, you can do:
from scipy import sparse
train_set = scipy.sparse.hstack([sparse.csr_matrix(x['value']).reshape(-1,1),traindataset])
randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
randomclassifier.fit(train_set,x['label'])
Check out also the help page for sparse

Python -- mismatch of columns between training data and prediction data

in my model I have figured out i made a mistake by not dropping a column from prediction dataset.
This column; yclass; is not available in training dataset. But in my prediction dataset it is available.
I wasn't aware of that mistake, but now I am confused? Why it is still giving me predictions with that abundant column? Shouldn't it give me some kind of error because of that? I have seen the examples of onehotencoding related training-test data inconsistency and solution to that problem. But it is a new case which I do not have an idea? Here is the final part of my code; may be I am making a mistake with pipelines etc.
lgbr = LGBMRegressor(learning_rate= 0.1, max_depth= 18, n_estimators= 50, num_leaves= 11)
lgbc = LGBMClassifier(learning_rate = 0.1, max_depth = 18, n_estimators = 100, num_leaves = 51)
numeric_pipe = make_pipeline(MinMaxScaler(feature_range = (-1,1)))
categoric_pipe = make_pipeline(OneHotEncoder(sparse = False, handle_unknown='ignore'))
preprocessor = ColumnTransformer(transformers = [('num',numeric_pipe, num_cols), ('cat',categoric_pipe,cat_cols)])
regr_pipe_final = make_pipeline(preprocessor, lgbr)
regr_pipe_final.fit(df_x_regr, df_y_regr.values.ravel())
class_pipe_final = make_pipeline(preprocessor, lgbc)
class_pipe_final.fit(df_x, df_y_class.values.ravel())
pred_final = pd.DataFrame()
for key in list(mi.unique_everseen(pred_set['from'] + pred_set['to'])):
pred_val_list = []
pred_subset = pred_set[(pred_set['from'] + pred_set['to']) == key]
lag = 0
for i in range(0,predmonths):
pred = pred_subset.iloc[[i],:]
class_val = class_pipe_final.predict(pred)
regr_val = regr_pipe_final.predict(pred)
I am making rowwise predictions to generate a moving forecast effect, thats why I use for loops for predictions.
Here, as a final summary, the problem is "pred" has 1 abundant column inside, named as "yclass". How is my pipeline accepting that column as an input? Or just ignoring it?
If I understand your question correctly and your code, assuming yclass is one of the One-Hot columns, you have a parameter there:
handle_unknown='ignore'
Which tells the encoder to ignore column it has not seen in data.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
handle_unknown : ‘error’ or ‘ignore’, default=’error’.
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

ValueError: X has 29 features per sample; expecting 84

I am working on a script using the Lending Club API to predict whether a loan will "pay in full" or "charge off". To do this I am using scikit-learn to build the model and persisted using joblib. I run into a ValueError due to a difference between the number of columns in the persisted model and the number of columns from new raw data. The ValueError is caused from creating dummy variables for categorical variables. The number of columns used in the model is 84 and in this example the number of columns using the new data is 29.
The number of columns needs to be 84 for the new data when making dummy variables but I am not sure how to proceed since only a subset of all possible values for the categorical variables 'homeOwnership','addrState', and 'purpose' are present when obtaining new data from the API.
Here's the code I am testing at the moment starting at the point where the categorical variables are transformed into dummy variables and stopping at model implementation.
#......continued
df['mthsSinceLastDelinq'].notnull().astype('int')
df['mthsSinceLastRecord'].notnull().astype('int')
df['grade_num'] = df['grade'].map({'A':0,'B':1,'C':2,'D':3})
df['emp_length_num'] = df['empLength']
df = pd.get_dummies(df,columns=['homeOwnership','addrState','purpose'])
# df = pd.get_dummies(df,columns=['home_ownership','addr_state','verification_status','purpose'])
# step 3.5 transform data before making predictions
df.drop(['id','grade','empLength','isIncV'],axis=1,inplace=True)
dfbcd = df[df['grade_num'] != 0]
scaler = StandardScaler()
x_scbcd = scaler.fit_transform(dfbcd)
# step 4 predicting
lrbcd_test = load('lrbcd_test.joblib')
ypredbcdfinal = lrbcd_test.predict(x_scbcd)
Here's the error message
ValueError Traceback (most recent call last)
<ipython-input-239-c99611b2e48a> in <module>
11 # change name of model and file name
12 lrbcd_test = load('lrbcd_test.joblib')
---> 13 ypredbcdfinal = lrbcd_test.predict(x_scbcd)
14
15 #add model
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
287 Predicted class label per sample.
288 """
--> 289 scores = self.decision_function(X)
290 if len(scores.shape) == 1:
291 indices = (scores > 0).astype(np.int)
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in decision_function(self, X)
268 if X.shape[1] != n_features:
269 raise ValueError("X has %d features per sample; expecting %d"
--> 270 % (X.shape[1], n_features))
271
272 scores = safe_sparse_dot(X, self.coef_.T,
ValueError: X has 29 features per sample; expecting 84
Your new data should have the same exact columns as the data that you used to train and persist your original model. And if the number of unique values of the categorical variables is lesser in the newer data, manually add columns for those variables after doing pd.get_dummies() and set them to zero for all the data points.
The model will work only when it gets the required number of columns. If pd.get_dummies fails to create all those columns on the newer data, you should do it manually.
Edit
If you want to automatically insert the missing columns after the pd.get_dummies() step, you can use the following approach.
Assuming that df_newdata is the dataframe after applying pd.get_dummies() tot he new dataset and df_olddata is the df that you got when you applied pd.get_dummies() on the older dataset(which was used for training), you can simply do this:
df_newdata = df_newdata.reindex(labels=df_olddata.columns,axis=1)
This will automatically create the missing columns in df_newdata (in comparison to df_olddata) and set the values of these columns to NaN for all the rows. So now, your new dataframe has the same exct columns as the original dataframe had.
Hope this helps
Use just Transform instead of fit_transform. This should do the trick. Hope it helps.
x_scbcd = scaler.transform(dfbcd)
Could you try using the transform method of x_scbcd [StandardScaler object] on your testing data object lrbcd_test? This will create a feature representation of your testing data.
ypredbcdfinal = lrbcd_test.predict(x_scbcd.transform(x_scbcd))
In place of predict we get error so
We get error free with
Pred_1=Model_1.predict(tfidf_train)
Cr1=accuracy_score(y_train,pred_1)

NotFittedError: Estimator not fitted, call `fit` before exploiting the model

I am running Python 3.5.2 on a Macbook OSX 10.2.1 (Sierra).
While attempting to run some code for the Titanic Dataset from Kaggle, I keep getting the following error:
NotFittedError Traceback (most recent call
last) in ()
6
7 # Make your prediction using the test set and print them.
----> 8 my_prediction = my_tree_one.predict(test_features)
9 print(my_prediction)
10
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/tree/tree.py
in predict(self, X, check_input)
429 """
430
--> 431 X = self._validate_X_predict(X, check_input)
432 proba = self.tree_.predict(X)
433 n_samples = X.shape[0]
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/tree/tree.py
in _validate_X_predict(self, X, check_input)
386 """Validate X whenever one tries to predict, apply, predict_proba"""
387 if self.tree_ is None:
--> 388 raise NotFittedError("Estimator not fitted, "
389 "call fit before exploiting the model.")
390
NotFittedError: Estimator not fitted, call fit before exploiting the
model.
The offending code seems to be this:
# Impute the missing value with the median
test.Fare[152] = test.Fare.median()
# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values
# Make your prediction using the test set and print them.
my_prediction = my_tree_one.predict(test_features)
print(my_prediction)
# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution)
# Check that your data frame has 418 entries
print(my_solution.shape)
# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])
And here is a link to the rest of the code.
Since I already have called the 'fit' function, I cannot understand this error message. Where am I going wrong? Thanks for your time.
Edit:
Turns out that the problem is inherited from the previous block of code.
# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)
# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))
With the line:
my_tree_one = my_tree_one.fit(features_one, target)
generating the error:
ValueError: Input contains NaN, infinity or a value too large for
dtype('float32').
The error is self explanatory: either the features_one or the target arrays do contain NaNs or infinite values, so the estimator fails to fit and therefore you cannot use it for prediction later.
Check those arrays and treat NaN values accordingly before fitting.

Python, ValueError, BroadCast Error with SKLearn Preproccesing

I am trying to run SKLearn Preprocessing standard scaler function and I receive the following error:
from sklearn import preprocessing as pre
scaler = pre.StandardScaler().fit(t_train)
t_train_scale = scaler.transform(t_train)
t_test_scale = scaler.transform(t_test)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-149-c0133b7e399b> in <module>()
4 scaler = pre.StandardScaler().fit(t_train)
5 t_train_scale = scaler.transform(t_train)
----> 6 t_test_scale = scaler.transform(t_test)
C:\Users\****\Anaconda\lib\site-packages\sklearn\preprocessing\data.pyc in transform(self, X, y, copy)
356 else:
357 if self.with_mean:
--> 358 X -= self.mean_
359 if self.with_std:
360 X /= self.std_
ValueError: operands could not be broadcast together with shapes (40000,59) (119,) (40000,59)
I understand the shapes do not match. The train and test data set are different lengths so how would I transform the data?
please print the output from t_train.shape[1] and t_test.shape[1]
StandardScaler expects any two datasets to have the same number of columns. I suspect earlier pre-processing (dropping columns, adding dummy columns, etc) is the source of your problem. Whatever transformations you make to the t_train also need to be made to t_test.
The error is telling you the information that I'm asking for:
ValueError: operands could not be broadcast together with shapes (40000,59) (119,) (40000,59)
I expect you'll find that t_train.shape[1] is 59 and t_test.shape[1] is 119.
So you have 59 columns in your training dataset and 119 in your test dataset.
Did you remove any columns from the training set prior to attempting to use StandardScaler?
What do you mean by "train and test data set are different lengths"?? How did you obtain your training data?
If your testing data have more features than your training data in order to efficiently reduce the dimensionality of your testing data you should know how your training data were formulated.For example using a dimensionality reduction technique (PCA,SVD etc.) or something like that. If that is the case you have to multiply each testing vector with the same matrix that was used to reduce the dimensionality of your training data.
The time series was in the format with time as the columns and data in the rows. I did the following before the original posted code:
t_train.transpose()
t_test.transpose()
Just a reminder, I had to run the cell a 2x before the change 'stuck' for some reason...
t_train shape is (x, 119), whereas t_test shape is (40000,59).
If you want to use same scaler object for transformation then your data should have same number of columns always.
Since you fit scaler on t_train, that's the reason you are getting issue when you are trying to transform t_test.

Categories

Resources