I am working on a script using the Lending Club API to predict whether a loan will "pay in full" or "charge off". To do this I am using scikit-learn to build the model and persisted using joblib. I run into a ValueError due to a difference between the number of columns in the persisted model and the number of columns from new raw data. The ValueError is caused from creating dummy variables for categorical variables. The number of columns used in the model is 84 and in this example the number of columns using the new data is 29.
The number of columns needs to be 84 for the new data when making dummy variables but I am not sure how to proceed since only a subset of all possible values for the categorical variables 'homeOwnership','addrState', and 'purpose' are present when obtaining new data from the API.
Here's the code I am testing at the moment starting at the point where the categorical variables are transformed into dummy variables and stopping at model implementation.
#......continued
df['mthsSinceLastDelinq'].notnull().astype('int')
df['mthsSinceLastRecord'].notnull().astype('int')
df['grade_num'] = df['grade'].map({'A':0,'B':1,'C':2,'D':3})
df['emp_length_num'] = df['empLength']
df = pd.get_dummies(df,columns=['homeOwnership','addrState','purpose'])
# df = pd.get_dummies(df,columns=['home_ownership','addr_state','verification_status','purpose'])
# step 3.5 transform data before making predictions
df.drop(['id','grade','empLength','isIncV'],axis=1,inplace=True)
dfbcd = df[df['grade_num'] != 0]
scaler = StandardScaler()
x_scbcd = scaler.fit_transform(dfbcd)
# step 4 predicting
lrbcd_test = load('lrbcd_test.joblib')
ypredbcdfinal = lrbcd_test.predict(x_scbcd)
Here's the error message
ValueError Traceback (most recent call last)
<ipython-input-239-c99611b2e48a> in <module>
11 # change name of model and file name
12 lrbcd_test = load('lrbcd_test.joblib')
---> 13 ypredbcdfinal = lrbcd_test.predict(x_scbcd)
14
15 #add model
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
287 Predicted class label per sample.
288 """
--> 289 scores = self.decision_function(X)
290 if len(scores.shape) == 1:
291 indices = (scores > 0).astype(np.int)
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in decision_function(self, X)
268 if X.shape[1] != n_features:
269 raise ValueError("X has %d features per sample; expecting %d"
--> 270 % (X.shape[1], n_features))
271
272 scores = safe_sparse_dot(X, self.coef_.T,
ValueError: X has 29 features per sample; expecting 84
Your new data should have the same exact columns as the data that you used to train and persist your original model. And if the number of unique values of the categorical variables is lesser in the newer data, manually add columns for those variables after doing pd.get_dummies() and set them to zero for all the data points.
The model will work only when it gets the required number of columns. If pd.get_dummies fails to create all those columns on the newer data, you should do it manually.
Edit
If you want to automatically insert the missing columns after the pd.get_dummies() step, you can use the following approach.
Assuming that df_newdata is the dataframe after applying pd.get_dummies() tot he new dataset and df_olddata is the df that you got when you applied pd.get_dummies() on the older dataset(which was used for training), you can simply do this:
df_newdata = df_newdata.reindex(labels=df_olddata.columns,axis=1)
This will automatically create the missing columns in df_newdata (in comparison to df_olddata) and set the values of these columns to NaN for all the rows. So now, your new dataframe has the same exct columns as the original dataframe had.
Hope this helps
Use just Transform instead of fit_transform. This should do the trick. Hope it helps.
x_scbcd = scaler.transform(dfbcd)
Could you try using the transform method of x_scbcd [StandardScaler object] on your testing data object lrbcd_test? This will create a feature representation of your testing data.
ypredbcdfinal = lrbcd_test.predict(x_scbcd.transform(x_scbcd))
In place of predict we get error so
We get error free with
Pred_1=Model_1.predict(tfidf_train)
Cr1=accuracy_score(y_train,pred_1)
Related
I have a dataset of 284 features I am trying to impute using scikit-learn, however I get an error where the number of features changes to 283:
imputer = SimpleImputer(missing_values = np.nan, strategy = "mean")
imputer = imputer.fit(data.iloc[:,0:284])
df[:,0:284] = imputer.transform(df[:,0:284])
X = MinMaxScaler().fit_transform(df)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-150-849be5be8fcb> in <module>
1 imputer = SimpleImputer(missing_values = np.nan, strategy = "mean")
2 imputer = imputer.fit(data.iloc[:,0:284])
----> 3 df[:,0:284] = imputer.transform(df[:,0:284])
4 X = MinMaxScaler().fit_transform(df)
~\Anaconda3\envs\environment\lib\site-packages\sklearn\impute\_base.py in transform(self, X)
411 if X.shape[1] != statistics.shape[0]:
412 raise ValueError("X has %d features per sample, expected %d"
--> 413 % (X.shape[1], self.statistics_.shape[0]))
414
415 # Delete the invalid columns if strategy is not constant
ValueError: X has 283 features per sample, expected 284
I don't understand how this is reaching 283 features, I assume on fitting it's finding features that have all 0s or something and deciding to drop that, but I can't find documentation which tells me how to make sure those features are still kept. I am not a programmer so not sure if I am missing something else that's obvious or if I'm better looking into another method?
This could happen if you have a feature without any values, from https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html:
'Columns which only contained missing values at fit are discarded upon transform if strategy is not “constant”'.
You can tell if this is indeed the problem by using a high 'verbose' value when constructing the imputer:
sklearn.impute.SimpleImputer(..., verbose=100,...)
It will spit sth like:
UserWarning: Deleting features without observed values: [ ... ]
I was dealing with the same situation and i got my solution by adding this transformation before the SimpleImputer mean strategy
imputer = SimpleImputer(strategy = 'constant', fill_value = 0)
df_prepared_to_mean_or_anything_else = imputer.fit_transform(previous_df)
What does it do? Fills everything missing with the value specified on parameter fill_value
first my setup:
X is my feature table. It has 150 000 features and 96 samples. So 150 000 columns and 96 rows.
y is my target table. It has 4 labels and of course 96 samples. So 4x96 (columns x rows).
After splitting into train and test data I'm using MLPRegressor. Based on the documentation of Sci-kit it is an native multioutput regressor. So I can use it to predict my four desired output values with a new sample of 150 000 features .
My code:
mlp = MLPRegressor(hidden_layer_sizes=(2000, 2000), solver= 'lbfgs', max_iter=100)
mlp.fit(X_train,y_train)
And then I'm using cross validation.
cross_validation.cross_val_score(mlp, X, y, scoring='r2')
The output is a list with 3 entries (parameter cv=3).
I don't really get how my 4 labels get represented by these 3 values.
I expected something in a format like this:
label 1: 3 entries, label 2: 3 entries and the same with label 3 and 4.
So I'm getting the R^2-Value for all my labels three times for different splittings of test and train data.
Am I missing something? Do I need to use Multioutputregressor?
(See doc here)
And Here the documentation of cross validation.
Thanks.
First thing is if you are actually using cross_validation.cross_val_score(), then you should replace that with model_selection.cross_val_score(). Module cross_validation has been deprecated and removed from latest version of scikit.
Now coming to why you are only getting a single score for all your outputs and not individual entries is because thats how the default value of scorer is set.
You have used scoring 'r2' which is documented here. In that, there is an option to change the result if the input is multi-output (as your case) by using the
multioutput :
Defines aggregating of multiple output scores. Array-like value
defines weights used to average scores. Default is “uniform_average”.
‘raw_values’ : Returns a full set of scores in case of multioutput
input.
‘uniform_average’ : Scores of all outputs are averaged with uniform
weight.
‘variance_weighted’ : Scores of all outputs are averaged, weighted by
the variances of each individual output.
You see that the default value is 'uniform_average', which just averages all the outputs to get a single value, which is what you are getting.
I am running Python 3.5.2 on a Macbook OSX 10.2.1 (Sierra).
While attempting to run some code for the Titanic Dataset from Kaggle, I keep getting the following error:
NotFittedError Traceback (most recent call
last) in ()
6
7 # Make your prediction using the test set and print them.
----> 8 my_prediction = my_tree_one.predict(test_features)
9 print(my_prediction)
10
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/tree/tree.py
in predict(self, X, check_input)
429 """
430
--> 431 X = self._validate_X_predict(X, check_input)
432 proba = self.tree_.predict(X)
433 n_samples = X.shape[0]
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/tree/tree.py
in _validate_X_predict(self, X, check_input)
386 """Validate X whenever one tries to predict, apply, predict_proba"""
387 if self.tree_ is None:
--> 388 raise NotFittedError("Estimator not fitted, "
389 "call fit before exploiting the model.")
390
NotFittedError: Estimator not fitted, call fit before exploiting the
model.
The offending code seems to be this:
# Impute the missing value with the median
test.Fare[152] = test.Fare.median()
# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values
# Make your prediction using the test set and print them.
my_prediction = my_tree_one.predict(test_features)
print(my_prediction)
# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution)
# Check that your data frame has 418 entries
print(my_solution.shape)
# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])
And here is a link to the rest of the code.
Since I already have called the 'fit' function, I cannot understand this error message. Where am I going wrong? Thanks for your time.
Edit:
Turns out that the problem is inherited from the previous block of code.
# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)
# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))
With the line:
my_tree_one = my_tree_one.fit(features_one, target)
generating the error:
ValueError: Input contains NaN, infinity or a value too large for
dtype('float32').
The error is self explanatory: either the features_one or the target arrays do contain NaNs or infinite values, so the estimator fails to fit and therefore you cannot use it for prediction later.
Check those arrays and treat NaN values accordingly before fitting.
Just wondering how is possible next case:
def fit(self, train, target):
xgtrain = xgb.DMatrix(train, label=target, missing=np.nan)
self.model = xgb.train(self.params, xgtrain, self.num_rounds)
I passed the train dataset as csr_matrix with 5233 columns, and after converting to DMatrix I got 5322 features.
Later on predict step, I got an error as cause of above bug :(
def predict(self, test):
if not self.model:
return -1
xgtest = xgb.DMatrix(test)
return self.model.predict(xgtest)
Error: ... training data did not have the following fields: f5232
How can I guarantee correct converting my train/test datasets to DMatrix?
Are there any chance to use in Python something similar to R?
# get same columns for test/train sparse matrixes
col_order <- intersect(colnames(X_train_sparse), colnames(X_test_sparse))
X_train_sparse <- X_train_sparse[,col_order]
X_test_sparse <- X_test_sparse[,col_order]
My approach doesn't work, unfortunately:
def _normalize_columns(self):
columns = (set(self.xgtest.feature_names) - set(self.xgtrain.feature_names)) | \
(set(self.xgtrain.feature_names) - set(self.xgtest.feature_names))
for item in columns:
if item in self.xgtest.feature_names:
self.xgtest.feature_names.remove(item)
else:
# seems, it's immutable structure and can not add any new item!!!
self.xgtest.feature_names.append(item)
One another possibility is to have one feature level exclusively in training data not in testing data. This situation happens mostly while post one hot encoding whose resultant is big matrix have level for each level of categorical features. In your case it looks like "f5232" is either exclusive in training or test data. If either case model scoring likely to throw error (in most implementations of ML packages) because:
If exclusive to training: Model object will have reference of this feature in model equation. While scoring it will throw error saying I am not able to find this column.
If exclusive to test (lesser likely as test data is usually smaller than training data): Model object will NOT have reference of this feature in model equation. While scoring it will throw error saying I got this column but model equation don't have this column. This is also lesser likely because most implementations are cognizant of this case.
Solutions:
The best "automated" solution is to keep only those columns, which are common to both training and test post one hot encoding.
For adhoc analysis if you can not afford to drop the level of feature because of its importance then do stratified sampling to ensure that all level of feature gets distributed to training and test data.
This situation can happen after one-hot encoding. For example,
ar = np.array([
[1, 2],
[1, 0]
])
enc = OneHotEncoder().fit(ar)
ar2 = enc.transform(ar)
b = np.array([[1, 0]])
b2 = enc.transform(b)
xgb_ar = xgb.DMatrix(ar2)
xgb_b = xgb.DMatrix(b2)
print(b2.shape) # (1, 3)
print(xgb_b.num_col()) # 2
So, when you have all zero column in sparse matrix, DMatrix drop this column (I think, because this column is useless for XGBoost)
Usually, I add a fake row to matrix which contents 1 in all columns.
Such an issue occurred for me when RandomUnderSampler (RUS) method returned a np.array rather than a Pandas DataFrame with column names.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X_train, y_train)
I resolved the issue with this:
X_rus = pd.DataFrame(X_rus, columns = X_train.columns)
Basically taking the output of RUS method and creating a Pandas DataFrame out of it with column names from the original X_train data which was the input of RUS method.
This can be generalized to any similar problem where XGBoost expected to read column names but could not. Just create a Pandas DataFrame and assign the column names accordingly.
I am trying to run SKLearn Preprocessing standard scaler function and I receive the following error:
from sklearn import preprocessing as pre
scaler = pre.StandardScaler().fit(t_train)
t_train_scale = scaler.transform(t_train)
t_test_scale = scaler.transform(t_test)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-149-c0133b7e399b> in <module>()
4 scaler = pre.StandardScaler().fit(t_train)
5 t_train_scale = scaler.transform(t_train)
----> 6 t_test_scale = scaler.transform(t_test)
C:\Users\****\Anaconda\lib\site-packages\sklearn\preprocessing\data.pyc in transform(self, X, y, copy)
356 else:
357 if self.with_mean:
--> 358 X -= self.mean_
359 if self.with_std:
360 X /= self.std_
ValueError: operands could not be broadcast together with shapes (40000,59) (119,) (40000,59)
I understand the shapes do not match. The train and test data set are different lengths so how would I transform the data?
please print the output from t_train.shape[1] and t_test.shape[1]
StandardScaler expects any two datasets to have the same number of columns. I suspect earlier pre-processing (dropping columns, adding dummy columns, etc) is the source of your problem. Whatever transformations you make to the t_train also need to be made to t_test.
The error is telling you the information that I'm asking for:
ValueError: operands could not be broadcast together with shapes (40000,59) (119,) (40000,59)
I expect you'll find that t_train.shape[1] is 59 and t_test.shape[1] is 119.
So you have 59 columns in your training dataset and 119 in your test dataset.
Did you remove any columns from the training set prior to attempting to use StandardScaler?
What do you mean by "train and test data set are different lengths"?? How did you obtain your training data?
If your testing data have more features than your training data in order to efficiently reduce the dimensionality of your testing data you should know how your training data were formulated.For example using a dimensionality reduction technique (PCA,SVD etc.) or something like that. If that is the case you have to multiply each testing vector with the same matrix that was used to reduce the dimensionality of your training data.
The time series was in the format with time as the columns and data in the rows. I did the following before the original posted code:
t_train.transpose()
t_test.transpose()
Just a reminder, I had to run the cell a 2x before the change 'stuck' for some reason...
t_train shape is (x, 119), whereas t_test shape is (40000,59).
If you want to use same scaler object for transformation then your data should have same number of columns always.
Since you fit scaler on t_train, that's the reason you are getting issue when you are trying to transform t_test.