Feature Names Mismatch when Passing X_test to .predict() Function (Again, Still) - python

Ok I'm still having this issue and I'm at a loss as to where I'm going wrong. I thought I had a working solution, but I was wrong.
After finding a regression pipeline through TPOT, I go to use the .predict(X_test) function and I get the following error message:
ValueError: Number of features of the model must match the input. Model n_features is 117 and input n_features is 118
I read somewhere on Github that XGBoost likes to have the X features passed to it in the form of a Numpy Array, and not a Pandas Dataframe. So I did that and now I receive this error message whenever a RandomForestRegressor ends up in my pipeline.
So I investigate:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed, shuffle=False)
# Here is where I convert the features to numpy arrays
X_train=X_train.values
X_test=X_test.values
print('[INFO] Printing the shapes of the training/testing feature/label sets...')
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
[INFO] Printing the shapes of the training/testing feature/label sets...
(1366, 117)
(456, 117)
(1366,)
(456,)
# Notice 117 rows for X columns...
# Now print the X_test shape just before the predict function...
print(X_test.shape)
(456, 117)
# Still 117 columns, so call predict:
predictions = best_model.predict(X_test)
ValueError: Number of features of the model must match the input. Model n_features is 117 and input n_features is 118
WHY!!!!!!?????
Now the tricky thing is, I'm using a custom tpot_config to only use the regressors XGBRegressor, ExtraTreesRegressor, GradientBoostingRegressor, AdaBoostRegressor, DecisionTreeRegressor, and RandomForestRegressor, so I need to come up with a way to train and predict the features whereby all of them will work with the data in the same way, so that no matter what pipeline it comes up with, I won't have this issue each time I go to run my code!
There have been similar questions asked at these links on SO:
Here
Here
Here
Here
... but I don't understand why my model is not predicting, when I AM passing it the same number of (X) features as was used in training the model!? Where am I going wrong here???
EDIT
I should also mention, that leaving the features as dataframes and not converting them to numpy arrays sometimes gives me a "feature names mismatch" error when XGBRegressor is in the pipeline as well. So I'm at a loss as to how to handle both the list of tree regressors (which like Dataframes) and XGBoost (which likes Numpy arrays). I have also tried “re-arranging” the columns(?) to make sure that the X_train and X_test Dataframes are in the same order like some have suggested but that didn’t do anything.
I have posted my full code in a Google Colab notebook here where you can make comments on it. How can I pass the testing data to the .predict() function no matter what pipeline TPOT comes up with??????

Thanks to weixuanfu at GitHub, I may have found a solution by moving the feature_importance code section down to the bottom of the my code, and yes using numpy arrays for the features. If I run into this issue again, I will be posting it below:
https://github.com/EpistasisLab/tpot/issues/738

Related

Multiple Linear Regression Using Scikit-learn Error

I'm relatively new to Python and I am trying to make a Multiple Linear Regression model which has two predictor variables and one dependent. While doing my research on this, I found that Scikit provides a class to do this. I tried to get a model for my variables and I got the following message:
Shape of passed values is (3, 1), indices imply (2, 1)
The code I've used is:
from sklearn import linear_model
data = pd.read_csv('data.csv',delimiter=',',header=0)
SEED_VALUE = 12356789
np.random.seed(SEED_VALUE)
data_train, data_test = train_test_split(data, test_size=0.3, random_state=SEED_VALUE)
print('Train size: {}'.format(data_train.shape[0]))
print('Test size: {}'.format(data_test.shape[0]))
data_train_X = data_train.values[:,0:2] #predictor variables
data_train_Y = data_train.values[:,2].astype('float') # dependant
model = linear_model.LinearRegression()
np.random.seed(SEED_VALUE)
model.fit(data_train_X, data_train_Y)
coef = pd.DataFrame([model.intercept_, *model.coef_], ['(Intercept)', *data_train_X.columns], columns=['Coefficients'])
coef
I got the Error in the model.fit(data_train_X, data_train_Y) line. I have searched online for different ways to use Scikit method and I have found that other people with the same code had no error, so I don't know where my mistake could be
Thank you all so much
The data file is like this:
"retardation","distrust","degree"
2.80,6.1,44
3.10,5.1,25
2.59,6.0,10
3.36,6.9,28
2.80,7.0,25
3.35,5.6,72
2.99,6.3,45
2.99,7.2,25
2.92,6.9,12
3.23,6.5,24
3.37,6.8,46
2.72,6.6, 8
Would be great if you could also provide the input of the data, but even without it's most likely due to the fact that you use index column from your file. You should remove it and it will be fine.
If you can give the example of data that you use (columns), will be able to check it further.
I rerun your code, with the data you've provided and it does not show any errors for model.fit(data_train_X, data_train_Y) line :)

UserWarning: X does not have valid feature names, but Linear Regression was fitted with feature names [duplicate]

I'm getting the following warning after upgrading to version 1.0 of scikit-learn:
UserWarning: X does not have valid feature names, but IsolationForest was
fitted with feature name
I cannot find in the docs on what is a "valid feature name". How do I deal with this warning?
I got the same warning message with another sklearn model. I realized that it was showing up because I fitted the model with a data in a dataframe, and then used only the values to predict. From the moment I fixed that, the warning disappeared.
Here is an example:
model_reg.fit(scaled_x_train, y_train[vp].values)
data_pred = model_reg.predict(scaled_x_test.values)
This first code had the warning, because scaled_x_train is a DataFrame with feature names, while scaled_x_test.values is only values, without feature names. Then, I changed to this:
model_reg.fit(scaled_x_train.values, y_train[vp].values)
data_pred = model_reg.predict(scaled_x_test.values)
And now there are no more warnings on my code.
I was getting very similar error but on module DecisionTreeClassifier for Fit and Predict.
Initially I was sending dataframe as input to fit with headers and I got the error.
When I trimmed to remove the headers and sent only values then the error got disappeared.
Sample code before and after changes.
Code with Warning:
model = DecisionTreeClassifier()
model.fit(x,y) #Here x includes the dataframe with headers
predictions = model.predict([
[20,1], [20,0]
])
print(predictions)
Code without Warning:
model = DecisionTreeClassifier()
model.fit(x.values,y) #Here x.values will have only values without headers
predictions = model.predict([
[20,1], [20,0]
])
print(predictions)
I had also the same problem .The problem was due to fact that I fitted the model with X train data as dataframe (model.fit(X,Y)) and I make a prediction with with X test as an array ( model.predict([ [20,0] ]) ) . To solve that I have converted the X train dataframe into an array as illustrated bellow .
BEFORE
model = DecisionTreeClassifier()
model.fit(X,Y) # X train here is a dataFrame
predictions = model.predict([20,0]) ## generates warning
AFTER
model = DecisionTreeClassifier()
X = X.values # conversion of X into array
model.fit(X,Y)
model.predict([ [20,0] ]) #now ok , no warning
The other answers so far recommend (re)training using a numpy array instead of a dataframe for the training data. The warning is a sort of safety feature, to ensure you're passing the data you meant to, so I would suggest to pass a dataframe (with correct column labels!) to the predict function instead.
Also, note that it's just a warning, not an error. You can ignore the warning and proceed with the rest of your code without problem; just be sure that the data is in the same order as it was trained with!
I got the same error while using dataframes but by passing only values it is no more there
use
reg = reg.predict( x[['data']].values , y)
It is showing error because our dataframe has feature names but we should fit the data as 2d array(or matrix) with values for training or testing the dataset.
Here is the image of the same thing mentioned above image of jupytr notebook code

My pipeline not imputing values correctly?

I'm new to python and have been learning about pipelines from datacamp. I have been experimenting with some fifa data that has missing NaN values. I have tried to create a pipeline with the steps of imputing any missing data (replacing it with the mean) and then creating a logistic regression. I don't seem to get any errors in the output. However, when I print things such as print(x_train) and print(y_pred) the output still returns NaN values. Would that indicate that my Pipeline is not working and that the data was not correctly imputed as surely I should be seeing the mean values rather than NaN. Would appreciate if someone could answer the question in layman's terms as I am new to the topic.
fif_data=pd.read_csv("fifa_draft_1.csv")
df_Foot_Dummy=pd.get_dummies(fif_data, drop_first=True)
imp=SimpleImputer(missing_values=np.nan, strategy="mean")
logreg=LogisticRegression()
x=df_Foot_Dummy["passing"].values.reshape(-1,1)
y=df_Foot_Dummy["preferred_foot_Right"]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=42)
steps=[("imputation", imp),("logistic_regression",logreg)]
pipe=Pipeline(steps)
pipe.fit(x_train,y_train)
y_pred=pipe.predict(x_test)
print(x_train)
print(y_pred)
Pipelines do not change data in-place; at each step, the data is modified and passed along, but the intermediate results are not saved (with a partial exception when the cache parameter is set).
That the logistic regression doesn't complain indicates that the imputation has in fact happened.
y_pred shouldn't have any missing values; if that's the case, please let us know and provide an example dataset.

Using sklearn's roc_auc_score for OneVsOne Multi-Classification?

So I am working on a model that attempts to use RandomForest to classify samples into 1 of 7 classes. I'm able to build and train the model, but when it comes to evaluating it using roc_auc function, I'm able to perform 'ovr' (oneVsrest) but 'ovo' is giving me some trouble.
roc_auc_score(y_test, rf_probs, multi_class = 'ovr', average = 'weighted')
The above works wonderfully, I get my output, however, when I switch multi_class to 'ovo' which I understand might be better with class imbalances, I get the following error:
roc_auc_score(y_test, rf_probs, multi_class = 'ovo')
IndexError: too many indices for array
(I pasted the whole traceback below!)
Currently my data is set up as follow:
y_test (61,1)
y_probs (61, 7)
Do I need to reshape my data in a special way to use 'ovo'?
In the documentation, https://thomasjpfan.github.io/scikit-learn-website/modules/generated/sklearn.metrics.roc_auc_score.html, it says "binary y_true, y_score is supposed to be the score of the class with greater label. The multiclass case expects shape = [n_samples, n_classes] where the scores correspond to probability estimates."
Additionally, the whole traceback seems to hint as using maybe using a more binary array (hopefully that's the right term! I'm new to this!)
Very, very thankful for any ideas/thoughts!
#Tirth Patel provided the right answer, I needed to reshape my test set using one hot encoding. Thank you!

Python: Is it OK to use "as_matrix" with dataframes as input to scikit models

Hi I have seen some examples of machine learning implementations that uses as_matrix with dataframes as inputs to machine learning algorithms. I wonder if it is OK to use tuples, which are output of .as_matrix as inputs to machine learning algorithms such as below. Thanks
trainArr_All = df.as_matrix(cols_attr) # training array
trainRes_All = df.as_matrix(col_class) # training results
trainArr, x_test, trainRes, y_test = train_test_split(trainArr_All, trainRes_All, test_size=0.20, random_state=42)
rf = RandomForestClassifier(n_estimators=20, criterion='gini', random_state=42) # 100 decision trees
y_score = rf.fit(trainArr, trainRes.ravel()).predict(x_test)
y_score = y_score.tolist()
Pandas as_matrix converts the dataframe to numpy.array (documentation) NOT tuple! sklearn assumes that the inputs are in the form of numpy arrays and if not it converts the dtype to dtype=np.float32 or a sparse csc_matrix internally. Although using pandas dataframe as input is usually fine when using a stable version of sklearn (internal conversion), you may have occasional problems due to data type incompatibility. It is usually safer to use as_matrix and convert the dataframe to numpy.array before using sklearn.
Here is an example of someone having problem with pandas dataframe:
Using slices in Python

Categories

Resources