Python -- mismatch of columns between training data and prediction data

Python -- mismatch of columns between training data and prediction data - python

in my model I have figured out i made a mistake by not dropping a column from prediction dataset.
This column; yclass; is not available in training dataset. But in my prediction dataset it is available.
I wasn't aware of that mistake, but now I am confused? Why it is still giving me predictions with that abundant column? Shouldn't it give me some kind of error because of that? I have seen the examples of onehotencoding related training-test data inconsistency and solution to that problem. But it is a new case which I do not have an idea? Here is the final part of my code; may be I am making a mistake with pipelines etc.
lgbr = LGBMRegressor(learning_rate= 0.1, max_depth= 18, n_estimators= 50, num_leaves= 11)
lgbc = LGBMClassifier(learning_rate = 0.1, max_depth = 18, n_estimators = 100, num_leaves = 51)
numeric_pipe = make_pipeline(MinMaxScaler(feature_range = (-1,1)))
categoric_pipe = make_pipeline(OneHotEncoder(sparse = False, handle_unknown='ignore'))
preprocessor = ColumnTransformer(transformers = [('num',numeric_pipe, num_cols), ('cat',categoric_pipe,cat_cols)])
regr_pipe_final = make_pipeline(preprocessor, lgbr)
regr_pipe_final.fit(df_x_regr, df_y_regr.values.ravel())
class_pipe_final = make_pipeline(preprocessor, lgbc)
class_pipe_final.fit(df_x, df_y_class.values.ravel())
pred_final = pd.DataFrame()
for key in list(mi.unique_everseen(pred_set['from'] + pred_set['to'])):
pred_val_list = []
pred_subset = pred_set[(pred_set['from'] + pred_set['to']) == key]
lag = 0
for i in range(0,predmonths):
pred = pred_subset.iloc[[i],:]
class_val = class_pipe_final.predict(pred)
regr_val = regr_pipe_final.predict(pred)
I am making rowwise predictions to generate a moving forecast effect, thats why I use for loops for predictions.
Here, as a final summary, the problem is "pred" has 1 abundant column inside, named as "yclass". How is my pipeline accepting that column as an input? Or just ignoring it?

If I understand your question correctly and your code, assuming yclass is one of the One-Hot columns, you have a parameter there:
handle_unknown='ignore'
Which tells the encoder to ignore column it has not seen in data.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
handle_unknown : ‘error’ or ‘ignore’, default=’error’.
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

Related

Linear Regression TypeError: float() argument must be a string or a number, not 'NaTType'

I am currently working on a linear regression model to try and predict prices of certain shoes. After loading in the data frames for the training data and the data to test and dropping the unwanted columns, I am beginning to work with independent and dependent variable for the regression coefficients. Here was the set of code I ran and was able to get working:
SourceData_train_independent = Training_Data.drop(['Sale Price'], axis = 1) #
Drop depedent variable from training dataset
SourceData_train_dependent = Training_Data['Sale Price'].copy() # New dataframe
with only Dependent variable value for training dataset
SourceData_test_independent = Test_Data.drop(['Sale Price'], axis = 1)
SourceData_test_dependent = Test_Data['Sale Price'].copy()
When I ran the next block of code to scale the data:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(SourceData_train_independent.values) #scale the
independent variables
y_train = SourceData_train_dependent # scaling is not required for dependent
variable
X_test = sc_X.transform(SourceData_test_independent)
y_test = SourceData_test_dependent
I get the following error:
TypeError: float() argument must be a string or a number, not 'NaTType'
on the line:
X_train = sc_X.fit_transform(SourceData_train_independent.values) #scale the
independent variables
Currently, the sale price variable is a float64, do I just need to change the datatype to a number value or is it a more complicated fix? I have already run the code:
Training_Data["Sale Price"] = Training_Data["Sale
Price"].astype(str).str.strip().replace("",0).astype(float)
Test_Data["Sale Price"] = Training_Data["Sale
Price"].astype(str).str.strip().replace("",0).astype(float)
Hopefully someone can help as I'm struggling with this project. I used the code from: https://towardsdatascience.com/linear-regression-in-python-sklearn-vs-excel-6790187dc9ca
Thank you so much!

Why is sklearn.metrics support value changing every time?

I'm working on training a supervised learning keras model to categorize data into one of 3 categories. After training, I run this:
dataset = pandas.read_csv(filename, header=[0], encoding='utf-8-sig', sep=',')
# split X and Y (last column)
array = dataset.values
columns = array.shape[1] - 1
np.random.shuffle(array)
x_orig = array[:, 1:columns]
testy = array[:, columns]
columns -= 1
# normalize data
scaler = StandardScaler()
testx= scaler.fit_transform(x_orig)
#onehot
testy = to_categorical(testy)
# load weights
save_path = "[filepath]"
model = tf.keras.models.load_model(save_path)
# gets class breakdown
y_pred = model.predict(testx, verbose=1)
y_pred_bool = np.argmax(y_pred, axis=1)
y_true = np.argmax(testy, axis=1)
print(sklearn.metrics.precision_recall_fscore_support(y_true, y_pred))
sklearn.metrics.precision_recall_fscore_support prints, among other metrics, the support for each class. Per this link, support is the number of occurrences of each class in y_true, which is the true labels.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
My problem: each run, support is different. I'm using the same data, and support for each class always adds up the same (but different than the total in the file – which I also don’t understand), but the number per class differs.
As an example, one run might say [16870, 16299, 7807] and the next might say [17169, 15923, 7884]. They add up the same, but each class differs.
Since my data isn't changing between runs, I'd expect support to be identical every time. Am I wrong? If not, what's going on? I've tried googling, but didn't get any useful results.
Potentially useful information: when I run sklearn.metrics.classification_report, I have the same issue, and the numbers from that match the numbers from precision_recall_fscore_support.
Sidenote: unrelated to above question, but I couldn't google-fu an answer to this one either, I hope that's ok to include here. When I run model.evaluate, part of the printout is e.g. 74us/sample. What does us/sample mean?

Add:
np.random.seed(42)
before you shuffle the array at
np.random.shuffle(array)
The reason for this is without seeding np.shuffle will create a different result each time. Thus when you feed the array into the model it will return a different result. Seeding allows you to shuffle it the same each time, thus creating reproducible results.
Or you can not shuffle and get the same array each time to feed into the model. Either or both methods will ensure reproducibility within the model.

Confusion Matrix on H2O

Final Edit: this problem ended up occurring because the target array were integers that were supposed to represent categories so it was doing a regression. Once I converted them into factors using .asfactor(), then the confusion matrix method detailed in the answer below worked
I am trying to run a confusion matrix on my Random Forest Model (my_model), but the documentation has been less than helpful. From here it says the command is h2o.confusionMatrix(my_model) but there is no such thing in 3.0.
Here are the steps to fit the model:
from h2o.estimators.random_forest import H2ORandomForestEstimator
data_h = h2o.H2OFrame(data)
train, valid = data_h.split_frame(ratios=[.7], seed = 1234)
my_model = H2ORandomForestEstimator(model_id = "rf_h", ntrees = 400,
max_depth = 30, nfolds = 8, seed = 25)
my_model.train(x = features, y = target, training_frame=train)
pred = rf_h.predict(valid)
I have tried the following:
my_model.confusion_matrix()
AttributeError: type object 'H2ORandomForestEstimator' has no attribute
'confusion_matrix'
Gotten from this example.
I have attempted to use tab completion to find out what it might be and have tried:
h2o.model.confusion_matrix(my_model)
TypeError: 'module' object is not callable
and
h2o.model.ConfusionMatrix(my_model)
which outputs simply all the model diagnostics and then the error:
H2OTypeError: Argument `cm` should be a list, got H2ORandomForestEstimator
Finally,
h2o.model.ConfusionMatrix(pred)
Which gives the same error as above.
Not sure what to do here, how can I view the results of the confusion matrix of the model?
Edit: Added more code to the beginning of the question for Context

please see the documentation for the full parameter list. For your convenience here is the list confusion_matrix(metrics=None, thresholds=None, train=False, valid=False, xval=False).
Here is a working example of how to use the method:
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
h2o.init()
# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
# convert response column to a factor
cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
# set the predictor names and the response column name
predictors = ["displacement","power","weight","acceleration","year"]
response = "economy_20mpg"
# split into train and validation sets
train, valid = cars.split_frame(ratios = [.8], seed = 1234)
# try using the binomial_double_trees (boolean parameter):
# Initialize and train a DRF
cars_drf = H2ORandomForestEstimator(binomial_double_trees = False, seed = 1234)
cars_drf.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
cars_drf.confusion_matrix()
# or specify the validation frame
cars_drf.confusion_matrix(valid=True)

Sklearn svr give wrong results when the training data obvious show a pattern

I have the following training data:
x = [
[0.914728682,5.217,5,0.217,3.150362319,33.36,35,-1.64,4.220113852],
[0.885057471,7.793,8,-0.207,3.380911063,46.84,48,-1.16,4.448243115],
[0.871345029,7.152,7,0.152,3.976205037,44.98,47,-2.02,5.421236592],
[0.821428571,8.04,8,0.04,2.909880565,52.02,54.5,-2.48,2.824104235],
[0.931372549,8.01,8,0.01,4.616714697,48.04,48,0.04,9.650462033],
[0.66367713,5.424,5.5,-0.076,1.37804878,32.6,35.5,-2.9,1.189781022],
[0.78,8.66,9,-0.34,2.272965879,48.47,55,-6.53,2.564550265],
[0.227272727,19.55,21,-1.45,1.860133206,128.23,147,-18.77,1.896893491],
[0.47826087,10.09,8,2.09,1.155519927,74.43,64,10.43,1.169547454],
[0.652694611,6.775,4,2.775,1.05529595,43.1,30,13.1,1.062885327],
[0.798561151,3.986,2,1.986,0.656563993,25.38,13,12.38,0.652442159],
[0.666666667,5.419,3,2.419,1.057985162,34.37,16,18.37,0.981719509],
[0.5625,7.719,2,5.719,0.6421797,46.91,12,34.91,0.665673336]
]
and the following labels(scores):
y = [0.237113402,0.168831169,0.104166667,0.086419753,0.063147368,0.016042781,
0.014814815,0,0,-0.0794,-0.14,-0.1832,-0.2385]
It seems clear that the larger the values in column 5 and column 9 are, the higher the scores.
I write the following code that make use of SVR on the training data provided:
rb = RobustScaler()
xScaled = rb.fit_transform(x)
model = SVR(C=1.0, epsilon=0.1)
model.fit(xScaled,y)
But no matter which of the following I use for prediction, it is not giving a score that looks right.
1 score = model.predict(rb.fit_transform(testData))
2 score = model.predict(testData)
If I do something like the following during training:
xScaled = preprocessing.scale(x)
model = SVR(C=1.0, epsilon=0.1)
model.fit(xScaled,y)
then:
score = svmModel.predict(testData)
I get back something close to the origin y.
But I pick a row in x, put it in a 2d array with one row called testData, and do:
score = svmModel.predict(testData)
I get a wrong score. In fact, no matter which row in x I use for creating the 2d array with one row, I get the same score.
What have I done wrong? I would be extremely grateful if someone can help.

1) score = model.predict(rb.fit_transform(testData))
When you do the above, you are re-fitting the RobustScaler to the new data. That means that it will be scaled to new data and will not match the scales of the training data. So the results will not be good.
2) score = model.predict(testData)
In the above, you are not scaling the test data, so its different that what the SVC has learnt. Hence the results will be bad here also.
What you need to do:-
score = model.predict(rb.transform(testData))
Calling transform() will scale the supplied data based on training data scales, and hence the SVC can better predict the output.

XGBoost difference in train and test features after converting to DMatrix

Just wondering how is possible next case:
def fit(self, train, target):
xgtrain = xgb.DMatrix(train, label=target, missing=np.nan)
self.model = xgb.train(self.params, xgtrain, self.num_rounds)
I passed the train dataset as csr_matrix with 5233 columns, and after converting to DMatrix I got 5322 features.
Later on predict step, I got an error as cause of above bug :(
def predict(self, test):
if not self.model:
return -1
xgtest = xgb.DMatrix(test)
return self.model.predict(xgtest)
Error: ... training data did not have the following fields: f5232
How can I guarantee correct converting my train/test datasets to DMatrix?
Are there any chance to use in Python something similar to R?
# get same columns for test/train sparse matrixes
col_order <- intersect(colnames(X_train_sparse), colnames(X_test_sparse))
X_train_sparse <- X_train_sparse[,col_order]
X_test_sparse <- X_test_sparse[,col_order]
My approach doesn't work, unfortunately:
def _normalize_columns(self):
columns = (set(self.xgtest.feature_names) - set(self.xgtrain.feature_names)) | \
(set(self.xgtrain.feature_names) - set(self.xgtest.feature_names))
for item in columns:
if item in self.xgtest.feature_names:
self.xgtest.feature_names.remove(item)
else:
# seems, it's immutable structure and can not add any new item!!!
self.xgtest.feature_names.append(item)

One another possibility is to have one feature level exclusively in training data not in testing data. This situation happens mostly while post one hot encoding whose resultant is big matrix have level for each level of categorical features. In your case it looks like "f5232" is either exclusive in training or test data. If either case model scoring likely to throw error (in most implementations of ML packages) because:
If exclusive to training: Model object will have reference of this feature in model equation. While scoring it will throw error saying I am not able to find this column.
If exclusive to test (lesser likely as test data is usually smaller than training data): Model object will NOT have reference of this feature in model equation. While scoring it will throw error saying I got this column but model equation don't have this column. This is also lesser likely because most implementations are cognizant of this case.
Solutions:
The best "automated" solution is to keep only those columns, which are common to both training and test post one hot encoding.
For adhoc analysis if you can not afford to drop the level of feature because of its importance then do stratified sampling to ensure that all level of feature gets distributed to training and test data.

This situation can happen after one-hot encoding. For example,
ar = np.array([
[1, 2],
[1, 0]
])
enc = OneHotEncoder().fit(ar)
ar2 = enc.transform(ar)
b = np.array([[1, 0]])
b2 = enc.transform(b)
xgb_ar = xgb.DMatrix(ar2)
xgb_b = xgb.DMatrix(b2)
print(b2.shape) # (1, 3)
print(xgb_b.num_col()) # 2
So, when you have all zero column in sparse matrix, DMatrix drop this column (I think, because this column is useless for XGBoost)
Usually, I add a fake row to matrix which contents 1 in all columns.

Such an issue occurred for me when RandomUnderSampler (RUS) method returned a np.array rather than a Pandas DataFrame with column names.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X_train, y_train)
I resolved the issue with this:
X_rus = pd.DataFrame(X_rus, columns = X_train.columns)
Basically taking the output of RUS method and creating a Pandas DataFrame out of it with column names from the original X_train data which was the input of RUS method.
This can be generalized to any similar problem where XGBoost expected to read column names but could not. Just create a Pandas DataFrame and assign the column names accordingly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python -- mismatch of columns between training data and prediction data - python

Related

Linear Regression TypeError: float() argument must be a string or a number, not 'NaTType'

Why is sklearn.metrics support value changing every time?

Confusion Matrix on H2O

Sklearn svr give wrong results when the training data obvious show a pattern

XGBoost difference in train and test features after converting to DMatrix

Categories

Resources