I have used one-hot-encoding to encode the categorical features in my dataset. After encoding, I found out that the training data results in more features than the testing data.
Example
A feature called week-day has 1,2,3,4,5 values and the one-hot-encoding results into 5 new features as expected. However the testing data of the same feature has values of 1,2 and result in 2 new features of one hot encoded.
My question
The training input has more features than the test data, only two inputs. How does this affect the model? Does the model works after all? Because of less input in the test data, how it will handle it?
Any help or suggestion is highly appreciated.
This is not a problem, per sé, although a standard assumption in most ML settings is that the label distribution of the test set is identical to that of the training set.
You should just re-encode your test data using the same features as you would encode your training data. That is, any features from test that also occur in train should get the same index those features got in test, while any features occurring in test that do not occur in train should be discarded (because it is unclear what you should do with them anyway).
Related
I have many categorical variables which exist in my test set but don't in my train set. They are important so I can't drop them. Should I combine train and test set or what other solution should I make?
You have some options in this case, you can use another technique than Holdout to separate your data like a K-Fold Cross-validation or Leave-one-out.
When using a holdout is necessary to stratify your data to all subsets have all classes on train/test/validation subset's and NEVER use your test or validation dataset to fit your model, when you do it the model will learn this data and you probably will be overfitting your model, read more about it here
How did you end up in this situation? Normally you take a dataset and divide it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.
But it is clear that, because they originate from the same original dataset, that they both have the same categorical variables.
I'm using OneHotEncoding to generate dummies for a classification problem. When used on the training data, I get ~300 dummy columns, which is fine. However, when I input new data (which is fewer rows), the OneHotEncoding only generates ~250 dummies, which isn't surprising considering the smaller dataset, but then I can't use the new data with the model because the features don't align.
Is there a way to retain the OneHotEncoding schema to use on new incoming data?
I think you are using fit_transform on both training and test dataset, which is not the right approach because the encoding schema has to be consistent on both the dataset for the model to understand the information from the features.
The correct way is do
fit_transform on training data
transform on test data
By doing this way, you will get consistent number of columns.
I have encountered a unique problem. My model was trained on DNN framework and the model parameters are saved which I'm using now to score the data. Since my data is quite huge, I'm scoring the data in batches. I'm not hot encoding categorical variables before creating batches as the Onehotencode function runs into memory error when I apply on complete dataset. This led me to explore the option of hot encoding in batches, however this fails as all batches does not contain all levels of values in categorical variables. If anyone has faced a similar issue could you please recommend or suggest a workaround?
I am new to Machine Learning. I am currently solving a classification problem which has strings as its target. I have split the test and training sets and I have dealt with the string attributes by converting them by OneHotEncoder and also, I am using StandardScaler to scale the numerical features of the training set.
My question is for the test set, do I need to convert the test set targets which are still in string format such as I did with training set's string targets using the OneHotEncoder, or do I leave the test set alone as it is and the Classifier will do the job itself? Similarly for the numerical attributes do I have to use StandardScaler to scale the numerical attributes in the test set or the Classifier will do this itself once the training is done on the training set?
For the first question, I would say, you don't need to convert it, but it would make the evaluation on the test set easier.
Your classifier will output one hot encoded values, which you can convert back to string, and evaluate those values, however I think if you would have the test targets as 0-1s would help.
For the second one, you need to fit the standardscaler on the train set, and use (transform) that on the test set.
So, currently my training and testing sets start with 669 features, many of which are categorical and will need to be one-hot encoded.
After one-hot encoding both sets, I found that the training set has additional features.
I'm not quite sure how to handle this but I feel like I have three options:
Remove these features from training set so both match up
Add these features to the test set and produce synthetic data.
Before I train my model, use some dimensionality reduction technique (PCA) and use the same number of components for training and testing.
Any feedback would be much appreciated.
I think you did one-hot encoding on Train & Test data separately. Maybe, combine them and then apply the encoding or try pandas.get_dummies on them separately and use the following code symmetric difference for sets to get differences and then assign 0 for the missing columns.
missing_cols = (set(train.columns)).symmetric_difference(set(test.columns))
for c in missing_cols:
titanic_test[c] = 0
Create dummies for both training and testing set separately. Exclude features which are in testing but not in training set, as they are not used in training. Include features which are in training set but not in testing set and fill data corresponding to them using "mean","median","mode" of training set or zero.
Something like this can be done:
Xdummies=pd.get_dummies(X,drop_first=True)
features=Xdummies.columns
xtest=pd.get_dummies(Xhold,drop_first=True)
test_feat=xtest.columns
exclude=list(set(test_feat)-set(features))
include=list(set(features)-set(test_feat))
Xtest=pd.concat([xtest.drop(exclude,axis=1).reset_index(drop=True),pd.DataFrame(index=range(0,xtest.shape[0]),data=0,columns=include).reset_index(drop=True)],axis=1)