How to handle the feature differences between my training and test data - python

So, currently my training and testing sets start with 669 features, many of which are categorical and will need to be one-hot encoded.
After one-hot encoding both sets, I found that the training set has additional features.
I'm not quite sure how to handle this but I feel like I have three options:
Remove these features from training set so both match up
Add these features to the test set and produce synthetic data.
Before I train my model, use some dimensionality reduction technique (PCA) and use the same number of components for training and testing.
Any feedback would be much appreciated.

I think you did one-hot encoding on Train & Test data separately. Maybe, combine them and then apply the encoding or try pandas.get_dummies on them separately and use the following code symmetric difference for sets to get differences and then assign 0 for the missing columns.
missing_cols = (set(train.columns)).symmetric_difference(set(test.columns))
for c in missing_cols:
titanic_test[c] = 0

Create dummies for both training and testing set separately. Exclude features which are in testing but not in training set, as they are not used in training. Include features which are in training set but not in testing set and fill data corresponding to them using "mean","median","mode" of training set or zero.
Something like this can be done:
Xdummies=pd.get_dummies(X,drop_first=True)
features=Xdummies.columns
xtest=pd.get_dummies(Xhold,drop_first=True)
test_feat=xtest.columns
exclude=list(set(test_feat)-set(features))
include=list(set(features)-set(test_feat))
Xtest=pd.concat([xtest.drop(exclude,axis=1).reset_index(drop=True),pd.DataFrame(index=range(0,xtest.shape[0]),data=0,columns=include).reset_index(drop=True)],axis=1)

Related

encoding categorical features?

I have used one-hot-encoding to encode the categorical features in my dataset. After encoding, I found out that the training data results in more features than the testing data.
Example
A feature called week-day has 1,2,3,4,5 values and the one-hot-encoding results into 5 new features as expected. However the testing data of the same feature has values of 1,2 and result in 2 new features of one hot encoded.
My question
The training input has more features than the test data, only two inputs. How does this affect the model? Does the model works after all? Because of less input in the test data, how it will handle it?
Any help or suggestion is highly appreciated.
This is not a problem, per sé, although a standard assumption in most ML settings is that the label distribution of the test set is identical to that of the training set.
You should just re-encode your test data using the same features as you would encode your training data. That is, any features from test that also occur in train should get the same index those features got in test, while any features occurring in test that do not occur in train should be discarded (because it is unclear what you should do with them anyway).

How to solve mismatch in train and test set after categorical encoding?

I have many categorical variables which exist in my test set but don't in my train set. They are important so I can't drop them. Should I combine train and test set or what other solution should I make?
You have some options in this case, you can use another technique than Holdout to separate your data like a K-Fold Cross-validation or Leave-one-out.
When using a holdout is necessary to stratify your data to all subsets have all classes on train/test/validation subset's and NEVER use your test or validation dataset to fit your model, when you do it the model will learn this data and you probably will be overfitting your model, read more about it here
How did you end up in this situation? Normally you take a dataset and divide it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.
But it is clear that, because they originate from the same original dataset, that they both have the same categorical variables.

Is one hot encoding required for this data set?

Below is the data set from the UCI data repository. I want to build a regression model taking platelets count as the dependent variable(y) and the rest as features/inputs.
However, there are few categorical variables like such as anemia, sex, smoking, and DEATH_EVENT in the data set in numeric form.
My questions are:
Should I perform 'one-hot encoding' on these variables before building a regression model?
Also, I observe the values are in various ranges, so should I even scale the data set before applying the regression model?
1.Should I perform 'one-hot encoding' on these variables before building a regression model?
Yup, you should one-hot encode the categorical variables. You can use like below:
columns_to_category = ['sex', 'smoking','DEATH_EVENT']
df[columns_to_category] = df[columns_to_category].astype('category') # change datetypes to category
df = pd.get_dummies(df, columns=columns_to_category) # One hot encoding the categories
2.If so, only one hot encoding is sufficient or should I perform even
label encoding?
One hot encoding should be sufficient I guess.
3.Also, I observe the values are in various ranges, so should I even scale the data set before applying the regression model?
Yes you can use either StandardScaler() or MinMaxScaler() to get better results and then inverse scale the predictions. Also, make sure you scale the test and train separately and not combined because in real life your test will be not realized so yo need to scale accordingly to avoid such errors.
If those are truly binary categories, you don't have to one hot encode. They are already encoded.
You don't have to use one-hot encoding as those columns already have numerical values. Although if those numerical values are actually string instead of int or float then you should use one-hot encoding on them. About scaling the data, the variation is considerable, so you should scale it to avoid your regression model being biased towards high values.

Training a categorical classification example

I am new to Machine Learning. I am currently solving a classification problem which has strings as its target. I have split the test and training sets and I have dealt with the string attributes by converting them by OneHotEncoder and also, I am using StandardScaler to scale the numerical features of the training set.
My question is for the test set, do I need to convert the test set targets which are still in string format such as I did with training set's string targets using the OneHotEncoder, or do I leave the test set alone as it is and the Classifier will do the job itself? Similarly for the numerical attributes do I have to use StandardScaler to scale the numerical attributes in the test set or the Classifier will do this itself once the training is done on the training set?
For the first question, I would say, you don't need to convert it, but it would make the evaluation on the test set easier.
Your classifier will output one hot encoded values, which you can convert back to string, and evaluate those values, however I think if you would have the test targets as 0-1s would help.
For the second one, you need to fit the standardscaler on the train set, and use (transform) that on the test set.

When to use train_test_split of scikit learn

I have a dataset having 19 features. Now I need to do missing value imputation, then encoding the categorical variables using OneHOtEncoder of scikit and then run a machine learning algo.
My question is should I split this dataset before doing all the above things using train_test_split method of scikit or should I first split into train and test and then on each set of data, do missing value and encoding.
My concern is if I split first then do missing value and other encoding on resulting two sets, when doing encoding of variables in test set, shouldn't test set would have some values missing for that variable there maybe resulting in less no. of dummies. Like if original data had 3 levels for categorical and I know we are doing random sampling but is there a chance that the test set might not have all three levels present for that variable thereby resulting in only two dummies instead of three in first?
What's the right approach. Splitting first and then doing all of the above on train and test or do missing value and encoding first on whole dataset and then split?
I would first split the data into a training and testing set. Your missing value imputation strategy should be fitted on the training data and applied both on the training and testing data.
For instance, if you intend to replace missing values by the most frequent value or the median. This knowledge (median, most frequent value) must be obtained without having seen the testing set. Otherwise, your missing value imputation will be biased. If some values of feature are unseen in the training data, then you can for instance increasing your overall number of samples or have a missing value imputation strategy robust to outliers.
Here is an example how to perform missing value imputation using a scikit-learn pipeline and imputer:

Categories

Resources