I have a dataset having 19 features. Now I need to do missing value imputation, then encoding the categorical variables using OneHOtEncoder of scikit and then run a machine learning algo.
My question is should I split this dataset before doing all the above things using train_test_split method of scikit or should I first split into train and test and then on each set of data, do missing value and encoding.
My concern is if I split first then do missing value and other encoding on resulting two sets, when doing encoding of variables in test set, shouldn't test set would have some values missing for that variable there maybe resulting in less no. of dummies. Like if original data had 3 levels for categorical and I know we are doing random sampling but is there a chance that the test set might not have all three levels present for that variable thereby resulting in only two dummies instead of three in first?
What's the right approach. Splitting first and then doing all of the above on train and test or do missing value and encoding first on whole dataset and then split?
I would first split the data into a training and testing set. Your missing value imputation strategy should be fitted on the training data and applied both on the training and testing data.
For instance, if you intend to replace missing values by the most frequent value or the median. This knowledge (median, most frequent value) must be obtained without having seen the testing set. Otherwise, your missing value imputation will be biased. If some values of feature are unseen in the training data, then you can for instance increasing your overall number of samples or have a missing value imputation strategy robust to outliers.
Here is an example how to perform missing value imputation using a scikit-learn pipeline and imputer:
Related
I have used one-hot-encoding to encode the categorical features in my dataset. After encoding, I found out that the training data results in more features than the testing data.
Example
A feature called week-day has 1,2,3,4,5 values and the one-hot-encoding results into 5 new features as expected. However the testing data of the same feature has values of 1,2 and result in 2 new features of one hot encoded.
My question
The training input has more features than the test data, only two inputs. How does this affect the model? Does the model works after all? Because of less input in the test data, how it will handle it?
Any help or suggestion is highly appreciated.
This is not a problem, per sé, although a standard assumption in most ML settings is that the label distribution of the test set is identical to that of the training set.
You should just re-encode your test data using the same features as you would encode your training data. That is, any features from test that also occur in train should get the same index those features got in test, while any features occurring in test that do not occur in train should be discarded (because it is unclear what you should do with them anyway).
I have a tabular pytorch model that takes in cities and zipcodes as a categorical embeddings. However, I can't stratify effectively based on those columns.
How can I get pytorch to run if it's missing a categorical value in the test set that's not in the train set, or has a categorical value in the holdout set that was not in the train/test set?
u can try to use one hot encoding instead
PS: this is a suggestion not an answer
I used one set of data to learn a Random Forest Regressor and right now I have another dataset with smaller number of features (the subset of the previous set).
Is there a function which allows to get the list of names of columns used during the training of the Random Forest Regressor model?
If not, then is there a function which for the missing columns would assign Nulls?
Is there a function which allows to get the list of names of columns
used during the training of the Random Forest Regressor model?
RF uses all features from your dataset. Each tree may contain sqrt(num_of_features) or log2(num_of_features) or whatever but these columns are selected at random. So usually RF covers all columns from your dataset.
There may be edge case when you use a small number of estimators in RF and some features may not be considered. I suppose, RandomForestRegressor.feature_importances_ (zero or nan value may be indicators here) or dive into each tree in RandomForestRegressor.estimators_ may help.
If not, then is there a function which for the missing columns would
assign Nulls?
RF does not accept missing values. Either you need to code missing value as the separate class (and use it for learning too) or XGBoost (for example) is your choice.
I am new to Machine Learning. I am currently solving a classification problem which has strings as its target. I have split the test and training sets and I have dealt with the string attributes by converting them by OneHotEncoder and also, I am using StandardScaler to scale the numerical features of the training set.
My question is for the test set, do I need to convert the test set targets which are still in string format such as I did with training set's string targets using the OneHotEncoder, or do I leave the test set alone as it is and the Classifier will do the job itself? Similarly for the numerical attributes do I have to use StandardScaler to scale the numerical attributes in the test set or the Classifier will do this itself once the training is done on the training set?
For the first question, I would say, you don't need to convert it, but it would make the evaluation on the test set easier.
Your classifier will output one hot encoded values, which you can convert back to string, and evaluate those values, however I think if you would have the test targets as 0-1s would help.
For the second one, you need to fit the standardscaler on the train set, and use (transform) that on the test set.
So, currently my training and testing sets start with 669 features, many of which are categorical and will need to be one-hot encoded.
After one-hot encoding both sets, I found that the training set has additional features.
I'm not quite sure how to handle this but I feel like I have three options:
Remove these features from training set so both match up
Add these features to the test set and produce synthetic data.
Before I train my model, use some dimensionality reduction technique (PCA) and use the same number of components for training and testing.
Any feedback would be much appreciated.
I think you did one-hot encoding on Train & Test data separately. Maybe, combine them and then apply the encoding or try pandas.get_dummies on them separately and use the following code symmetric difference for sets to get differences and then assign 0 for the missing columns.
missing_cols = (set(train.columns)).symmetric_difference(set(test.columns))
for c in missing_cols:
titanic_test[c] = 0
Create dummies for both training and testing set separately. Exclude features which are in testing but not in training set, as they are not used in training. Include features which are in training set but not in testing set and fill data corresponding to them using "mean","median","mode" of training set or zero.
Something like this can be done:
Xdummies=pd.get_dummies(X,drop_first=True)
features=Xdummies.columns
xtest=pd.get_dummies(Xhold,drop_first=True)
test_feat=xtest.columns
exclude=list(set(test_feat)-set(features))
include=list(set(features)-set(test_feat))
Xtest=pd.concat([xtest.drop(exclude,axis=1).reset_index(drop=True),pd.DataFrame(index=range(0,xtest.shape[0]),data=0,columns=include).reset_index(drop=True)],axis=1)