I am new to Machine Learning. I am currently solving a classification problem which has strings as its target. I have split the test and training sets and I have dealt with the string attributes by converting them by OneHotEncoder and also, I am using StandardScaler to scale the numerical features of the training set.
My question is for the test set, do I need to convert the test set targets which are still in string format such as I did with training set's string targets using the OneHotEncoder, or do I leave the test set alone as it is and the Classifier will do the job itself? Similarly for the numerical attributes do I have to use StandardScaler to scale the numerical attributes in the test set or the Classifier will do this itself once the training is done on the training set?
For the first question, I would say, you don't need to convert it, but it would make the evaluation on the test set easier.
Your classifier will output one hot encoded values, which you can convert back to string, and evaluate those values, however I think if you would have the test targets as 0-1s would help.
For the second one, you need to fit the standardscaler on the train set, and use (transform) that on the test set.
Related
I have used one-hot-encoding to encode the categorical features in my dataset. After encoding, I found out that the training data results in more features than the testing data.
Example
A feature called week-day has 1,2,3,4,5 values and the one-hot-encoding results into 5 new features as expected. However the testing data of the same feature has values of 1,2 and result in 2 new features of one hot encoded.
My question
The training input has more features than the test data, only two inputs. How does this affect the model? Does the model works after all? Because of less input in the test data, how it will handle it?
Any help or suggestion is highly appreciated.
This is not a problem, per sé, although a standard assumption in most ML settings is that the label distribution of the test set is identical to that of the training set.
You should just re-encode your test data using the same features as you would encode your training data. That is, any features from test that also occur in train should get the same index those features got in test, while any features occurring in test that do not occur in train should be discarded (because it is unclear what you should do with them anyway).
I have many categorical variables which exist in my test set but don't in my train set. They are important so I can't drop them. Should I combine train and test set or what other solution should I make?
You have some options in this case, you can use another technique than Holdout to separate your data like a K-Fold Cross-validation or Leave-one-out.
When using a holdout is necessary to stratify your data to all subsets have all classes on train/test/validation subset's and NEVER use your test or validation dataset to fit your model, when you do it the model will learn this data and you probably will be overfitting your model, read more about it here
How did you end up in this situation? Normally you take a dataset and divide it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.
But it is clear that, because they originate from the same original dataset, that they both have the same categorical variables.
from sklearn.model_selection import train_test_split
I have a question about the train_test_split function from sklearn. First, why do we split the data??? and were do we get the testing data from. Do we just chop the data in half and use some of it to train and some of it to test?? Than doesn't make sense since the data is already filled. If it is filled, then what are we predicting now?? I need help!
First, why do we split the data???
We split the data to isolate a portion of it for validation purposes. Use the non-isolated portion to fit the algorithm and use the algorithm to test again the isolated portion.
were do we get the testing data from.
The testing data is actually part of your original dataset.
Do we just chop the data in half
We usually chop at the 20-40% mark.
If it is filled, then what are we predicting now??*
You are actually not trying to predict the result directly. You are training the algorithm to fit the training set and use the testing set to see how accurate the algorithm is.
Your original dataset should be split up into training and testing data. For example, 80% of the data could be used for training and 20% could be used for testing. The data is split so that there is data for the model to be evaluated on to see how well the model performs on unseen data.
Training: This data is used to build your model. E.g. finding the optimal coefficients in a Linear Regression model; or using the CART
algorithm to create a Decision Tree.
Testing: This data is used to see how the model performs on unseen data, as it would in a real world situation. This data should
be left completely unseen until you would like to test your model to
evaluate performance.
Extra notes on validation data:
To tune your model (for example, finding the best max_depth value for a decision tree) the training data should also be proportioned into training and validation data. K-Folds Cross Validation can be used here. The model should be trained on the training data and evaluated on the validation data. This is performed multiple time with cross validation.
The results (e.g. MSE, F1 etc.) can then be evaluated on each fold and used to tune the hyperparameters. By using cross-validation to tune the hyperparameters, this ensures that the model is not overfitting to the test data.
Once the model is tuned, it can then be applied to the test data.
In the chapter seven of this book "TensorFlow Machine Learning Cookbook" the author in pre-processing data uses fit_transform function of scikit-learn to get the tfidf features of text for training. The author gives all text data to the function before separating it into train and test. Is it a true action or we must separate data first and then perform fit_transform on train and transform on test?
According to the documentation of scikit-learn, fit() is used in order to
Learn vocabulary and idf from training set.
On the other hand, fit_transform() is used in order to
Learn vocabulary and idf, return term-document matrix.
while transform()
Transforms documents to document-term matrix.
On the training set you need to apply both fit() and transform() (or just fit_transform() that essentially joins both operations) however, on the testing set you only need to transform() the testing instances (i.e. the documents).
Remember that training sets are used for learning purposes (learning is achieved through fit()) while testing set is used in order to evaluate whether the trained model can generalise well to new unseen data points.
For more details you can refer to the article fit() vs transform() vs fit_transform()
Author gives all text data before separating train and test to
function. Is it a true action or we must separate data first then
perform tfidf fit_transform on train and transform on test?
I would consider this as already leaking some information about the test set into the training set.
I tend to always follow the rule that before any pre-processing first thing to do is to separate the data, create a hold-out set.
As we are talking about text data, we have to make sure that the model is trained only on the vocabulary of the training set as when we will deploy a model in real life, it will encounter words that it has never seen before so we have to do the validation on the test set keeping that in mind.
We have to make sure that the new words in the test set are not a part of the vocabulary of the model.
Hence we have to use fit_transform on the training data and transform on the test data.
If you think about doing cross validation, then you can use this logic across all the folds.
Is there a way to retrieve the list of feature names used for training of a classifier, once it has been trained with the fit method? I would like to get this information before applying to unseen data.
The data used for training is a pandas DataFrame and in my case, the classifier is a RandomForestClassifier.
I have a solution which works but is not very elegant. This is an old post with no existing solutions so I suppose there are not any.
Create and fit your model. For example
model = GradientBoostingRegressor(**params)
model.fit(X_train, y_train)
Then you can add an attribute which is the 'feature_names' since you know them at training time
model.feature_names = list(X_train.columns.values)
I typically then put the model into a binary file to pass it around but you can ignore this
joblib.dump(model, filename)
loaded_model = joblib.load(filename)
Then you can get the feature names back from the model to use them when you predict
f_names = loaded_model.feature_names
loaded_model.predict(X_pred[f_names])
Based on the documentation and previous experience, there is no way to get a list of the features considered at least at one of the splitting.
Is your concern that you do not want to use all your features for prediction, just the ones actually used for training? In this case I suggest to list the feature_importances_ after fitting and eliminate the features that does not seem relevant. Then train a new model with only the relevant features and use those features for prediction as well.
You don't need to know which features were selected for the training. Just make sure to give, during the prediction step, to the fitted classifier the same features you used during the learning phase.
The Random Forest Classifier will only use the features on which it makes its splits. Those will be the same as those learnt during the first phase. Others won't be considered.
If the shape of your test data is not the same as the training data it will throw an error, even if the test data contains all the features used for the splits of you decision trees.
What's more, since Random Forests make random selection of features for your decision trees (called estimators in sklearn) all the features are likely to be used at least once.
However, if you want to know the features used, you can just call the attributes n_features_ and feature_importances_ on your classifier once fitted.
You can look here to see how you can retrieve the names of the most important features you used.
You can extract feature names from a trained XGBOOST model as follows:
model.get_booster().feature_names