SkLearn - Why LabelEncoder().fit only to training data

SkLearn - Why LabelEncoder().fit only to training data - python

I may be missing something but after following for quite a long time now the suggestion (of some senior data scientists) to LabelEncoder().fit only to training data and not also to test data then I start to think why is this really necessary.
Specifically, at SkLearn if I want to LabelEncoder().fit only to training data then there are two different scenarios:
The test set has some new labels in relation to the training set. For example, the test set has only the labels ['USA', 'UK'] while the test set has the labels ['USA', 'UK', 'France']. Then, as it has been reported elsewhere (e.g. Getting ValueError: y contains new labels when using scikit learn's LabelEncoder), you are getting an error if you try to transform the test set according to this LabelEncoder() because exactly it encounters a new label.
The test set has the same labels as the training set. For example, both the training and the test set have the labels ['USA', 'UK', 'France']. However, then LabelEncoder().fit only to training data is essentially redundant since the test set have the same known values as the training set.
Hence, what is the point of LabelEncoder().fit only to training data and then LabelEncoder().tranform both the training and the test data if at case (1) this throws an error and if at case (2) it is redundant?
Let my clarify that the (pretty knowledgeable) senior data scientists whom I have seen to LabelEncoder().fit only to training data, they had justified this by saying that the test set should be entirely new to even the simplest model like an encoder and it should not be mixed at any fitting with the training data. They did not mention anything about any production or out-of-vocabulary purposes.

The main reason to do so is because in inference/production time (not testing) you might encounter labels that you have never seen before (and you won't be able to call fit() even if you wanted to).
In scenario 2 where you are guaranteed to always have the same labels across folds and in production it is indeed redundant. But are you still guaranteed to see the same in production?
In scenario 1 you need to find a solution to handle unknown labels. One popular approach is map every unknown label into an unknown token. In natural language processing this is call the "Out of vocabulary" problem and the above approach is often used.
To do so and still use LabelEncoder() you can pre-process your data and perform the mapping yourself.

It's hard to guess why the senior data scientists gave you that advice without context, but I can think of at least one reason they may have had in mind.
If you are in the first scenario, where the training set does not contain the full set of labels, then it is often helpful to know this and so the error message is useful information.
Random sampling can often miss rare labels and so taking a fully random sample of all of your data is not always the best way to generate a training set. If France does not appear in your training set, then your algorithm will not be learning from it, so you may want to use a randomisation method that ensures your training set is representative of minority cases. On the other hand, using a different randomisation method may introduce new biases.
Once you have this information, it will depend on your data and problem to be solved as to what the best approach to solve it will be, but there are cases where it is important to have all labels present. A good example would be identifying the presence of a very rare illness. If your training data doesn't include the label indicating that the illness is present, then you better re-sample.

Related

Using class encoding for prediction?

I was wondering if you can use class encoding, specifically OneHotEncoder in Python, for prediction, if you do not know all the future feature values?
To give more context. I am predicting whether or not a fine will be paid in the future based upon the location, issuing office & amount (and potentially other features if I can get it to work). When I do onehotencoding on my training set it works great (For 100k rows of data my test accuracy is around 92%, using a 75/25 split).
However, when I then introduce the new data, there are locations and 'offices' the encoder never saw. Therefore, new features were not created. This means that in my training set, I had 2302 columns when I built my model (random forest), while when predicting using the real data, I have 3330 columns, therefore, the model I built is no longer valid. (note, I am also looking at other models as the data is so sparse)
How do you handle such a problem when class encoding? Can you only class encode if you have tight control on your future feature values?
Any help would be much appreciated. Apologies if my terminology is wrong, I am new to this and this is my first post on stackoverflow.
I can add code if it helps however I think it is more the theory which is relevant here.

There are two things to keep in mind when using OneHotEncoding.
The number of classes in a column is important. If a class is missing from the test set but present in the train set, it won't be a problem. But if a class is missing from the train set and present in the test set, the encoding will not be able to recognize the new class. This seems to be the problem in your case.
Secondly, you should use the same encoder to encode the train and test split. This way the number of columns in the train and test splits will be the same (2302 != 3330 columns), but in the case of any additional classes in the test set, the user can specify how to deal with missing values. Have a look at the documentation.
A possible way to deal with your issue would be to do the OneHotEncoding on the entire dataset and then split the data 75/25. This will work considering you wont have any new training data later.

What does it mean if I can not get 0 error on very small training dataset?

In order to validate if the network can potentially learn often people try to overfit on the small dataset.
I can not reach 0 error with my dataset but the output looks like that network memorizes the training set. (MPAE ~1 %)
Is it absolutely necessary to get 0 error in order to prove that my network potentially works on my dataset?

Short answer: No
Reason:
It may be that a small number of examples are miss labeled. In the case of classification, try to identify which examples it is unable to correctly classify. This will tell you whether your network has learnt all it can.
It can also happen if your data has no pattern that can be learnt - if the data is essentially random.
If the data is noisy, sometimes the noise will mask the features that are required for prediction.
If a dataset is chaotic in the sense that the features vary quickly and dramatically between (and among) labels - if your data follows a very complex (non-smooth) function.
Hope this helps!

How does DAI handle new (unseen in training) categorical values within a production environment?

I would like confirmation that DAI follows a similar structure for dealing with categorical variables it didn't encounter within training, as in this answer h2o DRF unseen categorical values handling. I could not find it explicitly within the H2O Driverless AI documentation.
Please also state if parts of that link are outdated (as mentioned in the answer) and how it's being processed if this is happening differently. Please note the version of h2o DAI. Thank you!

EDIT this information is now detailed in the documentation here
Below is a description of what happens when you try to predict on a categorical level not seen during training. Depending on the version of DAI you use, you may not have access to a certain algorithm, but given an algorithm, the details should apply to your version of DAI.
XGBoost, LightGBM, RuleFit, TensorFlow, GLM
Driverless AI's feature engineering pipeline will compute a numeric value for every categorical level present in the data, whether it's a previously seen value or not. For frequency encoding, unseen levels will be replaced by 0. For target encoding, the global mean of the target value will be used. Etc.
and
FTRL
FTRL model doesn't distinguish between categorical and numeric values. Whether or not FTRL saw a particular value during training, it will hash all the data, row by row, to numeric and then make predictions. Since you can think of FTRL as learning all the possible values in the dataset by heart, there is no guarantee it will make accurate predictions for unseen data. Therefore, it is important to ensure that the training dataset has a reasonable "overlap", in terms of unique values, with the ones used to make predictions.
Since DAI uses different algorithms than H2O-3 (except for XGBoost), it's best to consider these as separate products with potentially different handling of unseen levels or missing values - though in some cases there are similarities.
As mentioned in the comment, the DRF documentation for H2O-3 should be up to date now.
Hope this explanation helps!

Dealing with missing values

I have two data sets, training and test set.
If I have NA values in the training set but not in the test set, I usually drop the rows (if they are few) in the training set and that's all.
But now, I got a lot of NA values in both sets, so I have dropped the features which got lot most of NA values, and I was wondering what to do now.
Should I just drop the same features in the test set and impute the rest missing values?
Is there any other technique I could use to preprocess the data?
Can Machine Learning algorithms like Logistic Regression, Decision Trees or Neural Netwroks handle missing values?
The data sets come from a Kaggle competition so I can't do the preprocessing before splitting the data
Thanks in advance

This question is not so easy to answer, because it depends on the type of NA values.
Are the NA values due to some random reason? Or is there a reason they are missing (no matching multiple choice answer in a survey or maybe something people would not like to answer)
For the first, it would be fine to use a simple imputation strategy, so that you can fit your model on the data. Thereby, I mean something like mean imputation or sampling from an estimated probability distribution. Or even sampling values at random. Note, that if you simply take the mean of the existing values, you change the statistics of the dataset, i.e. you reduce the standard deviation. You should keep that in mind when choosing your model.
For the second, you will have to apply you domain knowledge to find good fill values.
Regarding your last question: if you want to fill the values with a machine learning model, you may use the other features of the dataset and implicitly assume a dependency between the missing feature and the other features. Depending on the model you will later use for prediction, you may not benefit from the intermediate estimation.
I hope this helps, but the correct answer really depends on the data.

In general, machine learning algorithms do not cope well with missing values (for mostly good reasons, as it is not known why they are missing or what it means to be missing, which could even be different for different observations).
Good practice would be to do the preprocessing before the split between training and test sets (are your training and test data truly random subsets of the data, as they should be?) and ensure that both sets are treated identically.
There is a plethora of ways to deal with your missing data and it depends strongly on the data, as well as on your goals, which are the better ways. Feel free to get in contact if you need more specific advice.

Classification tree in sklearn giving inconsistent answers

I am using a classification tree from sklearn and when I have the the model train twice using the same data, and predict with the same test data, I am getting different results. I tried reproducing on a smaller iris data set and it worked as predicted. Here is some code
from sklearn import tree
from sklearn.datasets import iris
clf = tree.DecisionTreeClassifier()
clf.fit(iris.data, iris.target)
r1 = clf.predict_proba(iris.data)
clf.fit(iris.data, iris.target)
r2 = clf.predict_proba(iris.data)
r1 and r2 are the same for this small example, but when I run on my own much larger data set I get differing results. Is there a reason why this would occur?
EDIT After looking into some documentation I see that DecisionTreeClassifier has an input random_state which controls the starting point. By setting this value to a constant I get rid of the problem I was previously having. However now I'm concerned that my model is not as optimal as it could be. What is the recommended method for doing this? Try some randomly? Or are all results expected to be about the same?

The DecisionTreeClassifier works by repeatedly splitting the training data, based on the value of some feature. The Scikit-learn implementation lets you choose between a few splitting algorithms by providing a value to the splitter keyword argument.
"best" randomly chooses a feature and finds the 'best' possible split for it, according to some criterion (which you can also choose; see the methods signature and the criterion argument). It looks like the code does this N_feature times, so it's actually quite like a bootstrap.
"random" chooses the feature to consider at random, as above. However, it also then tests randomly-generated thresholds on that feature (random, subject to the constraint that it's between its minimum and maximum values). This may help avoid 'quantization' errors on the tree where the threshold is strongly influenced by the exact values in the training data.
Both of these randomization methods can improve the trees' performance. There are some relevant experimental results in Lui, Ting, and Fan's (2005) KDD paper.
If you absolutely must have an identical tree every time, then I'd re-use the same random_state. Otherwise, I'd expect the trees to end up more or less equivalent every time and, in the absence of a ton of held-out data, I'm not sure how you'd decide which random tree is best.
See also: Source code for the splitter

The answer provided by Matt Krause does not answer the question entirely correctly.
The reason for the observed behaviour in scikit-learn's DecisionTreeClassifier is explained in this issue on GitHub.
When using the default settings, all features are considered at each split. This is governed by the max_features parameter, which specifies how many features should be considered at each split. At each node, the classifier randomly samples max_features without replacement (!).
Thus, when using max_features=n_features, all features are considered at each split. However, the implementation will still sample them at random from the list of features (even though this means all features will be sampled, in this case). Thus, the order in which the features are considered is pseudo-random. If two possible splits are tied, the first one encountered will be used as the best split.
This is exactly the reason why your decision tree is yielding different results each time you call it: the order of features considered is randomized at each node, and when two possible splits are then tied, the split to use will depend on which one was considered first.
As has been said before, the seed used for the randomization can be specified using the random_state parameter.

The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.
Source: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier#Notes

I don't know anything about sklearn but...
I guess DecisionTreeClassifier has some internal state, create by fit, which only gets updated/extended.
You should create a new one?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.