I was wondering if you can use class encoding, specifically OneHotEncoder in Python, for prediction, if you do not know all the future feature values?
To give more context. I am predicting whether or not a fine will be paid in the future based upon the location, issuing office & amount (and potentially other features if I can get it to work). When I do onehotencoding on my training set it works great (For 100k rows of data my test accuracy is around 92%, using a 75/25 split).
However, when I then introduce the new data, there are locations and 'offices' the encoder never saw. Therefore, new features were not created. This means that in my training set, I had 2302 columns when I built my model (random forest), while when predicting using the real data, I have 3330 columns, therefore, the model I built is no longer valid. (note, I am also looking at other models as the data is so sparse)
How do you handle such a problem when class encoding? Can you only class encode if you have tight control on your future feature values?
Any help would be much appreciated. Apologies if my terminology is wrong, I am new to this and this is my first post on stackoverflow.
I can add code if it helps however I think it is more the theory which is relevant here.
There are two things to keep in mind when using OneHotEncoding.
The number of classes in a column is important. If a class is missing from the test set but present in the train set, it won't be a problem. But if a class is missing from the train set and present in the test set, the encoding will not be able to recognize the new class. This seems to be the problem in your case.
Secondly, you should use the same encoder to encode the train and test split. This way the number of columns in the train and test splits will be the same (2302 != 3330 columns), but in the case of any additional classes in the test set, the user can specify how to deal with missing values. Have a look at the documentation.
A possible way to deal with your issue would be to do the OneHotEncoding on the entire dataset and then split the data 75/25. This will work considering you wont have any new training data later.
Related
I am working on a waste water data. The data is collected every 5 min. This is the sample data.
The threshold of the individual parameters is provided. My question is what kind of models should I go for to classify it as usable or not useable and also output the anomaly because of which it is unusable (if possible since it is a combination of the variables). The column for yes/no is yet to be and will be provided to me.
The other question I have is how do I keep it running since the data is collected every 5 minutes?
Your data and use case seem fit for a decision tree classifier. Decision trees are easy to train and interpret (which is one of your requirements, since you want to know why a given sample was classified as usable or not usable), do not require large amounts of labeled data, can be trained and used for prediction on most haedware, and are well suited for structured data with no missing values and low dimensionality. They also work well without normalizing your variables.
Scikit learn is super mature and easy to use, so you should be able to get something working without too much trouble.
As regards time, I'm not sure how you or your employee will be taking samples, so I don't know. If you will be getting and reading samples at that rate, using your model to label data should not be a problem, but I'm not sure if I understood your situation.
Note stackoverflow is aimed towards questions of the form "here's my code, how do I fix this?", and not so much towards general questions such as this. There are other stackexhange sites specially dedicated to statistics and data science. If you don't find here what you need, maybe you can try those other sites!
I would like confirmation that DAI follows a similar structure for dealing with categorical variables it didn't encounter within training, as in this answer h2o DRF unseen categorical values handling. I could not find it explicitly within the H2O Driverless AI documentation.
Please also state if parts of that link are outdated (as mentioned in the answer) and how it's being processed if this is happening differently. Please note the version of h2o DAI. Thank you!
EDIT this information is now detailed in the documentation here
Below is a description of what happens when you try to predict on a categorical level not seen during training. Depending on the version of DAI you use, you may not have access to a certain algorithm, but given an algorithm, the details should apply to your version of DAI.
XGBoost, LightGBM, RuleFit, TensorFlow, GLM
Driverless AI's feature engineering pipeline will compute a numeric value for every categorical level present in the data, whether it's a previously seen value or not. For frequency encoding, unseen levels will be replaced by 0. For target encoding, the global mean of the target value will be used. Etc.
and
FTRL
FTRL model doesn't distinguish between categorical and numeric values. Whether or not FTRL saw a particular value during training, it will hash all the data, row by row, to numeric and then make predictions. Since you can think of FTRL as learning all the possible values in the dataset by heart, there is no guarantee it will make accurate predictions for unseen data. Therefore, it is important to ensure that the training dataset has a reasonable "overlap", in terms of unique values, with the ones used to make predictions.
Since DAI uses different algorithms than H2O-3 (except for XGBoost), it's best to consider these as separate products with potentially different handling of unseen levels or missing values - though in some cases there are similarities.
As mentioned in the comment, the DRF documentation for H2O-3 should be up to date now.
Hope this explanation helps!
I have two data sets, training and test set.
If I have NA values in the training set but not in the test set, I usually drop the rows (if they are few) in the training set and that's all.
But now, I got a lot of NA values in both sets, so I have dropped the features which got lot most of NA values, and I was wondering what to do now.
Should I just drop the same features in the test set and impute the rest missing values?
Is there any other technique I could use to preprocess the data?
Can Machine Learning algorithms like Logistic Regression, Decision Trees or Neural Netwroks handle missing values?
The data sets come from a Kaggle competition so I can't do the preprocessing before splitting the data
Thanks in advance
This question is not so easy to answer, because it depends on the type of NA values.
Are the NA values due to some random reason? Or is there a reason they are missing (no matching multiple choice answer in a survey or maybe something people would not like to answer)
For the first, it would be fine to use a simple imputation strategy, so that you can fit your model on the data. Thereby, I mean something like mean imputation or sampling from an estimated probability distribution. Or even sampling values at random. Note, that if you simply take the mean of the existing values, you change the statistics of the dataset, i.e. you reduce the standard deviation. You should keep that in mind when choosing your model.
For the second, you will have to apply you domain knowledge to find good fill values.
Regarding your last question: if you want to fill the values with a machine learning model, you may use the other features of the dataset and implicitly assume a dependency between the missing feature and the other features. Depending on the model you will later use for prediction, you may not benefit from the intermediate estimation.
I hope this helps, but the correct answer really depends on the data.
In general, machine learning algorithms do not cope well with missing values (for mostly good reasons, as it is not known why they are missing or what it means to be missing, which could even be different for different observations).
Good practice would be to do the preprocessing before the split between training and test sets (are your training and test data truly random subsets of the data, as they should be?) and ensure that both sets are treated identically.
There is a plethora of ways to deal with your missing data and it depends strongly on the data, as well as on your goals, which are the better ways. Feel free to get in contact if you need more specific advice.
I may be missing something but after following for quite a long time now the suggestion (of some senior data scientists) to LabelEncoder().fit only to training data and not also to test data then I start to think why is this really necessary.
Specifically, at SkLearn if I want to LabelEncoder().fit only to training data then there are two different scenarios:
The test set has some new labels in relation to the training set. For example, the test set has only the labels ['USA', 'UK'] while the test set has the labels ['USA', 'UK', 'France']. Then, as it has been reported elsewhere (e.g. Getting ValueError: y contains new labels when using scikit learn's LabelEncoder), you are getting an error if you try to transform the test set according to this LabelEncoder() because exactly it encounters a new label.
The test set has the same labels as the training set. For example, both the training and the test set have the labels ['USA', 'UK', 'France']. However, then LabelEncoder().fit only to training data is essentially redundant since the test set have the same known values as the training set.
Hence, what is the point of LabelEncoder().fit only to training data and then LabelEncoder().tranform both the training and the test data if at case (1) this throws an error and if at case (2) it is redundant?
Let my clarify that the (pretty knowledgeable) senior data scientists whom I have seen to LabelEncoder().fit only to training data, they had justified this by saying that the test set should be entirely new to even the simplest model like an encoder and it should not be mixed at any fitting with the training data. They did not mention anything about any production or out-of-vocabulary purposes.
The main reason to do so is because in inference/production time (not testing) you might encounter labels that you have never seen before (and you won't be able to call fit() even if you wanted to).
In scenario 2 where you are guaranteed to always have the same labels across folds and in production it is indeed redundant. But are you still guaranteed to see the same in production?
In scenario 1 you need to find a solution to handle unknown labels. One popular approach is map every unknown label into an unknown token. In natural language processing this is call the "Out of vocabulary" problem and the above approach is often used.
To do so and still use LabelEncoder() you can pre-process your data and perform the mapping yourself.
It's hard to guess why the senior data scientists gave you that advice without context, but I can think of at least one reason they may have had in mind.
If you are in the first scenario, where the training set does not contain the full set of labels, then it is often helpful to know this and so the error message is useful information.
Random sampling can often miss rare labels and so taking a fully random sample of all of your data is not always the best way to generate a training set. If France does not appear in your training set, then your algorithm will not be learning from it, so you may want to use a randomisation method that ensures your training set is representative of minority cases. On the other hand, using a different randomisation method may introduce new biases.
Once you have this information, it will depend on your data and problem to be solved as to what the best approach to solve it will be, but there are cases where it is important to have all labels present. A good example would be identifying the presence of a very rare illness. If your training data doesn't include the label indicating that the illness is present, then you better re-sample.
Quite new to SciKit and linear algebra/machine learning with Python in general, so I can't seem to solve the following:
I have a training set and a test set of data, containing both continuous and discrete/categorical values. The CSV files are loaded into Pandas DataFrames and match in shape, being (1460,81) and (1459,81).
However, after using Pandas' get_dummies, the shapes of the DataFrames change to (1460, 306) and (1459, 294). So, when I do linear regression with the SciKit Linear Regression module, it builds a model for 306 variables and it tries to predict one with only 294 with it. This then, naturally, leads to the following error:
ValueError: shapes (1459,294) and (306,1) not aligned: 294 (dim 1) != 306 (dim 0)
How could I tackle such a problem? Could I somehow reshape the (1459, 294) to match the other one?
Thanks and I hope I've made myself clear :)
This is an extremely common problem when dealing with categorical data. There are differing opinions on how to best handle this.
One possible approach is to apply a function to categorical features that limits the set of possible options. For example, if your feature contained the letters of the alphabet, you could encode features for A, B, C, D, and 'Other/Unknown'. In this way, you could apply the same function at test time and abstract from the issue. A clear downside, of course, is that by reducing the feature space you may lose meaningful information.
Another approach is to build a model on your training data, with whichever dummies are naturally created, and treat that as the baseline for your model. When you predict with the model at test time, you transform your test data in the same way your training data is transformed. For example, if your training set had the letters of the alphabet in a feature, and the same feature in the test set contained a value of 'AA', you would ignore that in making a prediction. This is the reverse of your current situation, but the premise is the same. You need to create the missing features on the fly. This approach also has downsides, of course.
The second approach is what you mention in your question, so I'll go through it with pandas.
By using get_dummies you're encoding the categorical features into multiple one-hot encoded features. What you could do is force your test data to match your training data by using reindex, like this:
test_encoded = pd.get_dummies(test_data, columns=['your columns'])
test_encoded_for_model = test_encoded.reindex(columns = training_encoded.columns,
fill_value=0)
This will encode the test data in the same way as your training data, filling in 0 for dummy features that weren't created by encoding the test data but were created in during the training process.
You could just wrap this into a function, and apply it to your test data on the fly. You don't need the encoded training data in memory (which I access with training_encoded.columns) if you create an array or list of the column names.
For anyone interested: I ended up merging the train and test set, then generating the dummies, and then splitting the data again at exactly the same fraction. That way there wasn't any issue with different shapes anymore, as it generated exactly the same dummy data.
This works for me:
Initially, I was getting this error message:
shapes (15754,3) and (4, ) not aligned
I found out that, I was creating a model using 3 variables in my train data. But what I add constant X_train = sm.add_constant(X_train) the constant variable is automatically gets created. So, in total there are now 4 variables.
And when you test this model by default the test variable has 3 variables. So, the error gets pops up for dimension miss match.
So, I used the trick that creates a dummy variable for y_test also.
`X_test = sm.add_constant(X_test)`
Though this a useless variable, but this solves all the issue.