Dealing with missing values

Dealing with missing values - python

I have two data sets, training and test set.
If I have NA values in the training set but not in the test set, I usually drop the rows (if they are few) in the training set and that's all.
But now, I got a lot of NA values in both sets, so I have dropped the features which got lot most of NA values, and I was wondering what to do now.
Should I just drop the same features in the test set and impute the rest missing values?
Is there any other technique I could use to preprocess the data?
Can Machine Learning algorithms like Logistic Regression, Decision Trees or Neural Netwroks handle missing values?
The data sets come from a Kaggle competition so I can't do the preprocessing before splitting the data
Thanks in advance

This question is not so easy to answer, because it depends on the type of NA values.
Are the NA values due to some random reason? Or is there a reason they are missing (no matching multiple choice answer in a survey or maybe something people would not like to answer)
For the first, it would be fine to use a simple imputation strategy, so that you can fit your model on the data. Thereby, I mean something like mean imputation or sampling from an estimated probability distribution. Or even sampling values at random. Note, that if you simply take the mean of the existing values, you change the statistics of the dataset, i.e. you reduce the standard deviation. You should keep that in mind when choosing your model.
For the second, you will have to apply you domain knowledge to find good fill values.
Regarding your last question: if you want to fill the values with a machine learning model, you may use the other features of the dataset and implicitly assume a dependency between the missing feature and the other features. Depending on the model you will later use for prediction, you may not benefit from the intermediate estimation.
I hope this helps, but the correct answer really depends on the data.

In general, machine learning algorithms do not cope well with missing values (for mostly good reasons, as it is not known why they are missing or what it means to be missing, which could even be different for different observations).
Good practice would be to do the preprocessing before the split between training and test sets (are your training and test data truly random subsets of the data, as they should be?) and ensure that both sets are treated identically.
There is a plethora of ways to deal with your missing data and it depends strongly on the data, as well as on your goals, which are the better ways. Feel free to get in contact if you need more specific advice.

Related

For a binary classification model based on multiple continuous variable what model should be used?

I am working on a waste water data. The data is collected every 5 min. This is the sample data.
The threshold of the individual parameters is provided. My question is what kind of models should I go for to classify it as usable or not useable and also output the anomaly because of which it is unusable (if possible since it is a combination of the variables). The column for yes/no is yet to be and will be provided to me.
The other question I have is how do I keep it running since the data is collected every 5 minutes?

Your data and use case seem fit for a decision tree classifier. Decision trees are easy to train and interpret (which is one of your requirements, since you want to know why a given sample was classified as usable or not usable), do not require large amounts of labeled data, can be trained and used for prediction on most haedware, and are well suited for structured data with no missing values and low dimensionality. They also work well without normalizing your variables.
Scikit learn is super mature and easy to use, so you should be able to get something working without too much trouble.
As regards time, I'm not sure how you or your employee will be taking samples, so I don't know. If you will be getting and reading samples at that rate, using your model to label data should not be a problem, but I'm not sure if I understood your situation.
Note stackoverflow is aimed towards questions of the form "here's my code, how do I fix this?", and not so much towards general questions such as this. There are other stackexhange sites specially dedicated to statistics and data science. If you don't find here what you need, maybe you can try those other sites!

Using class encoding for prediction?

I was wondering if you can use class encoding, specifically OneHotEncoder in Python, for prediction, if you do not know all the future feature values?
To give more context. I am predicting whether or not a fine will be paid in the future based upon the location, issuing office & amount (and potentially other features if I can get it to work). When I do onehotencoding on my training set it works great (For 100k rows of data my test accuracy is around 92%, using a 75/25 split).
However, when I then introduce the new data, there are locations and 'offices' the encoder never saw. Therefore, new features were not created. This means that in my training set, I had 2302 columns when I built my model (random forest), while when predicting using the real data, I have 3330 columns, therefore, the model I built is no longer valid. (note, I am also looking at other models as the data is so sparse)
How do you handle such a problem when class encoding? Can you only class encode if you have tight control on your future feature values?
Any help would be much appreciated. Apologies if my terminology is wrong, I am new to this and this is my first post on stackoverflow.
I can add code if it helps however I think it is more the theory which is relevant here.

There are two things to keep in mind when using OneHotEncoding.
The number of classes in a column is important. If a class is missing from the test set but present in the train set, it won't be a problem. But if a class is missing from the train set and present in the test set, the encoding will not be able to recognize the new class. This seems to be the problem in your case.
Secondly, you should use the same encoder to encode the train and test split. This way the number of columns in the train and test splits will be the same (2302 != 3330 columns), but in the case of any additional classes in the test set, the user can specify how to deal with missing values. Have a look at the documentation.
A possible way to deal with your issue would be to do the OneHotEncoding on the entire dataset and then split the data 75/25. This will work considering you wont have any new training data later.

How does DAI handle new (unseen in training) categorical values within a production environment?

I would like confirmation that DAI follows a similar structure for dealing with categorical variables it didn't encounter within training, as in this answer h2o DRF unseen categorical values handling. I could not find it explicitly within the H2O Driverless AI documentation.
Please also state if parts of that link are outdated (as mentioned in the answer) and how it's being processed if this is happening differently. Please note the version of h2o DAI. Thank you!

EDIT this information is now detailed in the documentation here
Below is a description of what happens when you try to predict on a categorical level not seen during training. Depending on the version of DAI you use, you may not have access to a certain algorithm, but given an algorithm, the details should apply to your version of DAI.
XGBoost, LightGBM, RuleFit, TensorFlow, GLM
Driverless AI's feature engineering pipeline will compute a numeric value for every categorical level present in the data, whether it's a previously seen value or not. For frequency encoding, unseen levels will be replaced by 0. For target encoding, the global mean of the target value will be used. Etc.
and
FTRL
FTRL model doesn't distinguish between categorical and numeric values. Whether or not FTRL saw a particular value during training, it will hash all the data, row by row, to numeric and then make predictions. Since you can think of FTRL as learning all the possible values in the dataset by heart, there is no guarantee it will make accurate predictions for unseen data. Therefore, it is important to ensure that the training dataset has a reasonable "overlap", in terms of unique values, with the ones used to make predictions.
Since DAI uses different algorithms than H2O-3 (except for XGBoost), it's best to consider these as separate products with potentially different handling of unseen levels or missing values - though in some cases there are similarities.
As mentioned in the comment, the DRF documentation for H2O-3 should be up to date now.
Hope this explanation helps!

SkLearn - Why LabelEncoder().fit only to training data

I may be missing something but after following for quite a long time now the suggestion (of some senior data scientists) to LabelEncoder().fit only to training data and not also to test data then I start to think why is this really necessary.
Specifically, at SkLearn if I want to LabelEncoder().fit only to training data then there are two different scenarios:
The test set has some new labels in relation to the training set. For example, the test set has only the labels ['USA', 'UK'] while the test set has the labels ['USA', 'UK', 'France']. Then, as it has been reported elsewhere (e.g. Getting ValueError: y contains new labels when using scikit learn's LabelEncoder), you are getting an error if you try to transform the test set according to this LabelEncoder() because exactly it encounters a new label.
The test set has the same labels as the training set. For example, both the training and the test set have the labels ['USA', 'UK', 'France']. However, then LabelEncoder().fit only to training data is essentially redundant since the test set have the same known values as the training set.
Hence, what is the point of LabelEncoder().fit only to training data and then LabelEncoder().tranform both the training and the test data if at case (1) this throws an error and if at case (2) it is redundant?
Let my clarify that the (pretty knowledgeable) senior data scientists whom I have seen to LabelEncoder().fit only to training data, they had justified this by saying that the test set should be entirely new to even the simplest model like an encoder and it should not be mixed at any fitting with the training data. They did not mention anything about any production or out-of-vocabulary purposes.

The main reason to do so is because in inference/production time (not testing) you might encounter labels that you have never seen before (and you won't be able to call fit() even if you wanted to).
In scenario 2 where you are guaranteed to always have the same labels across folds and in production it is indeed redundant. But are you still guaranteed to see the same in production?
In scenario 1 you need to find a solution to handle unknown labels. One popular approach is map every unknown label into an unknown token. In natural language processing this is call the "Out of vocabulary" problem and the above approach is often used.
To do so and still use LabelEncoder() you can pre-process your data and perform the mapping yourself.

It's hard to guess why the senior data scientists gave you that advice without context, but I can think of at least one reason they may have had in mind.
If you are in the first scenario, where the training set does not contain the full set of labels, then it is often helpful to know this and so the error message is useful information.
Random sampling can often miss rare labels and so taking a fully random sample of all of your data is not always the best way to generate a training set. If France does not appear in your training set, then your algorithm will not be learning from it, so you may want to use a randomisation method that ensures your training set is representative of minority cases. On the other hand, using a different randomisation method may introduce new biases.
Once you have this information, it will depend on your data and problem to be solved as to what the best approach to solve it will be, but there are cases where it is important to have all labels present. A good example would be identifying the presence of a very rare illness. If your training data doesn't include the label indicating that the illness is present, then you better re-sample.

How feature columns work in tensorflow?

I read that feature columns in tensorflow are used to define our data but how and why? How do feature columns work and why they even exist if we can make a custom estimator without them too?
And if they are necessary, why libraries like keras don't use them?

Broadly Speaking
This may be too general to answer. You may want to watch some videos or do more reading on machine learning, because this is a broad topic.
I will try to explain what features of data are used for.
A "feature" of the data is a meaningful variable that should separate two classes from each other. For example, if we choose the feature "weight", we can tell elephants apart from squirrels. They have very different weights, and our machine learning algorithm can learn to "understand" that an animal with a heavy weight is more likely to be an elephant than it is to be a squirrel. In a real scenario you would generally have more than one feature.
I'm not sure why you would say that Keras does not use features. They are a fundamental aspect of many classification problems. Some datasets may contain labelled data or labelled features, like this one: https://keras.io/datasets/#cifar100-small-image-classification
When we "don't use features", I think a more accurate way to state that would be that the data is unlabelled. In this case, a machine learning algorithm can still find relationships in the data, but without human labels applied to the data.
If you Ctrl+F for the word "features" on this page you will see places where Keras accepts them as an argument: https://keras.io/layers/core/
I am not a machine learning expert so if anyone is able to correct my answer, I would appreciate that too.
In Tensorflow
My understanding of Tensorflow's feature columns implementation in particular is that they allow you to cast raw data into a typed column that allow the algorithm to better distinguish what type of data you are passing. For example Latitude and Longitude could be passed as two numerical columns, but as the docs say here, using a Crossed Column for Latitude X Longitude may allow the model to train on the data in a more meaningful/effective way. After all, what "Latitude" and "Longitude" really mean is "Location." As for why Keras does not have this functionality, I am not sure, hopefully someone else can offer insight on this topic.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.