I am learning machine learning using Python and understand that I cannot run categorical data through the model, and must first get dummies. Some of my categorical data has nulls (a very small fraction of only 2 features). When I convert to dummies, then see if I have missing values it always shows none. Should I impute beforehand? Or do I impute categorical data at all? For instance if the category was male/female, I wouldn't want to replace nulls with the most_frequent. I see how this would make sense if the feature was income, and I was going to impute missing values. Income is income, whereas a male is not a female.
So does it make sense to impute categorical data? Am I way off? I am sorry this is more applied theory than actual Python programming but was not sure where to post this type of question.
I think the answers depends on the properties of your features.
Fill in missing data with expectation maximization (EM)
Say you have two features, one is gender (has missing data) and the other one is wage (no missing data). If there is a relationship between the two features, you could use information contained in the wage to fill in missing values in gender.
To put it a little bit more formally - if you have a missing value in gender column but you have a value for wage, EM tells you P(gender=Male | wage=w0, theta), i.e. the probability of the gender being male given wage=w0 and theta which is a parameter obtained with maximum likelihood estimation.
In simpler terms, this could be achieved by running regression of gender on wage (use logistic regression since the y-variable is categorical) to give you the probability described above.
Visually:
(these are totally add-hoc values but convey the idea that the wage distribution for males is generally above that for females)
Fill in missing values #2
You probably can fill in missing value using the most frequent observation if you believe that the data is missing at random even though there is no relationship between the two features. I would be cautious though.
Don't impute
If there is no relationship between the two features and you believe that the missing data might not be missing at random.
Related
I am working on a project to model the change in a person happiness depending on many variables.
Most of the explanatory variables are daily (how much food they ate, daily exercise, sleep etc…) but some of them are weekly - and they're supposed to be weekly, and have an effect on the predicted variable once a week.
For instance, one of the weekly variable is a person's change of weight when they weigh themselves on the same day each week.
This data is only available once a week and has an effect on the person's happiness on that day.
In that case, can someone please advise how I can handle missing data in python on the days when there is no data availalbe for weekly variables?
It would be wrong to extrapolate data on missing days since the person's happiness isn't affected at all by those weekly variables on days when they aren't available.
I have created a dummy with 1 when the weekly data is available and 0 if not, but I don't know what to do for the missing data. I can't leave NaNs otherwise python won't run the regression but I can't put 0 since sometimes the actual variable value (ex: change in weight) on the day when the data is available can be 0.
SciKit-learn provides classes called Imputers that deal with missing values by following a user-defined strategy (i.e. using a default value, using the mean of the column...). If you do not want to skew training I'd suggest you use a statistic instead of some arbitrary default value.
Additionally, you can store information about which values have been imputed vs. which values are organic using a MissingIndicator.
You can find out more about the different Imputers with some example code in the SciKit-Learn documentation
One way to solve this issue:
Fill in the NaN with the last value (in this case measured weight)
Add a boolean variable "value available today" (which was already done as described in the question)
Add one more variable: (last available value / previous value) * "value available today".
Caveat: modeling a product might prove a little difficult for linear regression algorithms.
I am facing a dilemma with a project of mine. Few of the variables don't have enough data that means almost 99% data observations are missing.
I am thinking of couple of options -
Impute missing value with mean/knn imputation
Impute missing value with 0.
I couldn't think of anything in this direction. If someone can help that would be great.
P.S. I am not feeling comfortable using mean imputation when 99% of the data is missing. Does someone have a reasoning for that? kindly let me know.
Data has 397576 Observations out of which below are the missing values
enter image description here
99% of the data is missing!!!???
Well, if your dataset has less than 100,000 examples, then you may want to remove those columns instead of imputing through any methods.
If you have a larger dataset then using mean imputing or knn imputing would be ...OK. These methods don't catch the statistics of your data and can eat up memory. Instead use Bayesian methods of Machine Learning like fitting a Gaussian Process through your data or a Variational Auto-Encoder to those sparse columns.
1.) Here are a few links to learn and use gaussian processes to samples missing values from the dataset:
What is a Random Process?
How to handle missing values with GP?
2.) You can also use a VAE to impute the missing values!!!
Try reading this paper
I hope this helps!
My first question to give a good answer would be:
What you are actually trying to archive with the completed data?
.
People impute data for different reasons and the use case makes a big difference for example you could use imputation as:
Preprocessing step for training a machine learning model
Solution to have a nice Graphic/Plot that does not have gaps
Statistical inference tool to evaluate scientific or medical studies
99% of missing data is a lot - in most cases you can expect, that nothing meaningful will come out of this.
For some variables it still might make sense and produce at least something meaningful - but you have to handle this with care and think a lot about your solution.
In general you can say, imputation does not create entries out of thin air. A pattern must be present in the existing data - which then is applied to the missing data.
You probably will have to decide on a variable basis what makes sense.
Take your variable email as an example:
Depending how your data - it might be that each row represents a different customer that has a specific email address. So that every row is supposed to be a unique mail address. In this case imputation won't have any benefits - how should the algorithm guess the email. But if the data is structured differently and customers appear in multiple rows - then an algorithm can still fill in some meaningful data. Seeing that Customer number 4 always has the same mail address and filling it for rows where only customer number 4 is given and the mail is missing.
I'm building a credit score model and some numerical attributes like age of most recent inquiry have missing value. They are missing because these people didn't have an inquiry before so this attribute is not applicable for them.
I do want to include this attribute in the model. Here's the dilemma: if I keep this attribute numerical, those clients with missing value would be excluded but I want them to be considered in the model. If I bin this attribute, those different levels would be treated as categories and I wouldn't be able to keep the monotonicity of age.
Is there a way to keep null records for a numerical variable when doing logistic regression in Python? Missing data imputation is not applicable in this case since these data should be kept as null.
If I use bins, is it possible to keep the monotonicity between bins? For example, age of most recent inquiry in bin of >12 should cause more severe effect on credit score rather than in bin of [1,2].
I use One Hot Encoding when binning. How can I treat all the levels of a variable as a group and decide its importance rather than treating all the levels as independent predictors?
[]
Hi ,
Attached is the data, can you please help me to handle the missing data in the "Outlet_Size" column.
So that i can use this complete data for preparing the datascience models.
Thanks,
These are one of the major challenges of Data Mining problems (or Machine Learning). YOU decide what to do with the missing data based on PURE EXPERIENCE. You mustn't look at Data Science as a blackbox that follows a series of steps to be successful at it!
Some guidelines about missing data.
A. If more than 40% of the data is missing from a column, drop it! (Again, the 40% depends on what type of problem you're working with! If the data is super crucial or its very trivial that you can ignore it).
B. Check if there is someway you can impute the missing data from the internet. You're looking at item weight! If there is anyway you could know which product you're dealing with instead of hashed coded Item_Identifier, then you can always literally Google it and figure it out.
C. Missing data can be classified into two types:
MCAR: missing completely at random. This is the desirable scenario in case of missing data.
MNAR: missing not at random. Missing not at random data is a more serious issue and in this case it might be wise to check the data gathering process further and try to understand why the information is missing. For instance, if most of the people in a survey did not answer a certain question, why did they do that? Was the question unclear?
Assuming data is MCAR, too much missing data can be a problem too. Usually a safe maximum threshold is 5% of the total for large datasets. If missing data for a certain feature or sample is more than 5% then you probably should leave that feature or sample out. We therefore check for features (columns) and samples (rows) where more than 5% of the data is missing using a simple function
D. As posted in the comments, you can simply drop the rows using df.dropna() or fill them with infinity, or fill them with mean using df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
This groups the column value from dataframe df by category name, finds the mean in each category and fills the missing value in value with the corresponding mean of that category!
E. Apart from just either dropping missing values, replacing with mean or median, there are other advanced regression techniques you can use that has a way to predict missing values and fill it, E.G (mice: Multivariate Imputation by Chained Equations), you should browse and read more about where advanced imputation technique will be helpful.
The accepted answer is really nice.
In your specific case I'd say either drop the column or assign a new value called Missing. Since that's a Categorical variable, there's a good chance it ends up going into a OneHot or Target Encoder (or being understandable by the model as a category directly). Also, the fact the value is NaN is an info itself, it can come from multiple factors (from bad data to technical difficulties getting an answer, etc). Be careful and watch this doesn't brings bias or some information you shouldn't know (example : the products have NaN due to not being into a certain base, thing that will never happen in a real situation, which will make your result non-representative of a true situation)
The column "Outlet_Size" contains the categorical data, so instead of dropping the data use measures to fill data.
Since it is categorical data use Measures of Central Tendency, Mode.
Use mode to find which category occurs more or frequently and fill the column with the corresponding value.
Code:
Dataframe['Outlet_Size'].mode()
Datarame['Outlet_Size'].fillna(Dataframe['Outlet_Size'].mode(), inplace=True)
I want to do linear regression analysis. I have multiple features. Some features has unassigned (null) values for some items in data. Because for some items some specific feature values were missed in data source. To be more clear, I provide example:
As you can see, some items missing values for some features. For now, I just assigned it to 'Null', but how to handle this values when doing linear regression analysis of the data? I do not want this unassigned values to incorrectly affect regression model. Unfortunately I cannot get rid of items where unassigned feature values presented. I plan to use Python for regression.
You need to either ignore those rows -- you've already said you can't, and it's not a good idea with the quantity of missing values -- or use an algorithm that proactively discounts those items, or impute (that's the technical term for filling in an educated guess) the missing data.
There's a limited amount of help we can give, because you haven't given us the semantics you want for missing data. You can impute some of the missing values by using your favourite "closest match" algorithm against the data you do have. For instance, you may well be able to infer a good guess for area from the other data.
For your non-linear, discrete items (i.e. District), you may well want to to keep NULL as a separate district. If you have few enough missing entries, you'll be able to get a decent model anyway.
A simple imputation is to replace each NULL with the mean value for the feature, but this works only for those with a proper mean (i.e. not District).
Overall, I suggest that you search for appropriate references on "impute missing data". Since we're not sure of your needs, we can't help much with this, and doing so is outside the scope of SO.