I'm trying to predict a score that user gives to a restaurant.
The data I have can be grouped into two dataframes
data about user (taste, personal traits, family, ...)
data about restaurant(open hours, location, cuisine, ...).
First major question is: how do I approach this?
I've already tried basic prediction with the user dataframe (predict one column with few others using RandomForest) and it was pretty straightforward. These dataframes are logically different and I can't merge them into one.
What is the best approach when doing prediction like this?
My second question is what is the best way to handle categorical data (cuisine f.e.)?
I know I can create a mapping function and convert each value to index, or I can use Categorical from pandas (and probably few other methods). Is there any prefered way to do this?
1) The second dataset is essentially characteristics of the restaurant which might influence the first dataset. Example-opening timings or location are strong factors that a customer could consider. You can use them, merging them at a restaurant level. It could help you to understand how people treat location, timings as a reflection in their score for the restaurant- note here you could even apply clustering and see different customers have different sensitivities to these variables.
For e.g. for frequent occurring customers(who mostly eats out) may be more mindful of location/ timing etc if its a part of their daily routine.
You should apply modelling techniques and do multiple simulations to get variable importance box plots and see if variables like location/ timings have a high variance in their importance scores when calculated on different subsets of data - it would be indicative of different customer sensitivities.
2) You can look at label enconding or one hot enconding or even use the variable as it is? It will helpful here to explain how many levels are there in the data. You can look at pd.get_dummies kind of functions
Hope this helps.
Related
I am a beginner in Machine Learning, my point is..how should i encode the column "OECDSTInterbkRate"? I don't know how to replace the missing values and especially with what. Should I just delete them? Or replace them with the mean / median of the values?
There are many approaches to this issue.
The simplest: if you have huge amount of data - drop NaNs.
Replace the NaNs with mean/median/etc of the whole non-NaN dataset or the dataset grouped by one or several columns. E.g. for you dataset you can fill the Australia NaNs with a mean for Australian non-NaNs. And the same for other countries.
A common approach is to create another indicator column after the imputation of NaNs which keeps the indices where the missing data was replaced with a value. This column then is taken as yet another input to your ML algorithm.
Take a look at the docs (assuming you work with Pandas) - the developers of the library have already created some tools for the missing data: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
There's no specific answer to your question, it's a general problem in statistics which is called "imputation". Depending on the application the answer could be many things.
There are few alternatives that comes to mind first to solve your problem, but don't forget that "no data" is almost always better than "bad/wrong data". If you have more than enough rows without the rows with NaNs, you may simply drop them. Otherwise you can consider the following:
Can you mathematically calculate the column that you need by the other columns that you already have in your dataset? If so, you have your answer.
Check the correlation of the particular column by using it's non-missing valued rows with the other columns and see if they are highly correlated. If so, you might just as well try dropping the whole column(might not be always a good idea but it's generally a good idea).
Can you create an estimator(such as a regression model) to predict the missing values by learning the pattern using the values that you already have and by using the other columns with a really good accuracy? Well you might have an answer (need benchmarking with the following). Please keep in mind that this is a very risky operation that could give bad estimations and decrease the performance of your overall model. Try this only if your estimations are really good!
Is it a regression problem? Using the statistical mean could be a good idea.
Is it a classification problem? Using median could be a good idea.
In some cases using mode might also be a good idea depending on the distribution.
I suggest that you try all the things out and see which one works better because there's really not a concrete answer to your problem. You can create a machine learning model without using the column and use it's performance as a baseline, and carry out a performance(accuracy) benchmarking for all the steps compared to the baseline.
Note: I am just a graduate student with some insights, please comment out if anything I said is not correct!
I'm using a dataset to predict the effects on the economy because of covid-19. The dataset contains 9k rows and around 1k rows in each column is empty. Do I need to fill them manually by looking at other datasets online or can I fill the average or should I leave the dataset as it is?
Generally, I'd say that combining datasets from multiple sources without being really clear about your rational can raise pretty big questions about the reliability of your data.
Otherwise, either assuming averages or leaving null are both valid options depending on what you're trying to do. If you're using scikit learn (eg) you'll probably find that nulls throw up errors, so filling with assumed averages is a relatively common thing to do. Although you might want to watch out given you've got more that 10% nulls!
From experience, I'd say think about what you're trying to do, and what will help you get there best. Then be really clear about presenting your methodology with your findings.
I am facing a dilemma with a project of mine. Few of the variables don't have enough data that means almost 99% data observations are missing.
I am thinking of couple of options -
Impute missing value with mean/knn imputation
Impute missing value with 0.
I couldn't think of anything in this direction. If someone can help that would be great.
P.S. I am not feeling comfortable using mean imputation when 99% of the data is missing. Does someone have a reasoning for that? kindly let me know.
Data has 397576 Observations out of which below are the missing values
enter image description here
99% of the data is missing!!!???
Well, if your dataset has less than 100,000 examples, then you may want to remove those columns instead of imputing through any methods.
If you have a larger dataset then using mean imputing or knn imputing would be ...OK. These methods don't catch the statistics of your data and can eat up memory. Instead use Bayesian methods of Machine Learning like fitting a Gaussian Process through your data or a Variational Auto-Encoder to those sparse columns.
1.) Here are a few links to learn and use gaussian processes to samples missing values from the dataset:
What is a Random Process?
How to handle missing values with GP?
2.) You can also use a VAE to impute the missing values!!!
Try reading this paper
I hope this helps!
My first question to give a good answer would be:
What you are actually trying to archive with the completed data?
.
People impute data for different reasons and the use case makes a big difference for example you could use imputation as:
Preprocessing step for training a machine learning model
Solution to have a nice Graphic/Plot that does not have gaps
Statistical inference tool to evaluate scientific or medical studies
99% of missing data is a lot - in most cases you can expect, that nothing meaningful will come out of this.
For some variables it still might make sense and produce at least something meaningful - but you have to handle this with care and think a lot about your solution.
In general you can say, imputation does not create entries out of thin air. A pattern must be present in the existing data - which then is applied to the missing data.
You probably will have to decide on a variable basis what makes sense.
Take your variable email as an example:
Depending how your data - it might be that each row represents a different customer that has a specific email address. So that every row is supposed to be a unique mail address. In this case imputation won't have any benefits - how should the algorithm guess the email. But if the data is structured differently and customers appear in multiple rows - then an algorithm can still fill in some meaningful data. Seeing that Customer number 4 always has the same mail address and filling it for rows where only customer number 4 is given and the mail is missing.
For instance, column x has 50 values and all of these values are the same.
Is it a good idea to delete variables like these for building machine learning models? If so, how can I spot these variables in a large data set?
I guess a formula/function might be required to do so. I am thinking of using nunique that can take account of the whole dataset.
You should be deleting such columns because it will provide no extra information about how each data point is different from another. It's fine to leave the column for some machine learning models (due to the nature of how the algorithms work), like random forest, because this column will actually not be selected to split the data.
To spot those, especially for categorical or nominal variables (with fixed number of possible values), you can count the occurrence of each unique value, and if the mode is larger than a certain threshold (say 95%), then you delete that column from your model.
I personally will go through variables one by one if there aren't any so that I can fully understand each variable in the model, but the above systematic way is possible if the feature size is too large.
[]
Hi ,
Attached is the data, can you please help me to handle the missing data in the "Outlet_Size" column.
So that i can use this complete data for preparing the datascience models.
Thanks,
These are one of the major challenges of Data Mining problems (or Machine Learning). YOU decide what to do with the missing data based on PURE EXPERIENCE. You mustn't look at Data Science as a blackbox that follows a series of steps to be successful at it!
Some guidelines about missing data.
A. If more than 40% of the data is missing from a column, drop it! (Again, the 40% depends on what type of problem you're working with! If the data is super crucial or its very trivial that you can ignore it).
B. Check if there is someway you can impute the missing data from the internet. You're looking at item weight! If there is anyway you could know which product you're dealing with instead of hashed coded Item_Identifier, then you can always literally Google it and figure it out.
C. Missing data can be classified into two types:
MCAR: missing completely at random. This is the desirable scenario in case of missing data.
MNAR: missing not at random. Missing not at random data is a more serious issue and in this case it might be wise to check the data gathering process further and try to understand why the information is missing. For instance, if most of the people in a survey did not answer a certain question, why did they do that? Was the question unclear?
Assuming data is MCAR, too much missing data can be a problem too. Usually a safe maximum threshold is 5% of the total for large datasets. If missing data for a certain feature or sample is more than 5% then you probably should leave that feature or sample out. We therefore check for features (columns) and samples (rows) where more than 5% of the data is missing using a simple function
D. As posted in the comments, you can simply drop the rows using df.dropna() or fill them with infinity, or fill them with mean using df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
This groups the column value from dataframe df by category name, finds the mean in each category and fills the missing value in value with the corresponding mean of that category!
E. Apart from just either dropping missing values, replacing with mean or median, there are other advanced regression techniques you can use that has a way to predict missing values and fill it, E.G (mice: Multivariate Imputation by Chained Equations), you should browse and read more about where advanced imputation technique will be helpful.
The accepted answer is really nice.
In your specific case I'd say either drop the column or assign a new value called Missing. Since that's a Categorical variable, there's a good chance it ends up going into a OneHot or Target Encoder (or being understandable by the model as a category directly). Also, the fact the value is NaN is an info itself, it can come from multiple factors (from bad data to technical difficulties getting an answer, etc). Be careful and watch this doesn't brings bias or some information you shouldn't know (example : the products have NaN due to not being into a certain base, thing that will never happen in a real situation, which will make your result non-representative of a true situation)
The column "Outlet_Size" contains the categorical data, so instead of dropping the data use measures to fill data.
Since it is categorical data use Measures of Central Tendency, Mode.
Use mode to find which category occurs more or frequently and fill the column with the corresponding value.
Code:
Dataframe['Outlet_Size'].mode()
Datarame['Outlet_Size'].fillna(Dataframe['Outlet_Size'].mode(), inplace=True)