How and when to deal with outliers in your dataset (general strategy)

How and when to deal with outliers in your dataset (general strategy) - python

I stumbled about the following problem:
I'm working on a beginners project in data science. I got my test and train data splits and right now I'm analyzing every feature, then adding it to either a dataframe for discretised continuous variables or a dataframe for continuous variables.
Doing so I encountered a feature with big outliers. If I would to delete them, other features I already added to my sub dataframes would have more column entries than this one.
Should I just find a strategy to overwrite the outliers with "better" values or should I reconsider my strategy to split the train data for both types of variables in the beginning? I don't think that
getting rid of the outlier rows in the real train_data would be useful though...

There are many ways to deal with outliers.
In my cours for datascience we used "data imputation":
But before you start to replace or remove data, its important to analyse what difference the outlier makes and if the outlier is valid ofcours.
If the outlier is invalid, you can delete the outlier and use data imputation as explained below.
If your outlier is valid, check the differnce in outcome with and without the outlier. If the difference is very small then there ain't a problem. If the differnce is significant you can use standardization and normalization.
You can replace the outlier with:
a random value (not recommended)
a value based on hueristic logic
a value based on its neighbours
the median, mean or modus.
a value based on interpolation (making a prediction with a certain ml model)
I recommend using the strategy with the best outcome.
Statquest explains datascience and machinelearning concepts in a very easy and understandable way, so refer to him if you encounter more theoritical questions: https://www.youtube.com/user/joshstarmer

Related

How should I handle NaN values in a Finance DF?

I am a beginner in Machine Learning, my point is..how should i encode the column "OECDSTInterbkRate"? I don't know how to replace the missing values and especially with what. Should I just delete them? Or replace them with the mean / median of the values?

There are many approaches to this issue.
The simplest: if you have huge amount of data - drop NaNs.
Replace the NaNs with mean/median/etc of the whole non-NaN dataset or the dataset grouped by one or several columns. E.g. for you dataset you can fill the Australia NaNs with a mean for Australian non-NaNs. And the same for other countries.
A common approach is to create another indicator column after the imputation of NaNs which keeps the indices where the missing data was replaced with a value. This column then is taken as yet another input to your ML algorithm.
Take a look at the docs (assuming you work with Pandas) - the developers of the library have already created some tools for the missing data: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

There's no specific answer to your question, it's a general problem in statistics which is called "imputation". Depending on the application the answer could be many things.
There are few alternatives that comes to mind first to solve your problem, but don't forget that "no data" is almost always better than "bad/wrong data". If you have more than enough rows without the rows with NaNs, you may simply drop them. Otherwise you can consider the following:
Can you mathematically calculate the column that you need by the other columns that you already have in your dataset? If so, you have your answer.
Check the correlation of the particular column by using it's non-missing valued rows with the other columns and see if they are highly correlated. If so, you might just as well try dropping the whole column(might not be always a good idea but it's generally a good idea).
Can you create an estimator(such as a regression model) to predict the missing values by learning the pattern using the values that you already have and by using the other columns with a really good accuracy? Well you might have an answer (need benchmarking with the following). Please keep in mind that this is a very risky operation that could give bad estimations and decrease the performance of your overall model. Try this only if your estimations are really good!
Is it a regression problem? Using the statistical mean could be a good idea.
Is it a classification problem? Using median could be a good idea.
In some cases using mode might also be a good idea depending on the distribution.
I suggest that you try all the things out and see which one works better because there's really not a concrete answer to your problem. You can create a machine learning model without using the column and use it's performance as a baseline, and carry out a performance(accuracy) benchmarking for all the steps compared to the baseline.
Note: I am just a graduate student with some insights, please comment out if anything I said is not correct!

Identification of redundant columns/variables in a classification case study

I have a Database with 13 columns(both categorical and numerical). The 13th column is a categorical variable SalStat which classifies weather the person is below 50k or above 50k. I am using Logical Regression for this case and want to know which columns (numerical and categorical) are redundant that is, dont affect SalStat, so that I can remove them. What function should I use for this purpose?

In my opinion you can study the correlation between your variables and remove the ones that have high correlation since they in a way give the same amount of information to your model
you can start with something like DataFrame.corr() then draw a heatmap using seaborn for better visualization seaborn.heatmap() or a more simple one with plt.imshow(data.corr()) plt.colorbar();

Prediction based on more dataframes

I'm trying to predict a score that user gives to a restaurant.
The data I have can be grouped into two dataframes
data about user (taste, personal traits, family, ...)
data about restaurant(open hours, location, cuisine, ...).
First major question is: how do I approach this?
I've already tried basic prediction with the user dataframe (predict one column with few others using RandomForest) and it was pretty straightforward. These dataframes are logically different and I can't merge them into one.
What is the best approach when doing prediction like this?
My second question is what is the best way to handle categorical data (cuisine f.e.)?
I know I can create a mapping function and convert each value to index, or I can use Categorical from pandas (and probably few other methods). Is there any prefered way to do this?

1) The second dataset is essentially characteristics of the restaurant which might influence the first dataset. Example-opening timings or location are strong factors that a customer could consider. You can use them, merging them at a restaurant level. It could help you to understand how people treat location, timings as a reflection in their score for the restaurant- note here you could even apply clustering and see different customers have different sensitivities to these variables.
For e.g. for frequent occurring customers(who mostly eats out) may be more mindful of location/ timing etc if its a part of their daily routine.
You should apply modelling techniques and do multiple simulations to get variable importance box plots and see if variables like location/ timings have a high variance in their importance scores when calculated on different subsets of data - it would be indicative of different customer sensitivities.
2) You can look at label enconding or one hot enconding or even use the variable as it is? It will helpful here to explain how many levels are there in the data. You can look at pd.get_dummies kind of functions
Hope this helps.

find anomalies in records of categorical data

I have a dataset with m observations and p categorical variables (nominal), each variable X1,X2...Xp has several different classes (possible values). Ultimately I am looking for a way to find anomalies i.e to identify rows for which the combination of values seems incorrect with respect to the data I saw so far. So far I was thinking about building a model to predict the value for each column and then build some metric to evaluate how different the actual row is from the predicted row. I would greatly appreciate any help!

Take a look on nearest neighborhoods method and cluster analysis. Metric can be simple (like squared error) or even custom (with predefined weights for each category).
Nearest neighborhoods will answer the question 'how different is the current row from the other row' and cluster analysis will answer the question 'is it outlier or not'. Also some visualization may help (T-SNE).

Missing data in Dataframe using Python

[]
Hi ,
Attached is the data, can you please help me to handle the missing data in the "Outlet_Size" column.
So that i can use this complete data for preparing the datascience models.
Thanks,

These are one of the major challenges of Data Mining problems (or Machine Learning). YOU decide what to do with the missing data based on PURE EXPERIENCE. You mustn't look at Data Science as a blackbox that follows a series of steps to be successful at it!
Some guidelines about missing data.
A. If more than 40% of the data is missing from a column, drop it! (Again, the 40% depends on what type of problem you're working with! If the data is super crucial or its very trivial that you can ignore it).
B. Check if there is someway you can impute the missing data from the internet. You're looking at item weight! If there is anyway you could know which product you're dealing with instead of hashed coded Item_Identifier, then you can always literally Google it and figure it out.
C. Missing data can be classified into two types:
MCAR: missing completely at random. This is the desirable scenario in case of missing data.
MNAR: missing not at random. Missing not at random data is a more serious issue and in this case it might be wise to check the data gathering process further and try to understand why the information is missing. For instance, if most of the people in a survey did not answer a certain question, why did they do that? Was the question unclear?
Assuming data is MCAR, too much missing data can be a problem too. Usually a safe maximum threshold is 5% of the total for large datasets. If missing data for a certain feature or sample is more than 5% then you probably should leave that feature or sample out. We therefore check for features (columns) and samples (rows) where more than 5% of the data is missing using a simple function
D. As posted in the comments, you can simply drop the rows using df.dropna() or fill them with infinity, or fill them with mean using df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
This groups the column value from dataframe df by category name, finds the mean in each category and fills the missing value in value with the corresponding mean of that category!
E. Apart from just either dropping missing values, replacing with mean or median, there are other advanced regression techniques you can use that has a way to predict missing values and fill it, E.G (mice: Multivariate Imputation by Chained Equations), you should browse and read more about where advanced imputation technique will be helpful.

The accepted answer is really nice.
In your specific case I'd say either drop the column or assign a new value called Missing. Since that's a Categorical variable, there's a good chance it ends up going into a OneHot or Target Encoder (or being understandable by the model as a category directly). Also, the fact the value is NaN is an info itself, it can come from multiple factors (from bad data to technical difficulties getting an answer, etc). Be careful and watch this doesn't brings bias or some information you shouldn't know (example : the products have NaN due to not being into a certain base, thing that will never happen in a real situation, which will make your result non-representative of a true situation)

The column "Outlet_Size" contains the categorical data, so instead of dropping the data use measures to fill data.
Since it is categorical data use Measures of Central Tendency, Mode.
Use mode to find which category occurs more or frequently and fill the column with the corresponding value.
Code:
Dataframe['Outlet_Size'].mode()
Datarame['Outlet_Size'].fillna(Dataframe['Outlet_Size'].mode(), inplace=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.