find anomalies in records of categorical data - python

I have a dataset with m observations and p categorical variables (nominal), each variable X1,X2...Xp has several different classes (possible values). Ultimately I am looking for a way to find anomalies i.e to identify rows for which the combination of values seems incorrect with respect to the data I saw so far. So far I was thinking about building a model to predict the value for each column and then build some metric to evaluate how different the actual row is from the predicted row. I would greatly appreciate any help!

Take a look on nearest neighborhoods method and cluster analysis. Metric can be simple (like squared error) or even custom (with predefined weights for each category).
Nearest neighborhoods will answer the question 'how different is the current row from the other row' and cluster analysis will answer the question 'is it outlier or not'. Also some visualization may help (T-SNE).

Related

How and when to deal with outliers in your dataset (general strategy)

I stumbled about the following problem:
I'm working on a beginners project in data science. I got my test and train data splits and right now I'm analyzing every feature, then adding it to either a dataframe for discretised continuous variables or a dataframe for continuous variables.
Doing so I encountered a feature with big outliers. If I would to delete them, other features I already added to my sub dataframes would have more column entries than this one.
Should I just find a strategy to overwrite the outliers with "better" values or should I reconsider my strategy to split the train data for both types of variables in the beginning? I don't think that
getting rid of the outlier rows in the real train_data would be useful though...
There are many ways to deal with outliers.
In my cours for datascience we used "data imputation":
But before you start to replace or remove data, its important to analyse what difference the outlier makes and if the outlier is valid ofcours.
If the outlier is invalid, you can delete the outlier and use data imputation as explained below.
If your outlier is valid, check the differnce in outcome with and without the outlier. If the difference is very small then there ain't a problem. If the differnce is significant you can use standardization and normalization.
You can replace the outlier with:
a random value (not recommended)
a value based on hueristic logic
a value based on its neighbours
the median, mean or modus.
a value based on interpolation (making a prediction with a certain ml model)
I recommend using the strategy with the best outcome.
Statquest explains datascience and machinelearning concepts in a very easy and understandable way, so refer to him if you encounter more theoritical questions: https://www.youtube.com/user/joshstarmer

Identification of redundant columns/variables in a classification case study

I have a Database with 13 columns(both categorical and numerical). The 13th column is a categorical variable SalStat which classifies weather the person is below 50k or above 50k. I am using Logical Regression for this case and want to know which columns (numerical and categorical) are redundant that is, dont affect SalStat, so that I can remove them. What function should I use for this purpose?
In my opinion you can study the correlation between your variables and remove the ones that have high correlation since they in a way give the same amount of information to your model
you can start with something like DataFrame.corr() then draw a heatmap using seaborn for better visualization seaborn.heatmap() or a more simple one with plt.imshow(data.corr()) plt.colorbar();

KMeans: Extracting the parameters/rules that fill up the clusters

I have created a 4-cluster k-means customer segmentation in scikit learn (Python). The idea is that every month, the business gets an overview of the shifts in size of our customers in each cluster.
My question is how to make these clusters 'durable'. If I rerun my script with updated data, the 'boundaries' of the clusters may slightly shift, but I want to keep the old clusters (even though they fit the data slightly worse).
My guess is that there should be a way to extract the paramaters that decides which case goes to their respective cluster, but I haven't found the solution yet.
Got the answer in a different topic:
Just record the cluster means. Then when new data comes in, compare it to each mean and put it in the one with the closest mean.

Prediction based on more dataframes

I'm trying to predict a score that user gives to a restaurant.
The data I have can be grouped into two dataframes
data about user (taste, personal traits, family, ...)
data about restaurant(open hours, location, cuisine, ...).
First major question is: how do I approach this?
I've already tried basic prediction with the user dataframe (predict one column with few others using RandomForest) and it was pretty straightforward. These dataframes are logically different and I can't merge them into one.
What is the best approach when doing prediction like this?
My second question is what is the best way to handle categorical data (cuisine f.e.)?
I know I can create a mapping function and convert each value to index, or I can use Categorical from pandas (and probably few other methods). Is there any prefered way to do this?
1) The second dataset is essentially characteristics of the restaurant which might influence the first dataset. Example-opening timings or location are strong factors that a customer could consider. You can use them, merging them at a restaurant level. It could help you to understand how people treat location, timings as a reflection in their score for the restaurant- note here you could even apply clustering and see different customers have different sensitivities to these variables.
For e.g. for frequent occurring customers(who mostly eats out) may be more mindful of location/ timing etc if its a part of their daily routine.
You should apply modelling techniques and do multiple simulations to get variable importance box plots and see if variables like location/ timings have a high variance in their importance scores when calculated on different subsets of data - it would be indicative of different customer sensitivities.
2) You can look at label enconding or one hot enconding or even use the variable as it is? It will helpful here to explain how many levels are there in the data. You can look at pd.get_dummies kind of functions
Hope this helps.

Python Pandas Regression

[enter image description here][1]I am struggling to figure out if regression is the route I need to go in order to solve my current challenge with Python. Here is my scenario:
I have a Pandas Dataframe that is 195 rows x 25 columns
All data (except for index and headers) are integers
I have one specific column (Column B) that I would like compared to all other columns
Attempting to determine if there is a range of numbers in any of the columns that influences or impacts column B
An example of the results I would like to calculate in Python is something similar to: Column B is above 3.5 when data in Column D is between 10.20 - 16.4
The examples I've been reading online with Regression in Python appear to produce charts and statistics that I don't need (or maybe I am interpreting incorrectly). I believe the proper wording to describe what I am asking, is to identify specific values or a range of values that are linear between two columns in a Pandas dataframe.
Can anyone help point me in the right direction?
Thank you all in advance!
Your goals sound very much like exploratory data analysis at this point. You should probably first calculate the correlation between your target column B and any other column using pandas.Series.corr (which really is the same as bivariate regression), which you could list:
other_cols = [col for col in df1.columns if col !='B']
corr_B = [{other: df.loc[:, 'B'].corr(df.loc[:, other])} for other in other_col]
To get a handle on specific ranges, I would recommend looking at:
the cut and qcut functionality to bin your data as you like and either plot or correlate subsets accordingly: see docs here and here.
To visualize bivariate and simple multivariate relationships, I would recommend
the seaborn package because it includes various types of plots designed to help you get a quick grasp of covariation among variables. See for instance the examples for univariate and bivariate distributions here, linear relationship plots here, and categorical data plots here.
The above should help you understand bivariate relationships. Once you want to progress to multivariate relationships, you could return to the scikit-learn or statsmodels packages best suited for this in python IMHO. Hope this helps to get you started.

Categories

Resources