Identification of redundant columns/variables in a classification case study - python

I have a Database with 13 columns(both categorical and numerical). The 13th column is a categorical variable SalStat which classifies weather the person is below 50k or above 50k. I am using Logical Regression for this case and want to know which columns (numerical and categorical) are redundant that is, dont affect SalStat, so that I can remove them. What function should I use for this purpose?

In my opinion you can study the correlation between your variables and remove the ones that have high correlation since they in a way give the same amount of information to your model
you can start with something like DataFrame.corr() then draw a heatmap using seaborn for better visualization seaborn.heatmap() or a more simple one with plt.imshow(data.corr()) plt.colorbar();

Related

How and when to deal with outliers in your dataset (general strategy)

I stumbled about the following problem:
I'm working on a beginners project in data science. I got my test and train data splits and right now I'm analyzing every feature, then adding it to either a dataframe for discretised continuous variables or a dataframe for continuous variables.
Doing so I encountered a feature with big outliers. If I would to delete them, other features I already added to my sub dataframes would have more column entries than this one.
Should I just find a strategy to overwrite the outliers with "better" values or should I reconsider my strategy to split the train data for both types of variables in the beginning? I don't think that
getting rid of the outlier rows in the real train_data would be useful though...
There are many ways to deal with outliers.
In my cours for datascience we used "data imputation":
But before you start to replace or remove data, its important to analyse what difference the outlier makes and if the outlier is valid ofcours.
If the outlier is invalid, you can delete the outlier and use data imputation as explained below.
If your outlier is valid, check the differnce in outcome with and without the outlier. If the difference is very small then there ain't a problem. If the differnce is significant you can use standardization and normalization.
You can replace the outlier with:
a random value (not recommended)
a value based on hueristic logic
a value based on its neighbours
the median, mean or modus.
a value based on interpolation (making a prediction with a certain ml model)
I recommend using the strategy with the best outcome.
Statquest explains datascience and machinelearning concepts in a very easy and understandable way, so refer to him if you encounter more theoritical questions: https://www.youtube.com/user/joshstarmer

Is there a way to cluster transactions (journals) data using python if a transaction is represented by two or more rows?

In accounting the data set representing the transactions is called a 'general ledger' and takes following form:
Note that a 'journal' i.e. a transaction consists of two line items. E.g. transaction (Journal Number) 1 has two lines. The receipt of cash and the income. Companies could also have transactions (journals) which can consist of 3 line items or even more.
Will I first need to cleanse the data to only have one line item for each journal? I.e. cleanse the above 8 rows into 4.
Are there any python machine learning algorithms which will allow me to cluster the above data without further manipulation?
The aim of this is to detect anomalies in transactions data. I do not know what anomalies look like so this would need to be unsupervised learning.
Use gaussians on each dimension of the data to determine what is an anomaly. Mean and variance are backed out per dimension, and if the value of a new datapoint on that dimension is below a threshold, it is considered an outlier. This creates one gaussian per dimension. You can use some feature engineering here, rather than just fit gaussians on the raw data.
If features don't look gaussian (plot their histogram), use data transformations like log(x) or sqrt(x) to change them until they look better.
Use anomaly detection if supervised learning is not available, or if you want to find new, previously unseen kind of anomalies (such as the failure of a power plant, or someone acting suspiciously rather than whether someone is male/female)
Error analysis: However, what if p(x), the probability the an example is not an anomaly, is large for all examples? Add another dimension, and hope it helps to show the anomaly. You could create this dimension by combining some of the others.
To fit the gaussian a bit more to the shape of your data, you can make it multivariate. It then takes a matrix mean and variance, and you can vary parameters to change its shape. It will also show feature correlations, if your features are not all independent.
https://stats.stackexchange.com/questions/368618/multivariate-gaussian-distribution

Prediction based on more dataframes

I'm trying to predict a score that user gives to a restaurant.
The data I have can be grouped into two dataframes
data about user (taste, personal traits, family, ...)
data about restaurant(open hours, location, cuisine, ...).
First major question is: how do I approach this?
I've already tried basic prediction with the user dataframe (predict one column with few others using RandomForest) and it was pretty straightforward. These dataframes are logically different and I can't merge them into one.
What is the best approach when doing prediction like this?
My second question is what is the best way to handle categorical data (cuisine f.e.)?
I know I can create a mapping function and convert each value to index, or I can use Categorical from pandas (and probably few other methods). Is there any prefered way to do this?
1) The second dataset is essentially characteristics of the restaurant which might influence the first dataset. Example-opening timings or location are strong factors that a customer could consider. You can use them, merging them at a restaurant level. It could help you to understand how people treat location, timings as a reflection in their score for the restaurant- note here you could even apply clustering and see different customers have different sensitivities to these variables.
For e.g. for frequent occurring customers(who mostly eats out) may be more mindful of location/ timing etc if its a part of their daily routine.
You should apply modelling techniques and do multiple simulations to get variable importance box plots and see if variables like location/ timings have a high variance in their importance scores when calculated on different subsets of data - it would be indicative of different customer sensitivities.
2) You can look at label enconding or one hot enconding or even use the variable as it is? It will helpful here to explain how many levels are there in the data. You can look at pd.get_dummies kind of functions
Hope this helps.

find anomalies in records of categorical data

I have a dataset with m observations and p categorical variables (nominal), each variable X1,X2...Xp has several different classes (possible values). Ultimately I am looking for a way to find anomalies i.e to identify rows for which the combination of values seems incorrect with respect to the data I saw so far. So far I was thinking about building a model to predict the value for each column and then build some metric to evaluate how different the actual row is from the predicted row. I would greatly appreciate any help!
Take a look on nearest neighborhoods method and cluster analysis. Metric can be simple (like squared error) or even custom (with predefined weights for each category).
Nearest neighborhoods will answer the question 'how different is the current row from the other row' and cluster analysis will answer the question 'is it outlier or not'. Also some visualization may help (T-SNE).

Python Machine Learning - Imputing categorical data?

I am learning machine learning using Python and understand that I cannot run categorical data through the model, and must first get dummies. Some of my categorical data has nulls (a very small fraction of only 2 features). When I convert to dummies, then see if I have missing values it always shows none. Should I impute beforehand? Or do I impute categorical data at all? For instance if the category was male/female, I wouldn't want to replace nulls with the most_frequent. I see how this would make sense if the feature was income, and I was going to impute missing values. Income is income, whereas a male is not a female.
So does it make sense to impute categorical data? Am I way off? I am sorry this is more applied theory than actual Python programming but was not sure where to post this type of question.
I think the answers depends on the properties of your features.
Fill in missing data with expectation maximization (EM)
Say you have two features, one is gender (has missing data) and the other one is wage (no missing data). If there is a relationship between the two features, you could use information contained in the wage to fill in missing values in gender.
To put it a little bit more formally - if you have a missing value in gender column but you have a value for wage, EM tells you P(gender=Male | wage=w0, theta), i.e. the probability of the gender being male given wage=w0 and theta which is a parameter obtained with maximum likelihood estimation.
In simpler terms, this could be achieved by running regression of gender on wage (use logistic regression since the y-variable is categorical) to give you the probability described above.
Visually:
(these are totally add-hoc values but convey the idea that the wage distribution for males is generally above that for females)
Fill in missing values #2
You probably can fill in missing value using the most frequent observation if you believe that the data is missing at random even though there is no relationship between the two features. I would be cautious though.
Don't impute
If there is no relationship between the two features and you believe that the missing data might not be missing at random.

Categories

Resources