How do I remove skewness from a distribution?

How do I remove skewness from a distribution? - python

I am working with the most famous Credit Card Fraud Detection dataset which includes 28 PCA transformed columns. I'm dealing with the most skewed feature of all which after running the following snippet of code turns out to be V28:
abs_skew_values = pca.skew().abs().sort_values(ascending=False)
selected_feature = abs_skew_values.index[0] # index[0]: most skewed feature
selected_feature # 'V28'
pca is the Pandas DataFrame containing the entire dataset with the PCA columns (V1, V2, V3, etc.).
Now, I wanted to test two things:
How much does the original distribution resemble a normal distribution?
How much skeweness (left or right) is there in the original distribution?
The first thing I have done is plot the histogram of the feature V28:
There are a lot of data points far from 0, these are right skewing the distribution with a score of 11.192. Also, tons of outliers outside of the boxplot fences.
I fixed this by applying a log transformation sign(x) * log(|x|) rather than plain log(x) because there are negative values in the distribution.
It significantly reduced the skew score to 0.184 and you can see less outliers in the distribution.
Running some normality tests also give an insight into how this is clearly not coming from a normal distribution.
Anderson-Darling test
---------------------
15.000: 0.576, data does not look normal (reject H0)
10.000: 0.656, data does not look normal (reject H0)
5.000: 0.787, data does not look normal (reject H0)
2.500: 0.918, data does not look normal (reject H0)
1.000: 1.092, data does not look normal (reject H0)
D'Agostino K^2 test
-------------------
statistic=96189.836, pvalue=0.000
It turns out that, after the log transformation, there are only 26 outliers that may (or may not) be outliers in other features, therefore I don't think I can outright remove them from the original dataset.
So, my question is, am I right in assuming that the transformation I applied is enough to correct the skewness that originally came from the given distribution?
Bonus points: Why is the pvalue in D'Agostino's test exactly 0, shouldn't it be a small number?

Related

Logistic Regression vs predicting probability by splitting data into bin

So I am exploring using a logistic regression model to predict the probability of a shot resulting in a goal. I have two predictors but for simplicity lets assume I have one predictor: distance from the goal. When doing some data exploration I decided to investigate the relationship between distance and the result of a goal. I did this graphical by splitting the data into equal size bins and then taking the mean of all the results (0 for a miss and 1 for a goal) within each bin. Then I plotted the average distance from goal for each bin vs the probability of scoring. I did this in python
#use the seaborn library to inspect the distribution of the shots by result (goal or no goal)
fig, axes = plt.subplots(1, 2,figsize=(11, 5))
#first we want to create bins to calc our probability
#pandas has a function qcut that evenly distibutes the data
#into n bins based on a desired column value
df['Goal']=df['Goal'].astype(int)
df['Distance_Bins'] = pd.qcut(df['Distance'],q=50)
#now we want to find the mean of the Goal column(our prob density) for each bin
#and the mean of the distance for each bin
dist_prob = df.groupby('Distance_Bins',as_index=False)['Goal'].mean()['Goal']
dist_mean = df.groupby('Distance_Bins',as_index=False)['Distance'].mean()['Distance']
dist_trend = sns.scatterplot(x=dist_mean,y=dist_prob,ax=axes[0])
dist_trend.set(xlabel="Avg. Distance of Bin",
ylabel="Probabilty of Goal",
title="Probability of Scoring Based on Distance")
Probability of Scoring Based on Distance
So my question is why would we go through the process of creating a logistic regression model when I could fit a curve to the plot in the image? Would that not provide a function that would predict a probability for a shot with distance x.
I guess the problem would be that we are reducing say 40,000 data point into 50 but I'm not entirely sure why this would be a problem for predict future shot. Could we increase the number of bins or would that just add variability? Is this a case of bias-variance trade off? Im just a little confused about why this would not be as good as a logistic model.

The binning method is a bit more finicky than the logistic regression since you need to try different types of plots to fit the curve (e.g. inverse relationship, log, square, etc.), while for logistic regression you only need to adjust the learning rate to see results.
If you are using one feature (your "Distance" predictor), I wouldn't see much difference between the binning method and the logistic regression. However, when you are using two or more features (I see "Distance" and "Angle" in the image you provided), how would you plan to combine the probabilities for each to make a final 0/1 classification? It can be tricky. For one, perhaps "Distance" is more useful a predictor than "Angle". However, logistic regression does that for you because it can adjust the weights.
Regarding your binning method, if you use fewer bins you might see more bias since the data may be more complicated than you think, but this is not that likely because your data looks quite simple at first glance. However, if you use more bins that would not significantly increase variance, assuming that you fit the curve without varying the order of the curve. If you change the order of the curve you fit, then yes, it will increase variance. However, your data seems like it is amenable to a very simple fit if you go with this method.

Correct way of normalizing and scaling the MNIST dataset

I've looked everywhere but couldn't quite find what I want. Basically the MNIST dataset has images with pixel values in the range [0, 255]. People say that in general, it is good to do the following:
Scale the data to the [0,1] range.
Normalize the data to have zero mean and unit standard deviation (data - mean) / std.
Unfortunately, no one ever shows how to do both of these things. They all subtract a mean of 0.1307 and divide by a standard deviation of 0.3081. These values are basically the mean and the standard deviation of the dataset divided by 255:
from torchvision.datasets import MNIST
import torchvision.transforms as transforms
trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True)
print('Min Pixel Value: {} \nMax Pixel Value: {}'.format(trainset.data.min(), trainset.data.max()))
print('Mean Pixel Value {} \nPixel Values Std: {}'.format(trainset.data.float().mean(), trainset.data.float().std()))
print('Scaled Mean Pixel Value {} \nScaled Pixel Values Std: {}'.format(trainset.data.float().mean() / 255, trainset.data.float().std() / 255))
This outputs the following
Min Pixel Value: 0
Max Pixel Value: 255
Mean Pixel Value 33.31002426147461
Pixel Values Std: 78.56748962402344
Scaled Mean: 0.13062754273414612
Scaled Std: 0.30810779333114624
However clearly this does none of the above! The resulting data 1) will not be between [0, 1] and will not have mean 0 or std 1. In fact this is what we are doing:
[data - (mean / 255)] / (std / 255)
which is very different from this
[(scaled_data) - (mean/255)] / (std/255)
where scaled_data is just data / 255.

Euler_Salter
I may have stumbled upon this a little too late, but hopefully I can help a little bit.
Assuming that you are using torchvision.Transform, the following code can be used to normalize the MNIST dataset.
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('./data', train=True
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
Usually, 'transforms.ToTensor()' is used to turn the input data in the range of [0,255] to a 3-dimensional Tensor. This function automatically scales the input data to the range of [0,1]. (This is equivalent to scaling the data down to 0,1)
Therefore, it makes sense that the mean and std used in the 'transforms.Normalize(...)' will be 0.1307 and 0.3081, respectively. (This is equivalent to normalizing zero mean and unit standard deviation.)
Please refer to the link below for better explanation.
https://pytorch.org/vision/stable/transforms.html

I think you misunderstand one critical concept: these are two different, and inconsistent, scaling operations. You can have only one of the two:
mean = 0, stdev = 1
data range [0,1]
Think about it, considering the [0,1] range: if the data are all small positive values, with min=0 and max=1, then the sum of the data must be positive, giving a positive, non-zero mean. Similarly, the stdev cannot be 1 when none of the data can possibly be as much as 1.0 different from the mean.
Conversely, if you have mean=0, then some of the data must be negative.
You use only one of the two transformations. Which one you use depends on the characteristics of your data set, and -- ultimately -- which one works better for your model.
For the [0,1] scaling, you simply divide by 255.
For the mean=0, stdev=1 scaling, you perform the simple linear transformation you already know:
new_val = (old_val - old_mean) / old_stdev
Does that clarify it for you, or have I entirely missed your point of confusion?

Purpose
Two of the most important reasons for features scaling are:
You scale features to make them all of the same magnitude (i.e. importance or weight).
Example:
Dataset with two features: Age and Weight. The ages in years and the weights in grams! Now a fella in the 20th of his age and weights only 60Kg would translate to a vector = [20 yrs, 60000g], and so on for the whole dataset. The Weight Attribute will dominate during the training process. How is that, depends on the type of the algorithm you are using - Some are more sensitive than others: E.g. Neural Network where the Learning Rate for Gradient Descent get affected by the magnitude of the Neural Network Thetas (i.e. Weights), and the latter varies in correlation to the input (i.e. features) during the training process; also Feature Scaling improves Convergence. Another example is the K-Mean Clustering Algorithm requires Features of the same magnitude since it is isotropic in all directions of space. INTERESTING LIST.
You scale features to speed up execution time.
This is straightforward: All these matrices multiplications and parameters summation would be faster with small numbers compared to very large number (or very large number produced from multiplying features by some other parameters..etc)
Types
The most popular types of Feature Scalers can be summarized as follows:
StandardScaler: usually your first option, it's very commonly used. It works via standardizing the data (i.e. centering them), that's to bring them to a STD=1 and Mean=0. It gets affected by outliers, and should only be used if your data have Gaussian-Like Distribution.
MinMaxScaler: usually used when you want to bring all your data point into a specific range (e.g. [0-1]). It heavily gets affected by outliers simply because it uses the Range.
RobustScaler: It's "robust" against outliers because it scales the data according to the quantile range. However, you should know that outliers will still exist in the scaled data.
MaxAbsScaler: Mainly used for sparse data.
Unit Normalization: It basically scales the vector for each sample to have unit norm, independently of the distribution of the samples.
Which One & How Many
You need to get to know your dataset first. As per mentioned above, there are things you need to look at before, such as: the Distribution of the Data, the Existence of Outliers, and the Algorithm being utilized.
Anyhow, you need one scaler per dataset, unless there is a specific requirement, such that if there exist an algorithm that works only if data are within certain range and has mean of zero and standard deviation of 1 - all together. Nevertheless, I have never come across such case.
Key Takeaways
There are different types of Feature Scalers that are used based on some rules of thumb mentioned above.
You pick one Scaler based on the requirements, not randomly.
You scale data for a purpose, for example, in the Random Forest Algorithm you do NOT usually need to scale.

Well the data gets scaled to [0,1] using torchvision.transforms.ToTensor() and then the normalization (0.1306,0.3081) is applied.
You can look about it in the Pytorch documentation : https://pytorch.org/vision/stable/transforms.html.
Hope that answers your question.

Detecting and Replacing Outliers

In my mind, there are multiple ways to treat dataset outliers
> -> Delete data
> -> Transforming using log or Bin
> -> using mean median
> -> Test separately
I have a dataset of around 50000 observations and each observation has quite some outlier values (some variable have small amount of outliers some has 100-200 outliers) so excluding data is not the one I'm looking for as it causing me to loose a huge chunk of data.
I read somewhere that using mean and median is for artificial outliers but in my case I think the outliers are Natural
I was actually about to use median to get rid of the outliers and then using mean to fill in missing values but it doesn't seem ok, however I did use it neverthless with this code
median = X.median()
std =X.std()
outliers = (X - median).abs() > std
X.outliers = np.nan
X.fillna(median, inplace = True)
it did lower the overfitting of just one model logistic regression but still gives 100% on Random Forest and the shape of graph changed from
to this
So I'm really confuse what technique to use? I tried replacing 5th and 95th percentile of data as well but it didn't work as well. Should I bin the data present in each column from 1-10? Also should I normalize or standardize my data before applying any model? Any guidance will be appreciated

Check robust statistics.
I would suggest to check the Huber's method / winsorization, which you also have in Python.
For Hypothesis testing you have Wilcoxon signed ranked test and I think Mann-Whitney test

Normalizations in sklearn and their differences

I have read many articles suggested this formula
N = (x - min(x))/(max(x)-min(x))
for normalization
but when i dig into the normalizor of sklearn somewhere i found they are using this formula
x / np.linalg.norm(x)
As the later use l2-norm by default. Which one should I use? Why is there a difference in between both?

There are different normalization techniques and sklearn provides for many of them. Please note that we are looking at 1d arrays here. For a matrix these operations are applied to each column (have a look at this post for an in depth example Scaling features for machine learning) Let's go through some of them:
Scikit-learn's MinMaxScaler performs (x - min(x))/(max(x)-min(x)) This scales your array in such a way that you only have values between 0 and 1. Can be useful if you want to apṕly some transformation afterwards where no negative values are allowed (e.g. a log-transform or in scaling RGB pixels like done in some MNIST examples)
scikit-learns StandardScaler performs (x-x.mean())/x.std() which centers the array around zero and scales by the variance of the features. This is a standard transformation and is appicable in many situations but keep in mind that you will get negative values. This is especially useful when you have gaussian sampled data which is not centered around 0 and/or does not have a unit variance.
Scikit-learn's Normalizer performs x / np.linalg.norm(x). This sets the length of your array/vector to 1. Might come in handy if you want to do some linear algebra stuff like if you want to implement the Gram-Schmidt Algorithm.
Scikit-learn's RobustScaler can be used to scale data with outliers. Mean and standard deviation are not robust to outliers therefore this scaler uses the median and scales the data to quantile ranges.
There are other non-linear transformations like QuantileTransformer that scales be quantile ranges and PowerTransformer that maps any distribution to a distribution similar to a Gaussian distribution.
And there are many other normalizations used in machine learning and there vast amount can be confusing. The idea behind normalizing data in ML is usually that you want dont want your model to treat one feature differently than others simply because it has a higher mean or a larger variance. For most standard cases I use MinMaxScaler or StandardScaler depending on whether scaling according to the variance seems important to me.

np.ling.norm is given by:
np.linalg.norm(x) = sqrt((sum_i_j(abs(x_i_j)))^2)
so lets assume you have:
X= (1 2
0 -1)
then with this you would have:
np.linalg.norm(x)= sqr((1+2+0+1)^2)= sqr(16)=4
X = (0.25 0.5
0 -0.25)
with the other approach you would have:
min(x)= -1
max(x)= 2
max(x)-min(x)=3
X = (0.66 1
0.33 0)
So the min(x)/max(x) is also called MinMaxScaler, there all the values are always between 0-1, the other approaches normalizes your values , but you can still have negativ values. Depending on your next steps you need to decide which one to use.

Based on the API description
Scikit-learn normalizer scales input vectors individually to a unit norm (vector length).
That is why it uses the L2 regularizer (you can also use L1 as well, as explained in the API)
I think you are looking for a scaler instead of a normalizer by your description. Please find the Min-Max scaler in this link.
Also, you can consider a standard scaler that normalizes value by removing its mean and scales to its standard deviation.

What's a good metric to analyze the quality of the output of a clustering algorithm?

I've been trying out the kmeans clustering algorithm implementation in scipy. Are there any standard, well-defined metrics that could be used to measure the quality of the clusters generated?
ie, I have the expected labels for the data points that are clustered by kmeans. Now, once I get the clusters that have been generated, how do I evaluate the quality of these clusters with respect to the expected labels?

I am doing this very thing at that time with Spark's KMeans.
I am using:
The sum of squared distances of points to their nearest center
(implemented in computeCost()).
The Unbalanced factor (see
Unbalanced factor of KMeans?
for an implementation and
Understanding the quality of the KMeans algorithm
for an explanation).
Both quantities promise a better cluster, when the are small (the less, the better).

Kmeans attempts to minimise a sum of squared distances to cluster centers. I would compare the result of this with the Kmeans clusters with the result of this using the clusters you get if you sort by expected labels.
There are two possibilities for the result. If the KMeans sum of squares is larger than the expected label clustering then your kmeans implementation is buggy or did not get started from a good set of initial cluster assignments and you could think about increasing the number of random starts you using or debugging it. If the KMeans sum of squares is smaller than the expected label clustering sum of squares and the KMeans clusters are not very similar to the expected label clustering (that is, two points chosen at random from the expected label clustering are/are not usually in the same expected label clustering when they are/are not in the KMeans clustering) then sum of squares from cluster centers is not a good way of splitting your points up into clusters and you need to use a different distance function or look at different attributes or use a different sort of clustering.

In your case, when you do have the samples true label, validation is very easy.
First of all, compute the confusion matrix (http://en.wikipedia.org/wiki/Confusion_matrix). Then, derive from it all relevant measures: True Positive, false negatives, false positives and true negatives. Then, you can find the Precision, Recall, Miss rate, etc.
Make sure you understand the meaning of all above. They basically tell you how well your clustering predicted / recognized the true nature of your data.
If you're using python, just use the sklearn package:
http://scikit-learn.org/stable/modules/model_evaluation.html
In addition, it's nice to run some internal validation, to see how well your clusters are separated. There are known internal validity measures, like:
Silhouette
DB index
Dunn index
Calinski-Harabasz measure
Gamma score
Normalized Cut
etc.
Read more here: An extensive comparative study of cluster validity indices
Olatz Arbelaitz , Ibai Gurrutxaga, , Javier Muguerza , Jesús M. Pérez , Iñigo Perona

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.