Difference between Standard scaler and MinMaxScaler

Difference between Standard scaler and MinMaxScaler - python

What is the difference between MinMaxScaler() and StandardScaler().
mms = MinMaxScaler(feature_range = (0, 1)) (Used in a machine learning model)
sc = StandardScaler() (In another machine learning model they used standard-scaler and not min-max-scaler)

MinMaxScaler(feature_range = (0, 1)) will transform each value in the column proportionally within the range [0,1]. Use this as the first scaler choice to transform a feature, as it will preserve the shape of the dataset (no distortion).
StandardScaler() will transform each value in the column to range about the mean 0 and standard deviation 1, ie, each value will be normalised by subtracting the mean and dividing by standard deviation. Use StandardScaler if you know the data distribution is normal.
If there are outliers, use RobustScaler(). Alternatively you could remove the outliers and use either of the above 2 scalers (choice depends on whether data is normally distributed)
Additional Note: If scaler is used before train_test_split, data leakage will happen. Do use scaler after train_test_split

From ScikitLearn site:
StandardScaler removes the mean and scales the data to unit variance.
However, the outliers have an influence when computing the empirical
mean and standard deviation which shrink the range of the feature
values as shown in the left figure below. Note in particular that
because the outliers on each feature have different magnitudes, the
spread of the transformed data on each feature is very different: most
of the data lie in the [-2, 4] range for the transformed median income
feature while the same data is squeezed in the smaller [-0.2, 0.2]
range for the transformed number of households.
StandardScaler therefore cannot guarantee balanced feature scales in
the presence of outliers.
MinMaxScaler rescales the data set such that all feature values are in
the range [0, 1] as shown in the right panel below. However, this
scaling compress all inliers in the narrow range [0, 0.005] for the
transformed number of households.

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range.
Scaling the data means it helps to Normalize the data within a particular range.
When MinMaxScaler is used the it is also known as Normalization and it transform all the values in range between (0 to 1)
formula is x = [(value - min)/(Max- Min)]
StandardScaler comes under Standardization and its value ranges between (-3 to +3)
formula is z = [(x - x.mean)/Std_deviation]

Before implementing MinMaxScaler or Standard Scaler you should know about the distribution of your dataset.
StandardScaler rescales a dataset to have a mean of 0 and a standard deviation of 1. Standardization is very useful if data has varying scales and the algorithm assumption about data having a gaussian distribution.
Normalization or MinMaxScaler rescale a dataset so that each value fall between 0 and 1. It is useful when data has varying scales and the algorithm does not make assumptions about the distribution. It is a good technique when we did not know about the distribution of data or when we know the distribution is not gaussian.

Related

What is the acceptable offset for mean and standard deviation after StandardScaler transform?

I am using sklearn StandardScaler to transform/normalize data, as the following:
scaler = StandardScaler()
data = scaler.fit_transform(data)
I am expecting a mean of 0 and a standard deviation of 1. However, the values I get are bit different.
rnd = randrange(0, data.shape[1])
print(data[:,rnd].std())
print(data[:,rnd].mean())
1.0282903146389404
-0.06686584736835668
It seems very close numbers to 0 and 1 should be acceptable; however, not sure what is the acceptable offset. For instance, +/- 1e-2, as I'm getting, is close enough? or I should be concerned?

You are using fit_transform for you data variable. This means, that now, all your data will normalized, following a shape of mean 0 and standard deviation of 1. Like the image below:
What you are doing next, is taking, randomly, some samples of your data variable. So, the new sample you collected, will be almost identical, but, since there's a randomly parameters, the mean and the standard deviation will not be the same as your data.
To make a comparison, imagine that we have the mean and std of the human height. If we now, take a small sample of your country height, the mean and std won't be exactly the same, but almost. That's the point.
For sure if your check mean and std for your data variable, you will obtain mean 0 and std 1.

Correct way of normalizing and scaling the MNIST dataset

I've looked everywhere but couldn't quite find what I want. Basically the MNIST dataset has images with pixel values in the range [0, 255]. People say that in general, it is good to do the following:
Scale the data to the [0,1] range.
Normalize the data to have zero mean and unit standard deviation (data - mean) / std.
Unfortunately, no one ever shows how to do both of these things. They all subtract a mean of 0.1307 and divide by a standard deviation of 0.3081. These values are basically the mean and the standard deviation of the dataset divided by 255:
from torchvision.datasets import MNIST
import torchvision.transforms as transforms
trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True)
print('Min Pixel Value: {} \nMax Pixel Value: {}'.format(trainset.data.min(), trainset.data.max()))
print('Mean Pixel Value {} \nPixel Values Std: {}'.format(trainset.data.float().mean(), trainset.data.float().std()))
print('Scaled Mean Pixel Value {} \nScaled Pixel Values Std: {}'.format(trainset.data.float().mean() / 255, trainset.data.float().std() / 255))
This outputs the following
Min Pixel Value: 0
Max Pixel Value: 255
Mean Pixel Value 33.31002426147461
Pixel Values Std: 78.56748962402344
Scaled Mean: 0.13062754273414612
Scaled Std: 0.30810779333114624
However clearly this does none of the above! The resulting data 1) will not be between [0, 1] and will not have mean 0 or std 1. In fact this is what we are doing:
[data - (mean / 255)] / (std / 255)
which is very different from this
[(scaled_data) - (mean/255)] / (std/255)
where scaled_data is just data / 255.

Euler_Salter
I may have stumbled upon this a little too late, but hopefully I can help a little bit.
Assuming that you are using torchvision.Transform, the following code can be used to normalize the MNIST dataset.
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('./data', train=True
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
Usually, 'transforms.ToTensor()' is used to turn the input data in the range of [0,255] to a 3-dimensional Tensor. This function automatically scales the input data to the range of [0,1]. (This is equivalent to scaling the data down to 0,1)
Therefore, it makes sense that the mean and std used in the 'transforms.Normalize(...)' will be 0.1307 and 0.3081, respectively. (This is equivalent to normalizing zero mean and unit standard deviation.)
Please refer to the link below for better explanation.
https://pytorch.org/vision/stable/transforms.html

I think you misunderstand one critical concept: these are two different, and inconsistent, scaling operations. You can have only one of the two:
mean = 0, stdev = 1
data range [0,1]
Think about it, considering the [0,1] range: if the data are all small positive values, with min=0 and max=1, then the sum of the data must be positive, giving a positive, non-zero mean. Similarly, the stdev cannot be 1 when none of the data can possibly be as much as 1.0 different from the mean.
Conversely, if you have mean=0, then some of the data must be negative.
You use only one of the two transformations. Which one you use depends on the characteristics of your data set, and -- ultimately -- which one works better for your model.
For the [0,1] scaling, you simply divide by 255.
For the mean=0, stdev=1 scaling, you perform the simple linear transformation you already know:
new_val = (old_val - old_mean) / old_stdev
Does that clarify it for you, or have I entirely missed your point of confusion?

Purpose
Two of the most important reasons for features scaling are:
You scale features to make them all of the same magnitude (i.e. importance or weight).
Example:
Dataset with two features: Age and Weight. The ages in years and the weights in grams! Now a fella in the 20th of his age and weights only 60Kg would translate to a vector = [20 yrs, 60000g], and so on for the whole dataset. The Weight Attribute will dominate during the training process. How is that, depends on the type of the algorithm you are using - Some are more sensitive than others: E.g. Neural Network where the Learning Rate for Gradient Descent get affected by the magnitude of the Neural Network Thetas (i.e. Weights), and the latter varies in correlation to the input (i.e. features) during the training process; also Feature Scaling improves Convergence. Another example is the K-Mean Clustering Algorithm requires Features of the same magnitude since it is isotropic in all directions of space. INTERESTING LIST.
You scale features to speed up execution time.
This is straightforward: All these matrices multiplications and parameters summation would be faster with small numbers compared to very large number (or very large number produced from multiplying features by some other parameters..etc)
Types
The most popular types of Feature Scalers can be summarized as follows:
StandardScaler: usually your first option, it's very commonly used. It works via standardizing the data (i.e. centering them), that's to bring them to a STD=1 and Mean=0. It gets affected by outliers, and should only be used if your data have Gaussian-Like Distribution.
MinMaxScaler: usually used when you want to bring all your data point into a specific range (e.g. [0-1]). It heavily gets affected by outliers simply because it uses the Range.
RobustScaler: It's "robust" against outliers because it scales the data according to the quantile range. However, you should know that outliers will still exist in the scaled data.
MaxAbsScaler: Mainly used for sparse data.
Unit Normalization: It basically scales the vector for each sample to have unit norm, independently of the distribution of the samples.
Which One & How Many
You need to get to know your dataset first. As per mentioned above, there are things you need to look at before, such as: the Distribution of the Data, the Existence of Outliers, and the Algorithm being utilized.
Anyhow, you need one scaler per dataset, unless there is a specific requirement, such that if there exist an algorithm that works only if data are within certain range and has mean of zero and standard deviation of 1 - all together. Nevertheless, I have never come across such case.
Key Takeaways
There are different types of Feature Scalers that are used based on some rules of thumb mentioned above.
You pick one Scaler based on the requirements, not randomly.
You scale data for a purpose, for example, in the Random Forest Algorithm you do NOT usually need to scale.

Well the data gets scaled to [0,1] using torchvision.transforms.ToTensor() and then the normalization (0.1306,0.3081) is applied.
You can look about it in the Pytorch documentation : https://pytorch.org/vision/stable/transforms.html.
Hope that answers your question.

Normalizations in sklearn and their differences

I have read many articles suggested this formula
N = (x - min(x))/(max(x)-min(x))
for normalization
but when i dig into the normalizor of sklearn somewhere i found they are using this formula
x / np.linalg.norm(x)
As the later use l2-norm by default. Which one should I use? Why is there a difference in between both?

There are different normalization techniques and sklearn provides for many of them. Please note that we are looking at 1d arrays here. For a matrix these operations are applied to each column (have a look at this post for an in depth example Scaling features for machine learning) Let's go through some of them:
Scikit-learn's MinMaxScaler performs (x - min(x))/(max(x)-min(x)) This scales your array in such a way that you only have values between 0 and 1. Can be useful if you want to apṕly some transformation afterwards where no negative values are allowed (e.g. a log-transform or in scaling RGB pixels like done in some MNIST examples)
scikit-learns StandardScaler performs (x-x.mean())/x.std() which centers the array around zero and scales by the variance of the features. This is a standard transformation and is appicable in many situations but keep in mind that you will get negative values. This is especially useful when you have gaussian sampled data which is not centered around 0 and/or does not have a unit variance.
Scikit-learn's Normalizer performs x / np.linalg.norm(x). This sets the length of your array/vector to 1. Might come in handy if you want to do some linear algebra stuff like if you want to implement the Gram-Schmidt Algorithm.
Scikit-learn's RobustScaler can be used to scale data with outliers. Mean and standard deviation are not robust to outliers therefore this scaler uses the median and scales the data to quantile ranges.
There are other non-linear transformations like QuantileTransformer that scales be quantile ranges and PowerTransformer that maps any distribution to a distribution similar to a Gaussian distribution.
And there are many other normalizations used in machine learning and there vast amount can be confusing. The idea behind normalizing data in ML is usually that you want dont want your model to treat one feature differently than others simply because it has a higher mean or a larger variance. For most standard cases I use MinMaxScaler or StandardScaler depending on whether scaling according to the variance seems important to me.

np.ling.norm is given by:
np.linalg.norm(x) = sqrt((sum_i_j(abs(x_i_j)))^2)
so lets assume you have:
X= (1 2
0 -1)
then with this you would have:
np.linalg.norm(x)= sqr((1+2+0+1)^2)= sqr(16)=4
X = (0.25 0.5
0 -0.25)
with the other approach you would have:
min(x)= -1
max(x)= 2
max(x)-min(x)=3
X = (0.66 1
0.33 0)
So the min(x)/max(x) is also called MinMaxScaler, there all the values are always between 0-1, the other approaches normalizes your values , but you can still have negativ values. Depending on your next steps you need to decide which one to use.

Based on the API description
Scikit-learn normalizer scales input vectors individually to a unit norm (vector length).
That is why it uses the L2 regularizer (you can also use L1 as well, as explained in the API)
I think you are looking for a scaler instead of a normalizer by your description. Please find the Min-Max scaler in this link.
Also, you can consider a standard scaler that normalizes value by removing its mean and scales to its standard deviation.

Feature agglomeration: How to retrieve the features that make up the clusters?

I am using the scikit-learn's feature agglomeration to use a hierarchical clustering procedure on features rather than on the observations.
This is my code:
from sklearn import cluster
import pandas as pd
#load the data
df = pd.read_csv('C:/Documents/data.csv')
agglo = cluster.FeatureAgglomeration(n_clusters=5)
agglo.fit(df)
df_reduced = agglo.transform(df)
My original df had the shape (990, 15), after using feature agglomeration, df_reduced now has (990, 5).
How do now find out how the original 15 features have been clustered together? In other words, what original features from df make up each of the 5 new features in df_reduced?

The way how the features within each of the clusters are combined during transform is set by the way you perform the hierarchical clustering. The reduced feature set simply consists of the n_clusters cluster centers (which are n_samples - dimensional vectors). For certain applications you might think of computing centers manually using different definitions of cluster centers (i.e. median instead of mean to avoid the influence of outliers etc.).
n_features = 15
feature_identifier = range(n_features)
feature_groups = [np.array(feature_identifier )[agglo.labels_==i] for i in range(n_clusters)]
new_features = [df.loc[:,df.keys()[group]].mean(0) for group in feature_groups]
Don't forget to standardize the features beforehand (for example using sklearn's scaler). Otherwise you are rather grouping the scales of the quantities than clustering similar behavior.
Hope that helps!
Haven't tested the code. Let me know if there are problems.

After fitting the clusterer, agglo.labels_ contains a list that tells in which cluster in the reduced dataset each feature in the original dataset belongs.

How to use scikit-learn PCA for features reduction and know which features are discarded

I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples.
Suppose I want to preserve the nf features with the maximum variance. With scikit-learn I am able to do it in this way:
from sklearn.decomposition import PCA
nf = 100
pca = PCA(n_components=nf)
# X is the matrix transposed (n samples on the rows, m features on the columns)
pca.fit(X)
X_new = pca.transform(X)
Now, I get a new matrix X_new that has a shape of n x nf. Is it possible to know which features have been discarded or the retained ones?
Thanks

The features that your PCA object has determined during fitting are in pca.components_. The vector space orthogonal to the one spanned by pca.components_ is discarded.
Please note that PCA does not "discard" or "retain" any of your pre-defined features (encoded by the columns you specify). It mixes all of them (by weighted sums) to find orthogonal directions of maximum variance.
If this is not the behaviour you are looking for, then PCA dimensionality reduction is not the way to go. For some simple general feature selection methods, you can take a look at sklearn.feature_selection

The projected features onto principal components will retain the important information (axes with maximum variances) and drop axes with small variances. This behavior is like to compression (Not discard).
And X_proj is the better name of X_new, because it is the projection of X onto principal components
You can reconstruct the X_rec as
X_rec = pca.inverse_transform(X_proj) # X_proj is originally X_new
Here, X_rec is close to X, but the less important information was dropped by PCA. So we can say X_rec is denoised.
In my opinion, I can say the noise is discard.

The answer marked above is incorrect. The sklearn site clearly states that the components_ array is sorted. so it can't be used to identify the important features.
components_ : array, [n_components, n_features]
Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.