How StandardScaler does not disrupt data integrity? - python

Since by using sklearns's StandardScaler the initial data are normalized, isn't that problematic that the initial data are not the same anymore?
Example:
from sklearn.preprocessing import StandardScaler
import numpy as np
data = np.array([[1,1],[2,0]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(data)
[[1 1]
[2 0]]
print(scaled_data)
[[-1. 1.]
[ 1. -1.]]
As you can see the data are not the same due to normalization. How that change is not affecting the results in future processing since the data are different and in what scenario is suitable to perform normalization (basically we do that for data which have negative values but I mean in what processes is it appropriate)?

Let's go to official docs for the function:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
From that, we can see this formula:
The standard score of a sample x is calculated as:
z = (x - u) / s
Here u - mean & s - standard deviation
As per Normal distribution theorem, we can represent any data using above formula & distribution.
Geometrically, we are subtracting all the values of a field/column with a same value & dividing with the another same value.
We are just rescaling the data.So, data integrity will not be lost

Another point to keep in mind is that the defaults for the StandardScaler class in sklearn do not modify the data in-place by default:
"copy : boolean, optional, default True
If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned."
So above, since you've assigned the results to the scaled_data name, the object that is referenced by data is unchanged as long as you don't change the copy=True default parameter in StandardScaler.

Related

The calculated Robustscaler in sklearn seems not right

I tried the Robustscaler in sklearn, and found the results are not the same as the formula.
The formula of the Robustscaler in sklearn is:
I have a matrix shown as below:
I test the first data in feature one (row one and column one). The scaled value should be (1-3)/(5.5-1.5) = -0.5. However, the result from the sklearn is -0.67. Does anyone know where the calculation is not correct?
The code using sklearn is as below:
import numpy as np
from sklearn.preprocessing import RobustScaler
x=[[1,2,3,4],[4,5,6,7],[7,8,9,10],[2,1,1,1]]
scaler = RobustScaler(quantile_range=(25.0, 75.0),with_centering=True)
x_new = scaler.fit_transform(x)
print(x_new)
From the RobustScaler documentation (emphasis added):
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set.
i.e. the median and IQR quantities are calculated per column, and not for the whole array.
Having clarified that, let's calculate the scaled values for your first column manually:
import numpy as np
x1 = np.array([1, 4, 7, 2]) # your 1st column here
q75, q25 = np.percentile(x1, [75 ,25])
iqr = q75 - q25
x1_med = np.median(x1)
x1_scaled = (x1-x1_med)/iqr
x1_scaled
# array([-0.66666667, 0.33333333, 1.33333333, -0.33333333])
which is the same with the first column of your own x_new, as calculated by scikit-learn:
# your code verbatim:
from sklearn.preprocessing import RobustScaler
x=[[1,2,3,4],[4,5,6,7],[7,8,9,10],[2,1,1,1]]
scaler = RobustScaler(quantile_range=(25.0, 75.0),with_centering=True)
x_new = scaler.fit_transform(x)
print(x_new)
# result
[[-0.66666667 -0.375 -0.35294118 -0.33333333]
[ 0.33333333 0.375 0.35294118 0.33333333]
[ 1.33333333 1.125 1.05882353 1. ]
[-0.33333333 -0.625 -0.82352941 -1. ]]
np.all(x1_scaled == x_new[:,0])
# True
Similarly for the rest of the columns (features) - you need to calculate separately the median and IQR values for each one of them before scaling them.
UPDATE (after comment):
As pointed out in the Wikipedia entry on quartiles:
For discrete distributions, there is no universal agreement on selecting the quartile values
See also the relevant reference, Sample quantiles in statistical packages:
There are a large number of different definitions used for sample quantiles in statistical computer packages
Digging into the documentation of np.percentile used here, you'll see that there are no less that five (5) different methods of interpolation, and not all of them produce identical results (see also the 4 different methods demonstrated in the Wikipedia entry linked just above); here is a quick demonstration of these methods and their results in the x1 data defined above:
np.percentile(x1, [75 ,25]) # interpolation='linear' by default
# array([4.75, 1.75])
np.percentile(x1, [75 ,25], interpolation='lower')
# array([4, 1])
np.percentile(x1, [75 ,25], interpolation='higher')
# array([7, 2])
np.percentile(x1, [75 ,25], interpolation='midpoint')
# array([5.5, 1.5])
np.percentile(x1, [75 ,25], interpolation='nearest')
# array([4, 2])
Apart from the fact that there are no two methods producing identical results, it should also be apparent that the definition you are using in your own calculations corresponds to interpolation='midpoint', while the default Numpy method is interpolation='linear'. And as Ben Reiniger correctly points out in the comments below, what is actually used in the source code of RobustScaler is np.nanpercentile (a variation pf np.percentile I have used here that is able to handle nan values) with the default interpolation='linear' setting.

scaling data to specific range in python

I would like to scale an array of size [192,4000] to a specific range. I would like each row (1:192) to be rescaled to a specific range e.g. (-840,840). I run a very simple code:
import numpy as np
from sklearn import preprocessing as sp
sample_mat = np.random.randint(-840,840, size=(192, 4000))
scaler = sp.MinMaxScaler(feature_range=(-840,840))
scaler = scaler.fit(sample_mat)
scaled_mat= scaler.transform(sample_mat)
This messes up my matrix range, even when max and min of my original matrix is exactly the same. I can't figure out what is wrong, any idea?
You can do this manually.
It is a linear transformation of the minmax normalized data.
interval_min = -840
interval_max = 840
scaled_mat = (sample_mat - np.min(sample_mat) / (np.max(sample_mat) - np.min(sample_mat)) * (interval_max - interval_min) + interval_min
MinMaxScaler support feature_range argument on initialization that can produce the output in a certain range.
scaler = MinMaxScaler(feature_range=(1, 2)) will yield output in the (1,2) range

Handling missing (nan) values on sklearn.preprocessing

I'm trying to normalize data with missing (i.e. nan) values before processing it, using scikit-learn preprocessing.
Apparently, some scalers (e.g. StandardScaler) handle the missing values the way I want - by which I mean normalize the existing values while keeping the nans - while other (e.g. Normalizer) just raise an error.
I've looked around and haven't found - how can I use the Normalizer with missing values, or replicate its behavior (with norm='l1' and norm='l2'; I need to test several normalization options) some other way?
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
scaler = StandardScaler(with_mean=True, with_std=True)
scaler.fit_transform(data.reshape(-1,1))
normalizer = Normalizer(norm='l2')
normalizer.fit_transform(data.reshape(-1,1))
The problem with your request is that Normalizer operates in this fashion, accordingly to documentation:
Normalize samples individually to unit norm.
Each sample (i.e. each row of the data matrix) with at least one non
zero component is rescaled independently of other samples so that its
norm (l1 or l2) equals one (source here)
That means that each row have to sum to unit norm. How to deal with a missing value? Ideally it seems you don't want it to count in the sum and you want the row to normalize regardless of it, but the internal function check_array prevents from it by throwing an error.
You need to circumvent such a situation. The most reasonable way to do it is to:
first create a mask in order to record which elements were missing in your array
create a response array filled with missing values
apply the Normalizer to your array after selecting only the valid entries
record on your response array the normalized values based on their original position
here some code detailing the process, based on your example:
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
# set valid mask
nan_mask = np.isnan(data)
valid_mask = ~nan_mask
normalizer = Normalizer(norm='l2')
# create a result array
result = np.full(data.shape, np.nan)
# assign only valid cases to
result[valid_mask] = normalizer.fit_transform(data[valid_mask].reshape(-1,1)).reshape(data[valid_mask].shape)

What is the purpose of keras utils normalize?

I'd like to normalize my training set before passing it to my NN so instead of doing it manually (subtract mean and divide by std), I tried keras.utils.normalize() and I am amazed about the results I got.
Running this:
r = np.random.rand(3000) * 1000
nr = normalize(r)
print(np.mean(r))
print(np.mean(nr))
print(np.std(r))
print(np.std(nr))
print(np.min(r))
print(np.min(nr))
print(np.max(r))
print(np.max(nr))
​
​Results in that:
495.60440066771866
0.015737914577213984
291.4440194021
0.009254802974329002
0.20755517410064872
6.590913227674956e-06
999.7631481267636
0.03174747238214018
Unfortunately, the docs don't explain what's happening under the hood. Can you please explain what it does and if I should use keras.utils.normalize instead of what I would have done manually?
It is not the kind of normalization you expect. Actually, it uses np.linalg.norm() under the hood to normalize the given data using Lp-norms:
def normalize(x, axis=-1, order=2):
"""Normalizes a Numpy array.
# Arguments
x: Numpy array to normalize.
axis: axis along which to normalize.
order: Normalization order (e.g. 2 for L2 norm).
# Returns
A normalized copy of the array.
"""
l2 = np.atleast_1d(np.linalg.norm(x, order, axis))
l2[l2 == 0] = 1
return x / np.expand_dims(l2, axis)
For example, in the default case, it would normalize the data using L2-normalization (i.e. the sum of squared of elements would be equal to one).
You can either use this function, or if you don't want to do mean and std normalization manually, you can use StandardScaler() from sklearn or even MinMaxScaler().

Why does sklearn's train/test split plus PCA make my labelling incorrect?

I am exploring PCA in Scikit-learn (0.20 on Python 3) using Pandas for structuring my data. When I apply a test/train split (and only when), my input labels seem to no longer match up with the PCA output.
import pandas
import sklearn.datasets
from matplotlib import pyplot
import seaborn
def load_bc_as_dataframe():
data = sklearn.datasets.load_breast_cancer()
df = pandas.DataFrame(data.data, columns=data.feature_names)
df['diagnosis'] = pandas.Series(data.target_names[data.target])
return data.feature_names.tolist(), df
feature_names, bc_data = load_bc_as_dataframe()
from sklearn.model_selection import train_test_split
# bc_train, _ = train_test_split(bc_data, test_size=0)
bc_train = bc_data
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
bc_pca_raw = pca.fit_transform(bc_train[feature_names])
bc_pca = pandas.DataFrame(bc_pca_raw, columns=('PCA 1', 'PCA 2'))
bc_pca['diagnosis'] = bc_train['diagnosis']
seaborn.scatterplot(
data=bc_pca,
x='PCA 1',
y='PCA 2',
hue='diagnosis',
style='diagnosis'
)
pyplot.show()
This looks reasonable, and that's borne out by accurate classification results. If I replace the bc_train = bc_data with a train_test_split() call (even with test_size=0), my labels seem to no longer correspond to the original ones.
I realise that train_test_split() is shuffling my data (which I want it to, in general), but I don't see why that would be a problem, since the PCA and the label assignment use the same shuffled data. PCA's transformation is just a projection, and while it obviously doesn't retain the same features (columns), it shouldn't change which label goes with which frame.
How can I correctly relabel my PCA output?
The issue has three parts:
The shuffling in train_test_split() causes the indices in bc_train to be in a random order (compared to the row location).
PCA operates on numerical matrices, and effectively strips the indices from the input. Creating a new DataFrame recreates the indices to be sequential (compared to the row location).
Now we have random indices in bc_train and sequential indices in bc_pca. When I do bc_pca['diagnosis'] = bc_train['diagnosis'], bc_train is reindexed with bc_pcas indices. This reorders the bc_train data so that its indices match bc_pcas.
To put it another way, Pandas does a left-join on the indices when I assign with bc_pca['diagnosis'] (ie. __setitem__()), not a row-by-row copy (similar to update().
I don't find this intuitive, and I couldn't find documentation on __setitem__()'s behaviour beyond the source code, but I expect it makes sense to a more experienced Pandas user, and maybe it's documented at a higher level somewhere I haven't seen.
There are a number of ways to avoid this. I can reset the index of the training/test data:
bc_train, _ = train_test_split(bc_data, test_size=0)
bc_train.reset_index(inplace=True)
Alternatively I could assign from the values member:
bc_pca['diagnosis'] = bc_train['diagnosis'].values
I could also do a similar thing before constructing the DataFrame (arguably more sensible, since PCA is effectively operating on bc_train[feature_names].values).

Categories

Resources