Handling missing (nan) values on sklearn.preprocessing - python

I'm trying to normalize data with missing (i.e. nan) values before processing it, using scikit-learn preprocessing.
Apparently, some scalers (e.g. StandardScaler) handle the missing values the way I want - by which I mean normalize the existing values while keeping the nans - while other (e.g. Normalizer) just raise an error.
I've looked around and haven't found - how can I use the Normalizer with missing values, or replicate its behavior (with norm='l1' and norm='l2'; I need to test several normalization options) some other way?
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
scaler = StandardScaler(with_mean=True, with_std=True)
scaler.fit_transform(data.reshape(-1,1))
normalizer = Normalizer(norm='l2')
normalizer.fit_transform(data.reshape(-1,1))

The problem with your request is that Normalizer operates in this fashion, accordingly to documentation:
Normalize samples individually to unit norm.
Each sample (i.e. each row of the data matrix) with at least one non
zero component is rescaled independently of other samples so that its
norm (l1 or l2) equals one (source here)
That means that each row have to sum to unit norm. How to deal with a missing value? Ideally it seems you don't want it to count in the sum and you want the row to normalize regardless of it, but the internal function check_array prevents from it by throwing an error.
You need to circumvent such a situation. The most reasonable way to do it is to:
first create a mask in order to record which elements were missing in your array
create a response array filled with missing values
apply the Normalizer to your array after selecting only the valid entries
record on your response array the normalized values based on their original position
here some code detailing the process, based on your example:
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
# set valid mask
nan_mask = np.isnan(data)
valid_mask = ~nan_mask
normalizer = Normalizer(norm='l2')
# create a result array
result = np.full(data.shape, np.nan)
# assign only valid cases to
result[valid_mask] = normalizer.fit_transform(data[valid_mask].reshape(-1,1)).reshape(data[valid_mask].shape)

Related

Python Dataframe Filter data using linear relation

I have a data frame with input and output columns. They have a linear relation. So, I want to remove data that does not fit this relation. My actual df is big and has many samples. Here, I am giving an example.
My code:
xdf = pd.DataFrame({'ip':[10,20,30,40],'op':[105,195,500,410]})
I am not getting any idea on how to proceed.
You can do a linear fit first then filter out the data that is outside of a certain threshold.
Sample code below:
import numpy as np
df = pd.DataFrame({'ip':[10,20,30,40],'op':[105,195,500,410]})
# do a linear fit on ip and op
f = np.polyfit(df.ip,df.op,1)
fl = np.poly1d(f)
# you will have to determine this threshold in some way
threshold = 100
output = df[(df.op - fl(df.ip)).abs()<threshold]
Another way:
You can create a boolean mask to check the ratio of op/dp is less then their mean value:
m=xdf.eval("op/ip").lt(xdf.eval("op/ip").mean())
Finally:
out=xdf[m]
plt.scatter(x=out['ip'],y=out['op'])

How can I replace missing boolean values using python?

In my dataset, one of the columns is a boolean value, and there are missing values within the dataset and within other continuous variable columns which are successfully replaced with their mean. But the mean value can not be replaced for missing boolean. So how can I replace those values?
Note that the boolean is 1 or 0 in my dataset.
Below is the code for replacing continuous missing values:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x)
x = imputer.transform(x)
Thank You
there are several methods to attack this issue.
if you can afford it (if you have enough data) exclude those lines
replace those lines with the majority value (same as replacing with mean of continuous value)
for time series - replace the cell with mean of x cells before and after and set a threshold which above it - the mean will become 0, else , the mean will become 0
You can treat this boolean variable as a categorical feature and then use a SimpleImputer with the most_frequent strategy instead of mean.
You can do as follow:
from sklearn.impute import SimpleImputer
import numpy as np
#Create sample data with nans
X = np.random.randint(2, size=100).reshape(1,-1).astype(float)
X[0,::4] = np.nan
SimpleImputer(strategy="most_frequent").fit_transform(X)

How StandardScaler does not disrupt data integrity?

Since by using sklearns's StandardScaler the initial data are normalized, isn't that problematic that the initial data are not the same anymore?
Example:
from sklearn.preprocessing import StandardScaler
import numpy as np
data = np.array([[1,1],[2,0]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(data)
[[1 1]
[2 0]]
print(scaled_data)
[[-1. 1.]
[ 1. -1.]]
As you can see the data are not the same due to normalization. How that change is not affecting the results in future processing since the data are different and in what scenario is suitable to perform normalization (basically we do that for data which have negative values but I mean in what processes is it appropriate)?
Let's go to official docs for the function:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
From that, we can see this formula:
The standard score of a sample x is calculated as:
z = (x - u) / s
Here u - mean & s - standard deviation
As per Normal distribution theorem, we can represent any data using above formula & distribution.
Geometrically, we are subtracting all the values of a field/column with a same value & dividing with the another same value.
We are just rescaling the data.So, data integrity will not be lost
Another point to keep in mind is that the defaults for the StandardScaler class in sklearn do not modify the data in-place by default:
"copy : boolean, optional, default True
If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned."
So above, since you've assigned the results to the scaled_data name, the object that is referenced by data is unchanged as long as you don't change the copy=True default parameter in StandardScaler.

What is the purpose of keras utils normalize?

I'd like to normalize my training set before passing it to my NN so instead of doing it manually (subtract mean and divide by std), I tried keras.utils.normalize() and I am amazed about the results I got.
Running this:
r = np.random.rand(3000) * 1000
nr = normalize(r)
print(np.mean(r))
print(np.mean(nr))
print(np.std(r))
print(np.std(nr))
print(np.min(r))
print(np.min(nr))
print(np.max(r))
print(np.max(nr))
​
​Results in that:
495.60440066771866
0.015737914577213984
291.4440194021
0.009254802974329002
0.20755517410064872
6.590913227674956e-06
999.7631481267636
0.03174747238214018
Unfortunately, the docs don't explain what's happening under the hood. Can you please explain what it does and if I should use keras.utils.normalize instead of what I would have done manually?
It is not the kind of normalization you expect. Actually, it uses np.linalg.norm() under the hood to normalize the given data using Lp-norms:
def normalize(x, axis=-1, order=2):
"""Normalizes a Numpy array.
# Arguments
x: Numpy array to normalize.
axis: axis along which to normalize.
order: Normalization order (e.g. 2 for L2 norm).
# Returns
A normalized copy of the array.
"""
l2 = np.atleast_1d(np.linalg.norm(x, order, axis))
l2[l2 == 0] = 1
return x / np.expand_dims(l2, axis)
For example, in the default case, it would normalize the data using L2-normalization (i.e. the sum of squared of elements would be equal to one).
You can either use this function, or if you don't want to do mean and std normalization manually, you can use StandardScaler() from sklearn or even MinMaxScaler().

Why does sklearn's train/test split plus PCA make my labelling incorrect?

I am exploring PCA in Scikit-learn (0.20 on Python 3) using Pandas for structuring my data. When I apply a test/train split (and only when), my input labels seem to no longer match up with the PCA output.
import pandas
import sklearn.datasets
from matplotlib import pyplot
import seaborn
def load_bc_as_dataframe():
data = sklearn.datasets.load_breast_cancer()
df = pandas.DataFrame(data.data, columns=data.feature_names)
df['diagnosis'] = pandas.Series(data.target_names[data.target])
return data.feature_names.tolist(), df
feature_names, bc_data = load_bc_as_dataframe()
from sklearn.model_selection import train_test_split
# bc_train, _ = train_test_split(bc_data, test_size=0)
bc_train = bc_data
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
bc_pca_raw = pca.fit_transform(bc_train[feature_names])
bc_pca = pandas.DataFrame(bc_pca_raw, columns=('PCA 1', 'PCA 2'))
bc_pca['diagnosis'] = bc_train['diagnosis']
seaborn.scatterplot(
data=bc_pca,
x='PCA 1',
y='PCA 2',
hue='diagnosis',
style='diagnosis'
)
pyplot.show()
This looks reasonable, and that's borne out by accurate classification results. If I replace the bc_train = bc_data with a train_test_split() call (even with test_size=0), my labels seem to no longer correspond to the original ones.
I realise that train_test_split() is shuffling my data (which I want it to, in general), but I don't see why that would be a problem, since the PCA and the label assignment use the same shuffled data. PCA's transformation is just a projection, and while it obviously doesn't retain the same features (columns), it shouldn't change which label goes with which frame.
How can I correctly relabel my PCA output?
The issue has three parts:
The shuffling in train_test_split() causes the indices in bc_train to be in a random order (compared to the row location).
PCA operates on numerical matrices, and effectively strips the indices from the input. Creating a new DataFrame recreates the indices to be sequential (compared to the row location).
Now we have random indices in bc_train and sequential indices in bc_pca. When I do bc_pca['diagnosis'] = bc_train['diagnosis'], bc_train is reindexed with bc_pcas indices. This reorders the bc_train data so that its indices match bc_pcas.
To put it another way, Pandas does a left-join on the indices when I assign with bc_pca['diagnosis'] (ie. __setitem__()), not a row-by-row copy (similar to update().
I don't find this intuitive, and I couldn't find documentation on __setitem__()'s behaviour beyond the source code, but I expect it makes sense to a more experienced Pandas user, and maybe it's documented at a higher level somewhere I haven't seen.
There are a number of ways to avoid this. I can reset the index of the training/test data:
bc_train, _ = train_test_split(bc_data, test_size=0)
bc_train.reset_index(inplace=True)
Alternatively I could assign from the values member:
bc_pca['diagnosis'] = bc_train['diagnosis'].values
I could also do a similar thing before constructing the DataFrame (arguably more sensible, since PCA is effectively operating on bc_train[feature_names].values).

Categories

Resources