Machine leaning OneHotEncoding in Python

Machine leaning OneHotEncoding in Python - python

I am new to machine learning scikit-learn. I was going through the documentation and tried OneHotEncoder() with some sample data. Can someone please explain what is happening from encoder.feature_indices_ and how i get the output of Encoded_Vector as [[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]. Any help is appreciated. Thanks!
>>> from sklearn import preprocessing
>>> encoder = preprocessing.OneHotEncoder()
>>> encoder.fit([[0, 2, 1, 12], [1, 3, 5, 3], [2, 3, 2, 12], [1, 2, 4,3]])
OneHotEncoder(categorical_features='all', dtype=<type 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> encoder.n_values_
array([ 3, 4, 6, 13])
>>> encoder.feature_indices_
array([ 0, 3, 7, 13, 26])
>>> vector_encoded = encoder.transform([[2,3,5,3]]).toarray()
>>> print "\nEncoded_Vector =",vector_encoded
Encoded_Vector = [[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]
>>>
My understanding so far is
Input
0 2 1 12
1 3 5 3
2 3 2 12
1 2 4 3
This is 4 columns/features and 4 rows. Each column has different number of unique entities. If i run:
enc.n_values_
It gives: array([ 3, 4, 6, 13])
So categories for each feature are:
feature 1 can take 3 values : 0 1 2
feature 2 can take 4 values : 0 1 2 3
feature 3 can take 6 values : 0 1 2 3 4 5
feature 4 can take 13 values : 0 1 2 3 4 5 6 7 8 9 10 11 12

Even though you said that your features can take a total of 3, 4, 6 or 13 values, the data example you provided ([[0, 2, 1, 12], [1, 3, 5, 3], [2, 3, 2, 12], [1, 2, 4,3]]) did not cover the complete variety of your data.
Your example basically says that:
feature 1 can take 3 values (0,1,2)
feature 2 can take 2 values (2,3)
feature 3 can take 4 values (1,2,4,5)
feature 4 can take 2 values (3,12)
This ends up with a total of 11 values. Thus, the output from the OneHotEncoding ([[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]) has 11 values, and it can be split into 4 sections:
[0. 0. 1.] is the encoding for feature 1
[0. 1.] is the encoding for feature 2
[0. 0. 0. 1.] is the encoding for feature 3
[1. 0.] is the encoding for feature 4
The position of the "1." in the array will tell you the value of the variable (try to match the example before encoding and after encoding).

Related

StandardScaler returns all zeros

I am trying to transform my dictionary with StandardScaler(), but it gives me only zeros.
How to fix it?
from sklearn.preprocessing import StandardScaler
import pandas as pd
param ={
"user_id": 22058,
"signup_day": 24,
"signup_month": 2,
"signup_year": 2015,
"purchase_day": 18,
"purchase_month": 4,
"purchase_year": 2015,
"purchase_value": 34,
"age": 39,
"source_Ads": 0,
"source_Direct": 0,
"source_SEO": 1,
"browser_Chrome": 1,
"browser_FireFox": 0,
"browser_IE": 0,
"browser_Opera": 0,
"browser_Safari": 0,
"sex_F": 0,
"sex_M": 1
}
new = (pd.Series(param, index=['user_id', 'signup_day', 'signup_month', 'signup_year', 'purchase_day', 'purchase_month', 'purchase_year', 'purchase_value', 'age', 'source_Ads', 'source_Direct', 'source_SEO', 'browser_Chrome', 'browser_FireFox','browser_IE', 'browser_Opera', 'browser_Safari', 'sex_F', 'sex_M'])).values.reshape(1,-1)
print(new)
scaler = StandardScaler()
X_new = scaler.fit_transform(new)
print(X_new)
Results:
new = [[22058 24 2 2015 18 4 2015 34 39 0 0 1 1 0 0 0 0 0 1]]
X_new =[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

StandardScaler is meant to scale columns of your data, here you have only one value per column so each of them is set to 0. Use it with multiple values per column and you might get the expected result !

mask 0 values during normalization

I am doing normalization for datasets but the data contains a lot of 0 because of padding.
I can mask them during model training but apparently, these zero will be affected when I applied normalization.
from sklearn.preprocessing import StandardScaler,MinMaxScaler
I am currently using the Sklearn library to do the normalization
For example, given a 3D array with dimension (4,3,5) as (batch, step, features)
The number of zero-padding varied from batch to batch as these are the features I extracted from audio files, that have varying lengths, using a fixed window size.
[[[0 0 0 0 0],
[0 0 0 0 0],
[0 0 0 0 0]]
[[1 2 3 4 5],
[4 5 6 7 8],
[9 10 11 12 13]],
[[14 15 16 17 18],
[0 0 0 0 0],
[24 25 26 27 28]],
[[0 0 0 0 0],
[423 2 230 60 70],
[0 0 0 0 0]]
]
I wish to perform normalization by column so
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train.reshape(-1,X_train.shape[-1])).reshape(X_train.shape)
X_test = scaler.transform(X_test.reshape(-1,X_test.shape[-1])).reshape(X_test.shape)
However, in this case, zeros are treated as effective values. For example, the minimum value of the first column should be 1 instead of 0.
Further, the 0's values are also changed after applying the scalers but I wish to keep them as 0's so I can mask them during training. model.add(tf.keras.layers.Masking(mask_value=0.0, input_shape=(X_train.shape[1], X_train.shape[2])))
Is there any way to mask them during normalization so only the 2nd step and 3rd step in this example are used in normalization?
In addition, The actual dimension of the array for my project is bigger as (2000,50,68)
among the 68 features, the difference in values of the 68 features can be very large. I tried to normalize them by dividing each element by the biggest element in their row to avoid the impact from 0's but this did not work out well.

The task of just MinMaxScaler() masking can be solved by next code.
Each other operation needs separate way of handling, if you'll mention all operations that need masking then we can solve them one-by-one basis and I'll extend my answer. E.g. keras layers can be masked by tf.keras.layers.Masking() layer as you mentioned.
Next code min/max-scales only non zero features, the rest remain zeros.
import numpy as np
from sklearn.preprocessing import MinMaxScaler
X = np.array([
[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]],
[[1, 2, 3, 4, 5],
[4, 5, 6, 7, 8],
[9, 10, 11, 12, 13]],
[[14, 15, 16, 17, 18],
[0, 0, 0, 0, 0],
[24, 25, 26, 27, 28]],
[[0, 0, 0, 0, 0],
[423, 2, 230, 60, 70],
[0, 0, 0, 0, 0]]
], dtype = np.float64)
nz = np.any(X, -1)
X[nz] = MinMaxScaler().fit_transform(X[nz])
print(X)
Output:
[[[0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. ]]
[[0. 0. 0. 0. 0. ]
[0.007109 0.13043478 0.01321586 0.05357143 0.04615385]
[0.01895735 0.34782609 0.03524229 0.14285714 0.12307692]]
[[0.03080569 0.56521739 0.05726872 0.23214286 0.2 ]
[0. 0. 0. 0. 0. ]
[0.05450237 1. 0.10132159 0.41071429 0.35384615]]
[[0. 0. 0. 0. 0. ]
[1. 0. 1. 1. 1. ]
[0. 0. 0. 0. 0. ]]]
If you need to train MinMaxScaler() on one dataset and apply it later on others then you can do next:
scaler = MinMaxScaler().fit(X[np.any(X, -1)])
X[np.any(X, -1)] = scaler.transform(X[np.any(X, -1)])
Y[np.any(Y, -1)] = scaler.transform(Y[np.any(Y, -1)])

Repeat element of a tensor and form a new tensor in tensorflow

Say I have a tensor A=[a1,a2,...], I want to to repeat the element of the tensor and form a new tensor. The number of repetition of each element is indicated in another tensor B. For example if B=[1,3,2,2,..], the result should be [a1,a2,a2,a2,a3,a3,a4,a4,...]. Is there an efficient way to perform this in tensorflow without using loop?

This is copied from this issue. I am interpreting it because it is interesting and it is without loops. Don't know if it is efficient or not.
If we assume that we want to repeat [1,2,3] with [3,4,5] repetitions then gist of it is that they create a sparse tensor like this.
[1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0.]
and fill it up like this using tf.cumsum cleverly.
[1. 1. 1. 2. 2. 2. 2. 3. 3. 3. 3. 3.]
I am interpreting the procedure as much as possible here.
tf.cumsum([3,4,5] gives [ 3 7 12]
tf.cumsum([3, 4, 5][:-1]) gives [3 7] after removing the last element 12.
tf.concat([tf.constant([0], dtype=tf.int32), tf.cumsum([3,4,5][:-1])], axis=0) gives [0 3 7]
which are the indices where we see 1.
[1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0.]
And 1 is obtained using [1,2,3] - tf.concat([tf.constant([0], dtype=tf.float64), [1,2,3][:-1]], axis=0)
which is equivalent to subtracting like this. [1, 2, 3] - [0, 1, 2]. This gives [1 1 1] for the values for the sparse indices.
12 as the output_shape is the total placeholders required in the output which is the reduced cumulative sum of our repeat tensor [3, 4, 5]
This gives the final output.
print(sess.run(
tf.cumsum(
tf.sparse_to_dense(
sparse_indices=[0, 3, 7],
output_shape=(12,),
sparse_values=[1, 1, 1]))))

How to compute a spatial distance matrix from a given value

I've been looking for a way to (efficiently) compute a distance matrix from a target value and an input matrix.
If you consider an input array as:
[0 0 1 2 5 2 1]
[0 0 2 3 5 2 1]
[0 1 1 2 5 4 1]
[1 1 1 2 5 4 0]
Ho do you compute the spatial distance matrix associated to the target value 0?
i.e. what is the distance from each pixel to the closest 0 value?
Thanks in advance

You are looking for scipy.ndimage.morphology.distance_transform_edt. It operates on a binary array and computes euclidean distances on each TRUE position to the nearest background FALSE position. In our case, since we want to find out distances from nearest 0s, so the background is 0. Now, under the hoods, it converts the input to a binary array assuming 0 as the background, so we can just use it with the default parameters. Hence, it would be as simple as -
In [179]: a
Out[179]:
array([[0, 0, 1, 2, 5, 2, 1],
[0, 0, 2, 3, 5, 2, 1],
[0, 1, 1, 2, 5, 4, 1],
[1, 1, 1, 2, 5, 4, 0]])
In [180]: from scipy import ndimage
In [181]: ndimage.distance_transform_edt(a)
Out[181]:
array([[0. , 0. , 1. , 2. , 3. , 3.16, 3. ],
[0. , 0. , 1. , 2. , 2.83, 2.24, 2. ],
[0. , 1. , 1.41, 2.24, 2.24, 1.41, 1. ],
[1. , 1.41, 2.24, 2.83, 2. , 1. , 0. ]])
Solving for generic case
Now, let's say we want to find out distances from nearest 1s, then it would be -
In [183]: background = 1 # element from which distances are to be computed
# compare this with original array, a to verify
In [184]: ndimage.distance_transform_edt(a!=background)
Out[184]:
array([[2. , 1. , 0. , 1. , 2. , 1. , 0. ],
[1.41, 1. , 1. , 1.41, 2. , 1. , 0. ],
[1. , 0. , 0. , 1. , 2. , 1. , 0. ],
[0. , 0. , 0. , 1. , 2. , 1.41, 1. ]])

Python Numpy Matrix Operations - matrix[a==b]?

I've been trying to create a watershed algorithm and as all the examples seem to be in Python I've run into a bit of a wall. I've been trying to find in numpy documentation what this line means:
matrixVariable[A==255] = 0
but have had no luck. Could anyone explain what that operation does?
For context the line in action: label [lbl == -1] = 0

The expression A == 255 creates a boolean array which is True where x == 255 in A and False otherwise.
The expression matrixVariable[A==255] = 0 sets each index corresponding to a True value in A == 255 to 0.
EG:
import numpy as np
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
B = np.zeros([3, 3])
print('before:')
print(B)
B[A>5] = 5
print('after:')
print(B)
OUT:
[[ 0. 0. 0.]
[ 0. 0. 0.]
[ 0. 0. 0.]]
after:
[[ 0. 0. 0.]
[ 0. 0. 5.]
[ 5. 5. 5.]]

I assumed that matrixVariable and A are numpy arrays. If the assumption is correct then "matrixVariable[A==255] = 0" expression first gets the index of the array A where values of A are equal to 255 then gets the values of matrixVariable for those index and set them to "0"
Example:
import numpy as np
matrixVariable = np.array([(1, 3),
(2, 2),
(3,1)])
A = np.array([255, 1,255])
So A[0] and A[2] are equal to 255
matrixVariable[A==255]=0 #then sets matrixVariable[0] and matrixVariable[2] to zero
print(matrixVariable) # this would print
[[0 0]
[2 2]
[0 0]]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Machine leaning OneHotEncoding in Python - python

Related

StandardScaler returns all zeros

mask 0 values during normalization

Repeat element of a tensor and form a new tensor in tensorflow

How to compute a spatial distance matrix from a given value

Python Numpy Matrix Operations - matrix[a==b]?

Categories

Resources