Related
I am doing normalization for datasets but the data contains a lot of 0 because of padding.
I can mask them during model training but apparently, these zero will be affected when I applied normalization.
from sklearn.preprocessing import StandardScaler,MinMaxScaler
I am currently using the Sklearn library to do the normalization
For example, given a 3D array with dimension (4,3,5) as (batch, step, features)
The number of zero-padding varied from batch to batch as these are the features I extracted from audio files, that have varying lengths, using a fixed window size.
[[[0 0 0 0 0],
[0 0 0 0 0],
[0 0 0 0 0]]
[[1 2 3 4 5],
[4 5 6 7 8],
[9 10 11 12 13]],
[[14 15 16 17 18],
[0 0 0 0 0],
[24 25 26 27 28]],
[[0 0 0 0 0],
[423 2 230 60 70],
[0 0 0 0 0]]
]
I wish to perform normalization by column so
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train.reshape(-1,X_train.shape[-1])).reshape(X_train.shape)
X_test = scaler.transform(X_test.reshape(-1,X_test.shape[-1])).reshape(X_test.shape)
However, in this case, zeros are treated as effective values. For example, the minimum value of the first column should be 1 instead of 0.
Further, the 0's values are also changed after applying the scalers but I wish to keep them as 0's so I can mask them during training. model.add(tf.keras.layers.Masking(mask_value=0.0, input_shape=(X_train.shape[1], X_train.shape[2])))
Is there any way to mask them during normalization so only the 2nd step and 3rd step in this example are used in normalization?
In addition, The actual dimension of the array for my project is bigger as (2000,50,68)
among the 68 features, the difference in values of the 68 features can be very large. I tried to normalize them by dividing each element by the biggest element in their row to avoid the impact from 0's but this did not work out well.
The task of just MinMaxScaler() masking can be solved by next code.
Each other operation needs separate way of handling, if you'll mention all operations that need masking then we can solve them one-by-one basis and I'll extend my answer. E.g. keras layers can be masked by tf.keras.layers.Masking() layer as you mentioned.
Next code min/max-scales only non zero features, the rest remain zeros.
import numpy as np
from sklearn.preprocessing import MinMaxScaler
X = np.array([
[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]],
[[1, 2, 3, 4, 5],
[4, 5, 6, 7, 8],
[9, 10, 11, 12, 13]],
[[14, 15, 16, 17, 18],
[0, 0, 0, 0, 0],
[24, 25, 26, 27, 28]],
[[0, 0, 0, 0, 0],
[423, 2, 230, 60, 70],
[0, 0, 0, 0, 0]]
], dtype = np.float64)
nz = np.any(X, -1)
X[nz] = MinMaxScaler().fit_transform(X[nz])
print(X)
Output:
[[[0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. ]]
[[0. 0. 0. 0. 0. ]
[0.007109 0.13043478 0.01321586 0.05357143 0.04615385]
[0.01895735 0.34782609 0.03524229 0.14285714 0.12307692]]
[[0.03080569 0.56521739 0.05726872 0.23214286 0.2 ]
[0. 0. 0. 0. 0. ]
[0.05450237 1. 0.10132159 0.41071429 0.35384615]]
[[0. 0. 0. 0. 0. ]
[1. 0. 1. 1. 1. ]
[0. 0. 0. 0. 0. ]]]
If you need to train MinMaxScaler() on one dataset and apply it later on others then you can do next:
scaler = MinMaxScaler().fit(X[np.any(X, -1)])
X[np.any(X, -1)] = scaler.transform(X[np.any(X, -1)])
Y[np.any(Y, -1)] = scaler.transform(Y[np.any(Y, -1)])
I am a complete beginner with NumPy and I am trying to generate the following matrix pattern. Below is my code. What I am not figuring out is that what am I doing wrong to get this result. Thanks in advance for any help.
import numpy as np
def matrix(n):
final = []
for i in range(n):
final.append(list(np.tile([0,1],int(n/2))) if i%2==0 else list(np.tile([1,0],int(n/2))))
print(np.array(final))
size = 8
matrix(size)
While using numpy you should avoid working with arrays and for loops for matrix creating and editing because for large matrices it would be very slow.
Try to examine this code:
import math
import numpy as np
def zero_borders(mat: np.ndarray) -> None:
"""Makes the borders of the array zero."""
mat[:, 0] = 0 # left border
mat[:, -1] = 0 # right border
mat[0, :] = 0 # upper border
mat[-1, :] = 0 # bottom border
def zero_center_square(mat: np.ndarray) -> None:
"""Makes small square of zeros in the center of the array."""
size = mat.shape[0]
i_low = size//2 - 1
i_high = math.ceil(size/2)
mat[i_low, i_low:i_high + 1] = 0 # upper edge of the square
mat[i_high, i_low:i_high + 1] = 0 # upper edge of the square
mat[i_low:i_high + 1, i_low] = 0 # left edge of the square
mat[i_low:i_high + 1, i_high] = 0 # right edge of the square
def matrix(n: int) -> np.ndarray:
"""Creates a square matrix with special pattern."""
mat = np.ones((n, n))
zero_borders(mat)
zero_center_square(mat)
return mat
def main():
print("Even size:")
print(matrix(8))
print("")
print("Odd size:")
print(matrix(9))
if __name__ == "__main__":
main()
The output:
Even size:
[[0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 1. 1. 1. 1. 1. 0.]
[0. 1. 1. 1. 1. 1. 1. 0.]
[0. 1. 1. 0. 0. 1. 1. 0.]
[0. 1. 1. 0. 0. 1. 1. 0.]
[0. 1. 1. 1. 1. 1. 1. 0.]
[0. 1. 1. 1. 1. 1. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 0.]]
Odd size:
[[0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 1. 1. 1. 1. 1. 1. 0.]
[0. 1. 1. 1. 1. 1. 1. 1. 0.]
[0. 1. 1. 0. 0. 0. 1. 1. 0.]
[0. 1. 1. 0. 1. 0. 1. 1. 0.]
[0. 1. 1. 0. 0. 0. 1. 1. 0.]
[0. 1. 1. 1. 1. 1. 1. 1. 0.]
[0. 1. 1. 1. 1. 1. 1. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0.]]
You can use numpy ix_() like this:
>>> x = np.zeros((9,9), dtype=int)
>>> p1 = np.ix_([1,2,6,7],[1,2,3,4,5,6,7])
>>> x[p]=1
>>> p2 = np.ix_([3,4,5],[1,2,6,7])
>>> x[p2]=1
>>> x
array([[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 0, 0, 0, 1, 1, 0],
[0, 1, 1, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 0, 0, 0, 1, 1, 0],
[0, 1, 1, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0]])
You have not mentioned any particular pattern for lxl length of matrix, so I will write just code about how to generate the matrix in given image.
You can use NumPy (particularly numpy.pad()) to create that matrix easily as:
import numpy as np
# Create required matrix
matrix = np.pad(np.pad(np.pad(np.array([[1]]), (1, 1)), (2, 2), constant_values = 1), (1, 1))
# If you want that as list instead of NumPy array
list_matrix = list(list(i) for i in matrix)
I am new in numpy, and I am having troubles with simple managment of numpy arrays.
I am doing a task in which it said that loops has to be avoid as much as possible, and I need to edit the values of an array through another array of indexes.
indexes # [3, 16]
y # [0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 1. 1.]
y[indexes] = 2 # [0. 1. 1. 2. 0. 1. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 2. 0. 1. 1.]
But I don't need change the value simply by 2. I need make a conditional change. This what I have got, but I would need something like
y[indexes] = 0 if y[indexes] == 1 else 0
>>> [0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1.]
And the line above should be the results.
This is the loop way answer, but I need a numpy way if exists:
for index in indexes:
y[index] = 1 if y[index] == 0 else 0
Thanks in advance.
I don't know if I understood your question. But I hope this helps you.
tip 01
import numpy as np
indexes = [1, 5, 7] # index list
y = np.array([9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]) #array example
y[indexes][2] #3rd(0,1,>>2<<) item of y array (1,5,>>7<<).
In this case it is y[7] equal 16.
tip 02
This can also be useful.
y = np.array([0,1,1,0,3,0,1,0,1,0])
y
array([0, 1, 1, 0, 3, 0, 1, 0, 1, 0])
np.where(y != 1, y, 0)
y
array([0, 0, 0, 0, 3, 0, 0, 0, 0, 0])
Sorry for the long post.
I'm using python 3.6 on windows 10.I have a pandas data frame that contain around 100,000 rows. From this data frame I need to generate Four numpy arrays. First 5 relevant rows of my data frame looks like below
A B x UB1 LB1 UB2 LB2
0.2134 0.7866 0.2237 0.1567 0.0133 1.0499 0.127
0.24735 0.75265 0.0881 0.5905 0.422 1.4715 0.5185
0.0125 0.9875 0.1501 1.3721 0.5007 2.0866 2.0617
0.8365 0.1635 0.0948 1.9463 1.0854 2.4655 1.9644
0.1234 0.8766 0.0415 2.7903 2.2602 3.5192 3.2828
Column B is (1-Column A), Actually column B is not there in my data frame. I have added it to explain my problem
From this data frame, I need to generate three arrays. My arrays looks like
My array c looks like array([-0.2134, -0.7866,-0.24735, -0.75265,-0.0125, -0.9875,-0.8365, -0.1635,-0.1234, -0.8766],dtype=float32)
Where first element is first row of column A with added negative sign, similarly 2nd element is taken from 1st row of column B, third element is from second row of column A,fourth element is 2nd row of column B & so on
My second array UB looks like
array([ 0.2237, 0.0881, 0.1501, 0.0948, 0.0415, 0.2237],dtype=float32)
where elements are rows of column X.
My third array,bounds, looks like
array([[0.0133 , 0.1567],
[0.127 , 1.0499],
[0.422 , 0.5905],
[0.5185 , 1.4715],
[0.5007 , 1.3721],
[2.0617 , 2.0866],
[1.0854 , 1.9463],
[1.9644 , 2.4655],
[2.2602 , 2.7903],
[3.2828 , 3.5192]])
Where bounds[0][0] is first row of LB1,bounds[0][1] is first row of UB1. bounds[1][0] is first row of LB2, bounds [1][1] is first row of UB2. Again bounds[2][0] is 2nd row of LB1 & so on.
My fourth array looks like
array([[-1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, -1, 1, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, -1, 1, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, -1, 1, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, -1, 1]])
It contains same number of rows as data frame rows & column=2*data frame rows.
Can you please tell me for 100,000 rows of record what is the efficient way to generate these arrays
This should be rather straightforward:
from io import StringIO
import pandas as pd
import numpy as np
data = """A B x UB1 LB1 UB2 LB2
0.2134 0.7866 0.2237 0.1567 0.0133 1.0499 0.127
0.24735 0.75265 0.0881 0.5905 0.422 1.4715 0.5185
0.0125 0.9875 0.1501 1.3721 0.5007 2.0866 2.0617
0.8365 0.1635 0.0948 1.9463 1.0854 2.4655 1.9644
0.1234 0.8766 0.0415 2.7903 2.2602 3.5192 3.2828"""
df = pd.read_csv(StringIO(data), sep='\\s+', header=0)
c = -np.stack([df['A'], 1 - df['A']], axis=1).ravel()
print(c)
# [-0.2134 -0.7866 -0.24735 -0.75265 -0.0125 -0.9875 -0.8365 -0.1635
# -0.1234 -0.8766 ]
ub = df['x'].values
print(ub)
# [0.2237 0.0881 0.1501 0.0948 0.0415]
bounds = np.stack([df['LB1'], df['UB1'], df['LB2'], df['UB2']], axis=1).reshape((-1, 2))
print(bounds)
# [[0.0133 0.1567]
# [0.127 1.0499]
# [0.422 0.5905]
# [0.5185 1.4715]
# [0.5007 1.3721]
# [2.0617 2.0866]
# [1.0854 1.9463]
# [1.9644 2.4655]
# [2.2602 2.7903]
# [3.2828 3.5192]]
n = len(df)
fourth = np.zeros((n, 2 * n))
idx = np.arange(n)
fourth[idx, 2 * idx] = -1
fourth[idx, 2 * idx + 1] = 1
print(fourth)
# [[-1. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
# [ 0. 0. -1. 1. 0. 0. 0. 0. 0. 0.]
# [ 0. 0. 0. 0. -1. 1. 0. 0. 0. 0.]
# [ 0. 0. 0. 0. 0. 0. -1. 1. 0. 0.]
# [ 0. 0. 0. 0. 0. 0. 0. 0. -1. 1.]]
I am new to machine learning scikit-learn. I was going through the documentation and tried OneHotEncoder() with some sample data. Can someone please explain what is happening from encoder.feature_indices_ and how i get the output of Encoded_Vector as [[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]. Any help is appreciated. Thanks!
>>> from sklearn import preprocessing
>>> encoder = preprocessing.OneHotEncoder()
>>> encoder.fit([[0, 2, 1, 12], [1, 3, 5, 3], [2, 3, 2, 12], [1, 2, 4,3]])
OneHotEncoder(categorical_features='all', dtype=<type 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> encoder.n_values_
array([ 3, 4, 6, 13])
>>> encoder.feature_indices_
array([ 0, 3, 7, 13, 26])
>>> vector_encoded = encoder.transform([[2,3,5,3]]).toarray()
>>> print "\nEncoded_Vector =",vector_encoded
Encoded_Vector = [[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]
>>>
My understanding so far is
Input
0 2 1 12
1 3 5 3
2 3 2 12
1 2 4 3
This is 4 columns/features and 4 rows. Each column has different number of unique entities. If i run:
enc.n_values_
It gives: array([ 3, 4, 6, 13])
So categories for each feature are:
feature 1 can take 3 values : 0 1 2
feature 2 can take 4 values : 0 1 2 3
feature 3 can take 6 values : 0 1 2 3 4 5
feature 4 can take 13 values : 0 1 2 3 4 5 6 7 8 9 10 11 12
Even though you said that your features can take a total of 3, 4, 6 or 13 values, the data example you provided ([[0, 2, 1, 12], [1, 3, 5, 3], [2, 3, 2, 12], [1, 2, 4,3]]) did not cover the complete variety of your data.
Your example basically says that:
feature 1 can take 3 values (0,1,2)
feature 2 can take 2 values (2,3)
feature 3 can take 4 values (1,2,4,5)
feature 4 can take 2 values (3,12)
This ends up with a total of 11 values. Thus, the output from the OneHotEncoding ([[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]) has 11 values, and it can be split into 4 sections:
[0. 0. 1.] is the encoding for feature 1
[0. 1.] is the encoding for feature 2
[0. 0. 0. 1.] is the encoding for feature 3
[1. 0.] is the encoding for feature 4
The position of the "1." in the array will tell you the value of the variable (try to match the example before encoding and after encoding).