I am doing normalization for datasets but the data contains a lot of 0 because of padding.
I can mask them during model training but apparently, these zero will be affected when I applied normalization.
from sklearn.preprocessing import StandardScaler,MinMaxScaler
I am currently using the Sklearn library to do the normalization
For example, given a 3D array with dimension (4,3,5) as (batch, step, features)
The number of zero-padding varied from batch to batch as these are the features I extracted from audio files, that have varying lengths, using a fixed window size.
[[[0 0 0 0 0],
[0 0 0 0 0],
[0 0 0 0 0]]
[[1 2 3 4 5],
[4 5 6 7 8],
[9 10 11 12 13]],
[[14 15 16 17 18],
[0 0 0 0 0],
[24 25 26 27 28]],
[[0 0 0 0 0],
[423 2 230 60 70],
[0 0 0 0 0]]
]
I wish to perform normalization by column so
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train.reshape(-1,X_train.shape[-1])).reshape(X_train.shape)
X_test = scaler.transform(X_test.reshape(-1,X_test.shape[-1])).reshape(X_test.shape)
However, in this case, zeros are treated as effective values. For example, the minimum value of the first column should be 1 instead of 0.
Further, the 0's values are also changed after applying the scalers but I wish to keep them as 0's so I can mask them during training. model.add(tf.keras.layers.Masking(mask_value=0.0, input_shape=(X_train.shape[1], X_train.shape[2])))
Is there any way to mask them during normalization so only the 2nd step and 3rd step in this example are used in normalization?
In addition, The actual dimension of the array for my project is bigger as (2000,50,68)
among the 68 features, the difference in values of the 68 features can be very large. I tried to normalize them by dividing each element by the biggest element in their row to avoid the impact from 0's but this did not work out well.
The task of just MinMaxScaler() masking can be solved by next code.
Each other operation needs separate way of handling, if you'll mention all operations that need masking then we can solve them one-by-one basis and I'll extend my answer. E.g. keras layers can be masked by tf.keras.layers.Masking() layer as you mentioned.
Next code min/max-scales only non zero features, the rest remain zeros.
import numpy as np
from sklearn.preprocessing import MinMaxScaler
X = np.array([
[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]],
[[1, 2, 3, 4, 5],
[4, 5, 6, 7, 8],
[9, 10, 11, 12, 13]],
[[14, 15, 16, 17, 18],
[0, 0, 0, 0, 0],
[24, 25, 26, 27, 28]],
[[0, 0, 0, 0, 0],
[423, 2, 230, 60, 70],
[0, 0, 0, 0, 0]]
], dtype = np.float64)
nz = np.any(X, -1)
X[nz] = MinMaxScaler().fit_transform(X[nz])
print(X)
Output:
[[[0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. ]]
[[0. 0. 0. 0. 0. ]
[0.007109 0.13043478 0.01321586 0.05357143 0.04615385]
[0.01895735 0.34782609 0.03524229 0.14285714 0.12307692]]
[[0.03080569 0.56521739 0.05726872 0.23214286 0.2 ]
[0. 0. 0. 0. 0. ]
[0.05450237 1. 0.10132159 0.41071429 0.35384615]]
[[0. 0. 0. 0. 0. ]
[1. 0. 1. 1. 1. ]
[0. 0. 0. 0. 0. ]]]
If you need to train MinMaxScaler() on one dataset and apply it later on others then you can do next:
scaler = MinMaxScaler().fit(X[np.any(X, -1)])
X[np.any(X, -1)] = scaler.transform(X[np.any(X, -1)])
Y[np.any(Y, -1)] = scaler.transform(Y[np.any(Y, -1)])
Related
I have a numpy array of shape (100, 100, 20) (in python 3)
I want to find for each 'pixel' the 15 channels with minimum values, and make them zeros (meaning: make the array sparse, keep only the 5 highest values).
Example:
input: array = [[1,2,3], [7,6,9], [12,71,3]], num_channles_to_zero = 2
output: [[0,0,3], [0,0,9], [0,71,0]]
How can I do it?
what I have for now:
array = numpy.random.rand(100, 100, 20)
inds = numpy.argsort(array, axis=-1) # also shape (100, 100, 20)
I want to do something like
array[..., inds[..., :15]] = 0
but it doesn't give me what I want
np.argsort outputs indices suitable for the [...]_along_axis functions of numpy. This includes np.put_along_axis:
import numpy as np
array = np.random.rand(100, 100, 20)
print(array[0,0])
#[0.44116124 0.94656705 0.20833932 0.29239585 0.33001399 0.82396784
# 0.35841905 0.20670957 0.41473762 0.01568006 0.1435386 0.75231818
# 0.5532527 0.69366173 0.17247832 0.28939985 0.95098187 0.63648877
# 0.90629116 0.35841627]
inds = np.argsort(array, axis=-1)
np.put_along_axis(array, inds[..., :15], 0, axis=-1)
print(array[0,0])
#[0. 0.94656705 0. 0. 0. 0.82396784
# 0. 0. 0. 0. 0. 0.75231818
# 0. 0. 0. 0. 0.95098187 0.
# 0.90629116 0. ]
As it mentioned in the numpy documentation
From each row, a specific element should be selected. The row index is just [0, 1, 2] and the column index specifies the element to choose for the corresponding row, here [0, 1, 0]. Using both together the task can be solved using advanced indexing:
>>>x = np.array([[1, 2], [3, 4], [5, 6]])
>>>x[[0, 1, 2], [0, 1, 0]]
array([1, 4, 5])
So, for your example:
a = np.array([[1,2,3], [7,6,9], [12,71,3]])
amax = a.argmax(axis=-1)
a[np.arange(a.shape[0]), amax] = 0
a
array([[ 1, 2, 0],
[ 7, 6, 0],
[12, 0, 3]])
I am trying to transform my dictionary with StandardScaler(), but it gives me only zeros.
How to fix it?
from sklearn.preprocessing import StandardScaler
import pandas as pd
param ={
"user_id": 22058,
"signup_day": 24,
"signup_month": 2,
"signup_year": 2015,
"purchase_day": 18,
"purchase_month": 4,
"purchase_year": 2015,
"purchase_value": 34,
"age": 39,
"source_Ads": 0,
"source_Direct": 0,
"source_SEO": 1,
"browser_Chrome": 1,
"browser_FireFox": 0,
"browser_IE": 0,
"browser_Opera": 0,
"browser_Safari": 0,
"sex_F": 0,
"sex_M": 1
}
new = (pd.Series(param, index=['user_id', 'signup_day', 'signup_month', 'signup_year', 'purchase_day', 'purchase_month', 'purchase_year', 'purchase_value', 'age', 'source_Ads', 'source_Direct', 'source_SEO', 'browser_Chrome', 'browser_FireFox','browser_IE', 'browser_Opera', 'browser_Safari', 'sex_F', 'sex_M'])).values.reshape(1,-1)
print(new)
scaler = StandardScaler()
X_new = scaler.fit_transform(new)
print(X_new)
Results:
new = [[22058 24 2 2015 18 4 2015 34 39 0 0 1 1 0 0 0 0 0 1]]
X_new =[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
StandardScaler is meant to scale columns of your data, here you have only one value per column so each of them is set to 0. Use it with multiple values per column and you might get the expected result !
I have a 3-d matrix as shown below and would like to take the max value along axis 1, and keep all non-max values to zero.
A = np.random.rand(3,3,2)
[[[0.34444547, 0.50260393],
[0.93374423, 0.39021899],
[0.94485653, 0.9264881 ]],
[[0.95446736, 0.335068 ],
[0.35971558, 0.11732342],
[0.72065402, 0.36436023]],
[[0.56911013, 0.04456443],
[0.17239996, 0.96278067],
[0.26004909, 0.06767436]]]
Desired result:
[[0 , 0 ],
[0 , 0 ],
[0.94485653, 0.9264881]],
[[0.95446736, 0 ],
[0 , 0 ],
[0 , 0.36436023]],
[[0.56911013, 0 ],
[0 , 0.96278067],
[0 , 0 ]]])
I have tried:
B = np.zeros_like(A) #return matrix of zero with same shape as A
max_idx = np.argmax(A, axis=1) #index along axis 1 with max value
array([[2, 0],
[2, 2],
[0, 2],
[0, 1]])
C = np.max(A, axis=1, keepdims = True) #gives a (4,1,2) matrix of max value along axis 1
array([[[0.95377958, 0.92940525]],
[[0.94485653, 0.9264881 ]],
[[0.95446736, 0.36436023]],
[[0.56911013, 0.96278067]]])
But I can't figure out how to combine these ideas together to get my desired output. Please help!!
You can get the 3 dimensional index of your max values from max_idx. The values in max_idx are the index along axis 1 of your max values. There are six values since your other axes are 3 and 2 (3 x 2 = 6). You just have to realize the order that numpy goes through them to get the index for each of the other axes. You iterate over the last axes first:
d0, d1, d2 = A.shape
a0 = [i for i in range(d0) for _ in range(d2)] # [0, 0, 1, 1, 2, 2]
a1 = max_idx.flatten() # [2, 2, 0, 2, 0, 1]
a2 = [k for _ in range(d0) for k in range(d2)] # [0, 1, 0, 1, 0, 1]
B[a0, a1, a2] = A[a0, a1, a2]
Output:
array([[[0. , 0. ],
[0. , 0. ],
[0.94485653, 0.9264881 ]],
[[0.95446736, 0. ],
[0. , 0. ],
[0. , 0.36436023]],
[[0.56911013, 0. ],
[0. , 0.96278067],
[0. , 0. ]]])
I am trying to interpolate a 2D numpy matrix with the dimensions (5, 3) to a matrix with the dimensions (7, 3) along the axis 1 (columns). Obviously, the wrong approach would be to randomly insert rows anywhere between the original matrix, see the following example:
Source:
[[0, 1, 1]
[0, 2, 0]
[0, 3, 1]
[0, 4, 0]
[0, 5, 1]]
Target (terrible interpolation -> not wanted!):
[[0, 1, 1]
[0, 1.5, 0.5]
[0, 2, 0]
[0, 3, 1]
[0, 3.5, 0.5]
[0, 4, 0]
[0, 5, 1]]
The correct approach would be to take every row into account and interpolate between all of them to expand the source matrix to a (7, 3) matrix. I am aware of the scipy.interpolate.interp1d or scipy.interpolate.interp2d methods, but could not get it to work with other Stack Overflow posts or websites. I hope to receive any type of tips or tricks.
Update #1: The expected values should be equally spaced.
Update #2:
What I want to do is basically use the separate columns of the original matrix, expand the length of the column to 7 and interpolate between the values of the original column. See the following example:
Source:
[[0, 1, 1]
[0, 2, 0]
[0, 3, 1]
[0, 4, 0]
[0, 5, 1]]
Split into 3 separate Columns:
[0 [1 [1
0 2 0
0 3 1
0 4 0
0] 5] 1]
Expand length to 7 and interpolate between them, example for second column:
[1
1.66
2.33
3
3.66
4.33
5]
It seems like each column can be treated completely independently, but for each column you need to define essentially an "x" coordinate so that you can fit some function "f(x)" from which you generate your output matrix.
Unless the rows in your matrix are associated with some other datastructure (e.g. a vector of timestamps), an obvious set of x values is just the row-number:
x = numpy.arange(0, Source.shape[0])
You can then construct an interpolating function:
fit = scipy.interpolate.interp1d(x, Source, axis=0)
and use that to construct your output matrix:
Target = fit(numpy.linspace(0, Source.shape[0]-1, 7)
which produces:
array([[ 0. , 1. , 1. ],
[ 0. , 1.66666667, 0.33333333],
[ 0. , 2.33333333, 0.33333333],
[ 0. , 3. , 1. ],
[ 0. , 3.66666667, 0.33333333],
[ 0. , 4.33333333, 0.33333333],
[ 0. , 5. , 1. ]])
By default, scipy.interpolate.interp1d uses piecewise-linear interpolation. There are many more exotic options within scipy.interpolate, based on higher order polynomials, etc. Interpolation is a big topic in itself, and unless the rows of your matrix have some particular properties (e.g. being regular samples of a signal with a known frequency range), there may be no "truly correct" way of interpolating. So, to some extent, the choice of interpolation scheme will be somewhat arbitrary.
You can do this as follows:
from scipy.interpolate import interp1d
import numpy as np
a = np.array([[0, 1, 1],
[0, 2, 0],
[0, 3, 1],
[0, 4, 0],
[0, 5, 1]])
x = np.array(range(a.shape[0]))
# define new x range, we need 7 equally spaced values
xnew = np.linspace(x.min(), x.max(), 7)
# apply the interpolation to each column
f = interp1d(x, a, axis=0)
# get final result
print(f(xnew))
This will print
[[ 0. 1. 1. ]
[ 0. 1.66666667 0.33333333]
[ 0. 2.33333333 0.33333333]
[ 0. 3. 1. ]
[ 0. 3.66666667 0.33333333]
[ 0. 4.33333333 0.33333333]
[ 0. 5. 1. ]]
I've been trying to create a watershed algorithm and as all the examples seem to be in Python I've run into a bit of a wall. I've been trying to find in numpy documentation what this line means:
matrixVariable[A==255] = 0
but have had no luck. Could anyone explain what that operation does?
For context the line in action: label [lbl == -1] = 0
The expression A == 255 creates a boolean array which is True where x == 255 in A and False otherwise.
The expression matrixVariable[A==255] = 0 sets each index corresponding to a True value in A == 255 to 0.
EG:
import numpy as np
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
B = np.zeros([3, 3])
print('before:')
print(B)
B[A>5] = 5
print('after:')
print(B)
OUT:
[[ 0. 0. 0.]
[ 0. 0. 0.]
[ 0. 0. 0.]]
after:
[[ 0. 0. 0.]
[ 0. 0. 5.]
[ 5. 5. 5.]]
I assumed that matrixVariable and A are numpy arrays. If the assumption is correct then "matrixVariable[A==255] = 0" expression first gets the index of the array A where values of A are equal to 255 then gets the values of matrixVariable for those index and set them to "0"
Example:
import numpy as np
matrixVariable = np.array([(1, 3),
(2, 2),
(3,1)])
A = np.array([255, 1,255])
So A[0] and A[2] are equal to 255
matrixVariable[A==255]=0 #then sets matrixVariable[0] and matrixVariable[2] to zero
print(matrixVariable) # this would print
[[0 0]
[2 2]
[0 0]]