How to vectorize one hot encoding loop in numpy

How to vectorize one hot encoding loop in numpy - python

Is there a way to vectorize the loop in this code?
def get_onehot(y):
categories = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
arr = np.zeros((y.shape[0], len(categories)))
for i in range(y.shape[0]):
n = y[i]
arr[i][n] = 1
return arr
>>> get_onehot(np.array([0, 2, 5]))
array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]])
E.g I assume this is not the most efficient way of doing it and I am wondering if there is a code improvement available.

If you don't want to use scikit-learn, here is a NumPy way:
import numpy as np
def get_onehot(y, n=10):
return np.eye(n)[y]
get_onehot(np.array([0, 2, 5]))
# array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
# [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
# [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]])
np.eye documentation

Related

Is there a way of reshaping a multidimensional array into a 1-D Vector in Python?

so I'm trying to turn this array
array([[1., 0., 0., 0., 0., 0., 0., 0., 0.,0.],
[0., 0., 0., 1., 0., 0., 0., 0., 0.,0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.,1.],
[0., 0., 0., 0., 0., 0., 1., 0., 0.,0.]])
[1., 0., 0., 0., 0., 0., 0., 0., 0.,0.]])
into this array
array([ 0, 4, 9, 6, 0])
So each row of the original array, is replaced with a single value that is equal to where the "1" is located in the row.

You can use numpy.argmax:
a = np.array([[1., 0., 0., 0., 0., 0., 0., 0., 0.,0.],
[0., 0., 0., 1., 0., 0., 0., 0., 0.,0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.,1.],
[0., 0., 0., 0., 0., 0., 1., 0., 0.,0.],
[1., 0., 0., 0., 0., 0., 0., 0., 0.,0.]])
print( np.argmax(a, axis=1) )
Prints:
[0 3 9 6 0]

Try this:
array = np.array([[1., 0., 0., 0., 0., 0., 0., 0., 0.,0.],
[0., 0., 0., 1., 0., 0., 0., 0., 0.,0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.,1.],
[0., 0., 0., 0., 0., 0., 1., 0., 0.,0.],
[1., 0., 0., 0., 0., 0., 0., 0., 0.,0.]])
print(np.where(array == 1)[1]
Output:
array([0, 3, 9, 6, 0])

`np.add.at` to 2-dimensional array

I'm looking for 2-dimensional version of np.add.at().
The expected behavior is as follows.
augend = np.zeros((10, 10))
indices_for_dim0 = np.array([1, 5, 2])
indices_for_dim1 = np.array([5, 3, 1])
addend = np.array([1, 2, 3])
### some procedure substituting np.add.at ###
assert augend[1, 5] == 1
assert augend[5, 3] == 2
assert augend[2, 1] == 3
Any advice will help!

You can use np.add.at multidimensionally as it is. The indices argument contains the following in the description:
... If first operand has multiple dimensions, indices can be a tuple of array like index objects or slice
So:
augend = np.zeros((10, 10))
indices_for_dim0 = np.array([1, 5, 2])
indices_for_dim1 = np.array([5, 3, 1])
addend = np.array([1, 2, 3])
np.add.at(augend, (indices_for_dim0, indices_for_dim1), addend)
More simply:
augend[indices_for_dim0, indices_for_dim1] += addend
If you're really worried about the multidimensional aspect and your augend is a vanilla contiguous C order array, you can use ravel and ravel_multi_index to perform the operation on a 1D view:
indices = np.ravel_multi_index((indices_for_dim0, indices_for_dim1), augend.shape)
raveled = augend.ravel()
np.add.at(raveled, indices, addend)

Oneliner:
np.add.at(augend, (indices_for_dim0, indices_for_dim1), addend)
augend
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 3., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 2., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
assert augend[1, 5] == 1
assert augend[5, 3] == 2
assert augend[2, 1] == 3
# No AssertionError
When using 2d-array for np.add.at, indices must be of a tuple where tuple[0] contains all the first coordinates and tuple[1] contains all the second coordinates.

Finding the start/stop positions and length of the longest and shortest sequence of 1s or 0s in a numpy matrix

I have a numpy matrix that looks like:
matrix = [[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]
How would I get the length of the longest sequence of 1s or 0s? Also how would I get their start and stop positions?
Is there an easier numpy-way to get this done?
Output format is flexible as long as it denotes the inner list index, the length value, and value's list indices.
Example:
LONGEST ONES: 1, 16, 2, 17 (index of inner list, length, longest 1s sequence index start, longest 1s sequence end pos.).
or [1, 16, 2, 17]/(1, 16, 2, 17)
LONGEST ZEROS: 2, 45, 0, 45
Not a duplicate of these questions as this concerns a matrix:
find the start position of the longest sequence of 1's
The result(longest) should be considered among all lists.
A sequence count does not continue when it reaches the end of an inner list.

Using Divakar's base answer, you can adapt by using np.vectorize, setting the argument signature and doing simple math operations to get what you're looking for.
Take, for instance,
m = np.array(matrix)
def get_longest_ones_matrix(b):
idx_pairs = np.where(np.diff(np.hstack(([False], b==1, [False]))))[0].reshape(-1,2)
if not idx_pairs.size: return(np.array([0,0,0]))
d = np.diff(idx_pairs, axis=1).argmax()
start_longest_seq = idx_pairs[d,0]
end_longest_seq = idx_pairs[d,1]
l = end_longest_seq - start_longest_seq
p = start_longest_seq % 45
e = end_longest_seq - 1
return(np.array([l,p,e]))
s = m.shape[-1]
v = np.vectorize(get_longest_ones_matrix, signature=f'(s)->(1)')
x = v(m)
Which yields
[[ 3 26 28]
[16 2 17]
[ 0 0 0]]
Then,
a = x[:,0].argmax()
print(a,x[a])
1 [16 2 17]

Comparing Arrays for Accuracy

I've a 2 arrays:
np.array(y_pred_list).shape
# returns (5, 47151, 10)
np.array(y_val_lst).shape
# returns (5, 47151, 10)
np.array(y_pred_list)[:, 2, :]
# returns
array([[ 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
np.array(y_val_lst)[:, 2, :]
# returns
array([[ 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)
I would like to go through all 47151 examples, and calculate the "accuracy". Meaning the sum of those in y_pred_list that matches y_val_lst over 47151. What's the comparison function for this?

You can find a lot of useful classification scores in sklearn.metrics, particularly accuracy_score(). See the doc here, you would use it as:
import sklearn
acc = sklearn.metrics.accuracy_score(np.array(y_val_list)[:, 2, :],
np.array(y_pred_list)[:, 2, :])

Sounds like you want something like this:
accuracy = (y_pred_list == y_val_lst).all(axis=(0,2)).mean()
...though since your arrays are clearly floating-point arrays, you might want to allow for numerical-precision errors rather than insisting on exact equality:
accuracy = (numpy.abs(y_pred_list - y_val_lst) < tolerance ).all(axis=(0,2)).mean()
(where, for example, tolerance = 1e-10)
The .all(axis=(0,2)) call records cases in which everything in its input is True (i.e. everything matches) when working along the dimension 0 (i.e. the one that has extent 5) and dimension 2 (the one that has extent 10). It outputs a one-dimensional array of length 47151. The .mean() call then gives you the proportion of matches in that sequence, which is my best guess as to what you mean by "over 47151".

Replace values in bigger numpy array with smaller array

I have 2 numpy arrays: The bigger one is a 10 x 10 numpy array and the smaller one is a 2 x 2 array.
I would like to substitute the values in the bigger array with those from the smaller array, at a user specified location. E.g. Replace the values of the 10 x 10 array starting from its center point by replacing 4 values with the 2 x 2 array.
Right now, I am doing this by using a nested for loop, and figuring out which pixels in the bigger array overlap those of the smaller array. Is there a more pythonic way to do it?

In [1]: import numpy as np
In [2]: a = np.zeros(100).reshape(10,10)
In [3]: b = np.ones(4).reshape(2,2)
In [4]: a[4:6, 4:6] = b
In [5]: a
Out[5]:
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to vectorize one hot encoding loop in numpy - python

Related

Is there a way of reshaping a multidimensional array into a 1-D Vector in Python?

`np.add.at` to 2-dimensional array

Finding the start/stop positions and length of the longest and shortest sequence of 1s or 0s in a numpy matrix

Comparing Arrays for Accuracy

Replace values in bigger numpy array with smaller array

Categories

Resources