I am wondering what the most concise and pythonic way to keep only the maximum element in each line of a 2D numpy array while setting all other elements to zeros. Example:
given the following numpy array:
a = [ [1, 8, 3 ,6],
[5, 5, 60, 1],
[63,9, 9, 23] ]
I want the answer to be:
b = [ [0, 8, 0, 0],
[0, 0, 60, 0],
[63,0, 0, 0 ] ]
I can think of several ways to solve that, but what interests me is whether there are python functions to so this just quickly
Thank you in advance
You can use np.max to take the maximum along one axis, then use np.where to zero out the non-maximal elements:
np.where(a == a.max(axis=1, keepdims=True), a, 0)
The keepdims=True argument keeps the singleton dimension after taking the max (i.e. so that a.max(1, keepdims=True).shape == (3, 1)), which simplifies broadcasting it against a.
Don't know what is pythonic, so I assume the way with most python specific grammar is pythonic.
It used two list comprehension, which is feature of python. but in this way it might not that concise.
b = [[y if y == max(x) else 0 for y in x] for x in a ]
Related
I often need to convert a multi-column (or 2D) numpy array into an indicator vector in a stable (i.e., order preserved) manner.
For example, I have the following numpy array:
import numpy as np
arr = np.array([
[2, 20, 1],
[1, 10, 3],
[2, 20, 2],
[2, 20, 1],
[1, 20, 3],
[2, 20, 2],
])
The output I like to have is:
indicator = [0, 1, 2, 0, 3, 2]
How can I do this (preferably using numpy only)?
Notes:
I am looking for a high performance (vectorized) approach as the arr (see the example above) has millions of rows in a real application.
I am aware of the following auxiliary solutions, but none is ideal. It would be nice to hear expert's opinion.
My thoughts so far:
1. Numpy's unique: This would not work, as it is not stable:
arr_unq, indicator = np.unique(arr, axis=0, return_inverse=True)
print(arr_unq)
# output 1:
# [[ 1 10 3]
# [ 1 20 3]
# [ 2 20 1]
# [ 2 20 2]]
print(indicator)
# output 2:
# [2 0 3 2 1 3]
Notice how the indicator starts from 2. This is because unique function returns a "sorted" array (see output 1). However, I would like it to start from 0.
Of course I can use LabelEncoder from sklearn to convert the items in a manner that they start from 0 but I feel that there is a simple numpy trick that I can use and therefore avoid adding sklearn dependency to my program.
Or I can resolve this by a dictionary mapping like below, but I can imagine that there is a better or more elegant solution:
dct = {}
for idx, item in enumerate(indicator):
if item not in dct:
dct[item] = len(dct)
indicator[idx] = dct[item]
print(indicator)
# outputs:
# [0 1 2 0 3 2]
2. Stabilizing numpy's unique output: This solution is already posted in stackoverflow and correctly returns an stable unique array. But I do not know how to convert the returned indicator vector (returned when return_inverse=True) to represent the values in an stable order starting from 0.
3. Pandas's get_dummies: function. But it returns a "hot encoding" (matrix of indicator values). In contrast, I would like to have an indicator vector. It is indeed possible to convert the "hot encoding" to the indicator vector by few lines of code and data manipulation. But again that approach is not going to be highly efficient.
In addition to return_inverse, you can add the return_index option. This will tell you the first occurrence of each sorted item:
unq, idx, inv = np.unique(arr, axis=0, return_index=True, return_inverse=True)
Now you can use the fact that np.argsort is its own inverse to fix the order. Note that idx.argsort() places unq into sorted order. The corrected result is therefore
indicator = idx.argsort().argsort()[inv]
And of course the byproduct
unq = unq[idx.argsort()]
Of course there's nothing special about these operations to 2D.
A Note on the Intuition
Let's say you have an array x:
x = np.array([7, 3, 0, 1, 4])
x.argsort() is the index that tells you what elements of x are placed at each of the locations in the sorted array. So
i = x.argsort() # 2, 3, 1, 4, 0
But how would you get from np.sort(x) back to x (which is the problem you express in #2)?
Well, it happens that i tells you the original position of each element in the sorted array: the first (smallest) element was originally at index 2, the second at 3, ..., the last (largest) element was at index 0. This means that to place np.sort(x) back into its original order, you need the index that puts i into sorted order. That means that you can write x as
np.sort(x)[i.argsort()]
Which is equivalent to
x[i][i.argsort()]
OR
x[x.argsort()][x.argsort().argsort()]
So, as you can see, np.argsort is effectively its own inverse: argsorting something twice gives you the index to put it back in the original order.
I have a 2D array, where each row has a label that is stored in a separate array (not necessarily unique). For each label, I want to extract the rows from my 2D array that have this label. A basic working example of what I want would be this:
import numpy as np
data=np.array([[1,2],[3,5],[7,10], [20,32],[0,0]])
label=np.array([1,1,1,0,1])
#very simple approach
label_values=np.unique(label)
res=[]
for la in label_values:
data_of_this_label_val=data[label==la]
res+=[data_of_this_label_val]
print(res)
The result (res) can have any format, as long as it is easily accessible. In the above example, it would be
[array([[20, 32]]), array([[ 1, 2],
[ 3, 5],
[ 7, 10],
[ 0, 0]])]
Note that I can easily associate each element in my list to one of the unique labels in label_values (that is, by index).
While this works, using a for loop can take quite a lot of time, especially if my label vector is large. Can this be sped up or coded more elegantly?
You can argsort the labels (which is what unique does under the hood I believe).
If your labels are small nonnegatvie integers as in the example you can get it a bit cheaper, see https://stackoverflow.com/a/53002966/7207392.
>>> import numpy as np
>>>
>>> data=np.array([[1,2],[3,5],[7,10], [20,32],[0,0]])
>>> label=np.array([1,1,1,0,1])
>>>
>>> idx = label.argsort()
# use kind='mergesort' if you require a stable sort, i.e. one that
# preserves the order of equal labels
>>> ls = label[idx]
>>> split = 1 + np.where(ls[1:] != ls[:-1])[0]
>>> np.split(data[idx], split)
[array([[20, 32]]), array([[ 1, 2],
[ 3, 5],
[ 7, 10],
[ 0, 0]])]
Unfortunately, there isn't a built-in groupby function in numpy, though you could write alternatives. However, your problem could be solved more succinctly using pandas, if that's available to you:
import pandas as pd
res = pd.DataFrame(data).groupby(label).apply(lambda x: x.values).tolist()
# or, if performance is important, the following will be faster on large arrays,
# but less readable IMO:
res = [data[i] for i in pd.DataFrame(data).groupby(label).groups.values()]
[array([[20, 32]]), array([[ 1, 2],
[ 3, 5],
[ 7, 10],
[ 0, 0]])]
I'm trying to create a function that will calculate the lattice distance (number of horizontal and vertical steps) between elements in a multi-dimensional numpy array. For this I need to retrieve the actual numbers from the indexes of each element as I iterate through the array. I want to store those values as numbers that I can run through a distance formula.
For the example array A
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
I'd like to create a loop that iterates through each element and for the first element 1 it would retrieve a=0, b=0 since 1 is at A[0,0], then a=0, b=1 for element 2 as it is located at A[0,1], and so on...
My envisioned output is two numbers (corresponding to the two index values for that element) for each element in the array. So in the example above, it would be the two values that I am assigning to be a and b. I only will need to retrieve these two numbers within the loop (rather than save separately as another data object).
Any thoughts on how to do this would be greatly appreciated!
As I've become more familiar with the numpy and pandas ecosystem, it's become clearer to me that iteration is usually outright wrong due to how slow it is in comparison, and writing to use a vectorized operation is best whenever possible. Though the style is not as obvious/Pythonic at first, I've (anecdotally) gained ridiculous speedups with vectorized operations; more than 1000x in a case of swapping out a form like some row iteration .apply(lambda)
#MSeifert's answer much better provides this and will be significantly more performant on a dataset of any real size
More general Answer by #cs95 covering and comparing alternatives to iteration in Pandas
Original Answer
You can iterate through the values in your array with numpy.ndenumerate to get the indices of the values in your array.
Using the documentation above:
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
for index, values in np.ndenumerate(A):
print(index, values) # operate here
You can do it using np.ndenumerate but generally you don't need to iterate over an array.
You can simply create a meshgrid (or open grid) to get all indices at once and you can then process them (vectorized) much faster.
For example
>>> x, y = np.mgrid[slice(A.shape[0]), slice(A.shape[1])]
>>> x
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
>>> y
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
and these can be processed like any other array. So if your function that needs the indices can be vectorized you shouldn't do the manual loop!
For example to calculate the lattice distance for each point to a point say (2, 3):
>>> abs(x - 2) + abs(y - 3)
array([[5, 4, 3],
[4, 3, 2],
[3, 2, 1]])
For distances an ogrid would be faster. Just replace np.mgrid with np.ogrid:
>>> x, y = np.ogrid[slice(A.shape[0]), slice(A.shape[1])]
>>> np.hypot(x - 2, y - 3) # cartesian distance this time! :-)
array([[ 3.60555128, 2.82842712, 2.23606798],
[ 3.16227766, 2.23606798, 1.41421356],
[ 3. , 2. , 1. ]])
Another possible solution:
import numpy as np
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
for _, val in np.ndenumerate(A):
ind = np.argwhere(A==val)
print val, ind
In this case you will obtain the array of indexes if value appears in array not once.
I am trying to find the numpy matrix operations to get the same result as in the following for loop code. I believe it will be much faster but I am missing some python skills to do it.
It works line by line, each value from a line of x is multiplied by each value of the same line in e and then summed.
The first item of result would be (2*0+2*1+2*4+2*2+2*3)+(0*0+...)+...+(1*0+1*1+1*4+1*2+1*3)=30
Any idea would be much appreciated :).
e = np.array([[0,1,4,2,3],[2,0,2,3,0,1]])
x = np.array([[2,0,0,0,1],[0,3,0,0,4,0]])
result = np.zeros(len(x))
for key, j in enumerate(x):
for jj in j:
for i in e[key]:
result[key] += jj*i
>>> result
Out[1]: array([ 30., 56.])
Those are ragged arrays as they have lists of different lengths. So, a fully vectorized approach even if possible won't be straight-forward. Here's one using np.einsum in a loop comprehension -
[np.einsum('i,j->',x[n],e[n]) for n in range(len(x))]
Sample run -
In [381]: x
Out[381]: array([[2, 0, 0, 0, 1], [0, 3, 0, 0, 4, 0]], dtype=object)
In [382]: e
Out[382]: array([[0, 1, 4, 2, 3], [2, 0, 2, 3, 0, 1]], dtype=object)
In [383]: [np.einsum('i,j->',x[n],e[n]) for n in range(len(x))]
Out[383]: [30, 56]
If you are still feel persistent about a fully vectorized approach, you could make a regular array with the smaller lists being filled zeros. For the same, here's a post that lists a NumPy based approach to do the filling.
Once, we have the regular shaped arrays as x and e, the final result would be simply -
np.einsum('ik,il->i',x,e)
Is this close to what you are looking for?
https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html
It seems like you are trying to get the dot product of matrices.
i have a 3D array of int32. I would like to transform each item from array to its corresponding bit value on "n" th position. My current approach is to loop through the whole array, but I think it can be done much more efficiently.
for z in range(0,dim[2]):
for y in range(0,dim[1]):
for x in range(0,dim[0]):
byte='{0:032b}'.format(array[z][y][x])
array[z][y][x]=int(byte>>n) & 1
Looking forward to your answers.
If you are dealing with large arrays, you are better off using numpy. Applying bitwise operations on a numpy array is much faster than applying it on python lists.
import numpy as np
a = np.random.randint(1,65, (2,2,2))
print a
Out[12]:
array([[[37, 46],
[47, 34]],
[[ 3, 15],
[44, 57]]])
print (a>>1)&1
Out[16]:
array([[[0, 1],
[1, 1]],
[[1, 1],
[0, 0]]])
Unless there is an intrinsic relation between the different points, you have no other choice than to loop over them to discover their current values. So the best you can do, will always be O(n^3)
What I don't get however, is why you go over the hassle of converting a number to a 32bit string, then back to int.
If you want to check if the nth bit of a number is set, you would do the following:
power_n = 1 << (n - 1)
for z in xrange(0,dim[2]):
for y in xrange(0,dim[1]):
for x in xrange(0,dim[0]):
array[z][y][x]= 0 if array[z][y][x] & power_n == 0 else 1
Not that in this example, I'm assuming that N is a 1-index (first bit is at n=1).