Currently, I have a 3D Python list in jagged array format.
A = [[[0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0], [0], [0]]]
Is there any way I could convert this list to a NumPy array, in order to use certain NumPy array operators such as adding a number to each element.
A + 4 would give [[[4, 4, 4], [4, 4, 4], [4, 4, 4]], [[4], [4], [4]]].
Assigning B = numpy.array(A) then attempting to B + 4 throws a type error.
TypeError: can only concatenate list (not "float") to list
Is a conversion from a jagged Python list to a NumPy array possible while retaining the structure (I will need to convert it back later) or is looping through the array and adding the required the better solution in this case?
The answers by #SonderingNarcissit and #MadPhysicist are already quite nice.
Here is a quick way of adding a number to each element in your list and keeping the structure. You can replace the function return_number by anything you like, if you want to not only add a number but do something else with it:
def return_number(my_number):
return my_number + 4
def add_number(my_list):
if isinstance(my_list, (int, float)):
return return_number(my_list)
else:
return [add_number(xi) for xi in my_list]
A = [[[0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0], [0], [0]]]
Then
print(add_number(A))
gives you the desired output:
[[[4, 4, 4], [4, 4, 4], [4, 4, 4]], [[4], [4], [4]]]
So what it does is that it look recursively through your list of lists and everytime it finds a number it adds the value 4; this should work for arbitrarily deep nested lists. That currently only works for numbers and lists; if you also have e.g. also dictionaries in your lists then you would have to add another if-clause.
Since numpy can only work with regular-shaped arrays, it checks that all the elements of a nested iterable are the same length for a given dimension. If they are not, it still creates an array, but of type np.object instead of np.int as you would expect:
>>> B = np.array(A)
>>> B
array([[[0, 0, 0], [0, 0, 0], [0, 0, 0]],
[[0], [0], [0]]], dtype=object)
In this case, the "objects" are lists. Addition is defined for lists, but only in terms of other lists that extend the original, hence your error. [0, 0] + 4 is an error, while [0, 0] + [4] is [0, 0, 4]. Neither is what you want.
It may be interesting that numpy will make the object portion of your array nest as low as possible. Array you created is actually a 2D numpy array containing lists, not a 1D array containing nested lists:
>>> B[0, 0]
[0, 0, 0]
>>> B[0, 0, 0]
Traceback (most recent call last):
File "<ipython-input-438-464a9bfa40bf>", line 1, in <module>
B[0, 0, 0]
IndexError: too many indices for array
As you pointed out, you have two options when it comes to ragged arrays. The first is to pad the array so it is non-ragged, convert it to numpy, and only use the elements you care about. This does not seem very convenient in your case.
The other method is to apply functions to your nested array directly. Luckily for you, I wrote a snippet/recipe in response to this question, which does exactly what you need, down to being able to support arbitrary levels of nesting and your choice of operators. I have upgraded it here to accept non-iterable nested elements anywhere along the list, including the original input and do a primitive form of broadcasting:
from itertools import repeat
def elementwiseApply(op, *iters):
def isIterable(x):
"""
This function is also defined in numpy as `numpy.iterable`.
"""
try:
iter(x)
except TypeError:
return False
return True
def apply(op, *items):
"""
Applies the operator to the given arguments. If any of the
arguments are iterable, the non-iterables are broadcast by
`itertools.repeat` and the function is applied recursively
on each element of the zipped result.
"""
elements = []
count = 0
for iter in items:
if isIterable(iter):
elements.append(iter)
count += 1
else:
elements.append(itertools.repeat(iter))
if count == 0:
return op(*items)
return [apply(op, *items) for items in zip(*elements)]
return apply(op, *iters)
This is a pretty general solution that will work with just about any kind of input. Here are a couple of sample runs showing how it is relevant to your question:
>>> from operator import add
>>> elementwiseApply(add, 4, 4)
8
>>> elementwiseApply(add, [4, 0], 4)
[8, 4]
>>> elementwiseApply(add, [(4,), [0, (1, 3, [1, 1, 1])]], 4)
[[8], [4, [5, 7, [5, 5, 5]]]]
>>> elementwiseApply(add, [[0, 0, 0], [0, 0], 0], [[4, 4, 4], [4, 4], 4])
[[4, 4, 4], [4, 4], 4]
>>> elementwiseApply(add, [(4,), [0, (1, 3, [1, 1, 1])]], [1, 1, 1])
[[5], [1, [2, 4, [2, 2, 2]]]]
The result is always a new list or scalar, depending on the types of the inputs. The number of inputs must be the number accepted by the operator. operator.add always takes two inputs, for example.
Looping and adding is likely better, since you want to preserve the structure of the original. Plus, the error you mentioned indicates that you would need to flatten the numpy array and then add to each element. Although numpy operations tend to be faster than list operations, converting, flattening, and reverting is cumbersome and will probably offset any gains.
It we turn your list into an array, we get a 2d array of objects
In [1941]: A = [[[0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0], [0], [0]]]
In [1942]: A = np.array(A)
In [1943]: A.shape
Out[1943]: (2, 3)
In [1944]: A
Out[1944]:
array([[[0, 0, 0], [0, 0, 0], [0, 0, 0]],
[[0], [0], [0]]], dtype=object)
When I try A+1 it iterates over the elements of A and tries to do +1 for each. In the case of numeric array it can do that in fast compiled code. With an object array it has to invoke the + operation for each element.
In [1945]: A+1
...
TypeError: can only concatenate list (not "int") to list
Lets try that again with a flat iteration over A:
In [1946]: for a in A.flat:
...: print(a+1)
....
TypeError: can only concatenate list (not "int") to list
The elements of A are lists; + for a list is a concatenate:
In [1947]: for a in A.flat:
...: print(a+[1])
...:
[0, 0, 0, 1]
[0, 0, 0, 1]
[0, 0, 0, 1]
[0, 1]
[0, 1]
[0, 1]
If the elements of A were themselves arrays, I think the +1 would work.
In [1956]: for i, a in np.ndenumerate(A):
...: A[i]=np.array(a)
...:
In [1957]: A
Out[1957]:
array([[array([0, 0, 0]), array([0, 0, 0]), array([0, 0, 0])],
[array([0]), array([0]), array([0])]], dtype=object)
In [1958]: A+1
Out[1958]:
array([[array([1, 1, 1]), array([1, 1, 1]), array([1, 1, 1])],
[array([1]), array([1]), array([1])]], dtype=object)
And to get back to the pure list form, we have apply tolist to both the elements of the object array and to the array itself:
In [1960]: A1=A+1
In [1961]: for i, a in np.ndenumerate(A1):
...: A1[i]=a.tolist()
In [1962]: A1
Out[1962]:
array([[[1, 1, 1], [1, 1, 1], [1, 1, 1]],
[[1], [1], [1]]], dtype=object)
In [1963]: A1.tolist()
Out[1963]: [[[1, 1, 1], [1, 1, 1], [1, 1, 1]], [[1], [1], [1]]]
This a rather round about way of adding a value to all elements of nested lists. I could have done that with one iteration:
In [1964]: for i,a in np.ndenumerate(A):
...: A[i]=[x+1 for x in a]
...:
In [1965]: A
Out[1965]:
array([[[1, 1, 1], [1, 1, 1], [1, 1, 1]],
[[1], [1], [1]]], dtype=object)
So doing math on object arrays is hit and miss. Some operations do propagate to the elements, but even those depend on how the elements behave.
It is unfortunate that the input structure is a jagged list. If one could adjust the method used to generate the list by assigning no data values, then there is so much more one can do. I made this comment in the initial post, but I will demonstrate how the design of the originals could be altered to facilitate obtaining more data while enabling the return of a list.
I have done this as a function so I could comment the inputs and outputs for further reference.
def num_46():
"""(num_46)... Masked array from ill-formed list
: http://stackoverflow.com/questions/40289943/
: converting-a-3d-list-to-a-3d-numpy-array
: A =[[[0, 0, 0], [0, 0, 0], [0, 0, 0]],
: [[0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0], [0], [0]]]
"""
frmt = """
:Input list...
{}\n
:Masked array data
{}\n
:A sample calculations:
: a.count(axis=0) ... a.count(axis=1) ... a.count(axis=2)
{}\n
{}\n
{}\n
: and finally: a * 2
{}\n
:Return it to a list...
{}
"""
a_list = [[[0, 1, 2], [3, 4, 5], [6, 7, 8]],
[[9, 10, 11], [12, 13, 14], [15, 16, 17]],
[[18, -1, -1], [21, -1, -1], [24, -1, -1]]]
mask_val = -1
a = np.ma.masked_equal(a_list, mask_val)
a.set_fill_value(mask_val)
final = a.tolist(mask_val)
args = [a_list, a,
a.count(axis=0), a.count(axis=1), a.count(axis=2),
a*2, final]
print(dedent(frmt).format(*args))
return a_list, a, final
#----------------------
if __name__ == "__main__":
"""Main section... """
A, a, c = num_46()
Some results that show that the use of masked arrays may be preferable to jagged/malformed list structure.
:Input list...
[[[0, 1, 2], [3, 4, 5], [6, 7, 8]],
[[9, 10, 11], [12, 13, 14], [15, 16, 17]],
[[18, -1, -1], [21, -1, -1], [24, -1, -1]]]
:Masked array data
[[[0 1 2]
[3 4 5]
[6 7 8]]
[[9 10 11]
[12 13 14]
[15 16 17]]
[[18 - -]
[21 - -]
[24 - -]]]
:A sample calculations:
: a.count(axis=0) ... a.count(axis=1) ... a.count(axis=2)
[[3 2 2]
[3 2 2]
[3 2 2]]
[[3 3 3]
[3 3 3]
[3 0 0]]
[[3 3 3]
[3 3 3]
[1 1 1]]
: and finally: a * 2
[[[0 2 4]
[6 8 10]
[12 14 16]]
[[18 20 22]
[24 26 28]
[30 32 34]]
[[36 - -]
[42 - -]
[48 - -]]]
:Return it to a list...
[[[0, 1, 2], [3, 4, 5], [6, 7, 8]], [[9, 10, 11], [12, 13, 14], [15, 16, 17]], [[18, -1, -1], [21, -1, -1], [24, -1, -1]]]
Hope this helps someone.
Related
Is it possible to extract the upper values from the whole 3D array?
A simple example of a 3D array is below:
import numpy as np
a = np.array([[[7, 4, 2], [5, 0, 4], [0, 0, 5]],
[[7, 6, 1], [3, 9, 5], [0, 8, 7]],
[[8, 10, 3], [1, 2, 15], [9, 0, 1]]])
You can use the numpy building-matrices functions like numpy.triu (triangle-upper) or numpy.tril (triangle-lower) to return a copy of a matrix with the elements above or below the k-th diagonal zeroed.
If, on the other hand, you are only interested in the values above or below the diagonal (without having a copy of the matrix), you can simply use numpy.triu_indices and numpy.tril_indices, as follows:
upper_index = np.triu_indices(n=3, k=1)
where n is the size of the arrays for which the returned indices will be valid, and k the diagonal offset.
and return the indices for the triangle. The returned tuple contains two arrays, each with the indices along one dimension of the array:
(array([0, 0, 1], dtype=int64), array([1, 2, 2], dtype=int64))
now you can use the indexes obtained as indexes of your array and you will get:
a[upper_index]
and gives:
array([[5, 0, 4],
[0, 0, 5],
[0, 8, 7]])
Similarly you can find the part under the diagonal using numpy.tril_indices.
IUUC, You could use triu_indices:
result = a[np.triu_indices(3)]
print(result)
Output
[[7 4 2]
[5 0 4]
[0 0 5]
[3 9 5]
[0 8 7]
[9 0 1]]
If you want those strictly above the diagonal, you can pass an offset value:
result = a[np.triu_indices(3, 1)]
print(result)
Output
[[5 0 4]
[0 0 5]
[0 8 7]]
I want to insert a list into numpy-based matrix in a specific index. For instance, the following code (python 2.7) is supposed to insert the list [5,6,7] into M in the second place:
M = [[0, 0], [0, 1], [1, 0], [1, 1]]
M = np.asarray(M)
X = np.insert(M, 1, [5,6,7])
print(X)
This, however, does not output what I would like. It causes to mess up the matrix M by merging all lists into one single list. How can I achieve adding any list in any place of numpy-based matrix?
Thank you
In [80]: M = [[0, 0], [0, 1], [1, 0], [1, 1]]
...: M1 = np.asarray(M)
...:
List insert:
In [81]: M[1:2] = [[5,6,7]]
In [82]: M
Out[82]: [[0, 0], [5, 6, 7], [1, 0], [1, 1]]
Contrast the array made from the original M and the modified one:
In [83]: M1
Out[83]:
array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
In [84]: np.array(M)
Out[84]:
array([list([0, 0]), list([5, 6, 7]), list([1, 0]), list([1, 1])],
dtype=object)
The second one is not a 2d array.
np.insert without an axis ravels things (check the docs)
In [85]: np.insert(M1,1,[5,6,7])
Out[85]: array([0, 5, 6, 7, 0, 0, 1, 1, 0, 1, 1])
If I specify an axis it complains about a mismatch in shapes:
In [86]: np.insert(M1,1,[5,6,7],axis=0)
...
5071 new[slobj] = arr[slobj]
5072 slobj[axis] = slice(index, index+numnew)
-> 5073 new[slobj] = values
5074 slobj[axis] = slice(index+numnew, None)
5075 slobj2 = [slice(None)] * ndim
ValueError: could not broadcast input array from shape (1,3) into shape (1,2)
It creates a (1,2) shape slot to receive the new value, but [5,6,7] won't fit.
In [87]: np.insert(M1,1,[5,6],axis=0)
Out[87]:
array([[0, 0],
[5, 6],
[0, 1],
[1, 0],
[1, 1]])
arr = numpy.array([input().split() for i in range(int(input().split()[0]))])
print(arr)
INPUT:
2 1 2 3 4 5 6 7 8
OUTPUT:
[['1' '2' '3' '4']
['5' '6' '7' '8']]
I have an array X:
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
And I wish to find the index of the row of several values in this array:
searched_values = np.array([[4, 2],
[3, 3],
[5, 6]])
For this example I would like a result like:
[0,3,4]
I have a code doing this, but I think it is overly complicated:
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
searched_values = np.array([[4, 2],
[3, 3],
[5, 6]])
result = []
for s in searched_values:
idx = np.argwhere([np.all((X-s)==0, axis=1)])[0][1]
result.append(idx)
print(result)
I found this answer for a similar question but it works only for 1d arrays.
Is there a way to do what I want in a simpler way?
Approach #1
One approach would be to use NumPy broadcasting, like so -
np.where((X==searched_values[:,None]).all(-1))[1]
Approach #2
A memory efficient approach would be to convert each row as linear index equivalents and then using np.in1d, like so -
dims = X.max(0)+1
out = np.where(np.in1d(np.ravel_multi_index(X.T,dims),\
np.ravel_multi_index(searched_values.T,dims)))[0]
Approach #3
Another memory efficient approach using np.searchsorted and with that same philosophy of converting to linear index equivalents would be like so -
dims = X.max(0)+1
X1D = np.ravel_multi_index(X.T,dims)
searched_valuesID = np.ravel_multi_index(searched_values.T,dims)
sidx = X1D.argsort()
out = sidx[np.searchsorted(X1D,searched_valuesID,sorter=sidx)]
Please note that this np.searchsorted method assumes there is a match for each row from searched_values in X.
How does np.ravel_multi_index work?
This function gives us the linear index equivalent numbers. It accepts a 2D array of n-dimensional indices, set as columns and the shape of that n-dimensional grid itself onto which those indices are to be mapped and equivalent linear indices are to be computed.
Let's use the inputs we have for the problem at hand. Take the case of input X and note the first row of it. Since, we are trying to convert each row of X into its linear index equivalent and since np.ravel_multi_index assumes each column as one indexing tuple, we need to transpose X before feeding into the function. Since, the number of elements per row in X in this case is 2, the n-dimensional grid to be mapped onto would be 2D. With 3 elements per row in X, it would had been 3D grid for mapping and so on.
To see how this function would compute linear indices, consider the first row of X -
In [77]: X
Out[77]:
array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
We have the shape of the n-dimensional grid as dims -
In [78]: dims
Out[78]: array([10, 7])
Let's create the 2-dimensional grid to see how that mapping works and linear indices get computed with np.ravel_multi_index -
In [79]: out = np.zeros(dims,dtype=int)
In [80]: out
Out[80]:
array([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
Let's set the first indexing tuple from X, i.e. the first row from X into the grid -
In [81]: out[4,2] = 1
In [82]: out
Out[82]:
array([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
Now, to see the linear index equivalent of the element just set, let's flatten and use np.where to detect that 1.
In [83]: np.where(out.ravel())[0]
Out[83]: array([30])
This could also be computed if row-major ordering is taken into account.
Let's use np.ravel_multi_index and verify those linear indices -
In [84]: np.ravel_multi_index(X.T,dims)
Out[84]: array([30, 66, 61, 24, 41])
Thus, we would have linear indices corresponding to each indexing tuple from X, i.e. each row from X.
Choosing dimensions for np.ravel_multi_index to form unique linear indices
Now, the idea behind considering each row of X as indexing tuple of a n-dimensional grid and converting each such tuple to a scalar is to have unique scalars corresponding to unique tuples, i.e. unique rows in X.
Let's take another look at X -
In [77]: X
Out[77]:
array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
Now, as discussed in the previous section, we are considering each row as indexing tuple. Within each such indexing tuple, the first element would represent the first axis of the n-dim grid, second element would be the second axis of the grid and so on until the last element of each row in X. In essence, each column would represent one dimension or axis of the grid. If we are to map all elements from X onto the same n-dim grid, we need to consider the maximum stretch of each axis of such a proposed n-dim grid. Assuming we are dealing with positive numbers in X, such a stretch would be the maximum of each column in X + 1. That + 1 is because Python follows 0-based indexing. So, for example X[1,0] == 9 would map to the 10th row of the proposed grid. Similarly, X[4,1] == 6 would go to the 7th column of that grid.
So, for our sample case, we had -
In [7]: dims = X.max(axis=0) + 1 # Or simply X.max(0) + 1
In [8]: dims
Out[8]: array([10, 7])
Thus, we would need a grid of at least a shape of (10,7) for our sample case. More lengths along the dimensions won't hurt and would give us unique linear indices too.
Concluding remarks : One important thing to be noted here is that if we have negative numbers in X, we need to add proper offsets along each column in X to make those indexing tuples as positive numbers before using np.ravel_multi_index.
Another alternative is to use asvoid (below) to view each row as a single
value of void dtype. This reduces a 2D array to a 1D array, thus allowing you to use np.in1d as usual:
import numpy as np
def asvoid(arr):
"""
Based on http://stackoverflow.com/a/16973510/190597 (Jaime, 2013-06)
View the array as dtype np.void (bytes). The items along the last axis are
viewed as one value. This allows comparisons to be performed which treat
entire rows as one value.
"""
arr = np.ascontiguousarray(arr)
if np.issubdtype(arr.dtype, np.floating):
""" Care needs to be taken here since
np.array([-0.]).view(np.void) != np.array([0.]).view(np.void)
Adding 0. converts -0. to 0.
"""
arr += 0.
return arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1])))
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
searched_values = np.array([[4, 2],
[3, 3],
[5, 6]])
idx = np.flatnonzero(np.in1d(asvoid(X), asvoid(searched_values)))
print(idx)
# [0 3 4]
The numpy_indexed package (disclaimer: I am its author) contains functionality for performing such operations efficiently (also uses searchsorted under the hood). In terms of functionality, it acts as a vectorized equivalent of list.index:
import numpy_indexed as npi
result = npi.indices(X, searched_values)
Note that using the 'missing' kwarg, you have full control over behavior of missing items, and it works for nd-arrays (fi; stacks of images) as well.
Update: using the same shapes as #Rik X=[520000,28,28] and searched_values=[20000,28,28], it runs in 0.8064 secs, using missing=-1 to detect and denote entries not present in X.
Here is a pretty fast solution that scales up well using numpy and hashlib. It can handle large dimensional matrices or images in seconds. I used it on 520000 X (28 X 28) array and 20000 X (28 X 28) in 2 seconds on my CPU
Code:
import numpy as np
import hashlib
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
searched_values = np.array([[4, 2],
[3, 3],
[5, 6]])
#hash using sha1 appears to be efficient
xhash=[hashlib.sha1(row).digest() for row in X]
yhash=[hashlib.sha1(row).digest() for row in searched_values]
z=np.in1d(xhash,yhash)
##Use unique to get unique indices to ind1 results
_,unique=np.unique(np.array(xhash)[z],return_index=True)
##Compute unique indices by indexing an array of indices
idx=np.array(range(len(xhash)))
unique_idx=idx[z][unique]
print('unique_idx=',unique_idx)
print('X[unique_idx]=',X[unique_idx])
Output:
unique_idx= [4 3 0]
X[unique_idx]= [[5 6]
[3 3]
[4 2]]
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
S = np.array([[4, 2],
[3, 3],
[5, 6]])
result = [[i for i,row in enumerate(X) if (s==row).all()] for s in S]
or
result = [i for s in S for i,row in enumerate(X) if (s==row).all()]
if you want a flat list (assuming there is exactly one match per searched value).
Another way is to use cdist function from scipy.spatial.distance like this:
np.nonzero(cdist(X, searched_values) == 0)[0]
Basically, we get row numbers of X which have distance zero to a row in searched_values, meaning they are equal. Makes sense if you look on rows as coordinates.
I had similar requirement and following worked for me:
np.argwhere(np.isin(X, searched_values).all(axis=1))
Here's what worked out for me:
def find_points(orig: np.ndarray, search: np.ndarray) -> np.ndarray:
equals = [np.equal(orig, p).all(1) for p in search]
exists = np.max(equals, axis=1)
indices = np.argmax(equals, axis=1)
indices[exists == False] = -1
return indices
test:
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
searched_values = np.array([[4, 2],
[3, 3],
[5, 6],
[0, 0]])
find_points(X, searched_values)
output:
[0,3,4,-1]
I have an array X:
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
And I wish to find the index of the row of several values in this array:
searched_values = np.array([[4, 2],
[3, 3],
[5, 6]])
For this example I would like a result like:
[0,3,4]
I have a code doing this, but I think it is overly complicated:
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
searched_values = np.array([[4, 2],
[3, 3],
[5, 6]])
result = []
for s in searched_values:
idx = np.argwhere([np.all((X-s)==0, axis=1)])[0][1]
result.append(idx)
print(result)
I found this answer for a similar question but it works only for 1d arrays.
Is there a way to do what I want in a simpler way?
Approach #1
One approach would be to use NumPy broadcasting, like so -
np.where((X==searched_values[:,None]).all(-1))[1]
Approach #2
A memory efficient approach would be to convert each row as linear index equivalents and then using np.in1d, like so -
dims = X.max(0)+1
out = np.where(np.in1d(np.ravel_multi_index(X.T,dims),\
np.ravel_multi_index(searched_values.T,dims)))[0]
Approach #3
Another memory efficient approach using np.searchsorted and with that same philosophy of converting to linear index equivalents would be like so -
dims = X.max(0)+1
X1D = np.ravel_multi_index(X.T,dims)
searched_valuesID = np.ravel_multi_index(searched_values.T,dims)
sidx = X1D.argsort()
out = sidx[np.searchsorted(X1D,searched_valuesID,sorter=sidx)]
Please note that this np.searchsorted method assumes there is a match for each row from searched_values in X.
How does np.ravel_multi_index work?
This function gives us the linear index equivalent numbers. It accepts a 2D array of n-dimensional indices, set as columns and the shape of that n-dimensional grid itself onto which those indices are to be mapped and equivalent linear indices are to be computed.
Let's use the inputs we have for the problem at hand. Take the case of input X and note the first row of it. Since, we are trying to convert each row of X into its linear index equivalent and since np.ravel_multi_index assumes each column as one indexing tuple, we need to transpose X before feeding into the function. Since, the number of elements per row in X in this case is 2, the n-dimensional grid to be mapped onto would be 2D. With 3 elements per row in X, it would had been 3D grid for mapping and so on.
To see how this function would compute linear indices, consider the first row of X -
In [77]: X
Out[77]:
array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
We have the shape of the n-dimensional grid as dims -
In [78]: dims
Out[78]: array([10, 7])
Let's create the 2-dimensional grid to see how that mapping works and linear indices get computed with np.ravel_multi_index -
In [79]: out = np.zeros(dims,dtype=int)
In [80]: out
Out[80]:
array([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
Let's set the first indexing tuple from X, i.e. the first row from X into the grid -
In [81]: out[4,2] = 1
In [82]: out
Out[82]:
array([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
Now, to see the linear index equivalent of the element just set, let's flatten and use np.where to detect that 1.
In [83]: np.where(out.ravel())[0]
Out[83]: array([30])
This could also be computed if row-major ordering is taken into account.
Let's use np.ravel_multi_index and verify those linear indices -
In [84]: np.ravel_multi_index(X.T,dims)
Out[84]: array([30, 66, 61, 24, 41])
Thus, we would have linear indices corresponding to each indexing tuple from X, i.e. each row from X.
Choosing dimensions for np.ravel_multi_index to form unique linear indices
Now, the idea behind considering each row of X as indexing tuple of a n-dimensional grid and converting each such tuple to a scalar is to have unique scalars corresponding to unique tuples, i.e. unique rows in X.
Let's take another look at X -
In [77]: X
Out[77]:
array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
Now, as discussed in the previous section, we are considering each row as indexing tuple. Within each such indexing tuple, the first element would represent the first axis of the n-dim grid, second element would be the second axis of the grid and so on until the last element of each row in X. In essence, each column would represent one dimension or axis of the grid. If we are to map all elements from X onto the same n-dim grid, we need to consider the maximum stretch of each axis of such a proposed n-dim grid. Assuming we are dealing with positive numbers in X, such a stretch would be the maximum of each column in X + 1. That + 1 is because Python follows 0-based indexing. So, for example X[1,0] == 9 would map to the 10th row of the proposed grid. Similarly, X[4,1] == 6 would go to the 7th column of that grid.
So, for our sample case, we had -
In [7]: dims = X.max(axis=0) + 1 # Or simply X.max(0) + 1
In [8]: dims
Out[8]: array([10, 7])
Thus, we would need a grid of at least a shape of (10,7) for our sample case. More lengths along the dimensions won't hurt and would give us unique linear indices too.
Concluding remarks : One important thing to be noted here is that if we have negative numbers in X, we need to add proper offsets along each column in X to make those indexing tuples as positive numbers before using np.ravel_multi_index.
Another alternative is to use asvoid (below) to view each row as a single
value of void dtype. This reduces a 2D array to a 1D array, thus allowing you to use np.in1d as usual:
import numpy as np
def asvoid(arr):
"""
Based on http://stackoverflow.com/a/16973510/190597 (Jaime, 2013-06)
View the array as dtype np.void (bytes). The items along the last axis are
viewed as one value. This allows comparisons to be performed which treat
entire rows as one value.
"""
arr = np.ascontiguousarray(arr)
if np.issubdtype(arr.dtype, np.floating):
""" Care needs to be taken here since
np.array([-0.]).view(np.void) != np.array([0.]).view(np.void)
Adding 0. converts -0. to 0.
"""
arr += 0.
return arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1])))
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
searched_values = np.array([[4, 2],
[3, 3],
[5, 6]])
idx = np.flatnonzero(np.in1d(asvoid(X), asvoid(searched_values)))
print(idx)
# [0 3 4]
The numpy_indexed package (disclaimer: I am its author) contains functionality for performing such operations efficiently (also uses searchsorted under the hood). In terms of functionality, it acts as a vectorized equivalent of list.index:
import numpy_indexed as npi
result = npi.indices(X, searched_values)
Note that using the 'missing' kwarg, you have full control over behavior of missing items, and it works for nd-arrays (fi; stacks of images) as well.
Update: using the same shapes as #Rik X=[520000,28,28] and searched_values=[20000,28,28], it runs in 0.8064 secs, using missing=-1 to detect and denote entries not present in X.
Here is a pretty fast solution that scales up well using numpy and hashlib. It can handle large dimensional matrices or images in seconds. I used it on 520000 X (28 X 28) array and 20000 X (28 X 28) in 2 seconds on my CPU
Code:
import numpy as np
import hashlib
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
searched_values = np.array([[4, 2],
[3, 3],
[5, 6]])
#hash using sha1 appears to be efficient
xhash=[hashlib.sha1(row).digest() for row in X]
yhash=[hashlib.sha1(row).digest() for row in searched_values]
z=np.in1d(xhash,yhash)
##Use unique to get unique indices to ind1 results
_,unique=np.unique(np.array(xhash)[z],return_index=True)
##Compute unique indices by indexing an array of indices
idx=np.array(range(len(xhash)))
unique_idx=idx[z][unique]
print('unique_idx=',unique_idx)
print('X[unique_idx]=',X[unique_idx])
Output:
unique_idx= [4 3 0]
X[unique_idx]= [[5 6]
[3 3]
[4 2]]
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
S = np.array([[4, 2],
[3, 3],
[5, 6]])
result = [[i for i,row in enumerate(X) if (s==row).all()] for s in S]
or
result = [i for s in S for i,row in enumerate(X) if (s==row).all()]
if you want a flat list (assuming there is exactly one match per searched value).
Another way is to use cdist function from scipy.spatial.distance like this:
np.nonzero(cdist(X, searched_values) == 0)[0]
Basically, we get row numbers of X which have distance zero to a row in searched_values, meaning they are equal. Makes sense if you look on rows as coordinates.
I had similar requirement and following worked for me:
np.argwhere(np.isin(X, searched_values).all(axis=1))
Here's what worked out for me:
def find_points(orig: np.ndarray, search: np.ndarray) -> np.ndarray:
equals = [np.equal(orig, p).all(1) for p in search]
exists = np.max(equals, axis=1)
indices = np.argmax(equals, axis=1)
indices[exists == False] = -1
return indices
test:
X = np.array([[4, 2],
[9, 3],
[8, 5],
[3, 3],
[5, 6]])
searched_values = np.array([[4, 2],
[3, 3],
[5, 6],
[0, 0]])
find_points(X, searched_values)
output:
[0,3,4,-1]
I have a list of unique rows and another larger array of data (called test_rows in example). I was wondering if there was a faster way to get the location of each unique row in the data. The fastest way that I could come up with is...
import numpy
uniq_rows = numpy.array([[0, 1, 0],
[1, 1, 0],
[1, 1, 1],
[0, 1, 1]])
test_rows = numpy.array([[0, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 1],
[0, 1, 1],
[1, 1, 1],
[1, 1, 0],
[1, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0]])
# this gives me the indexes of each group of unique rows
for row in uniq_rows.tolist():
print row, numpy.where((test_rows == row).all(axis=1))[0]
This prints...
[0, 1, 0] [ 1 4 10]
[1, 1, 0] [ 3 8 12]
[1, 1, 1] [7 9]
[0, 1, 1] [0 5 6]
Is there a better or more numpythonic (not sure if that word exists) way to do this? I was searching for a numpy group function but could not find it. Basically for any incoming dataset I need the fastest way to get the locations of each unique row in that data set. The incoming dataset will not always have every unique row or the same number.
EDIT:
This is just a simple example. In my application the numbers would not be just zeros and ones, they could be anywhere from 0 to 32000. The size of uniq rows could be between 4 to 128 rows and the size of test_rows could be in the hundreds of thousands.
Numpy
From version 1.13 of numpy you can use numpy.unique like np.unique(test_rows, return_counts=True, return_index=True, axis=1)
Pandas
df = pd.DataFrame(test_rows)
uniq = pd.DataFrame(uniq_rows)
uniq
0 1 2
0 0 1 0
1 1 1 0
2 1 1 1
3 0 1 1
Or you could generate the unique rows automatically from the incoming DataFrame
uniq_generated = df.drop_duplicates().reset_index(drop=True)
yields
0 1 2
0 0 1 1
1 0 1 0
2 0 0 0
3 1 1 0
4 1 1 1
and then look for it
d = dict()
for idx, row in uniq.iterrows():
d[idx] = df.index[(df == row).all(axis=1)].values
This is about the same as your where method
d
{0: array([ 1, 4, 10], dtype=int64),
1: array([ 3, 8, 12], dtype=int64),
2: array([7, 9], dtype=int64),
3: array([0, 5, 6], dtype=int64)}
There are a lot of solutions here, but I'm adding one with vanilla numpy. In most cases numpy will be faster than list comprehensions and dictionaries, although the array broadcasting may cause memory to be an issue if large arrays are used.
np.where((uniq_rows[:, None, :] == test_rows).all(2))
Wonderfully simple, eh? This returns a tuple of unique row indices and the corresponding test row.
(array([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]),
array([ 1, 4, 10, 3, 8, 12, 7, 9, 0, 5, 6]))
How it works:
(uniq_rows[:, None, :] == test_rows)
Uses array broadcasting to compare each element of test_rows with each row in uniq_rows. This results in a 4x13x3 array. all is used to determine which rows are equal (all comparisons returned true). Finally, where returns the indices of these rows.
With the np.unique from v1.13 (downloaded from the source link on the latest documentation, https://github.com/numpy/numpy/blob/master/numpy/lib/arraysetops.py#L112-L247)
In [157]: aset.unique(test_rows, axis=0,return_inverse=True,return_index=True)
Out[157]:
(array([[0, 0, 0],
[0, 1, 0],
[0, 1, 1],
[1, 1, 0],
[1, 1, 1]]),
array([2, 1, 0, 3, 7], dtype=int32),
array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32))
In [158]: a,b,c=_
In [159]: c
Out[159]: array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32)
In [164]: from collections import defaultdict
In [165]: dd = defaultdict(list)
In [166]: for i,v in enumerate(c):
...: dd[v].append(i)
...:
In [167]: dd
Out[167]:
defaultdict(list,
{0: [2, 11],
1: [1, 4, 10],
2: [0, 5, 6],
3: [3, 8, 12],
4: [7, 9]})
or indexing the dictionary with the unique rows (as hashable tuple):
In [170]: dd = defaultdict(list)
In [171]: for i,v in enumerate(c):
...: dd[tuple(a[v])].append(i)
...:
In [172]: dd
Out[172]:
defaultdict(list,
{(0, 0, 0): [2, 11],
(0, 1, 0): [1, 4, 10],
(0, 1, 1): [0, 5, 6],
(1, 1, 0): [3, 8, 12],
(1, 1, 1): [7, 9]})
This will do the job:
import numpy as np
uniq_rows = np.array([[0, 1, 0],
[1, 1, 0],
[1, 1, 1],
[0, 1, 1]])
test_rows = np.array([[0, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 1],
[0, 1, 1],
[1, 1, 1],
[1, 1, 0],
[1, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0]])
indices=np.where(np.sum(np.abs(np.repeat(uniq_rows,len(test_rows),axis=0)-np.tile(test_rows,(len(uniq_rows),1))),axis=1)==0)[0]
loc=indices//len(test_rows)
indices=indices-loc*len(test_rows)
res=[[] for i in range(len(uniq_rows))]
for i in range(len(indices)):
res[loc[i]].append(indices[i])
print(res)
[[1, 4, 10], [3, 8, 12], [7, 9], [0, 5, 6]]
This will work for all the cases including the cases in which not all the rows in uniq_rows are present in test_rows. However, if somehow you know ahead that all of them are present, you could replace the part
res=[[] for i in range(len(uniq_rows))]
for i in range(len(indices)):
res[loc[i]].append(indices[i])
with just the row:
res=np.split(indices,np.where(np.diff(loc)>0)[0]+1)
Thus avoiding loops entirely.
Not very 'numpythonic', but for a bit of an upfront cost, we can make a dict with the keys as a tuple of your row, and a list of indices:
test_rowsdict = {}
for i,j in enumerate(test_rows):
test_rowsdict.setdefault(tuple(j),[]).append(i)
test_rowsdict
{(0, 0, 0): [2, 11],
(0, 1, 0): [1, 4, 10],
(0, 1, 1): [0, 5, 6],
(1, 1, 0): [3, 8, 12],
(1, 1, 1): [7, 9]}
Then you can filter based on your uniq_rows, with a fast dict lookup: test_rowsdict[tuple(row)]:
out = []
for i in uniq_rows:
out.append((i, test_rowsdict.get(tuple(i),[])))
For your data, I get 16us for just the lookup, and 66us for building and looking up, versus 95us for your np.where solution.
Approach #1
Here's one approach, not sure about the level of "NumPythonic-ness" though to such a tricky problem -
def get1Ds(a, b): # Get 1D views of each row from the two inputs
# check that casting to void will create equal size elements
assert a.shape[1:] == b.shape[1:]
assert a.dtype == b.dtype
# compute dtypes
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
# convert to 1d void arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
a_void = a.reshape(a.shape[0], -1).view(void_dt).ravel()
b_void = b.reshape(b.shape[0], -1).view(void_dt).ravel()
return a_void, b_void
def matching_row_indices(uniq_rows, test_rows):
A, B = get1Ds(uniq_rows, test_rows)
validA_mask = np.in1d(A,B)
sidx_A = A.argsort()
validA_mask = validA_mask[sidx_A]
sidx = B.argsort()
sortedB = B[sidx]
split_idx = np.flatnonzero(sortedB[1:] != sortedB[:-1])+1
all_split_indx = np.split(sidx, split_idx)
match_mask = np.in1d(B,A)[sidx]
valid_mask = np.logical_or.reduceat(match_mask, np.r_[0, split_idx])
locations = [e for i,e in enumerate(all_split_indx) if valid_mask[i]]
return uniq_rows[sidx_A[validA_mask]], locations
Scope(s) of improvement (on performance) :
np.split could be replaced by a for-loop for splitting using slicing.
np.r_ could be replaced by np.concatenate.
Sample run -
In [331]: unq_rows, idx = matching_row_indices(uniq_rows, test_rows)
In [332]: unq_rows
Out[332]:
array([[0, 1, 0],
[0, 1, 1],
[1, 1, 0],
[1, 1, 1]])
In [333]: idx
Out[333]: [array([ 1, 4, 10]),array([0, 5, 6]),array([ 3, 8, 12]),array([7, 9])]
Approach #2
Another approach to beat the setup overhead from the previous one and making use of get1Ds from it, would be -
A, B = get1Ds(uniq_rows, test_rows)
idx_group = []
for row in A:
idx_group.append(np.flatnonzero(B == row))
The numpy_indexed package (disclaimer: I am its author) was created to solve problems of this kind in an elegant and efficient manner:
import numpy_indexed as npi
indices = np.arange(len(test_rows))
unique_test_rows, index_groups = npi.group_by(test_rows, indices)
If you dont care about the indices of all rows, but only those present in test_rows, npi has a bunch of simple ways of tackling that problem too; f.i:
subset_indices = npi.indices(unique_test_rows, unique_rows)
As a sidenote; it might be useful to take a look at the examples in the npi library; in my experience, most of the time people ask a question of this kind, these grouped indices are just a means to an end, and not the endgoal of the computation. Chances are that using the functionality in npi you can reach that end goal more efficiently, without ever explicitly computing those indices. Do you care to give some more background to your problem?
EDIT: if you arrays are indeed this big, and always consist of a small number of columns with binary values, wrapping them with the following encoding might boost efficiency a lot further still:
def encode(rows):
return (rows * [[2**i for i in range(rows.shape[1])]]).sum(axis=1, dtype=np.uint8)