I have an array/set with unique positive integers, i.e.
>>> unique = np.unique(np.random.choice(100, 4, replace=False))
And an array containing multiple elements sampled from this previous array, such as
>>> A = np.random.choice(unique, 100)
I want to map the values of the array A to the position of which those values occur in unique.
So far the best solution I found is through a mapping array:
>>> table = np.zeros(unique.max()+1, unique.dtype)
>>> table[unique] = np.arange(unique.size)
The above assigns to each element the index on the array, and thus, can be used later to map A through advanced indexing:
>>> table[A]
array([2, 2, 3, 3, 3, 3, 1, 1, 1, 0, 2, 0, 1, 0, 2, 1, 0, 0, 2, 3, 0, 0, 0,
0, 3, 3, 2, 1, 0, 0, 0, 2, 1, 0, 3, 0, 1, 3, 0, 1, 2, 3, 3, 3, 3, 1,
3, 0, 1, 2, 0, 0, 2, 3, 1, 0, 3, 2, 3, 3, 3, 1, 1, 2, 0, 0, 2, 0, 2,
3, 1, 1, 3, 3, 2, 1, 2, 0, 2, 1, 0, 1, 2, 0, 2, 0, 1, 3, 0, 2, 0, 1,
3, 2, 2, 1, 3, 0, 3, 3], dtype=int32)
Which already gives me the proper solution. However, if the unique numbers in unique are very sparse and large, this approach implies creating a very large table array just to store a few numbers for later mapping.
Is there any better solution?
NOTE: both A and unique are sample arrays, not real arrays. So the question is not how to generate positional indexes, it is just how to efficiently map elements of A to indexes in unique, the pseudocode of what I'd like to speedup in numpy is as follows,
B = np.zeros_like(A)
for i in range(A.size):
B[i] = unique.index(A[i])
(assuming unique is a list in the above pseudocode).
The table approach described in your question is the best option when unique if pretty dense, but unique.searchsorted(A) should produce the same result and doesn't require unique to be dense. searchsorted is great with ints, if anyone is trying to do this kind of thing with floats which have precision limitations, consider something like this.
You can use standard python dict with np.vectorize
inds = {e:i for i, e in enumerate(unique)}
B = np.vectorize(inds.get)(A)
The numpy_indexed package (disclaimer: I am its author) contains a vectorized equivalent of list.index, which does not require memory proportional to the max element, but only proportional to the input itself:
import numpy_indexed as npi
npi.indices(unique, A)
Note that it also works for arbitrary dtypes and dimensions. Also, the array being queried does not need to be unique; the first index encountered will be returned, the same as for list.
Related
Suppose I have a numpy array (or pandas Series if it makes it any easier), which looks like this:
foo = np.array([1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0])
I want to transform into an array
bar = np.array([0, 1, 2, 3, 4,0, 1, 2, 0, 1, 2, 3])
where the entry is how many steps you need to walk to the left to find a 1 in foo.
Now, obviously one can write a loop to compute bar from foo, but this will be bog slow. Is there anything more clever one can do?
UPDATE The pd.Series solution is around 7 times slower than the pure numpy solution. The stupid loop solution is very slow (no surprise), but when jit compiled with numba is as fast as the numpy solution.
You could do cumcount with pandas
s = pd.Series(foo)
bar = s.groupby(s.cumsum()).cumcount().to_numpy()
Out[13]: array([0, 1, 2, 3, 4, 0, 1, 2, 0, 1, 2, 3], dtype=int64)
One option, specifically for the shared example, with numpy:
# get positions where value is 1
pos = foo.nonzero()[0]
# need this when computing the cumsum
values = np.diff(pos) - 1
arr = np.ones(foo.size, dtype=int)
arr[0] = 0
arr[pos[1:]] = -values
arr.cumsum()
array([0, 1, 2, 3, 4, 0, 1, 2, 0, 1, 2, 3])
Assume multiple matrices with different amount of rows. They are contained in a list. How can I fill the rows of the smaller matrices, so that they have the same size as the biggest matrix ?
list_of_matrices = []
list_of_matrices.append(np.array([[3,3],[4,4]]))
list_of_matrices.append(np.array([[1,1,3],[2,2,5]]))
list_of_matrices.append(np.array([[1,1,3,7],[2,2,5,9]]))
From list_of_matrices I want to create a 3D numpy array of e.g. shape 3x4x2 where the missing values (because the first to 2D matrices are too small) are filled with a scalar value (more specific the mean of each matrix around axis 1). I want to do that in a performant way (no for loops).
I tried various ways and concluded this should be readable and quite efficient:
z = np.zeros((3, 2, 4), dtype=int)
for i, n in enumerate(list_of_matrices):
z[i, :n.shape[0], :n.shape[1]] = n
While trying to find any other method than looping I concluded that every list of list_of_matrices has unbalanced amounts of cells to be assigned to z so even the most clever ways requires nothing better than concatenation of these groups which is slow in numpy.
This is an example of concatenation of index groups:
a = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2]
b = [0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1]
c = [0, 1, 0, 1, 0, 1, 2, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3]
z[a, b, c] = np.concatenate([n.ravel() for n in list_of_matrices])
It also requires np.concatenate and list comprehension therefore efficiency is also lost here. But if you really need to optimise it further, you can try replace concatenation like so:
# Use list of lists instead because arrays are slow while iterating
list_of_matrices = [[[3,3],[4,4]], [[1,1,3],[2,2,5]], [[1,1,3,7],[2,2,5,9]]]
from itertools import chain
concatenation = list(chain(*list(chain(*list_of_matrices))))
and create abovementioned sequences of indexes a, b, c applying tricks of np.repeat + repetition of specific blocks
I have the following numpy array named histarr with the shape 1, 13
array([0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0], dtype=uint32)
I want to get an array that gives me positions where 1's are, hence I used np.where
where_are_ones_arr = np.where(histarr == 1)
The output is:
(array([1, 2, 4, 5, 6], dtype=int32),)
I was confused for a while but than I checked the type and I realized that where_are_ones_arr is not an array but it is actually a tuple, so if I wanted to get an array I used:
where_are_ones_arr[0]
Result:
array([1, 2, 4, 5, 6], dtype=int32)
Now that is all fine but I found it unbelievable that I cannot get that in one line, so I looked around and tried:
where_are_ones_give_me_only_array = histarr[np.where(histarr == 1)]
But it spits out:
array([1, 1, 1, 1, 1], dtype=uint32)
Which is not what I want and what I can explain?
What is it that I do not get?
You should be able to do it in one line:
np.where(histarr == 1)[0]
If I want to create a matrix, I simply call
m = np.matrix([[x00, x01],
[x10, x11]])
, where x00, x01, x10 and x11 are numbers. However, I would like to vectorize this process. For example, if the x's are one-dimensional arrays with length l, then I would like m to become an array of matrices, or a lx2x2-dimensional array. Unfortunately,
zeros = np.zeros(10)
ones = np.ones(10)
m = np.matrix([[zeros, ones],
[zeros, ones]])
raises an error ("matrix must be 2-dimensional") and
m = np.array([[zeros, ones],
[zeros, ones]])
gives an 2x2xl-dimensional array instead. In order to solve this, I could call np.moveaxis(m, 2, 0), but I am looking for a direct solution that doesn't need to change the order of axes of a (potentially huge) array. This also only sets the axis-order right if I'm passing one-dimensional arrays as values for my matrix, not if they're higher dimensional.
Is there a general and efficient way of vectorizing the creation of matrices?
Let's try a 2d (4d after joining) case:
In [374]: ones = np.ones((3,4),int)
In [375]: arr = np.array([[ones*0, ones],[ones*2, ones*3]])
In [376]: arr
Out[376]:
array([[[[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]],
[[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1]]],
[[[2, 2, 2, 2],
[2, 2, 2, 2],
[2, 2, 2, 2]],
[[3, 3, 3, 3],
[3, 3, 3, 3],
[3, 3, 3, 3]]]])
In [377]: arr.shape
Out[377]: (2, 2, 3, 4)
Notice that the original array elements are 'together'. arr has its own databuffer, with copies of the original arrays, but it was made with relatively efficient block copies.
We can easily transpose axes:
In [378]: arr.transpose(2,3,0,1)
Out[378]:
array([[[[0, 1],
[2, 3]],
[[0, 1],
[2, 3]],
...
[[0, 1],
[2, 3]]]])
Now it's 12 (2,2) arrays. It is a view, using arr's databuffer. It just has a different shape and strides. Doing this transpose is quite efficient, and isn't any slower when arr is very big. And a lot of math on the transposed array will be nearly as efficient as on the original arr (because of stridded iteration). If there are differences in speed it will be because of caching at a deep level.
But some actions will require a copy. For example the transposed array can't be raveled without a copy. The original 0s,1s etc are no longer together.
In [379]: arr.transpose(2,3,0,1).ravel()
Out[379]:
array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1,
2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3,
0, 1, 2, 3])
I could construct the same 1d array with
In [380]: tarr = np.empty((3,4,2,2), int)
In [381]: tarr[...,0,0] = ones*0
In [382]: tarr[...,0,1] = ones*1
In [383]: tarr[...,1,0] = ones*2
In [384]: tarr[...,1,1] = ones*3
In [385]: tarr.ravel()
Out[385]:
array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1,
2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3,
0, 1, 2, 3])
This tarr is effectively what you are trying to produce 'directly'.
Another way to look at this construction, is to assign the values to the array's .flat with strides - insert 0s at every 4th slot, 1s at the adjacent ones, etc.:
In [386]: tarr.flat[0::4] = ones*0
In [387]: tarr.flat[1::4] = ones*1
In [388]: tarr.flat[2::4] = ones*2
In [389]: tarr.flat[3::4] = ones*3
Here's another 'direct' way - use np.stack (a version of concatenate) to create a (3,4,4) array, which can then be reshaped:
np.stack((ones*0,ones*1,ones*2,ones*3),2).reshape(3,4,2,2)
That stack is, in essence:
In [397]: ones1 = ones[...,None]
In [398]: np.concatenate((ones1*0, ones1*1, ones1*2, ones1*3),axis=2)
Notice that this target (3,4,2,2) could be reshaped to (12,4) (and v.v) at no cost. So the original problem becomes: is it easier to construct a (4,12) and transpose, or construct the (12,4) first? It's really a 2d problem, not a (m+n)d one.
np.matrix must be a 2D array. From numpy documentation of np.matrix
Returns a matrix from an array-like object, or from a string of data.
A matrix is a specialized 2-D array that retains its 2-D nature
through operations. It has certain special operators, such as *
(matrix multiplication) and ** (matrix power).
Note
It is no longer recommended to use this class, even for linear
algebra. Instead use regular arrays. The class may be removed in the
future.
Is there any reason you want np.matrix? Most numpy operations should be doable in the array object as the matrix class is quasi-deprecated.
From your example I tried using the transpose (.T) method:
zeros = np.zeros(10)
ones = np.ones(10)
twos = np.ones(10) * 2
threes = np.ones(10) * 3
m = np.array([[zeros, ones], [twos, threes]]).T
>> array([[0,2],[1,3]],...)
or
m = np.transpose(np.array([[zeros, ones], [twos, threes]]), (2,0,1))
>> array([[0,1],[2,3]],...)
This yields a (10, 2, 2) array
I have a 3d numpy array and I obtain the indices that meet a certain condition, for example:
a = np.tile([[1,2],[3,4]],(2,2,2))
indices = np.where(a == 2)
To this indices, I need to apply an offset, fo example (0, 0, 1), and view if meet another condition.
Something like this:
offset = [0, 0, 1]
indices_shift = indices + offset
count = 0
for i in indices_shift:
if a[i] == 3:
count += 1
In this example, with the offset of (0,0,1), the indices looks like:
indices = (array([0, 0, 0, 0, 1, 1, 1, 1], dtype=int64), array([0, 0, 2, 2, 0, 0, 2, 2], dtype=int64), array([1, 3, 1, 3, 1, 3, 1, 3], dtype=int64))
and I think that adding the offset the results should be something like:
indices_shift = (array([0, 0, 0, 0, 1, 1, 1, 1], dtype=int64), array([0, 0, 2, 2, 0, 0, 2, 2], dtype=int64), array([2, 4, 2, 4, 2, 4, 2, 4], dtype=int64))
Is there any easy way to do that?
Thanks.
Here's one approach -
idx = np.argwhere(a == 2)+[0,0,1]
valid_mask = (idx< a.shape).all(1)
valid_idx = idx[valid_mask]
count = np.count_nonzero(a[tuple(valid_idx.T)] == 3)
Steps :
Get the indices for matches against 2. Use np.argwhere here to get those in a nice 2D array with each column representing an axis. Another benefit is that this makes it generic to handle arrays with generic number of dimensions. Then, add offset in a broadcasted manner. This is idx.
Among the indices in idx, there would be few invalid ones that go beyond the array shape. So, get a valid mask valid_mask and hence valid indices valid_idx among them.
Finally index into input array with those, compare against 3 and count the number of matches.