Numpy efficient indexing with varied size arrays - python

Take a look at this piece of code:
import numpy as np
a = np.random.random(10)
indicies = [
np.array([1, 4, 3]),
np.array([2, 5, 8, 7, 3]),
np.array([1, 2]),
np.array([3, 2, 1])
]
result = np.zeros(2)
result[0] = a[indicies[0]].sum()
result[1] = a[indicies[2]].sum()
Is there any way to get result more efficiently? In my case a is a very large array.
In other words I want to select elements from a with several varying size index arrays and then sum over them in one operation, resulting in a single array.

With your a and indicies list:
In [280]: [a[i].sum() for i in indicies]
Out[280]:
[1.3986792680307709,
2.6354365193743732,
0.83324677494990895,
1.8195179021311731]
Which of course could wrapped in np.array().
For a subset of the indicies items use:
In [281]: [a[indicies[i]].sum() for i in [0,2]]
Out[281]: [1.3986792680307709, 0.83324677494990895]
A comment suggests indicies comes from an Adjacency matrix, possibly sparse.
I could recreate such an array with:
In [289]: A=np.zeros((4,10),int)
In [290]: for i in range(4): A[i,indicies[i]]=1
In [291]: A
Out[291]:
array([[0, 1, 0, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 1, 0, 0, 0, 0, 0, 0]])
and use a matrix product (np.dot) to do the selection and sum:
In [292]: A.dot(a)
Out[292]: array([ 1.39867927, 2.63543652, 0.83324677, 1.8195179 ])
A[[0,2],:].dot(a) would use a subset of rows.
A sparse matrix version has that list of row indices:
In [294]: Al=sparse.lil_matrix(A)
In [295]: Al.rows
Out[295]: array([[1, 3, 4], [2, 3, 5, 7, 8], [1, 2], [1, 2, 3]], dtype=object)
And a matrix product with that gives the same numbers:
In [296]: Al*a
Out[296]: array([ 1.39867927, 2.63543652, 0.83324677, 1.8195179 ])

If your array a is very large you might have memory issues if your array of indices contains many arrays of many indices when looping through it.
To avoid this issue use an iterator instead of a list :
indices = iter(indices)
and then loop through your iterator.

Related

Python Numpy - Create 2d array where length is based on 1D array

Sorry for confusing title, but not sure how to make it more concise. Here's my requirements:
arr1 = np.array([3,5,9,1])
arr2 = ?(arr1)
arr2 would then be:
[
[0,1,2,0,0,0,0,0,0],
[0,1,2,3,4,0,0,0,0],
[0,1,2,3,4,5,6,7,8],
[0,0,0,0,0,0,0,0,0]
]
It doesn't need to vary based on the max, the shape is known in advance. So to start I've been able to get a shape of zeros:
arr2 = np.zeros((len(arr1),max_len))
And then of course I could do a for loop over arr1 like this:
for i, element in enumerate(arr1):
arr2[i,0:element] = np.arange(element)
but that would likely take a long time and both dimensions here are rather large (arr1 is a few million rows, max_len is around 500). Is there a clean optimized way to do this in numpy?
Building on a 'padding' idea posted by #Divakar some years ago:
In [161]: res = np.arange(9)[None,:].repeat(4,0)
In [162]: res[res>=arr1[:,None]] = 0
In [163]: res
Out[163]:
array([[0, 1, 2, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 5, 6, 7, 8],
[0, 0, 0, 0, 0, 0, 0, 0, 0]])
Try this with itertools.zip_longest -
import numpy as np
import itertools
l = map(range, arr1)
arr2 = np.column_stack((itertools.zip_longest(*l, fillvalue=0)))
print(arr2)
array([[0, 1, 2, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 5, 6, 7, 8],
[0, 0, 0, 0, 0, 0, 0, 0, 0]])
I am adding a slight variation on #hpaulj's answer because you mentioned that max_len is around 500 and you have millions of rows. In this case, you can precompute a 500 by 500 matrix containing all possible rows and index into it using arr1:
import numpy as np
np.random.seed(0)
max_len = 500
arr = np.random.randint(0, max_len, size=10**5)
# generate all unique rows first, then index
# can be faster if max_len << len(arr)
# 53 ms
template = np.tril(np.arange(max_len)[None,:].repeat(max_len,0), k=-1)
res = template[arr,:]
# 173 ms
res1 = np.arange(max_len)[None,:].repeat(arr.size,0)
res1[res1>=arr[:,None]] = 0
assert (res == res1).all()

How to get "2d indices" of 2d ndarray ordered by value of elements

I have a 2d ndarray of ndarrays, which looks like:
array([[array([0]), array([0, 1]), array([0, 1]), None, None,array([0, 1])],
[array([0, 1]), None, array([0, 1]), array([0, 1]), array([0, 1]),None],
[None, None, array([0, 1]), None, None, None],
[array([0, 1]), array([0, 1]), None, array([0, 1]), array([0, 1]),array([0, 1])],
[array([0, 1]), None, None, None, array([0, 1]), None],
[array([0, 1]), None, array([0, 1]), array([0, 1]), array([0, 1]),None]], dtype=object)
My aim is to get the indices of the elements, sorted by the len(element), also skips those elements which are None. Like:
array([[0,0], --> len= 1
[0,1], --> len=2
[0,2], --> len=2
[0,5], ...
[1,0],
[1,2],
[1,3],
[1,4],
... ])
I have tried to first convert the elements to their len, which will give us something like:
array([[1, 2, 2, 0, 0, 2],
[2, 0, 2, 2, 2, 0],
[0, 0, 2, 0, 0, 0],
[2, 2, 0, 2, 2, 2],
[2, 0, 0, 0, 2, 0],
[2, 0, 2, 2, 2, 0]], dtype=object)
However, I couldn't find an effective way to generate the indices list(or ndarray will do).
Please help me with this. I appreciate anyone who could solve this problem or give me some clue.
Edit:
I have found a close but not perfect solution:
Due to the data constraints, the elements of "lenArray" can only have 3 kinds of values: 1, 2, inf.
So I can take advantage of this to do:
ones = np.column_stack(np.where(lenArray==1))
twos = np.column_stack(np.where(lenArray==2))
infs = np.column_stack(np.where(lenArray==inf))
sort_lenidx = np.concatenate((ones,twos,infs))
Where sort_lenidx will match my needs.
However, this is not a very general(if the possible value nums be very big, this will be useless) and elegant way to solve my problem. I still hope someone could give me a better way to do it.
I will appreciate for your help in any form.
Let's call the array containing lengths lenArray.
The right way of doing this would be by creating another 2d array- rowNcol, that contain the row and column indices of lenArray as elements. Then, implement a sorting algorithm on lenArray and perform identical operations on rowNcol to finally obtain the desired array of indices.
That being said, you could exploit the fact that we know beforehand, the type (int) and range of elements inside lenArray and simply iterate through the possible elements in the following manner:
from numpy import array, amax
lenArray = array([[1, 2, 2, 0, 0, 2],
[2, 0, 2, 2, 2, 0],
[0, 0, 2, 0, 0, 0],
[2, 2, 0, 2, 2, 2],
[2, 0, 0, 0, 2, 0],
[2, 0, 2, 2, 2, 0]])
rows, cols = lenArray.shape
lenMax = amax(lenArray)
for lenVal in range(lenMax):
for i in range(rows):
for j in range(cols):
if (lenArray[i,j] == lenVal):
print(str(i) + ',' + str(j))
This is however, extremely inefficient if the size of lenArray is very large, since you are parsing it over and over.
Edit: I later came across numpy.argsort which appears to do exactly what you want.

converty numpy array of arrays to 2d array

I have a pandas series features that has the following values (features.values)
array([array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),
array([0, 0, 0, ..., 0, 0, 0]), ...,
array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),
array([0, 0, 0, ..., 0, 0, 0])], dtype=object)
Now I really want this to be recognized as matrix, but if I do
>>> features.values.shape
(10000,)
rather than (10000, 3000) which is what I would expect.
How can I get this to be recognized as 2d rather than a 1d array with arrays as values. Also why does it not automatically detect it as a 2d array?
In response your comment question, let's compare 2 ways of creating an array
First make an array from a list of arrays (all same length):
In [302]: arr = np.array([np.arange(3), np.arange(1,4), np.arange(10,13)])
In [303]: arr
Out[303]:
array([[ 0, 1, 2],
[ 1, 2, 3],
[10, 11, 12]])
The result is a 2d array of numbers.
If instead we make an object dtype array, and fill it with arrays:
In [304]: arr = np.empty(3,object)
In [305]: arr[:] = [np.arange(3), np.arange(1,4), np.arange(10,13)]
In [306]: arr
Out[306]:
array([array([0, 1, 2]), array([1, 2, 3]), array([10, 11, 12])],
dtype=object)
Notice that this display is like yours. This is, by design a 1d array. Like a list it contains pointers to arrays elsewhere in memory. Notice that it requires an extra construction step. The default behavior of np.array is to create a multidimensional array where it can.
It takes extra effort to get around that. Likewise it takes some extra effort to undo that - to create the 2d numeric array.
Simply calling np.array on it does not change the structure.
In [307]: np.array(arr)
Out[307]:
array([array([0, 1, 2]), array([1, 2, 3]), array([10, 11, 12])],
dtype=object)
stack does change it to 2d. stack treats it as a list of arrays, which it joins on a new axis.
In [308]: np.stack(arr)
Out[308]:
array([[ 0, 1, 2],
[ 1, 2, 3],
[10, 11, 12]])
Shortening #hpauli answer:
your_2d_arry = np.stack(arr_of_arr_object)

Get array of indices of first zero in every row of numpy array

I have a numpy array of 1650 rows and 1275 columns containing 0s and 255s.
I want to get the index of every first zero in the row and store it in an array.
I used for loop to achieve that. Here is the example code
#new_arr is a numpy array and k is an empty array
for i in range(new_arr.shape[0]):
if not np.all(new_arr[i,:]) == 255:
x = np.where(new_arr[i,:]==0)[0][0]
k.append(x)
else:
k.append(-1)
It takes around 1.3 seconds for 1650 rows. Is there any other way or function to get the indices array in a much faster way?
One approach would be to get mask of matches with ==0 and then get argmax along each row, i.e argmax(axis=1) that gives us the first matching index for each row -
(arr==0).argmax(axis=1)
Sample run -
In [443]: arr
Out[443]:
array([[0, 1, 0, 2, 2, 1, 2, 2],
[1, 1, 2, 2, 2, 1, 0, 1],
[2, 1, 0, 1, 0, 0, 2, 0],
[2, 2, 1, 0, 1, 2, 1, 0]])
In [444]: (arr==0).argmax(axis=1)
Out[444]: array([0, 6, 2, 3])
Catching non-zero rows (if we can!)
To facilitate for rows that won't have any zero, we need to do one more step of work, with some masking -
In [445]: arr[2] = 9
In [446]: arr
Out[446]:
array([[0, 1, 0, 2, 2, 1, 2, 2],
[1, 1, 2, 2, 2, 1, 0, 1],
[9, 9, 9, 9, 9, 9, 9, 9],
[2, 2, 1, 0, 1, 2, 1, 0]])
In [447]: mask = arr==0
In [448]: np.where(mask.any(1), mask.argmax(1), -1)
Out[448]: array([ 0, 6, -1, 3])

NumPy: sort matrix rows by number of non-zero entries

import numpy as np
def calc_size(matrix, index):
return np.nonzero(matrix[index,:])[1].size
def swap_rows(matrix, frm, to):
matrix[[frm, to],:] = matrix[[to, frm],:]
Numpy - Python 2.7
How can I achieve that matrix's rows are sorted after the size of the nonzero entries? I already wrote these two methods for doing the work but I need to give it to a sorting engine? The fullest rows should be at the beginning!
If you have an array arr:
array([[0, 0, 0, 0, 0],
[1, 0, 1, 1, 1],
[0, 1, 0, 1, 1],
[1, 1, 1, 1, 1]])
You could sort the array's rows according to the number of zero entries by writing:
>>> arr[(arr == 0).sum(axis=1).argsort()]
array([[1, 1, 1, 1, 1],
[1, 0, 1, 1, 1],
[0, 1, 0, 1, 1],
[0, 0, 0, 0, 0]])
This first counts the number of zero entries in each row with (arr == 0).sum(axis=1): this produces the array [5, 1, 2, 0].
Next, argsort sorts the indices of this array by their corresponding value, giving [3, 1, 2, 0].
Lastly, this argsorted array is used to rearrange the rows of arr.
P.S. If you have a matrix m (and not an array), you may need to ravel before using argsort:
m[(m == 0).sum(axis=1).ravel().argsort()]

Categories

Resources