Efficiently index 2d numpy array using two 1d arrays - python

I have a large 2d numpy array and two 1d arrays that represent x/y indexes within the 2d array. I want to use these 1d arrays to perform an operation on the 2d array.
I can do this with a for loop, but it's very slow when working on a large array. Is there a faster way? I tried using the 1d arrays simply as indexes but that didn't work. See this example:
import numpy as np
# Two example 2d arrays
cnt_a = np.zeros((4,4))
cnt_b = np.zeros((4,4))
# 1d arrays holding x and y indices
xpos = [0,0,1,2,1,2,1,0,0,0,0,1,1,1,2,2,3]
ypos = [3,2,1,1,3,0,1,0,0,1,2,1,2,3,3,2,0]
# This method works, but is very slow for a large array
for i in range(0,len(xpos)):
cnt_a[xpos[i],ypos[i]] = cnt_a[xpos[i],ypos[i]] + 1
# This method is fast, but gives incorrect answer
cnt_b[xpos,ypos] = cnt_b[xpos,ypos]+1
# Print the results
print 'Good:'
print cnt_a
print ''
print 'Bad:'
print cnt_b
The output from this is:
Good:
[[ 2. 1. 2. 1.]
[ 0. 3. 1. 2.]
[ 1. 1. 1. 1.]
[ 1. 0. 0. 0.]]
Bad:
[[ 1. 1. 1. 1.]
[ 0. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 0. 0. 0.]]
For the cnt_b array numpy is obviously not summing correctly, but I'm unsure how to fix this without resorting to the (v. inefficient) for loop used to calculate cnt_a.

Another approach by using 1D indexing (suggested by #Shai) extended to answer the actual question:
>>> out = np.zeros((4, 4))
>>> idx = np.ravel_multi_index((xpos, ypos), out.shape) # extract 1D indexes
>>> x = np.bincount(idx, minlength=out.size)
>>> out.flat += x
np.bincount calculates how many times each of the index is present in the xpos, ypos and stores them in x.
Or, as suggested by #Divakar:
>>> out.flat += np.bincount(idx, minlength=out.size)

We could compute the linear indices, then accumulate into zeros-initialized output array with np.add.at. Thus, with xpos and ypos as arrays, here's one implementation -
m,n = xpos.max()+1, ypos.max()+1
out = np.zeros((m,n),dtype=int)
np.add.at(out.ravel(), xpos*n+ypos, 1)
Sample run -
In [95]: # 1d arrays holding x and y indices
...: xpos = np.array([0,0,1,2,1,2,1,0,0,0,0,1,1,1,2,2,3])
...: ypos = np.array([3,2,1,1,3,0,1,0,0,1,2,1,2,3,3,2,0])
...:
In [96]: cnt_a = np.zeros((4,4))
In [97]: # This method works, but is very slow for a large array
...: for i in range(0,len(xpos)):
...: cnt_a[xpos[i],ypos[i]] = cnt_a[xpos[i],ypos[i]] + 1
...:
In [98]: m,n = xpos.max()+1, ypos.max()+1
...: out = np.zeros((m,n),dtype=int)
...: np.add.at(out.ravel(), xpos*n+ypos, 1)
...:
In [99]: cnt_a
Out[99]:
array([[ 2., 1., 2., 1.],
[ 0., 3., 1., 2.],
[ 1., 1., 1., 1.],
[ 1., 0., 0., 0.]])
In [100]: out
Out[100]:
array([[2, 1, 2, 1],
[0, 3, 1, 2],
[1, 1, 1, 1],
[1, 0, 0, 0]])

you can iterate on both lists at once, and increment for each couple (if you are not used to it, zip can combine lists)
for x, y in zip(xpos, ypos):
cnt_b[x][y] += 1
But this will be about the same speed as your solution A.
If your lists xpos/ypos are of length n, I don't see how you can update your matrix in less than o(n) since you'll have to check each pair one way or an other.
Other solution: you could count (with collections.Counter possibly) the similar index pairs (ex: (0, 3) etc...) and update the matrix with the count value. But I doubt it would be much faster, since you the time gained on updating the matrix would be lost on counting multiple occurrences.
Maybe I am totally wrong tho, in which case I'd be curious too to see a not o(n) answer

I think you are looking for ravel_multi_index funciton
lidx = np.ravel_multi_index((xpos, ypos), cnt_a.shape)
converts to "flatten" 1D indices into cnt_a and cnt_b:
np.add.at( cnt_b, lidx, 1 )

Related

Combining two arrays of unequal sizes and storing in a third array as such

I want to combine two arrays of different sizes into a third array(which is an array of arrays).
I have tried to use np.concatenate function as well as the np.append function but am not getting the desired results.
A=[1. 1. 0.]
B=[0. 1.]
y2=np.concatenate((yl, yr))
Expected Result [[0 1],[1 1 0]]
What are you trying to do with the result?
Just use a list if you want to iterate over the array elements.
import numpy as np
A=np.array([1., 1., 0.])
B=np.array([0., 1.])
y2=[A,B]
You can use np.array here.
import numpy as np
A = np.array([1., 1., 0.])
B = np.array([0., 1.])
y2 = np.array([A,B])
print(y2)
#output:- array([array([1., 1., 0.]), array([0., 1.])])
And for your expected output, you need to convert A and B to list:-
v = np.vectorize(int)
y2 = [v(B).tolist(), v(A).tolist()]
print(y2)
#output:- [[0, 1], [1, 1, 0]]

numpy rearanging when mixing integer and list indices [duplicate]

Original question
I am getting a very odd error message when I try to assign some of the elements of an array. I am using a combination of a slice and a set of indices. See the following simple example.
import scipy as sp
a = sp.zeros((3, 4, 5))
b = sp.ones((4, 5))
I = sp.array([0, 1, 3])
b[:, I] = a[0, :, I]
This code raises the following ValueError:
ValueError: shape mismatch: value array of shape (3,4) could not be broadcast to indexing result of shape (3,4)
--
Follow up
Be careful when using a combination of a slice and seq. of integers. As pointed out on github:
x = rand(3, 5, 7)
print(x[0, :, [0,1]].shape)
# (2, 5)
print(x[0][:, [0,1]].shape)
# (5, 2)
This is how numpy is designed to work, but it is nevertheless a bit confusing that x[0][:, I] is not the same as x[0, :, I]. Since this is the behavior I want I choose to use x[0][:, I] in my code.
Looks like there are some errors in copying your code to question.
But I suspect there's a known problem with indexing:
In [73]: a=np.zeros((2,3,4)); b=np.ones((3,4)); I=np.array([0,1])
Make I 2 elements. Indexing b gives the expected (3,2) shape. 3 rows from the slice, 2 columns from I indexing
In [74]: b[:,I].shape
Out[74]: (3, 2)
But with 3d a we get the transpose.
In [75]: a[0,:,I].shape
Out[75]: (2, 3)
and assignment would produce an error
In [76]: b[:,I]=a[0,:,I]
...
ValueError: array is not broadcastable to correct shape
It's putting the 2 element dimension defined by I first, and the 3 element from : second. It's a case of mixed advanced indexing that has been discussed earlier - and there's a bug issue as well. (I'll have to look those up).
You are probably using a newer numpy (or scipy) and getting a different error message.
It's documented that indexing with two arrays or lists, and slice in the middle, puts the slice at the end, e.g.
In [86]: a[[[0],[0],[1],[1]],:,[0,1]].shape
Out[86]: (4, 2, 3)
The same thing is happening with a[0,:,[0,1]]. But there's a good argument that it shouldn't be this way.
As to a fix, you could transpose a value, or change the indexing
In [88]: b[:,I]=a[0:1,:,I]
In [90]: b[:,I]=a[0,:,I].T
In [91]: b
Out[91]:
array([[ 0., 0., 1., 1.],
[ 0., 0., 1., 1.],
[ 0., 0., 1., 1.]])
In [92]: b[:,I]=a[0][:,I]
https://github.com/numpy/numpy/issues/7030
https://github.com/numpy/numpy/pull/6256
First of all it looks like you're missing a comma on the line 6:
I = sp.array([0,1,4])
Secondly, I would expect the value 4 in the array I to raise an IndexError, since both a and b have a max dimension of 4. I suspect you might want:
I = sp.array([0,1,3])
Making these changes run the program for me, and I got b as:
[[ 0. 0. 1. 0.]
[ 0. 0. 1. 0.]
[ 0. 0. 1. 0.]]
Which I suspect is what you want.
Here I get this error with indices [0,1,4]:
IndexError: index 4 is out of bounds for axis 2 with size 4
Its suggesting the value 4 is being used as an index, while the SIZE 4 implies the max index would be 3.
EDIT: now that you changed it to [0, 1, 3], it's running fine here.
EDIT: with your current code, I get the same error, but when I print the arrays themselves, they have a transverse shape:
print b[:, I]
print a[0, :, I]
[[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]
[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]

numpy classification comparison with 3d array

I'm trying to do some basic classification of numpy arrays...
I want to compare a 2d array against a 3d array, along the 3rd dimension, and make a classification based on the corresponding z-axis values.
so given 3 arrays that are stacked into a 3d array:
import numpy as np
a1 = np.array([[1,1,1],[1,1,1],[1,1,1]])
a2 = np.array([[3,3,3],[3,3,3],[3,3,3]])
a3 = np.array([[5,5,5],[5,5,5],[5,5,5]])
a3d = dstack((a1,a2,a3))
and another 2d array
a2d = np.array([[1,2,4],[5,5,2],[2,3,3]])
I want to be able to compare a2d against a3d, and return a 2d array of which level of a3d is closest. (or I suppose any custom function that can compare each value along the z-axis, and return a value base on that comparison.)
EDIT
I modified my arrays to more closely match my data. a1 would be the minimum values, a2 the average values, and a3 the maximum values. So I want to output if each a2d value is closer to a1 (classed "1") a2 (classed "2") or a3 (classed "3"). I'm doing as a 3d array because in the real data, it won't be a simple 3-array choice, but for SO purposes, it helps to keep it simple. We can assume that in the case of a tie, we'll take the lower, so 2 would be classed as level "1", 4 as level "2".
You can use the following list comprehension :
>>> [sum(sum(abs(i-j)) for i,j in z) for z in [zip(i,a2d) for i in a3d]]
[30.0, 22.5, 30.0]
In preceding code i create the following list with zip,that is the zip of each sub array of your 3d list then all you need is calculate the sum of the elemets of subtract of those pairs then sum of them again :
>>> [zip(i,a2d) for i in a3d]
[[(array([ 1., 3., 1.]), array([1, 2, 1])), (array([ 2., 2., 1.]), array([5, 5, 4])), (array([ 3., 1., 1.]), array([9, 8, 8]))], [(array([ 4., 6., 4.]), array([1, 2, 1])), (array([ 5. , 6.5, 4. ]), array([5, 5, 4])), (array([ 6., 4., 4.]), array([9, 8, 8]))], [(array([ 7., 9., 7.]), array([1, 2, 1])), (array([ 8., 8., 7.]), array([5, 5, 4])), (array([ 9., 7., 7.]), array([9, 8, 8]))]]
then for all of your sub arrays you'll have the following list:
[30.0, 22.5, 30.0]
that for each sub-list show a the level of difference with 2d array!and then you can get the relative sub-array from a3d like following :
>>> a3d[l.index(min(l))]
array([[ 4. , 6. , 4. ],
[ 5. , 6.5, 4. ],
[ 6. , 4. , 4. ]])
Also you can put it in a function:
>>> def find_nearest(sub,main):
... l=[sum(sum(abs(i-j)) for i,j in z) for z in [zip(i,sub) for i in main]]
... return main[l.index(min(l))]
...
>>> find_nearest(a2d,a3d)
array([[ 4. , 6. , 4. ],
[ 5. , 6.5, 4. ],
[ 6. , 4. , 4. ]])
You might consider a different approach using numpy.vectorize which lets you efficiently apply a python function to each element of your array.
In this case, your python function could just classify each pixel with whatever breaks you define:
import numpy as np
a2d = np.array([[1,2,4],[5,5,2],[2,3,3]])
def classify(x):
if x >= 4:
return 3
elif x >= 2:
return 2
elif x > 0:
return 1
else:
return 0
vclassify = np.vectorize(classify)
result = vclassify(a2d)
Thanks to #perrygeo and #Kasra - they got me thinking in a good direction.
Since I want a classification of the closest 3d array's z value, I couldn't do simple math - I needed the (z)index of the closest value.
I did it by enumerating both axes of the 2d array, and doing a proximity compare against the corresponding (z)index of the 3d array.
There might be a way to do this without iterating the 2d array, but at least I'm avoiding iterating the 3d.
import numpy as np
a1 = np.array([[1,1,1],[1,1,1],[1,1,1]])
a2 = np.array([[3,3,3],[3,3,3],[3,3,3]])
a3 = np.array([[5,5,5],[5,5,5],[5,5,5]])
a3d = np.dstack((a1,a2,a3))
a2d = np.array([[1,2,4],[5,5,2],[2,3,3]])
classOut = np.empty_like(a2d)
def find_nearest_idx(array,value):
idx = (np.abs(array-value)).argmin()
return idx
# enumerate to get indices
for i,a in enumerate(a2d):
for ii,v in enumerate(a):
valStack = a3d[i,ii]
nearest = find_nearest_idx(valStack,v)
classOut[i,ii] = nearest
print classOut
which gets me
[[0 0 1]
[2 2 0]
[0 1 1]]
This tells me that (for example) a2d[0,0] is closest to the 0-index of a3d[0,0], which in my case means it is closest to the min value for that 2d position. a2d[1,1] is closest to the 2-index, which in my case means closer to the max value for that 2d position.

Scipy Sparse Matrix special substraction

I'm doing a project and I'm doing a lot of matrix computation in it.
I'm looking for a smart way to speed up my code. In my project, I'm dealing with a sparse matrix of size 100Mx1M with around 10M non-zeros values. The example below is just to see my point.
Let's say I have:
A vector v of size (2)
A vector c of size (3)
A sparse matrix X of size (2,3)
v = np.asarray([10, 20])
c = np.asarray([ 2, 3, 4])
data = np.array([1, 1, 1, 1])
row = np.array([0, 0, 1, 1])
col = np.array([1, 2, 0, 2])
X = coo_matrix((data,(row,col)), shape=(2,3))
X.todense()
# matrix([[0, 1, 1],
# [1, 0, 1]])
Currently I'm doing:
result = np.zeros_like(v)
d = scipy.sparse.lil_matrix((v.shape[0], v.shape[0]))
d.setdiag(v)
tmp = d * X
print tmp.todense()
#matrix([[ 0., 10., 10.],
# [ 20., 0., 20.]])
# At this point tmp is csr sparse matrix
for i in range(tmp.shape[0]):
x_i = tmp.getrow(i)
result += x_i.data * ( c[x_i.indices] - x_i.data)
# I only want to do the subtraction on non-zero elements
print result
# array([-430, -380])
And my problem is the for loop and especially the subtraction.
I would like to find a way to vectorize this operation by subtracting only on the non-zero elements.
Something to get directly the sparse matrix on the subtraction:
matrix([[ 0., -7., -6.],
[ -18., 0., -16.]])
Is there a way to do this smartly ?
You don't need to loop over the rows to do what you are already doing. And you can use a similar trick to perform the multiplication of the rows by the first vector:
import scipy.sparse as sps
# number of nonzero entries per row of X
nnz_per_row = np.diff(X.indptr)
# multiply every row by the corresponding entry of v
# You could do this in-place as:
# X.data *= np.repeat(v, nnz_per_row)
Y = sps.csr_matrix((X.data * np.repeat(v, nnz_per_row), X.indices, X.indptr),
shape=X.shape)
# subtract from the non-zero entries the corresponding column value in c...
Y.data -= np.take(c, Y.indices)
# ...and multiply by -1 to get the value you are after
Y.data *= -1
To see that it works, set up some dummy data
rows, cols = 3, 5
v = np.random.rand(rows)
c = np.random.rand(cols)
X = sps.rand(rows, cols, density=0.5, format='csr')
and after run the code above:
>>> x = X.toarray()
>>> mask = x == 0
>>> x *= v[:, np.newaxis]
>>> x = c - x
>>> x[mask] = 0
>>> x
array([[ 0.79935123, 0. , 0. , -0.0097763 , 0.59901243],
[ 0.7522559 , 0. , 0.67510109, 0. , 0.36240006],
[ 0. , 0. , 0.72370725, 0. , 0. ]])
>>> Y.toarray()
array([[ 0.79935123, 0. , 0. , -0.0097763 , 0.59901243],
[ 0.7522559 , 0. , 0.67510109, 0. , 0.36240006],
[ 0. , 0. , 0.72370725, 0. , 0. ]])
The way you are accumulating your result requires that there are the same number of non-zero entries in every row, which seems a pretty weird thing to do. Are you sure that is what you are after? If that's really what you want you could get that value with something like:
result = np.sum(Y.data.reshape(Y.shape[0], -1), axis=0)
but I have trouble believing that is really what you are after...

NumPy array indexing a 2D matrix

I've a little issue while working on same big data. But for now, let's assume I've got an NumPy array filled with zeros
>>> x = np.zeros((3,3))
>>> x
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
Now I want to change some of these zeros with specific values. I've given the index of the cells I want to change.
>>> y = np.array([[0,0],[1,1],[2,2]])
>>> y
array([[0, 0],
[1, 1],
[2, 2]])
And I've got an array with the desired (for now random) numbers, as follow
>>> z = np.array(np.random.rand(3))
>>> z
array([ 0.04988558, 0.87512891, 0.4288157 ])
So now I thought I can do the following:
>>> x[y] = z
But than it's filling the whole array like this
>>> x
array([[ 0.04988558, 0.87512891, 0.4288157 ],
[ 0.04988558, 0.87512891, 0.4288157 ],
[ 0.04988558, 0.87512891, 0.4288157 ]])
But I was hoping to get
>>> x
array([[ 0.04988558, 0, 0 ],
[ 0, 0.87512891, 0 ],
[ 0, 0, 0.4288157 ]])
EDIT
Now I've used a diagonal index, but what in the case my index is not just diagonal. I was hoping following works:
>>> y = np.array([[0,1],[1,2],[2,0]])
>>> x[y] = z
>>> x
>>> x
array([[ 0, 0.04988558, 0 ],
[ 0, 0, 0.87512891 ],
0.4288157, 0, 0 ]])
But it's filling whole array just like above
Array indexing works a bit differently on multidimensional arrays
If you have a vector, you can access the first three elements by using
x[np.array([0,1,2])]
but when you're using this on a matrix, it will return the first few rows. Upon first sight, using
x[np.array([0,0],[1,1],[2,2]])]
sounds reasonable. However, NumPy array indexing works differently: It still treats all those indices in a 1D fashion, but returns the values from the vector in the same shape as your index vector.
To properly access 2D matrices you have to split both components into two separate arrays:
x[np.array([0,1,2]), np.array([0,1,2])]
This will fetch all elements on the main diagonal of your matrix. Assignments using this method is possible, too:
x[np.array([0,1,2]), np.array([0,1,2])] = 1
So to access the elements you've mentioned in your edit, you have to do the following:
x[np.array([0,1,2]), np.array([1,2,0])]

Categories

Resources