SciPy Sparse Array: Get index for a data point

SciPy Sparse Array: Get index for a data point - python

I am creating a csr sparse array (because I have a lot of empty elements/cells) that I need to use forwards and backwards. That is, I need to input two indices and get the element that corresponds to it ( matrix[0][9]=34) but I also need to be able to get the indices upon just knowing the value is 34. The elements in my array will all be unique. I have looked all over for an answer regarding this, but have not found one, or may have not understood it was what I was looking for if I did! I'm quite new to python, so if you could make sure to let me know what the functions you find do and the steps to retrieve the indices for the element, I would very much appreciate it!
Thanks in advance!

Here's a way of finding a specific value that is applicable to both numpy arrays and sparse matrices
In [119]: A=sparse.csr_matrix(np.arange(12).reshape(3,4))
In [120]: A==10
Out[120]:
<3x4 sparse matrix of type '<class 'numpy.bool_'>'
with 1 stored elements in Compressed Sparse Row format>
In [121]: (A==10).nonzero()
Out[121]: (array([2], dtype=int32), array([2], dtype=int32))
In [122]: (A.A==10).nonzero()
Out[122]: (array([2], dtype=int32), array([2], dtype=int32))

You can use the nonzero method:
In [44]: from scipy.sparse import csr_matrix
In [45]: a = np.arange(50).reshape(5, 10)
In [46]: a
Out[46]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49]])
In [47]: s = csr_matrix(a)
In [48]: s
Out[48]:
<5x10 sparse matrix of type '<type 'numpy.int64'>'
with 49 stored elements in Compressed Sparse Row format>
In [49]: (s == 36).nonzero()
Out[49]: (array([3], dtype=int32), array([6], dtype=int32))
In general, it often works to try a method which worked on a numpy array. This does not always work, but at least here it just did (and I learned something new today).

Related

Numpy indexing by range of arrays

Say I have an array myarr such that myarr.shape = (2,64,64,2). Now if I define myarr2 = myarr[[0,1,0,0,1],...], then the following is true
myarr2.shape #(5,64,64,2)
myarr2[0,...] == myarr[0,...] # = True
myarr2[1,...] == myarr[1,...] # = True
myarr2[2,...] == myarr[0,...] # = True
...
Can this be generalized so the slices are arrays? That is, is there a way to make the following hypothetical code work?
myarr2 = myarr[...,[20,30,40]:[30,40,50],[15,25,35]:[25,35,45],..]
myarr2[0,] == myarr[...,20:30,15:25,...] # = True
myarr2[1,] == myarr[...,30:40,25:35,...] # = True
myarr2[2,] == myarr[...,40:50,35:45,...] # = True

you may feed the coordinates of subarrays to the cycle which cuts subarrays from myarray. I don't know hoe you store the indices of subarrays so I put them into nested list idx_list:
idx_list = [[[20,30,40],[30,40,50]],[[15,25,35]:[25,35,45]]] # assuming 2D cutouts
idx_array = np.array([k for i in idx_list for j in i for k in j]) # unpack
idx_array = idx_array .reshape(-1,2).T # reshape
myarray2 = np.array([myarray[a:b,c:d] for a,b,c,d in i2]) # cut and combine

Let's simplify the problem a bit; first by removing the two outer dimensions that don't affect the core indexing issue; and by reducing the size so we can see and understand the results.
The setup
In [540]: arr = np.arange(7*7).reshape(7,7)
In [541]: arr
Out[541]:
array([[ 0, 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12, 13],
[14, 15, 16, 17, 18, 19, 20],
[21, 22, 23, 24, 25, 26, 27],
[28, 29, 30, 31, 32, 33, 34],
[35, 36, 37, 38, 39, 40, 41],
[42, 43, 44, 45, 46, 47, 48]])
In [542]: idx =np.array([[0,2,4,6],[1,3,5,7]])
Now a straightforward iteration approach:
In [543]: alist = []
...: for i in range(idx.shape[1]-1):
...: j,k = idx[:,i]
...: sub = arr[j:j+2, k:k+2]
...: alist.append(sub)
...:
In [544]: np.array(alist)
Out[544]:
array([[[ 1, 2],
[ 8, 9]],
[[17, 18],
[24, 25]],
[[33, 34],
[40, 41]]])
In [545]: _.shape
Out[545]: (3, 2, 2)
I simplified the iteration from:
...: for i in range(idx.shape[1]-1):
...: sub = arr[idx[0,i]:idx[0,i+1],idx[1,i]:idx[1,i+1]]
...: alist.append(sub)
to highlight the fact that we are generating ranges of a consistent size, and make the next transformation more obvious.
So I start with a (7,7) array, and create 3 (2,2) slices.
As I demonstrated in Slicing a different range at each index of a multidimensional numpy array, we can use linspace to expand a set of slices, or ranges.
In [567]: ranges = np.linspace(idx[:,:3],idx[:,:3]+1,2).astype(int)
In [568]: ranges
Out[568]:
array([[[0, 2, 4],
[1, 3, 5]],
[[1, 3, 5],
[2, 4, 6]]])
So ranges[0] expands on the idx[0] slices, etc. But if I simply index with these I get 'diagonal' values from Out[554]:
In [569]: arr[ranges[0], ranges[1]]
Out[569]:
array([[ 1, 17, 33],
[ 9, 25, 41]])
to get blocks I have to add a dimension to the first indices:
In [570]: arr[ranges[0,:,None], ranges[1]]
Out[570]:
array([[[ 1, 17, 33],
[ 2, 18, 34]],
[[ 8, 24, 40],
[ 9, 25, 41]]])
these are the same values as in Out[554], but need to be transposed:
In [571]: _.transpose(2,0,1)
Out[571]:
array([[[ 1, 2],
[ 8, 9]],
[[17, 18],
[24, 25]],
[[33, 34],
[40, 41]]])
The code's a bit clunky and needs to get generalized, but gives the general idea of how one can substitute one indexing for the iterative one, provide the slices are regular enough. For this small example it probably isn't faster, but it probably will come ahead as the problem size gets larger.

Numpy Dot Product of two 2-d arrays in numpy to get 3-d array

Sorry for the badly explained title. I am trying to parallelise a part of my code and got stuck on a dot product. I am looking for an efficient way of doing what the code below does, I'm sure there is a simple linear algebra solution but I'm very stuck:
puy = np.arange(8).reshape(2,4)
puy2 = np.arange(12).reshape(3,4)
print puy, '\n'
print puy2.T
zz = np.zeros([4,2,3])
for i in range(4):
zz[i,:,:] = np.dot(np.array([puy[:,i]]).T,
np.array([puy2.T[i,:]]))

One way would be to use np.einsum, which allows you to specify what you want to happen to the indices:
>>> np.einsum('ik,jk->kij', puy, puy2)
array([[[ 0, 0, 0],
[ 0, 16, 32]],
[[ 1, 5, 9],
[ 5, 25, 45]],
[[ 4, 12, 20],
[12, 36, 60]],
[[ 9, 21, 33],
[21, 49, 77]]])
>>> np.allclose(np.einsum('ik,jk->kij', puy, puy2), zz)
True

Here's another way with broadcasting -
(puy[None,...]*puy2[:,None,:]).T

Numpy, how to get a sub matrix with boolean slicing

I have a question: how to get a sub matrix like a sub array by boolean slicing?
For example:
a2 = np.array(np.arange(30).reshape(5, 6))
a2[a2[:, 1] > 10]
will give me:
array([[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29]])
but:
m2 = np.mat(np.arange(30).reshape(5, 6))
m2[m2[:, 1] > 10]
will give me:
matrix([[12, 18, 24]])
Why the output is different and How can I get the same result as array from matrix?
Thank you!

The issue you're experiencing comes down to the fact that operations on a matrix return always return a 2-dimensional array.
When you build the mask on the first array, you get:
In [24]: a2[:,1] > 10
Out[24]: array([False, False, True, True, True], dtype=bool)
which, as you can see, is a 1-dimensional array.
When you do the same thing with the matrix, you get:
In [25]: m2[:,1] > 10
Out[25]:
matrix([[False],
[False],
[ True],
[ True],
[ True]], dtype=bool)
In other words, you have a nx1 array, not an array of length n.
Indexing in numpy operates differently depending on whether you're indexing with a one or n dimensional array.
In your first case, numpy will treat the array of length n as row indices, so you'll get the expected result:
In [28]: a2[a2[:,1] > 10]
Out[28]:
array([[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29]])
In the second case, because you have a 2-dimensional index array, numpy has enough information to extract both the row and the column, and so it only grabs things from the matching column (the first one):
In [29]: m2[m2[:,1] > 10]
Out[29]: matrix([[12, 18, 24]])
To answer your question: you can get this behaviour by converting your masks to an array and grabbing the first column, to extract your initial array of length n:
In [32]: m2[np.array(m2[:,1] > 10)[:,0]]
Out[32]:
matrix([[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29]])
Alternatively, you could do the conversion first, getting the same result as before:
In [34]: np.array(m2)[:,1] > 10
Out[34]: array([False, False, True, True, True], dtype=bool)
Now, both of those fixes require conversions between matrices and arrays, which can be pretty ugly.
The question I'd be asking yourself is why you wish to use a matrix, and yet expect the behaviour of an array.
It could be that the right tool for your job is actually an array, not a matrix.

If you flatten the boolean mask like:
m2[np.asarray(m2[:,1]>10).flatten()]
you get the same result, but I would recommend using np.array instead of np.matrix for the reasons given in this answer.

New array of smaller size excluding one value from each column

In Python 2.7 using numpy or by any means if I had an array of any size and wanted to excluded certain values and output the new array how would I do that? Here is What I would like
[(1,2,3),
(4,5,6), then exclude [4,2,9] to make the array[(1,5,3),
(7,8,9)] (7,8,6)]
I would always be excluding data the same length as the row length and always only one entry per column. [(1,5,3)] would be another example of data I would want to excluded. So every time I loop the function it reduces the array row size by one. I would imagine I have to use a masked array or convert my mask to a masked array and subtract the two then maybe condense the output but I have no idea how. Thanks for your time.

You can do it very efficiently if you transform your 2-D array in an unraveled 1-D array. Then you repeat the array with the elements to be excluded, called e in order to do an element-wise comparison:
import numpy as np
a = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
e = [1, 5, 3]
ar = a.T.ravel()
er = np.repeat(e, a.shape[0])
ans = ar[er != ar].reshape(a.shape[1], a.shape[0]-1).T
But it will work if each element in e only matches one row of a.
EDIT:
as suggested by #Jaime, you can avoid the ravel() and get the same result doing directly:
ans = a.T[(a != e).T].reshape(a.shape[1], a.shape[0]-1).T

To exclude vector e from matrix a:
import numpy as np
a = np.array([(1,2,3), (4,5,6), (7,8,9)])
e = [4,2,9]
print np.array([ [ i for i in a.transpose()[j] if i != e[j] ]
for j in range(len(e)) ]).transpose()

This would take some work to generalize, but here's something that can handle 2-d cases of the kind you describe. If passed unexpected input, this won't notice and will generate strange results, but it's at least a starting point:
def columnwise_compress(a, values):
a_shape = a.shape
a_trans_flat = a.transpose().reshape(-1)
compressed = a_trans_flat[~numpy.in1d(a_trans_flat, values)]
return compressed.reshape(a_shape[:-1] + ((a_shape[0] - 1),)).transpose()
Tested:
>>> columnwise_compress(numpy.arange(9).reshape(3, 3) + 1, [4, 2, 9])
array([[1, 5, 3],
[7, 8, 6]])
>>> columnwise_compress(numpy.arange(9).reshape(3, 3) + 1, [1, 5, 3])
array([[4, 2, 6],
[7, 8, 9]])
The difficulty is that you're asking for "compression" of a kind that numpy.compress doesn't do (removing different values for each column or row) and you're asking for compression along columns instead of rows. Compressing along rows is easier because it moves along the natural order of the values in memory; you might consider working with transposed arrays for that reason. If you want to do that, things become a bit simpler:
>>> a = numpy. array([[1, 4, 7],
... [2, 5, 8],
... [3, 6, 9]])
>>> a[~numpy.in1d(a, [4, 2, 9]).reshape(3, 3)].reshape(3, 2)
array([[1, 7],
[5, 8],
[3, 6]])
You'll still need to handle shape parameters intelligently if you do it this way, but it will still be simpler. Also, this assumes there are no duplicates in the original array; if there are, this could generate wrong results. Saullo's excellent answer partially avoids the problem, but any value-based approach isn't guaranteed to work unless you're certain that there aren't duplicate values in the columns.

In the spirit of #SaulloCastro's answer, but handling multiple occurrences of items, you can remove the first occurrence on each column doing the following:
def delete_skew_row(a, b) :
rows, cols = a.shape
row_to_remove = np.argmax(a == b, axis=0)
items_to_remove = np.ravel_multi_index((row_to_remove,
np.arange(cols)),
a.shape, order='F')
ret = np.delete(a.T, items_to_remove)
return np.ascontiguousarray(ret.reshape(cols,rows-1).T)
rows, cols = 5, 10
a = np.random.randint(100, size=(rows, cols))
b = np.random.randint(rows, size=(cols,))
b = a[b, np.arange(cols)]
>>> a
array([[50, 46, 85, 82, 27, 41, 45, 27, 17, 26],
[92, 35, 14, 34, 48, 27, 63, 58, 14, 18],
[90, 91, 39, 19, 90, 29, 67, 52, 68, 69],
[10, 99, 33, 58, 46, 71, 43, 23, 58, 49],
[92, 81, 64, 77, 61, 99, 40, 49, 49, 87]])
>>> b
array([92, 81, 14, 82, 46, 29, 67, 58, 14, 69])
>>> delete_skew_row(a, b)
array([[50, 46, 85, 34, 27, 41, 45, 27, 17, 26],
[90, 35, 39, 19, 48, 27, 63, 52, 68, 18],
[10, 91, 33, 58, 90, 71, 43, 23, 58, 49],
[92, 99, 64, 77, 61, 99, 40, 49, 49, 87]])

Python, neighbors on a regular grid

Let's suppose I have a set of 2D coordinates that represent the centers of cells of a 2D regular mesh. I would like to find, for each cell in the grid, the two closest neighbors in each direction.
The problem is quite straightforward if one assigns to each cell and index defined as follows:
idx_cell = idx+N*idy
where N is the total number of cells in the grid, idx=x/dx and idy=y/dx, with x and y being the x-coordinate and the y-coordinate of a cell and dx its size.
For example, the neighboring cells for a cell with idx_cell=5 are the cells with idx_cell equal to 4,6 (for the x-axis) and 5+N,5-N (for the y-axis).
The problem that I have is that my implementation of the algorithm is quite slow for large (N>1e6) data sets.
For instance, to get the neighbors of the x-axis I do
[x[(idx_cell==idx_cell[i]-1)|(idx_cell==idx_cell[i]+1)] for i in cells]
Do you think there's a fastest way to implement this algorithm?

You are basically reinventing the indexing scheme of a multidimensional array. It is relatively easy to code, but you can use the two functions unravel_index and ravel_multi_index to your advantage here.
If your grid is of M rows and N columns, to get the idx and idy of a single item you could do:
>>> M, N = 12, 10
>>> np.unravel_index(4, dims=(M, N))
(0, 4)
This also works if, instead of a single index, you provide an array of indices:
>>> np.unravel_index([15, 28, 32, 97], dims=(M, N))
(array([1, 2, 3, 9], dtype=int64), array([5, 8, 2, 7], dtype=int64))
So if cells has the indices of several cells you want to find neighbors to:
>>> cells = np.array([15, 28, 32, 44, 87])
You can get their neighbors as:
>>> idy, idx = np.unravel_index(cells, dims=(M, N))
>>> neigh_idx = np.vstack((idx-1, idx+1, idx, idx))
>>> neigh_idy = np.vstack((idy, idy, idy-1, idy+1))
>>> np.ravel_multi_index((neigh_idy, neigh_idx), dims=(M,N))
array([[14, 27, 31, 43, 86],
[16, 29, 33, 45, 88],
[ 5, 18, 22, 34, 77],
[25, 38, 42, 54, 97]], dtype=int64)
Or, if you prefer it like that:
>>> np.ravel_multi_index((neigh_idy, neigh_idx), dims=(M,N)).T
array([[14, 16, 5, 25],
[27, 29, 18, 38],
[31, 33, 22, 42],
[43, 45, 34, 54],
[86, 88, 77, 97]], dtype=int64)
The nicest thing about going this way is that ravel_multi_index has a mode keyword argument you can use to handle items on the edges of your lattice, see the docs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

SciPy Sparse Array: Get index for a data point - python

Related

Numpy indexing by range of arrays

Numpy Dot Product of two 2-d arrays in numpy to get 3-d array

Numpy, how to get a sub matrix with boolean slicing

New array of smaller size excluding one value from each column

Python, neighbors on a regular grid

Categories

Resources