Updating a coo matrix on the fly with scipy - python

I have not found a solution to this problem after searching the site. Its quite simple, I would like to update an already existing coo sparse matrix. So lets say I have initiated a coo matrix:
from scipy.sparse import coo_matrix
import numpy as np
row = np.array([0, 3, 1, 0])
col = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
a=coo_matrix((data, (row, col)), shape=(4, 4)).toarray()
array([[4, 0, 9, 0],
[0, 7, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 5]])
Fine but what if I just want an empty sparse array and initiate it with only the shape, and then update the values many times. The only way I have succeeded is to add a new coo matrix to my old one
a=coo_matrix((4, 4), dtype=np.int8)
a=a+coo_matrix((data, (row, col)), shape=(4, 4))
a.toarray()
array([[4, 0, 9, 0],
[0, 7, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 5]])
And I would like to update this sparse array many times. But this takes quite awhile since I am calling upon the coo function for each update. There has to be a better way but I feel like the documentation is a little light (at least what I have read) or that I am just not seeing it.
Thanks very much

when you make a coo matrix this way, it uses your input arrays as the attributes of the matrix (provided they are the correct type):
In [923]: row = np.array([0, 3, 1, 0])
...: col = np.array([0, 3, 1, 2])
...: data = np.array([4, 5, 7, 9])
...: A=sparse.coo_matrix((data, (row, col)), shape=(4, 4))
In [924]: A
Out[924]:
<4x4 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements in COOrdinate format>
In [925]: A.row
Out[925]: array([0, 3, 1, 0])
In [926]: id(A.row)
Out[926]: 3071951160
In [927]: id(row)
Out[927]: 3071951160
Similarly for A.col, and A.data.
For display and calculations the matrix will probably be converted to csr format, since many of those operations are not defined for a coo format.
And as you've no doubt seen coo format does not implement indexing, either for fetching or setting.
lil format is designed for easier incremental changes. Indexed changes to csr are also possible but it will issue a warning.
But coo is often used for building new matrices. For example in the bmat format, the coo attributes of the component matrices are combined into new arrays, which are then used to construct a new coo matrix.
A good way of building a coo incrementally is to keep concatenating new values to your row, col, and data arrays, and then periodically build a new coo from those.
On updating a dok format:
How to incrementally create an sparse matrix on python?
putting column into empty sparse matrix
creating a scipy.lil_matrix using a python generator efficiently

I first thought that the coo_matrix is immutable, because it doesn't support any indexing, nor indexed assignment. Turns out you can directly mutate the underlying structure of your empty sparse matrix:
from scipy.sparse import coo_matrix
import numpy as np
row = np.array([0, 3, 1, 0])
col = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
a = coo_matrix((4, 4), dtype=np.int8)
print(a.toarray())
a.row = row
a.col = col
a.data = data
print(a.toarray())
That being said, there might be other sparse formats that are more suitable for this approach.

Related

How can I retrieve indptr, indices from scipy csr_matrix?

According to the documentation of scipy.sparse.csr_matrix, there are many ways to create a csr_matrix. One of the ways is to give data, indptr, and indices as inputs. I wonder, if there is a way to retrieve them in the opposite direction, i.e., assume that I have my_csr_matrix created as below:
>>> indptr = np.array([0, 2, 3, 6])
>>> indices = np.array([0, 2, 2, 0, 1, 2])
>>> data = np.array([1, 2, 3, 4, 5, 6])
>>> my_csr_matrix = csr_matrix((data, indices, indptr), shape=(3, 3))
>>> my_csr_matrix.toarray()
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])
The question is, how can I retrieve indptr and indices from my_csr_matrix without knowing the information about them or data before-hand explicitly?
Literally just my_csr_matrix.indptr and my_csr_matrix.indices. You can also get the data array with my_csr_matrix.data.
These attributes are documented further down the page, under the "Attributes" heading.
Note that these are the actual underlying arrays used by the sparse matrix representation. Modifying these arrays will modify the sparse matrix, unless you do something that causes the CSR matrix to allocate new underlying arrays. (For example, my_csr_matrix[1, 1] = 7 would change the sparsity structure of the matrix and require allocating new arrays.)

Fill numpy array with other numpy array

I have following numpy arrays:
whole = np.array(
[1, 0, 3, 0, 6]
)
sparse = np.array(
[9, 8]
)
Now I want to replace every zero in the whole array in chronological order with the items in the sparse array. In the example my desired array would look like:
merged = np.array(
[1, 9, 3, 8, 6]
)
I could write a small algorithm by myself to fix this but if someone knows a time efficient way to solve this I would be very grateful for you help!
Do you assume that sparse has the same length as there is zeros in whole ?
If so, you can do:
import numpy as np
from copy import copy
whole = np.array([1, 0, 3, 0, 6])
sparse = np.array([9, 8])
merge = copy(whole)
merge[whole == 0] = sparse
if the lengths mismatch, you have to restrict to the correct length using len(...) and slicing.

form numpy array from possible numpy array

EDIT
I realized that I did not check my mwe very well and as such asked something of the wrong question. The main problem is when the numpy array is passed in as a 2d array instead of 1d (or even when a python list is passed in as 1d instead of 2d). So if we have
x = np.array([[1], [2], [3]])
then obviously if you try to index this then you will get arrays out (if you use item you do not). this same thing also applies to standard python lists.
Sorry about the confusion.
Original
I am trying to form a new numpy array from something that may be a numpy array or may be a standard python list.
for example
import numpy as np
x = [2, 3, 1]
y = np.array([[0, -x[2], x[1]], [x[2], 0, -x[0]], [-x[1], x[0], 0]])
Now I would like to form a function such that I can make y easily.
def skew(vector):
"""
this function returns a numpy array with the skew symmetric cross product matrix for vector.
the skew symmetric cross product matrix is defined such that
np.cross(a, b) = np.dot(skew(a), b)
:param vector: An array like vector to create the skew symmetric cross product matrix for
:return: A numpy array of the skew symmetric cross product vector
"""
return np.array([[0, -vector[2], vector[1]],
[vector[2], 0, -vector[0]],
[-vector[1], vector[0], 0]])
This works great and I can now write (assuming the above function is included)
import numpy as np
x=[2, 3, 1]
y = skew(x)
However, I would also like to be able to call skew on existing 1d or 2d numpy arrays. For instance
import numpy as np
x = np.array([2, 3, 1])
y = skew(x)
Unfortunately, doing this returns a numpy array where the elements are also numpy arrays, not python floats as I would like them to be.
Is there an easy way to form a new numpy array like I have done from something that is either a python list or a numpy array and have the result be just a standard numpy array with floats in each element?
Now obviously one solution is to check to see if the input is a numpy array or not:
def skew(vector):
"""
this function returns a numpy array with the skew symmetric cross product matrix for vector.
the skew symmetric cross product matrix is defined such that
np.cross(a, b) = np.dot(skew(a), b)
:param vector: An array like vector to create the skew symmetric cross product matrix for
:return: A numpy array of the skew symmetric cross product vector
"""
if isinstance(vector, np.ndarray):
return np.array([[0, -vector.item(2), vector.item(1)],
[vector.item(2), 0, -vector.item(0)],
[-vector.item(1), vector.item(0), 0]])
else:
return np.array([[0, -vector[2], vector[1]],
[vector[2], 0, -vector[0]],
[-vector[1], vector[0], 0]])
however, it gets very tedious having to write these instance checks all over the place.
Another solution would be to cast everything to an array first and then just use the array call
def skew(vector):
"""
this function returns a numpy array with the skew symmetric cross product matrix for vector.
the skew symmetric cross product matrix is defined such that
np.cross(a, b) = np.dot(skew(a), b)
:param vector: An array like vector to create the skew symmetric cross product matrix for
:return: A numpy array of the skew symmetric cross product vector
"""
vector = np.array(vector)
return np.array([[0, -vector.item(2), vector.item(1)],
[vector.item(2), 0, -vector.item(0)],
[-vector.item(1), vector.item(0), 0]])
but I feel like this is inefficient as it requires creating a new copy of vector (in this case not a big deal since vector is small but this is just a simple example).
My question is, is there a different way to do this outside of what I've discussed or am I stuck using one of these methods?
Arrays are iterable. You can write in your skew function:
def skew(x):
return np.array([[0, -x[2], x[1]],
[x[2], 0, -x[0]],
[-x[1], x[0], 0]])
x = [1,2,3]
y = np.array([1,2,3])
>>> skew(y)
array([[ 0, -3, 2],
[ 3, 0, -1],
[-2, 1, 0]])
>>> skew(x)
array([[ 0, -3, 2],
[ 3, 0, -1],
[-2, 1, 0]])
In any case your methods ended with 1st dimension elements being numpy arrays containing floats. You'll need in any case a call on the 2nd dimension to get the floats inside.
Regarding what you told me in the comments, you may add an if condition for 2d arrays:
def skew(x):
if (isinstance(x,ndarray) and len(x.shape)>=2):
return np.array([[0, -x[2][0], x[1][0]],
[x[2][0], 0, -x[0][0]],
[-x[1][0], x[0][0], 0]])
else:
return np.array([[0, -x[2], x[1]],
[x[2], 0, -x[0]],
[-x[1], x[0], 0]])
You can implement the last idea efficiently using numpy.asarray():
vector = np.asarray(vector)
Then, if vector is already a NumPy array, no copying occurs.
You can keep the first version of your function and convert the numpy array to list:
def skew(vector):
if isinstance(vector, np.ndarray):
vector = vector.tolist()
return np.array([[0, -vector[2], vector[1]],
[vector[2], 0, -vector[0]],
[-vector[1], vector[0], 0]])
In [58]: skew([2, 3, 1])
Out[58]:
array([[ 0, -1, 3],
[ 1, 0, -2],
[-3, 2, 0]])
In [59]: skew(np.array([2, 3, 1]))
Out[59]:
array([[ 0, -1, 3],
[ 1, 0, -2],
[-3, 2, 0]])
This is not an optimal solution but is a very easy one.
You can just convert the vector into list by default.
def skew(vector):
vector = list(vector)
return np.array([[0, -vector[2], vector[1]],
[vector[2], 0, -vector[0]],
[-vector[1], vector[0], 0]])

numpy.square returns incorrect result for sparse matrices

numpy.square seems to give incorrect output when scipy.sparse matrices are passed to it:
import numpy as np
import scipy.sparse as S
a = np.array([np.arange(5), np.arange(5), np.arange(5), np.arange(5), np.arange(5)])
a
# array([[0, 1, 2, 3, 4],
# [0, 1, 2, 3, 4],
# [0, 1, 2, 3, 4],
# [0, 1, 2, 3, 4],
# [0, 1, 2, 3, 4]])
np.square(a)
# array([[ 0, 1, 4, 9, 16],
# [ 0, 1, 4, 9, 16],
# [ 0, 1, 4, 9, 16],
# [ 0, 1, 4, 9, 16],
# [ 0, 1, 4, 9, 16]])
b = S.lil_matrix(a)
c = np.square(b)
c
# <5x5 sparse matrix of type '<class 'numpy.int64'>'
# with 20 stored elements in Compressed Sparse Row format>
c[2,2]
# 20
# Expected output is 4, as in np.square(a) output above.
Is this a bug?
In general, passing in scipy.sparse matrices into numpy functions that take arrays ("array_like") as input, results to undefined/unintended behavior.
There is no automatic sparse -> dense cast.
Numpy does not know anything about Scipy's sparse matrices.
Sparse matrices are not "array_like" in the sense understood by Numpy.
What the numpy functions then do is to treat the sparse matrices as just some Python objects of an unknown type --- in general resulting to putting them to 1-element object arrays, and working on from there. For returning scalar results, the temporary object array is discarded and just the object contained inside it is returned, so it's easy to miss that something strange was actually done.
Object arrays have some fallbacks for performing arithmetic etc operations on their elements (unknown Python objects), including calling operator.mul of the element if * needs to be performed and so on. This then combined with the above results to the behavior you see.
Update: As pointed out by hpaulj, the reason is probably a bit more involved. np.square is able to detect np.matrix and is able to square the elements. However, it falters on sp.sparse.*matrix.
This is not a bug; this is the subtle difference between how numpy and scipy implement the __mul__ operator. By default, * for numpy.ndarray performs element-wise multiplication whereas for numpy.matrix (and consequently, for scipy.sparse.*matrix), it performs matrix multiplication (from PEP 465):
numpy provides two different types with different __mul__ methods. For
numpy.ndarray objects, * performs elementwise multiplication, and
matrix multiplication must use a function call (numpy.dot). For
numpy.matrix objects, * performs matrix multiplication, and
elementwise multiplication requires function syntax.
Internally, numpy.square uses the provided argument's __mul__ method, which is different for ndarrays and matrixes.

Locality Sensitive Hashing of sparse numpy arrays

I have a large sparse numpy/scipy matrix where each row corresponds to a point in high-dimensional space. I want make queries of the following kind:
Given a point P (a row in the matrix) and a distance epsilon, find all points with distance at most epsilon from P.
The distance metric I am using is Jaccard-similarity, so it should be possible to use Locality Sensitive Hashing tricks such as MinHash.
Is there an implementation of MinHash for sparse numpy arrays somewhere (I can't seem to find one) or is there an easy way to do this?
The reason I am not just pulling something built for non-sparse arrays off of Github is that the sparse data structures in scipy might cause explosions in time complexity.
If you have very large sparse datasets that are too large to be held in memory in a non-sparse format, I'd try out this LSH implementation that is built around the assumption of Scipy's CSR Sparse Matrices:
https://github.com/brandonrobertz/SparseLSH
It also hash support for disk-based key-value stores like LevelDB if you can't fit the tables in memory. From the docs:
from sparselsh import LSH
from scipy.sparse import csr_matrix
X = csr_matrix( [
[ 3, 0, 0, 0, 0, 0, -1],
[ 0, 1, 0, 0, 0, 0, 1],
[ 1, 1, 1, 1, 1, 1, 1] ])
# One class number for each input point
y = [ 0, 3, 10]
X_sim = csr_matrix( [ [ 1, 1, 1, 1, 1, 1, 0]])
lsh = LSH( 4,
X.shape[1],
num_hashtables=1,
storage_config={"dict":None})
for ix in xrange(X.shape[0]):
x = X.getrow(ix)
c = y[ix]
lsh.index( x, extra_data=c)
# find points similar to X_sim
lsh.query(X_sim, num_results=1)
If you definitely only want to use MinHash, you could try out https://github.com/go2starr/lshhdc, but I haven't personally tested that one out for compatibility with sparse matrices.

Categories

Resources