Scipy Handling of Large COO matrix

Scipy Handling of Large COO matrix - python

I have a large sparse matrix in the form of a scipy coo_matrix (size of 5GB). I have to make use of the non-zero entries of the matrix and do some further processing.
What would be the best way to access the elements of the matrix? Should I convert the matrix to other formats or use it as it is? Also, could you please tell me the exact syntax for accessing an element of a coo_matrix? I got a bit confused since it doesn't allow slicing.

First let's build a random COO matrix:
import numpy as np
from scipy import sparse
x = sparse.rand(10000, 10000, format='coo')
The non-zero values are found in the .data attribute of the matrix, and you can get their corresponding row/column indices using x.nonzero():
v = x.data
r, c = x.nonzero()
print np.all(x.todense()[r, c] == v)
# True
With a COO matrix it's possible to index a single row or column (as a sparse vector) using the getrow()/getcol() methods. If you want to do slicing or fancy indexing of particular elements then you need to convert it to another format such as lil_matrix, for example using the .tolil() method.
You should really read the scipy.sparse docs for more information about the features of the different sparse array formats - the appropriate choice of format really depends on what you plan on doing with your array.

Related

transpose Keyword not working as I expected [duplicate]

My goal is to to turn a row vector into a column vector and vice versa. The documentation for numpy.ndarray.transpose says:
For a 1-D array, this has no effect. (To change between column and row vectors, first cast the 1-D array into a matrix object.)
However, when I try this:
my_array = np.array([1,2,3])
my_array_T = np.transpose(np.matrix(myArray))
I do get the wanted result, albeit in matrix form (matrix([[66],[640],[44]])), but I also get this warning:
PendingDeprecationWarning: the matrix subclass is not the recommended way to represent matrices or deal with linear algebra (see https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html). Please adjust your code to use regular ndarray.
my_array_T = np.transpose(np.matrix(my_array))
How can I properly transpose an ndarray then?

A 1D array is itself once transposed, contrary to Matlab where a 1D array doesn't exist and is at least 2D.
What you want is to reshape it:
my_array.reshape(-1, 1)
Or:
my_array.reshape(1, -1)
Depending on what kind of vector you want (column or row vector).
The -1 is a broadcast-like, using all possible elements, and the 1 creates the second required dimension.

If your array is my_array and you want to convert it to a column vector you can do:
my_array.reshape(-1, 1)
For a row vector you can use
my_array.reshape(1, -1)
Both of these can also be transposed and that would work as expected.

IIUC, use reshape
my_array.reshape(my_array.size, -1)

Defining empty numpy array when we do not know the size

From my understanding, when we want to define a numpy array, we have to define its size.
However, in my case, I want to define a numpy array and then extend it based on my values in the for loop. The shape of values might differ in each run. So I cannot define the numpy array shape in advance.
Is there any way to overcome this?
I would like to avoid using lists.
Thanks

import numpy as np
myArrayShape = 2
myArray = np.empty(shape=2)
Note that this generates random values for each element in the array.

I think numpy array is just like array in clang or c++, I mean when you make numpy array you allocate memory depend on your request(size and dtype).
So it is better to make array after the size of array is determinated.
Or you can try numpy.append
https://numpy.org/doc/stable/reference/generated/numpy.append.html
But I don't think it is preferable way because it keeps generate new arrays.

From the Octave (free-MATLAB) docs, https://octave.org/doc/v6.3.0/Advanced-Indexing.html
In cases where a loop cannot be avoided, or a number of values must be combined to form a larger matrix, it is generally faster to set the size of the matrix first (pre-allocate storage), and then insert elements using indexing commands. For example, given a matrix a,
[nr, nc] = size (a);
x = zeros (nr, n * nc);
for i = 1:n
x(:,(i-1)*nc+1:i*nc) = a;
endfor
is considerably faster than
x = a;
for i = 1:n-1
x = [x, a];
endfor
because Octave does not have to repeatedly resize the intermediate result.
The same idea applies in numpy. While you can start with a (0,n) shaped array, and grow by concatenating (1,n) arrays, that is a lot slower than starting with a (m,n) array, and assigning values.
There's a deleted answer that illustrates how to create an array by list append. That is highly recommended.

For a given sparse matrix, how can I multiply it with a given vector of binary values

I have a sparse matrix and another vector and I want to multiply the matrix and vector so that each column of the vector where it's equal to zero it'll zero the entire column of the sparse matrix.
How can I achieve that?

You didn't mention anything about how the array and matrix are defined, it can be assumed that those are numpy matrix and array...
Do you mean something like the following?
import numpy as np
from scipy.sparse import csr_matrix
A = csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]])
v = np.array([1, 0, 1])
print(A.dot(v))
if so take a look at here:
https://docs.scipy.org/doc/scipy/reference/sparse.html

The main problem is the size of your problem and the fact you're using Python which is on the order of 10-100x slower for matrix multiplication than some other languages. Unless you use something like Cython I don't see you getting an improvement.

If you don't like the speed of matrix multiplication, then you have to consider modification of the matrix attributes directly. But depending on the format that may be slower.
To zero-out columns of a csr, you can find the relevant nonzero elements, and set the data values to zero. Then run the eliminate_zeros method to remove those elements from the sparsity structure.
Setting columns of a csc format may be simpler - find the relevant value in the indptr. At least the elements that you want to remove will be clustered together. I won't go into the details.
Zeroing rows of a lil format should be fairly easy - replace the relevant lists with [].
Anyways with familiarity of the formats it should possible to work out alternatives to matrix multiplication. But without doing so, and doing sometimes, I could say which ones are faster.

Python: Cosine Similarity m * n matrices

I have two M X N matrices which I construct after extracting data from images. Both the vectors have lengthy first row and after the 3rd row they all become only first column.
for example raw vector looks like this
1,23,2,5,6,2,2,6,2,
12,4,5,5,
1,2,4,
1,
2,
2
:
Both vectors have a similar pattern where first three rows have lengthy row and then thin out as it progress. Do do cosine similarity I was thinking to use a padding technique to add zeros and make these two vectors N X N. I looked at Python options of cosine similarity but some examples were using a package call numpy. I couldn't figure out how exactly numpy can do this type of padding and carry out a cosine similarity. Any guidance would be greatly appreciated.

If both arrays have the same dimension, I would flatten them using NumPy. NumPy (and SciPy) is a powerful scientific computational tool that makes matrix manipulations way easier.
Here an example of how I would do it with NumPy and SciPy:
import numpy as np
from scipy.spatial import distance
A = np.array([[1,23,2,5,6,2,2,6,2],[12,4,5,5],[1,2,4],[1],[2],[2]], dtype=object )
B = np.array([[1,23,2,5,6,2,2,6,2],[12,4,5,5],[1,2,4],[1],[2],[2]], dtype=object )
Aflat = np.hstack(A)
Bflat = np.hstack(B)
dist = distance.cosine(Aflat, Bflat)
The result here is dist = 1.10e-16 (i.e., 0).
Note that I've used here the dtype=object because that's the only way I know to be able to store different shapes into an array in NumPy. That's why later I used hstack() in order to flatten the array (instead of using the more common flatten() function).

I would make them into a scipy sparse matrix (http://docs.scipy.org/doc/scipy/reference/sparse.html) and then run cosine similarity from the scikit learn module.
from scipy import sparse
sparse_matrix= scipy.sparse.csr_matrix(your_np_array)
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
distance_matrix= pairwise_distances(sparse_matrix, metric="cosine")

Why cant you just run a nested loop over both jagged lists (presumably), summating each row using Euclidian/vector dot product and using the result as a similarity measure. This assumes that the jagged dimensions are identical.
Although I'm not quite sure how you are getting a jagged array from a bitmap image (I would of assumed it would be a proper dense matrix of MxN form) or how the jagged array of arrays above is meant to represent an MxN matrix/image data, and therefore, how padding the data with zeros would make sense? If this was a sparse matrix representation, one would expect row/col information annotated with the values.

Load sparse scipy matrix into existing numpy dense matrix

Say I have a huge numpy matrix A taking up tens of gigabytes. It takes a non-negligible amount of time to allocate this memory.
Let's say I also have a collection of scipy sparse matrices with the same dimensions as the numpy matrix. Sometimes I want to convert one of these sparse matrices into a dense matrix to perform some vectorized operations.
Can I load one of these sparse matrices into A rather than re-allocate space each time I want to convert a sparse matrix into a dense matrix? The .toarray() method which is available on scipy sparse matrices does not seem to take an optional dense array argument, but maybe there is some other way to do this.

If the sparse matrix is in the COO format:
def assign_coo_to_dense(sparse, dense):
dense[sparse.row, sparse.col] = sparse.data
If it is in the CSR format:
def assign_csr_to_dense(sparse, dense):
rows = sum((m * [k] for k, m in enumerate(np.diff(sparse.indptr))), [])
dense[rows, sparse.indices] = sparse.data
To be safe, you might want to add the following lines to the beginning of each of the functions above:
assert sparse.shape == dense.shape
dense[:] = 0

It does seem like there should be a better way to do this (and I haven't scoured the documentation), but you could always loop over the elements of the sparse array and assign to the dense array (probably zeroing out the dense array first). If this ends up too slow, that seems like an easy C extension to write....

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scipy Handling of Large COO matrix - python

Related

transpose Keyword not working as I expected [duplicate]

Defining empty numpy array when we do not know the size

For a given sparse matrix, how can I multiply it with a given vector of binary values

Python: Cosine Similarity m * n matrices

Load sparse scipy matrix into existing numpy dense matrix

Categories

Resources