Given a Scipy CSC Sparse matrix "sm" with dimensions (170k x 170k) with 440 million non-null points and a sparse CSC vector "v" (170k x 1) with a few non-null points, is there anything that can be done to improve the performance of the operation:
resul = sm.dot(v)
?
Currently it's taking roughly 1 second. Initializing the matrices as CSR increased the time up to 3 seconds, so CSC performed better.
SM is a matrix of similarities between products and V is the vector that represents which products the user bought or clicked on. So for every user sm is the same.
I'm using Ubuntu 13.04, Intel i3 #3.4GHz, 4 Cores.
Researching on SO I read about Ablas package. I typed into the terminal:
~$ ldd /usr/lib/python2.7/dist-packages/numpy/core/_dotblas.so
Which resulted in:
linux-vdso.so.1 => (0x00007fff56a88000)
libblas.so.3 => /usr/lib/libblas.so.3 (0x00007f888137f000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f8880fb7000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f8880cb1000)
/lib64/ld-linux-x86-64.so.2 (0x00007f888183c000)
And for what I understood this means that I'm already using a high performance package from Ablas. I'm still not sure though if this package already implements parallel computing but it looks like it doesn't.
Could multi-core processing help to boost performance? If so, is there any library that could be helpful in python?
I was also considering the idea of implementing this in Cython but I don't know if this would lead to good results.
Thanks in advance.
The sparse matrix multiplication routines are directly coded in C++, and as far as a quick look at the source reveals, there doesn't seem to be any hook to any optimized library. Furthermore, it doesn't seem to be taking advantage of the fact that the second matrix is a vector to minimize calculations. So you can probably speed things up quite a bit by accessing the guts of the sparse matrix, and customizing the multiplication algorithm. The following code does so in pure Python/Numpy, and if the vector really has "a few non-null points" it matches the speed of scipy's C++ code: if you implemented it in Cython, the speed increase should be noticeable:
def sparse_col_vec_dot(csc_mat, csc_vec):
# row numbers of vector non-zero entries
v_rows = csc_vec.indices
v_data = csc_vec.data
# matrix description arrays
m_dat = csc_mat.data
m_ind = csc_mat.indices
m_ptr = csc_mat.indptr
# output arrays
sizes = m_ptr.take(v_rows+1) - m_ptr.take(v_rows)
sizes = np.concatenate(([0], np.cumsum(sizes)))
data = np.empty((sizes[-1],), dtype=csc_mat.dtype)
indices = np.empty((sizes[-1],), dtype=np.intp)
indptr = np.zeros((2,), dtype=np.intp)
for j in range(len(sizes)-1):
slice_ = slice(*m_ptr[[v_rows[j] ,v_rows[j]+1]])
np.multiply(m_dat[slice_], v_data[j], out=data[sizes[j]:sizes[j+1]])
indices[sizes[j]:sizes[j+1]] = m_ind[slice_]
indptr[-1] = len(data)
ret = sps.csc_matrix((data, indices, indptr),
shape=csc_vec.shape)
ret.sum_duplicates()
return ret
A quick explanation of what is going on: a CSC matrix is defined in three linear arrays:
data contains the non-zero entries, stored in column major order.
indices contains the rows of the non-zero entries.
indptr has one entry more than the number of columns of the matrix, and items in column j are found in data[indptr[j]:indptr[j+1]] and are in rows indices[indptr[j]:indptr[j+1]].
So to multiply by a sparse column vector, you can iterate over data and indices of the column vector, and for each (d, r) pair, extract the corresponding column of the matrix and multiply it by d, i.e. data[indptr[r]:indptr[r+1]] * d and indices[indptr[r]:indptr[r+1]].
Recently i had the same issue. I solved it like this.
def sparse_col_vec_dot(csc_mat, csc_vec):
curr_mat = csc_mat.tocsr()
ret curr_mat* csc_vec
The trick here is we have to make one version of the matrix as row representation and the other version as column representation.
Related
I have a matrix of 63695 row vectors of dim 384.
I would like to compute the cosine similarity model for this matrix.
I was thinking of vectorizing it.
How would one proceed to that objective?
If you look in scikit-learns source code you will see that X and Y are first normalized and then X_norm # Y_norm.T (dot product) is returned. Or if as in your case no Y exists it is X_norm # X_norm.T.
Normalizing and transposing can be discarded when looking at the runtime, but the matrix multiplaction of a (63695 x 384) matrix should take somewhere in the neighbourhood of 63695*63695 (elements in result matrix) times 384*384 (element-wise multiplactions and additions to calculate one element) calculations, so something like 63695*63695*384*384 = 598,236,810,854,400 operations. (Or strictly, that number of multiplications plus that same number of additions.)
And as you already mentioned it requires 4 (Bytes for float32) * 63695 * 63695 = ~16.2 GB of memory to handle that result matrix.
Do you really need that enormous matrix? What type of data are you handling and what are you trying to do? If we are talking about e.g. vector represenations of text data then you should look at removing duplicates, processing it in chunks or reducing the dimensionality before analysing similarity. If you are looking for something like ranking these cosine similarities and finding then k most similar ones you'd be much better of using algorithms for finding similar data points instead of doing it all by hand yourself.
I researched a lot on this but couldn't find a practical solution to this problem. I am using scipy to create csr sparse matrix and want to substract this matrix from an equivalent matrix of all ones. In scipy and numpy notations, if matrix is not sparse, we can do so by simply writing 1 - MatrixVariable. However, this operation is not implemented if Matrix is sparse. I could just think of the following obvious solution:
Iterate through the entire sparse matrix, set all zero elements to 1 and all non-zero elements to 0.
But this would create a matrix where most elements are 1 and only a few are 0, which is no longer sparse and due its huge size could not be converted to dense.
What could be an alternative and effective way of doing this?
Thanks.
Your new matrix will not be sparse, because it will have 1s everywhere, so you will need a dense array to hold it:
new_mat = np.ones(sps_mat.shape, sps_mat.dtype) - sps_mat.todense()
This requires that your matrix fits in memory. It actually requires that it fits in memory 3 times. If that is an issue, you can get it to be more efficient doing something like:
new_mat = sps_mat.todense()
new_mat *= -1
new_mat += 1
You can access the data from your sparse matrix as a 1D array so that:
ss.data *= -1
ss.data += 1
will work like 1 - ss, for all non-zero elements in your sparse matrix.
I have a loop that in each iteration gives me a column c of a sparse matrix N.
To assemble/grow/accumulate N column by column I thought of using
N = scipy.sparse.hstack([N, c])
To do this it would be nice to initialize the matrix with with rows of length 0. However,
N = scipy.sparse.csc_matrix((4,0))
raises a ValueError: invalid shape.
Any suggestions, how to do this right?
You can't. Sparse matrices are restricted compared to NumPy arrays and in particular don't allow 0 for any axis. All sparse matrix constructors check for this, so if and when you do manage to build such a matrix, you're exploiting a SciPy bug and your script is likely to break when you upgrade SciPy.
That being said, I don't see why you'd need an n × 0 sparse matrix since an n × 0 NumPy array is allowed and takes practically no storage space.
Turns out sparse.hstack cannot handle a NumPy array with a zero axis, so disregard my previous comment. However, what I think you should do is collect all the columns in a list, then hstack them in one call. That's better than your loop since append'ing to a list takes amortized constant time, while hstack takes linear time. So your proposed algorithm takes quadratic time while it could be linear.
You must use at least 1 in your shape.
N = scipy.sparse.csc_matrix((4,1))
Which you can stack:
print scipy.sparse.hstack( (N,N) )
#<4x2 sparse matrix of type '<type 'numpy.float64'>'
# with 0 stored elements in COOrdinate format>
In the python computer graphics kit, there is a vec3 type for the representation of three-component vectors, but how can I do the following multiplication:
A three-component vector multiply by its transpose result in a 3*3 matrix, like the following example:
a = vec3(1,1,1)
matrix_m = a * a.transpose()
Anyone knows such a library that can handle multiplying a matrix of dimension 1*3 by another one of dimension 3*1 and result in a matrix of 3*3.
Sorry, I have to clarify a bit more about this. I am talking about matrix math.
It is like:
[a0, a1, a2]*[a0, a1, a2]T = [a0*a0, a0*a1, a0*a2; a1*a0, a1*a1, a1*a2;a2*a0, a2*a1, a2*a2]
Maybe I can try write a function myself, it is so straightforward.....
Some vector math software, such as MATLAB, happily keep track of column vectors and row vectors as separate types of things. Python's Numpy doesn't, but does offer numpy.outer(A,B). Unfortunately, the Graphics Kit (I assume you refer to http://cgkit.sourceforge.net/) doesn't track rows vs columns, use numpy (which would be huge overkill), or provide a vector x vector --> matrix outer product. It looks like you'll have to write your own function to do that.
I'm trying to write a function in Python (still a noob!) which returns indices and scores of documents ordered by the inner products of their tfidf scores. The procedure is:
Compute vector of inner products between doc idx and all other documents
Sort in descending order
Return the "scores" and indices from the second one to the end (i.e. not itself)
The code I have at the moment is:
import h5py
import numpy as np
def get_related(tfidf, idx) :
''' return the top documents '''
# calculate inner product
v = np.inner(tfidf, tfidf[idx].transpose())
# sort
vs = np.sort(v.toarray(), axis=0)[::-1]
scores = vs[1:,]
# sort indices
vi = np.argsort(v.toarray(), axis=0)[::-1]
idxs = vi[1:,]
return (scores, idxs)
where tfidf is a sparse matrix of type '<type 'numpy.float64'>'.
This seems inefficient, as the sort is performed twice (sort() then argsort()), and the results have to then be reversed.
Can this be done more efficiently?
Can this be done without converting the sparse matrix using toarray()?
I don't think there's any real need to skip the toarray. The v array will be only n_docs long, which is dwarfed by the size of the n_docs × n_terms tf-idf matrix in practical situations. Also, it will be quite dense since any term shared by two documents will give them a non-zero similarity. Sparse matrix representations only pay off when the matrix you're storing is very sparse (I've seen >80% figures for Matlab and assume that Scipy will be similar, though I don't have an exact figure).
The double sort can be skipped by doing
v = v.toarray()
vi = np.argsort(v, axis=0)[::-1]
vs = v[vi]
Btw., your use of np.inner on sparse matrices is not going to work with the latest versions of NumPy; the safe way of taking an inner product of two sparse matrices is
v = (tfidf * tfidf[idx, :]).transpose()