I'm trying to write a function in Python (still a noob!) which returns indices and scores of documents ordered by the inner products of their tfidf scores. The procedure is:
Compute vector of inner products between doc idx and all other documents
Sort in descending order
Return the "scores" and indices from the second one to the end (i.e. not itself)
The code I have at the moment is:
import h5py
import numpy as np
def get_related(tfidf, idx) :
''' return the top documents '''
# calculate inner product
v = np.inner(tfidf, tfidf[idx].transpose())
# sort
vs = np.sort(v.toarray(), axis=0)[::-1]
scores = vs[1:,]
# sort indices
vi = np.argsort(v.toarray(), axis=0)[::-1]
idxs = vi[1:,]
return (scores, idxs)
where tfidf is a sparse matrix of type '<type 'numpy.float64'>'.
This seems inefficient, as the sort is performed twice (sort() then argsort()), and the results have to then be reversed.
Can this be done more efficiently?
Can this be done without converting the sparse matrix using toarray()?
I don't think there's any real need to skip the toarray. The v array will be only n_docs long, which is dwarfed by the size of the n_docs × n_terms tf-idf matrix in practical situations. Also, it will be quite dense since any term shared by two documents will give them a non-zero similarity. Sparse matrix representations only pay off when the matrix you're storing is very sparse (I've seen >80% figures for Matlab and assume that Scipy will be similar, though I don't have an exact figure).
The double sort can be skipped by doing
v = v.toarray()
vi = np.argsort(v, axis=0)[::-1]
vs = v[vi]
Btw., your use of np.inner on sparse matrices is not going to work with the latest versions of NumPy; the safe way of taking an inner product of two sparse matrices is
v = (tfidf * tfidf[idx, :]).transpose()
Related
I have a matrix of 63695 row vectors of dim 384.
I would like to compute the cosine similarity model for this matrix.
I was thinking of vectorizing it.
How would one proceed to that objective?
If you look in scikit-learns source code you will see that X and Y are first normalized and then X_norm # Y_norm.T (dot product) is returned. Or if as in your case no Y exists it is X_norm # X_norm.T.
Normalizing and transposing can be discarded when looking at the runtime, but the matrix multiplaction of a (63695 x 384) matrix should take somewhere in the neighbourhood of 63695*63695 (elements in result matrix) times 384*384 (element-wise multiplactions and additions to calculate one element) calculations, so something like 63695*63695*384*384 = 598,236,810,854,400 operations. (Or strictly, that number of multiplications plus that same number of additions.)
And as you already mentioned it requires 4 (Bytes for float32) * 63695 * 63695 = ~16.2 GB of memory to handle that result matrix.
Do you really need that enormous matrix? What type of data are you handling and what are you trying to do? If we are talking about e.g. vector represenations of text data then you should look at removing duplicates, processing it in chunks or reducing the dimensionality before analysing similarity. If you are looking for something like ranking these cosine similarities and finding then k most similar ones you'd be much better of using algorithms for finding similar data points instead of doing it all by hand yourself.
My goal here is to build the sparse CSR matrix very fast. It is currently the main bottleneck in my process, and I've already optimized it by constructing the coo matrix relatively fast, and then using tocsr().
However, I would imagine that constructing the csr matrix directly must be faster?
I have a very specific format of a sparse matrix that is also large (i.e. on orders of 100000x50000). I've looked online at these other answers, but most are not addressing the question I have.
Efficiently construct FEM/FVM matrix
Looks at constructing a very specific formatted sparse matrix vs using coo, which led to a scipy merge improvement on the speed of tocsr().
Sparse Matrix Structure:
The sparse matrix, H, is comprised of W lists of size N, or built from an initial array of size NxW, lets call it A. Along the diagonal are repeating lists of size N for N times. So for the first N rows of H, is A[:,0] repeated, but sliding along N steps for each row.
Comparison to COO.tocsr()
When I scale up N and W, and build up the COO matrix first, then running tocsr(), it is actually faster then just building the CSR matrix directly. I'm not sure why this would be the case? I am wondering if perhaps I can take advantage of the structure of my sparse matrix H in some way? Since there are many repeating elements in there.
Code Sample
Here is a code sample to visualize what is going on for a small sample size:
from scipy.sparse import linalg, dok_matrix, coo_matrix, csr_matrix
import numpy as np
import matplotlib.pyplot as plt
def test_csr(testdata):
indices = [x for _ in range(W-1) for x in range(N**2)]
ptrs = [N*(i) for i in range(N*(W-1))]
ptrs.append(len(indices))
data = []
# loop along the first axis
for i in range(W-1):
vec = testdata[:,i].squeeze()
# repeat vector N times
for i in range(N):
data.extend(vec)
Hshape = ((N*(W-1), N**2))
H = csr_matrix((data, indices, ptrs), shape=Hshape)
return H
N = 4
W = 8
testdata = np.random.randn(N,W)
print(testdata.shape)
H = test_csr(testdata)
plt.imshow(H.toarray(), cmap='jet')
plt.show()
It looks like your output has only the first W-1 rows of test data. I'm not sure if this is intentional or not. My solutions assumes you want to use all of testdata.
When you construct the COO matrix are you also constructing the indices and data in a similar way?
One thing which might speed up constructing the csr_matrix is to use built in numpy functions to generate the data for the csr_matrix rather than python loops and arrays. I would expect this to improve the speed of generating the indices significantly. You can adjust the dtype to different type of int depending on the size of your matrix.
N = 4
W = 8
testdata = np.random.randn(N,W)
ptrs = N*np.arange(W*N+1,dtype='int')
indices = np.tile(np.arange(N*N,dtype='int'),W)
data = np.tile(testdata,N).flatten()
Hshape = ((N*W, N**2))
H = csr_matrix((data, indices, ptrs), shape=Hshape)
Another possibility is to first construct the large array, and then define each of the N vertical column blocks at once. This means that you don't need to make a ton of copies of the original data before you put it into the sparse matrix. However, it may be slow to convert the matrix type.
N = 4
W = 8
testdata = np.random.randn(N,W)
Hshape = ((N*W, N**2))
H = sp.sparse.lol_matrix(Hshape)
for j in range(N):
H[N*np.arange(W)+j,N*j:N*(j+1)] = testdata.T
H = H.tocsc()
I have a sparse matrix where I'm currently enumerating over each row and performing some calculations based on the information from each row. Each row is completely independent of the others. However, for large matrices, this code is extremely slow (takes about 2 hours) and I can't convert the matrix to a dense one either (limited to 8GB RAM).
import scipy.sparse
import numpy as np
def process_row(a, b):
"""
a - contains the row indices for a sparse matrix
b - contains the column indices for a sparse matrix
Returns a new vector of length(a)
"""
return
def assess(mat):
"""
"""
mat_csr = mat.tocsr()
nrows, ncols = mat_csr.shape
a = np.arange(ncols, dtype=np.int32)
b = np.empty(ncols, dtype=np.int32)
result = []
for i, row in enumerate(mat_csr):
# Process one row at a time
b.fill(i)
result.append(process_row(b, a))
return result
if __name__ == '__main__':
row = np.array([8,2,7,4])
col = np.array([1,3,2,1])
data = np.array([1,1,1,1])
mat = scipy.sparse.coo_matrix((data, (row, col)))
print assess(mat)
I am looking to see if there's any way to design this better so that it performs much faster. Essentially, the process_row function takes (row, col) index pairs (from a, b) and does some math using another sparse matrix and returns a result. I don't have the option to change this function but it can actually process different row/col pairs and is not restricted to processing everything from the same row.
Your problem looks similar to this other recent SO question:
Calculate the euclidean distance in scipy csr matrix
In my answer I sketched a way of iterating over the rows of a sparse matrix. I think it is faster to convert the array to lil, and construct the dense rows directly from its sublists. This avoids the overhead of creating a new sparse matrix for each row. But I haven't done time tests.
https://stackoverflow.com/a/36559702/901925
Maybe this applies to your case.
Given a Scipy CSC Sparse matrix "sm" with dimensions (170k x 170k) with 440 million non-null points and a sparse CSC vector "v" (170k x 1) with a few non-null points, is there anything that can be done to improve the performance of the operation:
resul = sm.dot(v)
?
Currently it's taking roughly 1 second. Initializing the matrices as CSR increased the time up to 3 seconds, so CSC performed better.
SM is a matrix of similarities between products and V is the vector that represents which products the user bought or clicked on. So for every user sm is the same.
I'm using Ubuntu 13.04, Intel i3 #3.4GHz, 4 Cores.
Researching on SO I read about Ablas package. I typed into the terminal:
~$ ldd /usr/lib/python2.7/dist-packages/numpy/core/_dotblas.so
Which resulted in:
linux-vdso.so.1 => (0x00007fff56a88000)
libblas.so.3 => /usr/lib/libblas.so.3 (0x00007f888137f000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f8880fb7000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f8880cb1000)
/lib64/ld-linux-x86-64.so.2 (0x00007f888183c000)
And for what I understood this means that I'm already using a high performance package from Ablas. I'm still not sure though if this package already implements parallel computing but it looks like it doesn't.
Could multi-core processing help to boost performance? If so, is there any library that could be helpful in python?
I was also considering the idea of implementing this in Cython but I don't know if this would lead to good results.
Thanks in advance.
The sparse matrix multiplication routines are directly coded in C++, and as far as a quick look at the source reveals, there doesn't seem to be any hook to any optimized library. Furthermore, it doesn't seem to be taking advantage of the fact that the second matrix is a vector to minimize calculations. So you can probably speed things up quite a bit by accessing the guts of the sparse matrix, and customizing the multiplication algorithm. The following code does so in pure Python/Numpy, and if the vector really has "a few non-null points" it matches the speed of scipy's C++ code: if you implemented it in Cython, the speed increase should be noticeable:
def sparse_col_vec_dot(csc_mat, csc_vec):
# row numbers of vector non-zero entries
v_rows = csc_vec.indices
v_data = csc_vec.data
# matrix description arrays
m_dat = csc_mat.data
m_ind = csc_mat.indices
m_ptr = csc_mat.indptr
# output arrays
sizes = m_ptr.take(v_rows+1) - m_ptr.take(v_rows)
sizes = np.concatenate(([0], np.cumsum(sizes)))
data = np.empty((sizes[-1],), dtype=csc_mat.dtype)
indices = np.empty((sizes[-1],), dtype=np.intp)
indptr = np.zeros((2,), dtype=np.intp)
for j in range(len(sizes)-1):
slice_ = slice(*m_ptr[[v_rows[j] ,v_rows[j]+1]])
np.multiply(m_dat[slice_], v_data[j], out=data[sizes[j]:sizes[j+1]])
indices[sizes[j]:sizes[j+1]] = m_ind[slice_]
indptr[-1] = len(data)
ret = sps.csc_matrix((data, indices, indptr),
shape=csc_vec.shape)
ret.sum_duplicates()
return ret
A quick explanation of what is going on: a CSC matrix is defined in three linear arrays:
data contains the non-zero entries, stored in column major order.
indices contains the rows of the non-zero entries.
indptr has one entry more than the number of columns of the matrix, and items in column j are found in data[indptr[j]:indptr[j+1]] and are in rows indices[indptr[j]:indptr[j+1]].
So to multiply by a sparse column vector, you can iterate over data and indices of the column vector, and for each (d, r) pair, extract the corresponding column of the matrix and multiply it by d, i.e. data[indptr[r]:indptr[r+1]] * d and indices[indptr[r]:indptr[r+1]].
Recently i had the same issue. I solved it like this.
def sparse_col_vec_dot(csc_mat, csc_vec):
curr_mat = csc_mat.tocsr()
ret curr_mat* csc_vec
The trick here is we have to make one version of the matrix as row representation and the other version as column representation.
I have two M X N matrices which I construct after extracting data from images. Both the vectors have lengthy first row and after the 3rd row they all become only first column.
for example raw vector looks like this
1,23,2,5,6,2,2,6,2,
12,4,5,5,
1,2,4,
1,
2,
2
:
Both vectors have a similar pattern where first three rows have lengthy row and then thin out as it progress. Do do cosine similarity I was thinking to use a padding technique to add zeros and make these two vectors N X N. I looked at Python options of cosine similarity but some examples were using a package call numpy. I couldn't figure out how exactly numpy can do this type of padding and carry out a cosine similarity. Any guidance would be greatly appreciated.
If both arrays have the same dimension, I would flatten them using NumPy. NumPy (and SciPy) is a powerful scientific computational tool that makes matrix manipulations way easier.
Here an example of how I would do it with NumPy and SciPy:
import numpy as np
from scipy.spatial import distance
A = np.array([[1,23,2,5,6,2,2,6,2],[12,4,5,5],[1,2,4],[1],[2],[2]], dtype=object )
B = np.array([[1,23,2,5,6,2,2,6,2],[12,4,5,5],[1,2,4],[1],[2],[2]], dtype=object )
Aflat = np.hstack(A)
Bflat = np.hstack(B)
dist = distance.cosine(Aflat, Bflat)
The result here is dist = 1.10e-16 (i.e., 0).
Note that I've used here the dtype=object because that's the only way I know to be able to store different shapes into an array in NumPy. That's why later I used hstack() in order to flatten the array (instead of using the more common flatten() function).
I would make them into a scipy sparse matrix (http://docs.scipy.org/doc/scipy/reference/sparse.html) and then run cosine similarity from the scikit learn module.
from scipy import sparse
sparse_matrix= scipy.sparse.csr_matrix(your_np_array)
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
distance_matrix= pairwise_distances(sparse_matrix, metric="cosine")
Why cant you just run a nested loop over both jagged lists (presumably), summating each row using Euclidian/vector dot product and using the result as a similarity measure. This assumes that the jagged dimensions are identical.
Although I'm not quite sure how you are getting a jagged array from a bitmap image (I would of assumed it would be a proper dense matrix of MxN form) or how the jagged array of arrays above is meant to represent an MxN matrix/image data, and therefore, how padding the data with zeros would make sense? If this was a sparse matrix representation, one would expect row/col information annotated with the values.