I'm wondering what the best way is to iterate nonzero entries of sparse matrices with scipy.sparse. For example, if I do the following:
from scipy.sparse import lil_matrix
x = lil_matrix( (20,1) )
x[13,0] = 1
x[15,0] = 2
c = 0
for i in x:
print c, i
c = c+1
the output is
0
1
2
3
4
5
6
7
8
9
10
11
12
13 (0, 0) 1.0
14
15 (0, 0) 2.0
16
17
18
19
so it appears the iterator is touching every element, not just the nonzero entries. I've had a look at the API
http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html
and searched around a bit, but I can't seem to find a solution that works.
Edit: bbtrb's method (using coo_matrix) is much faster than my original suggestion, using nonzero. Sven Marnach's suggestion to use itertools.izip also improves the speed. Current fastest is using_tocoo_izip:
import scipy.sparse
import random
import itertools
def using_nonzero(x):
rows,cols = x.nonzero()
for row,col in zip(rows,cols):
((row,col), x[row,col])
def using_coo(x):
cx = scipy.sparse.coo_matrix(x)
for i,j,v in zip(cx.row, cx.col, cx.data):
(i,j,v)
def using_tocoo(x):
cx = x.tocoo()
for i,j,v in zip(cx.row, cx.col, cx.data):
(i,j,v)
def using_tocoo_izip(x):
cx = x.tocoo()
for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
(i,j,v)
N=200
x = scipy.sparse.lil_matrix( (N,N) )
for _ in xrange(N):
x[random.randint(0,N-1),random.randint(0,N-1)]=random.randint(1,100)
yields these timeit results:
% python -mtimeit -s'import test' 'test.using_tocoo_izip(test.x)'
1000 loops, best of 3: 670 usec per loop
% python -mtimeit -s'import test' 'test.using_tocoo(test.x)'
1000 loops, best of 3: 706 usec per loop
% python -mtimeit -s'import test' 'test.using_coo(test.x)'
1000 loops, best of 3: 802 usec per loop
% python -mtimeit -s'import test' 'test.using_nonzero(test.x)'
100 loops, best of 3: 5.25 msec per loop
The fastest way should be by converting to a coo_matrix:
cx = scipy.sparse.coo_matrix(x)
for i,j,v in zip(cx.row, cx.col, cx.data):
print "(%d, %d), %s" % (i,j,v)
To loop a variety of sparse matrices from the scipy.sparse code section I would use this small wrapper function (note that for Python-2 you are encouraged to use xrange and izip for better performance on large matrices):
from scipy.sparse import *
def iter_spmatrix(matrix):
""" Iterator for iterating the elements in a ``scipy.sparse.*_matrix``
This will always return:
>>> (row, column, matrix-element)
Currently this can iterate `coo`, `csc`, `lil` and `csr`, others may easily be added.
Parameters
----------
matrix : ``scipy.sparse.sp_matrix``
the sparse matrix to iterate non-zero elements
"""
if isspmatrix_coo(matrix):
for r, c, m in zip(matrix.row, matrix.col, matrix.data):
yield r, c, m
elif isspmatrix_csc(matrix):
for c in range(matrix.shape[1]):
for ind in range(matrix.indptr[c], matrix.indptr[c+1]):
yield matrix.indices[ind], c, matrix.data[ind]
elif isspmatrix_csr(matrix):
for r in range(matrix.shape[0]):
for ind in range(matrix.indptr[r], matrix.indptr[r+1]):
yield r, matrix.indices[ind], matrix.data[ind]
elif isspmatrix_lil(matrix):
for r in range(matrix.shape[0]):
for c, d in zip(matrix.rows[r], matrix.data[r]):
yield r, c, d
else:
raise NotImplementedError("The iterator for this sparse matrix has not been implemented")
tocoo() materializes the entire matrix into a different structure, which is not the preferred MO for python 3. You can also consider this iterator, which is especially useful for large matrices.
from itertools import chain, repeat
def iter_csr(matrix):
for (row, col, val) in zip(
chain(*(
repeat(i, r)
for (i,r) in enumerate(comparisons.indptr[1:] - comparisons.indptr[:-1])
)),
matrix.indices,
matrix.data
):
yield (row, col, val)
I have to admit that I'm using a lot of python-constructs which possibly should be replaced by numpy-constructs (especially enumerate).
NB:
In [43]: t=time.time(); sum(1 for x in rather_dense_sparse_matrix.data); print(time.time()-t)
52.48686504364014
In [44]: t=time.time(); sum(1 for x in enumerate(rather_dense_sparse_matrix.data)); print(time.time()-t)
70.19013023376465
In [45]: rather_dense_sparse_matrix
<99829x99829 sparse matrix of type '<class 'numpy.float16'>'
with 757622819 stored elements in Compressed Sparse Row format>
So yes, enumerate is somewhat slow(ish)
For the iterator:
In [47]: it = iter_csr(rather_dense_sparse_matrix)
In [48]: t=time.time(); sum(1 for x in it); print(time.time()-t)
113.something something
So you decide whether this overhead is acceptable, in my case the tocoo caused MemoryOverflows's.
IMHO: such an iterator should be part of the csr_matrix interface, similar to items() in a dict() :)
I had the same problem and actually, if your concern is only speed, the fastest way (more than 1 order of magnitude faster) is to convert the sparse matrix to a dense one (x.todense()), and iterating over the nonzero elements in the dense matrix. (Though, of course, this approach requires a lot more memory)
Try filter(lambda x:x, x) instead of x.
Related
I've written a method that takes in an integer "n" and creates a square matrix where the values of each element are dictated by their respective i,j indices.
When I build a small matrix 30x30 it works just fine, but when I try to do something larger like 1000x1000 it takes very long. Is there any way that I can speed it up with multiprocessing?
def createMatrix(n):
matrix = []
for j in range(1,n+1):
row = []
for i in range(1,n+1):
value = 1/(i+j-1)
row.append(value)
matrix.append(row)
return np.array(matrix)
Parallelizing two computation-bound for loops in Python is not trivial because of GIL. The good news is that your case is perfectly vectorizeable:
def createMatrix(n):
return 1 / (np.arange(n)[None, :] + np.arange(n)[:, None] + 1)
Explanation:
essentially, your formula for the matrix is X[row][column] = 1/(row+column-1), where rows and columns are 1-based
np.arange(n) creates a range that can be used for rows or columns
[None, :] and [:, None] turn it into a 2d array, 1 x n or n x 1
numpy then broadcasts dimensions, replicating row and column indexes to match dimensions - thus, implicitly tiling both into n x n when added
since both ranges are 0-based, using +1 instead of -1
As a rule of thumb, it is almost never a good idea to use for loops on numpy arrays. A vectorized approach (i.e. matrix form computations) is orders of magnitude faster.
It's not a good idea to use fors to fill a list then convert it to a matrix. the operation that you have can be vectorized with numpy from scratch. if you think that given the i,j, M(i,j) = 1/(j+i-1) considering that both indices starts at 1.
Here's my proposal :
def createMatrix2(n):
arr =np.arange(1,n+1)
xx,yy = np.meshgrid(arr,arr)
matrix = 1/(xx+yy-1)
return matrix
looking at Marat answer, I think his/her it's better, so tested the 3 methods:
EDIT: added wwii method as createMatrix4 (correcting the errors):
import numpy as np
from time import time
def createMatrix1(n):
matrix = []
for j in range(1,n+1):
row = []
for i in range(1,n+1):
value = 1/(i+j-1)
row.append(value)
matrix.append(row)
return np.array(matrix)
def createMatrix2(n):
arr =np.arange(1,n+1)
xx,yy = np.meshgrid(arr,arr)
matrix = 1/(xx+yy-1)
return matrix
def createMatrix3(n):
"""Marat's proposed matrix"""
return 1 / (1 + np.arange(n)[None, :] + np.arange(n)[:, None])
def createMatrix4(n):
""" wwii method"""
i,j = np.ogrid[1:n,1:n]
return 1/(i+j-1)
#test all the three methods
n = 10000
t1 = time()
m1 = createMatrix1(n)
t2 = time()
m2 = createMatrix2(n)
t3 = time()
m3 = createMatrix3(n)
t4 = time()
m4 = createMatrix4(n)
t5 = time()
print(np.allclose(m1,m2))
print(np.allclose(m1,m3))
print(np.allclose(m1,m4))
print("Matrix 1 (OP): ",t2-t1)
print("Matrix 2: (mine)",t3-t2)
print("Matrix 3: (Marat)",t4-t3)
print("Matrix 4: (wwii)",t5-t4)
# the output is:
#True
#True
#True
#Matrix 1 (OP): 18.4886577129364
#Matrix 2: (mine) 1.005324363708496
#Matrix 3: (Marat) 0.43033909797668457
#Matrix 4: (wwii) 0.5138359069824219
So Marat's solution is faster. As general comments:
Try to avoid fors loops
Think your problem as operation with indices and dessing operations with numpy arrays directly.
For last, given Marat's answer I thought my proposal is a easier to read, and understand. But it's just a subjective view
Your code can be written in another style, accelerated by numba library in a parallel no python mode:
import numba as nb
#nb.njit("float64[:, ::1](int64)", parallel=True, fastmath=True)
def createMatrix(n):
matrix = np.empty((n, n)) # np.zeros is slower than np.empty
for j in nb.prange(1, n + 1):
for i in range(1, n + 1):
matrix[j - 1, i - 1] = 1 / (i + j - 1)
return matrix
This solution will be faster than the Marat answer above 3 times.
Benchmarks: (temporary link to colab)
n = 1000
1000 loops, best of 5: 3.52 ms per loop # Marat
1000 loops, best of 5: 1.5 ms per loop # numba accelerated with np.zeros
1000 loops, best of 5: 1.05 ms per loop # numba accelerated with np.empty
n = 3000
1000 loops, best of 5: 39.5 ms per loop
1000 loops, best of 5: 19.3 ms per loop
1000 loops, best of 5: 8.91 ms per loop
n = 5000
1000 loops, best of 5: 109 ms per loop
1000 loops, best of 5: 53.5 ms per loop
1000 loops, best of 5: 24.8 ms per loop
Ordered list reduction
I need to reduce some lists where, depending on element types, the speed and implementation of the binary operation varies, i.e. large speed reductions can be gained by reducing some pairs with specific functions first.
For example foo(a[0], bar(a[1], a[2]))
might be a lot slower than bar(foo(a[0], a[1]), a[2]) but in this case give the same result.
I have the code that produces an optimal ordering in the form of a list of tuples (pair_index, binary_function) already. I am struggling to implement an efficient function to perform the reduction, ideally one that returns a new partial function which can then be used repeatedly on lists of the same type-ordering but varying values.
Simple and slow(?) solution
Here is my naive solution involving a for loop, deletion of elements and closure over the (pair_index, binary_function) list to return a 'precomputed' function.
def ordered_reduce(a, pair_indexes, binary_functions, precompute=False):
"""
a: list to reduce, length n
pair_indexes: order of pairs to reduce, length (n-1)
binary_functions: functions to use for each reduction, length (n-1)
"""
def ord_red_func(x):
y = list(x) # copy so as not to eat up
for p, f in zip(pair_indexes, binary_functions):
b = f(y[p], y[p+1])
# Replace pair
del y[p]
y[p] = b
return y[0]
return ord_red_func if precompute else ord_red_func(a)
>>> foos = (lambda a, b: a - b, lambda a, b: a + b, lambda a, b: a * b)
>>> ordered_reduce([1, 2, 3, 4], (2, 1, 0), foos)
1
>>> 1 * (2 + (3-4))
1
And how pre-compution works:
>>> foo = ordered_reduce(None, (0, 1, 0), foos)
>>> foo([1, 2, 3, 4])
-7
>>> (1 - 2) * (3 + 4)
-7
However it involves copying the whole list and is also (therefore?) slow. Is there a better/standard way to do this?
(EDIT:) Some Timings:
from operators import add
from functools import reduce
from itertools import repeat
from random import random
r = 100000
xs = [random() for _ in range(r)]
# slightly trivial choices of pairs and functions, to replicate reduce
ps = [0]*(r-1)
fs = repeat(add)
foo = ordered_reduce(None, ps, fs, precompute=True)
>>> %timeit reduce(add, xs)
100 loops, best of 3: 3.59 ms per loop
>>> %timeit foo(xs)
1 loop, best of 3: 1.44 s per loop
This is kind of worst case scenario, and slightly cheating as reduce does not take a iterable of functions, but a function which does (but no order) is still pretty fast:
def multi_reduce(fs, xs):
xs = iter(xs)
x = next(xs)
for f, nx in zip(fs, xs):
x = f(x, nx)
return x
>>> %timeit multi_reduce(fs, xs)
100 loops, best of 3: 8.71 ms per loop
(EDIT2): and for fun, the performance of a massively cheating 'compiled' version, which gives some idea of the total overhead occurring.
from numba import jit
#jit(nopython=True)
def numba_sum(xs):
y = 0
for x in xs:
y += x
return y
>>> %timeit numba_sum(xs)
1000 loops, best of 3: 1.46 ms per loop
When I read this problem, I immediately thought of reverse Polish notation (RPN). While it may not be the best approach, it still gives a substantial speedup in this case.
My second thought is that you may get an equivalent result if you just reorder the sequence xs appropriately to get rid of del y[p]. (Arguably the best performance would be achieved if the whole reduce procedure is written in C. But it's a different kettle of fish.)
Reverse Polish Notation
If you are not familiar with RPN, please read the short explanation in the wikipedia article. Basically, all operations can be written down without parentheses, for example (1-2)*(3+4) is 1 2 - 3 4 + * in RPN, while 1-(2*(3+4)) becomes 1 2 3 4 + * -.
Here is a simple implementation of an RPN parser. I separated an list of objects from an RPN sequence, so that the same sequence can be used for directly for different lists.
def rpn(arr, seq):
'''
Reverse Polish Notation algorithm
(this version works only for binary operators)
arr: array of objects
seq: rpn sequence containing indices of objects from arr and functions
'''
stack = []
for x in seq:
if isinstance(x, int):
# it's an object: push it to stack
stack.append(arr[x])
else:
# it's a function: pop two objects, apply the function, push the result to stack
b = stack.pop()
#a = stack.pop()
#stack.append(x(a,b))
## shortcut:
stack[-1] = x(stack[-1], b)
return stack.pop()
Example of usage:
# Say we have an array
arr = [100, 210, 42, 13]
# and want to calculate
(100 - 210) * (42 + 13)
# It translates to RPN:
100 210 - 42 13 + *
# or
arr[0] arr[1] - arr[2] arr[3] + *
# So we apply `
rpn(arr,[0, 1, subtract, 2, 3, add, multiply])
To apply RPN to your case you'd need either to generate rpn sequences from scratch or to convert your (pair_indexes, binary_functions) into them. I haven't thought about a converter but it surely can be done.
Tests
Your original test comes first:
r = 100000
xs = [random() for _ in range(r)]
ps = [0]*(r-1)
fs = repeat(add)
foo = ordered_reduce(None, ps, fs, precompute=True)
rpn_seq = [0] + [x for i, f in zip(range(1,r), repeat(add)) for x in (i,f)]
rpn_seq2 = list(range(r)) + list(repeat(add,r-1))
# Here rpn_seq denotes (_ + (_ + (_ +( ... )...))))
# and rpn_seq2 denotes ((...( ... _)+ _) + _).
# Obviously, they are not equivalent but with 'add' they yield the same result.
%timeit reduce(add, xs)
100 loops, best of 3: 7.37 ms per loop
%timeit foo(xs)
1 loops, best of 3: 1.71 s per loop
%timeit rpn(xs, rpn_seq)
10 loops, best of 3: 79.5 ms per loop
%timeit rpn(xs, rpn_seq2)
10 loops, best of 3: 73 ms per loop
# Pure numpy just out of curiosity:
%timeit np.sum(np.asarray(xs))
100 loops, best of 3: 3.84 ms per loop
xs_np = np.asarray(xs)
%timeit np.sum(xs_np)
The slowest run took 4.52 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 48.5 µs per loop
So, rpn was 10 times slower than reduce but about 20 times faster than ordered_reduce.
Now, let's try something more complicated: alternately adding and multiplying matrices. I need a special function for it to test against reduce.
add_or_dot_b = 1
def add_or_dot(x,y):
'''calls 'add' and 'np.dot' alternately'''
global add_or_dot_b
if add_or_dot_b:
out = x+y
else:
out = np.dot(x,y)
add_or_dot_b = 1 - add_or_dot_b
# normalizing out to avoid `inf` in results
return out/np.max(out)
r = 100001 # +1 for convenience
# (we apply an even number of functions)
xs = [np.random.rand(2,2) for _ in range(r)]
ps = [0]*(r-1)
fs = repeat(add_or_dot)
foo = ordered_reduce(None, ps, fs, precompute=True)
rpn_seq = [0] + [x for i, f in zip(range(1,r), repeat(add_or_dot)) for x in (i,f)]
%timeit reduce(add_or_dot, xs)
1 loops, best of 3: 894 ms per loop
%timeit foo(xs)
1 loops, best of 3: 2.72 s per loop
%timeit rpn(xs, rpn_seq)
1 loops, best of 3: 1.17 s per loop
Here, rpn was roughly 25% slower than reduce and more than 2 times faster than ordered_reduce.
Hi I'm writing program for AES mix column stage. Here I have to multiply two matrices of (4,4) shape. The only difference is that while multiplying two matrices I have to take 'xor' instead of where I have to add. e.g
a = np.array([[1,2],[3,4]])
b = np.array([[5,6],[7,8]])
np.dot(a,b) # this gives [[(1*5+2*7),(1*6+2*8)][(3*5+4*7),(3*6+4*8)]]
# but I want [[((1*5)^(2*7)),((1*6)^(2*8))][((3*5)^(4*7)),((3*6)^(4*8))]]
Here's the solution with loops
result = [[0,0,0,0],
[0,0,0,0],
[0,0,0,0],
[0,0,0,0]]
# iterate through rows of X
for i in range(len(X)):
# iterate through columns of Y
for j in range(len(Y[0])):
# iterate through rows of Y
for k in range(len(Y)):
result[i][j] = result[i][j] ^ (X[i][k] * Y[k][j])
How to achieve that without using loops?
xor_ab=np.bitwise_xor.reduce(a[...,None]*b,axis=1)
For explanation, consider a rectangular problem for easier identification:
a=np.arange(12).reshape(4,3).astype(object)
b=np.arange(12).reshape(3,4).astype(object)
object is to provide the python int arbitrary precision for AES.
products are obtained by broadcasting,
c=a[...,None]*b # dims : (4,3,1) * ((1),3,4) -> (4,3,4) , c_ijk =a_ij*b_jk
The dot product it then obtained by :
dot_ab=c.sum(axis=1) # ->(4,4)
In [734]: (dot_ab==a.dot(b)).all()
Out[734]: True
Then change to the equivalent xor function :
xor_ab=np.bitwise_xor.reduce(a[...,None]*b,axis=1)
As an alternative, you can interpret your loops with numba (0.23):
from numba import jit
#jit(nopython=True)
def xor(X,Y):
result=np.zeros((4,4),np.uint64)
for i in range(len(X)):
# iterate through columns of Y
for j in range(Y.shape[1]):
# iterate through rows of Y
for k in range(len(Y)):
result[i,j] = result[i,j] ^ (X[i,k] * Y[k,j])
return result
for a impressive efficiency gain,due to optimal memory usage.
But you are limited to 32 bits for a and b:
In [790]: %timeit xor(a,b)
1000000 loops, best of 3: 580 ns per loop
In [791]: %timeit xor_ab=np.bitwise_xor.reduce(a[...,None]*b,axis=1)
100000 loops, best of 3: 13.2 µs per loop
In [792] (xor(a,b)==np.bitwise_xor.reduce(a[...,None]*b,axis=1)).all()
Out[792]: True
For performance reasons,
I'm curious if there is a way to multiply a stack of a stack of matrices. I have a 4-D array (500, 201, 2, 2). Its basically a 500 length stack of (201,2,2) matrices where for each of the 500, I want to multiply the adjacent matrices using einsum and get another (201,2,2) matrix.
I am only doing matrix multiplication on the [2x2] matrices at the end. Since my explanation is already heading off the rails, I'll just show what I'm doing now, and also the 'reduce' equivalent and why its not helpful (because its the same speed computationally). Preferably this would be a numpy one-liner, but I don't know what that is, or even if its possible.
Code:
Arr = rand(500,201,2,2)
def loopMult(Arr):
ArrMult = Arr[0]
for i in range(1,len(Arr)):
ArrMult = np.einsum('fij,fjk->fik', ArrMult, Arr[i])
return ArrMult
def myeinsum(A1, A2):
return np.einsum('fij,fjk->fik', A1, A2)
A1 = loopMult(Arr)
A2 = reduce(myeinsum, Arr)
print np.all(A1 == A2)
print shape(A1); print shape(A2)
%timeit loopMult(Arr)
%timeit reduce(myeinsum, Arr)
Returns:
True
(201, 2, 2)
(201, 2, 2)
10 loops, best of 3: 34.8 ms per loop
10 loops, best of 3: 35.2 ms per loop
Any help would be appreciated. Things are functional, but when I have to iterate this over a large series of parameters, the code tends to take a long time, and I'm wondering if there's a way to avoid the 500 iterations through a loop.
I don't think it's possible to do this efficiently using numpy (the cumprod solution was elegant, though). This is the sort of situation where I would use f2py. It's the simplest way of calling a faster language that I know of and only requires a single extra file.
fortran.f90:
subroutine multimul(a, b)
implicit none
real(8), intent(in) :: a(:,:,:,:)
real(8), intent(out) :: b(size(a,1),size(a,2),size(a,3))
real(8) :: work(size(a,1),size(a,2))
integer i, j, k, l, m
!$omp parallel do private(work,i,j)
do i = 1, size(b,3)
b(:,:,i) = a(:,:,i,size(a,4))
do j = size(a,4)-1, 1, -1
work = matmul(b(:,:,i),a(:,:,i,j))
b(:,:,i) = work
end do
end do
end subroutine
Compile with f2py -c -m fortran fortran.f90 (or F90FLAGS="-fopenmp" f2py -c -m fortran fortran.f90 -lgomp to enable OpenMP acceleration). Then you would use it in your script as
import numpy as np, fmuls
Arr = np.random.standard_normal([500,201,2,2])
def loopMult(Arr):
ArrMult = Arr[0]
for i in range(1,len(Arr)):
ArrMult = np.einsum('fij,fjk->fik', ArrMult, Arr[i])
return ArrMult
def myeinsum(A1, A2):
return np.einsum('fij,fjk->fik', A1, A2)
A1 = loopMult(Arr)
A2 = reduce(myeinsum, Arr)
A3 = fmuls.multimul(Arr.T).T
print np.allclose(A1,A2)
print np.allclose(A1,A3)
%timeit loopMult(Arr)
%timeit reduce(myeinsum, Arr)
%timeit fmuls.multimul(Arr.T).T
Which outputs
True
True
10 loops, best of 3: 48.4 ms per loop
10 loops, best of 3: 48.8 ms per loop
100 loops, best of 3: 5.82 ms per loop
So that's a factor 8 speedup. The reason for all the transposes is that f2py implicitly transposes all the arrays, and we need to transpose them manually to tell it that our fortran code expects things to be transposed. This avoids a copy operation. The cost is that each of our 2x2 matrices are transposed, so to avoid performing the wrong operation we have to loop in reverse.
Greater speedups than 8 should be possible - I didn't spend any time trying to optimize this.
I have n documents in MongoDB containing a scipy sparse vector, stored as a pickle object and initially created with scipy.sparse.lil. The vectors are all of the same size, say p x 1.
What I need to do is to put all these vectors into a sparse n x p matrix back in python. I am using mongoengine and thus defined a property to load each pickle vector:
class MyClass(Document):
vector_text = StringField()
#property
def vector(self):
return cPickle.loads(self.vector_text)
Here's what I'm doing now, with n = 4700 and p = 67:
items = MyClass.objects()
M = items[0].vector
for item in items[1:]:
to_add = item.vector
M = scipy.sparse.hstack((M, to_add))
The loading part (i.e. calling n times the property) takes about 1.3s. The stacking part about 2.7s. Since in the future n is going to seriously increase (possibly more than a few hundred thousands), I sense that this is not optimal :)
Any idea to speed the whole thing up? If you know how to fasten the "loading" or the "stacking" only, I'm happy to hear it. For instance maybe the solution is to store the entire matrix in mongoDB? Thanks !
First, what you describe you want to do would require you using vstack, not hstack. In any case, your choice of sparse format is part of your performance problem. Try the following:
n, p = 4700, 67
csr_vecs = [sps.rand(1, p, density=0.5, format='csr') for j in xrange(n)]
lil_vecs = [vec.tolil() for vec in csr_vecs]
%timeit sps.vstack(csr_vecs, format='csr')
1 loops, best of 3: 722 ms per loop
%timeit sps.vstack(lil_vecs, format='lil')
1 loops, best of 3: 1.34 s per loop
So there's already a 2x improvement simply from swithcing to CSR. Furthermore, the stacking functions of scipy.sparse do not seem to be very optimized, definitely not for sparse vectors. The following two functions stack a list of CSR or LIL vectors, returning a CSR sparse matrix:
def csr_stack(vectors):
data = np.concatenate([vec.data for vec in vectors])
indices = np.concatenate([vec.indices for vec in vectors])
indptr = np.cumsum([0] + [vec.nnz for vec in vectors])
return sps.csr_matrix((data, indices, indptr), shape=(len(vectors),
vectors[0].shape[1]))
import itertools as it
def lil_stack(vectors):
indptr = np.cumsum([0] + [vec.nnz for vec in vectors])
data = np.fromiter(it.chain(*(vec.data[0] for vec in vectors)),
dtype=vectors[0].dtype, count=indptr[-1])
indices = np.fromiter(it.chain(*(vec.rows[0] for vec in vectors)),
dtype=np.intp, count=indptr[-1])
return sps.csr_matrix((data, indices, indptr), shape=(len(vectors),
vectors[0].shape[1]))
It works:
>>> np.allclose(sps.vstack(csr_vecs).A, csr_stack(csr_vecs).A)
True
>>> np.allclose(csr_stack(csr_vecs).A, lil_stack(lil_vecs).A)
True
And is substantially faster:
%timeit csr_stack(csr_vecs)
100 loops, best of 3: 11.7 ms per loop
%timeit lil_stack(lil_vecs)
10 loops, best of 3: 37.6 ms per loop
%timeit lil_stack(lil_vecs).tolil()
10 loops, best of 3: 53.6 ms per loop
So, by switching to CSR, you can improve performance by over 100x. If you stick with LIL, your performance improvement will be only around 30x, more if you can live with CSR in the combined matrix, less if you insist on LIL.
I think, you should try to use ListField, which is essentially a python list representation of BSON array, to store your vectors. In that situation, you won't need to unpickle them every time.
class MyClass(Document):
vector = ListField()
items = MyClass.objects()
M = items[0].vector
The only problem I can see in that solution, is that you have to convert python lists to scipy sparse vector type, but I believe, that should be faster.