how to speed up the computation? - python

i need to calculate a 1 million*1 million computations to fill a sparse matrix.But when i use loops to fill the matrix line by line,i find it will take 6 minutes to do a just 100*100 computations.So the task won't be solved.Is there some ways to speed up the process?
import numpy as np
from scipy.sparse import lil_matrix
import pandas as pd
tp = pd.read_csv('F:\\SogouDownload\\train.csv', iterator=True, chunksize=1000)
data = pd.concat(tp, ignore_index=True)
matrix=lil_matrix((1862220,1862220))
for i in range(1,1862220):
for j in range(1,1862220):
matrix[i-1,j-1]=np.sum(data[data['source_node']==i].destination_node.isin(data[data['source_node']==j].destination_node))

While not the fastest way of constructing a sparse matrix, this isn't horribly slow either, at least not the lil assignment step:
In [204]: N=100
In [205]: M=sparse.lil_matrix((N,N))
In [206]: for i in range(N):
...: for j in range(N):
...: M[i,j]=(i==j)
In [207]: M
Out[207]:
<100x100 sparse matrix of type '<class 'numpy.float64'>'
with 100 stored elements in LInked List format>
It saved just the nonzero values to M. I barely saw the delay during the loop.
So my guess is that most of the time is spent in the panadas indexing expression:
np.sum(data[data['source_node']==i].destination_node.isin(data[data['source_node']==j].destination_node))
Converting data, often textual, into coocurance counts sparse matrices comes up often. They are used in learning code, pattern searches etc. scikit-learn is often used. Also tensorflow.
For N=1000
In [212]: %%timeit
...: M=sparse.lil_matrix((N,N))
...: for i in range(N):
...: for j in range(N):
...: M[i,j]=(i==j)
...:
1 loop, best of 3: 7.31 s per loop
Iteratively assigning these values to a dense array is faster, even if we include the conversion to sparse at the end.
In [213]: %%timeit
...: M=np.zeros((N,N))
...: for i in range(N):
...: for j in range(N):
...: M[i,j]=(i==j)
...:
1 loop, best of 3: 353 ms per loop
In [214]: %%timeit
...: M=np.zeros((N,N))
...: for i in range(N):
...: for j in range(N):
...: M[i,j]=(i==j)
...: M = sparse.lil_matrix(M)
...:
1 loop, best of 3: 353 ms per loop
But for the very large case, creating that intermediate dense array might hit memory problems.

The technique to use here is sparse matrix multiplication. But for that technique you first need a binary matrix mapping source nodes to destination nodes (the node labels will be the indices of the nonzero entries).
from scipy.sparse import csr_matrix
I = data['source_node'] - 1
J = data['destination_node'] - 1
values = np.ones(len(data), int)
shape = (np.max(I) + 1, np.max(J) + 1)
mapping = csr_matrix((values, (I, J)), shape)
The technique itself is simply a matrix multiplication of this matrix with its transpose (see also this question).
cooccurrence = mapping.dot(mapping.T)
The only potential problem is that the resulting matrix may not be sparse and consumes all your RAM.

Related

Dask Distributed: Reducing Multiple Dimensions into a Distance Matrix

I want to calculate a large distance matrix, based on a higher dimensional vector. For instance, I have 1000 instances each represented by 20 vectors of length 10. The distance between each two instances is given by the mean distance between each of the 20 vectors associated to each vector. So I want to go from a 1000 by 20 by 10 matrix to a 1000 by 1000 (lower-triangular) matrix. Because these calculations can get slow, I want to use Dask distributed to block the algorithm and spread it over several CPU's. Below is how far I've gotten:
Preamble
import itertools
import random
import numpy as np
import dask.array
from dask.distributed import Client
The distance function is defined by
def distance(u, v):
result = np.empty([int((len(u)*(len(u)+1))/2)], dtype=float)
for i, j in itertools.product(range(len(u)),range(len(v))):
if j <= i:
differences = []
k = int(((i*(i+1))/2 +j-1)+1)
for x,y in itertools.product(u[i], v[j]):
difference = np.abs(np.array(x) - np.array(y)).sum(axis=1)
differences.apply(difference)
result[k] = np.mean(differences)
return result
and returns an array of length n*(n+1)/2 to describe the lower triangular matrix for this block of the distance matrix.
def distance_matrix(X):
X = np.asarray(X, dtype=object)
X = dask.array.from_array(X, (100, 20, 10)).astype(float)
print("chunksize: ", X.chunksize)
resulting_length = [int((X.chunksize[0]*(X.chunksize[0])+1)/2)]
result = dask.array.map_blocks(distance, X, X, chunks=(resulting_length), drop_axis=[1,2], dtype=float)
return result.compute()
I split up the input array in chunks and use dask.array.map_blocks to apply the distance calculation to all the blocks.
if __name__ == '__main__':
workers = 6
X = np.array([[[random.random() for _ in range(10)] for _ in range(20)] for _ in range(1000)])
client = Client(n_workers=workers)
results = similarity_matrix(X)
client.close()
print(results)
Unfortunately, this approach returns the wrong length of array at the end of the process. Would somebody to help me out here? I don't have much experience in distributed computing.
I'm a big fan of dask, but this problem is way too small to need it. The runtime issue you're seeing is because you are looping through each element in python rather than using vectorized operations in numpy.
As with many packages in python, numpy relies on highly efficient compiled code written in other, faster languages such as C to carry out array operations. When you do something like an array operation A + B numpy calls these fast routines once, and the array operations are carried out within a highly optimized C routine. There is overhead involved with making calls to other languages, but this is overwhelmed by the performance gain due to the single call to a very fast routine. If instead you loop over every element, adding cell-wise, you have a (slow) python process, and on each element, this calls the C code, which adds overhead for each element of the array. Because of this, you actually would be better off not using numpy if you're going to do this once for each element.
To implement this in a vectorized manner, you can exploit numpy's broadcasting rules to ensure the first dimensions of your two arrays expand to a new dimension. I don't totally understand what's going on in your distance function, but you could extend this simple version to do whatever you want:
In [1]: import numpy as np
In [2]: A = np.random.random((1000, 20))
...: B = np.random.random((1000, 20))
In [3]: distance = np.abs(A[:, np.newaxis, :] - B[np.newaxis, :, :]).sum(axis=-1)
In [4]: distance
Out[4]:
array([[7.22985776, 7.76185666, 5.61824886, ..., 7.62092039, 6.35189562,
7.06365986],
[5.73359499, 5.8422105 , 7.2644021 , ..., 5.72230353, 6.79390303,
5.03074007],
[7.27871151, 8.6856818 , 5.97489449, ..., 8.86620029, 7.49875638,
6.57389575],
...,
[7.67783107, 7.24419076, 4.17941596, ..., 8.68674754, 6.65078093,
5.67279811],
[7.1550136 , 6.10590227, 5.75417987, ..., 7.05953998, 5.8306628 ,
6.55112672],
[5.81748615, 6.79246838, 6.95053088, ..., 7.63994705, 6.77720511,
7.5663236 ]])
In [5]: distance.shape
Out[5]: (1000, 1000)
The performance difference can be seen clearly against a looped implementation:
In [6]: %%timeit
...: np.abs(A[:, np.newaxis, :] - B[np.newaxis, :, :]).sum(axis=-1)
...:
...:
45 ms ± 326 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %%timeit
...: distances = np.empty((1000, 1000))
...: for i in range(1000):
...: for j in range(1000):
...: distances[i, j] = np.abs(A[i, :] - B[j, :]).sum()
...:
2.42 s ± 7.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The looped version takes more than 50x as long!

Fast column access over large scipy sparse matrix

I am working with scipy's csc sparse matrix and currently a major bottleneck in the code is a line similar to the following
for i in range(multiply_cols.shape[0]):
F = F - factor*values[i]*mat.getcol(multiply_cols[i])
The matrices that I am working with are extremely large, of size typically more than 10**6x10**6 and I don't want to convert them to dense matrix. In fact I have a restriction to always have the matrix in csc format. My attempts show that converting to coo_matrix or lil_matrix also does not pay off.
Here is my rudimentary attempts using csc, csr and coo:
n=1000
sA = csc_matrix(np.random.rand(n,n))
F = np.random.rand(n,1)
multiply_cols = np.unique(np.random.randint(0,int(0.6*n),size=n))
values = np.random.rand(multiply_cols.shape[0])
def foo1(mat,F,values,multiply_cols):
factor = 0.75
for i in range(multiply_cols.shape[0]):
F = F - factor*values[i]*mat.getcol(multiply_cols[i])
def foo2(mat,F,values,multiply_cols):
factor = 0.75
mat = mat.tocsr()
for i in range(multiply_cols.shape[0]):
F = F - factor*values[i]*mat.getcol(multiply_cols[i])
def foo3(mat,F,values,multiply_cols):
factor = 0.75
mat = mat.tocoo()
for i in range(multiply_cols.shape[0]):
F = F - factor*values[i]*mat.getcol(multiply_cols[i])
def foo4(mat,F,values,multiply_cols):
factor = 0.75
mat = mat.tolil()
for i in range(multiply_cols.shape[0]):
F = F - factor*values[i]*mat.getcol(multiply_cols[i])
and timing them I get:
In [41]: %timeit foo1(sA,F,values,multiply_cols)
10 loops, best of 3: 133 ms per loop
In [42]: %timeit foo2(sA,F,values,multiply_cols)
1 loop, best of 3: 999 ms per loop
In [43]: %timeit foo3(sA,F,values,multiply_cols)
1 loop, best of 3: 6.38 s per loop
In [44]: %timeit foo4(sA,F,values,multiply_cols)
1 loop, best of 3: 45.1 s per loop
So certainly coo_matrix and lil_matrix are not a good choice here. Does anyone know a faster way of doing this. Is it a good option to retrieve the underlyng indptr, indices and data have a custom cython solution?
I found in
Sparse matrix slicing using list of int
that column (or row) indexing for sparse matrices is essentially a matrix multiplication task - construct a sparse matrix with the right mix of 1s and 0s, and multiply. Also row (and column) sums are done with multiplication.
This function implements that idea. M is a 1 column sparse matrix, with values in the multiply_cols slots:
def wghtsum(sA, values, multiply_cols):
cols = np.zeros_like(multiply_cols)
M=sparse.csc_matrix((values,(multiply_cols,cols)),shape=(sA.shape[1],1))
return (sA*M).A
testing:
In [794]: F1=wghtsum(sA,values,multiply_cols)
In [800]: F2=(sA[:,multiply_cols]*values)[:,None] # Divaker's
In [802]: np.allclose(F1,F2)
Out[802]: True
It has a modest time savings over #Divakar's solution:
In [803]: timeit F2=(sA[:,multiply_cols]*values)[:,None]
100 loops, best of 3: 18.3 ms per loop
In [804]: timeit F1=wghtsum(sA,values,multiply_cols)
100 loops, best of 3: 6.57 ms per loop
=======
sA as created is dense - it's a sparse rendition of a dense random array. sparse.rand can be used to create a sparse random matrix with a defined level of sparsity.
In testing your foo1 I had a problem with getcol:
In [818]: sA.getcol(multiply_cols[0])
...
TypeError: an integer is required
In [819]: sA.getcol(multiply_cols[0].item())
Out[819]:
<1000x1 sparse matrix of type '<class 'numpy.float64'>'
with 1000 stored elements in Compressed Sparse Column format>
In [822]: sA[:,multiply_cols[0]]
Out[822]:
<1000x1 sparse matrix of type '<class 'numpy.float64'>'
with 1000 stored elements in Compressed Sparse Column format>
I suspect that's caused by a scipy version difference.
In [821]: scipy.__version__
Out[821]: '0.17.0'
This issue did go away in 0.18; but I can't find a relevant issue/pullrequest.
Well you could use a vectorized approach that uses matrix-multiplication of sliced out columns from sparse matrix against values, like so -
F -= (mat[:,multiply_cols]*values*factor)[:,None]
Benchmarking
It seems foo1 is the fastest of the lot listed in the question. So, let's time the proposed approach against that one.
Function definitions -
def foo1(mat,F,values,multiply_cols):
factor = 0.75
outF = F.copy()
for i in range(multiply_cols.shape[0]):
outF -= factor*values[i]*mat.getcol(multiply_cols[i])
return outF
def foo_vectorized(mat,F,values,multiply_cols):
factor = 0.75
return F - (mat[:,multiply_cols]*values*factor)[:,None]
Timings and verification on bigger set with sparseness -
In [242]: # Setup inputs
...: n = 3000
...: mat = csc_matrix(np.random.randint(0,3,(n,n))) #Sparseness with 0s
...: F = np.random.rand(n,1)
...: multiply_cols = np.unique(np.random.randint(0,int(0.6*n),size=n))
...: values = np.random.rand(multiply_cols.shape[0])
...:
In [243]: out1 = foo1(mat,F,values,multiply_cols)
In [244]: out2 = foo_vectorized(mat,F,values,multiply_cols)
In [245]: np.allclose(out1, out2)
Out[245]: True
In [246]: %timeit foo1(mat,F,values,multiply_cols)
1 loops, best of 3: 641 ms per loop
In [247]: %timeit foo_vectorized(mat,F,values,multiply_cols)
10 loops, best of 3: 40.3 ms per loop
In [248]: 641/40.3
Out[248]: 15.905707196029779
There we have a 15x+ speedup!

Csr_matrix.dot vs. Numpy.dot

I have a large (n=50000) block diagonal csr_matrix M representing the adjacency matrices of a set of graphs. I have to have multiply M by a dense numpy.array v several times. Hence I use M.dot(v).
Surprisingly, I have discovered that first converting M to numpy.array and then using numpy.dot is much faster.
Any ideas why this it the case?
I don't have enough memory to hold a 50000x50000 dense matrix in memory and multiply it by a 50000 vector. But find here some tests with lower dimensionality.
Setup:
import numpy as np
from scipy.sparse import csr_matrix
def make_csr(n, N):
rows = np.random.choice(N, n)
cols = np.random.choice(N, n)
data = np.ones(n)
return csr_matrix((data, (rows, cols)), shape=(N,N), dtype=np.float32)
The code above generates sparse matrices with n non-zero elements in a NxN matrix.
Matrices:
N = 5000
# Sparse matrices
A = make_csr(10*10, N) # ~100 non-zero
B = make_csr(100*100, N) # ~10000 non-zero
C = make_csr(1000*1000, N) # ~1000000 non-zero
D = make_csr(5000*5000, N) # ~25000000 non-zero
E = csr_matrix(np.random.randn(N,N), dtype=np.float32) # non-sparse
# Numpy dense arrays
An = A.todense()
Bn = B.todense()
Cn = C.todense()
Dn = D.todense()
En = E.todense()
b = np.random.randn(N)
Timings:
>>> %timeit A.dot(b) # 9.63 µs per loop
>>> %timeit An.dot(b) # 41.6 ms per loop
>>> %timeit B.dot(b) # 41.3 µs per loop
>>> %timeit Bn.dot(b) # 41.2 ms per loop
>>> %timeit C.dot(b) # 3.2 ms per loop
>>> %timeit Cn.dot(b) # 41.2 ms per loop
>>> %timeit D.dot(b) # 35.4 ms per loop
>>> %timeit Dn.dot(b) # 43.2 ms per loop
>>> %timeit E.dot(b) # 55.5 ms per loop
>>> %timeit En.dot(b) # 43.4 ms per loop
For highly sparse matrices (A and B) it is more than 1000x times faster.
For not very sparse matrices (C), it still gets 10x speedup.
For almost non-sparse matrix (D will have some 0 due to repetition in the indices, but not many probabilistically speaking), it is still faster, not much, but faster.
For a truly non-sparse matrix (E), the operation is slower, but not much slower.
Conclusion: the speedup you get depends on the sparsity of your matrix, but with N = 5000 sparse matrices are always faster (as long as they have some zero entries).
I can't try it for N = 50000 due to memory issues. You can try the above code and see what is like for you with that N.

Diagonal sparse matrix obtained from a sparse coo_matrix

I built some sparse matrix M in Python using the coo_matrix format. I would like to find an efficient way to compute:
A = M + M.T - D
where D is the restriction of M to its diagonal (M is potentially very large). I can't find a way to efficiently build D while keeping a coo_matrix format. Any ideas?
Could D = scipy.sparse.spdiags(coo_matrix.diagonal(M),0,M.shape[0],M.shape[0]) be a solution?
I have come up with a faster coo diagonal:
msk = M.row==M.col
D1 = sparse.coo_matrix((M.data[msk],(M.row[msk],M.col[msk])),shape=M.shape)
sparse.tril uses this method with mask = A.row + k >= A.col (sparse/extract.py)
Some times for a (100,100) M (and M1 = M.tocsr())
In [303]: timeit msk=M.row==M.col; D1=sparse.coo_matrix((M.data[msk],(M.row[msk],M.col[msk])),shape=M.shape)
10000 loops, best of 3: 115 µs per loop
In [305]: timeit D=sparse.diags(M.diagonal(),0)
1000 loops, best of 3: 358 µs per loop
So the coo way of getting the diagional is fast, at least for this small, and very sparse matrix (only 1 time in the diagonal)
If I start with the csr form, the diags is faster. That's because .diagonal works in the csr format:
In [306]: timeit D=sparse.diags(M1.diagonal(),0)
10000 loops, best of 3: 176 µs per loop
But creating D is a small part of the overall calculation. Again, working with M1 is faster. The sum is done in csr format.
In [307]: timeit M+M.T-D
1000 loops, best of 3: 1.35 ms per loop
In [308]: timeit M1+M1.T-D
1000 loops, best of 3: 1.11 ms per loop
Another way to do the whole thing is to take advantage of that fact that coo allows duplicate i,j values, which will be summed when converted to csr format. So you could stack the row, col, data arrays for M with those for M.T (see M.transpose for how those are constructed), along with masked values for D. (or the masked diagonals could be removed from M or M.T)
For example:
def MplusMT(M):
msk=M.row!=M.col;
data=np.concatenate([M.data, M.data[msk]])
rows=np.concatenate([M.row, M.col[msk]])
cols=np.concatenate([M.col, M.row[msk]])
MM=sparse.coo_matrix((data, (rows, cols)), shape=M.shape)
return MM
# alt version with a more explicit D
# msk=M.row==M.col;
# data=np.concatenate([M.data, M.data,-M.data[msk]])
MplusMT as written is very fast because it is just doing array concatenation, not summation. To do that we have to convert it to a csr matrix.
MplusMT(M).tocsr()
which takes considerably longer. Still this approach is, in my limited testing, more than 2x faster than M+M.T-D. So it's a potential tool for constructing complex sparse matrices.
You probably want
from scipy.sparse import diags
D = diags(M.diagonal(), 0, format='coo')
This will still build an M-size 1d array as an intermediate step, but that will probably not be so bad.

Efficiently sample all arrays in ndarray using scipy.ndimage.map_coordinates

I have a 3D stack of masked arrays. I'd like to sample all arrays in the stack at the same fixed locations.
stack.ma_stack.shape
(1461, 390, 327)
#Indices to be sampled
x = np.array([ 117.38670304, 119.1220485 ])
y = np.array([ 209.98120554, 210.37202372])
The following is very efficient, but only works for integer indices:
x_int = np.rint(x).astype(int)
y_int = np.rint(y).astype(int)
samp = stack.ma_stack[:,y_int,x_int]
samp.shape
(1461, 2)
I'm trying to implement the scipy.ndimage.map_coordinates interpolated sampling for float indices, but I can't seem to figure out how to format the coordinates properly.
Most examples use map_coordinates to sample a single array, and the following works for a single array from the stack:
map_coord = np.array([[y,], [x,]])
samp = scipy.ndimage.map_coordinates(stack.ma_stack[0], map_coord, order=1)
samp.shape
(1, 2)
I can loop through each array in the stack, but I know there is a simple indexing trick that will sample the entire stack in a single call. I read about mgrid, and did some experimentation, but couldn't find the right solution (I'm still learning advanced indexing). I know somebody out there will know the answer right away. Thanks.
On a related note: Anybody know how to do this for masked arrays without replacing missing data with fill_value or np.nan? The ndimage interpolation doesn't play nicely with masked arrays:
https://github.com/scipy/scipy/issues/1682
There must be a way to get it to broadcast automatically... in the meantime, you can force the broadcasting with np.arange(...) to get one point from each 2d array in the stack:
map_coords = np.broadcast_arrays(np.arange(stack.ma_stack.shape[0])[:, None], y, x)
samp = ndimage.map_coordinates(stack.ma_stack, map_coords, order=1)
This is inefficient though, because the "broadcasting" is done in advance (and presumably copies all that data), but it's still quite a bit faster than the loop:
In [88]: a = np.random.rand(1461, 390, 327)
In [89]: x = np.array([ 117.38670304, 119.1220485 ])
In [90]: y = np.array([ 209.98120554, 210.37202372])
In [107]: %%timeit
.....: map_coord = np.array([[y,], [x,]])
.....: np.concatenate([ndimage.map_coordinates(ai, map_coord, order=1) for ai in a])
.....:
10 loops, best of 3: 33.1 ms per loop
In [108]: %%timeit
.....: map_coords = np.broadcast_arrays(np.arange(a.shape[0])[:, None], y, x)
.....: ndimage.map_coordinates(a, map_coords, order=1)
.....:
100 loops, best of 3: 4.67 ms per loop
In [109]: samp_OP = np.concatenate([ndimage.map_coordinates(ai, map_coord, order=1) for ai in a])
In [110]: samp_chan = ndimage.map_coordinates(a, map_coords, order=1)
In [111]: np.allclose(samp_chan, samp_OP)
Out[111]: True

Categories

Resources