I need to iteratively construct a huge sparse matrix in numpy/scipy. The intitialization is done within a loop:
from scipy.sparse import dok_matrix, csr_matrix
def foo(*args):
dim_x = 256*256*1024
dim_y = 128*128*512
matrix = dok_matrix((dim_x, dim_y))
for i in range(dim_x):
# compute stuff in order to get j
matrix[i, j] = 1.
return matrix.tocsr()
Then i need to convert it to a csr_matrix, because of further computations like:
matrix = foo(...)
result = matrix.T.dot(x)
At the beginning this was working fine. But my matrices are getting bigger and bigger and my computer starts to crash. Is there a more elegant way in storing the matrix?
Basically i have the following requirements:
The matrix needs to store float values form 0. to 1.
I need to compute the transpose of the matrix
I need to compute the dot product with a x_dimensional vector
The matrix dimensions can be around 1*10^9 x 1*10^8
My ram-storage is exceeding. I was reading several posts on stack overflow and the rest of the internet ;) I found PyTables, which isn't really made for matrix computations... etc.. Is there a better way?
For your case I would recommend using the data type np.int8 (or np.uint8) which require only one byte per element:
matrix = dok_matrix((dim_x, dim_y), dtype=np.int8)
Directly constructing the csr_matrix will also allow you to go further with the maximum matrix size:
from scipy.sparse import csr_matrix
def foo(*args):
dim_x = 256*256*1024
dim_y = 128*128*512
row = []
col = []
for i in range(dim_x):
# compute stuff in order to get j
row.append(i)
col.append(j)
data = np.ones_like(row, dtype=np.int8)
return csr_matrix((data, (row, col)), shape=(dim_x, dim_y), dtype=np.int8)
You may have hit the limits of what Python can do for you, or you may be able to do a little more. Try setting a datatype of np.float32, if you're on a 64 bit machine, this reduced precision may reduce your memory consumption. np.float16 may help you on memory even further, but your calculations may slow down (I've seen examples where processing may take 10x the amount of time):
matrix = dok_matrix((dim_x, dim_y), dtype=np.float32)
or possibly much slower, but even less memory consumption:
matrix = dok_matrix((dim_x, dim_y), dtype=np.float16)
Another option: buy more system memory.
Finally, if you can avoid creating your matrix with dok_matrix, and can create it instead with csr_matrix (I don't know if this is possible for your calculations) you may save a little overhead on the dict that dok_matrix uses.
Related
I have a number of indices and values that make up a scipy.coo_matrix. The indices/values are generated from different subroutines and are concatenated together before handed over to the matrix constructor:
import numpy
from scipy import sparse
n = 100000
I0 = range(n)
J0 = range(n)
V0 = numpy.random.rand(n)
I1 = range(n)
J1 = range(n)
V1 = numpy.random.rand(n)
# [...]
I = numpy.concatenate([I0, I1])
J = numpy.concatenate([J0, J1])
V = numpy.concatenate([V0, V1])
matrix = sparse.coo_matrix((V, (I, J)), shape=(n, n))
Now, the components of (I, J, V) can be quite large such that the concatenate operations become significant. (In the above example it takes over 20% of the runtime on my machine.) I'm reading that it's not possible to concatenate without a copy.
Is there a way for handing over indices and values without copying the input data around first?
If you look at the code for coo_matrix.__init__ you'll see that it's pretty simple. In fact if the (V, (I,J)) inputs are right it will simply assign those 3 arrays to its .data, row, col attributes. You can even check that after creation by comparing those attributes with your variables.
If they aren't 1d arrays of the right dtype, it will massage them - make the arrays, etc. So without getting into details, processing that you do before hand might save time in the coo call.
self.row = np.array(row, copy=copy, dtype=idx_dtype)
self.col = np.array(col, copy=copy, dtype=idx_dtype)
self.data = np.array(obj, copy=copy)
One way or other those attributes will have to each be a single array, not a loose list of arrays or lists of lists.
sparse.bmat makes a coo matrix from other ones. It collected their coo attributes, joins them in the fill an empty array styles, and calls coo_matrix. Look at its code.
Almost all numpy operations that return a new array do so by allocating an empty and filling it. Letting numpy do that in compiled code (with np.concatentate) should be a be a little faster, but details like the size and number of inputs will make a difference.
A non_connonical coo matrix is just the start. Many operations require a conversion to one of the other formats.
Efficiently construct FEM/FVM matrix
This is about sparse matrix constrution where there are many duplicate points that need to be summed - and using using the csr format for calculations.
You can try pre-allocating the arrays. It'll spare you the copy at least. I didn't see any speedup for the example, but you might see a change.
import numpy
from scipy import sparse
n = 100000
I = np.empty(2*n, np.double)
J = np.empty_like(I)
V = np.empty_like(I)
I[:n] = range(n)
J[:n] = range(n)
V[:n] = numpy.random.rand(n)
I[n:] = range(n)
J[n:] = range(n)
V[n:] = numpy.random.rand(n)
matrix = sparse.coo_matrix((V, (I, J)), shape=(n, n))
I am trying to vectorize an operation using numpy, which I use in a python script that I have profiled, and found this operation to be the bottleneck and so needs to be optimized since I will run it many times.
The operation is on a data set of two parts. First, a large set (n) of 1D vectors of different lengths (with maximum length, Lmax) whose elements are integers from 1 to maxvalue. The set of vectors is arranged in a 2D array, data, of size (num_samples,Lmax) with trailing elements in each row zeroed. The second part is a set of scalar floats, one associated with each vector, that I have a computed and which depend on its length and the integer-value at each position. The set of scalars is made into a 1D array, Y, of size num_samples.
The desired operation is to form the average of Y over the n samples, as a function of (value,position along length,length).
This entire operation can be vectorized in matlab with use of the accumarray function: by using 3 2D arrays of the same size as data, whose elements are the corresponding value, position, and length indices of the desired final array:
sz_Y = num_samples;
sz_len = Lmax
sz_pos = Lmax
sz_val = maxvalue
ind_len = repmat( 1:sz_len ,1 ,sz_samples);
ind_pos = repmat( 1:sz_pos ,sz_samples,1 );
ind_val = data
ind_Y = repmat((1:sz_Y)',1 ,Lmax );
copiedY=Y(ind_Y);
mask = data>0;
finalarr=accumarray({ind_val(mask),ind_pos(mask),ind_len(mask)},copiedY(mask), [sz_val sz_pos sz_len])/sz_val;
I was hoping to emulate this implementation with np.bincounts. However, np.bincounts differs to accumarray in two relevant ways:
both arguments must be of same 1D size, and
there is no option to choose the shape of the output array.
In the above usage of accumarray, the list of indices, {ind_val(mask),ind_pos(mask),ind_len(mask)}, is 1D cell array of 1x3 arrays used as index tuples, while in np.bincounts it must be 1D scalars as far as I understand. I expect np.ravel may be useful but am not sure how to use it here to do what I want. I am coming to python from matlab and some things do not translate directly, e.g. the colon operator which ravels in opposite order to ravel. So my question is how might I use np.bincount or any other numpy method to achieve an efficient python implementation of this operation.
EDIT: To avoid wasting time: for these multiD index problems with complicated index manipulation, is the recommend route to just use cython to implement the loops explicity?
EDIT2: Alternative Python implementation I just came up with.
Here is a heavy ram solution:
First precalculate:
Using index units for length (i.e., length 1 =0) make a 4D bool array, size (num_samples,Lmax+1,Lmax+1,maxvalue) , holding where the conditions are satisfied for each value in Y.
ALLcond=np.zeros((num_samples,Lmax+1,Lmax+1,maxvalue+1),dtype='bool')
for l in range(Lmax+1):
for i in range(Lmax+1):
for v in range(maxvalue+!):
ALLcond[:,l,i,v]=(data[:,i]==v) & (Lvec==l)`
Where Lvec=[len(row) for row in data]. Then get the indices for these using np.where and initialize a 4D float array into which you will assign the values of Y:
[indY,ind_len,ind_pos,ind_val]=np.where(ALLcond)
Yval=np.zeros(np.shape(ALLcond),dtype='float')
Now in the loop in which I have to perform the operation, I compute it with the two lines:
Yval[ind_Y,ind_len,ind_pos,ind_val]=Y[ind_Y]
Y_avg=sum(Yval)/num_samples
This gives a factor of 4 or so speed up over the direct loop implementation. I was expecting more. Perhaps, this is a more tangible implementation for Python heads to digest. Any faster suggestions are welcome :)
One way is to convert the 3 "indices" to a linear index and then apply bincount. Numpy's ravel_multi_index is essentially the same as MATLAB's sub2ind. So the ported code could be something like:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
ind_len = np.tile(Lvec[:,None], [1, Lmax])
ind_pos = np.tile(posvec, [n, 1])
ind_val = data
Y_copied = np.tile(Y[:,None], [1, Lmax])
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((ind_len[mask], ind_pos[mask], ind_val[mask]), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied[mask], minlength=np.prod(shape)) / n
Y_avg.shape = shape
This is assuming data has shape (n, Lmax), Lvec is Numpy array, etc. You may need to adapt the code a little to get rid of off-by-one errors.
One could argue that the tile operations are not very efficient and not very "numpythonic". Something with broadcast_arrays could be nice, but I think I prefer this way:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
len_idx = np.repeat(Lvec, Lvec)
pos_idx = np.broadcast_to(posvec, data.shape)[mask]
val_idx = data[mask]
Y_copied = np.repeat(Y, Lvec)
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((len_idx, pos_idx, val_idx), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied, minlength=np.prod(shape)) / n
Y_avg.shape = shape
Note broadcast_to was added in Numpy 1.10.0.
The code is too complicated to paste here, but I have a numpy array shaped (800, 800, 1300), or 1300 matrices shaped (800, 800). This is 5GB.
I pass this array into a function, whereby the function
multiplies each "matrix" in the above array by a float in a (1300,) shaped array
sums the array into one "matrix", shaped (800, 800)
and takes the inverse of the matrix
This program runs at 20.2 GB RAM! Is that possible? I cannot see any memory leaks. I am simply taking numpy arrays, and passing them through a function. I then save the resulting arrays.
I'll try to post the code.
import math
import matplotlib.pyplot as plt
import numpy as np
import scipy
import scipy.io
import os
data_file1 = "filename1.npy"
data_file2 = "filename2.npy"
data_file3 = "filename3.npy"
data1 = np.load(data_file1)
data2 = np.load(data_file2)
data3 = np.load(data_file3)
data_total = np.concatenate((data1, data2, data3)) # This array is shape (800,800,1300), around 6 GB.
array1 = np.arange(1300) + 1
vector = np.arange(800) + 1
def function_matrix(data_total, vector):
Multi_matrix = array1[:, None, None] * data_total # step 1, multiplies each (800,800) matrix
Sum_matrix = np.sum(Multi_matrix, axis=0) #sum matrix
mTCm = np.array([np.dot(vector.T , (np.linalg.solve(Sum_matrix , vector)) )])
return mTCm
draw_pointsA = np.asarray([[function_matrix(data_total[i], vector[j]) for i in np.arange(0,100)] for j in np.arange(0,100)])
filename = "save_datapoints.npy"
np.save(filename, draw_pointsA)
EDIT 2:
See below. It is actually 12 GB RAM, 20.1 GB virtual size of process.
This doesn't answer your question, but proposes a way to avoid the problem from the start.
Step 1 is sequential -- you only need 1 matrix loaded at a time.
Change your code to process each matrix independently
By Step 2 your memory requirement is down to 800 * 800 * sizeof(datum), which is a few megabytes, and you can certainly afford to keep that in memory.
It sounds like this could be a type issue, i.e. you converted the values in the matrices to a different type. Perhaps you stored the original matrix with values as int16 or a single, and after multiplying it with a float, it's stored as a matrix of double values (which require 2 times more space in memory).
You can use the dtype argument to set the value type for the matrix.
Other possible reasons could be that some additional matrices are created underway. That's obviously impossible to decode unless you post the code.
A possible solution to your memory problem is to use HDF5 files, and write the matrices to disk. Then you could load the matrix one at a time. This is easy with h5py, as the matrices can be compressed, and/or sliced using numpy/scipy syntax.
I have a 2d numpy array X = (xrows, xcols) and I want to apply dot product on each row combination of the array to obtain another array which is of the shape P = (xrow, xrow).
The code looks like the following:
P = np.zeros((xrow, xrow))
for i in range(xrow):
for j in range(xrow):
P[i, j] = numpy.dot(X[i], X[j])
which works well if the array X is small but takes a lot of time for huge X. Is there any way to make it faster or do it more pythonically so that it is fast?
That is obtained by doing result = X.dot(X.T)
When the array becomes large, it can be done be blocks, but depending on your numpy backend this should already parallelize threadwise as much as possible. It seems that this is what you are looking for.
If for some reason you don't want to rely on that, and finally do resort to multiprocessing, you can try something along the lines of
import numpy as np
X = np.random.randn(1000, 100000)
block_size = 10000
from sklearn.externals.joblib import Parallel, delayed
products = Parallel(n_jobs=10)(delayed(np.dot)(X[:, pos:pos + block_size], X.T[pos:pos + block_size]) for pos in range(0, X.shape[1], block_size))
product = np.sum(products, axis=0)
I don't think this is useful for relatively small arrays. And threading can sometimes take care of this better as well.
This is 10% faster on my machine as it avoids loops:
numpy.matrix(X) * numpy.matrix(X.T)
but still there is 50% redundancy.
As part of a complex task, I need to compute matrix cofactors. I did this in a straightforward way using this nice code for computing matrix minors. Here is my code:
def matrix_cofactor(matrix):
C = np.zeros(matrix.shape)
nrows, ncols = C.shape
for row in xrange(nrows):
for col in xrange(ncols):
minor = matrix[np.array(range(row)+range(row+1,nrows))[:,np.newaxis],
np.array(range(col)+range(col+1,ncols))]
C[row, col] = (-1)**(row+col) * np.linalg.det(minor)
return C
It turns out that this matrix cofactor code is the bottleneck, and I would like to optimize the code snippet above. Any ideas as to how to do this?
If your matrix is invertible, the cofactor is related to the inverse:
def matrix_cofactor(matrix):
return np.linalg.inv(matrix).T * np.linalg.det(matrix)
This gives large speedups (~ 1000x for 50x50 matrices). The main reason is fundamental: this is an O(n^3) algorithm, whereas the minor-det-based one is O(n^5).
This probably means that also for non-invertible matrixes, there is some clever way to calculate the cofactor (i.e., not use the mathematical formula that you use above, but some other equivalent definition).
If you stick with the det-based approach, what you can do is the following:
The majority of the time seems to be spent inside det. (Check out line_profiler to find this out yourself.) You can try to speed that part up by linking Numpy with the Intel MKL, but other than that, there is not much that can be done.
You can speed up the other part of the code like this:
minor = np.zeros([nrows-1, ncols-1])
for row in xrange(nrows):
for col in xrange(ncols):
minor[:row,:col] = matrix[:row,:col]
minor[row:,:col] = matrix[row+1:,:col]
minor[:row,col:] = matrix[:row,col+1:]
minor[row:,col:] = matrix[row+1:,col+1:]
...
This gains some 10-50% total runtime depending on the size of your matrices. The original code has Python range and list manipulations, which are slower than direct slice indexing. You could try also to be more clever and copy only parts of the minor that actually change --- however, already after the above change, close to 100% of the time is spent inside numpy.linalg.det so that furher optimization of the othe parts does not make so much sense.
The calculation of np.array(range(row)+range(row+1,nrows))[:,np.newaxis] does not depended on col so you could could move that outside the inner loop and cache the value. Depending on the number of columns you have this might give a small optimization.
Instead of using the inverse and determinant, I'd suggest using the SVD
def cofactors(A):
U,sigma,Vt = np.linalg.svd(A)
N = len(sigma)
g = np.tile(sigma,N)
g[::(N+1)] = 1
G = np.diag(-(-1)**N*np.product(np.reshape(g,(N,N)),1))
return U # G # Vt
from sympy import *
A = Matrix([[1,2,0],[0,3,0],[0,7,1]])
A.adjugate().T
And the output (which is cofactor matrix) is:
Matrix([
[ 3, 0, 0],
[-2, 1, -7],
[ 0, 0, 3]])