Segmentation Fault on ndarray matrices dot product

Segmentation Fault on ndarray matrices dot product - python

I am performing dot product of a matrix with 50000 rows and 100 columns with it's transpose. The values of the matrix is in float.
A(50000, 100)
B(100, 50000)
Basically I get the matrix after performing SVD on a larger sparse matrix.
The matrix is of numpy.ndarray type.
I use dot method of numpy for multiplying the two matrices. And I get segmentation fault.
numpy.dot(A, B)
The dot product works well on matrix with 30000 rows but fails for 50000 rows.
Is there any limit on numpy's dot product?
Any problem above while using dot product?
Is there any other good python linear algebra tool which is efficient on large matrices.

As you have been told, there is a memory problem. You want to do this:
numpy.dot(A, A.T)
which requires a lot of memory for the result (not the operands). However the operation is easy to perform in pieces. You may use a loop-based approach to produce one output row at a time:
def trans_multi(A):
rows = A.shape[0]
result = numpy.empty((rows, rows), dtype=A.dtype)
for r in range(rows):
result[r,:] = numpy.dot(A, A[r,:].T)
return result
This, as such, is just a slower and equally memory consuming version (numpy.dot is well-optimized). However, what you most probably want to do is to write the result into a file, as you do not have the memory to hold the result:
def trans_multi(A, filename):
with open(filename, "wb") as f:
rows = A.shape[0]
for r in range(rows):
f.write(numpy.dot(A, A[r,:].T).tostring())
Yes, it is not exactly lightning-fast. However, it is most probably the fastest you can hope for. Sequential writes are usually well-optimized. I tried:
a=numpy.random.random((50000,100)).astype('float32')
trans_multi(a,"/tmp/large.dat")
This took approximately 60 seconds, but it really depends on your HDD performance.
Why not memmap?
I like mmap, and numpy.memmap is a great thing. However, numpy.memmap is optimized for having large tables and calculating small results from them. There is, for example memmap.dot which is optimized for taking dot products of memmapped arrays. The scenario is that the operands are memmapped, but the result is in RAM. Exactly the other way round, that is.
Memmapping is very useful when you have random access. Here the access is not random access but sequential write. Also, yf you try to use numpy.memmap to create a (50000,50000) array of float32's, it will take some time (for some reason I do not get, maybe it initializes the data even though there is no need).
However, after the file has been created, it is a very good idea to use numpy.memmap to analyze the huge table, as it gives the best random read performance and a very convenient interface.

Related

Faster Matrix-Vector-Multiplication (MVM) if the matrix elements are computed on-the-fly

I am currently working on a Project where I have to calculate the extremal Eigenvalues using the Lanczos-Algorithm. I replaced the MVM so that the Matrix-Elements are calculated on-the-fly, because I will have to calculate the Eigenvalues of real huge matrices. This slows down my Code because for-loops are slower in python thand MVM. Is there any way to improve on my code in an easy way? I tried using Cython but I had no real luck here.
for i in range(0,dim):
for j in range(0,dim):
temp=get_Matrix_Element(i,j)
ws[i]=ws[i]+temp*v[j]
This replaces:
ws = M.dot(v)
Update
The ,atrix M is sparse and can be stored for "small" systems in a sparse-matrix-format using scipy.sparse. For large systems with up to ~10^9 dimensions I need to calculate the matrix elements on-the-fly

The easiest and fast to implement solution would be to go half-way: precompute one row at a time.
In your original solution (M.dot(v)) you had to store dim x dim which grows quadratically. If you precompute one row, it scales linearly and should not cause you the troubles (since you are already storing a resultant vector ws of the same size).
The code should be along the lines of:
for i in range(0,dim):
temp=get_Matrix_Row(i)
ws[i]=temp.dot(v)
where temp is now an dim x 1 vector.
Such modification should allow more optimizations to happen during the dot product without major code modifications.

Why does it consume so much memory when I multiply two CSR matrices using scipy?

I use a Scipy CSR representation of a 800,000x350,000 Matrix, let's say its M. I want to calculate the dot product M * M[0:x].T. Now depending on the value of x the memory consumption grows. x=1 is not noticeable but if x=2000 the multiplication process takes around 8 gigabyte of RAM.
I wonder what happens when i calculate this product and why it takes so much memory in comparison to storing the sparse Matrix (around 30Mb). Is the matrix expanded for the multiplication?

By investigating the results and memory consumption over time and after each operation I found out that the reason is the result of the sparse matrix multiplication. Indeed there are many zero-values in M. But the result of M*M.T is a matrix containing only 50% zeros. Thus the result consumes a lot memory.
Example: Let's assume each row vector of M has a none-zero field at the same index but other than that is sparse. Then the result of M*M.T would not be sparse at all (no zero-values).
Nonetheless, thanks for helping.

The compile core of csr matrix multiplication is found at
https://github.com/scipy/scipy/blob/0cff7a5fe6226668729fc2551105692ce114c2b3/scipy/sparse/sparsetools/csr.h
starting around line 500, the csr_matmat... function. It includes a reference to a math paper that it is based on.
The Python code called with A*B is __mul__. Look at the version for your csr matrix to make sure that it is calling self._mul_sparse_matrix, which you will see ends up calling self.format + '_matmat_pass1' (and pass2).
Assuming it isn't resorting to the dense versions, you'll have to study the underlying algorithm to understand whether this memory use is realistic or not.

Python NUMPY HUGE Matrices multiplication

I need to multiply two big matrices and sort their columns.
import numpy
a= numpy.random.rand(1000000, 100)
b= numpy.random.rand(300000,100)
c= numpy.dot(b,a.T)
sorted = [argsort(j)[:10] for j in c.T]
This process takes a lot of time and memory. Is there a way to fasten this process? If not how can I calculate RAM needed to do this operation? I currently have an EC2 box with 4GB RAM and no swap.
I was wondering if this operation can be serialized and I dont have to store everything in the memory.

One thing that you can do to speed things up is compile numpy with an optimized BLAS library like e.g. ATLAS, GOTO blas or Intel's proprietary MKL.
To calculate the memory needed, you need to monitor Python's Resident Set Size ("RSS"). The following commands were run on a UNIX system (FreeBSD to be precise, on a 64-bit machine).
> ipython
In [1]: import numpy as np
In [2]: a = np.random.rand(1000, 1000)
In [3]: a.dtype
Out[3]: dtype('float64')
In [4]: del(a)
To get the RSS I ran:
ps -xao comm,rss | grep python
[Edit: See the ps manual page for a complete explanation of the options, but basically these ps options make it show only the command and resident set size of all processes. The equivalent format for Linux's ps would be ps -xao c,r, I believe.]
The results are;
After starting the interpreter: 24880 kiB
After importing numpy: 34364 kiB
After creating a: 42200 kiB
After deleting a: 34368 kiB
Calculating the size;
In [4]: (42200 - 34364) * 1024
Out[4]: 8024064
In [5]: 8024064/(1000*1000)
Out[5]: 8.024064
As you can see, the calculated size matches the 8 bytes for the default datatype float64 quite well. The difference is internal overhead.
The size of your original arrays in MiB will be approximately;
In [11]: 8*1000000*100/1024**2
Out[11]: 762.939453125
In [12]: 8*300000*100/1024**2
Out[12]: 228.8818359375
That's not too bad. However, the dot product will be way too large:
In [19]: 8*1000000*300000/1024**3
Out[19]: 2235.1741790771484
That's 2235 GiB!
What you can do is split up the problem and perfrom the dot operation in pieces;
load b as an ndarray
load every row from a as an ndarray in turn.
multiply the row by every column of b and write the result to a file.
del() the row and load the next row.
This wil not make it faster, but it would make it use less memory!
Edit: In this case I would suggest writing the output file in binary format (e.g. using struct or ndarray.tofile). That would make it much easier to read a column from the file with e.g. a numpy.memmap.

What DrV and Roland Smith said are good answers; they should be listened to. My answer does nothing more than present an option to make your data sparse, a complete game-changer.
Sparsity can be extremely powerful. It would transform your O(100 * 300000 * 1000000) operation into an O(k) operation with k non-zero elements (sparsity only means that the matrix is largely zero). I know sparsity has been mentioned by DrV and disregarded as not applicable but I would guess it is.
All that needs to be done is to find a sparse representation for computing this transform (and interpreting the results is another ball game). Easy (and fast) methods include the Fourier transform or wavelet transform (both rely on similarity between matrix elements) but this problem is generalizable through several different algorithms.
Having experience with problems like this, this smells like a relatively common problem that is typically solved through some clever trick. When in a field like machine learning where these types of problems are classified as "simple," that's often the case.

YOu have a problem in any case. As Roland Smith shows you in his answer, the amount of data and number of calculations is enormous. You may not be very familiar with linear algebra, so a few words of explanation might help in understanding (and then hopefully solving) the problem.
Your arrays are both a collection of vectors with length 100. One of the arrays has 300 000 vectors, the other one 1 000 000 vectors. The dot product between these arrays means that you calculate the dot product of each possible pair of vectors. There are 300 000 000 000 such pairs, so the resulting matrix is either 1.2 TB or 2.4 TB depending on whether you use 32 or 64-bit floats.
On my computer dot multiplying a (300,100) array with a (100,1000) array takes approximately 1 ms. Extrapolating from that, you are looking at a 1000 s calculation time (depending on the number of cores).
The nice thing about taking a dot product is that you can do it piecewise. Keeping the output is then another problem.
If you were running it on your own computer, calculating the resulting matrix could be done in the following way:
create an output array as a np.memmap array onto the disk
calculate the results one row at a time (as explained by Roland Smith)
This would result in a linear file write with a largish (2.4 TB) file.
This does not require too many lines of code. However, make sure everything is transposed in a suitable way; transposing the input arrays is cheap, transposing the output is extremely expensive. Accessing the resulting huge array is cheap if you can access elements close to each other, expensive, if you access elements far away from each other.
Sorting a huge memmapped array has to be done carefully. You should use in-place sort algorithms which operate on contiguous chunks of data. The data is stored in 4 KiB chunks (512 or 1024 floats), and the fewer chunks you need to read, the better.
Now that you are not running the code in our own machine but on a cloud platform, things change a lot. Usually the cloud SSD storage is very fast with random accesses, but IO is expensive (also in terms of money). Probably the least expensive option is to calculate suitable chunks of data and send them to S3 storage for further use. The "suitable chunk" part depends on how you intend to use the data. If you need to process individual columns, then you send one or a few columns at a time to the cloud object storage.
However, a lot depends on your sorting needs. Your code looks as if you are finally only looking at a few first items of each column. If this is the case, then you should only calculate the first few items and not the full output matrix. That way you can do everything in memory.
Maybe if you tell a bit more about your sorting needs, there can be a viable way to do what you want.
Oh, one important thing: Are your matrices dense or sparse? (Sparse means they mostly contain 0's.) If your expect your output matrix to be mostly zero, that may change the game completely.

How to assemble large sparse matrices effectively in python/scipy

I am working on an FEM project using Scipy. Now my problem is, that
the assembly of the sparse matrices is too slow. I compute the
contribution of every element in dense small matrices (one for each
element). For the assembly of the global matrices I loop over all
small dense matrices and set the matrice entries the following way:
[i,j] = someList[k][l]
Mglobal[i,j] = Mglobal[i,j] + Mlocal[k,l]
Mglobal is a lil_matrice of appropriate size, someList maps the
indexing variables.
Of course this is rather slow and consumes most of the matrice
assembly time. Is there a better way to assemble a large sparse matrix
from many small dense matrices? I tried scipy.weave but it doesn't
seem to work with sparse matrices

I posted my response to the scipy mailing list; stack overflow is a bit easier
to access so I will post it here as well, albeit a slightly improved version.
The trick is to use the IJV storage format. This is a trio of three arrays
where the first one contains row indicies, the second has column indicies, and
the third has the values of the matrix at that location. This is the best way
to build finite element matricies (or any sparse matrix in my opinion) as access
to this format is really fast (just filling an an array).
In scipy this is called coo_matrix; the class takes the three arrays as an
argument. It is really only useful for converting to another format (CSR os
CSC) for fast linear algebra.
For finite elements, you can estimate the size of the three arrays by something
like
size = number_of_elements * number_of_basis_functions**2
so if you have 2D quadratics you would do number_of_elements * 36, for example.
This approach is convenient because if you have local matricies you definitely
have the global numbers and entry values: exactly what you need for building
the three IJV arrays. Scipy is smart enough to throw out zero entries, so
overestimating is fine.

How to efficiently add sparse matrices in Python

I want to know how to efficiently add sparse matrices in Python.
I have a program that breaks a big task into subtasks and distributes them across several CPUs. Each subtask yields a result (a scipy sparse matrix formatted as: lil_matrix).
The sparse matrix dimensions are: 100000x500000 , which is quite huge, so I really need the most efficient way to sum all the resulting sparse matrices into a single sparse matrix, using some C-compiled method or something.

Have you tried timing the simplest method?
matrix_result = matrix_a + matrix_b
The documentation warns this may be slow for LIL matrices, suggesting the following may be faster:
matrix_result = (matrix_a.tocsr() + matrix_b.tocsr()).tolil()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.