I am writing a code in Python in which I am doing a lot of matrix multiplications on huge matrices. For the performance as well as the speed I am using numpy.einsum. Anyways, that einum is optimized very well, my code works very slowly. This is a reason that I'm looking for a solution to speed it up. I thought that PyCuda might have been helpful, but after doing some research on the Internet, I figured out that I was mistaken.
Do you have any tips how to improve performance of math operations Python?
For example I have a bunch of this kind of following multiplications:
NewMatrix_krls = np.einsum('kK,krls->Krls', C, Matrx_krls)
where C is a matrix 1000 by 1000 and a Matrx_krls is 1000 by 1000 by 1000 by 1000.
einsum constructs an iterator (ndinter) that loops (in C) over the problem space. With the dimensions and indexes in your expression that space is 1000**5 (5 variables, each of size 1000). That's 10**15 operations.
If the problem involved multiplying 3 or 4 arrays, you might speed things up by splitting it into a couple of einsum calls (or a mix of einsum and dot). But there's not enough information in your description to suggest that kind of partitioning.
Actually you might be able to cast this in a form that uses np.dot. Matrx_krls can be reshaped to a 2d array, combining the rls dimensions into one 1000**3 in size.
E.g. (not tested)
np.dot(C.T, Matrx_krls.reshape(1000,-1))
For a product of 2 2d arrays, np.dot is faster than einsum.
(Note I changed your tags to include numpy - that will bring it to the attention of more knowledgeable posters).
'Matrx_krls is 1000 by 1000 by 1000 by 1000'
So thats a 4-terabyte array, assuming 4-byte floats... I think you are better off investing time in Hadoop than in GPGPU. Even if you stream that data to a GPU, you will see little benefit because of limited PCIE bandwidth.
That, or perhaps explaining why you think you need this kind of operation might help on a higher level.
Related
My problem is to perform 3 matrix multiplications on a 3D numpy array A too large to fit in a single processor. In tensorial form I want A_ijk B_km C_jn D_ip (B, C, and D can all fit in memory). I want to know if dask is appropriate for this task (or if another tool might be more suited).
I believe the best approach is to split this operation into each multiplication, and make sure that they are all local. This link has a really useful diagram that summarises what I'm talking about http://www.2decomp.org/1d_mode.html.
In more detail: First, to do A_ijk B_km, I should distribute A over the first two axes, and perform the matrix multiplication over each pencil locally (the first step in the diagram).
Then, I need to transpose the array, making the j axis local to each processor (and splitting over the k (now m) axis), to then perform the next multiplication. (So going from the first to the second step in the diagram). This is where I wonder if dask could help.
I'm aware that this can be done in principle using mpi4py, but the steps are pretty non-trivial, whereas dask arrays have helpful rechunk and transpose methods, which feel relevant to this application.
Does this seem like something well-suited to dask?
If not, is anyone aware of any python libraries that can perform these steps? I know that fftw has routines for doing just this, but I don't know how to write the C-code necessary, or how to get it to interface with python and numpy.
Thanks for any help.
For anyone else in the future, mpi4py does have a transpose method. But it's called Alltoall/Alltoallv. It's not explained in the documentation or tutorial on mpi4py. I found out about it at another tutorial: https://info.gwdg.de/wiki/doku.php?id=wiki:hpc:mpi4py.
Dask implements einsum, which may be what you are after, and there is, of course matmul, if you want to write out the operation. So long as your large matrix A is a Dask array, with reasonable chunk sizes, Dask will parcel out your work without running out of memory.
Are there really good methods in Python to vectorize matrix like data constructs/containers -operations? What are the according data constructs used?
(I could observe and read that pandas and numpy element-wise operations using vectorize or applymap (may also be the case of apply/apply along axis for rows/columns) are not much of a speed progress compared to for loops.
Given that when trying to use them, you have sometimes to mess with the specificities of the datatypes when it is usually a little bit easier in for loops, what are the benefits? Readability?)
Are there ways to achieve a gap of performance similar to what happens in Matlab when comparing for loops and vectorized operations?
(note it is not to bash numpy or pandas, these are great, whole matrix operations are ok, it is just that when you have to do element-wise operations, it becomes slow).
EDIT to explain the context:
I was only wondering because I received more than once answers mentionning the fact that apply and so on are actually similar to for loops. That's why I was wondering if there were similar functions implemented in such way that it would perform better. The actual problems were varied. They just had to be element-wise, actually, not "doing the sum, product, whatever of the whole matrix". I did a lot of comparisons with differential outputs sometimes based on other matrices, so I had to use complex functions for this. But since the matrices are huge and the implementation depended on "for loop like" mechanisms, in the end I felt that my program would not work well on a more important dataset. Hence my question. But I was not looking for review, only knowledge.
You need to provide a specific example.
Normal per-element MATLAB or Python functions cannot be vectorized in general. The whole point of vectorizing, in both MATLAB and Python, is to off-load the operation onto the much faster underlying C or Fortran libraries that are designed to work on arrays of uniform data. This cannot be done on functions that operate on scalars, either in MATLAB or Python.
For functions that operate on arrays or matrices as a whole (such as mathematical operators, sum, square, etc), MATLAB and Python behave the same. In fact they use most of the same underlying C and Fortran libraries to do their calculations.
So you need to show the actual operation you want to do, and then we can see if there is a way to vectorize it.
If it is working code and you just want to improve its performance, then Code Review stack exchange site is probably a better choice.
I'm working on a project that heavily relies on reasonably small 2D Numpy arrays to represent our data (where small is around 100 x 4, at worst). The problem with this representation is that is extremely cumbersome and unclear given the context of the problem, terrible to maintain, and very bug prone. For example, my current task, in the context of these arrays, requires adding and removing multiple rows, and allocating a large number of them. I decided that an object-oriented, graph-based approach would be far clearer to use on my end and more generic for future development. After sitting down and designing most of the functionality on paper, there will probably be 5 or 6 larger classes with internal dictionaries (representing adjacency lists for graphs), that remove the need for Numpy arrays and are much clearer to interpret.
It is well known that Numpy arrays blow Python data structures out of the water in terms of indexing speed and most other functionality. However, I also know that Python dictionaries are highly optimized and are fairly efficient in their own right. My dilemma is whether or not reallocating many Numpy arrays and copying values back and forth is more or less efficient than a few persisting classes that can be edited easily without reallocation, but will have slower indexing time than a Numpy array.
It's worth noting that I'm not using much of Numpy functionality other than np.where(), which can be worked around if necessary. Also the algorithm scales not in the size of arrays but in the number of arrays; a harder problem will have more arrays, but not necessarily larger ones.
Thanks!
I need to multiply two big matrices and sort their columns.
import numpy
a= numpy.random.rand(1000000, 100)
b= numpy.random.rand(300000,100)
c= numpy.dot(b,a.T)
sorted = [argsort(j)[:10] for j in c.T]
This process takes a lot of time and memory. Is there a way to fasten this process? If not how can I calculate RAM needed to do this operation? I currently have an EC2 box with 4GB RAM and no swap.
I was wondering if this operation can be serialized and I dont have to store everything in the memory.
One thing that you can do to speed things up is compile numpy with an optimized BLAS library like e.g. ATLAS, GOTO blas or Intel's proprietary MKL.
To calculate the memory needed, you need to monitor Python's Resident Set Size ("RSS"). The following commands were run on a UNIX system (FreeBSD to be precise, on a 64-bit machine).
> ipython
In [1]: import numpy as np
In [2]: a = np.random.rand(1000, 1000)
In [3]: a.dtype
Out[3]: dtype('float64')
In [4]: del(a)
To get the RSS I ran:
ps -xao comm,rss | grep python
[Edit: See the ps manual page for a complete explanation of the options, but basically these ps options make it show only the command and resident set size of all processes. The equivalent format for Linux's ps would be ps -xao c,r, I believe.]
The results are;
After starting the interpreter: 24880 kiB
After importing numpy: 34364 kiB
After creating a: 42200 kiB
After deleting a: 34368 kiB
Calculating the size;
In [4]: (42200 - 34364) * 1024
Out[4]: 8024064
In [5]: 8024064/(1000*1000)
Out[5]: 8.024064
As you can see, the calculated size matches the 8 bytes for the default datatype float64 quite well. The difference is internal overhead.
The size of your original arrays in MiB will be approximately;
In [11]: 8*1000000*100/1024**2
Out[11]: 762.939453125
In [12]: 8*300000*100/1024**2
Out[12]: 228.8818359375
That's not too bad. However, the dot product will be way too large:
In [19]: 8*1000000*300000/1024**3
Out[19]: 2235.1741790771484
That's 2235 GiB!
What you can do is split up the problem and perfrom the dot operation in pieces;
load b as an ndarray
load every row from a as an ndarray in turn.
multiply the row by every column of b and write the result to a file.
del() the row and load the next row.
This wil not make it faster, but it would make it use less memory!
Edit: In this case I would suggest writing the output file in binary format (e.g. using struct or ndarray.tofile). That would make it much easier to read a column from the file with e.g. a numpy.memmap.
What DrV and Roland Smith said are good answers; they should be listened to. My answer does nothing more than present an option to make your data sparse, a complete game-changer.
Sparsity can be extremely powerful. It would transform your O(100 * 300000 * 1000000) operation into an O(k) operation with k non-zero elements (sparsity only means that the matrix is largely zero). I know sparsity has been mentioned by DrV and disregarded as not applicable but I would guess it is.
All that needs to be done is to find a sparse representation for computing this transform (and interpreting the results is another ball game). Easy (and fast) methods include the Fourier transform or wavelet transform (both rely on similarity between matrix elements) but this problem is generalizable through several different algorithms.
Having experience with problems like this, this smells like a relatively common problem that is typically solved through some clever trick. When in a field like machine learning where these types of problems are classified as "simple," that's often the case.
YOu have a problem in any case. As Roland Smith shows you in his answer, the amount of data and number of calculations is enormous. You may not be very familiar with linear algebra, so a few words of explanation might help in understanding (and then hopefully solving) the problem.
Your arrays are both a collection of vectors with length 100. One of the arrays has 300 000 vectors, the other one 1 000 000 vectors. The dot product between these arrays means that you calculate the dot product of each possible pair of vectors. There are 300 000 000 000 such pairs, so the resulting matrix is either 1.2 TB or 2.4 TB depending on whether you use 32 or 64-bit floats.
On my computer dot multiplying a (300,100) array with a (100,1000) array takes approximately 1 ms. Extrapolating from that, you are looking at a 1000 s calculation time (depending on the number of cores).
The nice thing about taking a dot product is that you can do it piecewise. Keeping the output is then another problem.
If you were running it on your own computer, calculating the resulting matrix could be done in the following way:
create an output array as a np.memmap array onto the disk
calculate the results one row at a time (as explained by Roland Smith)
This would result in a linear file write with a largish (2.4 TB) file.
This does not require too many lines of code. However, make sure everything is transposed in a suitable way; transposing the input arrays is cheap, transposing the output is extremely expensive. Accessing the resulting huge array is cheap if you can access elements close to each other, expensive, if you access elements far away from each other.
Sorting a huge memmapped array has to be done carefully. You should use in-place sort algorithms which operate on contiguous chunks of data. The data is stored in 4 KiB chunks (512 or 1024 floats), and the fewer chunks you need to read, the better.
Now that you are not running the code in our own machine but on a cloud platform, things change a lot. Usually the cloud SSD storage is very fast with random accesses, but IO is expensive (also in terms of money). Probably the least expensive option is to calculate suitable chunks of data and send them to S3 storage for further use. The "suitable chunk" part depends on how you intend to use the data. If you need to process individual columns, then you send one or a few columns at a time to the cloud object storage.
However, a lot depends on your sorting needs. Your code looks as if you are finally only looking at a few first items of each column. If this is the case, then you should only calculate the first few items and not the full output matrix. That way you can do everything in memory.
Maybe if you tell a bit more about your sorting needs, there can be a viable way to do what you want.
Oh, one important thing: Are your matrices dense or sparse? (Sparse means they mostly contain 0's.) If your expect your output matrix to be mostly zero, that may change the game completely.
Hej there, I am writing a data aquisition and analysis software to a physical measurement set up with Python. In the process I gather massive amounts of data points (easily in the order of 1.000.000 or more) which I subsequently will analyze. So far I am using arrays of float numbers, which in principle do the job.
However, I am getting strange effects on the aquired data as I use more and more data points per measurement, which makes me wonder wether the handling of the arrays is so inefficient, that writing into them makes for a significant time delay in the data aquisition loop.
Is that a possibility? Do you have any suggestions about how to improve the handling time in the writing process (it is a matter of microseconds) or is that not a possible influence and I need to look somewhere else?
Thanks in advance!
Do you mean lists? You can use NumPy to handle numerical arrays efficient and performant.
From the NumyPy website:
First of all, they are great for performing calculation relying
heavily on mathematical and numerical operations. They can work
natively with matrices and arrays, perform operations on them, find
eigenvectors, compute integrals, solve differential equations.
NumPy’s array class (which is used to implement the matrix class) is
implemented with speed in mind, so accessing NumPy arrays is faster
than accessing Python lists. Further, NumPy implements an array
language, so that most loops are not needed.