sympy compiling functions with large matrices

sympy compiling functions with large matrices - python

I have been using sympy to work with systems of differential equations. I write the equations symbolically, use autowrap to compile them through cython, and then pass the resulting function to the scipy ODE solver. One of the major benefits of doing this is that I can solve for the jacobian symbolically using the sympy jacobian function, compile it, and it to the ODE solver as well.
This has been working great for systems of about 30 variables. Recently I tried doing it with 150 variables, and what happened was that I ran out of memory when compiling the jacobian function. This is on Windows with anaconda and the microsoft Visual C++ 14 tools for python. Basically during compilation of the jacobian, which is now a 22000-element vector, memory usage during the linking step went up to about 7GB (on my 8GB laptop) before finally crashing out.
Does someone have some suggestions before I go and try on a machine with more memory? Are other operating systems or other C compilers likely to improve the situation?
I know lots of people do this type of work, so if there's an answer, it will be beneficial to a good chunk of the community.
Edit: response to some of Jonathan's comments:
Yes, I'm fully aware that this is an N^2 problem. The jacobian is a matrix of all partial derivatives, so it will have size N^2. There is no real way around this scaling. However, a 22000-element array is not nearly at the level that would create a memory problem during runtime -- I only have the problem during compilation.
Basically there are three levels that we can address this at.
1) solve the ODE problem without the jacobian, or somehow split up the jacobian to not have a 150x150 matrix. That would address the very root, but it certainly limits what I can do, and I'm not yet convinced that it's impossible to compile the jacobian function
2) change something about the way sympy automatically generates C code, to split it up into multiple chunks, use more functions for intermediate expressions, to somehow make the .c file smaller. People with more sympy experience might have some ideas on this.
3) change something about the way the C is compiled, so that less memory is needed.
I thought that by posting a separate question more oriented around #3 (literal referencing of large array -- compiler out of memory) , I would get a different audience answering. That is in fact exactly what happened. Perhaps the answer to #3 is "you can't" but that's also useful information.

Following a lot of the examples posted at http://www.sympy.org/scipy-2017-codegen-tutorial/ I was able to get this to compile.
The key things were
1) instead of using autowrap, write the C code directly with more control over it. Among other things, this allows passing the argument list as a vector instead of expanding it. This took some effort to get working (setting up the compiler flags through distutils, etc, etc) but in the end it worked well. Having the repo from the course linked above as an example helped a lot.
2) using common subexpression elimination (sympy.cse) to dramatically reduce the size of the expressions for the jacobian elements.
(1) by itself didn't do that much to help in this case (although I was able to use it to vastly improve performance of smaller models). The code was still 200 MB instead of the original 300 MB. But combining it with (2) (cse) I was able to get it down to a meager 1.7 MB (despite 14000 temporary variables).
The cse takes about 20-30 minutes on my laptop. After that, it compiles quickly.

Related

Understand order of magnitude performance gap between python and C++ for CPU heavy application

**Summary: ** I observe a ~1000 performance gap between a python code and a C+ code doing the same job despite the use of parallelization, vectorization, just in time compilation and machine code conversion using Numba in the context of scientific calculation. CPU wont be used at full, and I don't understand why
Hello everybody,
I just started in a laboratory doing simulation of various material, including simulation of the growth of biological-like tissues. To do that we create a 3D version of said tissue (collection of vertices stored in a numpy array) and we apply different functions on it to mimic physic/biology.
We have a C++ code doing just that, which takes approximately 10 second to run. Someone converted said code to python, but this version takes about 2h30 hours to process. We tried every trick in the book we knew to accelerate the code. We used numba to accelerate numpy where appropriate, parallelized the code as much as we could, tried to vectorize what could be, but still the gap remains. In fact the earlier version of the code took days to proceed.
When the code execute, multiple cores are properly used, as monitored using the build-in system monitor. However, they are not used at full, and in fact deactivating cores manually does not seem to hit performances too much. At first I thought it could be due to the GIL, but releasing it had no effect on performances either. Somehow it makes me think of a bottleneck in memory transfer between the CPU and the ram, but I cannot understand why the C version would not have the same problem. I also have the feeling that there is a performance cost for calling functions. One of my earlier tasks was to refactor the code, thus decomposing complicated functions into smaller elements. I since have a small performance degradation compared to the earlier version.
I must say I am really wondering where my bottleneck is and how it could be tested/improved. Any idea would be very welcome.
I am aware my question is kind of a complicated one, so let me know if you would need additional information, I would be happy to provide.

Expokit realization on Python

I am looking for a Pythonic realization of Expokit, which is a software package that provides matrix exponential routines for small dense or very large sparse matrices, real or complex, i.e. it finds
w(t) = exp(t*A)*v
This package had been realized in Fortran and Matlab and can be found here https://www.maths.uq.edu.au/expokit/
I have found a python wrapper expokitpy
https://github.com/weinbe58/expokitpy and a Krylov subspace methods package KryPy https://github.com/andrenarchy/krypy. Both seem to be relevant, however neither of them goes with good enough documentation (for me) to do time-evolution.
Does somebody have a working solution with the packages mentioned above or similar?

In case this is still useful to someone, it looks like there was an effort to incorporate expokit within scipy which has now stalled and is looking for somebody to finish. Though here are some instructions to compile with Fortran and then run via Python, with good results.
It seems also to have been adopted by slepc4py, which is then used by quimb, which seems useful if you need it for quantum information (or just use its expm and expm_multiply methods).

Python FFTN slow in comparison to MATLAB

Dear stackoverflow community!
In a previous stackoverflow question, I mentioned that python's np.fft.fftn() routine seems somehow slow compared to MATLAB, provided that the datacubes are rather big (grids of dimension 512x512x1921, datatype float) (see Comparatively slow python numpy 3D Fourier Transformation). I think that MATLAB adopts the FFTW algorithm and could therefore be faster (~5s compared to ~185s (time.time())), so I was suggested to try pyFFTW for a time reduction.
The problem now is that at my work place python packages are implemented via anaconda for a large number of computers and the pyFFTW package cannot be easily integrated therein. There's somehow also a problem that long datatypes are not recognized and therefore compilation does not work at all. pyFFTW also conflicts with the internal FFTW implementation. Even if somehow installed, it would be overriden by the next update of the system.
I'm however not sure whether the different algorithm alone would explain the difference in computation time. As already written in the previous questions, I really need these FFTs for my work.
Another issue concerns the striding of the output array of np.fft.fftn(), which is switched to FORTRAN structure automatically (which again is the opposite of the default). This causes low performace when operating on the output in combination with C-strided grids (see Python numpy.fft changes strides).
So as a follow-up to my original questions, I want to ask you:
(MAIN) What other reasons might there for python to be so much slower? What can be done about it? I'd like to stay with python if possible and rather not switch to MATLAB just because of such things...
(SIDE) Is there any keyword to preserve striding? Using scipy is not a good option and copying the array to a new one to get the strides correctly also seems an unnecessarily complicated step requiring additional computation time.
Thanks for the help!

Efficient Matrix-Vector Multiplication: Multithreading directly in Python vs. using ctypes to bind a multithreaded C function

I have a simple problem: multiply a matrix by a vector. However, the implementation of the multiplication is complicated because the matrix is 18 gb (3000^2 by 500).
Some info:
The matrix is stored in HDF5 format. It's Matlab output. It's dense so no sparsity savings there.
I have to do this matrix multiplication roughly 2000 times over the course of my algorithm (MCMC Bayesian Inversion)
My program is a combination of Python and C, where the Python code handles most of the MCMC procedure: keeping track of the random walk, generating perturbations, checking MH Criteria, saving accepted proposals, monitoring the burnout, etc. The C code is simply compiled into a separate executable and called when I need to solve the forward (acoustic wave) problem. All communication between the Python and C is done via the file system. All this is to say I don't already have ctype stuff going on.
The C program is already parallelized using MPI, but I don't think that's an appropriate solution for this MV multiplication problem.
Our program is run mainly on linux, but occasionally on OSX and Windows. Cross-platform capabilities without too much headache is a must.
Right now I have a single-thread implementation where the python code reads in the matrix a few thousand lines at a time and performs the multiplication. However, this is a significant bottleneck for my program since it takes so darn long. I'd like to multithread it to speed it up a bit.
I'm trying to get an idea of whether it would be faster (computation-time-wise, not implementation time) for python to handle the multithreading and to continue to use numpy operations to do the multiplication, or to code an MV multiplication function with multithreading in C and bind it with ctypes.
I will likely do both and time them since shaving time off of an extremely long running program is important. I was wondering if anyone had encountered this situation before, though, and had any insight (or perhaps other suggestions?)
As a side question, I can only find algorithmic improvements for nxn matrices for m-v multiplication. Does anyone know of one that can be used on an mxn matrix?

Hardware
As Sven Marnach wrote in the comments, your problem is most likely I/O bound since disk access is orders of magnitude slower than RAM access.
So the fastest way is probably to have a machine with enough memory to keep the whole matrix multiplication and the result in RAM. It would save lots of time if you read the matrix only once.
Replacing the harddisk with an SSD would also help, because that can read and write a lot faster.
Software
Barring that, for speeding up reads from disk, you could use the mmap module. This should help, especially once the OS figures out you're reading pieces of the same file over and over and starts to keep it in the cache.
Since the calculation can be done by row, you might benefit from using numpy in combination with a multiprocessing.Pool for that calculation. But only really if a single process cannot use all available disk read bandwith.

When Does It Make Sense To Rewrite A Python Module in C?

In a game that I am writing, I use a 2D vector class which I have written to handle the speeds of the objects. This is called a large number of times every frame as there are a lot of objects on the screen, so any increase I can make in its speed will be useful.
It is pretty simple, consisting mostly of wrappers to the related math functions. It would be quite trivial to rewrite in C, but I am not sure whether doing so will make any significant difference as all it really does is call the underlying math functions, add, multiply or divide.
So, my question is under what circumstances does it make sense to rewrite in C? Where will you see a significant speed boost, and where can you see a reasonable speed boost without rewriting an extensive amount of the program?

If you're vector-munging, give numpy a try first. Chances are you will get speeds not far from C if you utilize numpy's vector manipulation functions wisely.
Other than that, your question is very heuristic. If your code is too slow:
Profile it - chances are you'll be able to improve it in Python
Use the correct optimized C-based libraries (numpy in your case)
Try psyco
Try rewriting parts with cython
If all else fails, rewrite in C

First measure then optimize

You should never optimize anything, be it in C or any other language, without timing your code before and after your optimization:
your clever optimization could in fact induce a slow down
optimizing something that takes 1% of the total execution time will never give you more than 1% performance
The common approach is:
profile your code
identify a hotspot
time this hotspot
optimize it
time the hotspot again, see if it's faster. If it's not goto 3.
If you can't find hotspots it could mean that your app is already optimized, or that you are not using the good algorithm for your problem. In both cases profiling helps understanding what your code does.
For profiling python code under Linux, you can use pyprof2calltree which works in conjunction with kcachegrind, and is totally awesome.

Common wisdom is "profile", "measure", etc. Well - maybe. Just get in the debugger and take 10 stackshots. If more than one of them terminates in your wrapper code, then it is costing more than 10% roughly, so you should consider re-doing it in C, to save that time. Chances are you will find other things also that are costing more than that.

A nice Profiler I use on Linux is pycallgraph - however, as your program gets bigger it starts to create much larger images which are harder to trace. I'm pretty sure you can exclude modules, though.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.