Sequentially solving linear algebra systems in the fastest way

Sequentially solving linear algebra systems in the fastest way - python

Dear stackoverflow community,
I am currently facing the problem of having to solve a large number of linear algebra systems sequentially, i.e.
solve "Ax = b" with known A (Matrix, 45000 x 45000, ~6 non zeros per row) and b (vector, 45000 rows) for x (vector, 45000 rows).
A is complex symmetric non-Hermitian.
Since there are data dependencies concerning A in each iteration of my algorithm, the linear systems need to be solved sequentially in the least time possible. The vector b is the same for each iteration.
The main code is written in Python 3.7. Using scipy.sparse.linalg.qmr I end up at 1.5s per solve.
MAGMA's iterative BICGSTAB is able to solve the system within ~0.4s including the overhead of copying the data to the GPU's (RTX 2080) memory. To access the c++ library I use pybind11.
My question now is: Do you have ideas on how to speed up the calculation? I have the feeling that a direct matrix solver rather than an iterative one might be faster. Do you have recommendations for libraries implementing direct solvers which might use the GPU? Is that even possible?
Thank you very much for your help.

Related

Efficient way to speeding up graph theory and complex network algorithms on CPU/GPU using Python?

P.S.: I've mentioned possible solutions to my problem but have many confusions with them, please provide me suggestions on them. Also if this question is not good for this site, please point me to the correct site and I'll move the question there. Thanks in advance.
I need to perform some repetitive graph theory and complex network algorithms to analyze approx 2000 undirected simple graphs with no self-loops for some research work. Each graph has approx 40,000 nodes and approx 600,000 edges (essentially making them sparse graphs).
Currently, I am using NetworkX for my analysis and currently running nx.algorithms.cluster.average_clustering(G) and nx.average_shortest_path_length(G) for 500 such graphs and the code is running for 3 days and have reached only halfway. This makes me fearful that my full analysis will take a huge and unexpected time.
Before elaborating on my problem and the probable solutions I've thought of, let me mention my computer's configuration as it may help you in suggesting the best approach. I am running Windows 10 on an Intel i7-9700K processor with 32GB RAM and one Zotac GeForce GTX 1050 Ti OC Edition ZT-P10510B-10L 4GB PCI Express Graphics Card.
Explaining my possible solutions and my confusions regarding them:
A) Using GPU with Adjacency Matrix as Graph Data Structure: I can put an adjacency matrix on GPU and perform my analysis by manually coding them with PyCuda or Numba using loops only as recursion cannot be handled by GPU. The nearest I was able to search is this on stackoverflow but it has no good solution.
My Expectations: I hope to speedup algorithms such as All Pair Shortest Path, All Possible Paths between two nodes, Average Clustering, Average Shortest Path Length, and Small World Properties, etc. If it gives a significant speedup per graph, my results can be achieved very fast.
My Confusions:
Could these graph algorithms can be efficiently coded in GPU?
Which will be better to use? PyCuda or Numba?
Is there any other way to store Graphs on GPU that could be more efficient as my graphs are sparse graphs.
I am an average Python Programmer with no experience of GPU programming, so I will have to understand and learn GPU programming with PyCuda/ Numba. Which one is easier to learn?
B) Parallelizing Programs on CPU Itself: I can use Joblib or any other library to parallelly run the program on my CPU itself. I can arrange 2-3 more computers on which I can run small independent portions of programs or can run 500 graphs per computer.
My Expectations: I hope to speedup algorithms by parallelizing and dividing tasks among computers. If the GPU solution does not work, I may still have some hope by this method.
My Confusions:
Which other libraries are available as good alternatives for Joblib?
Should I allot all CPU cores (8 cores in i7) for my programs or use fewer cores?
C) Apart from my probable solutions do you have any other suggestions for me? If a better and faster solution is available in any other language except C/C++, you can also suggest them as well, as I am already considering C++ as a fallback plan if nothing works.
Work In Progress Updates
In different suggestions from comments on this question and discussion in my community, these are the points I've suggested to explore.
GraphBLAS
boost.graph + extensions with python-wrappers
graph-tool
Spark/ Dask
PyCuda/ Numba
Linear Algerbra methods using Pytorch
I tried to run 100 graphs on my CPU (using n_job=-1) using Joblib, the CPU was continuously hitting a temperature of 100°C. The processor tripped after running for 3 hours. - As a solution, I am using 75% of available cores on multiple computers (so if available cores are 8, I am using 6 cores) and the program is running fine. the speedup is also good.

This is a broad but interesting question. Let me try to answer it.
2000 undirected simple graphs [...] Each graph has approx 40,000 nodes and approx 600,000 edges
Currently, I am using NetworkX for my analysis and currently running nx.algorithms.cluster.average_clustering(G) and nx.average_shortest_path_length(G)
NetworkX uses plain Python implementations and is not optimized for performance. It's great for prototyping but if you encounter performance issues, it's best to look to rewrite your code using another library.
Other than NetworkX, the two most popular graph processing libraries are igraph and SNAP. Both are written in C and have Python APIs so you get both good single-threaded performance and ease of use. Their parallelism is very limited but this is not a problem in your use case as you have many graphs, rendering your problem embarrassingly parallel. Therefore, as you remarked in the updated question, you can run 6-8 jobs in parallel using e.g. Joblib or even xargs. If you need parallel processing, look into graph-tool, which also has a Python API.
Regarding your NetworkX algorithms, I'd expect the average_shortest_path_length to be reasonably well-optimized in all libraries. The average_clustering algorithm is tricky as it relies on node-wise triangle counting and a naive implementation takes O(|E|^2) time while an optimized implementation will do it in O(|E|^1.5). Your graphs are large enough so that the difference between these two costs is running the algorithm on a graph in a few seconds vs. running the algorithm for hours.
The "all-pairs shortest paths" (APSP) problem is very time-consuming, with most libraries using the Floyd–Warshall algorithm that has a runtime of O(|V|^3). I'm unsure what output you're looking for with the "All Possible Paths between two nodes" algorithm – enumerating all paths leads to an exponential amount of results and is unfeasible at this scale.
I would not start using the GPU for this task: an Intel i7-9700K should be up for this job. GPU-based graph processing libraries are challenging to set up and currently do not provide that significant of a speedup – the gains by using a GPU instead of a CPU are nowhere near as significant for graph processing as for machine learning algorithms. The only problem where you might be able to get a big speedup is APSP but it depends on which algorithms your chosen library uses.
If you are interested in GPU-based libraries, there are promising directions on the topic such as Gunrock, GraphBLAST, and a work-in-progress SuiteSparse:GraphBLAS extension that supports CUDA. However, my estimate is that you should be able to run most of your algorithms (barring APSP) in a few hours using a single computer and its CPU.

sympy compiling functions with large matrices

I have been using sympy to work with systems of differential equations. I write the equations symbolically, use autowrap to compile them through cython, and then pass the resulting function to the scipy ODE solver. One of the major benefits of doing this is that I can solve for the jacobian symbolically using the sympy jacobian function, compile it, and it to the ODE solver as well.
This has been working great for systems of about 30 variables. Recently I tried doing it with 150 variables, and what happened was that I ran out of memory when compiling the jacobian function. This is on Windows with anaconda and the microsoft Visual C++ 14 tools for python. Basically during compilation of the jacobian, which is now a 22000-element vector, memory usage during the linking step went up to about 7GB (on my 8GB laptop) before finally crashing out.
Does someone have some suggestions before I go and try on a machine with more memory? Are other operating systems or other C compilers likely to improve the situation?
I know lots of people do this type of work, so if there's an answer, it will be beneficial to a good chunk of the community.
Edit: response to some of Jonathan's comments:
Yes, I'm fully aware that this is an N^2 problem. The jacobian is a matrix of all partial derivatives, so it will have size N^2. There is no real way around this scaling. However, a 22000-element array is not nearly at the level that would create a memory problem during runtime -- I only have the problem during compilation.
Basically there are three levels that we can address this at.
1) solve the ODE problem without the jacobian, or somehow split up the jacobian to not have a 150x150 matrix. That would address the very root, but it certainly limits what I can do, and I'm not yet convinced that it's impossible to compile the jacobian function
2) change something about the way sympy automatically generates C code, to split it up into multiple chunks, use more functions for intermediate expressions, to somehow make the .c file smaller. People with more sympy experience might have some ideas on this.
3) change something about the way the C is compiled, so that less memory is needed.
I thought that by posting a separate question more oriented around #3 (literal referencing of large array -- compiler out of memory) , I would get a different audience answering. That is in fact exactly what happened. Perhaps the answer to #3 is "you can't" but that's also useful information.

Following a lot of the examples posted at http://www.sympy.org/scipy-2017-codegen-tutorial/ I was able to get this to compile.
The key things were
1) instead of using autowrap, write the C code directly with more control over it. Among other things, this allows passing the argument list as a vector instead of expanding it. This took some effort to get working (setting up the compiler flags through distutils, etc, etc) but in the end it worked well. Having the repo from the course linked above as an example helped a lot.
2) using common subexpression elimination (sympy.cse) to dramatically reduce the size of the expressions for the jacobian elements.
(1) by itself didn't do that much to help in this case (although I was able to use it to vastly improve performance of smaller models). The code was still 200 MB instead of the original 300 MB. But combining it with (2) (cse) I was able to get it down to a meager 1.7 MB (despite 14000 temporary variables).
The cse takes about 20-30 minutes on my laptop. After that, it compiles quickly.

Large non-linear system of independent equations: Solve sequentially or stacked?

Say I have to solve for a large system of equations where
A_i = f(B_i)
B_i = g(A_i)
for many different i. Now, this is a system of equations which are only pair-wise dependent. The lm algorythm has proven most stable to solve this.
Now, I could solve these either independently (i.e. loop over i many scipy.optimize.root, or stack them all together and solve at the same time). I'm unsure which will be the fastest, and it's difficult to know generally. I'm having the following arguments for and against:
The algorythm initially numerically approximates the Jacobian at the provided guess, increasing dimensionality exponentially increases the time it takes to find the Jacobian (speaks against stacking)
Once the Jacobian is found, most of the updating is linear matrix algebra, and therefore should be faster if stacked.
Does that make sense? My conclusion would in that case be "if solving it takes a long time (bad guess or irregular function), stack them, if it's quick, do not stack".

I am not sure I understand correctly; when you say that they are pairwise dependent do you mean that the full system can be decomposed in a collection of small 2x2 systems? If so, you should definitely opt for solving the smaller systems. If not, can you provide some equations?

Efficient Matrix-Vector Multiplication: Multithreading directly in Python vs. using ctypes to bind a multithreaded C function

I have a simple problem: multiply a matrix by a vector. However, the implementation of the multiplication is complicated because the matrix is 18 gb (3000^2 by 500).
Some info:
The matrix is stored in HDF5 format. It's Matlab output. It's dense so no sparsity savings there.
I have to do this matrix multiplication roughly 2000 times over the course of my algorithm (MCMC Bayesian Inversion)
My program is a combination of Python and C, where the Python code handles most of the MCMC procedure: keeping track of the random walk, generating perturbations, checking MH Criteria, saving accepted proposals, monitoring the burnout, etc. The C code is simply compiled into a separate executable and called when I need to solve the forward (acoustic wave) problem. All communication between the Python and C is done via the file system. All this is to say I don't already have ctype stuff going on.
The C program is already parallelized using MPI, but I don't think that's an appropriate solution for this MV multiplication problem.
Our program is run mainly on linux, but occasionally on OSX and Windows. Cross-platform capabilities without too much headache is a must.
Right now I have a single-thread implementation where the python code reads in the matrix a few thousand lines at a time and performs the multiplication. However, this is a significant bottleneck for my program since it takes so darn long. I'd like to multithread it to speed it up a bit.
I'm trying to get an idea of whether it would be faster (computation-time-wise, not implementation time) for python to handle the multithreading and to continue to use numpy operations to do the multiplication, or to code an MV multiplication function with multithreading in C and bind it with ctypes.
I will likely do both and time them since shaving time off of an extremely long running program is important. I was wondering if anyone had encountered this situation before, though, and had any insight (or perhaps other suggestions?)
As a side question, I can only find algorithmic improvements for nxn matrices for m-v multiplication. Does anyone know of one that can be used on an mxn matrix?

Hardware
As Sven Marnach wrote in the comments, your problem is most likely I/O bound since disk access is orders of magnitude slower than RAM access.
So the fastest way is probably to have a machine with enough memory to keep the whole matrix multiplication and the result in RAM. It would save lots of time if you read the matrix only once.
Replacing the harddisk with an SSD would also help, because that can read and write a lot faster.
Software
Barring that, for speeding up reads from disk, you could use the mmap module. This should help, especially once the OS figures out you're reading pieces of the same file over and over and starts to keep it in the cache.
Since the calculation can be done by row, you might benefit from using numpy in combination with a multiprocessing.Pool for that calculation. But only really if a single process cannot use all available disk read bandwith.

Scipy: Linear programming with sparse matrices

I want to solve a linear program in python. The number of variables (I will call it N from now on) is very large (~50000) and in order to formulate the problem in the way scipy.optimize.linprog requires it, I have to construct two N x N matrices (A and B below). The LP can be written as
minimize: c.x
subject to:
A.x <= a
B.x = b
x_i >= 0 for all i in {0, ..., n}
whereby . denotes the dot product and a, b, and c are vectors with length N.
My experience is that constructing such large matrices (A and B have both approx. 50000x50000 = 25*10^8 entries) comes with some issues: If the hardware is not very strong, NumPy may refuse to construct such big matrices at all (see for example Very large matrices using Python and NumPy) and even if NumPy creates the matrix without problems, there is a huge performance issue. This is natural regarding the huge amount of data NumPy has to deal with.
However, even though my linear program comes with N variables, the matrices I work with are very sparse. One of them has only entries in the very first row, the other one only in the first M rows, with M < N/2. Of course I would like to exploit this fact.
As far as I have read (e.g. Trying to solve a Scipy optimization problem using sparse matrices and failing), scipy.optimize.linprog does not work with sparse matrices. Therefore, I have the following questions:
Is it actually true that SciPy does not offer any possibility to solve a linear program with sparse matrices? (If not, how can I do it?)
Do you know any alternative library that will solve the problem more effectively than SciPy with non-sparse matrices? (The library suggested in the thread above seems to be not flexible enough for my purposes - as far as I understand its documentation)
Can it be expected that a new implementation of the simplex algorithm (using plain Python, no C) that exploits the sparsity of the matrices will be more efficient than SciPy with non-sparse matrices?

I would say forming a dense matrix (or two) to solve a large sparse LP is probably not the right thing to do. When solving a large sparse LP it is important to use a solver that has facilities to handle such problems and also to generate the model in a way that does not create explicitly any of these zero elements.
Writing a stable, fast, sparse Simplex LP solver in Python as a replacement for the SciPy dense solver is not a trivial exercise. Moreover a solver written in pure Python may not perform as well.
For the size you indicate, although not very, very large (may be large medium sized model would be a good classification) you may want to consider a commercial solver like Cplex, Gurobi or Mosek. These solvers are very fast and very reliable (they solve basically any LP problem you throw at them). They all have Python APIs. The solvers are free or very cheap for academics.
If you want to use an Open Source solver, you may want to look at the COIN CLP solver. It also has a Python interface.
If your model is more complex then you also may want to consider to use a Python modeling tool such as Pulp or Pyomo (Gurobi also has good modeling support in Python).

I can't believe nobody has pointed you in the direction of PuLP! You will be able to create your problem efficiently, like so:
import pulp
prob = pulp.LpProblem("test problem",pulp.LpMaximize)
x = pulp.LpVariable.dicts('x', range(5), lowBound=0.0)
prob += pulp.lpSum([(ix+1)*x[ix] for ix in range(5)]), "objective"
prob += pulp.lpSum(x)<=3, "capacity"
prob.solve()
for k, v in prob.variablesDict().iteritems():
print k, v.value()
PuLP is fantastic, comes with a very decent solver (CBC) and can be hooked up to open source and commercial solvers. I am currently using it in production for a forestry company and exploring Dippy for the hardest (integer) problems we have. Best of luck!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.