Is anyone aware of an implemented version (perhaps using scipy/numpy) of parallel exact matrix diagonalization (equivalently, finding the eigensystem)? If it helps, my matrices are symmetric and sparse. I would hate to spend a day reinventing the wheel.
EDIT:
My matrices are at least 10,000x10,000 (but, preferably, at least 20 times larger). For now, I only have access to a 4-core Intel machine (with hyperthreading, so 2 processes per core), ~3.0Ghz each with 12GB of RAM. I may later have access to a 128-core node ~3.6Ghz/core with 256GB of RAM, so single machine/multiple cores should do it (for my other parallel tasks, I have been using multiprocessing). I would prefer for the algorithms to scale well.
I do need exact diagonalization, so scipy.sparse routines are not be good for me (tried, didn't work well). I have been using numpy.linalg.eigh (I see only single core doing all the computations).
Alternatively (to the original question): is there an online resource where I can find out more about compiling SciPy so as to insure parallel execution?
For symmetric sparse matrix eigenvalue/eigenvector finding, you may use scipy.sparse.linalg.eigsh. It uses ARPACK behind the scenes, and there are parallel ARPACK implementations. AFAIK, SciPy can be compiled with one if your scipy installation uses the serial version.
However, this is not a good answer, if you need all eigenvalues and eigenvectors for the matrix, as the sparse version uses the Lanczos algorithm.
If your matrix is not overwhelmingly large, then just use numpy.linalg.eigh. It uses LAPACK or BLAS and may use parallel code internally.
If you end up rolling your own, please note that SciPy/NumPy does all the heavy lifting with different highly optimized linear algebra packages, not in pure Python. Due to this the performance and degree of parallelism depends heavily on the libraries your SciPy/NumPy installation is compiled with.
(Your question does not reveal if you just want to have parallel code running on several processors, or on several computers. Also, the size of your matrix has a big impact on the best method. So, this answer may be completely off-the-mark.)
Related
I need to efficiently solve large nonsymmetric generalized eigenvalue/eigenvector problems.
A x = lambda B x
A, B - general real matrices
A - dense
B - mostly sparse
x - the eigenvector
lambda - the eigenvalue
Could someone help me by:
Informing me if the nonsymmetric generalized eigenvalue/eigenvector problems is known to be parallelized. (What are some good algorithms and libraries implementing them if any);
Telling me if scalapack is an alternative to dense nonsymmetric eigenproblems;
Suggesting some good computational alternatives to test the use of both sparse matrices and linear-algebra algorithms;
Suggesting an alternative linear algebra construction that I could use (if there are no simple routines call, perhaps there is a good solution that is not so simple).
I tested code efficiency using matlab, python and C programming. Matlab is said to have builtin lapack functionality. I used intel provided python, with scipy and numpy linking to intel MKL lapack and blas libraries. I also used C code linking to intel MKL lapack and blas libraries.
I was able to check that for non-generalized eigenvalue problems, the code ran in parallel. I had as many threads as physical cores in my machine. That told me that LAPACK uses parallel code in certain routines. (Either LAPACK itself or the optimized versions shipped within matlab and intel MKL oneapi libraries.
When I started to run generalized eigenvalue routines, I observed that the code ran with only one thread. I tested in matlab and python as distributed by intel.
I'd like to investigate this further, but first I need to know if it's possible even in theory to run generalized nonsymmetric eigen decompositions in parallel.
I've seen that scipy have routines for the reduction of a pair of general matrices to a pair of hessenberg/upper triagular matrices. It seems that from hessenberg form, that eigenvalue/eigenvector problems are computationally easier.
Hessenberg for a single matrix runs in parallel. But hessenberg for a pair of matrices, runs only in sequence with one thread. (tested in python scipy). And again, I hit a wall. Which raises the question: is this problem parallelizable?
Other source of optimization for the problem I have is that I have one of the matrices dense and the other is mostly sparse. I'm still not sure how to exploit this. Are there good implementations of sparse matrices and state of the art linear algebra algorithms that work well together?
Thank you very much for any help supplied! Including books and scientific papers.
For MKL-provided routines, you can run it in either parallel mode by using /Qmkl compiler option (Intel compilers) or in sequential mode using /Qmkl:sequential option. For more details, you can refer to the MKL developer reference manual for more details https://www.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/linking-your-application-with-onemkl.html regarding linking options
Here is the link which shows the MKL routines that use sparse matrices https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/sparse-solver-routines.html
P.S.: I've mentioned possible solutions to my problem but have many confusions with them, please provide me suggestions on them. Also if this question is not good for this site, please point me to the correct site and I'll move the question there. Thanks in advance.
I need to perform some repetitive graph theory and complex network algorithms to analyze approx 2000 undirected simple graphs with no self-loops for some research work. Each graph has approx 40,000 nodes and approx 600,000 edges (essentially making them sparse graphs).
Currently, I am using NetworkX for my analysis and currently running nx.algorithms.cluster.average_clustering(G) and nx.average_shortest_path_length(G) for 500 such graphs and the code is running for 3 days and have reached only halfway. This makes me fearful that my full analysis will take a huge and unexpected time.
Before elaborating on my problem and the probable solutions I've thought of, let me mention my computer's configuration as it may help you in suggesting the best approach. I am running Windows 10 on an Intel i7-9700K processor with 32GB RAM and one Zotac GeForce GTX 1050 Ti OC Edition ZT-P10510B-10L 4GB PCI Express Graphics Card.
Explaining my possible solutions and my confusions regarding them:
A) Using GPU with Adjacency Matrix as Graph Data Structure: I can put an adjacency matrix on GPU and perform my analysis by manually coding them with PyCuda or Numba using loops only as recursion cannot be handled by GPU. The nearest I was able to search is this on stackoverflow but it has no good solution.
My Expectations: I hope to speedup algorithms such as All Pair Shortest Path, All Possible Paths between two nodes, Average Clustering, Average Shortest Path Length, and Small World Properties, etc. If it gives a significant speedup per graph, my results can be achieved very fast.
My Confusions:
Could these graph algorithms can be efficiently coded in GPU?
Which will be better to use? PyCuda or Numba?
Is there any other way to store Graphs on GPU that could be more efficient as my graphs are sparse graphs.
I am an average Python Programmer with no experience of GPU programming, so I will have to understand and learn GPU programming with PyCuda/ Numba. Which one is easier to learn?
B) Parallelizing Programs on CPU Itself: I can use Joblib or any other library to parallelly run the program on my CPU itself. I can arrange 2-3 more computers on which I can run small independent portions of programs or can run 500 graphs per computer.
My Expectations: I hope to speedup algorithms by parallelizing and dividing tasks among computers. If the GPU solution does not work, I may still have some hope by this method.
My Confusions:
Which other libraries are available as good alternatives for Joblib?
Should I allot all CPU cores (8 cores in i7) for my programs or use fewer cores?
C) Apart from my probable solutions do you have any other suggestions for me? If a better and faster solution is available in any other language except C/C++, you can also suggest them as well, as I am already considering C++ as a fallback plan if nothing works.
Work In Progress Updates
In different suggestions from comments on this question and discussion in my community, these are the points I've suggested to explore.
GraphBLAS
boost.graph + extensions with python-wrappers
graph-tool
Spark/ Dask
PyCuda/ Numba
Linear Algerbra methods using Pytorch
I tried to run 100 graphs on my CPU (using n_job=-1) using Joblib, the CPU was continuously hitting a temperature of 100°C. The processor tripped after running for 3 hours. - As a solution, I am using 75% of available cores on multiple computers (so if available cores are 8, I am using 6 cores) and the program is running fine. the speedup is also good.
This is a broad but interesting question. Let me try to answer it.
2000 undirected simple graphs [...] Each graph has approx 40,000 nodes and approx 600,000 edges
Currently, I am using NetworkX for my analysis and currently running nx.algorithms.cluster.average_clustering(G) and nx.average_shortest_path_length(G)
NetworkX uses plain Python implementations and is not optimized for performance. It's great for prototyping but if you encounter performance issues, it's best to look to rewrite your code using another library.
Other than NetworkX, the two most popular graph processing libraries are igraph and SNAP. Both are written in C and have Python APIs so you get both good single-threaded performance and ease of use. Their parallelism is very limited but this is not a problem in your use case as you have many graphs, rendering your problem embarrassingly parallel. Therefore, as you remarked in the updated question, you can run 6-8 jobs in parallel using e.g. Joblib or even xargs. If you need parallel processing, look into graph-tool, which also has a Python API.
Regarding your NetworkX algorithms, I'd expect the average_shortest_path_length to be reasonably well-optimized in all libraries. The average_clustering algorithm is tricky as it relies on node-wise triangle counting and a naive implementation takes O(|E|^2) time while an optimized implementation will do it in O(|E|^1.5). Your graphs are large enough so that the difference between these two costs is running the algorithm on a graph in a few seconds vs. running the algorithm for hours.
The "all-pairs shortest paths" (APSP) problem is very time-consuming, with most libraries using the Floyd–Warshall algorithm that has a runtime of O(|V|^3). I'm unsure what output you're looking for with the "All Possible Paths between two nodes" algorithm – enumerating all paths leads to an exponential amount of results and is unfeasible at this scale.
I would not start using the GPU for this task: an Intel i7-9700K should be up for this job. GPU-based graph processing libraries are challenging to set up and currently do not provide that significant of a speedup – the gains by using a GPU instead of a CPU are nowhere near as significant for graph processing as for machine learning algorithms. The only problem where you might be able to get a big speedup is APSP but it depends on which algorithms your chosen library uses.
If you are interested in GPU-based libraries, there are promising directions on the topic such as Gunrock, GraphBLAST, and a work-in-progress SuiteSparse:GraphBLAS extension that supports CUDA. However, my estimate is that you should be able to run most of your algorithms (barring APSP) in a few hours using a single computer and its CPU.
I've been using numpy for quite some time now and am fond of just how much faster it is for simple operations on vectors and matrices, compared to e.g. looping over elements of the same array.
My understanding is that it is using SIMD CPU extensions, but according to some, at least some of its functionality is making use of multiprocessing (via openMP?). On the other hand, there are lots of questions here on SO (example) about speeding operations on numpy arrays up by using multiprocessing.
I have not seen numpy definitely use multiple cores at once, although it looks as if sometimes two cores (on an 8-core machine) are in use. But I may have been using the "wrong" functions for that, or using them in the wrong way, or maybe my matrices are too small to make it worth it?
The question therefore:
Are there some numpy functions which can use multiple processes on a shared-memory machine, either via openMP or some other means?
If yes, is there some place in the numpy documentation with a definite list of those functions?
And in that case, is there some documentation on what a user of numpy would have to do to make sure they use all available CPU cores, or some specific predetermined number of cores?
I'm aware that there are libraries which permit splitting numpy arrays and such up across multiple machines or compute nodes, but I suspect the use case for that is either with being able to handle more data than fits into local RAM, or speeding processing up more than what a single multi-core machine can achieve. This is however not what this question is about.
Update
Given the comment by #talonmies (who states that by default there's no such functionality in numpy, and it would depend on LAPACK and BLAS):
What's the easiest way to obtain a suitably-compiled numpy version which makes use of multiple CPU cores (and hopefully also SIMD extensions)?
Or is the reason why numpy doesn't usually multiprocess that most people for whom that is important have already switched to using Multiprocessing or things like dask to handle multiple cores explicitly rather than having only the numpy bits accelerated implicitly?
I was wondering if there is a way to calculate the first few eigenvectors of a very large sparse matrix in tensorflow, hoping that it might be faster than scipy's implementation of ARPACK, which doesn't seem to support parallel computing. At least, as far as I noticed.
I believe you should rather look into PETCs4py or SLEPc4py.
They are python binding of PETSc (Portable, Extensible Toolkit for
Scientific Computation) and SLEPc (Scalable Library for Eigenvalue Problem Computations).
PETSc and SLEPc support MPI and therefore PETCs4py and SLEPc4py do too.
I believe you will find useful examples in examples
I have a large (1000x1000x5000) 3D numpy array upon which I need to perform many 3D rotations and then compute an asymmetric distance transform. The distance transform is trivially parallelizable, but I need a way to also perform the rotation itself using a computing cluster (which doesn't have so much [e.g. 2GB] memory/core). What's a good strategy to efficiently exploit the computing cluster? (it does not have any GPUs or other specialized hardware for that matter).
And yes, I need the rotated volume - meaning that I cannot just simply relabel the coordinates as the asymmetric distance transform will overwrite the dataset several times.
The software I'm using on the cluster: python3.4.2 with scipy, numpy and mpi4py.
Thanks!
If you want to do matrix operations (e.g. a rotation that you could express as a matrix multiplication) in parallel on a cluster, what I would do is.
Compile numpy with a multi-threaded BLAS (e.g. OpenBLAS) so the matrix multiplication is multi-threaded on a node. The advantage is that you know this has been extensively tested and optimized, and you don't need to worry about parallel scaling.
Assuming that the machine has, say 32 cores per node (i.e. 2*32=64 GB of RAM in total). I would run ~4 MPI task per node with 8 threads / MPI task (so the available RAM / task is 16 GB, thus removing the low RAM constraints).
Do a domain decomposition of your array among MPI tasks. For instance, this code (see _mprotate function) does a computation of rotations with scipy.ndimage using multiprocessing, you could do something similar but with mpi4py.
Although the problem is that, unless I'm mistaken, scipy.ndimage.interpolation.rotate does not use matrix operations with BLAS, and is a pure C implementation that in the end calls the NI_GeometricTransform function. So, unless you use a different algorithm, the above approach won't work. You will then have to run as many MPI tasks as you have cores, and do the domain decomposition among them (see mpi4py tutorials).
This does not fully answer your question but hope it helps.