I am using scipy.optimize.minimize for nonlinear constrained optimization.
I tested two methods (trust-constr, SLSQP).
On a machine (Ubuntu 20.04.1 LTS) where proc gives 32,
scipy.optimize.minimize(..., method='trust-constr', ...) uses multiple cores like 1600%
scipy.optimize.minimize(..., method='SLSQP', ...) only uses one core
According to another post (scipy optimise minimize -- parallelisation options), it seems that this is not a python problem, rather, a BLAS/LAPACK/MKL problem.
However, if it is a BLAS problem, then for me, it seems that all methods should be of a single core.
In the post, someone replied that SLSQP uses multiple cores.
Does the parallelization support of scipy.optimize.minimize depends on a chosen method?
How can I make SLSQP use multiple cores?
One observation I made by looking into
anaconda3/envs/[env_name]/lib/python3.8/site-packages/scipy/optimize
trust-constr is implemented in python (_trustregsion_constr directory)
SLSQP is implemented by C (_slsqp.cpython-38-x86_64-linux-gnu.so file)
On parsing the _slsqp.py source file , you may notice that scipy's SLSQP not using MPI or multiprocessing (or any parallel processing).
Adding some sort of multiprocessing/MPI support is not trivial, because you have to do some surgery on the backend to enable those MPI barriers/synchronization holds (and make sure that all processes/threads are running in sync, and the main "optimizer" is only run on a single core).
If you're heading down this path, its relevant to mention: SLSQP as implemented in Scipy has some inefficient order of operations. When it computes derivatives, it perturbs all design variables and finds the gradient of the objective function first (some wrapper function is created at runtime to do this operation), and then SLSQP's python wrapper computes gradients for constraint functions by perturbing each design variable.
If speeding up SLSQP is critical, fixing the order of operations in the backend (where it invokes different treatment for finding gradients of objectives vs constraints) is important for many problems where there are a lot of common operations for calculating objectives and constraints. I'd say both backend updates belong under this category.. something for the dev forums to ponder.
Related
I have an optimization model written on pyomo. When I run it using gurobi, it outputs the answer to the problem very quickly. Mostly because of its efficient presolver. Is there a way to do a presolve on pyomo before calling the actual solver so I can test my model using non-commercial packages, like couenne or cbc?
As #gmavrom mentions, it's important to know what you are trying to accomplish with a presolve, as many different techniques may be considered "presolve" operations. The commercial solvers put a lot of engineering effort into the tuning of their respective presolve operations.
As #Erwin points out, commercial AMLs like AMPL also sometimes provide presolve capabilities.
Within Pyomo, you can implement various "presolve" techniques by operating directly on the optimization modeling objects. See the feasibility-based bounds tightening implemented in pyomo.contrib.fbbt as an example: https://github.com/Pyomo/pyomo/blob/master/pyomo/contrib/fbbt/fbbt.py
I'm using scipy.optimize.brute(), but I noticed that it's only using one of my cores. One big advantage of a grid-search is to have all iterations of the solutions algorithm independent of each other.
Given that that's the case - why is brute() not implemented to run on multiple cores? If there is no good reason - is there a quick way to extend it / make it work, or does it make more sense to write the whole routine from scratch?
scipy.optimize.brute takes an arbitrary Python function. There is no guarantee this function is threadsafe. Even if it is, Python's global interpreter lock means that unless the function bypasses the GIL in C, it can't be run on more than one core anyway.
If you want to parallelize your brute-force search, you should write it yourself. You may have to write some Cython or C to get around the GIL.
Do you have scikit-learn installed? With a bit of refactoring you could use sklearn.grid_search.GridSearchCV, which supports multiprocessing via joblib.
You would need to wrap your local optimization function as an object that exposes the generic scikit-learn estimator interface, including a .score(...) method (or you could pass in a separate scoring function to the GridSearchCV constructor via the scoring= kwarg).
Is anyone aware of an implemented version (perhaps using scipy/numpy) of parallel exact matrix diagonalization (equivalently, finding the eigensystem)? If it helps, my matrices are symmetric and sparse. I would hate to spend a day reinventing the wheel.
EDIT:
My matrices are at least 10,000x10,000 (but, preferably, at least 20 times larger). For now, I only have access to a 4-core Intel machine (with hyperthreading, so 2 processes per core), ~3.0Ghz each with 12GB of RAM. I may later have access to a 128-core node ~3.6Ghz/core with 256GB of RAM, so single machine/multiple cores should do it (for my other parallel tasks, I have been using multiprocessing). I would prefer for the algorithms to scale well.
I do need exact diagonalization, so scipy.sparse routines are not be good for me (tried, didn't work well). I have been using numpy.linalg.eigh (I see only single core doing all the computations).
Alternatively (to the original question): is there an online resource where I can find out more about compiling SciPy so as to insure parallel execution?
For symmetric sparse matrix eigenvalue/eigenvector finding, you may use scipy.sparse.linalg.eigsh. It uses ARPACK behind the scenes, and there are parallel ARPACK implementations. AFAIK, SciPy can be compiled with one if your scipy installation uses the serial version.
However, this is not a good answer, if you need all eigenvalues and eigenvectors for the matrix, as the sparse version uses the Lanczos algorithm.
If your matrix is not overwhelmingly large, then just use numpy.linalg.eigh. It uses LAPACK or BLAS and may use parallel code internally.
If you end up rolling your own, please note that SciPy/NumPy does all the heavy lifting with different highly optimized linear algebra packages, not in pure Python. Due to this the performance and degree of parallelism depends heavily on the libraries your SciPy/NumPy installation is compiled with.
(Your question does not reveal if you just want to have parallel code running on several processors, or on several computers. Also, the size of your matrix has a big impact on the best method. So, this answer may be completely off-the-mark.)
I'm thinking about using Clyther for a high performance task. It is exciting to write OpenCL kernels using only python, but I'm wondering about the performance gap.
What are tasks that Clyther is good at? Bad at? Are Clyther-generated kernels good or not?
Is it possible to find some benchmarks?
As the documentation states, the main entry points for CLyther are its clyther.task and clyther.kernel decorators - once a function is decorated with one of these the function will be compiled to OpenCL when called.
CLyther is a compiler of a subset of the Python language. It compiles your Python subset code into OpenCL, so the actual run time of the kernel will not (or should not) differ much between interfaces to OpenCL. The actual overhead of CLyther (as with all interfaces with Python) comes from calling the OpenCL functions, or the moving of data between CLyther/Python and OpenCL.
Benchmarks showing CLyther's performance are available in the documentation. The source tarball contains the C++ and FORTRAN edition of the benchmarked program, a Laplace equation solver, so you can use them to reproduce the benchmark results yourself.
Personally, I believe that you can use CLyther effectively on the majority of problems in need of OpenCL computation.
I am trying to figure out explicitly which of the functions in SciPy/NumPy run on multiple processors. I can e.g. read in the SciPy reference manual that SciPy uses this, but I am more interested in exactly which functions do run parallel computations, because not all of them do. The dream scenario would of course be if it is included when you type help(SciPy.foo), but this does not seem to be the case.
Any help will be much appreciated.
Best,
Matias
I think the question is better addressed to the BLAS/LAPACK libraries you use rather than to SciPy/NumPy.
Some BLAS/LAPACK libraries, such as MKL, use multiple cores natively where other implementations might not.
To take scipy.linalg.solve as an example, here's its source code (with some error handling code omitted for clarity):
def solve(a, b, sym_pos=0, lower=0, overwrite_a=0, overwrite_b=0,
debug = 0):
if sym_pos:
posv, = get_lapack_funcs(('posv',),(a1,b1))
c,x,info = posv(a1,b1,
lower = lower,
overwrite_a=overwrite_a,
overwrite_b=overwrite_b)
else:
gesv, = get_lapack_funcs(('gesv',),(a1,b1))
lu,piv,x,info = gesv(a1,b1,
overwrite_a=overwrite_a,
overwrite_b=overwrite_b)
if info==0:
return x
if info>0:
raise LinAlgError, "singular matrix"
raise ValueError,\
'illegal value in %-th argument of internal gesv|posv'%(-info)
As you can see, it's just a thin wrapper around two families of LAPACK functions (exemplified by DPOSV and DGESV).
There is no parallelism going on at the SciPy level, yet you observe the function using multiple cores on your system. The only possible explanation is that your LAPACK library is capable of using multiple cores, without NumPy/SciPy doing anything to make this happen.