When I have to parallelize an algorithm in python I usually use the multiprocessing map function.
In sklearn randomized Lasso it seems that they are using something of different RandomizedLasso
I am not very expert of parallel computing in python and I hope that I can learn something new from this.
Can anyone explain me what are they using?
In their situation I would have used multiprocessing. Why did they choose something of different?
n_jobs is fed to joblib, which is used for all parallel processing in scikit-learn. As you can see on the joblib website, it's much easier to use than multiprocessing; it's also more feature-rich, as it can use either processes or threads (faster when executing C code) and has shared-memory support for NumPy arrays.
Related
I've been using numpy for quite some time now and am fond of just how much faster it is for simple operations on vectors and matrices, compared to e.g. looping over elements of the same array.
My understanding is that it is using SIMD CPU extensions, but according to some, at least some of its functionality is making use of multiprocessing (via openMP?). On the other hand, there are lots of questions here on SO (example) about speeding operations on numpy arrays up by using multiprocessing.
I have not seen numpy definitely use multiple cores at once, although it looks as if sometimes two cores (on an 8-core machine) are in use. But I may have been using the "wrong" functions for that, or using them in the wrong way, or maybe my matrices are too small to make it worth it?
The question therefore:
Are there some numpy functions which can use multiple processes on a shared-memory machine, either via openMP or some other means?
If yes, is there some place in the numpy documentation with a definite list of those functions?
And in that case, is there some documentation on what a user of numpy would have to do to make sure they use all available CPU cores, or some specific predetermined number of cores?
I'm aware that there are libraries which permit splitting numpy arrays and such up across multiple machines or compute nodes, but I suspect the use case for that is either with being able to handle more data than fits into local RAM, or speeding processing up more than what a single multi-core machine can achieve. This is however not what this question is about.
Update
Given the comment by #talonmies (who states that by default there's no such functionality in numpy, and it would depend on LAPACK and BLAS):
What's the easiest way to obtain a suitably-compiled numpy version which makes use of multiple CPU cores (and hopefully also SIMD extensions)?
Or is the reason why numpy doesn't usually multiprocess that most people for whom that is important have already switched to using Multiprocessing or things like dask to handle multiple cores explicitly rather than having only the numpy bits accelerated implicitly?
I'm working on an algorithm, and I've made no attempt to parallelize it other than just by using numpy/scipy. Looking at htop, sometimes the code uses all of my cores and sometimes just one. I'm considering adding parallelism to the single-threaded portions using multiprocessing or something similar.
Assuming that I have all of the parallel BLAS/MKL libraries, is there some rule of thumb that I can follow to guess whether a numpy/scipy ufunc is going to be multithreaded or not? Even better, is there some place where this is documented?
To try to figure this out, I've looked at: https://scipy.github.io/old-wiki/pages/ParallelProgramming, Python: How do you stop numpy from multithreading?, multithreaded blas in python/numpy.
You may try to take IDP package (Intel® Distribution for Python) which contains versions of NumPy*, SciPy*, and scikit-learn* with integrated Intel® Math Kernel Library.
This would give you threading of all Lapack routines automatically whether this make sense to do.
Here you find out the list of threaded mkl’s functions :
https://software.intel.com/en-us/mkl-linux-developer-guide-openmp-threaded-functions-and-problems
The routines intrinsic to numpy and scipy allow single threads by default. You can change that if you so choose.
# encoding: utf-8
# module numpy.core.multiarray
# from /path/to/anaconda/lib/python3.6/site-packages/numpy/core/multiarray.cpython-36m-darwin.so
# by generator 1.145
# no doc
# no imports
# Variables with simple values
ALLOW_THREADS = 1
When compiling numpy, you can control threading by changing NPY_ALLOW_THREADS:
./core/include/numpy/ufuncobject.h:#if NPY_ALLOW_THREADS
./core/include/numpy/ndarraytypes.h: #define NPY_ALLOW_THREADS 1
As for the external libraries, I've mostly found numpy and scipy to wrap around legacy Fortran code (QUADPACK, LAPACK, FITPACK ... so on). All the subroutines in these libraries compute on single threads.
As for the MKL dependencies, the SO posts you link to sufficiently answer the question.
please try to set the global variable OMP_NUM_THREADS. It works for my scipy and numpy. The functions I use are,
ling.inv() and A.dot(B)
I have code like this:
def generator():
while True:
# do slow calculation
yield x
I would like to move the slow calculation to separate process(es).
I'm working in python 3.6 so I have concurrent.futures.ProcessPoolExecutor. It's just not obvious how to concurrent-ize a generator using that.
The differences from a regular concurrent scenario using map is that there is nothing to map here (the generator runs forever), and we don't want all the results at once, we want to queue them up and wait until the queue is not full before calculating more results.
I don't have to use concurrent, multiprocessing is fine also. It's a similar problem, it's not obvious how to use that inside a generator.
Slight twist: each value returned by the generator is a large numpy array (10 megabytes or so). How do I transfer that without pickling and unpickling? I've seen the docs for multiprocessing.Array but it's not totally obvious how to transfer a numpy array using that.
In this type of situation I usually use the joblib library. It is a parallel computation framework based on multiprocessing. It supports memmapping precisely for the cases where you have to handle large numpy arrays. I believe it is worth checking for you.
Maybe joblib's documentation is not explicit enough on this point, showing only examples with for loops, since you want to use a generator I should point out that it indeed works with generators. An example that would achieve what you want is the following:
from joblib import Parallel, delayed
def my_long_running_job(x):
# do something with x
# you can customize the number of jobs
Parallel(n_jobs=4)(delayed(my_long_running_job)(x) for x in generator())
Edit: I don't know what kind of processing you want to do, but if it releases the GIL you could also consider using threads. This way you won't have the problem of having to transfer large numpy arrays between processes, and still beneficiate from true parallelism.
I'm using scipy.optimize.brute(), but I noticed that it's only using one of my cores. One big advantage of a grid-search is to have all iterations of the solutions algorithm independent of each other.
Given that that's the case - why is brute() not implemented to run on multiple cores? If there is no good reason - is there a quick way to extend it / make it work, or does it make more sense to write the whole routine from scratch?
scipy.optimize.brute takes an arbitrary Python function. There is no guarantee this function is threadsafe. Even if it is, Python's global interpreter lock means that unless the function bypasses the GIL in C, it can't be run on more than one core anyway.
If you want to parallelize your brute-force search, you should write it yourself. You may have to write some Cython or C to get around the GIL.
Do you have scikit-learn installed? With a bit of refactoring you could use sklearn.grid_search.GridSearchCV, which supports multiprocessing via joblib.
You would need to wrap your local optimization function as an object that exposes the generic scikit-learn estimator interface, including a .score(...) method (or you could pass in a separate scoring function to the GridSearchCV constructor via the scoring= kwarg).
I am little bit new to python and I have a large code base written in python 3.3.2 (32 bit). It uses numpy 1.7.1 and takes a very long time to run because of computationally intensive calculations.
I need to parallelize the code to increase the performance. I am thinking about using pypy to parallelize but am unsure how to use it with existing code.
I have search Google but couldn't find and appropriate or satisfactory answer. I also read about using cython but I am unsure how to use that as well.
Could anyone provide pointers on increasing the performance of my code?
Since you're new to Python, I highly recommend taking the time to survey all of the possibilities before jumping into something pypy which may not be appropriate for your needs. There are lots of ways to speed up NumPy code, and the best way really depends on exactly what you're doing.
A great starting off point is Ian Oszvald's High Performance Python tutorial. Don't just watch it: follow along and try out the examples!
From there, you should think about whether parallelization will help. There are several options for parallelization, like the stdlib's multiprocessing module, but a lot of people in the scientific space are using IPython's parallelization capabilities. For learning about this, check out Min Ragan-Kelley's IPython Parallel tutorial (pt 1, pt 2, pt 3).
Once you have some sense of what Python is capable of, pick a method for speeding up your code and try it out. When you run into more specific problems, StackOverflow will be able to provide more concrete answers than just some links to tutorials ;)