Will multi-threading necessarily decrease runtime?

Will multi-threading necessarily decrease runtime? - python

If I have a series of CPU-intensive operations, will multi-threading my program necessarily decrease its runtime? What are the trade-offs of doing so? In this case, I'm trying to compute the nullspace of a very large matrix. I'm using Python and, specifically, the numpy package:
def nullspace(A, eps=1e-15):
"""Computes the null space of the real matrix A."""
n, m = shape(A)
if n > m :
return nullspace(transpose(A), eps)
_, s, vh = linalg.svd(A)
s = append(s, zeros(m))[0:m]
null_mask = (s <= eps)
null_space = compress(null_mask, vh, axis=0)
return null_space.tolist()
Also, I would be interested to know just how one would go about multi-threading such a function. Thanks in advance.

Python has the Global Interpreter Lock (GIL), which only allows one thread to interact with the interpreter at a time -- effectively, this means that you can only run one thread of Python at a time. This is a severe disadvantage when trying to run multiple threads.
However, numpy is built on top of a heavily-optimised library for numerical linear algebra called LAPACK. If you install the right version of LAPACK for your system, it will parallelise its computations for you. You can then install numpy on top of your LAPACK, and the Python computations will be parallelised.
This also means that many numpy operations release the GIL, so that you can fire off a long numpy computation in a Python thread and simultaneously execute other Python. Thanks #JFSebastian.

No. For one thing, CPU-bound programs rarely benefit at all from threading in Python because of the Global Interpreter Lock.
Also, on a single-core machine, threading won't reduce runtime at all.

Usually GIL is an impediment for getting benefits of multithreading except for cases when your calculations are being made out of your python interpreter (for example C implementations). I'm not sure if this relates to numpy.
If you're running not so many threads you should have a look at multiprocessing module. You'll have a separate system process instead of a python thread.

Related

How can I do parallel processing in Python?

I have a a problem where I need to solve thousands of independent nonnegative least squares problem using nnls in scipy. All problems are small about 100x100 matricies. To speed it up I've tried to use the multiprocessing module in python with the Pool class. I get about a factor 2 improvement if I set number of threads in numpy to 1 and use multiprocessing vs using multithreaded numpy and no multiprocessing. But the performance is very unpredictable. For instance, if I move sections of code into a separate function (to make it easier to read) or call the pool.map function in a class method the performance can decrease with 50%. So it seems like the multiprocessing module is too unreliable to be used.
Does anyone know what can cause this behaviour or know of a better alternative to multiprocessing?

Parallelizing python: multiprocessing vs cython

I want to parallelize an iteration, in which many instances of cython instances are evaluated and the results are stored in a global numpy array:
for cythonInstance in myCythonInstances:
success = cythonInstance.evaluate(someConstantGlobalVariables,) # very CPU intense
if success == False:
break
globalNumpyArray[instanceSpecificLocation] = cythonInstance.resultVector[:]
The results of the instance evaluations are independent of each other. There is no kind of interaction between the instances, except that the results are written to the same global array, but at fixed, pre-determined and independent locations. If one evaluation fails, the iteration must be stopped.
As far as i understood, 2 possibilities would be possible:
1) using the multiprocessing package
2) making a cython function and using prange/openmp.
I have no experience with parallelization at all. Which solution is preferable, or are there also better alternatives? Thank You!

Use Cython if you can:
The prange syntax is pretty similar to range. It lets you take the easy development route of write a Python loop -> convert it to Cython -> convert it to a parallel loop. Hopefully the changes needed each time are small. In contrast, multiprocessing requires you to get the inside of your loop as a function and then set up ppols, so it's less immediately familiar.
OpenMP/Cython threading is pretty low overhead. In contrast there the multiprocessing module is relatively high overhead ("processes" are generally slower than "threads").
Multiprocessing is quite restricted in Windows (everything has to be pickleable). This often turns out to be quite a hassle.
There's a few specific circumstances when you should uses multiprocessing:
You find you need to get the GIL a lot - multiprocessing doesn't share a GIL so isn't slowed down. If you only need to get the GIL occasionally though then small with gil: blocks in Cython often don't slow you down too much, so try this first.
You need to do a bunch of quite different operations at once (i.e. something that doesn't lend itself to a prange loop because each thread is genuinely running separate code). If this is the case the Cython prange syntax doesn't really fit.
The caveats from looking at your code are that you should avoid using Cython classes if you can. If you can refactor it into a call to a cdef function that would be better (Cython classes will still need the GIL at times). Something similar to the following would work well:
cdef int f(double[:] arr, double loop_specific_parameter, int constant_parameter) nogil:
# return your boolean to stop the iteration
# modify arr
return result
# then elsewhere
cdef int i
cdef double[:,:] output = np.zeros(shape)
for i in prange(len(parameters_to_try),nogil=True):
result = f(output[i,:],parameters_to_try[i],constant_parameter)
if result:
break
The reason I don't really recommend using Cython classes is that 1) you can't create them or index an list of them without the GIL (for reference counting reasons) and 2) Python objects including Cython classes don't seem to be allowed to be thread local. See Cython parallel prange - thread locality? for an example of the issues. (Originally I wasn't aware of the restriction on being theead local)
The with_gil overhead involved isn't necessarily huge, so if this design makes most sense then try it. Looking at your CPU usage will tell you how well it's parallelizing.
Nb. Most of the pros/cons in this set of answers still applies, even though you're using Cython rather than the Python threading module. The difference is that you can often avoid the GIL in Cython (so some of the disadvantages of using threads are less significant).

I would suggest using joblib with the threading backend. Joblib is a very good tool to paralellize for loops. Joblib
Threading is prefered over multiprocessing here, because mulitprocessing has a lot of overhead. This would be inapropriate when there are a lot of parallel calculations to be done.
The results are stored in a list however, which you then can convert back to a numpy array.
from joblib import Parallel, delayed
def sim(x):
return x**2
if __name__ == "__main__":
result = Parallel(n_jobs=-1, backend="threading", verbose=5) \
(delayed(sim)(x) for x in range(10))
print result
result
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Python - Loop parallelisation with joblib

I would like some help understanding exactly what I have done/ why my code isn't running as I would expect.
I have started to use joblib to try and speed up my code by running a (large) loop in parallel.
I am using it like so:
from joblib import Parallel, delayed
def frame(indeces, image_pad, m):
XY_Patches = np.float32(image_pad[indeces[0]:indeces[0]+m, indeces[1]:indeces[1]+m, indeces[2]])
XZ_Patches = np.float32(image_pad[indeces[0]:indeces[0]+m, indeces[1], indeces[2]:indeces[2]+m])
YZ_Patches = np.float32(image_pad[indeces[0], indeces[1]:indeces[1]+m, indeces[2]:indeces[2]+m])
return XY_Patches, XZ_Patches, YZ_Patches
def Patch_triplanar_para(image_path, patch_size):
Image, Label, indeces = Sampling(image_path)
n = (patch_size -1)/2
m = patch_size
image_pad = np.pad(Image, pad_width=n, mode='constant', constant_values = 0)
A = Parallel(n_jobs= 1)(delayed(frame)(i, image_pad, m) for i in indeces)
A = np.array(A)
Label = np.float32(Label.reshape(len(Label), 1))
R, T, Y = np.hsplit(A, 3)
return R, T, Y, Label
I have been experimenting with "n_jobs", expecting that increasing this will speed up my function. However as I increase n_jobs, things slow down quite significantly. When running this code without "Parallel", things are slower, until I increase the number of jobs from 1.
Why is this the case? I understood that the more jobs I run, the faster the script? am i using this wrong?
Thanks!

Maybe your problem is caused because image_pad is a large array. In your code, you are using the default multiprocessing backend of joblib. This backend creates a pool of workers, each of which is a Python process. The input data to the function is then copied n_jobs times and broadcasted to each worker in the pool, which can lead to a serious overhead. Quoting from joblib's docs:
By default the workers of the pool are real Python processes forked using the multiprocessing module of the Python standard library when n_jobs != 1. The arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process.
This can be problematic for large arguments as they will be reallocated n_jobs times by the workers.
As this problem can often occur in scientific computing with numpy based datastructures, joblib.Parallel provides a special handling for large arrays to automatically dump them on the filesystem and pass a reference to the worker to open them as memory map on that file using the numpy.memmap subclass of numpy.ndarray. This makes it possible to share a segment of data between all the worker processes.
Note: The following only applies with the default "multiprocessing" backend. If your code can release the GIL, then using backend="threading" is even more efficient.
So if this is your case, you should switch to the threading backend, if you are able to release the global interpreter lock when calling frame, or switch to the shared memory approach of joblib.
The docs say that joblib provides an automated memmap conversion that could be useful.

It's quite possible that the problem you are running up against is a fundamental one to the nature of the python compiler.
If you read "https://www.ibm.com/developerworks/community/blogs/jfp/entry/Python_Is_Not_C?lang=en", you can see from a professional who specialises in optimisation and parallelising python code that iterating through large loops is an inherently slow operation for a python thread to perform. Therefore, spawning more processes that loop through arrays is only going to slow things down.
However - there are things that can be done.
The Cython and Numba compilers are both designed to optimise code that is similar to C/C++ style (i.e. your case) - in particular Numba's new #vectorise decorators allow scalar functions to take in and apply operations on large arrays with large arrays in a parallel manner (target=Parallel).
I don't understand your code enough to give an example of an implementation, but try this! These compilers, used in the correct ways, have brought speed increases of 3000,000% to me for parallel processes in the past!

Is the multiprocessing module of python the right way to speed up large numeric calculations?

I have a strong background in numeric compuation using FORTRAN and parallelization with OpenMP, which I found easy enough to use it on many problems. I switched to PYTHON since it much more fun (at least for me) to develop with, but parallelization for nummeric tasks seem much more tedious than with OpenMP. I'm often interested in loading large (tens of GB) data sets to to the main Memory and manipulate it in parallel while containing only a single copy of the data in main memory (shared data). I started to use the PYTHON module MULTIPROCESSING for this and came up with this generic example:
#test cases
#python parallel_python_example.py 1000 1000
#python parallel_python_example.py 10000 50
import sys
import numpy as np
import time
import multiprocessing
import operator
n_dim = int(sys.argv[1])
n_vec = int(sys.argv[2])
#class which contains large dataset and computationally heavy routine
class compute:
def __init__(self,n_dim,n_vec):
self.large_matrix=np.random.rand(n_dim,n_dim)#define large random matrix
self.many_vectors=np.random.rand(n_vec,n_dim)#define many random vectors which are organized in a matrix
def dot(self,a,b):#dont use numpy to run on single core only!!
return sum(p*q for p,q in zip(a,b))
def __call__(self,ii):# use __call__ as computation such that it can be handled by multiprocessing (pickle)
vector = self.dot(self.large_matrix,self.many_vectors[ii,:])#compute product of one of the vectors and the matrix
return self.dot(vector,vector)# return "length" of the result vector
#initialize data
comp = compute(n_dim,n_vec)
#single core
tt=time.time()
result = [comp(ii) for ii in range(n_vec)]
time_single = time.time()-tt
print "Time:",time_single
#multi core
for prc in [1,2,4,10]:#the 20 case is there to check that the large_matrix is only once in the main memory
tt=time.time()
pool = multiprocessing.Pool(processes=prc)
result = pool.map(comp,range(n_vec))
pool.terminate()
time_multi = time.time()-tt
print "Time using %2i processes. Time: %10.5f, Speedup:%10.5f" % (prc,time_multi,time_single/time_multi)
I ran two test cases on my machine (64bit Linux using Fedora 18) with the following results:
andre#lot:python>python parallel_python_example.py 10000 50
Time: 10.3667809963
Time using 1 processes. Time: 15.75869, Speedup: 0.65785
Time using 2 processes. Time: 11.62338, Speedup: 0.89189
Time using 4 processes. Time: 15.13109, Speedup: 0.68513
Time using 10 processes. Time: 31.31193, Speedup: 0.33108
andre#lot:python>python parallel_python_example.py 1000 1000
Time: 4.9363951683
Time using 1 processes. Time: 5.14456, Speedup: 0.95954
Time using 2 processes. Time: 2.81755, Speedup: 1.75201
Time using 4 processes. Time: 1.64475, Speedup: 3.00131
Time using 10 processes. Time: 1.60147, Speedup: 3.08242
My question is, am I misusing the MULTIPROCESSING module here? Or is this the way it goes with PYTHON (i.e. don't parallelize within python but rely totally on numpy's optimizations)?

While there is no general answer to your question (in the title), I think it is valid to say that multiprocessing alone is not the key for great number-crunching performance in Python.
In principle however, Python (+ 3rd party modules) are awesome for number crunching. Find the right tools, you will be amazed. Most of the times, I am pretty sure, you will get better performance with writing (much!) less code than you have achieved before doing everything manually in Fortran. You just have to use the right tools and approaches. This is a broad topic. A few random things that might interest you:
You can compile numpy and scipy yourself using Intel MKL and OpenMP (or maybe a sys admin in your facility already did so). This way, many linear algebra operations will automatically use multiple threads and get the best out of your machine. This is simply awesome and probably underestimated so far. Get your hands on a properly compiled numpy and scipy!
multiprocessing should be understood as a useful tool for managing multiple more or less independent processes. Communication among these processes has to be explicitly programmed. Communication happens mainly through pipes. Processes talking a lot to each other spend most of their time talking and not number crunching. Hence, multiprocessing is best used in cases when the transmission time for input and output data is small compared to the computing time. There are also tricks, you can for instance make use of Linux' fork() behavior and share large amounts of memory (read-only!) among multiple multiprocessing processes without having to pass this data around through pipes. You might want to have a look at https://stackoverflow.com/a/17786444/145400.
Cython has already been mentioned, you can use it in special situations and replace performance-critical code parts in your Python program with compiled code.
I did not comment on the details of your code, because (a) it is not very readable (please get used to PEP8 when writing Python code :-)) and (b) I think especially regarding number crunching it depends on the problem what the right solution is. You have already observed in your benchmark what I have outlined above: in the context of multiprocessing, it is especially important to have an eye on the communication overhead.
Spoken generally, you should always try to find a way from within Python to control compiled code to do the heavy work for you. Numpy and SciPy provide great interfaces for that.

Number crunching with Python... You probably should learn about Cython. It is and intermediate language between Python and C. It is tightly interfaced with numpy and has support for paralellization using openMP as backend.

From the test results you supplied, it appears that you ran your tests on a two core machine. I have one of those and ran your test code getting similar results. What these results show is that there is little benefit to running more processes than you have cores for numerical applications that lend themselves to parallel computation.
On my two core machine, approximately 20% of the CPU is absorbed simply in keeping my environment going, so when I see a 1.8 improvement running two processes I am confident that all the available cycles are being used for my work. Basically, for parallel numerical work the more cores the better as this raises the percentage of the computer that is available to do your work.
The other posters are entirely correct in pointing you at Numpy, Scipy, Cython etc. Basically you first need to make your computation use as few cycles as possible and then use multiprocessing in some form to find more cycles to apply to your problem.

Naive and easiest way to decompose independent loop into parallel threads/processes

I have a loop of intensive calculations, I want them to be
accelerated using the multicore processor as they are independent:
all performed in parallel. What the easiest way to do that in
python?
Let’s imagine that those calculations have to be summed at the end. How to easily add them to a list or a float variable?
Thanks for all your pedagogic answers and using python libraries ;o)

From my experience, multi-threading is probably not going to be a viable option for speeding things up (due to the Global Interpreter Lock).
A good alternative is the multiprocessing module. This may or may not work well, depending on how much data you end up having to pass around from one process to another.
Another good alternative would be to consider using numpy for your computations (if you aren't already). If you can vectorize your code, you should be able to achieve significant speedups even on a single core. Depending on what exactly you're doing and on your build of numpy, it might even be able to transparently distribute the computations across multiple cores.
edit Here is a complete example of using the multiprocessing module to perform a simple computation. It uses four processes to compute the squares of the numbers from zero to nine.
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
inputs = range(10)
result = pool.map(f, inputs)
print result
This is meant as a simple illustration. Given the trivial nature of f(), this parallel version will almost certainly be slower than computing the same thing serially.

Multicore processing is a bit difficult to do in CPython (thanks to the GIL ). However, their is the multiprocessing module which allows to use subprocesses (not threads) to split you work on multiple cores.
The module is relatively straight forward to use as long as your code can really be split into multiple parts and doesn't depend on shared objects. The linked documentation should be a good starting point.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.