Python threading unexpectedly slower

Python threading unexpectedly slower - python

I have decided to learn how multi-threading is done in Python, and I did a comparison to see what kind of performance gain I would get on a dual-core CPU. I found that my simple multi-threaded code actually runs slower than the sequential equivalent, and I cant figure out why.
The test I contrived was to generate a large list of random numbers and then print the maximum
from random import random
import threading
def ox():
print max([random() for x in xrange(20000000)])
ox() takes about 6 seconds to complete on my Intel Core 2 Duo, while ox();ox() takes about 12 seconds.
I then tried calling ox() from two threads to see how fast that would complete.
def go():
r = threading.Thread(target=ox)
r.start()
ox()
go() takes about 18 seconds to complete, with the two results printing within 1 second of eachother. Why should this be slower?
I suspect ox() is being parallelized automatically, because I if look at the Windows task manager performance tab, and call ox() in my python console, both processors jump to about 75% utilization until it completes. Does Python automatically parallelize things like max() when it can?

Python has the GIL. Python bytecode will only be executed by a single processor at a time. Only certain C modules (which don't manage Python state) will be able to run concurrently.
The Python GIL has a huge overhead in locking the state between threads. There are fixes for this in newer versions or in development branches - which at the very least should make multi-threaded CPU bound code as fast as single threaded code.
You need to use a multi-process framework to parallelize with Python. Luckily, the multiprocessing module which ships with Python makes that fairly easy.
Very few languages can auto-parallelize expressions. If that is the functionality you want, I suggest Haskell (Data Parallel Haskell)

The problem is in function random()
If you remove random from you code.
Both cores try to access to shared state of the random function.
Cores work consequentially and spent a lot of time on caches synchronization.
Such behavior is known as false sharing.
Read this article False Sharing

As Yann correctly pointed out, the Python GIL prevents parallelization from happening in this example. You can either use the python multiprocessing module to fix that or if you are willing to use other open source libraries, Ray is also a great option to get around the GIL problem and is easier to use and has more features than the Python multiprocessing library.
This is how you can parallelize your code example with Ray:
from random import random
import ray
ray.init()
#ray.remote
def ox():
print(max([random() for x in range(20000000)]))
%time x = ox.remote(); y = ox.remote(); ray.get([x, y])
On my machine, the single threaded ox() code you posted takes 1.84s and the two invocations with ray take 1.87s combined, so we get almost perfect parallelization here.
Ray also makes it very efficient to share data between tasks, on a single machine it will use shared memory under the hood, see https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.
You can also run the same program across different machines on your cluster or the cloud without having to modify the program, see the documentation (https://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html and https://ray.readthedocs.io/en/latest/autoscaling.html).
Disclaimer: I'm one of the Ray developers.

Related

Why python does (not) use more CPUs? [duplicate]

I'm slightly confused about whether multithreading works in Python or not.
I know there has been a lot of questions about this and I've read many of them, but I'm still confused. I know from my own experience and have seen others post their own answers and examples here on StackOverflow that multithreading is indeed possible in Python. So why is it that everyone keep saying that Python is locked by the GIL and that only one thread can run at a time? It clearly does work. Or is there some distinction I'm not getting here?
Many posters/respondents also keep mentioning that threading is limited because it does not make use of multiple cores. But I would say they are still useful because they do work simultaneously and thus get the combined workload done faster. I mean why would there even be a Python thread module otherwise?
Update:
Thanks for all the answers so far. The way I understand it is that multithreading will only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks.
I'm not entirely sure what this means for me in practical terms, so I'll just give an example of the kind of task I'd like to multithread. For instance, let's say I want to loop through a very long list of strings and I want to do some basic string operations on each list item. If I split up the list, send each sublist to be processed by my loop/string code in a new thread, and send the results back in a queue, will these workloads run roughly at the same time? Most importantly will this theoretically speed up the time it takes to run the script?
Another example might be if I can render and save four different pictures using PIL in four different threads, and have this be faster than processing the pictures one by one after each other? I guess this speed-component is what I'm really wondering about rather than what the correct terminology is.
I also know about the multiprocessing module but my main interest right now is for small-to-medium task loads (10-30 secs) and so I think multithreading will be more appropriate because subprocesses can be slow to initiate.

The GIL does not prevent threading. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads.
What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.
This only applies to Python code. C extensions can and do release the GIL to allow multiple threads of C code and one Python thread to run across multiple cores. This extends to I/O controlled by the kernel, such as select() calls for socket reads and writes, making Python handle network events reasonably efficiently in a multi-threaded multi-core setup.
What many server deployments then do, is run more than one Python process, to let the OS handle the scheduling between processes to utilize your CPU cores to the max. You can also use the multiprocessing library to handle parallel processing across multiple processes from one codebase and parent process, if that suits your use cases.
Note that the GIL is only applicable to the CPython implementation; Jython and IronPython use a different threading implementation (the native Java VM and .NET common runtime threads respectively).
To address your update directly: Any task that tries to get a speed boost from parallel execution, using pure Python code, will not see a speed-up as threaded Python code is locked to one thread executing at a time. If you mix in C extensions and I/O, however (such as PIL or numpy operations) and any C code can run in parallel with one active Python thread.
Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library.

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

Python Multiprocessing Troubleshooting

Recently I wanted to speed up some of my code using parallel processing, as I have a Quad Core i7 and it seemed like a waste. I learned about python's (I'm using v 3.3.2 if it maters) GIL and how it can be overcome using the multiprocessing module, so I wrote this simple test program:
from multiprocessing import Process, Queue
def sum(a,b):
su=0
for i in range(a,b):
su+=i
q.put(su)
q= Queue()
p1=Process(target=sum, args=(1,25*10**7))
p2=Process(target=sum, args=(25*10**7,5*10**8))
p3=Process(target=sum, args=(5*10**8,75*10**7))
p4=Process(target=sum, args=(75*10**7,10**9))
p1.run()
p2.run()
p3.run()
p4.run()
r1=q.get()
r2=q.get()
r3=q.get()
r4=q.get()
print(r1+r2+r3+r4)
The code runs in about 48 seconds measured using cProfile, however the single process code
def sum(a,b):
su=0
for i in range(a,b):
su+=i
print(su)
sum(1,10**9)
runs in about 50 seconds. I understand that the method has overheads but i expected the improvements to be more drastic. The error with fork() doesn't apply to my as I'm running the code on a Mac.

The problem is that you're calling run rather than start.
If you read the docs, run is the "Method representing the process's activity", while start is the function that starts the process's activity on the background process. (This is the same as with threading.Thread.)
So, what you're doing is running the sum function on the main process, and never doing anything on the background processes.
From timing tests on my laptop, this cuts the time to about 37% of the original. Not quite the 25% you'd hope for, and I'm not sure why, but… good enough to prove that it's really multi-processing. (That, and the fact that I get four extra Python processes each using 60-100% CPU…)

If you really want to write fast computations using python it is not the way to go. Use numpy, or cython. Your computations will be hundred times faster than plain python.
On the other hand if you just want to launch bunch of parralel jobs use proper tools for it, for example
from multiprocessing import Pool
def mysum(a,b):
su=0
for i in range(a,b):
su+=i
return su
with Pool() as pool:
print(sum(pool.starmap(mysum, ((1,25*10**7),
(25*10**7,5*10**8),
(5*10**7,75*10**7),
(75*10**7,10**9)))))

Does Python support multithreading? Can it speed up execution time?

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

How to program to have all processors on your machine used?

I am running a single-threaded python program that performs massive data processing on my windows box. My machine has 8 processors. When I monitor the CPU usage in performance tab under Windows Task Manager, it shows that I am using only a very small fraction of the processing power available to me. Only one processor is being used to the fullest and all the rest are almost idle. What should I do to ensure that all my processors are used? Is multithreading a solution?

multithreading cannot make use of extra processors or cores.
You should spawn new processes instead of new threads.
This tool is by far the simplest among all that I have come across:
parallel python
Overview:
PP is a python module which provides mechanism for parallel
execution of python code on SMP
(systems with multiple processors or
cores) and clusters (computers
connected via network).
It is light, easy to install and integrate with other python software.
PP is an open source and cross-platform module written in pure
python

Multithreading is required for a single process, but it is not necessarily a solution; processor affinity can restrict it to a subset of available cores even if you have more than enough threads to use all.

As an addition to what Jon said, if you're using the standard Python interpreter you should understand the limitations with respect to multi-threading. If your threads are pure-python and aren't making system calls, they can't run concurrently on multiple processors due to the Global Interpreter Lock so the benefits to multi-threading are minimal. In this case, perhaps the recommendation would be to go with multiple processes instead or to switch to another Python implementation such as JPython or IronPython, which do not have a Global Interpreter Lock.

you can get that if your program is of the type that would benefit using python's multiprocessing module
multiprocessing uses multiple python process which avoids problems with the GIL so it's possible to use all of those cores with python code it has a easy threaded map and the basis for more complex schemes
it is similar to parallel python but is limited to the local machine and is included with python 2.6 and higher and is metaphorically similar to python's threading

Assuming your task is parallelizable, then yes, threading is certainly a solution. In particular, if you have a lot of data items to process but they can all be handled independently then it should be relatively straightforward to parallelize.
Using multiple processes instead of multiple threads might be another solution - you haven't told us enough about the problem to say, really.

Do this.
Break your task in to steps or stages. Each step reads something, does part of the overall calculation and writes something.
"""Some Step."""
import json
for some_line in sys.stdin:
object= json.loads( some_line )
# process the object
json.dump( result, sys.stdout )
Something like that ought to do fine.
If you have multiple objects that must be communicated, make a simple dictionary of the objects.
results = { 'a': a, 'b': b }
Connect them in a pipeline, like this.
python step1.py | python step2.py | python step3.py >output_file.dat
If you can break things into 8 or more steps, you will use 8 or more cores. And, BTW, this will be blazingly fast for very little real work.

multiprocess or threading in python?

I have a python application that grabs a collection of data and for each piece of data in that collection it performs a task. The task takes some time to complete as there is a delay involved. Because of this delay, I don't want each piece of data to perform the task subsequently, I want them to all happen in parallel. Should I be using multiprocess? or threading for this operation?
I attempted to use threading but had some trouble, often some of the tasks would never actually fire.

If you are truly compute bound, using the multiprocessing module is probably the lightest weight solution (in terms of both memory consumption and implementation difficulty.)
If you are I/O bound, using the threading module will usually give you good results. Make sure that you use thread safe storage (like the Queue) to hand data to your threads. Or else hand them a single piece of data that is unique to them when they are spawned.
PyPy is focused on performance. It has a number of features that can help with compute-bound processing. They also have support for Software Transactional Memory, although that is not yet production quality. The promise is that you can use simpler parallel or concurrent mechanisms than multiprocessing (which has some awkward requirements.)
Stackless Python is also a nice idea. Stackless has portability issues as indicated above. Unladen Swallow was promising, but is now defunct. Pyston is another (unfinished) Python implementation focusing on speed. It is taking an approach different to PyPy, which may yield better (or just different) speedups.

Tasks runs like sequentially but you have the illusion that are run in parallel. Tasks are good when you use for file or connection I/O and because are lightweights.
Multiprocess with Pool may be the right solution for you because processes runs in parallel so are very good with intensive computing because each process run in one CPU (or core).
Setup multiprocess may be very easy:
from multiprocessing import Pool
def worker(input_item):
output = do_some_work()
return output
pool = Pool() # it make one process for each CPU (or core) of your PC. Use "Pool(4)" to force to use 4 processes, for example.
list_of_results = pool.map(worker, input_list) # Launch all automatically

For small collections of data, simply create subprocesses with subprocess.Popen.
Each subprocess can simply get it's piece of data from stdin or from command-line arguments, do it's processing, and simply write the result to an output file.
When the subprocesses have all finished (or timed out), you simply merge the output files.
Very simple.

You might consider looking into Stackless Python. If you have control over the function that takes a long time, you can just throw some stackless.schedule()s in there (saying yield to the next coroutine), or else you can set Stackless to preemptive multitasking.
In Stackless, you don't have threads, but tasklets or greenlets which are essentially very lightweight threads. It works great in the sense that there's a pretty good framework with very little setup to get multitasking going.
However, Stackless hinders portability because you have to replace a few of the standard Python libraries -- Stackless removes reliance on the C stack. It's very portable if the next user also has Stackless installed, but that will rarely be the case.

Using CPython's threading model will not give you any performance improvement, because the threads are not actually executed in parallel, due to the way garbage collection is handled. Multiprocess would allow parallel execution. Obviously in this case you have to have multiple cores available to farm out your parallel jobs to.
There is much more information available in this related question.

If you can easily partition and separate the data you have, it sounds like you should just do that partitioning externally, and feed them to several processes of your program. (i.e. several processes instead of threads)

IronPython has real multithreading, unlike CPython and it's GIL. So depending on what you're doing it may be worth looking at. But it sounds like your use case is better suited to the multiprocessing module.
To the guy who recommends stackless python, I'm not an expert on it, but it seems to me that he's talking about software "multithreading", which is actually not parallel at all (still runs in one physical thread, so cannot scale to multiple cores.) It's merely an alternative way to structure asynchronous (but still single-threaded, non-parallel) application.

You may want to look at Twisted. It is designed for asynchronous network tasks.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.