Python 2.7 concurrent.futures.ThreadPoolExecutor does not parallelize

Python 2.7 concurrent.futures.ThreadPoolExecutor does not parallelize - python

I am running the following code on a Intel i3-based machine with 4 virtual cores (2 hyperthreads/physical core, 64bit) and Ubuntu 14.04 installed:
n = multiprocessing.cpu_count()
executor = ThreadPoolExecutor(n)
tuple_mapper = lambda i: (i, func(i))
results = dict(executor.map(tuple_mapper, range(10)))
The code does not seem to be executed in a parallel fashion, since the CPU is utilized only 25% constantly. On the utilization graph only one of the 4 virtual cores is used 100% at a time. The utilized cores are alternating every 10 seconds or so.
But the parallelization works well on a server machine with the same software setting. I don't know the exact number of cores nor the exact processor type, but I know for sure that it has several cores and the utilization is at 100% and that the calculations have a rapid speedup (10 times faster after using parallelization, made some experiments with it).
I would expect, that parallelization would work on my machine too, not only on the server.
Why does it not work? Does it have something to do with my operating system settings? Do I have to change them?
Thanks in advance!
Update:
For the background information see the correct answer below. For the sake of completeness, I want to give a sample code which solved the problem:
tuple_mapper = lambda i: (i, func(i))
n = multiprocessing.cpu_count()
with concurrent.futures.ProcessPoolExecutor(n) as executor:
results = dict(executor.map(tuple_mapper, range(10)))
Before you reuse this take care that all functions you are using are defined at the top-level of a module as described here:
Python multiprocessing pickling error

It sounds like you're seeing the results of Python's Global Interpreter Lock (a.k.a GIL).
In CPython, the global interpreter lock, or GIL, is a mutex that
prevents multiple native threads from executing Python bytecodes at
once.
As all your threads are running pure Python code, only one of them can actually run in parallel. That should cause only one CPU to be active and matches your description of the problem.
You can get around it by using multiple processes with ProcessPoolExecutor from the same module. Other solutions include switching to Jython or IronPython which don't have GIL.
The ProcessPoolExecutor class is an Executor subclass that uses a pool
of processes to execute calls asynchronously. ProcessPoolExecutor uses
the multiprocessing module, which allows it to side-step the Global
Interpreter Lock but also means that only picklable objects can be
executed and returned.

Related

How python multithreaded program can run on different Cores of CPU simultaneously despite of having GIL

In this video, he shows how multithreading runs on physical(Intel or
AMD) processor cores.
https://youtu.be/ecKWiaHCEKs
and
is python capable of running on multiple cores?
All these links basically say:
Python threads cannot take advantage of many physical cores. This is due to an internal implementation detail called the GIL (global interpreter lock) and if we want to utilize multiple physical cores of the CPU
we must use true parallel multiprocessing module
But when I ran this below code on my laptop
import threading
import math
def worker(argument):
for i in range(200000):
print(math.sqrt(i))
return
for i in range(3):
t = threading.Thread(target=worker, args=[i])
t.start()
I got this result
Questions:
1. Why did code run on all of my physical CPU cores instead of using one out of four physical cores of my
CPU? If so what is the point of multiprocessing module?
2. The second time, I changed the above code to create only one thread and that also took all of the CPU physical cores and took 4/4 physical cores to run.Why is that?

https://docs.python.org/3/library/math.html
The math module consists mostly of thin wrappers around the platform C math library functions.
While python itself can only execute a single instruction at a time, a low level c function that is called by python does not have this limitation.
So it's not python that is using multiple cores but your system's well optimized math library that is wrapped by python's math module.
That basically answers both your questions.
Regarding the usefulness of multiprocessing: It is still useful for those cases, where you're trying to parallelize pure python code or code that does not call libraries that already use multiple cores.
However, it comes with inter process communication (IPC) overhead that may or may not be larger than the performance gain that you get from using multiple cores. Tuning IPC is therefore often crucial for multiprocessing in python.

First to answer one misconception / error from your code:
import threading
import math
def worker(argument):
for i in range(200000):
print(math.sqrt(i))
return
for i in range(3):
t = threading.Thread(target=worker, args=[i])
t.start()
You are NOT using multiprocessing you are using threading. They are similar BUT NOT THE SAME. Threading IS AFFECTED by GIL, Multiprocessing IS NOT.
So when you ran 4 threads, a CPU intensive worker/job like sqrt() WILL NOT have the benefit of the thread because it is CPU-bound, and CPU-bound tasks ARE hindered by the GIL (wehn used in a Thread() vs. Process()
To have all four cores work well without the GIL, change your code to:
import multiprocessing
import math
def worker(argument):
for i in range(200000):
print(math.sqrt(i))
return
for i in range(3):
t = multiprocessing.Process(target=worker, args=[i])
t.start()

Multiprocessing on Pypy is threaded but only on 1 core

I was testing running my code on Pypy, and it seems to be faster than on cPython, but if i call it with multiprocessing like i was doing on cpython, i see threads being created, but on a single core, which defeats the purpose of using Pypy.
I am running a high number of independent simulations, (no talk, no wait between them), I have a bunch of strategies, create tuples of two strategies:
strat_combinations = tuple(map(tuple, itertools.combinations(
tuple((sim_params["pool_of_strats"].values())), 2)))
that are then passed to a pool:
pool = mp.Pool(processes=sim_params["processes_to_use"])
pool.map(run_multi_simulation, strat_combinations)
And executes the main script, controlling the number of processes with processes to use.
on Cpython, i get the number of processes being distributed with the number of cores as desired, on Pypy, when 8 processes are called, 8 threads are created but they all run in the same core.
I am calling the script using:
pypy Multi_Simulation_LinuxV5_Totalsims_pypy.py -withmod-_multiprocessing
i have tried also:
pypy Multi_Simulation_LinuxV5_Totalsims_pypy.py -withmod-thread withmod-_multiprocessing
this is run on Ubuntu 18.04, using pypy 3.6.
Is it a problem with multiprocessing, should i be doing it in another way?

Can multiple chained generators be parallelized on multiple CPU cores

With standard CPython, it's not possible to truly parallelized program execution on multiple CPU cores using threading. This is due to the Global Interpreter Lock (GIL).
CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
Source: CPython documentation
Another solution is to use multiple interpreters in parallel with Pythons multiprocessing. This solution spawns multiple processes, each with it's own interpreter instance and thus its own independent GIL.
In my usecase I have multiple chained generators. Each generator is generating a linked list objects. This list is input to the next generator, which generates again a linked list of objects.
While this algorithm is quite fast, I'm asking myself, if could be parallelized with Pythons multiprocessing, so each generator runs on one CPU core. I think in between of two generators (producer / consumer), some kind of buffer/ FIFO would be needed to decouple the execution speeds.
My questions:
Is such an implementation possible?
How would a minimal example look like?
tokenStream = Token.GetGenerator(fileContent) # producer
blockStream = Block.TranslateTokenToBlocks(tokenStream) # consumer / producer
groupStream = Group.TranslateBlockToGroup(blockStream) # consumer / producer
CodeDOM = CodeDOM.FromGroupStream(groupStream) # consumer

If `threading` module doesn't really let multiple threads run at once due to GIL, why is ParallelPython using all my cores?

AFAIK, the threading.Thread instances can't actually run in parallell, due to the Global Interpreter Lock, which forces only one thread to be able to run at any time (except for when blocking on I/O operations).
ParalellPython uses the threading module.
If I however submit multiple local jobs to it, it DOES execute them in parallel, or at least so it would seem.
I have 8 cores, and if I start 8 jobs to simply run empty loops, they all take up 12-13% of the CPU (meaning they each get executed on one core, and I can see this in my task manager)
Does anyone know how this can happen?

As the linked page says,
PP module overcomes this limitation and provides a simple way to write parallel python applications. Internally ppsmp uses processes and IPC (Inter Process Communications) to organize parallel computations
So the actual parallelism must be due to invoking multiple processes, as one would expect.

Python threading unexpectedly slower

I have decided to learn how multi-threading is done in Python, and I did a comparison to see what kind of performance gain I would get on a dual-core CPU. I found that my simple multi-threaded code actually runs slower than the sequential equivalent, and I cant figure out why.
The test I contrived was to generate a large list of random numbers and then print the maximum
from random import random
import threading
def ox():
print max([random() for x in xrange(20000000)])
ox() takes about 6 seconds to complete on my Intel Core 2 Duo, while ox();ox() takes about 12 seconds.
I then tried calling ox() from two threads to see how fast that would complete.
def go():
r = threading.Thread(target=ox)
r.start()
ox()
go() takes about 18 seconds to complete, with the two results printing within 1 second of eachother. Why should this be slower?
I suspect ox() is being parallelized automatically, because I if look at the Windows task manager performance tab, and call ox() in my python console, both processors jump to about 75% utilization until it completes. Does Python automatically parallelize things like max() when it can?

Python has the GIL. Python bytecode will only be executed by a single processor at a time. Only certain C modules (which don't manage Python state) will be able to run concurrently.
The Python GIL has a huge overhead in locking the state between threads. There are fixes for this in newer versions or in development branches - which at the very least should make multi-threaded CPU bound code as fast as single threaded code.
You need to use a multi-process framework to parallelize with Python. Luckily, the multiprocessing module which ships with Python makes that fairly easy.
Very few languages can auto-parallelize expressions. If that is the functionality you want, I suggest Haskell (Data Parallel Haskell)

The problem is in function random()
If you remove random from you code.
Both cores try to access to shared state of the random function.
Cores work consequentially and spent a lot of time on caches synchronization.
Such behavior is known as false sharing.
Read this article False Sharing

As Yann correctly pointed out, the Python GIL prevents parallelization from happening in this example. You can either use the python multiprocessing module to fix that or if you are willing to use other open source libraries, Ray is also a great option to get around the GIL problem and is easier to use and has more features than the Python multiprocessing library.
This is how you can parallelize your code example with Ray:
from random import random
import ray
ray.init()
#ray.remote
def ox():
print(max([random() for x in range(20000000)]))
%time x = ox.remote(); y = ox.remote(); ray.get([x, y])
On my machine, the single threaded ox() code you posted takes 1.84s and the two invocations with ray take 1.87s combined, so we get almost perfect parallelization here.
Ray also makes it very efficient to share data between tasks, on a single machine it will use shared memory under the hood, see https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.
You can also run the same program across different machines on your cluster or the cloud without having to modify the program, see the documentation (https://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html and https://ray.readthedocs.io/en/latest/autoscaling.html).
Disclaimer: I'm one of the Ray developers.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.