Python multiprocessing slower than single thread - python

I have been playing around with multiprocessing problem and notice my algorithm is slower when I parallelizes it than when it is single thread.
In my code I don't share memory.
And I'm pretty sure my algorithm (see code), which is just nested loops is CPU bound.
However, no matter what I do. The parallel code runs 10-20% slower on all my computers.
I also ran this on a 20 CPUs virtual machine and single thread beats multithread every times (even slower up there than my computer, actually).
from multiprocessing.dummy import Pool as ThreadPool
from multi import chunks
from random import random
import logging
import time
from multi import chunks
## Product two set of stuff we can iterate over
S = []
for x in range(100000):
S.append({'value': x*random()})
H =[]
for x in range(255):
H.append({'value': x*random()})
# the function for each thread
# just nested iteration
def doStuff(HH):
R =[]
for k in HH['S']:
for h in HH['H']:
R.append(k['value'] * h['value'])
return R
# we will split the work
# between the worker thread and give it
# 5 item each to iterate over the big list
HChunks = chunks(H, 5)
XChunks = []
# turn them into dictionary, so i can pass in both
# S and H list
# Note: I do this because I'm not sure if I use the global
# S, will it spend too much time on cache synchronizatio or not
# the idea is that I dont want each thread to share anything.
for x in HChunks:
XChunks.append({'H': x, 'S': S})
print("Process")
t0 = time.time()
pool = ThreadPool(4)
R = pool.map(doStuff, XChunks)
pool.close()
pool.join()
t1 = time.time()
# measured time for 4 threads is slower
# than when i have this code just do
# doStuff(..) in non-parallel way
# Why!?
total = t1-t0
print("Took", total, "secs")
There are many related question opened, but many are geared toward code being structured incorrectly - each worker being IO bound and such.

You are using multithreading, not multiprocessing. While many languages allow threads to run in parallel, python does not. A thread is just a separate state of control, i.e. it holds it own stack, current function, etc. The python interpreter just switches between executing each stack every now and then.
Basically, all threads are running on a single core. They will only speed up your program when you are not CPU bound.
multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.
Multithreading is usually slower than single threading if you are CPU bound. This is because the work and processing resources stay the same, but you add overhead for managing the threads, e.g. switching between them.
How to fix this: instead of using from multiprocessing.dummy import Pool as ThreadPool do multiprocessing.Pool as ThreadPool.
You might want to read up on the GIL, the Global Interpreter Lock. It's what prevents threads from running in parallel (that and implications on single threaded performance). Python interpreters other than CPython may not have the GIL and be able to run multithreaded on several cores.

Related

Why isn't my threaded python script not much faster?

I wanted to speed up a python script I have that iterates over 300 records. So I figured I'd try to use threading. My non-thread version takes just under 1 minute to execute. My threaded version does 1 seconds better. Here are the pertinent parts of my thread version of the script:
... other imports ...
import threading
import concurrent.futures
# global vars
threads = []
check_records = []
default_max_problems = 5
problems_found = 0
lock = threading.Lock()
... some functions ...
def check_host(rec):
with lock:
global problems_found
global max_problems
if problems_found >= max_problems:
# I'd prefer to stop all threads and stop new ones from starting,
# but I don't know how to do that.
return
... bunch of function calls that do network stuff ...
check_records.append(rec)
if not(reachable and dns_ready):
problems_found += 1
logging.debug(f"check_host problems_found is {problems_found}.")
if __name__ == '__main__':
... handle command line args ...
try:
with concurrent.futures.ThreadPoolExecutor() as executor:
for ip in get_ips():
req_rec = find_dns_req_record(ip, dns_record_reqs)
executor.submit(check_host, req_rec)
Why is performance of my threaded script almost the same my non-thread version?
The kind of work you are performing is important to answer the question. If you are performing many IO-bound tasks (network calls, disk reads, etc.), then using Python's multi-threading should provide a good speed increase, since you can now have multiple threads waiting for multiple IO calls.
However, if you are performing raw computation, then multi-threading wont help you, because of Python's GIL (global interpreter lock), which basically only allows one thread to run at a time. To speed up non IO-bound computation, you will need to use the multiprocessing module, and spin up multiple Python processes. One of the disadvantages of multiple processes vs multiple threads is that it is harder to share data/memory between processes (because they have separate address spaces) vs threads (threads share memory because they are part of the same process).
Another thing that is important to consider is how you are using locks. If you put too much code under a lock, then threads won't be able to concurrently execute that code. You should try to have the smallest amount of code possible under any given lock, and only in places where shared data is accessed. If your entire thread function body is under a lock then you eliminate the potential for speed improvement via multi-threading.

Pandas DataFrame Multithreading No Performance Gain

I have a dictionary (in memory) data that has ~ 10,000 keys which each key represent a stock ticker, and the value stores the pandas dataframe representation of time series data for daily stock price. I am trying to calculate the pairwise Pearson correlation.
The code takes a long time ~3 hr to fully iterate through all the combinations O(n^2) ~ C(2, 10000). I tried to use multiprocessing dummy package but saw no performance gain AT ALL (actually slower as the number of workers increases).
from multiprocessing.dummy import Pool
def calculate_correlation((t1, t2)):
# pseudo code here
return pearsonr(data[t1]['Close'], data[t2]['Close'])
todos = []
for idx, t1 in enumerate(list(data.keys())):
for t2 in list(data.keys())[idx:]: # only the matrix top triangle
todos.append((t1, t2))
pool = Pool(4)
results = pool.map(calculate_correlation, todos)
pool.close()
pool.join()
All the data has been loaded into memory so it should not be IO intensive. Is there any reason that why there is no performance gain at all?
When you use multiprocessing.dummy, you're using threads, not processes. For a CPU-bound application in Python, you are usually not going to get performance boost when using multi-threading. You should use multi-processing instead to parallelize your code in Python. So, if you change your code from
from multiprocessing.dummy import Pool
to
from multiprocessing import Pool
This should substantially improve your performance.
The above will fix your problem, but if you want to know why this happened. Please continue reading:
Multi-threading in Python has Global Interpreter Lock (GIL) that prevents two threads in the same process to run at the same time. If you had a a lot of disk IO happening, multi-threading would have helped because DISK IO is separate process that can handle locks. Or, if you had a separate application used by your Python code that can handle locks, multi-threading would have helped. Multi-processing, on the other hand, will use all the cores of your CPU as separate processes as opposed to multi-threading. In CPU bound Python application such as yours, if you use multi-processing instead of multi-threading, your application will run on multiple processes on several cores in parallel which will boost the performance of your application.

multiprocessing do not work

I am working on Ubuntu 12 with 8 CPU3 as reported by the System monitor.
the testing code is
import multiprocessing as mp
def square(x):
return x**2
if __name__ == '__main__':
pool=mp.Pool(processes=4)
pool.map(square,range(100000000))
pool.close()
# for i in range(100000000):
# square(i)
The problem is:
1) All workload seems to be scheduled to just one core, which gets close to 100% utilization, despite the fact that several processes are started. Occasionally all workload migrates to another core but the workload is never distributed among them.
2) without multiprocessing is faster
for i in range(100000000):
square(i)
I have read the similar questions on stackoverflow like:
Python multiprocessing utilizes only one core
still got no applied result.
The function you are using is way too short (i.e. doesn't take enough time to compute), so you spend all your time in the synchronization between processes, that has to be done in a serial manner (so why not on a single processor). Try this:
import multiprocessing as mp
def square(x):
for i in range(10000):
j = i**2
return x**2
if __name__ == '__main__':
# pool=mp.Pool(processes=4)
# pool.map(square,range(1000))
# pool.close()
for i in range(1000):
square(i)
You will see that suddenly the multiprocessing works well: it takes ~2.5 seconds to accomplish, while it will take 10s without it.
Note: If using python 2, you might want to replace all the range by xrange
Edit: I replaced time.sleep by a CPU-intensive but useless calculation
Addendum: In general, for multi-CPU applications, you should try to make each CPU do as much work as possible without returning to the same process. In a case like yours, this means splitting the range into almost-equal sized lists, one per CPU and send them to the various CPUs.
When you do:
pool.map(square, range(100000000))
Before invoking the map function, it has to create a list with 100000000 elements, and this is done by a single process, That's why you see a single core working.
Use a generator instead, so each core can pop a number out of it and you should see the speedup:
pool.map(square, xrange(100000000))
It isn't sufficient simply to import the multiprocessing library to make use of multiple processes to schedule your work. You actually have to create processes too!
Your work is currently scheduled to a single core because you haven't done so, and so your program is a single process with a single thread.
Naturally, when you start a new process to simply square a number, you are going to get slower performance. The overhead of process creation makes sure of that. So your process pool will very likely take longer than a singe-process run.

Very simple concurrent programming in Python

I have a simple Python script that uses two much more complicated Python scripts, and does something with the results.
I have two modules, Foo and Bar, and my code is like the following:
import Foo
import Bar
output = []
a = Foo.get_something()
b = Bar.get_something_else()
output.append(a)
output.append(b)
Both methods take a long time to run, and neither depends on the other, so the obvious solution is to run them in parallel. How can I achieve this, but make sure that the order is maintained: Whichever one finishes first must wait for the other one to finish before the script can continue.
Let me know if I haven't made myself clear enough, I've tried to make the example code as simple as possible.
In general, you'd use threading to do this.
First, create a thread for each thing you want to run in parallel:
import threading
import Foo
import Bar
results = {}
def get_a():
results['a'] = Foo.get_something()
a_thread = threading.Thread(target=get_a)
a_thread.start()
def get_b():
results['b'] = Bar.get_something_else()
b_thread = threading.Thread(target=get_b)
b_thread.start()
Then to require both of them to have finished, use .join() on both:
a_thread.join()
b_thread.join()
at which point your results will be in results['a'] and results['b'], so if you wanted an ordered list:
output = [results['a'], results['b']]
Note: if both tasks are inherently CPU-intensive, you might want to consider multiprocessing instead - due to Python's GIL, a given Python process will only ever use one CPU core, whereas multiprocessing can distribute the tasks to separate cores. However, it has a slightly higher overhead than threading, and thus if the tasks are less CPU-intensive, it might not be as efficient.
import multiprocessing
import Foo
import Bar
results = {}
def get_a():
results['a'] = Foo.get_something()
def get_b():
results['b'] = Bar.get_something_else()
process_a = multiprocessing.Process(target=get_a)
process_b = multiprocessing.Process(target=get_b)
process_b.start()
process_a.start()
process_a.join
process_b.join
Here is the process version of your program.
NOTE: that in threading there are shared datastructures so you have to worry about locking which avoids wrong manipulation of data plus as amber mentioned above it also has a GIL (Global interpreter Lock) problem and since both of your tasks are CPU intensive then this means that it will take more time because of the calls notifying the threads of thread acquisition and release. If however your tasks were I/O intensive then it does not effect that much.
Now since there are no shared datastructures in a process thus no worrying about LOCKS and since it works irrespective of the GIL so you actually enjoy the real power of multiprocessors.
Simple note to remember: process is the same as thread just without using a shared datastructures (everything works in isolation and is focused on messaging.)
check out dabeaz.com he gave a good presentation on concurrent programming once.

Multiprocessing vs Threading Python [duplicate]

This question already has answers here:
What are the differences between the threading and multiprocessing modules?
(6 answers)
Closed 3 years ago.
I am trying to understand the advantages of multiprocessing over threading. I know that multiprocessing gets around the Global Interpreter Lock, but what other advantages are there, and can threading not do the same thing?
Here are some pros/cons I came up with.
Multiprocessing
Pros
Separate memory space
Code is usually straightforward
Takes advantage of multiple CPUs & cores
Avoids GIL limitations for cPython
Eliminates most needs for synchronization primitives unless if you use shared memory (instead, it's more of a communication model for IPC)
Child processes are interruptible/killable
Python multiprocessing module includes useful abstractions with an interface much like threading.Thread
A must with cPython for CPU-bound processing
Cons
IPC a little more complicated with more overhead (communication model vs. shared memory/objects)
Larger memory footprint
Threading
Pros
Lightweight - low memory footprint
Shared memory - makes access to state from another context easier
Allows you to easily make responsive UIs
cPython C extension modules that properly release the GIL will run in parallel
Great option for I/O-bound applications
Cons
cPython - subject to the GIL
Not interruptible/killable
If not following a command queue/message pump model (using the Queue module), then manual use of synchronization primitives become a necessity (decisions are needed for the granularity of locking)
Code is usually harder to understand and to get right - the potential for race conditions increases dramatically
The threading module uses threads, the multiprocessing module uses processes. The difference is that threads run in the same memory space, while processes have separate memory. This makes it a bit harder to share objects between processes with multiprocessing. Since threads use the same memory, precautions have to be taken or two threads will write to the same memory at the same time. This is what the global interpreter lock is for.
Spawning processes is a bit slower than spawning threads.
Threading's job is to enable applications to be responsive. Suppose you have a database connection and you need to respond to user input. Without threading, if the database connection is busy the application will not be able to respond to the user. By splitting off the database connection into a separate thread you can make the application more responsive. Also because both threads are in the same process, they can access the same data structures - good performance, plus a flexible software design.
Note that due to the GIL the app isn't actually doing two things at once, but what we've done is put the resource lock on the database into a separate thread so that CPU time can be switched between it and the user interaction. CPU time gets rationed out between the threads.
Multiprocessing is for times when you really do want more than one thing to be done at any given time. Suppose your application needs to connect to 6 databases and perform a complex matrix transformation on each dataset. Putting each job in a separate thread might help a little because when one connection is idle another one could get some CPU time, but the processing would not be done in parallel because the GIL means that you're only ever using the resources of one CPU. By putting each job in a Multiprocessing process, each can run on its own CPU and run at full efficiency.
Python documentation quotes
The canonical version of this answer is now at the dupliquee question: What are the differences between the threading and multiprocessing modules?
I've highlighted the key Python documentation quotes about Process vs Threads and the GIL at: What is the global interpreter lock (GIL) in CPython?
Process vs thread experiments
I did a bit of benchmarking in order to show the difference more concretely.
In the benchmark, I timed CPU and IO bound work for various numbers of threads on an 8 hyperthread CPU. The work supplied per thread is always the same, such that more threads means more total work supplied.
The results were:
Plot data.
Conclusions:
for CPU bound work, multiprocessing is always faster, presumably due to the GIL
for IO bound work. both are exactly the same speed
threads only scale up to about 4x instead of the expected 8x since I'm on an 8 hyperthread machine.
Contrast that with a C POSIX CPU-bound work which reaches the expected 8x speedup: What do 'real', 'user' and 'sys' mean in the output of time(1)?
TODO: I don't know the reason for this, there must be other Python inefficiencies coming into play.
Test code:
#!/usr/bin/env python3
import multiprocessing
import threading
import time
import sys
def cpu_func(result, niters):
'''
A useless CPU bound function.
'''
for i in range(niters):
result = (result * result * i + 2 * result * i * i + 3) % 10000000
return result
class CpuThread(threading.Thread):
def __init__(self, niters):
super().__init__()
self.niters = niters
self.result = 1
def run(self):
self.result = cpu_func(self.result, self.niters)
class CpuProcess(multiprocessing.Process):
def __init__(self, niters):
super().__init__()
self.niters = niters
self.result = 1
def run(self):
self.result = cpu_func(self.result, self.niters)
class IoThread(threading.Thread):
def __init__(self, sleep):
super().__init__()
self.sleep = sleep
self.result = self.sleep
def run(self):
time.sleep(self.sleep)
class IoProcess(multiprocessing.Process):
def __init__(self, sleep):
super().__init__()
self.sleep = sleep
self.result = self.sleep
def run(self):
time.sleep(self.sleep)
if __name__ == '__main__':
cpu_n_iters = int(sys.argv[1])
sleep = 1
cpu_count = multiprocessing.cpu_count()
input_params = [
(CpuThread, cpu_n_iters),
(CpuProcess, cpu_n_iters),
(IoThread, sleep),
(IoProcess, sleep),
]
header = ['nthreads']
for thread_class, _ in input_params:
header.append(thread_class.__name__)
print(' '.join(header))
for nthreads in range(1, 2 * cpu_count):
results = [nthreads]
for thread_class, work_size in input_params:
start_time = time.time()
threads = []
for i in range(nthreads):
thread = thread_class(work_size)
threads.append(thread)
thread.start()
for i, thread in enumerate(threads):
thread.join()
results.append(time.time() - start_time)
print(' '.join('{:.6e}'.format(result) for result in results))
GitHub upstream + plotting code on same directory.
Tested on Ubuntu 18.10, Python 3.6.7, in a Lenovo ThinkPad P51 laptop with CPU: Intel Core i7-7820HQ CPU (4 cores / 8 threads), RAM: 2x Samsung M471A2K43BB1-CRC (2x 16GiB), SSD: Samsung MZVLB512HAJQ-000L7 (3,000 MB/s).
Visualize which threads are running at a given time
This post https://rohanvarma.me/GIL/ taught me that you can run a callback whenever a thread is scheduled with the target= argument of threading.Thread and the same for multiprocessing.Process.
This allows us to view exactly which thread runs at each time. When this is done, we would see something like (I made this particular graph up):
+--------------------------------------+
+ Active threads / processes +
+-----------+--------------------------------------+
|Thread 1 |******** ************ |
| 2 | ***** *************|
+-----------+--------------------------------------+
|Process 1 |*** ************** ****** **** |
| 2 |** **** ****** ** ********* **********|
+-----------+--------------------------------------+
+ Time --> +
+--------------------------------------+
which would show that:
threads are fully serialized by the GIL
processes can run in parallel
The key advantage is isolation. A crashing process won't bring down other processes, whereas a crashing thread will probably wreak havoc with other threads.
As mentioned in the question, Multiprocessing in Python is the only real way to achieve true parallelism. Multithreading cannot achieve this because the GIL prevents threads from running in parallel.
As a consequence, threading may not always be useful in Python, and in fact, may even result in worse performance depending on what you are trying to achieve. For example, if you are performing a CPU-bound task such as decompressing gzip files or 3D-rendering (anything CPU intensive) then threading may actually hinder your performance rather than help. In such a case, you would want to use Multiprocessing as only this method actually runs in parallel and will help distribute the weight of the task at hand. There could be some overhead to this since Multiprocessing involves copying the memory of a script into each subprocess which may cause issues for larger-sized applications.
However, Multithreading becomes useful when your task is IO-bound. For example, if most of your task involves waiting on API-calls, you would use Multithreading because why not start up another request in another thread while you wait, rather than have your CPU sit idly by.
TL;DR
Multithreading is concurrent and is used for IO-bound tasks
Multiprocessing achieves true parallelism and is used for CPU-bound tasks
Another thing not mentioned is that it depends on what OS you are using where speed is concerned. In Windows processes are costly so threads would be better in windows but in unix processes are faster than their windows variants so using processes in unix is much safer plus quick to spawn.
Other answers have focused more on the multithreading vs multiprocessing aspect, but in python Global Interpreter Lock (GIL) has to be taken into account. When more number (say k) of threads are created, generally they will not increase the performance by k times, as it will still be running as a single threaded application. GIL is a global lock which locks everything out and allows only single thread execution utilizing only a single core. The performance does increase in places where C extensions like numpy, Network, I/O are being used, where a lot of background work is done and GIL is released. So when threading is used, there is only a single operating system level thread while python creates pseudo-threads which are completely managed by threading itself but are essentially running as a single process. Preemption takes place between these pseudo threads. If the CPU runs at maximum capacity, you may want to switch to multiprocessing.
Now in case of self-contained instances of execution, you can instead opt for pool. But in case of overlapping data, where you may want processes communicating you should use multiprocessing.Process.
MULTIPROCESSING
Multiprocessing adds CPUs to increase computing power.
Multiple processes are executed concurrently.
Creation of a process is time-consuming and resource intensive.
Multiprocessing can be symmetric or asymmetric.
The multiprocessing library in Python uses separate memory space, multiple CPU cores, bypasses GIL limitations in CPython, child processes are killable (ex. function calls in program) and is much easier to use.
Some caveats of the module are a larger memory footprint and IPC’s a little more complicated with more overhead.
MULTITHREADING
Multithreading creates multiple threads of a single process to increase computing power.
Multiple threads of a single process are executed concurrently.
Creation of a thread is economical in both sense time and resource.
The multithreading library is lightweight, shares memory, responsible for responsive UI and is used well for I/O bound applications.
The module isn’t killable and is subject to the GIL.
Multiple threads live in the same process in the same space, each thread will do a specific task, have its own code, own stack memory, instruction pointer, and share heap memory.
If a thread has a memory leak it can damage the other threads and parent process.
Example of Multi-threading and Multiprocessing using Python
Python 3 has the facility of Launching parallel tasks. This makes our work easier.
It has for thread pooling and Process pooling.
The following gives an insight:
ThreadPoolExecutor Example
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
ProcessPoolExecutor
import concurrent.futures
import math
PRIMES = [
112272535095293,
112582705942171,
112272535095293,
115280095190773,
115797848077099,
1099726899285419]
def is_prime(n):
if n % 2 == 0:
return False
sqrt_n = int(math.floor(math.sqrt(n)))
for i in range(3, sqrt_n + 1, 2):
if n % i == 0:
return False
return True
def main():
with concurrent.futures.ProcessPoolExecutor() as executor:
for number, prime in zip(PRIMES, executor.map(is_prime, PRIMES)):
print('%d is prime: %s' % (number, prime))
if __name__ == '__main__':
main()
Threads share the same memory space to guarantee that two threads don't share the same memory location so special precautions must be taken the CPython interpreter handles this using a mechanism called GIL, or the Global Interpreter Lock
what is GIL(Just I want to Clarify GIL it's repeated above)?
In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. This lock is necessary mainly because CPython's memory management is not thread-safe.
For the main question, we can compare using Use Cases, How?
1-Use Cases for Threading: in case of GUI programs threading can be used to make the application responsive For example, in a text editing program, one thread can take care of recording the user inputs, another can be responsible for displaying the text, a third can do spell-checking, and so on. Here, the program has to wait for user interaction. which is the biggest bottleneck. Another use case for threading is programs that are IO bound or network bound, such as web-scrapers.
2-Use Cases for Multiprocessing: Multiprocessing outshines threading in cases where the program is CPU intensive and doesn’t have to do any IO or user interaction.
For More Details visit this link and link or you need in-depth knowledge for threading visit here for Multiprocessing visit here
Process may have multiple threads. These threads may share memory and are the units of execution within a process.
Processes run on the CPU, so threads are residing under each process. Processes are individual entities which run independently. If you want to share data or state between each process, you may use a memory-storage tool such as Cache(redis, memcache), Files, or a Database.
As I learnd in university most of the answers above are right. In PRACTISE on different platforms (always using python) spawning multiple threads ends up like spawning one process. The difference is the multiple cores share the load instead of only 1 core processing everything at 100%. So if you spawn for example 10 threads on a 4 core pc, you will end up getting only the 25% of the cpus power!! And if u spawn 10 processes u will end up with the cpu processing at 100% (if u dont have other limitations). Im not a expert in all the new technologies. Im answering with own real experience background

Categories

Resources