Why use multithreading when multiprocessing is available in python? [duplicate]

Why use multithreading when multiprocessing is available in python? [duplicate] - python

I found that in Python 3.4, there are few different libraries for multiprocessing/threading: multiprocessing vs threading vs asyncio.
But I don't know which one to use or is the "recommended one". Do they do the same thing, or are different? If so, which one is used for what? I want to write a program that uses multicores in my computer. But I don't know which library I should learn.

TL;DR
Making the Right Choice:
We have walked through the most popular forms of concurrency. But the question remains - when should choose which one? It really depends on the use cases. From my experience (and reading), I tend to follow this pseudo code:
if io_bound:
if io_very_slow:
print("Use Asyncio")
else:
print("Use Threads")
else:
print("Multi Processing")
CPU Bound => Multi Processing
I/O Bound, Fast I/O, Limited Number of Connections => Multi Threading
I/O Bound, Slow I/O, Many connections => Asyncio
Reference
[NOTE]:
If you have a long call method (e.g. a method containing a sleep time or lazy I/O), the best choice is asyncio, Twisted or Tornado approach (coroutine methods), that works with a single thread as concurrency.
asyncio works on Python3.4 and later.
Tornado and Twisted are ready since Python2.7
uvloop is ultra fast asyncio event loop (uvloop makes asyncio 2-4x faster).
[UPDATE (2019)]:
Japranto (GitHub) is a very fast pipelining HTTP server based on uvloop.

They are intended for (slightly) different purposes and/or requirements. CPython (a typical, mainline Python implementation) still has the global interpreter lock so a multi-threaded application (a standard way to implement parallel processing nowadays) is suboptimal. That's why multiprocessing may be preferred over threading. But not every problem may be effectively split into [almost independent] pieces, so there may be a need in heavy interprocess communications. That's why multiprocessing may not be preferred over threading in general.
asyncio (this technique is available not only in Python, other languages and/or frameworks also have it, e.g. Boost.ASIO) is a method to effectively handle a lot of I/O operations from many simultaneous sources w/o need of parallel code execution. So it's just a solution (a good one indeed!) for a particular task, not for parallel processing in general.

In multiprocessing you leverage multiple CPUs to distribute your calculations. Since each of the CPUs runs in parallel, you're effectively able to run multiple tasks simultaneously. You would want to use multiprocessing for CPU-bound tasks. An example would be trying to calculate a sum of all elements of a huge list. If your machine has 8 cores, you can "cut" the list into 8 smaller lists and calculate the sum of each of those lists separately on separate core and then just add up those numbers. You'll get a ~8x speedup by doing that.
In (multi)threading you don't need multiple CPUs. Imagine a program that sends lots of HTTP requests to the web. If you used a single-threaded program, it would stop the execution (block) at each request, wait for a response, and then continue once received a response. The problem here is that your CPU isn't really doing work while waiting for some external server to do the job; it could have actually done some useful work in the meantime! The fix is to use threads - you can create many of them, each responsible for requesting some content from the web. The nice thing about threads is that, even if they run on one CPU, the CPU from time to time "freezes" the execution of one thread and jumps to executing the other one (it's called context switching and it happens constantly at non-deterministic intervals). So if your task is I/O bound - use threading.
asyncio is essentially threading where not the CPU but you, as a programmer (or actually your application), decide where and when does the context switch happen. In Python you use an await keyword to suspend the execution of your coroutine (defined using async keyword).

This is the basic idea:
Is it IO-BOUND ? -----------> USE asyncio
IS IT CPU-HEAVY ? ---------> USE multiprocessing
ELSE ? ----------------------> USE threading
So basically stick to threading unless you have IO/CPU problems.

Many of the answers suggest how to choose only 1 option, but why not be able to use all 3? In this answer I explain how you can use asyncio to manage combining all 3 forms of concurrency instead as well as easily swap between them later if need be.
The short answer
Many developers that are first-timers to concurrency in Python will end up using processing.Process and threading.Thread. However, these are the low-level APIs which have been merged together by the high-level API provided by the concurrent.futures module. Furthermore, spawning processes and threads has overhead, such as requiring more memory, a problem which plagued one of the examples I showed below. To an extent, concurrent.futures manages this for you so that you cannot as easily do something like spawn a thousand processes and crash your computer by only spawning a few processes and then just re-using those processes each time one finishes.
These high-level APIs are provided through concurrent.futures.Executor, which are then implemented by concurrent.futures.ProcessPoolExecutor and concurrent.futures.ThreadPoolExecutor. In most cases, you should use these over the multiprocessing.Process and threading.Thread, because it's easier to change from one to the other in the future when you use concurrent.futures and you don't have to learn the detailed differences of each.
Since these share a unified interfaces, you'll also find that code using multiprocessing or threading will often use concurrent.futures. asyncio is no exception to this, and provides a way to use it via the following code:
import asyncio
from concurrent.futures import Executor
from functools import partial
from typing import Any, Callable, Optional, TypeVar
T = TypeVar("T")
async def run_in_executor(
executor: Optional[Executor],
func: Callable[..., T],
/,
*args: Any,
**kwargs: Any,
) -> T:
"""
Run `func(*args, **kwargs)` asynchronously, using an executor.
If the executor is None, use the default ThreadPoolExecutor.
"""
return await asyncio.get_running_loop().run_in_executor(
executor,
partial(func, *args, **kwargs),
)
# Example usage for running `print` in a thread.
async def main():
await run_in_executor(None, print, "O" * 100_000)
asyncio.run(main())
In fact it turns out that using threading with asyncio was so common that in Python 3.9 they added asyncio.to_thread(func, *args, **kwargs) to shorten it for the default ThreadPoolExecutor.
The long answer
Are there any disadvantages to this approach?
Yes. With asyncio, the biggest disadvantage is that asynchronous functions aren't the same as synchronous functions. This can trip up new users of asyncio a lot and cause a lot of rework to be done if you didn't start programming with asyncio in mind from the beginning.
Another disadvantage is that users of your code will also become forced to use asyncio. All of this necessary rework will often leave first-time asyncio users with a really sour taste in their mouth.
Are there any non-performance advantages to this?
Yes. Similar to how using concurrent.futures is advantageous over threading.Thread and multiprocessing.Process for its unified interface, this approach can be considered a further abstraction from an Executor to an asynchronous function. You can start off using asyncio, and if later you find a part of it you need threading or multiprocessing, you can use asyncio.to_thread or run_in_executor. Likewise, you may later discover that an asynchronous version of what you're trying to run with threading already exists, so you can easily step back from using threading and switch to asyncio instead.
Are there any performance advantages to this?
Yes... and no. Ultimately it depends on the task. In some cases, it may not help (though it likely does not hurt), while in other cases it may help a lot. The rest of this answer provides some explanations as to why using asyncio to run an Executor may be advantageous.
- Combining multiple executors and other asynchronous code
asyncio essentially provides significantly more control over concurrency at the cost of you need to take control of the concurrency more. If you want to simultaneously run some code using a ThreadPoolExecutor along side some other code using a ProcessPoolExecutor, it is not so easy managing this using synchronous code, but it is very easy with asyncio.
import asyncio
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
async def with_processing():
with ProcessPoolExecutor() as executor:
tasks = [...]
for task in asyncio.as_completed(tasks):
result = await task
...
async def with_threading():
with ThreadPoolExecutor() as executor:
tasks = [...]
for task in asyncio.as_completed(tasks):
result = await task
...
async def main():
await asyncio.gather(with_processing(), with_threading())
asyncio.run(main())
How does this work? Essentially asyncio asks the executors to run their functions. Then, while an executor is running, asyncio will go run other code. For example, the ProcessPoolExecutor starts a bunch of processes, and then while waiting for those processes to finish, the ThreadPoolExecutor starts a bunch of threads. asyncio will then check in on these executors and collect their results when they are done. Furthermore, if you have other code using asyncio, you can run them while waiting for the processes and threads to finish.
- Narrowing in on what sections of code needs executors
It is not common that you will have many executors in your code, but what is a common problem that I have seen when people use threads/processes is that they will shove the entirety of their code into a thread/process, expecting it to work. For example, I once saw the following code (approximately):
from concurrent.futures import ThreadPoolExecutor
import requests
def get_data(url):
return requests.get(url).json()["data"]
urls = [...]
with ThreadPoolExecutor() as executor:
for data in executor.map(get_data, urls):
print(data)
The funny thing about this piece of code is that it was slower with concurrency than without. Why? Because the resulting json was large, and having many threads consume a huge amount of memory was disastrous. Luckily the solution was simple:
from concurrent.futures import ThreadPoolExecutor
import requests
urls = [...]
with ThreadPoolExecutor() as executor:
for response in executor.map(requests.get, urls):
print(response.json()["data"])
Now only one json is unloaded into memory at a time, and everything is fine.
The lesson here?
You shouldn't try to just slap all of your code into threads/processes, you should instead focus in on what part of the code actually needs concurrency.
But what if get_data was not a function as simple as this case? What if we had to apply the executor somewhere deep in the middle of the function? This is where asyncio comes in:
import asyncio
import requests
async def get_data(url):
# A lot of code.
...
# The specific part that needs threading.
response = await asyncio.to_thread(requests.get, url, some_other_params)
# A lot of code.
...
return data
urls = [...]
async def main():
tasks = [get_data(url) for url in urls]
for task in asyncio.as_completed(tasks):
data = await task
print(data)
asyncio.run(main())
Attempting the same with concurrent.futures is by no means pretty. You could use things such as callbacks, queues, etc., but it would be significantly harder to manage than basic asyncio code.

Already a lot of good answers. Can't elaborate more on the when to use each one. This is more an interesting combination of two. Multiprocessing + asyncio: https://pypi.org/project/aiomultiprocess/.
The use case for which it was designed was highio, but still utilizing as many of the cores available. Facebook used this library to write some kind of python based File server. Asyncio allowing for IO bound traffic, but multiprocessing allowing multiple event loops and threads on multiple cores.
Ex code from the repo:
import asyncio
from aiohttp import request
from aiomultiprocess import Pool
async def get(url):
async with request("GET", url) as response:
return await response.text("utf-8")
async def main():
urls = ["https://jreese.sh", ...]
async with Pool() as pool:
async for result in pool.map(get, urls):
... # process result
if __name__ == '__main__':
# Python 3.7
asyncio.run(main())
# Python 3.6
# loop = asyncio.get_event_loop()
# loop.run_until_complete(main())
Just and addition here, would not working in say jupyter notebook very well, as the notebook already has a asyncio loop running. Just a little note for you to not pull your hair out.

I’m not a professional Python user, but as a student in computer architecture I think I can share some of my considerations when choosing between multi processing and multi threading. Besides, some of the other answers (even among those with higher votes) are misusing technical terminology, so I thinks it’s also necessary to make some clarification on those as well, and I’ll do it first.
The fundamental difference between multiprocessing and multithreading is whether they share the same memory space. Threads share access to the same virtual memory space, so it is efficient and easy for threads to exchange their computation results (zero copy, and totally user-space execution).
Processes on the other hand have separate virtual memory spaces. They cannot directly read or write the other process’ memory space, just like a person cannot read or alter the mind of another person without talking to him. (Allowing so would be a violation of memory protection and defeat the purpose of using virtual memory. ) To exchange data between processes, they have to rely on the operating system’s facility (e.g. message passing), and for more than one reasons this is more costly to do than the “shared memory” scheme used by threads. One reason is that invoking the OS’ message passing mechanism requires making a system call which will switch the code execution from user mode to kernel mode, which is time consuming; another reason is likely that OS message passing scheme will have to copy the data bytes from the senders’ memory space to the receivers’ memory space, so non-zero copy cost.
It is incorrect to say a multithread program can only use one CPU. The reason why many people say so is due to an artifact of the CPython implementation: global interpreter lock (GIL). Because of the GIL, threads in a CPython process are serialized. As a result, it appears that the multithreaded python program only uses one CPU.
But multi thread computer programs in general are not restricted to one core, and for Python, implementations that do not use the GIL can indeed run many threads in parallel, that is, run on more than one CPU at the same time. (See https://wiki.python.org/moin/GlobalInterpreterLock).
Given that CPython is the predominant implementation of Python, it’s understandable why multithreaded python programs are commonly equated to being bound to a single core.
With Python with GIL, the only way to unleash the power of multicores is to use multiprocessing (there are exceptions to this as mentioned below). But your problem better be easily partition-able into parallel sub-problems that have minimal intercommunication, otherwise a lot of inter-process communication will have to take place and as explained above, the overhead of using the OS’ message passing mechanism will be costly, sometimes so costly the benefits of parallel processing are totally offset. If the nature of your problem requires intense communication between concurrent routines, multithreading is the natural way to go. Unfortunately with CPython, true, effectively parallel multithreading is not possible due to the GIL. In this case you should realize Python is not the optimal tool for your project and consider using another language.
There’s one alternative solution, that is to implement the concurrent processing routines in an external library written in C (or other languages), and import that module to Python. The CPython GIL will not bother to block the threads spawned by that external library.
So, with the burdens of GIL, is multithreading in CPython any good? It still offers benefits though, as other answers have mentioned, if you’re doing IO or network communication. In these cases the relevant computation is not done by your CPU but done by other devices (in the case of IO, the disk controller and DMA (direct memory access) controller will transfer the data with minimal CPU participation; in the case of networking, the NIC (network interface card) and DMA will take care of much of the task without CPU’s participation), so once a thread delegates such task to the NIC or disk controller, the OS can put that thread to a sleeping state and switch to other threads of the same program to do useful work.
In my understanding, the asyncio module is essentially a specific case of multithreading for IO operations.
So:
CPU-intensive programs, that can easily be partitioned to run on multiple processes with limited communication: Use multithreading if GIL does not exist (eg Jython), or use multiprocess if GIL is present (eg CPython).
CPU-intensive programs, that requires intensive communication between concurrent routines: Use multithreading if GIL does not exist, or use another programming language.
Lot’s of IO: asyncio

Multiprocessing can be run parallelly.
Multithreading and asyncio cannot be run parallelly.
With Intel(R) Core(TM) i7-8700K CPU # 3.70GHz and 32.0 GB RAM, I timed how many prime numbers are between 2 and 100000 with 2 processes, 2 threads and 2 asyncio tasks as shown below. *This is CPU bound calculation:
Multiprocessing
Multithreading
asyncio
23.87 seconds
45.24 seconds
44.77 seconds
Because multiprocessing can be run parallelly so multiprocessing is double more faster than multithreading and asyncio as shown above.
I used 3 sets of code below:
Multiprocessing:
# "process_test.py"
from multiprocessing import Process
import time
start_time = time.time()
def test():
num = 100000
primes = 0
for i in range(2, num + 1):
for j in range(2, i):
if i % j == 0:
break
else:
primes += 1
print(primes)
if __name__ == "__main__": # This is needed to run processes on Windows
process_list = []
for _ in range(0, 2): # 2 processes
process = Process(target=test)
process_list.append(process)
for process in process_list:
process.start()
for process in process_list:
process.join()
print(round((time.time() - start_time), 2), "seconds") # 23.87 seconds
Result:
...
9592
9592
23.87 seconds
Multithreading:
# "thread_test.py"
from threading import Thread
import time
start_time = time.time()
def test():
num = 100000
primes = 0
for i in range(2, num + 1):
for j in range(2, i):
if i % j == 0:
break
else:
primes += 1
print(primes)
thread_list = []
for _ in range(0, 2): # 2 threads
thread = Thread(target=test)
thread_list.append(thread)
for thread in thread_list:
thread.start()
for thread in thread_list:
thread.join()
print(round((time.time() - start_time), 2), "seconds") # 45.24 seconds
Result:
...
9592
9592
45.24 seconds
Asyncio:
# "asyncio_test.py"
import asyncio
import time
start_time = time.time()
async def test():
num = 100000
primes = 0
for i in range(2, num + 1):
for j in range(2, i):
if i % j == 0:
break
else:
primes += 1
print(primes)
async def call_tests():
tasks = []
for _ in range(0, 2): # 2 asyncio tasks
tasks.append(test())
await asyncio.gather(*tasks)
asyncio.run(call_tests())
print(round((time.time() - start_time), 2), "seconds") # 44.77 seconds
Result:
...
9592
9592
44.77 seconds

Multiprocessing
Each process has its own Python interpreter and can run on a separate core of a processor. Python multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers true parallelism, effectively side-stepping the Global Interpreter Lock by using sub processes instead of threads.
Use multiprocessing when you have CPU intensive tasks.
Multithreading
Python multithreading allows you to spawn multiple threads within the process. These threads can share the same memory and resources of the process. In CPython due to Global interpreter lock at any given time only a single thread can run, hence you cannot utilize multiple cores. Multithreading in Python does not offer true parallelism due to GIL limitation.
Asyncio
Asyncio works on co-operative multitasking concepts. Asyncio tasks run on the same thread so there is no parallelism, but it provides better control to the developer instead of the OS which is the case in multithreading.
There is a nice discussion on this link regarding the advantages of asyncio over threads.
There is a nice blog by Lei Mao on Python concurrency here
Multiprocessing VS Threading VS AsyncIO in Python Summary

Related

Why isn't my threaded python script not much faster?

I wanted to speed up a python script I have that iterates over 300 records. So I figured I'd try to use threading. My non-thread version takes just under 1 minute to execute. My threaded version does 1 seconds better. Here are the pertinent parts of my thread version of the script:
... other imports ...
import threading
import concurrent.futures
# global vars
threads = []
check_records = []
default_max_problems = 5
problems_found = 0
lock = threading.Lock()
... some functions ...
def check_host(rec):
with lock:
global problems_found
global max_problems
if problems_found >= max_problems:
# I'd prefer to stop all threads and stop new ones from starting,
# but I don't know how to do that.
return
... bunch of function calls that do network stuff ...
check_records.append(rec)
if not(reachable and dns_ready):
problems_found += 1
logging.debug(f"check_host problems_found is {problems_found}.")
if __name__ == '__main__':
... handle command line args ...
try:
with concurrent.futures.ThreadPoolExecutor() as executor:
for ip in get_ips():
req_rec = find_dns_req_record(ip, dns_record_reqs)
executor.submit(check_host, req_rec)
Why is performance of my threaded script almost the same my non-thread version?

The kind of work you are performing is important to answer the question. If you are performing many IO-bound tasks (network calls, disk reads, etc.), then using Python's multi-threading should provide a good speed increase, since you can now have multiple threads waiting for multiple IO calls.
However, if you are performing raw computation, then multi-threading wont help you, because of Python's GIL (global interpreter lock), which basically only allows one thread to run at a time. To speed up non IO-bound computation, you will need to use the multiprocessing module, and spin up multiple Python processes. One of the disadvantages of multiple processes vs multiple threads is that it is harder to share data/memory between processes (because they have separate address spaces) vs threads (threads share memory because they are part of the same process).
Another thing that is important to consider is how you are using locks. If you put too much code under a lock, then threads won't be able to concurrently execute that code. You should try to have the smallest amount of code possible under any given lock, and only in places where shared data is accessed. If your entire thread function body is under a lock then you eliminate the potential for speed improvement via multi-threading.

Python Asyncio/Trio for Asynchronous Computing/Fetching

I am looking for a way to efficiently fetch a chunk of values from disk, and then perform computation/calculations on the chunk. My thought was a for loop that would run the disk fetching task first, then run the computation on the fetched data. I want to have my program fetch the next batch as it is running the computation so I don't have to wait for another data fetch every time a computation completes. I expect the computation will take longer than the fetching of the data from disk, and likely cannot be done truly in parallel due to a single computation task already pinning the cpu usage at near 100%.
I have provided some code below in python using trio (but could alternatively be used with asyncio to the same effect) to illustrate my best attempt at performing this operation with async programming:
import trio
import numpy as np
from datetime import datetime as dt
import time
testiters=10
dim = 6000
def generateMat(arrlen):
for _ in range(30):
retval= np.random.rand(arrlen, arrlen)
# print("matrix generated")
return retval
def computeOpertion(matrix):
return np.linalg.inv(matrix)
def runSync():
for _ in range(testiters):
mat=generateMat(dim)
result=computeOpertion(mat)
return result
async def matGenerator_Async(count):
for _ in range(count):
yield generateMat(dim)
async def computeOpertion_Async(matrix):
return computeOpertion(matrix)
async def runAsync():
async with trio.open_nursery() as nursery:
async for value in matGenerator_Async(testiters):
nursery.start_soon(computeOpertion_Async,value)
#await computeOpertion_Async(value)
print("Sync:")
start=dt.now()
runSync()
print(dt.now()-start)
print("Async:")
start=dt.now()
trio.run(runAsync)
print(dt.now()-start)
This code will simulate getting data from disk by generating 30 random matrices, which uses a small amount of cpu. It will then perform matrix inversion on the generated matrix, which uses 100% cpu (with openblas/mkl configuration in numpy). I compare the time taken to run the tasks by timing the synchronous and asynchronous operations.
From what I can tell, both jobs take exactly the same amount of time to finish, meaning the async operation did not speed up the execution. Observing the behavior of each computation, the sequential operation runs the fetch and computation in order and the async operation runs all the fetches first, then all the computations afterwards.
Is there a way to use asynchronously fetch and compute? Perhaps with futures or something like gather()? Asyncio has these functions, and trio has them in a seperate package trio_future. I am also open to solutions via other methods (threads and multiprocessing).
I believe that there likely exists a solution with multiprocessing that can make the disk reading operation run in a separate process. However, inter-process communication and blocking then becomes a hassle, as I would need some sort of semaphore to control how many blocks could be generated at a time due to memory constraints, and multiprocessing tends to be quite heavy and slow.
EDIT
Thank you VPfB for your answer. I am not able to sleep(0) in the operation, but I think even if I did, it would necessarily block the computation in favor of performing disk operations. I think this may be a hard limitation of python threading and asyncio, that it can only execute 1 thread at a time. Running two different processes simultaneously is impossible if both require anything but waiting for some external resource to respond from your CPU.
Perhaps there is a way with an executor for a multiprocessing pool. I have added the following code below:
import asyncio
import concurrent.futures
async def asynciorunAsync():
loop = asyncio.get_running_loop()
with concurrent.futures.ProcessPoolExecutor() as pool:
async for value in matGenerator_Async(testiters):
result = await loop.run_in_executor(pool, computeOpertion,value)
print("Async with PoolExecutor:")
start=dt.now()
asyncio.run(asynciorunAsync())
print(dt.now()-start)
Although timing this, it still takes the same amount of time as the synchronous example. I think I will have to go with a more involved solution as it seems that async and await are too crude of a tool to properly do this type of task switching.

I don't work with trio, my answer it asyncio based.
Under these circumstances the only way to improve the asyncio performance I see is to break the computation into smaller pieces and insert await sleep(0) between them. This would allow the data fetching task to run.
Asyncio uses cooperative scheduling. A synchronous CPU bound routine does not cooperate, it blocks everything else while it is running.
sleep() always suspends the current task, allowing other tasks to run.
Setting the delay to 0 provides an optimized path to allow other tasks
to run. This can be used by long-running functions to avoid blocking
the event loop for the full duration of the function call.
(quoted from: asyncio.sleep)
If that is not possible, try to run the computation in an executor. This adds some multi-threading capabilities to otherwise pure asyncio code.

The point of async I/O is to make it easy to write programs where there is lots of network I/O but very little actual computation (or disk I/O). That applies to any async library (Trio or asyncio) or even different languages (e.g. ASIO in C++). So your program is ideally unsuited to async I/O! You will need to use multiple threads (or processes). Although, in fairness, async I/O including Trio can be useful for coordinating work on threads, and that might work well in your case.
As VPfB's answer says, if you're using asyncio then you can use executors, specifically a ThreadPoolExecutor passed to loop.run_in_executor(). For Trio, the equivalent would be trio.to_thread.run_sync() (see also Threads (if you must) in the Trio docs), which is even easier to use. In both cases, you can await the result, so the function is running in a separate thread while the main Trio thread can continue running your async code. Your code would end up looking something like this:
async def matGenerator_Async(count):
for _ in range(count):
yield await trio.to_thread.run_sync(generateMat, dim)
async def my_trio_main()
async with trio.open_nursery() as nursery:
async for matrix in matGenerator_Async(testiters):
nursery.start_soon(trio.to_thread.run_sync, computeOperation, matrix)
trio.run(my_trio_main)
There's no need for the computation functions (generateMat and computeOperation) to be async. In fact, it's problematic if they are because you could no longer run them in a separate thread. In general, only make a function async if it needs to await something or use async with or async for.
You can see from the above example how to pass data to the functions running in the other thread: just pass them as parameters to trio.to_thread.run_sync(), and they will be passed along as parameters to the function. Getting the result back from generateMat() is also straightforward - the return value of the function called in the other thread is returned from await trio.to_thread.run_sync(). Getting the result of computeOperation() is trickier, because it's called in the nursery, so its return value is thrown away. You'll need to pass a mutable parameter to it (like a dict) and stash the result in there. But be careful about thread safety; the easiest way to do that is to pass a new object to each coroutine, and only inspect them all after the nursery has finished.
A few final footnotes that you can probably ignore:
Just to be clear, yield await in the code above isn't some sort of special syntax. It's just await foo(), which returns a value once foo() has finished, followed by yield of that value.
You can change the number of threads Trio uses for calls to to_thread.run_sync() by passing a CapacityLimiter object, or by finding the default one and setting the count on that. It looks like the default is currently 40, so you might want to turn that down a bit, but it's probably not too important.
There is a common myth that Python doesn't support threads, or at least can't do computation in multiple threads simultaneously, because it has a single global lock (the global interpreter lock, or GIL). That would mean that you need to use multiple processes, rather than threads, for your program to really compute thing in parallel. It's true there is a GIL in Python, but so long as you're doing your computation using something like numpy, which you are, then it doesn't stop multithreading from working effectively.
Trio actually has great support for async file I/O. But I don't think it would be helpful in your case.

To supplement my other answer (which uses Trio like you asked), here's how to do it use it just using threads without any async library. The easiest way to do this with Future objects and a ThreadPoolExecutor.
futures = []
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
for matrix in matGenerator(testiters):
futures.append(executor.submit(computeOperation, matrix))
results = [f.result() for f in futures]
The code is actually pretty similar to the async code, but if anything it's simpler. If you don't need to do network I/O, you're better off with this method.

I think the main issue with using multiprocessing and not seeing any improvement is the 100% utilization of the CPU. It essentially leaves you with an async-like behavior where resources are occasionally being freed up and used for the I/O process. You could set a limit to the number of workers for your ProcessPoolExecutor and that might allow the I/O the room it needs to be more at the ready.
Disclaimer: I'm still new to multiprocessing and threading.

asyncio with multiple processors [duplicate]

As almost everyone is aware when they first look at threading in Python, there is the GIL that makes life miserable for people who actually want to do processing in parallel - or at least give it a chance.
I am currently looking at implementing something like the Reactor pattern. Effectively I want to listen for incoming socket connections on one thread-like, and when someone tries to connect, accept that connection and pass it along to another thread-like for processing.
I'm not (yet) sure what kind of load I might be facing. I know there is currently setup a 2MB cap on incoming messages. Theoretically we could get thousands per second (though I don't know if practically we've seen anything like that). The amount of time spent processing a message isn't terribly important, though obviously quicker would be better.
I was looking into the Reactor pattern, and developed a small example using the multiprocessing library that (at least in testing) seems to work just fine. However, now/soon we'll have the asyncio library available, which would handle the event loop for me.
Is there anything that could bite me by combining asyncio and multiprocessing?

You should be able to safely combine asyncio and multiprocessing without too much trouble, though you shouldn't be using multiprocessing directly. The cardinal sin of asyncio (and any other event-loop based asynchronous framework) is blocking the event loop. If you try to use multiprocessing directly, any time you block to wait for a child process, you're going to block the event loop. Obviously, this is bad.
The simplest way to avoid this is to use BaseEventLoop.run_in_executor to execute a function in a concurrent.futures.ProcessPoolExecutor. ProcessPoolExecutor is a process pool implemented using multiprocessing.Process, but asyncio has built-in support for executing a function in it without blocking the event loop. Here's a simple example:
import time
import asyncio
from concurrent.futures import ProcessPoolExecutor
def blocking_func(x):
time.sleep(x) # Pretend this is expensive calculations
return x * 5
#asyncio.coroutine
def main():
#pool = multiprocessing.Pool()
#out = pool.apply(blocking_func, args=(10,)) # This blocks the event loop.
executor = ProcessPoolExecutor()
out = yield from loop.run_in_executor(executor, blocking_func, 10) # This does not
print(out)
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
For the majority of cases, this is function alone is good enough. If you find yourself needing other constructs from multiprocessing, like Queue, Event, Manager, etc., there is a third-party library called aioprocessing (full disclosure: I wrote it), that provides asyncio-compatible versions of all the multiprocessing data structures. Here's an example demoing that:
import time
import asyncio
import aioprocessing
import multiprocessing
def func(queue, event, lock, items):
with lock:
event.set()
for item in items:
time.sleep(3)
queue.put(item+5)
queue.close()
#asyncio.coroutine
def example(queue, event, lock):
l = [1,2,3,4,5]
p = aioprocessing.AioProcess(target=func, args=(queue, event, lock, l))
p.start()
while True:
result = yield from queue.coro_get()
if result is None:
break
print("Got result {}".format(result))
yield from p.coro_join()
#asyncio.coroutine
def example2(queue, event, lock):
yield from event.coro_wait()
with (yield from lock):
yield from queue.coro_put(78)
yield from queue.coro_put(None) # Shut down the worker
if __name__ == "__main__":
loop = asyncio.get_event_loop()
queue = aioprocessing.AioQueue()
lock = aioprocessing.AioLock()
event = aioprocessing.AioEvent()
tasks = [
asyncio.async(example(queue, event, lock)),
asyncio.async(example2(queue, event, lock)),
]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()

Yes, there are quite a few bits that may (or may not) bite you.
When you run something like asyncio it expects to run on one thread or process. This does not (by itself) work with parallel processing. You somehow have to distribute the work while leaving the IO operations (specifically those on sockets) in a single thread/process.
While your idea to hand off individual connections to a different handler process is nice, it is hard to implement. The first obstacle is that you need a way to pull the connection out of asyncio without closing it. The next obstacle is that you cannot simply send a file descriptor to a different process unless you use platform-specific (probably Linux) code from a C-extension.
Note that the multiprocessing module is known to create a number of threads for communication. Most of the time when you use communication structures (such as Queues), a thread is spawned. Unfortunately those threads are not completely invisible. For instance they can fail to tear down cleanly (when you intend to terminate your program), but depending on their number the resource usage may be noticeable on its own.
If you really intend to handle individual connections in individual processes, I suggest to examine different approaches. For instance you can put a socket into listen mode and then simultaneously accept connections from multiple worker processes in parallel. Once a worker is finished processing a request, it can go accept the next connection, so you still use less resources than forking a process for each connection. Spamassassin and Apache (mpm prefork) can use this worker model for instance. It might end up easier and more robust depending on your use case. Specifically you can make your workers die after serving a configured number of requests and be respawned by a master process thereby eliminating much of the negative effects of memory leaks.

Based on #dano's answer above I wrote this function to replace places where I used to use multiprocess pool + map.
def asyncio_friendly_multiproc_map(fn: Callable, l: list):
"""
This is designed to replace the use of this pattern:
with multiprocessing.Pool(5) as p:
results = p.map(analyze_day, list_of_days)
By letting caller drop in replace:
asyncio_friendly_multiproc_map(analyze_day, list_of_days)
"""
tasks = []
with ProcessPoolExecutor(5) as executor:
for e in l:
tasks.append(asyncio.get_event_loop().run_in_executor(executor, fn, e))
res = asyncio.get_event_loop().run_until_complete(asyncio.gather(*tasks))
return res

See PEP 3156, in particular the section on Thread interaction:
http://www.python.org/dev/peps/pep-3156/#thread-interaction
This documents clearly the new asyncio methods you might use, including run_in_executor(). Note that the Executor is defined in concurrent.futures, I suggest you also have a look there.

Multiprocessing useless with urllib2?

I recently tried to speed up a little tool (which uses urllib2 to send a request to the (unofficial)twitter-button-count-url (> 2000 urls) and parses it´s results) with the multiprocessing module (and it´s worker pools). I read several discussion here about multithreading (which slowed the whole thing down compared to a standard, non-threaded version) and multiprocessing, but i could´t find an answer to a (probably very simple) question:
Can you speed up url-calls with multiprocessing or ain´t the bottleneck something like the network-adapter? I don´t see which part of, for example, the urllib2-open-method could be parallelized and how that should work...
EDIT: THis is the request i want to speed up and the current multiprocessing-setup:
urls=["www.foo.bar", "www.bar.foo",...]
tw_url='http://urls.api.twitter.com/1/urls/count.json?url=%s'
def getTweets(self,urls):
for i in urls:
try:
self.tw_que=urllib2.urlopen(tw_url %(i))
self.jsons=json.loads(self.tw_que.read())
self.tweets.append({'url':i,'date':today,'tweets':self.jsons['count']})
except ValueError:
print ....
continue
return self.tweets
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
result = [pool.apply_async(getTweets(i,)) for i in urls]
[i.get() for i in result]

Ah here comes yet another discussion about the GIL. Well here's the thing. Fetching content with urllib2 is going to be mostly IO-bound. Native threading AND multiprocessing will both have the same performance when the task is IO-bound (threading only becomes a problem when it's CPU-bound). Yes you can speed it up, I've done it myself using python threads and something like 10 downloader threads.
Basically you use a producer-consumer model with one thread (or process) producing urls to download, and N threads (or processes) consuming from that queue and making requests to the server.
Here's some pseudo-code:
# Make sure that the queue is thread-safe!!
def producer(self):
# Only need one producer, although you could have multiple
with fh = open('urllist.txt', 'r'):
for line in fh:
self.queue.enqueue(line.strip())
def consumer(self):
# Fire up N of these babies for some speed
while True:
url = self.queue.dequeue()
dh = urllib2.urlopen(url)
with fh = open('/dev/null', 'w'): # gotta put it somewhere
fh.write(dh.read())
Now if you're downloading very large chunks of data (hundreds of MB) and a single request completely saturates the bandwidth, then yes running multiple downloads is pointless. The reason you run multiple downloads (generally) is because requests are small and have a relatively high latency / overhead.

Take a look at a look at gevent and specifically at this example: concurrent_download.py. It will be reasonably faster than multiprocessing and multithreading + it can handle thousands of connections easily.

It depends! Are you contacting different servers, are the transferred files small or big, do you loose much of the time waiting for the server to reply or by transferring data,...
Generally, multiprocessing involves some overhead and as such you want to be sure that the speedup gained by parallelizing the work is larger than the overhead itself.
Another point: network and thus I/O bound applications work – and scale – better with asynchronous I/O and an event driven architecture instead of threading or multiprocessing, as in such applications much of the time is spent waiting on I/O and not doing any computation.
For your specific problem, I would try to implement a solution by using Twisted, gevent, Tornado or any other networking framework which does not use threads to parallelize connections.

What you do when you split web requests over several processes is to parallelize the network latencies (i.e. the waiting for responses). So you should normally get a good speedup, since most of the processes should sleep most of the time, waiting for an event.
Or use Twisted. ;)

Nothing is useful if your code is broken: f() (with parentheses) calls a function in Python immediately, you should pass just f (no parentheses) to be executed in the pool instead. Your code from the question:
#XXX BROKEN, DO NOT USE
result = [pool.apply_async(getTweets(i,)) for i in urls]
[i.get() for i in result]
notice parentheses after getTweets that means that all the code is executed in the main thread serially.
Delegate the call to the pool instead:
all_tweets = pool.map(getTweets, urls)
Also, you don't need separate processes here unless json.loads() is expensive (CPU-wise) in your case. You could use threads: replace multiprocessing.Pool with multiprocessing.pool.ThreadPool -- the rest is identical. GIL is released during IO in CPython and therefore threads should speed up your code if most of the time is spent in urlopen().read().
Here's a complete code example.

Multiprocessing vs Threading Python [duplicate]

This question already has answers here:
What are the differences between the threading and multiprocessing modules?
(6 answers)
Closed 3 years ago.
I am trying to understand the advantages of multiprocessing over threading. I know that multiprocessing gets around the Global Interpreter Lock, but what other advantages are there, and can threading not do the same thing?

Here are some pros/cons I came up with.
Multiprocessing
Pros
Separate memory space
Code is usually straightforward
Takes advantage of multiple CPUs & cores
Avoids GIL limitations for cPython
Eliminates most needs for synchronization primitives unless if you use shared memory (instead, it's more of a communication model for IPC)
Child processes are interruptible/killable
Python multiprocessing module includes useful abstractions with an interface much like threading.Thread
A must with cPython for CPU-bound processing
Cons
IPC a little more complicated with more overhead (communication model vs. shared memory/objects)
Larger memory footprint
Threading
Pros
Lightweight - low memory footprint
Shared memory - makes access to state from another context easier
Allows you to easily make responsive UIs
cPython C extension modules that properly release the GIL will run in parallel
Great option for I/O-bound applications
Cons
cPython - subject to the GIL
Not interruptible/killable
If not following a command queue/message pump model (using the Queue module), then manual use of synchronization primitives become a necessity (decisions are needed for the granularity of locking)
Code is usually harder to understand and to get right - the potential for race conditions increases dramatically

The threading module uses threads, the multiprocessing module uses processes. The difference is that threads run in the same memory space, while processes have separate memory. This makes it a bit harder to share objects between processes with multiprocessing. Since threads use the same memory, precautions have to be taken or two threads will write to the same memory at the same time. This is what the global interpreter lock is for.
Spawning processes is a bit slower than spawning threads.

Threading's job is to enable applications to be responsive. Suppose you have a database connection and you need to respond to user input. Without threading, if the database connection is busy the application will not be able to respond to the user. By splitting off the database connection into a separate thread you can make the application more responsive. Also because both threads are in the same process, they can access the same data structures - good performance, plus a flexible software design.
Note that due to the GIL the app isn't actually doing two things at once, but what we've done is put the resource lock on the database into a separate thread so that CPU time can be switched between it and the user interaction. CPU time gets rationed out between the threads.
Multiprocessing is for times when you really do want more than one thing to be done at any given time. Suppose your application needs to connect to 6 databases and perform a complex matrix transformation on each dataset. Putting each job in a separate thread might help a little because when one connection is idle another one could get some CPU time, but the processing would not be done in parallel because the GIL means that you're only ever using the resources of one CPU. By putting each job in a Multiprocessing process, each can run on its own CPU and run at full efficiency.

Python documentation quotes
The canonical version of this answer is now at the dupliquee question: What are the differences between the threading and multiprocessing modules?
I've highlighted the key Python documentation quotes about Process vs Threads and the GIL at: What is the global interpreter lock (GIL) in CPython?
Process vs thread experiments
I did a bit of benchmarking in order to show the difference more concretely.
In the benchmark, I timed CPU and IO bound work for various numbers of threads on an 8 hyperthread CPU. The work supplied per thread is always the same, such that more threads means more total work supplied.
The results were:
Plot data.
Conclusions:
for CPU bound work, multiprocessing is always faster, presumably due to the GIL
for IO bound work. both are exactly the same speed
threads only scale up to about 4x instead of the expected 8x since I'm on an 8 hyperthread machine.
Contrast that with a C POSIX CPU-bound work which reaches the expected 8x speedup: What do 'real', 'user' and 'sys' mean in the output of time(1)?
TODO: I don't know the reason for this, there must be other Python inefficiencies coming into play.
Test code:
#!/usr/bin/env python3
import multiprocessing
import threading
import time
import sys
def cpu_func(result, niters):
'''
A useless CPU bound function.
'''
for i in range(niters):
result = (result * result * i + 2 * result * i * i + 3) % 10000000
return result
class CpuThread(threading.Thread):
def __init__(self, niters):
super().__init__()
self.niters = niters
self.result = 1
def run(self):
self.result = cpu_func(self.result, self.niters)
class CpuProcess(multiprocessing.Process):
def __init__(self, niters):
super().__init__()
self.niters = niters
self.result = 1
def run(self):
self.result = cpu_func(self.result, self.niters)
class IoThread(threading.Thread):
def __init__(self, sleep):
super().__init__()
self.sleep = sleep
self.result = self.sleep
def run(self):
time.sleep(self.sleep)
class IoProcess(multiprocessing.Process):
def __init__(self, sleep):
super().__init__()
self.sleep = sleep
self.result = self.sleep
def run(self):
time.sleep(self.sleep)
if __name__ == '__main__':
cpu_n_iters = int(sys.argv[1])
sleep = 1
cpu_count = multiprocessing.cpu_count()
input_params = [
(CpuThread, cpu_n_iters),
(CpuProcess, cpu_n_iters),
(IoThread, sleep),
(IoProcess, sleep),
]
header = ['nthreads']
for thread_class, _ in input_params:
header.append(thread_class.__name__)
print(' '.join(header))
for nthreads in range(1, 2 * cpu_count):
results = [nthreads]
for thread_class, work_size in input_params:
start_time = time.time()
threads = []
for i in range(nthreads):
thread = thread_class(work_size)
threads.append(thread)
thread.start()
for i, thread in enumerate(threads):
thread.join()
results.append(time.time() - start_time)
print(' '.join('{:.6e}'.format(result) for result in results))
GitHub upstream + plotting code on same directory.
Tested on Ubuntu 18.10, Python 3.6.7, in a Lenovo ThinkPad P51 laptop with CPU: Intel Core i7-7820HQ CPU (4 cores / 8 threads), RAM: 2x Samsung M471A2K43BB1-CRC (2x 16GiB), SSD: Samsung MZVLB512HAJQ-000L7 (3,000 MB/s).
Visualize which threads are running at a given time
This post https://rohanvarma.me/GIL/ taught me that you can run a callback whenever a thread is scheduled with the target= argument of threading.Thread and the same for multiprocessing.Process.
This allows us to view exactly which thread runs at each time. When this is done, we would see something like (I made this particular graph up):
+--------------------------------------+
+ Active threads / processes +
+-----------+--------------------------------------+
|Thread 1 |******** ************ |
| 2 | ***** *************|
+-----------+--------------------------------------+
|Process 1 |*** ************** ****** **** |
| 2 |** **** ****** ** ********* **********|
+-----------+--------------------------------------+
+ Time --> +
+--------------------------------------+
which would show that:
threads are fully serialized by the GIL
processes can run in parallel

The key advantage is isolation. A crashing process won't bring down other processes, whereas a crashing thread will probably wreak havoc with other threads.

As mentioned in the question, Multiprocessing in Python is the only real way to achieve true parallelism. Multithreading cannot achieve this because the GIL prevents threads from running in parallel.
As a consequence, threading may not always be useful in Python, and in fact, may even result in worse performance depending on what you are trying to achieve. For example, if you are performing a CPU-bound task such as decompressing gzip files or 3D-rendering (anything CPU intensive) then threading may actually hinder your performance rather than help. In such a case, you would want to use Multiprocessing as only this method actually runs in parallel and will help distribute the weight of the task at hand. There could be some overhead to this since Multiprocessing involves copying the memory of a script into each subprocess which may cause issues for larger-sized applications.
However, Multithreading becomes useful when your task is IO-bound. For example, if most of your task involves waiting on API-calls, you would use Multithreading because why not start up another request in another thread while you wait, rather than have your CPU sit idly by.
TL;DR
Multithreading is concurrent and is used for IO-bound tasks
Multiprocessing achieves true parallelism and is used for CPU-bound tasks

Another thing not mentioned is that it depends on what OS you are using where speed is concerned. In Windows processes are costly so threads would be better in windows but in unix processes are faster than their windows variants so using processes in unix is much safer plus quick to spawn.

Other answers have focused more on the multithreading vs multiprocessing aspect, but in python Global Interpreter Lock (GIL) has to be taken into account. When more number (say k) of threads are created, generally they will not increase the performance by k times, as it will still be running as a single threaded application. GIL is a global lock which locks everything out and allows only single thread execution utilizing only a single core. The performance does increase in places where C extensions like numpy, Network, I/O are being used, where a lot of background work is done and GIL is released. So when threading is used, there is only a single operating system level thread while python creates pseudo-threads which are completely managed by threading itself but are essentially running as a single process. Preemption takes place between these pseudo threads. If the CPU runs at maximum capacity, you may want to switch to multiprocessing.
Now in case of self-contained instances of execution, you can instead opt for pool. But in case of overlapping data, where you may want processes communicating you should use multiprocessing.Process.

MULTIPROCESSING
Multiprocessing adds CPUs to increase computing power.
Multiple processes are executed concurrently.
Creation of a process is time-consuming and resource intensive.
Multiprocessing can be symmetric or asymmetric.
The multiprocessing library in Python uses separate memory space, multiple CPU cores, bypasses GIL limitations in CPython, child processes are killable (ex. function calls in program) and is much easier to use.
Some caveats of the module are a larger memory footprint and IPC’s a little more complicated with more overhead.
MULTITHREADING
Multithreading creates multiple threads of a single process to increase computing power.
Multiple threads of a single process are executed concurrently.
Creation of a thread is economical in both sense time and resource.
The multithreading library is lightweight, shares memory, responsible for responsive UI and is used well for I/O bound applications.
The module isn’t killable and is subject to the GIL.
Multiple threads live in the same process in the same space, each thread will do a specific task, have its own code, own stack memory, instruction pointer, and share heap memory.
If a thread has a memory leak it can damage the other threads and parent process.
Example of Multi-threading and Multiprocessing using Python
Python 3 has the facility of Launching parallel tasks. This makes our work easier.
It has for thread pooling and Process pooling.
The following gives an insight:
ThreadPoolExecutor Example
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
ProcessPoolExecutor
import concurrent.futures
import math
PRIMES = [
112272535095293,
112582705942171,
112272535095293,
115280095190773,
115797848077099,
1099726899285419]
def is_prime(n):
if n % 2 == 0:
return False
sqrt_n = int(math.floor(math.sqrt(n)))
for i in range(3, sqrt_n + 1, 2):
if n % i == 0:
return False
return True
def main():
with concurrent.futures.ProcessPoolExecutor() as executor:
for number, prime in zip(PRIMES, executor.map(is_prime, PRIMES)):
print('%d is prime: %s' % (number, prime))
if __name__ == '__main__':
main()

Threads share the same memory space to guarantee that two threads don't share the same memory location so special precautions must be taken the CPython interpreter handles this using a mechanism called GIL, or the Global Interpreter Lock
what is GIL(Just I want to Clarify GIL it's repeated above)?
In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. This lock is necessary mainly because CPython's memory management is not thread-safe.
For the main question, we can compare using Use Cases, How?
1-Use Cases for Threading: in case of GUI programs threading can be used to make the application responsive For example, in a text editing program, one thread can take care of recording the user inputs, another can be responsible for displaying the text, a third can do spell-checking, and so on. Here, the program has to wait for user interaction. which is the biggest bottleneck. Another use case for threading is programs that are IO bound or network bound, such as web-scrapers.
2-Use Cases for Multiprocessing: Multiprocessing outshines threading in cases where the program is CPU intensive and doesn’t have to do any IO or user interaction.
For More Details visit this link and link or you need in-depth knowledge for threading visit here for Multiprocessing visit here

Process may have multiple threads. These threads may share memory and are the units of execution within a process.
Processes run on the CPU, so threads are residing under each process. Processes are individual entities which run independently. If you want to share data or state between each process, you may use a memory-storage tool such as Cache(redis, memcache), Files, or a Database.

As I learnd in university most of the answers above are right. In PRACTISE on different platforms (always using python) spawning multiple threads ends up like spawning one process. The difference is the multiple cores share the load instead of only 1 core processing everything at 100%. So if you spawn for example 10 threads on a 4 core pc, you will end up getting only the 25% of the cpus power!! And if u spawn 10 processes u will end up with the cpu processing at 100% (if u dont have other limitations). Im not a expert in all the new technologies. Im answering with own real experience background

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.