This question already has an answer here:
How to start threads at the same time in Python [duplicate]
(1 answer)
Closed 3 months ago.
I already read this topic, but when I try to run this code, I will a little delta
import threading
from threading import Thread
from cryptography.fernet import Fernet
import time
from multiprocessing import Process
def create_key1():
print(time.time())
def create_key2():
print(time.time())
if __name__ == '__main__':
Process(target = create_key1()).start()
Process(target = create_key2()).start()
Thread(target = create_key1()).start()
Thread(target = create_key2()).start()
if we comment Process and run the code, we will see the result :
1501843580.508508
1501843580.5089302
if we comment Thread and run the code, we will see the result :
1501843680.4178944
1501843680.420028
we got delta at the same situation, my question is how to run threads at the same time, be cause I want check generation of the key in cryptography python library.
I want to check what will if I try to generate two keys at same time, will they same or not.
Parallel processing of two functions, as in your code, does not guarantee that the functions will run at exactly the same time. As you have seen there is a slight discrepancy in the time that the methods reach the time.time() call, and this is to be expected.
In particular due to the way that the threading module is designed it isn't possible for the methods to run at exactly the same time. Similarly, while the multiprocessing module could theoretically run two functions at the exact same time there is no guarantee of this, and it is likely to be a rare occurrence.
In the end this is butting up against the low level constraints of an operating system, where two pieces of code can't physically be run at the same time on the same processing core.
To answer your question on how this will affect the keys produced by your code, it depends on how sensitive your algorithm to the current time. If your algorithm bases a key of the current time to the nearest second, or tenth of a second then the keys produced will likely be identical (but are not guaranteed to be). However if the keys produced are based on the exact time that the function call is reached then they are unlikely to ever match, as there is no guarantee of the time the function calls will be reached in the two functions.
For more information on the differences between the threading and multiprocessing modules see this.
The GIL is an interpreter-level lock. This lock prevents the execution of multiple threads at once in the Python interpreter. Each thread that wants to run must wait for the GIL to be released by the other thread, which means your multi-threaded Python application is essentially single threaded,
Another approach is to use the multiprocessing module where each process runs in its own OS process with its own Python runtime. You can take full advantage of multiple cores with this approach, and it's usually safer because you don't have to worry about synchronising access to shared memory.
for more info about [GIL]1
Related
I have to write a job to perform difference type of analysis on given document. I know I can do sequentially i.e., call each parser one by one.
A very high level script structure is given below
def summarize(doc):
pass
def LengthCount(doc):
pass
def LanguageFinder(doc):
pass
def ProfanityFinder(doc):
pass
if __name__ == '__main__':
doc = "Some document"
smry = summarize(doc)
length = LengthCount(doc)
lang = LanguageFinder(doc)
profanity = ProfanityFinder(doc)
# Save sumary, length, language, profanity information in database
But for performance improvement, I think these task can be run in parallel. How can I do it. What are the possible ways for this purpose in Python especially 3.x version. It is quite possible that one parser (module) take more time than other but overall if they could be run in parallel they it will increase performance. Lastly, if not possible in Python, any other language is also welcome.
In Python you have a few options for concurrency/parallelism. There is the threading module which allows you to execute code in multiple logical threads and the multiprocessing module which allows you to spawn multiple processes. There is also the concurrent.futures module that provides an API into both of these mechanisms.
If your process is CPU-bound (i.e. you are running at 100% of the CPU available to Python throughout - note this is not 100% CPU if you have a multi-core or hyper-threading machine) you are unlikely to see much benefit from threading as this doesn't actually use multiple CPU threads in parallel, it just allows one to take over from another whilst the first is waiting for IO. Multiprocessing is likely to be more useful for you as this allows you to run using multiple CPU threads. You can start each of your functions in its own process using the Process class:
import multiprocessing
#function defs here
p = multiprocessing.Process(target=LengthCount, args=(doc,))
p.start()
# repeat for other processes
You will need to tweak your code to have the functions return to a shared variable (or write straight to your database) rather than directly return your result so you can access them once the process is complete.
I'm slightly confused about whether multithreading works in Python or not.
I know there has been a lot of questions about this and I've read many of them, but I'm still confused. I know from my own experience and have seen others post their own answers and examples here on StackOverflow that multithreading is indeed possible in Python. So why is it that everyone keep saying that Python is locked by the GIL and that only one thread can run at a time? It clearly does work. Or is there some distinction I'm not getting here?
Many posters/respondents also keep mentioning that threading is limited because it does not make use of multiple cores. But I would say they are still useful because they do work simultaneously and thus get the combined workload done faster. I mean why would there even be a Python thread module otherwise?
Update:
Thanks for all the answers so far. The way I understand it is that multithreading will only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks.
I'm not entirely sure what this means for me in practical terms, so I'll just give an example of the kind of task I'd like to multithread. For instance, let's say I want to loop through a very long list of strings and I want to do some basic string operations on each list item. If I split up the list, send each sublist to be processed by my loop/string code in a new thread, and send the results back in a queue, will these workloads run roughly at the same time? Most importantly will this theoretically speed up the time it takes to run the script?
Another example might be if I can render and save four different pictures using PIL in four different threads, and have this be faster than processing the pictures one by one after each other? I guess this speed-component is what I'm really wondering about rather than what the correct terminology is.
I also know about the multiprocessing module but my main interest right now is for small-to-medium task loads (10-30 secs) and so I think multithreading will be more appropriate because subprocesses can be slow to initiate.
The GIL does not prevent threading. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads.
What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.
This only applies to Python code. C extensions can and do release the GIL to allow multiple threads of C code and one Python thread to run across multiple cores. This extends to I/O controlled by the kernel, such as select() calls for socket reads and writes, making Python handle network events reasonably efficiently in a multi-threaded multi-core setup.
What many server deployments then do, is run more than one Python process, to let the OS handle the scheduling between processes to utilize your CPU cores to the max. You can also use the multiprocessing library to handle parallel processing across multiple processes from one codebase and parent process, if that suits your use cases.
Note that the GIL is only applicable to the CPython implementation; Jython and IronPython use a different threading implementation (the native Java VM and .NET common runtime threads respectively).
To address your update directly: Any task that tries to get a speed boost from parallel execution, using pure Python code, will not see a speed-up as threaded Python code is locked to one thread executing at a time. If you mix in C extensions and I/O, however (such as PIL or numpy operations) and any C code can run in parallel with one active Python thread.
Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library.
Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.
Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.
I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")
I'm slightly confused about whether multithreading works in Python or not.
I know there has been a lot of questions about this and I've read many of them, but I'm still confused. I know from my own experience and have seen others post their own answers and examples here on StackOverflow that multithreading is indeed possible in Python. So why is it that everyone keep saying that Python is locked by the GIL and that only one thread can run at a time? It clearly does work. Or is there some distinction I'm not getting here?
Many posters/respondents also keep mentioning that threading is limited because it does not make use of multiple cores. But I would say they are still useful because they do work simultaneously and thus get the combined workload done faster. I mean why would there even be a Python thread module otherwise?
Update:
Thanks for all the answers so far. The way I understand it is that multithreading will only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks.
I'm not entirely sure what this means for me in practical terms, so I'll just give an example of the kind of task I'd like to multithread. For instance, let's say I want to loop through a very long list of strings and I want to do some basic string operations on each list item. If I split up the list, send each sublist to be processed by my loop/string code in a new thread, and send the results back in a queue, will these workloads run roughly at the same time? Most importantly will this theoretically speed up the time it takes to run the script?
Another example might be if I can render and save four different pictures using PIL in four different threads, and have this be faster than processing the pictures one by one after each other? I guess this speed-component is what I'm really wondering about rather than what the correct terminology is.
I also know about the multiprocessing module but my main interest right now is for small-to-medium task loads (10-30 secs) and so I think multithreading will be more appropriate because subprocesses can be slow to initiate.
The GIL does not prevent threading. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads.
What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.
This only applies to Python code. C extensions can and do release the GIL to allow multiple threads of C code and one Python thread to run across multiple cores. This extends to I/O controlled by the kernel, such as select() calls for socket reads and writes, making Python handle network events reasonably efficiently in a multi-threaded multi-core setup.
What many server deployments then do, is run more than one Python process, to let the OS handle the scheduling between processes to utilize your CPU cores to the max. You can also use the multiprocessing library to handle parallel processing across multiple processes from one codebase and parent process, if that suits your use cases.
Note that the GIL is only applicable to the CPython implementation; Jython and IronPython use a different threading implementation (the native Java VM and .NET common runtime threads respectively).
To address your update directly: Any task that tries to get a speed boost from parallel execution, using pure Python code, will not see a speed-up as threaded Python code is locked to one thread executing at a time. If you mix in C extensions and I/O, however (such as PIL or numpy operations) and any C code can run in parallel with one active Python thread.
Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library.
Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.
Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.
I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")
I have a simple Python script that uses two much more complicated Python scripts, and does something with the results.
I have two modules, Foo and Bar, and my code is like the following:
import Foo
import Bar
output = []
a = Foo.get_something()
b = Bar.get_something_else()
output.append(a)
output.append(b)
Both methods take a long time to run, and neither depends on the other, so the obvious solution is to run them in parallel. How can I achieve this, but make sure that the order is maintained: Whichever one finishes first must wait for the other one to finish before the script can continue.
Let me know if I haven't made myself clear enough, I've tried to make the example code as simple as possible.
In general, you'd use threading to do this.
First, create a thread for each thing you want to run in parallel:
import threading
import Foo
import Bar
results = {}
def get_a():
results['a'] = Foo.get_something()
a_thread = threading.Thread(target=get_a)
a_thread.start()
def get_b():
results['b'] = Bar.get_something_else()
b_thread = threading.Thread(target=get_b)
b_thread.start()
Then to require both of them to have finished, use .join() on both:
a_thread.join()
b_thread.join()
at which point your results will be in results['a'] and results['b'], so if you wanted an ordered list:
output = [results['a'], results['b']]
Note: if both tasks are inherently CPU-intensive, you might want to consider multiprocessing instead - due to Python's GIL, a given Python process will only ever use one CPU core, whereas multiprocessing can distribute the tasks to separate cores. However, it has a slightly higher overhead than threading, and thus if the tasks are less CPU-intensive, it might not be as efficient.
import multiprocessing
import Foo
import Bar
results = {}
def get_a():
results['a'] = Foo.get_something()
def get_b():
results['b'] = Bar.get_something_else()
process_a = multiprocessing.Process(target=get_a)
process_b = multiprocessing.Process(target=get_b)
process_b.start()
process_a.start()
process_a.join
process_b.join
Here is the process version of your program.
NOTE: that in threading there are shared datastructures so you have to worry about locking which avoids wrong manipulation of data plus as amber mentioned above it also has a GIL (Global interpreter Lock) problem and since both of your tasks are CPU intensive then this means that it will take more time because of the calls notifying the threads of thread acquisition and release. If however your tasks were I/O intensive then it does not effect that much.
Now since there are no shared datastructures in a process thus no worrying about LOCKS and since it works irrespective of the GIL so you actually enjoy the real power of multiprocessors.
Simple note to remember: process is the same as thread just without using a shared datastructures (everything works in isolation and is focused on messaging.)
check out dabeaz.com he gave a good presentation on concurrent programming once.
How can multiple calculations be launched in parallel, while stopping them all when the first one returns?
The application I have in mind is the following: there are multiple ways of calculating a certain value; each method takes a different amount of time depending on the function parameters; by launching calculations in parallel, the fastest calculation would automatically be "selected" each time, and the other calculations would be stopped.
Now, there are some "details" that make this question more difficult:
The parameters of the function to be calculated include functions (that are calculated from data points; they are not top-level module functions). In fact, the calculation is the convolution of two functions. I'm not sure how such function parameters could be passed to a subprocess (they are not pickeable).
I do not have access to all calculation codes: some calculations are done internally by Scipy (probably via Fortran or C code). I'm not sure whether threads offer something similar to the termination signals that can be sent to processes.
Is this something that Python can do relatively easily?
I would look at the multiprocessing module if you haven't already. It offers a way of offloading tasks to separate processes whilst providing you with a simple, threading like interface.
It provides the same kinds of primatives as you get in the threading module, for example, worker pools and queues for passing messages between your tasks, but it allows you to sidestep the issue of the GIL since your tasks actually run in separate processes.
The actual semantics of what you want are quite specific so I don't think there is a routine that fits the bill out-of-the-box, but you can surely knock one up.
Note: if you want to pass functions around, they cannot be bound functions since these are not pickleable, which is a requirement for sharing data between your tasks.
Because of the global interpreter lock you would be hard pressed to get any speedup this way. In reality even multithreaded programs in Python only run on one core. Thus, you would just be doing N processes at 1/N times the speed. Even if one finished in half the time of the others you would still lose time in the big picture.
Processes can be started and killed trivially.
You can do this.
import subprocess
watch = []
for s in ( "process1.py", "process2.py", "process3.py" ):
sp = subprocess.Popen( s )
watch.append( sp )
Now you're simply waiting for one of those to finish. When one finishes, kill the others.
import time
winner= None
while winner is None:
time.sleep(10)
for w in watch:
if w.poll() is not None:
winner= w
break
for w in watch:
if w.poll() is None: w.kill()
These are processes -- not threads. No GIL considerations. Make the operating system schedule them; that's what it does best.
Further, each process is simply a script that simply solves the problem using one of your alternative algorithms. They're completely independent and stand-alone. Simple to design, build and test.