I have a loop with highly time-consuming process and instead of waiting for each process to complete to move to next iteration, is it possible to run the process and just move to next iteration without waiting for it to complete?
Example : Given a text, the script should try to find the matching links from the Web and files from the local disk. Both return simply a list of links or paths.
for proc in (web_search, file_search):
results = proc(text)
yield from results
What I have as a solution is, using a timer while doing the job. And if the time exceeds the waiting time, the process should be moved to a tray and asked to work from there. Now I will go to next iteration and repeat the same. After my loop is over, I will collect the results from the process moved to the tray.
For simple cases, where the objective is to let each process run simultaneously, we can use Thread of threading module.
So we can tackle the issue like this, we make each process as a Thread and ask it to put its results in a list or some other collection. The code is given below:
from threading import Thread
results = []
def add_to_collection(proc, args, collection):
'''proc is the function, args are the arguments to pass to it.
collection is our container (here it is the list results) for
collecting results.'''
result = proc(*args)
collection.append(result)
print("Completed":, proc)
# Now we do our time consuming tasks
for proc in (web_search, file_search):
t = Thread(target=add_to_collection, args=(proc, ()))
# We assume proc takes no arguments
t.start()
For complex tasks, as mentioned in comments, its better to go with multiprocessing.pool.Pool.
Related
I am having a lot of data (more than one million) and need to do some calculation on it that finds a certain value out of the millions. This is time consuming. To speed up the process I am trying to use all cores of the CPU. To do this, I am using multiprocessing-pool and I am calling the worker process with starmap_async (I need to hand over multiple arguments). Basically, it works so far with the limitation that I have to wait until all values of the list are executed and all processes are finished before I can continue. Is there a possibility to end the starmap process once one of the processes finds a correct value?
I have already tried several different things such as terminating from the worker process changing the whole structure to a for loop but it seems that the starmap process needs to run to its end and that they cannot be stopped. The only way seems to extract each list-value individually and feed it to a separate Process which creates again a big overhead and slows down the process significantly.
Does anyone have an idea?
The solution described here Terminate a Python multiprocessing program once a one of its workers meets a certain condition looks the same but it is not. I have tried this but it doesn’t work. The difference seems to be that none of the arguments is the iterable in the described issue. I have played with this and I couldn’t get it to end the processes before the starmap process finished completely. In the recommended solution the processes are started and ran independently until one finds a solution. In my case starmap seems to continue feeding the processes without checking termination conditions.
import multiprocessing
def worker(x, arg1, arg2):
some calculation with all arguments
**#here I need a possibility to cancel all processes and return the current x value**
if __name__ == '__main__':
arg1 = somthing
arg2 = somthing_else
value_list = (a,b,c,d,e,.......)
pool = multiprocessing.Pool(cpu_count())
p = pool.starmap_async(worker, [(value_list, arg1, arg2) for x in value_list])
pool.close()
pool.join()
for y in p:
if y = correct_value:
print(correct_value)
break
So I have a Python code running with one very expensive function that gets executed at times on demand, but it's result is not needed straight away (it can be delayed by a few cycles).
def heavy_function(arguments):
return calc_obtained_from_arguments
def main():
a = None
if some_condition:
a = heavy_function(x)
else:
do_something_with(a)
The thing is that whenever I calculate the heavy_function, the rest of the program hangs. However, I need it to run with empty a value, or better make it know that a is being processed separately and thus should not be accessed. How can I move the heavy_function to separate process and keep calling the main function all the time until heavy_function is done executing, then read the obtained a value and use it in main function?
You could use a simple queue.
Put your heavy_function inside a separate process that idles as long as there is no input in the input queue. Use Queue.get(block=True) to do so. Put the result of the computation in another queue.
Run your normal process with the empty a-value and check emptiness of the output queue from time to time. Maybe use while Queue.empty(): here.
If an item becomes available, because your heavy_functionhas finished, switch to a calculation with the value a from your output queue.
I want to learn how to run a function in multithreading using python. In other words, I have a long list of arguments that I want to send to a function that might take time to finish. I want my program to loop over the arguments and call the functions on parallel (no need to wait until the function finishes fromt he forst argument).
I found this example code from here:
import Queue
import threading
import urllib2
# called by each thread
def get_url(q, url):
q.put(urllib2.urlopen(url).read())
theurls = ["http://google.com", "http://yahoo.com"]
q = Queue.Queue()
for u in theurls:
t = threading.Thread(target=get_url, args = (q,u))
t.daemon = True
t.start()
s = q.get()
print s
My question are:
1) I normally know that I have to specify a number of threads that I want my program to run in parallel. There is no specific number of threads in the code above.
2) The number of threads is something that varies from device to device (depends on the processor, memory, etc.). Since this code does not specify any number of threads, how the program knows the right number of threads to run concurrently?
The threads are being created in the for loop. The for loop gets executed twice since there are two elements in theurls . This also answers your other two questions. Thus you end up with two threads in the program
plus main loop thread
Total 3
I am a novice user of python multithreading/multiprocessing, so please bear with me.
I would like to solve the following problem and I need some help/suggestions in this regard.
Let me describe in brief:
I would like to start a python script which does something in the
beginning sequentially.
After the sequential part is over, I would like to start some jobs
in parallel.
Assume that there are four parallel jobs I want to start.
I would like to also start these jobs on some other machines using "lsf" on the computing cluster.My initial script is also running on a ” lsf”
machine.
The four jobs which I started on four machines will perform two logical steps A and B---one after the other.
When a job started initially, they start with logical step A and finish it.
After every job (4jobs) has finished the Step A; they should notify the first job which started these. In other words, the main job which started is waiting for the confirmation from these four jobs.
Once the main job receives confirmation from these four jobs; it should notify all the four jobs to do the logical step B.
Logical step B will automatically terminate the jobs after finishing the task.
Main job is waiting for the all jobs to finish and later on it should continue with the sequential part.
An example scenario would be:
Python script running on an “lsf” machine in the cluster starts four "tcl shells" on four “lsf” machines.
In each tcl shell, a script is sourced to do the logical step A.
Once the step A is done, somehow they should inform the python script which is waiting for the acknowledgement.
Once the acknowledgement is received from all the four, python script inform them to do the logical step B.
Logical step B is also a script which is sourced in their tcl shell; this script will also close the tcl shell at the end.
Meanwhile, python script is waiting for all the four jobs to finish.
After all four jobs are finished; it should continue with the sequential part again and finish later on.
Here are my questions:
I am confused about---should I use multithreading/multiprocessing. Which one suits better?
In fact what is the difference between these two? I read about these but I wasn't able to conclude.
What is python GIL? I also read somewhere at any one point in time only one thread will execute.
I need some explanation here. It gives me an impression that I can't use threads.
Any suggestions on how could I solve my problem systematically and in a more pythonic way.
I am looking for some verbal step by step explanation and some pointers to read on each step.
Once the concepts are clear, I would like to code it myself.
Thanks in advance.
In addition to roganjosh's answer, I would include some signaling to start the step B after A has finished:
import multiprocessing as mp
import time
import random
import sys
def func_A(process_number, queue, proceed):
print "Process {} has started been created".format(process_number)
print "Process {} has ended step A".format(process_number)
sys.stdout.flush()
queue.put((process_number, "done"))
proceed.wait() #wait for the signal to do the second part
print "Process {} has ended step B".format(process_number)
sys.stdout.flush()
def multiproc_master():
queue = mp.Queue()
proceed = mp.Event()
processes = [mp.Process(target=func_A, args=(x, queue)) for x in range(4)]
for p in processes:
p.start()
#block = True waits until there is something available
results = [queue.get(block=True) for p in processes]
proceed.set() #set continue-flag
for p in processes: #wait for all to finish (also in windows)
p.join()
return results
if __name__ == '__main__':
split_jobs = multiproc_master()
print split_jobs
1) From the options you listed in your question, you should probably use multiprocessing in this case to leverage multiple CPU cores and compute things in parallel.
2) Going further from point 1: the Global Interpreter Lock (GIL) means that only one thread can actually execute code at any one time.
A simple example for multithreading that pops up often here is having a prompt for user input for, say, an answer to a maths problem. In the background, they want a timer to keep incrementing at one second intervals to register how long the person took to respond. Without multithreading, the program would block whilst waiting for user input and the counter would not increment. In this case, you could have the counter and the input prompt run on different threads so that they appear to be running at the same time. In reality, both threads are sharing the same CPU resource and are constantly passing an object backwards and forwards (the GIL) to grant them individual access to the CPU. This is hopeless if you want to properly process things in parallel. (Note: In reality, you'd just record the time before and after the prompt and calculate the difference rather than bothering with threads.)
3) I have made a really simple example using multiprocessing. In this case, I spawn 4 processes that compute the sum of squares for a randomly chosen range. These processes do not have a shared GIL and therefore execute independently unlike multithreading. In this example, you can see that all processes start and end at slightly different times, but we can aggregate the results of the processes into a single queue object. The parent process will wait for all 4 child processes to return their computations before moving on. You could then repeat the code for func_B (not included in the code).
import multiprocessing as mp
import time
import random
import sys
def func_A(process_number, queue):
start = time.time()
print "Process {} has started at {}".format(process_number, start)
sys.stdout.flush()
my_calc = sum([x**2 for x in xrange(random.randint(1000000, 3000000))])
end = time.time()
print "Process {} has ended at {}".format(process_number, end)
sys.stdout.flush()
queue.put((process_number, my_calc))
def multiproc_master():
queue = mp.Queue()
processes = [mp.Process(target=func_A, args=(x, queue)) for x in xrange(4)]
for p in processes:
p.start()
# Unhash the below if you run on Linux (Windows and Linux treat multiprocessing
# differently as Windows lacks os.fork())
#for p in processes:
# p.join()
results = [queue.get() for p in processes]
return results
if __name__ == '__main__':
split_jobs = multiproc_master()
print split_jobs
I have a problem running multiple processes in python3 .
My program does the following:
1. Takes entries from an sqllite database and passes them to an input_queue
2. Create multiple processes that take items off the input_queue, run it through a function and output the result to the output queue.
3. Create a thread that takes items off the output_queue and prints them (This thread is obviously started before the first 2 steps)
My problem is that currently the 'function' in step 2 is only run as many times as the number of processes set, so for example if you set the number of processes to 8, it only runs 8 times then stops. I assumed it would keep running until it took all items off the input_queue.
Do I need to rewrite the function that takes the entries out of the database (step 1) into another process and then pass its output queue as an input queue for step 2?
Edit:
Here is an example of the code, I used a list of numbers as a substitute for the database entries as it still performs the same way. I have 300 items on the list and I would like it to process all 300 items, but at the moment it just processes 10 (the number of processes I have assigned)
#!/usr/bin/python3
from multiprocessing import Process,Queue
import multiprocessing
from threading import Thread
## This is the class that would be passed to the multi_processing function
class Processor:
def __init__(self,out_queue):
self.out_queue = out_queue
def __call__(self,in_queue):
data_entry = in_queue.get()
result = data_entry*2
self.out_queue.put(result)
#Performs the multiprocessing
def perform_distributed_processing(dbList,threads,processor_factory,output_queue):
input_queue = Queue()
# Create the Data processors.
for i in range(threads):
processor = processor_factory(output_queue)
data_proc = Process(target = processor,
args = (input_queue,))
data_proc.start()
# Push entries to the queue.
for entry in dbList:
input_queue.put(entry)
# Push stop markers to the queue, one for each thread.
for i in range(threads):
input_queue.put(None)
data_proc.join()
output_queue.put(None)
if __name__ == '__main__':
output_results = Queue()
def output_results_reader(queue):
while True:
item = queue.get()
if item is None:
break
print(item)
# Establish results collecting thread.
results_process = Thread(target = output_results_reader,args = (output_results,))
results_process.start()
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
# Perform multi processing
perform_distributed_processing(dbList,10,Processor,output_results)
# Wait for it all to finish.
results_process.join()
A collection of processes that service an input queue and write to an output queue is pretty much the definition of a process pool.
If you want to know how to build one from scratch, the best way to learn is to look at the source code for multiprocessing.Pool, which is pretty simply Python, and very nicely written. But, as you might expect, you can just use multiprocessing.Pool instead of re-implementing it. The examples in the docs are very nice.
But really, you could make this even simpler by using an executor instead of a pool. It's hard to explain the difference (again, read the docs for both modules), but basically, a future is a "smart" result object, which means instead of a pool with a variety of different ways to run jobs and get results, you just need a dumb thing that doesn't know how to do anything but return futures. (Of course in the most trivial cases, the code looks almost identical either way…)
from concurrent.futures import ProcessPoolExecutor
def Processor(data_entry):
return data_entry*2
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
yield from executor.map(processor_factory, dbList)
if __name__ == '__main__':
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
for result in perform_distributed_processing(dbList, 8, Processor):
print(result)
Or, if you want to handle them as they come instead of in order:
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
fs = (executor.submit(processor_factory, db) for db in dbList)
yield from map(Future.result, as_completed(fs))
Notice that I also replaced your in-process queue and thread, because it wasn't doing anything but providing a way to interleave "wait for the next result" and "process the most recent result", and yield (or yield from, in this case) does that without all the complexity, overhead, and potential for getting things wrong.
Don't try to rewrite the whole multiprocessing library again. I think you can use any of multiprocessing.Pool methods depending on your needs - if this is a batch job you can even use the synchronous multiprocessing.Pool.map() - only instead of pushing to input queue, you need to write a generator that yields input to the threads.