Im trying to find out how multiprocessing works in Python.
The following example is what I made:
import requests
from multiprocessing import Process
import time
def f(name):
print 'hello', name
time.sleep(15)
print 'ended', name
if __name__ == '__main__':
urls = [
'http://python-requests.org',
'http://httpbin.org',
'http://python-guide.org'
]
for url in urls:
p = Process(target=f, args=(url,))
p.start()
p.join()
print("finished")
What I tried to simulate in f is a request to a URL that has a timeout of 15 seconds. What I expected to happen is that all the request would start at almost the same time and finish at the same time. But what actually happens is they all start one after each other and wait till the previous one is finished. So the result is:
hello http://python-requests.org
ended http://python-requests.org
hello http://httpbin.org
ended http://httpbin.org
hello http://python-guide.org
ended http://python-guide.org
So what actually happens? why would one use the code above instead of just doing:
for url in urls:
f(url)
the problem is your loop:
for url in urls:
p = Process(target=f, args=(url,))
p.start()
p.join()
you're starting the process then you wait for it to complete, then you start the next one ...
Instead, create your process list, start them, and wait for them:
pl = [Process(target=f, args=(url,)) for url in urls]
for p in pl:
p.start()
for p in pl:
p.join()
note that in that case, using Process is probably overkill, since threads would do the job very well (no massive python computing involved, only system calls & networking)
To switch to threads, just use multiprocessing.dummy instead so your program structure remains the same.
import multiprocessing.dummy as multiprocessing
You only spawn one process. Thus, the process (a unique worker) takes the first input, runs f, timeouts during 15 sec, quit f; and then takes the second input. c.f. doc
You could try to map your function f with the inputs. In the example below, you spawn 2 processes (2 workers).
import multiprocessing as mp
if __name__ == '__main__':
with mp.Pool(processes = 2) as p:
p.map(f, urls)
Related
How do I add tqdm for the multiprocessing for loop here. Namely I want to wrap urls in tqdm():
jobs = []
urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
for url in urls:
job = pool.apply_async(worker, (url, q))
jobs.append(job)
for job in jobs:
job.get()
pool.close()
pool.join()
The suggested solution on GitHub is this:
pbar = tqdm(total=100)
def update(*a):
pbar.update()
# tqdm.write(str(a))
for i in range(pbar.total):
pool.apply_async(myfunc, args=(i,), callback=update)
pool.close()
pool.join()
But my iterable is a list of URLs as opposed to a range like in the above. How do I translate the above solution to my for loop?
The easiest solution that is compatible with your current code is to just specify the callback argument to apply_async (and if there is a possibility of an exception in worker, then specify the error_callback argument too).
from multiprocessing import Pool
from tqdm import tqdm
def worker(url):
# So that the progress par does not proceed to quickly
# for demo purposes:
import time
time.sleep(1)
# For compatibility with platforms that use the *spawn* method (e.g. Windows):
if __name__ == '__main__':
def my_callback(result):
pbar.update()
# for this demo:
#urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
urls = list('abcdefghijklmnopqrstuvwxyz')
with tqdm(total=len(urls)) as pbar:
pool = Pool()
jobs = [
pool.apply_async(worker, (url,), callback=my_callback, error_callback=my_callback)
for url in urls
]
# You can delete the next two statements if you don't need
# to save the value of jobs.get() since the calls to
# pool.close() and pool.join() will wait for all submitted
# tasks to complete:
for job in jobs:
job.get()
pool.close()
pool.join()
Or instead of using apply_async, use imap (or imap_unordered if you do not care either about the results or the order of the results):
from multiprocessing import Pool
from tqdm import tqdm
def worker(url):
import time
time.sleep(1) # so that the progress par does not proceed to quickly:
return url
# For compatibility with platforms that use the *spawn* method (e.g. Windows):
if __name__ == '__main__':
# for this demo:
#urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
urls = list('abcdefghijklmnopqrstuvwxyz')
pool = Pool()
results = list(tqdm(pool.imap(worker, urls), total=len(urls)))
print(results)
pool.close()
pool.join()
Note
If you won't or can't use apply_async with a callback, then imap_unordered is to be preferred over imap, assuming you don't need to have the results returned in task-submission order, which imap is obliged to do. The potential problem with imap is that if for some reason the first task submitted to the pool were the last to complete, no results can be returned until that first submitted task finishes. When that occurs all the other submitted task will have already completed and so your progress bar will not move at all and then it will suddenly go from 0% to 100% as quickly as you can iterate the results.
Admittedly the above scenario is an extreme case not likely to occur too often, but you would still like the progress bar to advance as tasks complete regardless of that order of completion. For this and getting results back in task-submission order apply_async with a callback is probably best. The only drawback to apply_async is that if you have a very large number of tasks to submit, they cannot be "chunked up" (see the chunksize argument to imap and imap_unordered) without your doing your own chunking logic.
You can use Parallel and Delayed from Joblib and use tqdm in the following manner:
from multiprocessing import cpu_count
from joblib import Parallel, delayed
def process_urls(urls,i):
#define your function here
Call function using:
urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
Parallel(n_jobs=cpu_count(), prefer='processes')(delayed(process_urls)(urls, i) for i in tqdm(range(len(urls.axes[0]))))
I'd like to save time and use multiprocessing to make 10 get-requests. So far I have this:
# get one text from the url
def get_one_request_text(url, multiprocessing_queue):
response = requests.get(url)
assert response.status_code == 200
multiprocessing_queue.put(response.text)
# urls is a list of links
def get_many_request_texts(urls):
q = multiprocessing.Queue()
jobs = []
result = []
for url in urls:
p = multiprocessing.Process(target=get_one_request_text, args=(url, q))
jobs.append(p)
p.start()
for p in jobs:
p.join()
for _ in jobs:
result.append(q.get())
return result
if __name__ == '__main__':
# for testing purposes I use the same link
url = 'https://www.imdb.com/name/nm0000138'
urls = [url] * 10 # any number freezes my code, even 1
t1 = time.perf_counter()
texts = get_many_request_texts(urls)
t2 = time.perf_counter()
print(f"Soups: {len(texts)} Execution time: = {round(t2 - t1, 2)} {texts}")
I expect my script to produce 10 response.text's in a list but for some reason my program just freezes and I don't get anything. Even when I try to get 1 response.text, it freezes.
What am I doing wrong and how can I get my response.texts using multiprocessing to save time?
First, this is probably a job better suited for multithreading rather than multiprocessing. Second, regardless of whether you are using multithreading or multiprocessing, this would be more easily accomplished if you used a thread or process pool.
That said, given you are doing what you are doing, the problem is that you must never attempt to read from the queue that your processes have written to after you have joined those processes. That is, those processes must still be running for your main process to retrieve those messages. So you need to reverse the order of operations:
def get_many_request_texts(urls):
q = multiprocessing.Queue()
jobs = []
result = []
for url in urls:
p = multiprocessing.Process(target=get_one_request_text, args=(url, q))
jobs.append(p)
p.start()
for _ in jobs:
result.append(q.get())
for p in jobs:
p.join()
return result
See multiprocessing.Queue documentation:
Warning As mentioned above, if a child process has put items on a queue (and it has not used JoinableQueue.cancel_join_thread), then that process will not terminate until all buffered items have been flushed to the pipe.
This means that if you try joining that process you may get a deadlock unless you are sure that all items which have been put on the queue have been consumed. Similarly, if the child process is non-daemonic then the parent process may hang on exit when it tries to join all its non-daemonic children.
Note that a queue created using a manager does not have this issue. See Programming guidelines.
I've written the following code which runs a function that simulates a stochastic simulation of a series of chemical reactions. I've written the following code:
v = range(1, 51)
def parallelfunc(*v):
gillespie_tau_leaping(start_state, LHS, stoch_rate, state_change_array)
def info(title):
print(title)
print('module name:', __name__)
print('parent process:', os.getppid())
print('process id:', os.getpid())
if __name__ == '__main__':
info('main line')
start = datetime.utcnow()
p = Process(target=parallelfunc, args=(v))
p.start()
p.join()
end = datetime.utcnow()
sim_time = end - start
print(f"Simualtion utc time:\n{sim_time}")
I'm using the Process method from the multiprocessing library and am trying to run gillespie_tau_leaping 50 times.
Only I'm not sure if its working. gillespie_tau_leaping prints out a number of values to the terminal, but these values are only printed out once, I'd expect them to be printed out 50 times.
I tried using the getpid etc command and this returns the following to the terminal:
main line
module name: __main__
parent process: 6188
process id: 27920
How can I tell if my code as worked and how can I get it to print the values from gillepsie_tau_leaping 50 times to the terminal?
Cheers
Your code is running just one process, the call to Process, spawns a new thread but you are doing it only once (not in a loop).
I would suggest you to use multiprocessing pools
Your code can be something like this:
from multiprocess import Pool
def parallelfunc(*args):
do_something()
def main():
# create a list of list of args for the function invocation
func_args = [['arg1call1', 'arg2call1', 'arg3call1'], ['arg1call2', 'arg2call2', 'arg3call2']]
with Pool() as p:
results = p.map(parallelfunc, func_args)
# do something with results which is a list of results
multiprocessing pool by default create the same number of processes as your CPU cores and manage the process Pool till the end of the processing taking care of all the Inter Process Communication.
This is really handy because synchronizing processes can be hard.
Hope this helps
I have problems with python multiprocessing
python version 3.6.6
using Spyder IDE on windows 7
1.
queue is not being populated -> everytime I try to read it, its empty. Somewhere I read, that I have to get() it before process join() but it did not solve it.
from multiprocessing import Process,Queue
# define a example function
def fnc(i, output):
output.put(i)
if __name__ == '__main__':
# Define an output queue
output = Queue()
# Setup a list of processes that we want to run
processes = [Process(target=fnc, args=(i, output)) for i in range(4)]
print('created')
# Run processes
for p in processes:
p.start()
print('started')
# Exit the completed processes
for p in processes:
p.join()
print(output.empty())
print('finished')
>>>created
>>>started
>>>True
>>>finished
I would expect output to not be empty.
if I change it from .join() to
for p in processes:
print(output.get())
#p.join()
it freezes
2.
Next problem I have is with pool.map() - it freezes and has no chance to exceed memory limit. I dont even know how to debug such simple pieace of code.
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4)
print('Pool created')
# print "[0, 1, 4,..., 81]"
print(pool.map(f, range(10))) # it freezes here
Hope its not a big deal to have two questions in one topic
Apperently the problem is Spyder's IPython console. When I run both from cmd, its executed properly.
Solution
for debugging in Spyder add .dummy to multiprocessing import
from multiprocessing.dummy import Process,Queue
It will not be executed by more processors, but you will get results and can actualy see the output. When debugging is done simply delete .dummy, place it in another file, import it and call it for example as function
multiprocessing_my.py
from multiprocessing import Process,Queue
# define a example function
def fnc(i, output):
output.put(i)
print(i)
def test():
# Define an output queue
output = Queue()
# Setup a list of processes that we want to run
processes = [Process(target=fnc, args=(i, output)) for i in range(4)]
print('created')
# Run processes
for p in processes:
p.start()
print('started')
# Exit the completed processes
for p in processes:
p.join()
print(output.empty())
print('finished')
# Get process results from the output queue
results = [output.get() for p in processes]
print('get results')
print(results)
test_mp.py
executed by selecting code and pressing ctrl+Enter
import multiprocessing_my
multiprocessing_my.test2()
...
In[9]: test()
created
0
1
2
3
started
False
finished
get results
[0, 1, 2, 3]
I'm trying to implement multiprocessing and struggling to get where I need to get.
So some background I have previously done multiprocessing with Celery, so I am used to being able to send jobs to a worker and poll when its done, and get the results of the job even if other jobs are going. I'm trying to relate this to multiprocessing. Here is what I have so far, dug up from various sites I have found...
import urllib2
import time
from multiprocessing.dummy import Pool as ThreadPool
import random
def openurl(url):
time.sleep(random.randrange(1,10))
print url
return urllib2.urlopen(url)
urls = [
'http://www.python.org',
'http://www.python.org/about/',
'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',
'http://www.python.org/doc/',
'http://www.python.org/download/',
'http://www.python.org/getit/',
'http://www.python.org/community/',
'https://wiki.python.org/moin/',
'http://planet.python.org/',
'https://wiki.python.org/moin/LocalUserGroups',
'http://www.python.org/psf/',
'http://docs.python.org/devguide/',
'http://www.python.org/community/awards/'
# etc..
]
pool = ThreadPool(20)
results = pool.map(openurl, urls)
pool.close()
pool.join()
print 'now what'
So I am kicking off the openurl function on my urls, but if I break at "print 'now what'", it does not break there until all my jobs are complete.
How can I 'poll' my threads and return the results as they come in?
Thanks!
pool.map distributes the iterable elements over a Pool of Workers and aggregates the results when they are all ready.
Moreover, pool.close and pool.join instruct the Pool to wait until all the tasks are done.
If you want to handle the results as they come, you have to use pool.apply_async and use a callback. Or you can collect the AsyncResult objects returned by pool.apply_async and iterate over them to see when each of those is ready but the whole logic would be quite cumbersome.
from multiprocessing.pool import ThreadPool
pool = ThreadPool(20)
tasks = []
def callback(result):
# handle the result of your function here
print result
for url in urls:
pool.apply_async(openurl, args=[url], callback=callback)
pool.close()
pool.join()