I'm only beginner, so maybe the answer is obvious for you. I'm working on simple program and I would like to use multiprocessing there. I have function (Let's call it f) which is making requests on some URL. I would like to give user opportunity to choose number of 'threads' (number of functions running simultaneously ?).
I' have tried making for loop with _ variable in range of variable that user can adjust with input(), but I' don't know how to create processes (I know how to do it 'manually' but not 'automatically'). Could you please help me?
if __name__ == '__main__':
p1 = Process(target=f1)
You can use Pool to decide how many processes run at the same time. And you don't have to build loop for this.
from multiprocessing import Pool
import requests
def f(url):
print('url:', url)
data = requests.get(url).json()
result = data['args']
return result
if __name__ == '__main__':
urls = [
'https://httpbin.org/get?x=1',
'https://httpbin.org/get?x=2',
'https://httpbin.org/get?x=3',
'https://httpbin.org/get?x=4',
'https://httpbin.org/get?x=5',
]
numer_of_processes = 2
with Pool(numer_of_processes) as p:
results = p.map(f, urls)
print(results)
It will start 2 processes with two first urls. When one of them ends its job then it starts process again with next url.
You can see similar example with Pool in documentation: multiprocessing
Related
I want to input text to python and process it in parallel. For that purpose I use multiprocessing.Pool. The problem is that sometime, not always, I have to input text multiple times before anything is processed.
This is a minimal version of my code to reproduce the problem:
import multiprocessing as mp
import time
def do_something(text):
print('Out: ' + text, flush=True)
# do some awesome stuff here
if __name__ == '__main__':
p = None
while True:
message = input('In: ')
if not p:
p = mp.Pool()
p.apply_async(do_something, (message,))
What happens is that I have to input text multiple times before I get a result, no matter how long I wait after I have inputted something the first time. (As stated above, that does not happen every time.)
python3 test.py
In: a
In: a
In: a
In: Out: a
Out: a
Out: a
If I create the pool before the while loop or if I add time.sleep(1) after creating the pool, it seems to work every time. Note: I do not want to create the pool before I get an input.
Has someone an explanation for this behavior?
I'm running Windows 10 with Python 3.4.2
EDIT: Same behavior with Python 3.5.1
EDIT:
An even simpler example with Pool and also ProcessPoolExecutor. I think the problem is the call to input() right after appyling/submitting, which only seems to be a problem the first time appyling/submitting something.
import concurrent.futures
import multiprocessing as mp
import time
def do_something(text):
print('Out: ' + text, flush=True)
# do some awesome stuff here
# ProcessPoolExecutor
# if __name__ == '__main__':
# with concurrent.futures.ProcessPoolExecutor() as executor:
# executor.submit(do_something, 'a')
# input('In:')
# print('done')
# Pool
if __name__ == '__main__':
p = mp.Pool()
p.apply_async(do_something, ('a',))
input('In:')
p.close()
p.join()
print('done')
Your code works when I tried it on my Mac.
In Python 3, it might help to explicitly declare how many processors will be in your pool (ie the number of simultaneous processes).
try using p = mp.Pool(1)
import multiprocessing as mp
import time
def do_something(text):
print('Out: ' + text, flush=True)
# do some awesome stuff here
if __name__ == '__main__':
p = None
while True:
message = input('In: ')
if not p:
p = mp.Pool(1)
p.apply_async(do_something, (message,))
I could not reproduce it on Windows 7 but there are few long shots worth to mention for your issue.
your AV might be interfering with the newly spawned processes, try temporarily disabling it and see if the issue is still present.
Win 10 might have different IO caching algorithm, try inputting larger strings. If it works, it means that the OS tries to be smart and sends data when a certain amount has piled up.
As Windows has no fork() primitive, you might see the delay caused by the spawn starting method.
Python 3 added a new pool of workers called ProcessPoolExecutor, I'd recommend you to use this no matter the issue you suffer from.
So, I'm trying to speed up one routine by using the Multiprocessing module in Python. I want to be able to read several .csv files by splitting the job among several cores, for that I have:
def csvreader(string):
from numpy import genfromtxt;
time,signal=np.genfromtxt(string, delimiter=',',unpack="true")
return time,signal
Then I call this function by saying:
if __name__ == '__main__':
for i in range(0,2):
p = multiprocessing.Process(target=CSVReader.csvreader, args=(string_array[i],))
p.start()
The thing is that this doesn't store any output. I have read all the forums online and seen that there might be a way with multiprocessing.queue but I don't understand it quite well.
Is there any simple and straightforward method?
Your best bet are multiprocessing.Queue or multiprocessing.Pipe, which are designed exactly for this problem. They allow you to send data between processes in a safe and easy way.
If you'd like to return the output of your csvreader function, you should pass another argument to it, which is the multiprocessing.Queue through which the data will be sent back to the main process. Instead of returning the values, place them on the queue, and the main process will retrieve them at some point later. If they're not ready when the process tries to get them, by default it will just block (wait) until they are available
Your function would now look like this:
def cvsreader(string, q):
q.put(np.genfromtxt(string, delimiter=',', unpack="true"))
The main routine would be:
if __name__ == '__main__'
q = multiprocessing.Queue()
for i in range(2):
p = multiprocessing.Process(target=csvreader, args=(string_array[i], q,))
p.start()
# Do anything else you need in here
time=np.empty(2,dtype='object')
signal=np.empty(2,dtype='object')
for i in range(2):
time[i], signal[i] = q.get() # Returns output or blocks until ready
# Process my output
Note that you have to call Queue.get() for each item you want to return.
Have a look at the documentation on the multiprocessing module for more examples and information.
Using the example from the introduction to the documentation:
if __name__ == '__main__':
pool = Pool(2)
results = pool.map(CSVReader.csvreader, string_array[:2])
print(results)
I have a problem running multiple processes in python3 .
My program does the following:
1. Takes entries from an sqllite database and passes them to an input_queue
2. Create multiple processes that take items off the input_queue, run it through a function and output the result to the output queue.
3. Create a thread that takes items off the output_queue and prints them (This thread is obviously started before the first 2 steps)
My problem is that currently the 'function' in step 2 is only run as many times as the number of processes set, so for example if you set the number of processes to 8, it only runs 8 times then stops. I assumed it would keep running until it took all items off the input_queue.
Do I need to rewrite the function that takes the entries out of the database (step 1) into another process and then pass its output queue as an input queue for step 2?
Edit:
Here is an example of the code, I used a list of numbers as a substitute for the database entries as it still performs the same way. I have 300 items on the list and I would like it to process all 300 items, but at the moment it just processes 10 (the number of processes I have assigned)
#!/usr/bin/python3
from multiprocessing import Process,Queue
import multiprocessing
from threading import Thread
## This is the class that would be passed to the multi_processing function
class Processor:
def __init__(self,out_queue):
self.out_queue = out_queue
def __call__(self,in_queue):
data_entry = in_queue.get()
result = data_entry*2
self.out_queue.put(result)
#Performs the multiprocessing
def perform_distributed_processing(dbList,threads,processor_factory,output_queue):
input_queue = Queue()
# Create the Data processors.
for i in range(threads):
processor = processor_factory(output_queue)
data_proc = Process(target = processor,
args = (input_queue,))
data_proc.start()
# Push entries to the queue.
for entry in dbList:
input_queue.put(entry)
# Push stop markers to the queue, one for each thread.
for i in range(threads):
input_queue.put(None)
data_proc.join()
output_queue.put(None)
if __name__ == '__main__':
output_results = Queue()
def output_results_reader(queue):
while True:
item = queue.get()
if item is None:
break
print(item)
# Establish results collecting thread.
results_process = Thread(target = output_results_reader,args = (output_results,))
results_process.start()
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
# Perform multi processing
perform_distributed_processing(dbList,10,Processor,output_results)
# Wait for it all to finish.
results_process.join()
A collection of processes that service an input queue and write to an output queue is pretty much the definition of a process pool.
If you want to know how to build one from scratch, the best way to learn is to look at the source code for multiprocessing.Pool, which is pretty simply Python, and very nicely written. But, as you might expect, you can just use multiprocessing.Pool instead of re-implementing it. The examples in the docs are very nice.
But really, you could make this even simpler by using an executor instead of a pool. It's hard to explain the difference (again, read the docs for both modules), but basically, a future is a "smart" result object, which means instead of a pool with a variety of different ways to run jobs and get results, you just need a dumb thing that doesn't know how to do anything but return futures. (Of course in the most trivial cases, the code looks almost identical either way…)
from concurrent.futures import ProcessPoolExecutor
def Processor(data_entry):
return data_entry*2
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
yield from executor.map(processor_factory, dbList)
if __name__ == '__main__':
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
for result in perform_distributed_processing(dbList, 8, Processor):
print(result)
Or, if you want to handle them as they come instead of in order:
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
fs = (executor.submit(processor_factory, db) for db in dbList)
yield from map(Future.result, as_completed(fs))
Notice that I also replaced your in-process queue and thread, because it wasn't doing anything but providing a way to interleave "wait for the next result" and "process the most recent result", and yield (or yield from, in this case) does that without all the complexity, overhead, and potential for getting things wrong.
Don't try to rewrite the whole multiprocessing library again. I think you can use any of multiprocessing.Pool methods depending on your needs - if this is a batch job you can even use the synchronous multiprocessing.Pool.map() - only instead of pushing to input queue, you need to write a generator that yields input to the threads.
Again a question from me.. having some issues again. Hope to find someone who's a lot smarter and knows this.. :D
I'm now having the issue with threading that when opening threading urls in a range of (1,1000), I would love to see actually all the different urls. Only when i run the code i get a lot of double variables (probably because the crawls go that fast). Anyway this is my code: I try to see at which Thread it is, but I get doubles.
import threading
import urllib2
import time
import collections
results2 = []
def crawl():
var_Number = thread.getName().split("-")[1]
try:
data = urllib2.urlopen("http://www.waarmaarraar.nl").read()
results2.append(var_Number)
except:
crawl()
threads = []
for n in xrange(1, 1000):
thread = threading.Thread(target=crawl)
thread.start()
threads.append(thread)
# to wait until all three functions are finished
print "Waiting..."
for thread in threads:
thread.join()
print "Complete."
# print results (All numbers, should be 1/1000)
results2.sort()
print results2
# print doubles (should be [])
print [x for x, y in collections.Counter(results2).items() if y > 1]
However, if I add time.sleep(0.1) directly under the xrange line, those doubles will not occur. Although this does slow my programm down a lot. Anyone knows a better way to fix this?
There is a recursive call to crawl() in the exception handler. The same thread runs the function several times if there is an error. Thus results2 may contain the same var_Number several times. If you add time.sleep(.1) (a pause); your script consumes less resources e.g., number of open fds, running threads and the request to the remote server is more likely to succeed.
Also default thread names may repeat. If a thread exited; another thread may have the same name e.g., if the implementation uses .ident attribute to generate a name.
Notes:
use pep-8 naming conventions. You could use pep8, pyflakes, epylint command-line tools to check your code automatically
you don't need 1000 threads to fetch 1000 urls (see my comment to your previous question)
it is not nice to generate requests without a pause to the same site.
According to the documentation on Thread.getName() it is a correct behavior.
If you want an unique name for each of your thread you have to set it using the name attribute.
Based on what you expect in the end, replacing
for n in xrange(1, 1000):
thread = threading.Thread(target=crawl)
thread.start()
threads.append(thread)
with
for n in xrange(1, 1000):
thread = threading.Thread(target=crawl)
thread.name = n
thread.start()
threads.append(thread)
and var_Number = thread.getName().split("-")[1] with var_Number = thread.name should help you.
EDIT
After some testing a user-custom name can be reused by another thread, so the only way to pass n will be to use the args or kwargs of threading.Thread().
This behavior make sense, if we need to use some sort of data in a Thread, pass it correctly don't try to put it where it don't belong.
I'm sorry if this is too simple for some people, but I still don't get the trick with python's multiprocessing. I've read
http://docs.python.org/dev/library/multiprocessing
http://pymotw.com/2/multiprocessing/basics.html
and many other tutorials and examples that google gives me... many of them from here too.
Well, my situation is that I have to compute many numpy matrices and I need to store them in a single numpy matrix afterwards. Let's say I want to use 20 cores (or that I can use 20 cores) but I haven't managed to successfully use the pool resource since it keeps the processes alive till the pool "dies". So I thought on doing something like this:
from multiprocessing import Process, Queue
import numpy as np
def f(q,i):
q.put( np.zeros( (4,4) ) )
if __name__ == '__main__':
q = Queue()
for i in range(30):
p = Process(target=f, args=(q,))
p.start()
p.join()
result = q.get()
while q.empty() == False:
result += q.get()
print result
but then it looks like the processes don't run in parallel but they run sequentially (please correct me if I'm wrong) and I don't know if they die after they do their computation (so for more than 20 processes the ones that did their part leave the core free for another process). Plus, for a very large number (let's say 100.000), storing all those matrices (which may be really big too) in a queue will use a lot of memory, rendering the code useless since the idea is to put every result on each iteration in the final result, like using a lock (and its acquire() and release() methods), but if this code isn't for parallel processing, the lock is useless too...
I hope somebody may help me.
Thanks in advance!
You are correct, they are executing sequentially in your example.
p.join() causes the current thread to block until it is finished executing. You'll either want to join your processes individually outside of your for loop (e.g., by storing them in a list and then iterating over it) or use something like numpy.Pool and apply_async with a callback. That will also let you add it to your results directly rather than keeping the objects around.
For example:
def f(i):
return i*np.identity(4)
if __name__ == '__main__':
p=Pool(5)
result = np.zeros((4,4))
def adder(value):
global result
result += value
for i in range(30):
p.apply_async(f, args=(i,), callback=adder)
p.close()
p.join()
print result
Closing and then joining the pool at the end ensures that the pool's processes have completed and the result object is finished being computed. You could also investigate using Pool.imap as a solution to your problem. That particular solution would look something like this:
if __name__ == '__main__':
p=Pool(5)
result = np.zeros((4,4))
im = p.imap_unordered(f, range(30), chunksize=5)
for x in im:
result += x
print result
This is cleaner for your specific situation, but may not be for whatever you are ultimately trying to do.
As to storing all of your varied results, if I understand your question, you can just add it off into a result in the callback method (as above) or item-at-a-time using imap/imap_unordered (which still stores the results, but you'll clear it as it builds). Then it doesn't need to be stored for longer than it takes to add to the result.