Python'S multiporcessing.Queue's get() method - python

I am writing my first Python 2.7 multiprocessing program (woohoo).
I am using multiprocessing Queues to retrieve data from my sub processes. My question is about the Queues .get() method. Is there any guarantee that I will get the full object (no matter how large it is) when I call the method? If not how will it be split.
The doc says: “Remove and return an item from the queue. […]”. But I am not sure if this means that I might end up getting chunks of an object or if it is rebuild by the methods internals.
Here is some sample code: (stats might get pretty large)
p = Process(target=process_analyze_db, args=(db_names[j], j, queue_stats))
processes.append(p)
p.start()
while 1:
running = any(p.is_alive() for p in processes)
while not queue_stats.empty(): #Is this loop necessary?
data = queue_stats.get_nowait()
results[data[0]] = data[1]
if not running:
break
#In the process
def process_analyze_db (db_name, profile_nr, queue _stats):
#Do lots of stuff
queue_stats.put([profile_nr, stats])

Related

Running a list of custom funcs in parallel, outside of main

I have the following line being called from within a particular area of my script:
data = [extract_func(domain, response) for domain, response, extract_func in responses]
Basically I collected a bunch of webpage responses asynchronously using aiohttp in the variable responses which is nice and fast and so we've already got that. Problem is that the parsing of those responses (using Beautiful Soup) is not asynchronous so I have to parallelize that some other way.
(Each extract_func is technically one of many different various extraction functions that were pre-packaged with the response data so that I call the right Beautiful Soup parsing code for each page. The domain is passed in too for other packaging purposes.)
Anyways I don't know how I'd run all these extraction functions at the same time and then collect the results. I tried looking into multiprocessing but it doesn't seem to apply here / requires that you launch it directly from main, whereas this collection process of mine is taking place from within another function.
I tried this for example (where each extract_function, at the end, adds the returned result to some global list - then here I try):
global extract_shared
extract_shared = []
proc = []
for domain, response, extract_func in responses:
p = Process(target=extract_func, args=(domain, response))
p.start()
proc.append(p)
for p in proc:
p.join()
data = extract_shared
However this still seems to move along super slowly, and I end up with no data anyway so my code is still wrong.
Is there a better way I should be going about this?
Is this correct?
pool = multiprocessing.Pool(multiprocessing.cpu_count())
result_objects = [pool.apply_async(extract_func, args=(domain, response)) for domain, response, extract_func in responses]
data = [r.get() for r in result_objects]
pool.close()
pool.join()
return data
The problem is that your extract_shared list as you have defined it exists as individual instances in each process's address space. You need to have a shared memory implementation of extract_shared so that each process is appending to the same list. If I knew what type of data was being appended, I might be able to recommend which flavor of multiprocessing.Array to use. Alternatively, although it carries a bit more overhead to use, a managed list that is created by a multiprocessing.SyncManager and functions just like a regular list, might be simpler to use.
Using a multiprocessing pool is the way to go. If your worker functions are not returning meaningful results, there is no need to save the the AsyncResult instances returned by the call to ApplyAsync. Simply calling pool.close() followed by pool.join() is sufficient to wait for all outstanding submitted tasks to complete.
import multiprocessing
def init_pool(the_list):
global extract_shared
extract_shared = the_list
# required for Windows:
if __name__ == '__main__':
# compute pool size required but no greater than number of CPU cores we have:
n_processes = min(len(responses), multiprocessing.cpu_count())
# create a managed list:
extract_shared = multiprocessing.Manager().list()
# Initialize each process in the pool's global variable extract_shared with our extract_shared
# (a managed list can also be passed as another argument to the worker function instead)
pool = multiprocessing.Pool(n_processes, initializer=init_pool, initargs=(extract_shared,))
for domain, response, extract_func in responses:
pool.apply_async(extract_func, args=(domain, response))
# wait for tasks to complete
pool.close()
pool.join()
# results in extract_shared
print(extract_shared)
Update
If is easier just to have the worker functions return the results and the main process do the appending. And you had essentially the correct code for that except I would limit the pool size to less than the number of CPU cores you have if the number of tasks you are submitting is less than that number.

Python multithreading but using object instance

I hope you can help me.
I have a msgList, containing msg objects, each one having the pos and content attributes.
Then I have a function posClassify, that creates a SentimentClassifier object, that iterates thru this msgList and does msgList[i].pos = clf.predict(msgList[i].content), being clf an instance of SentimentClassifier.
def posClassify(msgList):
clf = SentimentClassifier()
for i in tqdm(range(len(msgList))):
if msgList[i].content.find("omitted") == -1:
msgList[i].pos = clf.predict(msgList[i].content)
And what I wanted is to compute this using multiprocessing. I have read that you create a pool, and call a function with a list of the arguments you want to pass this function, and thats it. I imagine that that function must be something like saving an image or working on different memory spaces, and not like mine, where you want to modify that same msg object, and also, having to use that SentimentClassifier object (which takes about 10 seconds or so to initialize).
My thoughts where creating cpu_cores-1 processes, each one using an instance of SentimentClassifier, and then each process starts consuming that msg list with its own classifier, but I can't work out how to approach this. I also thought of creating threads with binary semaphores, each one calling its own classifier, and then waiting the semaphore to update the pos value in the msg object, but still cant figure it out.
You can use ProcessPoolExecutor from futures module in Python.
The ProcessPoolExecutor is
An Executor subclass that executes calls asynchronously using a pool
of at most max_workers processes. If max_workers is None or not given,
it will default to the number of processors on the machine
You can find more at Python docs.
Here, is the sample code of achieving the concurrency assuming that each msgList[i] is independent of msgList[j] when i != j,
from concurrent import futures
def posClassify(msg, idx, clf):
return idx, clf.predict(msg.content)
def classify(msgList):
clf = SentimentClassifier()
calls = []
executor = futures.ProcessPoolExecutor(max_workers=4)
for i in tqdm(range(len(msgList))):
if msgList[i].content.find("omitted") == -1:
call = executor.submit(posClassify, msgList[i], i, clf)
calls.append(call)
# wait for all processes to finish
executor.shutdown()
# assign the result of individual calls to msgList[i].pos
for call in calls:
result = call.result()
msgList[result[0]].pos = result[1]
In order to execute the code, just call the classify(msgList) function.

How do I retrieve output from Multiprocessing in Python?

So, I'm trying to speed up one routine by using the Multiprocessing module in Python. I want to be able to read several .csv files by splitting the job among several cores, for that I have:
def csvreader(string):
from numpy import genfromtxt;
time,signal=np.genfromtxt(string, delimiter=',',unpack="true")
return time,signal
Then I call this function by saying:
if __name__ == '__main__':
for i in range(0,2):
p = multiprocessing.Process(target=CSVReader.csvreader, args=(string_array[i],))
p.start()
The thing is that this doesn't store any output. I have read all the forums online and seen that there might be a way with multiprocessing.queue but I don't understand it quite well.
Is there any simple and straightforward method?
Your best bet are multiprocessing.Queue or multiprocessing.Pipe, which are designed exactly for this problem. They allow you to send data between processes in a safe and easy way.
If you'd like to return the output of your csvreader function, you should pass another argument to it, which is the multiprocessing.Queue through which the data will be sent back to the main process. Instead of returning the values, place them on the queue, and the main process will retrieve them at some point later. If they're not ready when the process tries to get them, by default it will just block (wait) until they are available
Your function would now look like this:
def cvsreader(string, q):
q.put(np.genfromtxt(string, delimiter=',', unpack="true"))
The main routine would be:
if __name__ == '__main__'
q = multiprocessing.Queue()
for i in range(2):
p = multiprocessing.Process(target=csvreader, args=(string_array[i], q,))
p.start()
# Do anything else you need in here
time=np.empty(2,dtype='object')
signal=np.empty(2,dtype='object')
for i in range(2):
time[i], signal[i] = q.get() # Returns output or blocks until ready
# Process my output
Note that you have to call Queue.get() for each item you want to return.
Have a look at the documentation on the multiprocessing module for more examples and information.
Using the example from the introduction to the documentation:
if __name__ == '__main__':
pool = Pool(2)
results = pool.map(CSVReader.csvreader, string_array[:2])
print(results)

Multiprocessing with python3 only runs once

I have a problem running multiple processes in python3 .
My program does the following:
1. Takes entries from an sqllite database and passes them to an input_queue
2. Create multiple processes that take items off the input_queue, run it through a function and output the result to the output queue.
3. Create a thread that takes items off the output_queue and prints them (This thread is obviously started before the first 2 steps)
My problem is that currently the 'function' in step 2 is only run as many times as the number of processes set, so for example if you set the number of processes to 8, it only runs 8 times then stops. I assumed it would keep running until it took all items off the input_queue.
Do I need to rewrite the function that takes the entries out of the database (step 1) into another process and then pass its output queue as an input queue for step 2?
Edit:
Here is an example of the code, I used a list of numbers as a substitute for the database entries as it still performs the same way. I have 300 items on the list and I would like it to process all 300 items, but at the moment it just processes 10 (the number of processes I have assigned)
#!/usr/bin/python3
from multiprocessing import Process,Queue
import multiprocessing
from threading import Thread
## This is the class that would be passed to the multi_processing function
class Processor:
def __init__(self,out_queue):
self.out_queue = out_queue
def __call__(self,in_queue):
data_entry = in_queue.get()
result = data_entry*2
self.out_queue.put(result)
#Performs the multiprocessing
def perform_distributed_processing(dbList,threads,processor_factory,output_queue):
input_queue = Queue()
# Create the Data processors.
for i in range(threads):
processor = processor_factory(output_queue)
data_proc = Process(target = processor,
args = (input_queue,))
data_proc.start()
# Push entries to the queue.
for entry in dbList:
input_queue.put(entry)
# Push stop markers to the queue, one for each thread.
for i in range(threads):
input_queue.put(None)
data_proc.join()
output_queue.put(None)
if __name__ == '__main__':
output_results = Queue()
def output_results_reader(queue):
while True:
item = queue.get()
if item is None:
break
print(item)
# Establish results collecting thread.
results_process = Thread(target = output_results_reader,args = (output_results,))
results_process.start()
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
# Perform multi processing
perform_distributed_processing(dbList,10,Processor,output_results)
# Wait for it all to finish.
results_process.join()
A collection of processes that service an input queue and write to an output queue is pretty much the definition of a process pool.
If you want to know how to build one from scratch, the best way to learn is to look at the source code for multiprocessing.Pool, which is pretty simply Python, and very nicely written. But, as you might expect, you can just use multiprocessing.Pool instead of re-implementing it. The examples in the docs are very nice.
But really, you could make this even simpler by using an executor instead of a pool. It's hard to explain the difference (again, read the docs for both modules), but basically, a future is a "smart" result object, which means instead of a pool with a variety of different ways to run jobs and get results, you just need a dumb thing that doesn't know how to do anything but return futures. (Of course in the most trivial cases, the code looks almost identical either way…)
from concurrent.futures import ProcessPoolExecutor
def Processor(data_entry):
return data_entry*2
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
yield from executor.map(processor_factory, dbList)
if __name__ == '__main__':
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
for result in perform_distributed_processing(dbList, 8, Processor):
print(result)
Or, if you want to handle them as they come instead of in order:
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
fs = (executor.submit(processor_factory, db) for db in dbList)
yield from map(Future.result, as_completed(fs))
Notice that I also replaced your in-process queue and thread, because it wasn't doing anything but providing a way to interleave "wait for the next result" and "process the most recent result", and yield (or yield from, in this case) does that without all the complexity, overhead, and potential for getting things wrong.
Don't try to rewrite the whole multiprocessing library again. I think you can use any of multiprocessing.Pool methods depending on your needs - if this is a batch job you can even use the synchronous multiprocessing.Pool.map() - only instead of pushing to input queue, you need to write a generator that yields input to the threads.

No increase in speed when multithreading python hdf5 parsing function

I have a function that:
1) reads in a hdf5 dataset as integer ascii code
2) converts ascii integers to characters...chr() function
3) joins the characters into a single string function
Upon profiling, I found that the vast majority of the calculation is spent on the step #2, the conversion of the ascii integers to characters. I have somewhat optimized this call by using:
''.join([chr(x) for x in file[dataSetName].value])
As my parsing function seems to be cpu bound (the conversion of integer to characters) and not i/o bound, I expected to obtain a more/less linear speed enhancement with the number of cores devoted to parsing. To parse one file serially takes ~15 seconds...to parse 10 files (on my 12 core machine) takes ~150 seconds while using 10 threads. That is, there seems to be no enhancement at all.
I have used the following code to launch my threads:
threads=[]
timer=[]
threadNumber=10
for i,d in enumerate(sortedDirSet):
timer.append(time.time())
# self.loadFile(d,i)
threads.append(Thread(target=self.loadFileargs=(d,i)))
threads[-1].start()
if(i%threadNumber==0):
for i2,t in enumerate(threads):
t.join()
print(time.time()-timer[i2])
timer=[]
threads=[]
for t in threads:
t.join()
Any help would be greatly appreciated.
Python cannot use multiple cores (due to GIL) unless you spawn subprocesses (with multiprocessing for example). Thus you won't get any performance boost with spawning threads for CPU bound tasks.
Here's an example of a script using multiprocessing and queue:
from Queue import Empty # <-- only needed to catch Exception
from multiprocessing import Process, Queue, cpu_count
def loadFile(d, i, queue):
# some other stuff
queue.put(result)
if name == "main":
queue = Queue()
no = cpu_count()
processes = []
for i,d in enumerate(sortedDirSet):
p = Process(target=self.loadFile, args=(d, i, queue))
p.start()
processes.append(p)
if i % no == 0:
for p in processes:
p.join()
processes = []
for p in processes:
p.join()
results = []
while True:
try:
# False means "don't wait when Empty, throw an exception instead"
data = queue.get(False)
results.append(data)
except Empty:
break
# You have all the data, do something with it
The other (more complicated) way would be to use pipe instead of queue.
It would be also more efficient to spawn processes, then create a job queue and send them (via pipe) to subprocesses (so you won't have to create a process each time). But this would be even more complicated, so let's leave it like that.
Freakish is correct with his answer, it will be the GIL thwarting your efforts.
If you were to use python 3, you could do this very nicely using concurrent.futures. I believe PyPy has also backported this feature.
Also, you could eek a little bit more speed out of your code by replacing your list comprehension:
''.join([chr(x) for x in file[dataSetName].value])
With a map:
''.join(map(chr, file[dataSetName].value))
My tests (on a massive random list) using above code showed 15.73s using list comprehension and 12.44s using map.

Categories

Resources