Python & Multiprocessing, breaking a set generation down into sub-processes

Python & Multiprocessing, breaking a set generation down into sub-processes - python

I've got to generate a set of strings based on some calculations of other strings. This takes quite a while, and I'm working on a multiprocessor/multicore server so I figured that I could break these tasks down into chunks and pass them off to different process.
Firstly I break the first list of strings down into chunks of 10000 each, send this off to a process which creates a new set, then try to obtain a lock and report these back to the master process. However, my master processes's set is empty!
Here's some code:
def build_feature_labels(self,strings,return_obj,l):
feature_labels = set()
for s in strings:
feature_labels = feature_labels.union(s.get_feature_labels())
print "method: ", len(feature_labels)
l.acquire()
return_obj.return_feature_labels(feature_labels)
l.release()
print "Thread Done"
def return_feature_labels(self,labs):
self.feature_labels = self.feature_labels.union(labs)
print "length self", len(self.feature_labels)
print "length labs", len(labs)
current_pos = 0
lock = multiprocessing.Lock()
while current_pos < len(orig_strings):
while len(multiprocessing.active_children()) > threads:
print "WHILE: cpu count", str(multiprocessing.cpu_count())
T.sleep(30)
print "number of processes", str(len(multiprocessing.active_children()))
proc = multiprocessing.Process(target=self.build_feature_labels,args=(orig_strings[current_pos:current_pos+self.MAX_ITEMS],self,lock))
proc.start()
current_pos = current_pos + self.MAX_ITEMS
while len(multiprocessing.active_children()) > 0:
T.sleep(3)
print len(self.feature_labels)
What is strange is a) that self.feature_labels on the master process is empty, but when it is called from each sub-process it has items. I think I'm taking the wrong approach here (it's how I used to do it in Java!). Is there a better approach?
Thanks in advance.

Consider using a pool of workers: http://docs.python.org/dev/library/multiprocessing.html#using-a-pool-of-workers. This does a lot of the work for you in a map-reduce style and returns the assembled results.

Use a multiprocessing.Pipe, or Queue (or other such object) to pass data between processes. Use a Pipe to pass data between two processes, and a Queue to allow multiple producers and consumers.
Along with the official documentation, there are nice examples to be found in Doug Hellman's multiprocessing tutorial. In particular, it has an example of how to use multiprocessing.Pool to implement a mapreduce-type operation. It might suit your purposes very well.

Why it didn't work: multiprocessing uses processes, and process memory isn't shared. Multiprocessing can set up shared memory or pipes for IPC, but it must be done explicitly. This is how the various suggestions send data back to the master.

Related

Python'S multiporcessing.Queue's get() method

I am writing my first Python 2.7 multiprocessing program (woohoo).
I am using multiprocessing Queues to retrieve data from my sub processes. My question is about the Queues .get() method. Is there any guarantee that I will get the full object (no matter how large it is) when I call the method? If not how will it be split.
The doc says: “Remove and return an item from the queue. […]”. But I am not sure if this means that I might end up getting chunks of an object or if it is rebuild by the methods internals.
Here is some sample code: (stats might get pretty large)
p = Process(target=process_analyze_db, args=(db_names[j], j, queue_stats))
processes.append(p)
p.start()
while 1:
running = any(p.is_alive() for p in processes)
while not queue_stats.empty(): #Is this loop necessary?
data = queue_stats.get_nowait()
results[data[0]] = data[1]
if not running:
break
#In the process
def process_analyze_db (db_name, profile_nr, queue _stats):
#Do lots of stuff
queue_stats.put([profile_nr, stats])

Separate computation from socket work in Python

I'm serializing column data and then sending it over a socket connection.
Something like:
import array, struct, socket
## Socket setup
s = socket.create_connection((ip, addr))
## Data container setup
ordered_col_list = ('col1', 'col2')
columns = dict.fromkeys(ordered_col_list)
for i in range(num_of_chunks):
## Binarize data
columns['col1'] = array.array('i', range(10000))
columns['col2'] = array.array('f', [float(num) for num in range(10000)])
.
.
.
## Send away
chunk = b''.join(columns[col_name] for col_name in ordered_col_list]
s.sendall(chunk)
s.recv(1000) #get confirmation
I wish to separate the computation from the sending, put them on separate threads or processes, so I can keep doing computations while data is sent away.
I've put the binarizing part as a generator function, then sent the generator to a separate thread, which then yielded binary chunks via a queue.
I collected the data from the main thread and sent it away. Something like:
import array, struct, socket
from time import sleep
try:
import thread
from Queue import Queue
except:
import _thread as thread
from queue import Queue
## Socket and queue setup
s = socket.create_connection((ip, addr))
chunk_queue = Queue()
def binarize(num_of_chunks):
''' Generator function that yields chunks of binary data. In reality it wouldn't be the same data'''
ordered_col_list = ('col1', 'col2')
columns = dict.fromkeys(ordered_col_list)
for i in range(num_of_chunks):
columns['col1'] = array.array('i', range(10000)).tostring()
columns['col2'] = array.array('f', [float(num) for num in range(10000)]).tostring()
.
.
yield b''.join((columns[col_name] for col_name in ordered_col_list))
def chunk_yielder(queue):
''' Generate binary chunks and put them on a queue. To be used from a thread '''
while True:
try:
data_gen = queue.get_nowait()
except:
sleep(0.1)
continue
else:
for chunk in data_gen:
queue.put(chunk)
## Setup thread and data generator
thread.start_new_thread(chunk_yielder, (chunk_queue,))
num_of_chunks = 100
data_gen = binarize(num_of_chunks)
queue.put(data_gen)
## Get data back and send away
while True:
try:
binary_chunk = queue.get_nowait()
except:
sleep(0.1)
continue
else:
socket.sendall(binary_chunk)
socket.recv(1000) #Get confirmation
However, I did not see and performance imporovement - it did not work faster.
I don't understand threads/processes too well, and my question is whether it is possible (at all and in Python) to gain from this type of separation, and what would be a good way to go about it, either with threads or processess (or any other way - async etc).
EDIT:
As far as I've come to understand -
Multirpocessing requires serializing any sent data, so I'm double-sending every computed data.
Sending via socket.send() should release the GIL
Therefore I think (please correct me if I am mistaken) that a threading solution is the right way. However I'm not sure how to do it correctly.
I know cython can release the GIL off of threads, but since one of them is just socket.send/recv, my understanding is that it shouldn't be necessary.

You have two options for running things in parallel in Python, either use the multiprocessing (docs) library , or write the parallel code in cython and release the GIL. The latter is significantly more work and less applicable generally speaking.
Python threads are limited by the Global Interpreter Lock (GIL), I won't go into detail here as you will find more than enough information online on it. In short, the GIL, as the name suggests, is a global lock within the CPython interpreter that ensures multiple threads do not modify objects, that are within the confines of said interpreter, simultaneously. This is why, for instance, cython programs can run code in parallel because they can exist outside the GIL.
As to your code, one problem is that you're running both the number crunching (binarize) and the socket.send inside the GIL, this will run them strictly serially. The queue is also connected very strangely, and there is a NameError but let's leave those aside.
With the caveats already pointed out by Jeremy Friesner in mind, I suggest you re-structure the code in the following manner: you have two processes (not threads) one for binarising the data and the other for sending data. In addition to those, there is also the parent process that started both children, and a queue connecting child 1 to child 2.
Subprocess-1 does number crunching and produces crunched data into a queue
Subprocess-2 consumes data from a queue and does socket.send
in code the setup would look something like
from multiprocessing import Process, Queue
work_queue = Queue()
p1 = Process(target=binarize, args=(100, work_queue))
p2 = Process(target=send_data, args=(ip, port, work_queue))
p1.start()
p2.start()
p1.join()
p2.join()
binarize can remain as it is in your code, with the exception that instead of a yield at the end, you add elements into the queue
def binarize(num_of_chunks, q):
''' Generator function that yields chunks of binary data. In reality it wouldn't be the same data'''
ordered_col_list = ('col1', 'col2')
columns = dict.fromkeys(ordered_col_list)
for i in range(num_of_chunks):
columns['col1'] = array.array('i', range(10000)).tostring()
columns['col2'] = array.array('f', [float(num) for num in range(10000)]).tostring()
data = b''.join((columns[col_name] for col_name in ordered_col_list))
q.put(data)
send_data should just be the while loop from the bottom of your code, with the connection open/close functionality
def send_data(ip, addr, q):
s = socket.create_connection((ip, addr))
while True:
try:
binary_chunk = q.get(False)
except:
sleep(0.1)
continue
else:
socket.sendall(binary_chunk)
socket.recv(1000) # Get confirmation
# maybe remember to close the socket before killing the process
Now you have two (three actually if you count the parent) processes that are processing data independently. You can force the two processes to synchronise their operations by setting the max_size of the queue to a single element. The operation of these two separate processes is also easy to monitor from the process manager on your computer top (Linux), Activity Monitor (OsX), don't remember what it's called under Windows.
Finally, Python 3 comes with the option of using co-routines which are neither processes nor threads, but something else entirely. Co-routines are pretty cool from a CS point of view, but a bit of a head scratcher at first. There is plenty of resources to learn from though, like this post on Medium and this talk by David Beazley.
Even more generally, you might want to look into the producer/consumer pattern, if you are not already familiar with it.

If you are trying to use concurrency to improve performance in CPython I would strongly recommend using multiprocessing library instead of multithreading. It is because of GIL (Global Interpreter Lock), which can have a huge impact on execution speed (in some cases, it may cause your code to run slower than single threaded version). Also, if you would like to learn more about this topic, I recommend reading this presentation by David Beazley. Multiprocessing bypasses this problem by spawning a new Python interpreter instance for each process, thus allowing you to take full advantage of multi core architecture.

Multiprocessing has cutoff at 992 integers being joined as result

I am following this book http://doughellmann.com/pages/python-standard-library-by-example.html
Along with some online references. I have some algorithm setup for multiprocessing where i have a large array of dictionaries and do some calculation. I use multiprocessing to divide the indexes on which the calculations are done on the dictionary. To make the question more general, I replaced the algorithm with just some array of return values. From finding information online and other SO, I think it has to do with the join method.
The structure is like so,
Generate some fake data, call the manager function for multiprocessing, create a Queue, divide data over the number of index. Loop through the number of processes to use, send each process function the correct index range. Lastly join the processes and print out the results.
What I have figured out, is if the function used by the processes is trying to return a range(0,992), it works quickly, if the range(0,993), it hangs. I tried on two different computers with different specs.
The code is here:
import multiprocessing
def main():
data = []
for i in range(0,10):
data.append(i)
CalcManager(data,start=0,end=50)
def CalcManager(myData,start,end):
print 'in calc manager'
#Multi processing
#Set the number of processes to use.
nprocs = 3
#Initialize the multiprocessing queue so we can get the values returned to us
tasks = multiprocessing.JoinableQueue()
result_q = multiprocessing.Queue()
#Setup an empty array to store our processes
procs = []
#Divide up the data for the set number of processes
interval = (end-start)/nprocs
new_start = start
#Create all the processes while dividing the work appropriately
for i in range(nprocs):
print 'starting processes'
new_end = new_start + interval
#Make sure we dont go past the size of the data
if new_end > end:
new_end = end
#Generate a new process and pass it the arguments
data = myData[new_start:new_end]
#Create the processes and pass the data and the result queue
p = multiprocessing.Process(target=multiProcess,args=(data,new_start,new_end,result_q,i))
procs.append(p)
p.start()
#Increment our next start to the current end
new_start = new_end+1
print 'finished starting'
#Joint the process to wait for all data/process to be finished
for p in procs:
p.join()
#Print out the results
for i in range(nprocs):
result = result_q.get()
print result
#MultiProcess Handling
def multiProcess(data,start,end,result_q,proc_num):
print 'started process'
results = range(0,(992))
result_q.put(results)
return
if __name__== '__main__':
main()
Is there something about these numbers specifically or am I just missing something basic that has nothing to do with these numbers?
From my searches, it seems this is some memory issue with the join method, but the book does not really explain how to solve this using this setup. Is it possible to use this structure (i understand it mostly, so it would be nice if i can continue to use this) and also pass back large results. I know there are other methods to share data between processes, but thats not what I need, just return the values and join them to one array once completed.

I can't reproduce this on my machine, but it sounds like items in put into the queue haven't been flushed to the underlying pipe. This will cause a deadlock if you try to terminate the process, according to the docs:
As mentioned above, if a child process has put items on a queue (and
it has not used JoinableQueue.cancel_join_thread), then that process
will not terminate until all buffered items have been flushed to the
pipe. This means that if you try joining that process you may get a
deadlock unless you are sure that all items which have been put on the
queue have been consumed. Similarly, if the child process is
non-daemonic then the parent process may hang on exit when it tries to
join all its non-daemonic children.
If you're in this situation. your p.join() calls will hang forever, because there's still buffered data in the queue. You can avoid it by consuming from the queue before you join the processes:
#Print out the results
for i in range(nprocs):
result = result_q.get()
print result
#Joint the process to wait for all data/process to be finished
for p in procs:
p.join()
This doesn't affect the way the code works, each result_q.get() call will block until the result is placed on the queue, which has the same effect has calling join on all processes prior to calling get. The only difference is you avoid the deadlock.

No increase in speed when multithreading python hdf5 parsing function

I have a function that:
1) reads in a hdf5 dataset as integer ascii code
2) converts ascii integers to characters...chr() function
3) joins the characters into a single string function
Upon profiling, I found that the vast majority of the calculation is spent on the step #2, the conversion of the ascii integers to characters. I have somewhat optimized this call by using:
''.join([chr(x) for x in file[dataSetName].value])
As my parsing function seems to be cpu bound (the conversion of integer to characters) and not i/o bound, I expected to obtain a more/less linear speed enhancement with the number of cores devoted to parsing. To parse one file serially takes ~15 seconds...to parse 10 files (on my 12 core machine) takes ~150 seconds while using 10 threads. That is, there seems to be no enhancement at all.
I have used the following code to launch my threads:
threads=[]
timer=[]
threadNumber=10
for i,d in enumerate(sortedDirSet):
timer.append(time.time())
# self.loadFile(d,i)
threads.append(Thread(target=self.loadFileargs=(d,i)))
threads[-1].start()
if(i%threadNumber==0):
for i2,t in enumerate(threads):
t.join()
print(time.time()-timer[i2])
timer=[]
threads=[]
for t in threads:
t.join()
Any help would be greatly appreciated.

Python cannot use multiple cores (due to GIL) unless you spawn subprocesses (with multiprocessing for example). Thus you won't get any performance boost with spawning threads for CPU bound tasks.
Here's an example of a script using multiprocessing and queue:
from Queue import Empty # <-- only needed to catch Exception
from multiprocessing import Process, Queue, cpu_count
def loadFile(d, i, queue):
# some other stuff
queue.put(result)
if name == "main":
queue = Queue()
no = cpu_count()
processes = []
for i,d in enumerate(sortedDirSet):
p = Process(target=self.loadFile, args=(d, i, queue))
p.start()
processes.append(p)
if i % no == 0:
for p in processes:
p.join()
processes = []
for p in processes:
p.join()
results = []
while True:
try:
# False means "don't wait when Empty, throw an exception instead"
data = queue.get(False)
results.append(data)
except Empty:
break
# You have all the data, do something with it
The other (more complicated) way would be to use pipe instead of queue.
It would be also more efficient to spawn processes, then create a job queue and send them (via pipe) to subprocesses (so you won't have to create a process each time). But this would be even more complicated, so let's leave it like that.

Freakish is correct with his answer, it will be the GIL thwarting your efforts.
If you were to use python 3, you could do this very nicely using concurrent.futures. I believe PyPy has also backported this feature.
Also, you could eek a little bit more speed out of your code by replacing your list comprehension:
''.join([chr(x) for x in file[dataSetName].value])
With a map:
''.join(map(chr, file[dataSetName].value))
My tests (on a massive random list) using above code showed 15.73s using list comprehension and 12.44s using map.

Sharing data between processes without physically moving it

I have a job where I get a lot of separate tasks through. For each task I need to download some data, process it and then upload it again.
I'm using a multiprocessing pool for the processing.
I have a couple of issues I'm unsure of though.
Firstly the data can be up to 20MB roughly, I ideally want to get it to the child worker process without physically moving it in memory and getting the resulting data back to the parent process as well without moving it. As I'm not sure how some tools are working under the hood I don't know if I can just pass the data as an argument to the pool's apply_async (from my understanding it serialises the objects and then they're created again once the reach the subprocess?), or if I should use a multiprocessing Queue or mmap maybe? or something else?
I looked at ctypes objects but from what I understand only the objects that are defined when the pool is created when the process forks can be shared? Which is no good for me as I'll continuously have new data coming in which I need to share.
One thing I shouldn't need to worry about is any concurrent access on the data so I shouldn't need any type of locking. This is because the processing will only start after the data has been downloaded, and the upload will also only start after the output data has been generated.
Another issue I'm having is that sometimes the tasks coming in might spike and as a result I'm downloading data for the tasks quicker than the child processes can process it. So therefore I'm downloading data quicker than I can finish the tasks and dispose of the data and python is dying from running out of memory. What would be a good way to hold up the tasks at the downloading stage when memory is almost full / too much data is in the job pipeline?
I was thinking of some type of "ref" count by using the number of data bytes so I can limit the amount of data between download and upload and only download when the number is below some threshold. Although I'd be worried a child might sometimes fail and I'd never get to take the data it had off of the count. Is there a good way to achieve this kind of thing?

(This is an outcome of the discussion of my previous answer)
Have you tried POSH
This example shows that one can append elements to a mutable list, which is probably what you want (copied from documentation):
import posh
l = posh.share(range(3))
if posh.fork():
#parent process
l.append(3)
posh.waitall()
else:
# child process
l.append(4)
posh.exit(0)
print l
-- Output --
[0, 1, 2, 3, 4]
-- OR --
[0, 1, 2, 4, 3]

Here is cannonical example from multiprocessing documentation:
from multiprocessing import Process, Value, Array
def f(n, a):
n.value = 3.1415927
for i in range(len(a)):
a[i] = -a[i]
if __name__ == '__main__':
num = Value('d', 0.0)
arr = Array('i', range(10))
p = Process(target=f, args=(num, arr))
p.start()
p.join()
print num.value
print arr[:]
Note that num and arr are shared objects. Is it what you are looking for?

I clobbered this together since I need to figure this out for myself anyway. I'm by no means very accomplished when it comes to multiprocessing or threading, but at least it works. Maybe it can be done in a smarter way, I couldn't figure out how to use the lock that comes with the non-raw Array type. Maybe someone will suggest improvements in comments.
from multiprocessing import Process, Event
from multiprocessing.sharedctypes import RawArray
def modify(s, task_event, result_event):
for i in range(4):
print "Worker: waiting for task"
task_event.wait()
task_event.clear()
print "Worker: got task"
s.value = s.value.upper()
result_event.set()
if __name__ == '__main__':
data_list = ("Data", "More data", "oh look, data!", "Captain Pickard")
task_event = Event()
result_event = Event()
s = RawArray('c', "X" * max(map(len, data_list)))
p = Process(target=modify, args=(s, task_event, result_event))
p.start()
for data in data_list:
s.value = data
task_event.set()
print "Sent new task. Waiting for results."
result_event.wait()
result_event.clear()
print "Got result: {0}".format(s.value)
p.join()
In this example, data_list is defined beforehand, but it need not be. The only information I needed from that list was the length of the longest string. As long as you have some practical upper bound for the length, it's no problem.
Here's the output of the program:
Sent new task. Waiting for results.
Worker: waiting for task
Worker: got task
Worker: waiting for task
Got result: DATA
Sent new task. Waiting for results.
Worker: got task
Worker: waiting for task
Got result: MORE DATA
Sent new task. Waiting for results.
Worker: got task
Worker: waiting for task
Got result: OH LOOK, DATA!
Sent new task. Waiting for results.
Worker: got task
Got result: CAPTAIN PICKARD
As you can see, btel did in fact provide the solution, but the problem lay in keeping the two processes in lockstep with each other, so that the worker only starts working on a new task when the task is ready, and so that the main process doesn't read the result before it's complete.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.