Better way to share memory for multiprocessing in Python?

Better way to share memory for multiprocessing in Python? - python

I have been tackling this problem for a week now and it's been getting pretty frustrating because every time I implement a simpler but similar scale example of what I need to do, it turns out multiprocessing will fudge it up. The way it handles shared memory baffles me because it is so limited, it can become useless quite rapidly.
So the basic description of my problem is that I need to create a process that gets passed in some parameters to open an image and create about 20K patches of size 60x40. These patches are saved into a list 2 at a time and need to be returned to the main thread to then be processed again by 2 other concurrent processes that run on the GPU.
The process and the workflow and all that are mostly taken care of, what I need now is the part that was supposed to be the easiest is turning out to be the most difficult. I have not been able to save and get the list with 20K patches back to the main thread.
First problem was because I was saving these patches as PIL images. I then found out all data added to a Queue object has to be pickled.
Second problem was I then converted the patches to an array of 60x40 each and saved them to a list. And now that still doesn't work? Apparently Queues have a limited amount of data they can save otherwise when you call queue_obj.get() the program hangs.
I have tried many other things, and every new thing I try does not work, so I would like to know if anyone has other recommendations of a library I can use to share objects without all the fuzz?
Here is a sample implementation of kind of what I'm looking at. Keep in mind this works perfectly fine, but the full implementation doesn't. And I do have the code print informational messages to see that the data being saved has the exact same shape and everything, but for some reason it doesn't work. In the full implementation the independent process completes successfully but freezes at q.get().
from PIL import Image
from multiprocessing import Queue, Process
import StringIO
import numpy
img = Image.open("/path/to/image.jpg")
q = Queue()
q2 = Queue()
#
#
# MAX Individual Queue limit for 60x40 images in BW is 31,466.
# Multiple individual Queues can be filled to the max limit of 31,466.
# A single Queue can only take up to 31,466, even if split up in different puts.
def rz(patch, qn1, qn2):
totalPatchCount = 20000
channels = 1
patch = patch.resize((60,40), Image.ANTIALIAS)
patch = patch.convert('L')
# ImgArray = numpy.asarray(im, dtype=numpy.float32)
list_im_arr = []
# ----Create a 4D Array
# returnImageArray = numpy.zeros(shape=(totalPatchCount, channels, 40, 60))
imgArray = numpy.asarray(patch, dtype=numpy.float32)
imgArray = imgArray[numpy.newaxis, ...]
# ----End 4D array
# list_im_arr2 = []
for i in xrange(totalPatchCount):
# returnImageArray[i] = imgArray
list_im_arr.append(imgArray)
qn1.put(list_im_arr)
qn1.cancel_join_thread()
# qn2.cancel_join_thread()
print "PROGRAM Done"
# rz(img,q,q2)
# l = q.get()
#
p = Process(target=rz,args=(img, q, q2,))
p.start()
p.join()
#
# # l = []
# # for i in xrange(1000): l.append(q.get())
#
imdata = q.get()

Queue is for communication between processes. In your case, you don't really have this kind of communication. You can simply let the process return result, and use the .get() method to collect them. (Remember to add if __name__ == "main":, see programming guideline)
from PIL import Image
from multiprocessing import Pool, Lock
import numpy
img = Image.open("/path/to/image.jpg")
def rz():
totalPatchCount = 20000
imgArray = numpy.asarray(patch, dtype=numpy.float32)
list_im_arr = [imgArray] * totalPatchCount # A more elegant way than a for loop
return list_im_arr
if __name__ == '__main__':
# patch = img.... Your code to get generate patch here
patch = patch.resize((60,40), Image.ANTIALIAS)
patch = patch.convert('L')
pool = Pool(2)
imdata = [pool.apply_async(rz).get() for x in range(2)]
pool.close()
pool.join()
Now, according to first answer of this post, multiprocessing only pass objects that's picklable. Pickling is probably unavoidable in multiprocessing because processes don't share memory. They simply don't live in the same universe. (They do inherit memory when they're first spawned, but they can not reach out of their own universe). PIL image object itself is not picklable. You can make it picklable by extracting only the image data stored in it, like this post suggested.
Since your problem is mostly I/O bound, you can also try multi-threading. It might be even faster for your purpose. Threads share everything so no pickling is required. If you're using python 3, ThreadPoolExecutor is a wonderful tool. For Python 2, you can use ThreadPool. To achieve higher efficiency, you'll have to rearrange how you do things, you want to break-up the process and let different threads do the job.
from PIL import Image
from multiprocessing.pool import ThreadPool
from multiprocessing import Lock
import numpy
img = Image.open("/path/to/image.jpg")
lock = Lock():
totalPatchCount = 20000
def rz(x):
patch = ...
return patch
pool = ThreadPool(8)
imdata = [pool.map(rz, range(totalPatchCount)) for i in range(2)]
pool.close()
pool.join()

You say "Apparently Queues have a limited amount of data they can save otherwise when you call queue_obj.get() the program hangs."
You're right and wrong there. There is a limited amount of information the Queue will hold without being drained. The problem is that when you do:
qn1.put(list_im_arr)
qn1.cancel_join_thread()
it schedules the communication to the underlying pipe (handled by a thread). The qn1.cancel_join_thread() then says "but it's cool if we exit without the scheduled put completing", and of course, a few microseconds later, the worker function exits and the Process exits (without waiting for the thread that is populating the pipe to actually do so; at best it might have sent the initial bytes of the object, but anything that doesn't fit in PIPE_BUF almost certainly gets dropped; you'd need some amazing race conditions to occur to get anything at all, let alone the whole of a large object). So later, when you do:
imdata = q.get()
nothing has actually been sent by the (now exited) Process. When you call q.get() it's waiting for data that never actually got transmitted.
The other answer is correct that in the case of computing and conveying a single value, Queues are overkill. But if you're going to use them, you need to use them properly. The fix would be to:
Remove the call to qn1.cancel_join_thread() so the Process doesn't exit until the data has been transmitted across the pipe.
Rearrange your calls to avoid deadlock
Rearranging is just this:
p = Process(target=rz,args=(img, q, q2,))
p.start()
imdata = q.get()
p.join()
moving p.join() after q.get(); if you try to join first, your main process will be waiting for the child to exit, and the child will be waiting for the queue to be consumed before it will exit (this might actually work if the Queue's pipe is drained by a thread in the main process, but it's best not to count on implementation details like that; this form is correct regardless of implementation details, as long as puts and gets are matched).

Related

Separate computation from socket work in Python

I'm serializing column data and then sending it over a socket connection.
Something like:
import array, struct, socket
## Socket setup
s = socket.create_connection((ip, addr))
## Data container setup
ordered_col_list = ('col1', 'col2')
columns = dict.fromkeys(ordered_col_list)
for i in range(num_of_chunks):
## Binarize data
columns['col1'] = array.array('i', range(10000))
columns['col2'] = array.array('f', [float(num) for num in range(10000)])
.
.
.
## Send away
chunk = b''.join(columns[col_name] for col_name in ordered_col_list]
s.sendall(chunk)
s.recv(1000) #get confirmation
I wish to separate the computation from the sending, put them on separate threads or processes, so I can keep doing computations while data is sent away.
I've put the binarizing part as a generator function, then sent the generator to a separate thread, which then yielded binary chunks via a queue.
I collected the data from the main thread and sent it away. Something like:
import array, struct, socket
from time import sleep
try:
import thread
from Queue import Queue
except:
import _thread as thread
from queue import Queue
## Socket and queue setup
s = socket.create_connection((ip, addr))
chunk_queue = Queue()
def binarize(num_of_chunks):
''' Generator function that yields chunks of binary data. In reality it wouldn't be the same data'''
ordered_col_list = ('col1', 'col2')
columns = dict.fromkeys(ordered_col_list)
for i in range(num_of_chunks):
columns['col1'] = array.array('i', range(10000)).tostring()
columns['col2'] = array.array('f', [float(num) for num in range(10000)]).tostring()
.
.
yield b''.join((columns[col_name] for col_name in ordered_col_list))
def chunk_yielder(queue):
''' Generate binary chunks and put them on a queue. To be used from a thread '''
while True:
try:
data_gen = queue.get_nowait()
except:
sleep(0.1)
continue
else:
for chunk in data_gen:
queue.put(chunk)
## Setup thread and data generator
thread.start_new_thread(chunk_yielder, (chunk_queue,))
num_of_chunks = 100
data_gen = binarize(num_of_chunks)
queue.put(data_gen)
## Get data back and send away
while True:
try:
binary_chunk = queue.get_nowait()
except:
sleep(0.1)
continue
else:
socket.sendall(binary_chunk)
socket.recv(1000) #Get confirmation
However, I did not see and performance imporovement - it did not work faster.
I don't understand threads/processes too well, and my question is whether it is possible (at all and in Python) to gain from this type of separation, and what would be a good way to go about it, either with threads or processess (or any other way - async etc).
EDIT:
As far as I've come to understand -
Multirpocessing requires serializing any sent data, so I'm double-sending every computed data.
Sending via socket.send() should release the GIL
Therefore I think (please correct me if I am mistaken) that a threading solution is the right way. However I'm not sure how to do it correctly.
I know cython can release the GIL off of threads, but since one of them is just socket.send/recv, my understanding is that it shouldn't be necessary.

You have two options for running things in parallel in Python, either use the multiprocessing (docs) library , or write the parallel code in cython and release the GIL. The latter is significantly more work and less applicable generally speaking.
Python threads are limited by the Global Interpreter Lock (GIL), I won't go into detail here as you will find more than enough information online on it. In short, the GIL, as the name suggests, is a global lock within the CPython interpreter that ensures multiple threads do not modify objects, that are within the confines of said interpreter, simultaneously. This is why, for instance, cython programs can run code in parallel because they can exist outside the GIL.
As to your code, one problem is that you're running both the number crunching (binarize) and the socket.send inside the GIL, this will run them strictly serially. The queue is also connected very strangely, and there is a NameError but let's leave those aside.
With the caveats already pointed out by Jeremy Friesner in mind, I suggest you re-structure the code in the following manner: you have two processes (not threads) one for binarising the data and the other for sending data. In addition to those, there is also the parent process that started both children, and a queue connecting child 1 to child 2.
Subprocess-1 does number crunching and produces crunched data into a queue
Subprocess-2 consumes data from a queue and does socket.send
in code the setup would look something like
from multiprocessing import Process, Queue
work_queue = Queue()
p1 = Process(target=binarize, args=(100, work_queue))
p2 = Process(target=send_data, args=(ip, port, work_queue))
p1.start()
p2.start()
p1.join()
p2.join()
binarize can remain as it is in your code, with the exception that instead of a yield at the end, you add elements into the queue
def binarize(num_of_chunks, q):
''' Generator function that yields chunks of binary data. In reality it wouldn't be the same data'''
ordered_col_list = ('col1', 'col2')
columns = dict.fromkeys(ordered_col_list)
for i in range(num_of_chunks):
columns['col1'] = array.array('i', range(10000)).tostring()
columns['col2'] = array.array('f', [float(num) for num in range(10000)]).tostring()
data = b''.join((columns[col_name] for col_name in ordered_col_list))
q.put(data)
send_data should just be the while loop from the bottom of your code, with the connection open/close functionality
def send_data(ip, addr, q):
s = socket.create_connection((ip, addr))
while True:
try:
binary_chunk = q.get(False)
except:
sleep(0.1)
continue
else:
socket.sendall(binary_chunk)
socket.recv(1000) # Get confirmation
# maybe remember to close the socket before killing the process
Now you have two (three actually if you count the parent) processes that are processing data independently. You can force the two processes to synchronise their operations by setting the max_size of the queue to a single element. The operation of these two separate processes is also easy to monitor from the process manager on your computer top (Linux), Activity Monitor (OsX), don't remember what it's called under Windows.
Finally, Python 3 comes with the option of using co-routines which are neither processes nor threads, but something else entirely. Co-routines are pretty cool from a CS point of view, but a bit of a head scratcher at first. There is plenty of resources to learn from though, like this post on Medium and this talk by David Beazley.
Even more generally, you might want to look into the producer/consumer pattern, if you are not already familiar with it.

If you are trying to use concurrency to improve performance in CPython I would strongly recommend using multiprocessing library instead of multithreading. It is because of GIL (Global Interpreter Lock), which can have a huge impact on execution speed (in some cases, it may cause your code to run slower than single threaded version). Also, if you would like to learn more about this topic, I recommend reading this presentation by David Beazley. Multiprocessing bypasses this problem by spawning a new Python interpreter instance for each process, thus allowing you to take full advantage of multi core architecture.

Creating a Queue delay in a Python pool without blocking

I have a large program (specifically, a function) that I'm attempting to parallelize using a JoinableQueue and the multiprocessing map_async method. The function that I'm working with does several operations on multidimensional arrays, so I break up each array into sections, and each section evaluates independently; however I need to stitch together one of the arrays early on, but the "stitch" happens before the "evaluate" and I need to introduce some kind of delay in the JoinableQueue. I've searched all over for a workable solution but I'm very new to multiprocessing and most of it goes over my head.
This phrasing may be confusing- apologies. Here's an outline of my code (I can't put all of it because it's very long, but I can provide additional detail if needed)
import numpy as np
import multiprocessing as mp
from multiprocessing import Pool, Pipe, JoinableQueue
def main_function(section_number):
#define section sizes
array_this_section = array[:,start:end+1,:]
histogram_this_section = np.zeros((3, dataset_size, dataset_size))
#start and end are defined according to the size of the array
#dataset_size is to show that the histogram is a different size than the array
for m in range(1,num_iterations+1):
#do several operations- each section of the array
#corresponds to a section on the histogram
hist_queue.put(histogram_this_section)
#each process sends their own part of the histogram outside of the pool
#to be combined with every other part- later operations
#in this function must use the full histogram
hist_queue.join()
full_histogram = full_hist_queue.get()
full_hist_queue.task_done()
#do many more operations
hist_queue = JoinableQueue()
full_hist_queue = JoinableQueue()
if __name__ == '__main__':
pool = mp.Pool(num_sections)
args = np.arange(num_sections)
pool.map_async(main_function, args, chunksize=1)
#I need the map_async because the program is designed to display an output at the
#end of each iteration, and each output must be a compilation of all processes
#a few variable definitions go here
for m in range(1,num_iterations+1):
for i in range(num_sections):
temp_hist = hist_queue.get() #the code hangs here because the queue
#is attempting to get before anything
#has been put
hist_full += temp_hist
for i in range(num_sections):
hist_queue.task_done()
for i in range(num_sections):
full_hist_queue.put(hist_full) #the full histogram is sent back into
#the pool
full_hist_queue.join()
#etc etc
pool.close()
pool.join()

I'm pretty sure that your issue is how you're creating the Queues and trying to share them with the child processes. If you just have them as global variables, they may be recreated in the child processes instead of inherited (the exact details depend on what OS and/or context you're using for multiprocessing).
A better way to go about solving this issue is to avoid using multiprocessing.Pool to spawn your processes and instead explicitly create Process instances for your workers yourself. This way you can pass the Queue instances to the processes that need them without any difficulty (it's technically possible to pass the queues to the Pool workers, but it's awkward).
I'd try something like this:
def worker_function(section_number, hist_queue, full_hist_queue): # take queues as arguments
# ... the rest of the function can work as before
# note, I renamed this from "main_function" since it's not running in the main process
if __name__ == '__main__':
hist_queue = JoinableQueue() # create the queues only in the main process
full_hist_queue = JoinableQueue() # the workers don't need to access them as globals
processes = [Process(target=worker_function, args=(i, hist_queue, full_hist_queue)
for i in range(num_sections)]
for p in processes:
p.start()
# ...
If the different stages of your worker function are more or less independent of one another (that is, the "do many more operations" step doesn't depend directly on the "do several operations" step above it, just on full_histogram), you might be able to keep the Pool and instead split up the different steps into separate functions, which the main process could call via several calls to map on the pool. You don't need to use your own Queues in this approach, just the ones built in to the Pool. This might be best especially if the number of "sections" you're splitting the work up into doesn't correspond closely with the number of processor cores on your computer. You can let the Pool match the number of cores, and have each one work on several sections of the data in turn.
A rough sketch of this would be something like:
def worker_make_hist(section_number):
# do several operations, get a partial histogram
return histogram_this_section
def worker_do_more_ops(section_number, full_histogram):
# whatever...
return some_result
if __name__ == "__main__":
pool = multiprocessing.Pool() # by default the size will be equal to the number of cores
for temp_hist in pool.imap_unordered(worker_make_hist, range(number_of_sections)):
hist_full += temp_hist
some_results = pool.starmap(worker_do_more_ops, zip(range(number_of_sections),
itertools.repeat(hist_full)))

Access python program data while running

I have a python program that's been running for a while, and because of an unanticipated event, I'm now unsure that it will complete within a reasonable amount of time. The data it's collected so far, however, is valuable, and I would like to recover it if possible.
Here is the relevant code
from multiprocessing.dummy import Pool as ThreadPool
def pull_details(url):
#accesses a given URL
#returns some data which gets appended to the results list
pool = ThreadPool(25)
results = pool.map(pull_details, urls)
pool.close()
pool.join()
So I either need to access the data that is currently in results or somehow change the source of the code (or somehow manually change the program's control) to kill the loop so it continues to the later part of the program in which the data is exported (not sure if the second way is possible).
It seems as though the first option is also quite tricky, but luckily the IDE (Spyder) I'm using indicates the value of what I assume is the location of the list in the machine's memory (0xB73EDECCL).
Is it possible to create a C program (or another python program) to access this location in memory and read what's there?

Can't you use some sort of mechanism to exchange data between the two processes, like queues or pipes.
something like below:
from multiprocessing import Queue
from multiprocessing.dummy import Pool as ThreadPool
def pull_details(args=None):
q.put([my useful data])
q = Queue()
pool = ThreadPool(25)
results = pool.map(pull_details(args=q), urls)
while not done:
results = q.get()
pool.close()
pool.join()

Staggered data loading with multiprocessing.Queue sometimes leads to items being consumed out of order

I'm writing a script which animates image data. I have a number of large image cubes (3D arrays). For each of these, I step through the frames in each cube, and once I get near the end of it, I load the next cube and continue. Due to the large size of each cube, there is a significant load time (~5s). I'd like the animation to transition between cubes seamlessly (while also conserving memory), so I'm staggering the load processes. I've made some progress towards a solution, but some problems persist.
The code below loads each data cube, splits it into frames and puts these into a multiprocessing.Queue. Once the number of frames in the queue falls below a certain threshold, the next load process is triggered which loads another cube and unpacks it into the queue.
Check out the code below:
import numpy as np
import multiprocessing as mp
import logging
logger = mp.log_to_stderr(logging.INFO)
import time
def data_loader(event, queue, **kw):
'''loads data from 3D image cube'''
event.wait() #wait for trigger before loading
logger.info( 'Loading data' )
time.sleep(3) #pretend to take long to load the data
n = 100
data = np.ones((n,20,20))*np.arange(n)[:,None,None] #imaginary 3D image cube (increasing numbers so that we can track the data ordering)
logger.info( 'Adding data to queue' )
for d in data:
queue.put(d)
logger.info( 'Done adding to queue!' )
def queue_monitor(queue, triggers, threshold=50, interval=5):
'''
Triggers the load events once the number of data in the queue falls below
threshold, then doesn't trigger again until the interval has passed.
Note: interval should be larger than data load time.
'''
while len(triggers):
if queue.qsize() < threshold:
logger.info( 'Triggering next load' )
triggers.pop(0).set()
time.sleep(interval)
if __name__ == '__main__':
logger.info( "Starting" )
out_queue = mp.Queue()
#Initialise the load processes
nprocs, procs = 3, []
triggers = [mp.Event() for _ in range(nprocs)]
triggers[0].set() #set the first process to trigger immediately
for i, trigger in enumerate(triggers):
p = mp.Process( name='data_loader %d'%i, target=data_loader,
args=(trigger, out_queue) )
procs.append( p )
for p in procs:
p.start()
#Monitoring process
qm = mp.Process( name='queue_monitor', target=queue_monitor,
args=(out_queue, triggers) )
qm.start()
#consume data
while out_queue.empty():
pass
else:
for d in iter( out_queue.get, None ):
time.sleep(0.2) #pretend to take some time to process/animate the data
logger.info( 'data: %i' %d[0,0] ) #just to keep track of data ordering
This works brilliantly in some cases, but sometimes the order of the data gets jumbled after a new load process is triggered. I can't figure out why this should happen - mp.Queue is supposed to be FIFO right?! For eg. Running the code above as is won't preserve the correct order in the output queue, however, changing the threshold to a lower value eg. 30 fixes this. *so confused...
So question: How do I correctly implement this staggered loading strategy with multiprocessing in python?

This looks like a buffering problem. Internally, multiprocessing.Queue uses a buffer to temporarily store items you've enqueued, and eventually flushes them to a Pipe in a background thread. It's only after the flushing happening that the items are actually sent to other processes. Because you're putting large objects onto the Queue, there is a lot of buffering going on. This is causing the loading processes to actually overlap, even though your logging shows that one process is done before the other starts. The docs actually have a warning about this scenario:
When an object is put on a queue, the object is pickled and a
background thread later flushes the pickled data to an underlying
pipe. This has some consequences which are a little surprising, but
should not cause any practical difficulties – if they really bother
you then you can instead use a queue created with a manager.
After putting an object on an empty queue there may be an infinitesimal delay before the queue’s empty() method returns False
and get_nowait() can return without raising Queue.Empty.
If multiple processes are enqueuing objects, it is possible for the objects to be received at the other end out-of-order. However,
objects enqueued by the same process will always be in the expected
order with respect to each other.
I would recommend doing as the docs state, and use a multiprocessing.Manager to create your queue:
m = mp.Manager()
out_queue = m.Queue()
Which will let you avoid the issue altogether.
Another option would be to use just one process to do all the data loading, and have it run in a loop, with the event.wait() call at the top of the loop.

Python multiprocessing for parallel processes

I'm sorry if this is too simple for some people, but I still don't get the trick with python's multiprocessing. I've read
http://docs.python.org/dev/library/multiprocessing
http://pymotw.com/2/multiprocessing/basics.html
and many other tutorials and examples that google gives me... many of them from here too.
Well, my situation is that I have to compute many numpy matrices and I need to store them in a single numpy matrix afterwards. Let's say I want to use 20 cores (or that I can use 20 cores) but I haven't managed to successfully use the pool resource since it keeps the processes alive till the pool "dies". So I thought on doing something like this:
from multiprocessing import Process, Queue
import numpy as np
def f(q,i):
q.put( np.zeros( (4,4) ) )
if __name__ == '__main__':
q = Queue()
for i in range(30):
p = Process(target=f, args=(q,))
p.start()
p.join()
result = q.get()
while q.empty() == False:
result += q.get()
print result
but then it looks like the processes don't run in parallel but they run sequentially (please correct me if I'm wrong) and I don't know if they die after they do their computation (so for more than 20 processes the ones that did their part leave the core free for another process). Plus, for a very large number (let's say 100.000), storing all those matrices (which may be really big too) in a queue will use a lot of memory, rendering the code useless since the idea is to put every result on each iteration in the final result, like using a lock (and its acquire() and release() methods), but if this code isn't for parallel processing, the lock is useless too...
I hope somebody may help me.
Thanks in advance!

You are correct, they are executing sequentially in your example.
p.join() causes the current thread to block until it is finished executing. You'll either want to join your processes individually outside of your for loop (e.g., by storing them in a list and then iterating over it) or use something like numpy.Pool and apply_async with a callback. That will also let you add it to your results directly rather than keeping the objects around.
For example:
def f(i):
return i*np.identity(4)
if __name__ == '__main__':
p=Pool(5)
result = np.zeros((4,4))
def adder(value):
global result
result += value
for i in range(30):
p.apply_async(f, args=(i,), callback=adder)
p.close()
p.join()
print result
Closing and then joining the pool at the end ensures that the pool's processes have completed and the result object is finished being computed. You could also investigate using Pool.imap as a solution to your problem. That particular solution would look something like this:
if __name__ == '__main__':
p=Pool(5)
result = np.zeros((4,4))
im = p.imap_unordered(f, range(30), chunksize=5)
for x in im:
result += x
print result
This is cleaner for your specific situation, but may not be for whatever you are ultimately trying to do.
As to storing all of your varied results, if I understand your question, you can just add it off into a result in the callback method (as above) or item-at-a-time using imap/imap_unordered (which still stores the results, but you'll clear it as it builds). Then it doesn't need to be stored for longer than it takes to add to the result.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Better way to share memory for multiprocessing in Python? - python

Related

Separate computation from socket work in Python

Creating a Queue delay in a Python pool without blocking

Access python program data while running

Staggered data loading with multiprocessing.Queue sometimes leads to items being consumed out of order

Python multiprocessing for parallel processes

Categories

Resources