Put large ndarrays fast to multiprocessing.Queue

Put large ndarrays fast to multiprocessing.Queue - python

When trying to put a large ndarray to a Queue in a Process, I encounter the following problem:
First, here is the code:
import numpy
import multiprocessing
from ctypes import c_bool
import time
def run(acquisition_running, data_queue):
while acquisition_running.value:
length = 65536
data = numpy.ndarray(length, dtype='float')
data_queue.put(data)
time.sleep(0.1)
if __name__ == '__main__':
acquisition_running = multiprocessing.Value(c_bool)
data_queue = multiprocessing.Queue()
process = multiprocessing.Process(
target=run, args=(acquisition_running, data_queue))
acquisition_running.value = True
process.start()
time.sleep(1)
acquisition_running.value = False
process.join()
print('Finished')
number_items = 0
while not data_queue.empty():
data_item = data_queue.get()
number_items += 1
print(number_items)
If I use length=10 or so, everything works fine. I get 9 items transmitted through the Queue.
If I increase to length=1000, on my computer the process.join() blocks, although the function run() is already done. I can comment the line with process.join() and will see, that there are only 2 items put in the Queue, so apparently putting data to the Queue got very slow.
My plan is actually to transport 4 ndarray, each with length 65536. For the Thread this worked very fast (<1ms). Is there a way to improve speed of transmitting data for processes?
I used Python 3.4 on a Windows machine, but with Python 3.4 on Linux I get the same behavior.

"Is there a way to improve speed of transmitting data for processes?"
Surely, given the right problem to solve. Currently, you are just filling a buffer without emptying it simultaneously. Congratulations, you have just built yourself a so-called deadlock. The corresponding quote from the documentation is:
Bear in mind that a process that has put items in a queue will wait
before terminating until all the buffered items are fed by the
“feeder” thread to the underlying pipe.
But, let's approach this slowly. First of all, "speed" is not your problem! I understand that you are just experimenting with Python's multiprocessing. The most important insight when reading your code is that the flow of communication between parent and child and especially the event handling does not really make sense. If you have a real-world problem that you are trying to solve, you definitely cannot solve it this way. If you do not have a real-world problem, then you first need to come up with a good one before you should start writing code ;-). Eventually, you will need to understand the communication primitives an operating system provides for inter-process communication.
Explanation for what you are observing:
Your child process generates about 10 * length * size(float) bytes of data (considering the fact that your child process can perform about 10 iterations while your parent sleeps about 1 s before it sets acquisition_running to False). While your parent process sleeps, the child puts named amount of data into a queue. You need to appreciate that a queue is a complex construct. You do not need to understand every bit of it. But one thing really really is important: a queue for inter-process communication clearly uses some kind of buffer* that sits between parent and child. Buffers usually have a limited size. You are writing to this buffer from within the child without simultaneously reading from it in the parent. That is, the buffer contents steadily grow while the parent is just sleeping. By increasing length you run into the situation where the queue buffer is full and the child process cannot write to it anymore. However, the child process cannot terminate before it has written all data. At the same time, the parent process waits for the child to terminate.
You see? One entity waits for the other. The parent waits for the child to terminate and the child waits for the parent to make some space. Such a situation is called deadlock. It cannot resolve itself.
Regarding the details, the buffer situation is a little more complex than described above. Your child process has spawned an additional thread that tries to push the buffered data through a pipe to the parent. Actually, the buffer of this pipe is the limiting entity. It is defined by the operating system and, at least on Linux, is usually not larger than 65536 Bytes.
The essential part is, in other words: the parent does not read from the pipe before the child finishes attempting to write to the pipe. In every meaningful scenario where pipes are used, reading and writing happen in a rather simultaneous fashion so that one process can quickly react to input provided by another process. You are doing the exact opposite: you put your parent to sleep and therefore render it dis-responsive to input from the child, resulting in a deadlock situation.
(*) "When a process first puts an item on the queue a feeder thread is started which transfers objects from a buffer into the pipe", from https://docs.python.org/2/library/multiprocessing.html

If you have really big arrays, you might want to only pass their pickled state -- or a better alternative might be to use multiprocessing.Array or multiprocessing.sharedctypes.RawArray to make a shared memory array (for the latter, see http://briansimulator.org/sharing-numpy-arrays-between-processes/). You have to worry about conflicts, as you'll have an array that's not bound by the GIL -- and needs locks. However, you only need to send array indices to access the shared array data.

One thing you could do to resolve that issue, in tandem with the excellent answer from JPG, is to unload your Queue between every processes.
So do this instead:
process.start()
data_item = data_queue.get()
process.join()
While this does not fully replicate the behavior in the code (number of data counting), you get the idea ;)

Convert array/list to str(your_array)
q.put(str(your_array))

Related

Why does multiprocessing.Queue have a small delay while (apparently) multiprocessing.Pipe does not?

Documentation for multiprocessing.Queue points out that there's a bit of a delay from when an item is enqueued until it's pickled representation is flushed to the underlying Pipe. Apparently though you can enqueue an item straight into a Pipe (it doesn't say otherwise and implies that's the case).
Why doesn't the Pipe need - or have - the same background thread to do the pickling? And is that the same reason that there's no similar delay when talking to a multiprocessor.SyncManager.Queue?
(Bonus question: What does the documentation mean when it says "After putting an object on an empty queue there may be an infinitesimal delay ..."? I've taken calculus; I know what infinitesimal means, and that meaning doesn't seem to fit here. So what is it talking about?)

If you write to a Pipe, the current thread blocks until the write completes. There is therefore no delay (or rather, the calling thread can’t observe any), but it’s possible to deadlock; Pipe is a lower-level tool than Queue.
The situation with SyncManager.Queue is simply that all requests to the manager are synchronized such that the process that pushes an object cannot then observe it to still be empty (absent a pop).
Meanwhile, the “infinitesimal” delay merely means a thread scheduling delay rather than the (possibly much larger) time to write the entire object: it’s enough for it to have started so as to establish that the Queue is not empty. The pushing thread can nonetheless win the race and observe it still lacking the object “already pushed”.

Python multiprocessing prevent switching off to other processes

While using the multiprocessing module in Python, is there a way to prevent the process of switching off to another process for a certain time?
I have ~50 different child processes spawned to retrieve data from a database (each process = each table in DB) and after querying and filtering the data, I try to write the output to an excel file.
Since all the processes have similar steps, they all end up to the writing process at similar times, and of course since I am writing to a single file, I have a lock that prevents multiple processes to write on the file.
The problem is though, the writing process seems to take very long, compared to the writing process when I had the same amount of data written in a single process (slower by at least x10)
I am guessing one of the reasons could be that while writing, the cpu is constantly switching off to other processes, which are all stuck at the mutex lock, only to come back to the process that is the only one that is active. I am guessing the context switching is a significant waste of time since there are a lot of processes to switch back and forth from
I was wondering if there was a way to lock a process such that for a certain part of the code, no context switching between processes happen
Or any other suggestions to speed up this process?

Don't use locking and don't write from multiple processes; Let the child processes return the output to the parent (e.g. via standard output), and have it wait for the processes to join to read it. I'm not 100% on the multiprocessing API but you could just have the parent process sleep and wait for a SIGCHLD and only then read data from an exited child's standard output, and write it to your output file.
This way only one process is writing to the file and you don't need any busy looping or whatever. It will be much simpler and much more efficient.

You can raise the priority of your process (Go to Task Manager, right click on the process and raise it's process priority). However the OS will context switch no matter what, your process has no better claim then other processes to the OS.

Is there a cost to calling python multiprocessing .join() method

My question is inspired by a comment on the solving embarassingly parallel problem with multiprocessing post.
I am asking about the general case where python multiprocessing is used to (1) read data from file, (2) manipulate data, (3) write results to file. In the case I describe, data that is read from file is passed to a queue A in (1) and fetched from this queue A in (2). (2) also passes results to a separate queue B and (3) fetches results from this queue B to write them to file.
When (1) is done, it passes a STOP signal* to queue A so (2) knows queue A is empty. (2) then terminates and passes a STOP signal to queue B so (3) knows queue B is empty and terminates when it has used up the results queue.
So is there any need to call the multiprocessing .join() method on (1) and (2)? I would have thought that (2) will not finish until (1) finishes and sends a STOP signal? For (3) it makes sense to wait as any subsequent instructions might else proceed without (3).
But maybe calling the .join() method costs nothing and can be used just to avoid having to think about it?
*actually, the STOP signal consists of a sequence of N stop signals where N is equivalent to the number of processes running in (2).

According to the docs, it is safe to call join multiple times - this suggests that if p has already stopped, p.join() will return immediately. This means that if you expect p to have already stopped by this time, the cost of joining it should be negligible. If p hasn't stopped (as you say you expect the writer process might not have), there is a potential cost to joining it depending on what your main process needs to do. If it does any user interaction, it will appear hung. If that is a problem, you might consider this type of pattern:
while p.is_alive():
iterate_mainloop()
p.join(small_timeout)
But if that process doesn't do user interaction, joining the others should be fine. That seems to be the most likely situation here - if you can afford to be blocked waiting for a disk read, you should also be fine waiting for another process to complete (modulo any defensive timeouts in case it misbehaves).

Is multiprocessing the right tool for me?

I need to write a very specific data processing daemon.
Here is how I thought it could work with multiprocessing :
Process #1: One process to fetch some vital meta data, they can be fetched every second, but those data must be available in process #2. Process #1 writes the data, and Process #2 reads them.
Process #2: Two processes which will fetch the real data based on what has been received in process #1. Fetched data will be stored into a (big) queue to be processed "later"
Process #3: Two (or more) processes which poll the queue created in Process #2 and process those data. Once done, a new queue is filled up to be used in Process #4
Process #4 : Two processes which will read the queue filled by Process(es) #3 and send the result back over HTTP.
The idea behind all these different processes is to specialize them as much as possible and to make them as independent as possible.
All thoses processes will be wrapped into a main daemon which is implemented here :
http://www.jejik.com/articles/2007/02/a_simple_unix_linux_daemon_in_python/
I am wondering if what I have imagined is relevant/stupid/overkill/etc, especially if I run daemon multiprocessing.Process(es) within a main parent process which will daemonized.
Furthermore I am a bit concerned about potential locking problems. In theory processes that read and write data uses different variables/structures so that should avoid a few problems, but I am still concerned.
Maybe using multiprocessing for my context is not the right thing to do. I would love to get your feedback about this.
Notes :
I can not use Redis as a data structure server
I thought about using ZeroMQ for IPC but I would avoid using another extra library if multiprocessing can do the job as well.
Thanks in advance for your feedback.

Generally, your division in different workers with different tasks as well as your plan to let them communicate already looks good. However, one thing you should be aware of is whenever a processing step is I/O or CPU bound. If you are I/O bound, I'd go for the threading module whenever you can: the memory footprint of your application will be smaller and the communication between threads can be more efficient, as shared memory is allowed. Only if you need additional CPU power, go for multiprocessing. In your system, you can use both (it looks like process 3 (or more) will do some heavy computing, while the other workers will predominantly be I/O bound).

How do I limit the number of active threads in python?

Am new to python and making some headway with threading - am doing some music file conversion and want to be able to utilize the multiple cores on my machine (one active conversion thread per core).
class EncodeThread(threading.Thread):
# this is hacked together a bit, but should give you an idea
def run(self):
decode = subprocess.Popen(["flac","--decode","--stdout",self.src],
stdout=subprocess.PIPE)
encode = subprocess.Popen(["lame","--quiet","-",self.dest],
stdin=decode.stdout)
encode.communicate()
# some other code puts these threads with various src/dest pairs in a list
for proc in threads: # `threads` is my list of `threading.Thread` objects
proc.start()
Everything works, all the files get encoded, bravo! ... however, all the processes spawn immediately, yet I only want to run two at a time (one for each core). As soon as one is finished, I want it to move on to the next on the list until it is finished, then continue with the program.
How do I do this?
(I've looked at the thread pool and queue functions but I can't find a simple answer.)
Edit: maybe I should add that each of my threads is using subprocess.Popen to run a separate command line decoder (flac) piped to stdout which is fed into a command line encoder (lame/mp3).

If you want to limit the number of parallel threads, use a semaphore:
threadLimiter = threading.BoundedSemaphore(maximumNumberOfThreads)
class EncodeThread(threading.Thread):
def run(self):
threadLimiter.acquire()
try:
<your code here>
finally:
threadLimiter.release()
Start all threads at once. All but maximumNumberOfThreads will wait in threadLimiter.acquire() and a waiting thread will only continue once another thread goes through threadLimiter.release().

"Each of my threads is using subprocess.Popen to run a separate command line [process]".
Why have a bunch of threads manage a bunch of processes? That's exactly what an OS does that for you. Why micro-manage what the OS already manages?
Rather than fool around with threads overseeing processes, just fork off processes. Your process table probably can't handle 2000 processes, but it can handle a few dozen (maybe a few hundred) pretty easily.
You want to have more work than your CPU's can possibly handle queued up. The real question is one of memory -- not processes or threads. If the sum of all the active data for all the processes exceeds physical memory, then data has to be swapped, and that will slow you down.
If your processes have a fairly small memory footprint, you can have lots and lots running. If your processes have a large memory footprint, you can't have very many running.

If you're using the default "cpython" version then this won't help you, because only one thread can execute at a time; look up Global Interpreter Lock. Instead, I'd suggest looking at the multiprocessing module in Python 2.6 -- it makes parallel programming a cinch. You can create a Pool object with 2*num_threads processes, and give it a bunch of tasks to do. It will execute up to 2*num_threads tasks at a time, until all are done.
At work I have recently migrated a bunch of Python XML tools (a differ, xpath grepper, and bulk xslt transformer) to use this, and have had very nice results with two processes per processor.

It looks to me that what you want is a pool of some sort, and in that pool you would like the have n threads where n == the number of processors on your system. You would then have another thread whose only job was to feed jobs into a queue which the worker threads could pick up and process as they became free (so for a dual code machine, you'd have three threads but the main thread would be doing very little).
As you are new to Python though I'll assume you don't know about the GIL and it's side-effects with regard to threading. If you read the article I linked you will soon understand why traditional multithreading solutions are not always the best in the Python world. Instead you should consider using the multiprocessing module (new in Python 2.6, in 2.5 you can use this backport) to achieve the same effect. It side-steps the issue of the GIL by using multiple processes as if they were threads within the same application. There are some restrictions about how you share data (you are working in different memory spaces) but actually this is no bad thing: they just encourage good practice such as minimising the contact points between threads (or processes in this case).
In your case you are probably intersted in using a pool as specified here.

Short answer: don't use threads.
For a working example, you can look at something I've recently tossed together at work. It's a little wrapper around ssh which runs a configurable number of Popen() subprocesses. I've posted it at: Bitbucket: classh (Cluster Admin's ssh Wrapper).
As noted, I don't use threads; I just spawn off the children, loop over them calling their .poll() methods and checking for timeouts (also configurable) and replenish the pool as I gather the results. I've played with different sleep() values and in the past I've written a version (before the subprocess module was added to Python) which used the signal module (SIGCHLD and SIGALRM) and the os.fork() and os.execve() functions --- which my on pipe and file descriptor plumbing, etc).
In my case I'm incrementally printing results as I gather them ... and remembering all of them to summarize at the end (when all the jobs have completed or been killed for exceeding the timeout).
I ran that, as posted, on a list of 25,000 internal hosts (many of which are down, retired, located internationally, not accessible to my test account etc). It completed the job in just over two hours and had no issues. (There were about 60 of them that were timeouts due to systems in degenerate/thrashing states -- proving that my timeout handling works correctly).
So I know this model works reliably. Running 100 current ssh processes with this code doesn't seem to cause any noticeable impact. (It's a moderately old FreeBSD box). I used to run the old (pre-subprocess) version with 100 concurrent processes on my old 512MB laptop without problems, too).
(BTW: I plan to clean this up and add features to it; feel free to contribute or to clone off your own branch of it; that's what Bitbucket.org is for).

I am not an expert in this, but I have read something about "Lock"s. This article might help you out
Hope this helps

I would like to add something, just as a reference for others looking to do something similar, but who might have coded things different from the OP. This question was the first one I came across when searching and the chosen answer pointed me in the right direction. Just trying to give something back.
import threading
import time
maximumNumberOfThreads = 2
threadLimiter = threading.BoundedSemaphore(maximumNumberOfThreads)
def simulateThread(a,b):
threadLimiter.acquire()
try:
#do some stuff
c = a + b
print('a + b = ',c)
time.sleep(3)
except NameError: # Or some other type of error
# in case of exception, release
print('some error')
threadLimiter.release()
finally:
# if everything completes without error, release
threadLimiter.release()
threads = []
sample = [1,2,3,4,5,6,7,8,9]
for i in range(len(sample)):
thread = threading.Thread(target=(simulateThread),args=(sample[i],2))
thread.daemon = True
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
This basically follows what you will find on this site:
https://www.kite.com/python/docs/threading.BoundedSemaphore

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.