Is multiprocessing the right tool for me?

Is multiprocessing the right tool for me? - python

I need to write a very specific data processing daemon.
Here is how I thought it could work with multiprocessing :
Process #1: One process to fetch some vital meta data, they can be fetched every second, but those data must be available in process #2. Process #1 writes the data, and Process #2 reads them.
Process #2: Two processes which will fetch the real data based on what has been received in process #1. Fetched data will be stored into a (big) queue to be processed "later"
Process #3: Two (or more) processes which poll the queue created in Process #2 and process those data. Once done, a new queue is filled up to be used in Process #4
Process #4 : Two processes which will read the queue filled by Process(es) #3 and send the result back over HTTP.
The idea behind all these different processes is to specialize them as much as possible and to make them as independent as possible.
All thoses processes will be wrapped into a main daemon which is implemented here :
http://www.jejik.com/articles/2007/02/a_simple_unix_linux_daemon_in_python/
I am wondering if what I have imagined is relevant/stupid/overkill/etc, especially if I run daemon multiprocessing.Process(es) within a main parent process which will daemonized.
Furthermore I am a bit concerned about potential locking problems. In theory processes that read and write data uses different variables/structures so that should avoid a few problems, but I am still concerned.
Maybe using multiprocessing for my context is not the right thing to do. I would love to get your feedback about this.
Notes :
I can not use Redis as a data structure server
I thought about using ZeroMQ for IPC but I would avoid using another extra library if multiprocessing can do the job as well.
Thanks in advance for your feedback.

Generally, your division in different workers with different tasks as well as your plan to let them communicate already looks good. However, one thing you should be aware of is whenever a processing step is I/O or CPU bound. If you are I/O bound, I'd go for the threading module whenever you can: the memory footprint of your application will be smaller and the communication between threads can be more efficient, as shared memory is allowed. Only if you need additional CPU power, go for multiprocessing. In your system, you can use both (it looks like process 3 (or more) will do some heavy computing, while the other workers will predominantly be I/O bound).

Related

Separate process sharing queue with main process (producer/consumer)

I'm pretty new to multiprocessing in Python and I've done a lot of digging around, but can't seem to find exactly what I'm looking for. I have a bit of a consumer/producer problem where I have a simple server with an endpoint that consumes from a queue and a function that produces onto the queue. The queue can be full, so the producer doesn't always need to be running.
While the queue isn't full, I want the producer task to run but I don't want it to block the server from receiving or servicing requests. I tried using multithreading but this producing process is very slow and the GIL slows it down too much. I want the server to be running all the time, and whenever the queue is no longer full (something has been consumed), I want to kick off this producer task as a separate process and I want it to run until the queue is full again. What is the best way to share the queue so that the producer process can access the queue used by the main process?

What is the best way to share the queue so that the producer process can access the queue used by the main process?
If this is the important part of your question (which seems like it's actually several questions), then multiprocessing.Queue seems to be exactly what you need. I've used this in several projects to have multiple processes feed a queue for consumption by a separate process, so if that's what you're looking for, this should work.

Costs of multiprocessing in python

In python, what is the cost of creating another process - is it sufficiently high that it's not worth it as a way of handling events?
Context of question: I'm using radio modules to transmit data from sensors to a raspberry pi. I have a python script running on the pi, catching the data and handling it - putting it in a MySQL database and occasionally triggering other things.
My dilemma is that if I handle everything in a single script there's a risk that some data packet might be ignored, because it's taking too long to run the processing. I could avoid this by spawning a separate process to handle the event and then die - but if the cost of creating a process is high it might be worth me focusing on more efficient code than creating a process.
Thoughts people?
Edit to add:
Sensors are pushing data, at intervals of 8 seconds and up
No buffering easily available
If processing takes longer longer than the time till the next reading, it would be ignored and lost. (Transmission system
guarantees delivery - I need to guarantee the pi is in a position to
receive it)

I think you're trying to address two problems at the same time, and it is getting confusing.
Polling frequency: here the question is, how fast you need to poll data so that you don't risk losing some
Concurrency and i/o locking: what happens if processing takes longer than the frequency interval
The first problem depends entirely on your underlying architecture: are your sensors pushing or polling to your Raspberry? Is any buffering involved? What happens if your polling frequency is faster than the rate of arrival of data?
My recommendation is to enforce the KISS principle and basically write two tools: one that is entirely in charge of storing data data as fast as you need; the other that takes care of doing something with the data.
For example the storing could be done by a memcached instance, or even a simple shell pipe if you're at the prototyping level. The second utility that manipulates data then does not have to worry about polling frequency, I/O errors (what if the SQL database errors?), and so on.
As a bonus, de-coupling data retrieval and manipulation allows you to:
Test more easily (you can store some data as a sample, and then reply it to the manipulation routine to validate behaviour)
Isolate problems more easily
Scale much faster (you could have as many "manipulators" as you need)

Spawning new threads cost depends on what you do with them.
In term of memory, make sure your threads aren't loading themselves with everything, threading shares the memory for the whole application so variables keep their scope.
In term of processing, be sure you don't overload your system.
I'm doing something quite similar for work : I'm scanning a folder (where files are put constantly), and I do stuff on every file.
I use my main thread to initialize the application and spawn the child threads.
One child thread is used for logging.
Others child are for the actual work.
My main loop looks like this :
#spawn logging thread
while 1:
for stuff in os.walk('/gw'):
while threading.active_count() > 200:
time.sleep(0.1)
#spawn new worker thread sending the filepath
time.sleep(1)
This basically means that my application won't use more than 201 threads (200 + main thread).
So then it was just playing with the application, using htop for monitoring it's resources consumption and limiting the app to a proper max number of threads.

Python multiprocessing prevent switching off to other processes

While using the multiprocessing module in Python, is there a way to prevent the process of switching off to another process for a certain time?
I have ~50 different child processes spawned to retrieve data from a database (each process = each table in DB) and after querying and filtering the data, I try to write the output to an excel file.
Since all the processes have similar steps, they all end up to the writing process at similar times, and of course since I am writing to a single file, I have a lock that prevents multiple processes to write on the file.
The problem is though, the writing process seems to take very long, compared to the writing process when I had the same amount of data written in a single process (slower by at least x10)
I am guessing one of the reasons could be that while writing, the cpu is constantly switching off to other processes, which are all stuck at the mutex lock, only to come back to the process that is the only one that is active. I am guessing the context switching is a significant waste of time since there are a lot of processes to switch back and forth from
I was wondering if there was a way to lock a process such that for a certain part of the code, no context switching between processes happen
Or any other suggestions to speed up this process?

Don't use locking and don't write from multiple processes; Let the child processes return the output to the parent (e.g. via standard output), and have it wait for the processes to join to read it. I'm not 100% on the multiprocessing API but you could just have the parent process sleep and wait for a SIGCHLD and only then read data from an exited child's standard output, and write it to your output file.
This way only one process is writing to the file and you don't need any busy looping or whatever. It will be much simpler and much more efficient.

You can raise the priority of your process (Go to Task Manager, right click on the process and raise it's process priority). However the OS will context switch no matter what, your process has no better claim then other processes to the OS.

What profiling do I need to do to optimize a multi-step producer-consumer model?

I have a 3-step producer/consumer setup.
Client creates JSON-encoded dictionaries and sends them to PipeServer via a named pipe
Here are my threading.Thread subclasses:
PipeServer creates a named pipe and places messages into a queue unprocessed messages
Processor gets items from unprocessed messages, processes them (via a lambda function argument), and puts them into a queue processed messages
Printers gets items from processed messages, acquires a lock, prints the message, and releases the lock.
In the test script, I have one PipeServer, one Processor, and 4 Printers:
pipe_name = '\\\\.\\pipe\\testpipe'
pipe_server = pipetools.PipeServer(pipe_name, unprocessed_messages)
json_loader = lambda x: json.loads(x.decode('utf-8'))
processor = threadedtools.Processor(unprocessed_messages,
processed_messages,
json_loader)
print_servers = []
for i in range(4):
print_servers.append(threadedtools.Printer(processed_messages,
output_lock,
'PRINTER {0}'.format(i)))
pipe_server.start()
processor.start()
for print_server in print_servers:
print_server.start()
Question: in this kind of multi-step setup, how do I think through optimizing the number of Printer vs. Processor threads I should have? For example, how do I know if 4 is the optimal number of Printer threads to have? Should I have more processors?
I read through the Python Profilers docs, but didn't see anything that would help me think through these kinds of tradeoffs.

Generally speaking, you want to optimize for the maximum throughput of your slowest component. In this case, it sounds like either Client or Printer. If it's the Client, you want just enough Printers and Processors to be able to keep up with new messages (maybe that's just one!). Otherwise you'll be wasting resources on threads you don't need.
If it's Printers, then you need to optimize for the IO that's occurring. A few variables to take into account:
How many locks can you have simultaneously?
Do you have to maintain the lock for the length of a printing transaction?
How long does a printing operation take?
If you can only have one lock, then you should only have one thread, so on and so forth.
You then want to test with real world operation (it's difficult to predict what combination of RAM, disk and network activity will slow you down). Instrument your code so you can see how many threads are idle at any given time. Then create a test case that processes data into the system at maximum throughput. Start with an arbitrary number of threads for each component. If Client, Processor, or Printer threads are always busy, add more threads. If some threads are always idle, take some away.
You may need to retune if you move the code to a different hardware environment - different number of processors, more memory, different disk can all have an effect.

How do I limit the number of active threads in python?

Am new to python and making some headway with threading - am doing some music file conversion and want to be able to utilize the multiple cores on my machine (one active conversion thread per core).
class EncodeThread(threading.Thread):
# this is hacked together a bit, but should give you an idea
def run(self):
decode = subprocess.Popen(["flac","--decode","--stdout",self.src],
stdout=subprocess.PIPE)
encode = subprocess.Popen(["lame","--quiet","-",self.dest],
stdin=decode.stdout)
encode.communicate()
# some other code puts these threads with various src/dest pairs in a list
for proc in threads: # `threads` is my list of `threading.Thread` objects
proc.start()
Everything works, all the files get encoded, bravo! ... however, all the processes spawn immediately, yet I only want to run two at a time (one for each core). As soon as one is finished, I want it to move on to the next on the list until it is finished, then continue with the program.
How do I do this?
(I've looked at the thread pool and queue functions but I can't find a simple answer.)
Edit: maybe I should add that each of my threads is using subprocess.Popen to run a separate command line decoder (flac) piped to stdout which is fed into a command line encoder (lame/mp3).

If you want to limit the number of parallel threads, use a semaphore:
threadLimiter = threading.BoundedSemaphore(maximumNumberOfThreads)
class EncodeThread(threading.Thread):
def run(self):
threadLimiter.acquire()
try:
<your code here>
finally:
threadLimiter.release()
Start all threads at once. All but maximumNumberOfThreads will wait in threadLimiter.acquire() and a waiting thread will only continue once another thread goes through threadLimiter.release().

"Each of my threads is using subprocess.Popen to run a separate command line [process]".
Why have a bunch of threads manage a bunch of processes? That's exactly what an OS does that for you. Why micro-manage what the OS already manages?
Rather than fool around with threads overseeing processes, just fork off processes. Your process table probably can't handle 2000 processes, but it can handle a few dozen (maybe a few hundred) pretty easily.
You want to have more work than your CPU's can possibly handle queued up. The real question is one of memory -- not processes or threads. If the sum of all the active data for all the processes exceeds physical memory, then data has to be swapped, and that will slow you down.
If your processes have a fairly small memory footprint, you can have lots and lots running. If your processes have a large memory footprint, you can't have very many running.

If you're using the default "cpython" version then this won't help you, because only one thread can execute at a time; look up Global Interpreter Lock. Instead, I'd suggest looking at the multiprocessing module in Python 2.6 -- it makes parallel programming a cinch. You can create a Pool object with 2*num_threads processes, and give it a bunch of tasks to do. It will execute up to 2*num_threads tasks at a time, until all are done.
At work I have recently migrated a bunch of Python XML tools (a differ, xpath grepper, and bulk xslt transformer) to use this, and have had very nice results with two processes per processor.

It looks to me that what you want is a pool of some sort, and in that pool you would like the have n threads where n == the number of processors on your system. You would then have another thread whose only job was to feed jobs into a queue which the worker threads could pick up and process as they became free (so for a dual code machine, you'd have three threads but the main thread would be doing very little).
As you are new to Python though I'll assume you don't know about the GIL and it's side-effects with regard to threading. If you read the article I linked you will soon understand why traditional multithreading solutions are not always the best in the Python world. Instead you should consider using the multiprocessing module (new in Python 2.6, in 2.5 you can use this backport) to achieve the same effect. It side-steps the issue of the GIL by using multiple processes as if they were threads within the same application. There are some restrictions about how you share data (you are working in different memory spaces) but actually this is no bad thing: they just encourage good practice such as minimising the contact points between threads (or processes in this case).
In your case you are probably intersted in using a pool as specified here.

Short answer: don't use threads.
For a working example, you can look at something I've recently tossed together at work. It's a little wrapper around ssh which runs a configurable number of Popen() subprocesses. I've posted it at: Bitbucket: classh (Cluster Admin's ssh Wrapper).
As noted, I don't use threads; I just spawn off the children, loop over them calling their .poll() methods and checking for timeouts (also configurable) and replenish the pool as I gather the results. I've played with different sleep() values and in the past I've written a version (before the subprocess module was added to Python) which used the signal module (SIGCHLD and SIGALRM) and the os.fork() and os.execve() functions --- which my on pipe and file descriptor plumbing, etc).
In my case I'm incrementally printing results as I gather them ... and remembering all of them to summarize at the end (when all the jobs have completed or been killed for exceeding the timeout).
I ran that, as posted, on a list of 25,000 internal hosts (many of which are down, retired, located internationally, not accessible to my test account etc). It completed the job in just over two hours and had no issues. (There were about 60 of them that were timeouts due to systems in degenerate/thrashing states -- proving that my timeout handling works correctly).
So I know this model works reliably. Running 100 current ssh processes with this code doesn't seem to cause any noticeable impact. (It's a moderately old FreeBSD box). I used to run the old (pre-subprocess) version with 100 concurrent processes on my old 512MB laptop without problems, too).
(BTW: I plan to clean this up and add features to it; feel free to contribute or to clone off your own branch of it; that's what Bitbucket.org is for).

I am not an expert in this, but I have read something about "Lock"s. This article might help you out
Hope this helps

I would like to add something, just as a reference for others looking to do something similar, but who might have coded things different from the OP. This question was the first one I came across when searching and the chosen answer pointed me in the right direction. Just trying to give something back.
import threading
import time
maximumNumberOfThreads = 2
threadLimiter = threading.BoundedSemaphore(maximumNumberOfThreads)
def simulateThread(a,b):
threadLimiter.acquire()
try:
#do some stuff
c = a + b
print('a + b = ',c)
time.sleep(3)
except NameError: # Or some other type of error
# in case of exception, release
print('some error')
threadLimiter.release()
finally:
# if everything completes without error, release
threadLimiter.release()
threads = []
sample = [1,2,3,4,5,6,7,8,9]
for i in range(len(sample)):
thread = threading.Thread(target=(simulateThread),args=(sample[i],2))
thread.daemon = True
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
This basically follows what you will find on this site:
https://www.kite.com/python/docs/threading.BoundedSemaphore

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.