Costs of multiprocessing in python

Costs of multiprocessing in python - python

In python, what is the cost of creating another process - is it sufficiently high that it's not worth it as a way of handling events?
Context of question: I'm using radio modules to transmit data from sensors to a raspberry pi. I have a python script running on the pi, catching the data and handling it - putting it in a MySQL database and occasionally triggering other things.
My dilemma is that if I handle everything in a single script there's a risk that some data packet might be ignored, because it's taking too long to run the processing. I could avoid this by spawning a separate process to handle the event and then die - but if the cost of creating a process is high it might be worth me focusing on more efficient code than creating a process.
Thoughts people?
Edit to add:
Sensors are pushing data, at intervals of 8 seconds and up
No buffering easily available
If processing takes longer longer than the time till the next reading, it would be ignored and lost. (Transmission system
guarantees delivery - I need to guarantee the pi is in a position to
receive it)

I think you're trying to address two problems at the same time, and it is getting confusing.
Polling frequency: here the question is, how fast you need to poll data so that you don't risk losing some
Concurrency and i/o locking: what happens if processing takes longer than the frequency interval
The first problem depends entirely on your underlying architecture: are your sensors pushing or polling to your Raspberry? Is any buffering involved? What happens if your polling frequency is faster than the rate of arrival of data?
My recommendation is to enforce the KISS principle and basically write two tools: one that is entirely in charge of storing data data as fast as you need; the other that takes care of doing something with the data.
For example the storing could be done by a memcached instance, or even a simple shell pipe if you're at the prototyping level. The second utility that manipulates data then does not have to worry about polling frequency, I/O errors (what if the SQL database errors?), and so on.
As a bonus, de-coupling data retrieval and manipulation allows you to:
Test more easily (you can store some data as a sample, and then reply it to the manipulation routine to validate behaviour)
Isolate problems more easily
Scale much faster (you could have as many "manipulators" as you need)

Spawning new threads cost depends on what you do with them.
In term of memory, make sure your threads aren't loading themselves with everything, threading shares the memory for the whole application so variables keep their scope.
In term of processing, be sure you don't overload your system.
I'm doing something quite similar for work : I'm scanning a folder (where files are put constantly), and I do stuff on every file.
I use my main thread to initialize the application and spawn the child threads.
One child thread is used for logging.
Others child are for the actual work.
My main loop looks like this :
#spawn logging thread
while 1:
for stuff in os.walk('/gw'):
while threading.active_count() > 200:
time.sleep(0.1)
#spawn new worker thread sending the filepath
time.sleep(1)
This basically means that my application won't use more than 201 threads (200 + main thread).
So then it was just playing with the application, using htop for monitoring it's resources consumption and limiting the app to a proper max number of threads.

Related

How to do weighted fair task queues for CPU intensive tasks (in Python)?

Problem
We run several calculations on geographical data from user input (called a "system"). Sometimes one system needs 10 locations to do calculations for, sometimes 1000+. One location takes approximately 1 second to calculate, hopefully we can speed this up in the future. We currently do this by using a multiprocessing Pool (from billiard) from within a Celery worker. This works in that it utilises all cores 100%, but there are two problems:
There are lingering connections (pipes, probably to the child procs) that cause the worker to hang when reaching the max open file limit (investigated, but haven't found a solution after more than a day of work)
We can't spread the calculations over multiple machines.
To solve these problems, I would could run each calculation as a separate Celery task. However, we also want to schedule these calculations "fairly" for our users, so that:
Users working on small systems (say <50 locations) don't have to wait until a large system (>1000 locations) is finished. The larger the system, the less the increased waiting time matters to the user (they are doing something else anyway, and can get a notification). So this would be something akin to Weighted fair queueing
.
I have not been able to find a distributed task runner that implements this possibility of prioritisation. Did I miss one? I looked at Celery, RQ, Huey, MRQ, Pulsar Queue and some more, as well as into data processing pipelines like Luigi and Pinball, but none seem to easily enable this.
Most of these suggest creating priority by adding more workers for higher priority queues. However, that wouldn't work as the workers would start fighting for CPU time. (RQ does it differently by emptying the complete first passed in queue, before moving on to the next).
Proposed architecture
What I imagine would work is running a multiprocessing program, with a process per CPU, that fetches, in a WFQ fashion, from multiple Redis lists, each being a certain queue.
Would this be the right approach? Of course there is quite some work to be done on making the queue configuration be dynamic (for example also storing it in Redis, and reloading it upon each couple of processed tasks), and getting event monitoring to be able to get insight.
Additional thoughts:
Each task needs around 3MB of data, coming from Postgres, which is the same for each location in the system (or at least per a couple of 100 locations). With the current approach, this resides in the shared memory, and each process can access it quickly. I'll probably have to setup a local Redis instance on each machine to cache this data to, so not every process is going to fetch it over and over again.
I keep hitting up on ZeroMQ, and it has a lot of enticing possibilities, but besides maybe the monitoring, it doesn't seem to be a good fit. Or am I wrong?
What would make more sense: running each worker as a separate program, and managing it with something like supervisor, or starting a single program, that forks a child for each CPU (no CPU count config necessary), and maybe also monitors its children for stuck processes?
We already run both RabbitMQ and Redis, so I could also use RMQ for the queues. It seems to me the only thing gained by using RMQ is the possibility of not losing tasks on worker crash by using acknowledgements, at the cost of using a more difficult library/complicated protocol.
Any other advice?

Python: Interruptable threading in wx

My wx GUI shows thumbnails, but they're slow to generate, so:
The program should remain usable while the thumbnails are generating.
Switching to a new folder should stop generating thumbnails for the old folder.
If possible, thumbnail generation should make use of multiple processors.
What is the best way to do this?

Putting the thumbnail generation in a background thread with threading.Thread will solve your first problem, making the program usable.
If you want a way to interrupt it, the usual way is to add a "stop" variable which the background thread checks every so often (e.g., once per thumbnail), and the GUI thread sets when it wants to stop it. Ideally you should protect this with a threading.Condition. (The condition isn't actually necessary in most cases—the same GIL that prevents your code from parallelizing well also protects you from certain kinds of race conditions. But you shouldn't rely on that.)
For the third problem, the first question is: Is thumbnail generation actually CPU-bound? If you're spending more time reading and writing images from disk, it probably isn't, so there's no point trying to parallelize it. But, let's assume that it is.
First, if you have N cores, you want a pool of N threads, or N-1 if the main thread has a lot of work to do too, or maybe something like 2N or 2N-1 to trade off a bit of best-case performance for a bit of worst-case performance.
However, if that CPU work is done in Python, or in a C extension that nevertheless holds the Python GIL, this won't help, because most of the time, only one of those threads will actually be running.
One solution to this is to switch from threads to processes, ideally using the standard multiprocessing module. It has built-in APIs to create a pool of processes, and to submit jobs to the pool with simple load-balancing.
The problem with using processes is that you no longer get automatic sharing of data, so that "stop flag" won't work. You need to explicitly create a flag in shared memory, or use a pipe or some other mechanism for communication instead. The multiprocessing docs explain the various ways to do this.
You can actually just kill the subprocesses. However, you may not want to do this. First, unless you've written your code carefully, it may leave your thumbnail cache in an inconsistent state that will confuse the rest of your code. Also, if you want this to be efficient on Windows, creating the subprocesses takes some time (not as in "30 minutes" or anything, but enough to affect the perceived responsiveness of your code if you recreate the pool every time a user clicks a new folder), so you probably want to create the pool before you need it, and keep it for the entire life of the program.
Other than that, all you have to get right is the job size. Hopefully creating one thumbnail isn't too big of a job—but if it's too small of a job, you can batch multiple thumbnails up into a single job—or, more simply, look at the multiprocessing API and change the way it batches jobs when load-balancing.
Meanwhile, if you go with a pool solution (whether threads or processes), if your jobs are small enough, you may not really need to cancel. Just drain the job queue—each worker will finish whichever job it's working on now, but then sleep until you feed in more jobs. Remember to also drain the queue (and then maybe join the pool) when it's time to quit.
One last thing to keep in mind is that if you successfully generate thumbnails as fast as your computer is capable of generating them, you may actually cause the whole computer—and therefore your GUI—to become sluggish and unresponsive. This usually comes up when your code is actually I/O bound and you're using most of the disk bandwidth, or when you use lots of memory and trigger swap thrash, but if your code really is CPU-bound, and you're having problems because you're using all the CPU, you may want to either use 1 fewer core, or look into setting thread/process priorities.

Long term instrument data acquisition with Python - Using "While" loops and threaded processes

I will have 4 hardware data acquisition units connected to a single control PC over a hard-wired Ethernet LAN. The coding for this application will reside on the PC and is entirely Python-based. Each data acquisition unit is identically configured and will be polled from the PC in identical fashion. The test boxes they are connected to provide the variable output we seek to do our testing.
These tests are long-term (8-16 months or better), with relatively low data acquisition rates (less than 500 samples per minute, likely closer to 200). The general process flow is simple, too. I'll loop over each data acquisition device and:
Read the data off the device;
Do some calc on the data;
If the calcs say one thing, turn on a heater;
If they say anything else, do nothing;
Write the data to disk and file for subsequent processing
I'll wait some amount of time, then repeat the process all over again. Here are my questions:
I plan to use a while TRUE: loop to start the execution of the sequence I outlined above, and to allow the loop to be exited via exceptions, but I'd welcome any advice on the specific exceptions I should check for -- or even, is this the best approach to take AT ALL? Another approach might be this: Once inside the while loop, I could use the try: - except: - finally: construct to exit the loop.
The process I've outlined above is for the main data acquisition stuff, but given the length of the collection period, I need to be able to do other things as well: check the hardware units are running OK, take test stands on and offline as required, etc. These 'management' functions ar distinct from the main loop, so I'd like to keep them distinct. Should I set this activity up in separate threads within the same script, or are there better approaches?
Thanks in advance, folks. All feedback is welcome!

I'm thinking that it would be good for you to use client-server model
It would be nicely separated, so one script would not affect the other - status check / data collecting
Basically what you would do it to run server for data collecting on the main machine, which could have some terminal input for maintenance (logging, gracefull exit etc..) and the data collecting PC would act like clients with while True loop (which can run indefinetly unless killed), and on each of data collecting PC would be server/client (depends on point of view) for status check and that would send data to MAIN pc where you would decide what to do
also if you use unix/linux or maybe even windows, for status check just use ssh to the machine and check status (manually or via script from main machine) ... depends on specific needs...
Enjoy

You may need more than one loop. If the instruments are TCP servers, you may want to catch a 'disconnected' exception in an inside loop and try to reconnect, rather than terminating the instrument thread permanently.
Not sure about Python. On C++, C#, Delphi, I would probably generate the wait by waiting on a producer-consumer queue with a timeout. If nothing gets posted, the sequence you outlined would be repeated as you wish. If some of that other, occasional, stuff needs to happen, you can queue up a message that instructs the thread to issue the necessary commands to the instruments, or disconnect and set an internal 'don't poll, wait until instructed to reconnect and poll again' flag, or whatever needs to be done.
This sort of approach is going to be cleaner than stopping the thread and connecting from some other thread just to do the occasional stuff. Stopping/terminating/recreating threads is just best avoided in any language.

What profiling do I need to do to optimize a multi-step producer-consumer model?

I have a 3-step producer/consumer setup.
Client creates JSON-encoded dictionaries and sends them to PipeServer via a named pipe
Here are my threading.Thread subclasses:
PipeServer creates a named pipe and places messages into a queue unprocessed messages
Processor gets items from unprocessed messages, processes them (via a lambda function argument), and puts them into a queue processed messages
Printers gets items from processed messages, acquires a lock, prints the message, and releases the lock.
In the test script, I have one PipeServer, one Processor, and 4 Printers:
pipe_name = '\\\\.\\pipe\\testpipe'
pipe_server = pipetools.PipeServer(pipe_name, unprocessed_messages)
json_loader = lambda x: json.loads(x.decode('utf-8'))
processor = threadedtools.Processor(unprocessed_messages,
processed_messages,
json_loader)
print_servers = []
for i in range(4):
print_servers.append(threadedtools.Printer(processed_messages,
output_lock,
'PRINTER {0}'.format(i)))
pipe_server.start()
processor.start()
for print_server in print_servers:
print_server.start()
Question: in this kind of multi-step setup, how do I think through optimizing the number of Printer vs. Processor threads I should have? For example, how do I know if 4 is the optimal number of Printer threads to have? Should I have more processors?
I read through the Python Profilers docs, but didn't see anything that would help me think through these kinds of tradeoffs.

Generally speaking, you want to optimize for the maximum throughput of your slowest component. In this case, it sounds like either Client or Printer. If it's the Client, you want just enough Printers and Processors to be able to keep up with new messages (maybe that's just one!). Otherwise you'll be wasting resources on threads you don't need.
If it's Printers, then you need to optimize for the IO that's occurring. A few variables to take into account:
How many locks can you have simultaneously?
Do you have to maintain the lock for the length of a printing transaction?
How long does a printing operation take?
If you can only have one lock, then you should only have one thread, so on and so forth.
You then want to test with real world operation (it's difficult to predict what combination of RAM, disk and network activity will slow you down). Instrument your code so you can see how many threads are idle at any given time. Then create a test case that processes data into the system at maximum throughput. Start with an arbitrary number of threads for each component. If Client, Processor, or Printer threads are always busy, add more threads. If some threads are always idle, take some away.
You may need to retune if you move the code to a different hardware environment - different number of processors, more memory, different disk can all have an effect.

python program choice

My program is ICAPServer (similar with httpserver), it's main job is to receive data from clients and save the data to DB.
There are two main steps and two threads:
ICAPServer receives data from clients, puts the data in a queue (50kb <1ms);
another thread pops data from the queue, and writes them to DB SO, if 2nd step is too slow, the queue will fill up memory with those data.
Wondering if anyone have any suggestion...

It is hard to say for sure, but perhaps using two processes instead of threads will help in this situation. Since Python has the Global Interpreter Lock (GIL), it has the effect of only allowing any one thread to execute Python instructions at any time.
Having a system designed around processes might have the following advantages:
Higher concurrency, especially on multiprocessor machines
Greater throughput, since you can probably spawn multiple queue consumers / DB writer processes to spread out the work. Although, the impact of this might be minimal if it is really the DB that is the bottleneck and not the process writing to the DB.

One note: before going for optimizations, it is very important to get some good measurement, and profiling.
That said, I would bet the slow part in the second step is database communication; you could try to analyze the SQL statement and its execution plan. and then optimize it (it is one of the features of SQLAlchemy); if still it would be too slow, check about database optimizations.
Of course, it is possible the bottleneck would be in a completely different place; in this case, you still have chances to optimize using C code, dedicated network, or more threads - just to give three possible example of completely different kind of optimizations.
Another point: as I/O operations usually release the GIL, you could also try to improve performance just by adding another reader thread - and I think this could be a much cheaper solution.

Put an upper limit on the amount of data in the queue?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.