I'm pretty new to multiprocessing in Python and I've done a lot of digging around, but can't seem to find exactly what I'm looking for. I have a bit of a consumer/producer problem where I have a simple server with an endpoint that consumes from a queue and a function that produces onto the queue. The queue can be full, so the producer doesn't always need to be running.
While the queue isn't full, I want the producer task to run but I don't want it to block the server from receiving or servicing requests. I tried using multithreading but this producing process is very slow and the GIL slows it down too much. I want the server to be running all the time, and whenever the queue is no longer full (something has been consumed), I want to kick off this producer task as a separate process and I want it to run until the queue is full again. What is the best way to share the queue so that the producer process can access the queue used by the main process?
What is the best way to share the queue so that the producer process can access the queue used by the main process?
If this is the important part of your question (which seems like it's actually several questions), then multiprocessing.Queue seems to be exactly what you need. I've used this in several projects to have multiple processes feed a queue for consumption by a separate process, so if that's what you're looking for, this should work.
Related
There is a large number of field devices (100,000, each having individual IP) from which I have to collect data.
I want to do it in a python based scheduler combined with an readily available executable written in C/C++, which handles the communication and readout of the devices. The idea is to communicate with up to ~100 devices in parallel. So the first 100 devices could be read out using subprocess call to the executable. I don't want to wait for all 100 tasks being completed, because some might take longer while other being faster. Instead I want to put the next process on its journey immediately after one task has been finished, and so on. So, conducted by a simple "dispatcher", there is a continuous starting of tasks over time.
Question: Which Python API is the best I can use for this purpose?
I considered to use concurrent.futures API, starting a ThreadPoolExecutor and submit task by task, each starting the executable in a separate thread. ProcessPoolExecutor wouldn't be an advantage, because the executable is started as a process anyway...
But I think, that this is not intended to be used in such way, because each submitted job will be remembered an therefore "kind of stored" in the executor forever; when a job is finished it ends up in status "finished" and is still visible, so I would mess up my executor with finished tasks. So I guess, the Executor API is more usable, when there is a given fixed number of tasks to be worked up like in
https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example
and not for permanently submitting tasks.
The other idea would be, to start 100 worker threads in parallel, each working in an endless-loop and reading its next task to be executed from a Queue object. In this case I can dispatch on my own to which Worker a new task is sent next. I know that this would work, because I implemented it already. But I have the feeling, that it must be a more elegant solution in Python to perform dispatching of tasks.
Perhaps I'm being silly asking the question but I need to wrap my head around the basic concepts before I do further work.
I am processing a few thousand RSS feeds, using multiple Celery worker nodes and a RabbitMQ node as the broker. The URL of each feed is being written as a message in the queue. A worker just reads the URL from the queue and starts processing it. I have to ensure that a single RSS feed does not get processed by two workers at the same time.
The article Ensuring a task is only executed one at a time suggests a Memcahced-based solution for locking the feed when it's being processed.
But what I'm trying to understand is that why do I need to use Memcached (or something else) to ensure that a message on a RabbitMQ queue not be consumed by multiple workers at the same time. Is there some configuration change in RabbitMQ (or Celery) that I can do to achieve this goal?
A single MQ message will certainly not be seen by multiple consumers in a normal working setup. You'll have to do some work for the cases involving failing/crashing workers, read up on auto-acks and message rejections, but the basic case is sound.
I don't see a synchronized queue (read: MQ) in the article you've linked, so (as far as I can tell) they're using the lock mechanism (read: memcache) to synchronize, as an alternative. And I can think of a few problems which wouldn't be there in a proper MQ setup.
As noted by others you are mixing apples and oranges.
Being a celery task and a MQ message.
You can ensure that a message will be processed by only one worker at the same time.
eg.
#task(...)
def my_task(
my_task.apply(1)
the .apply publishes a message to the message broker you are using (rabbit, redis...).
Then the message will get routed to a queue and consumed by one worker at time. you dont need locking for this, you have it for free :)
The example on the celery cookbook shows how to prevent two messages like that (my_task.apply(1)) from running at the same time, this is something you need to ensure within the task itself.
You need something which you can access from all workers of course (memcached, redis ...) as they might be running on different machines.
Mentioned example typically used for other goal: it prevents you from working with different messages with the same meaning (not the same message). Eg, I have two processes: first one puts to queue some URLs, and second one - takes URL from queue and fetch them. What will be if first process puts to queue one URL twice (or even more times)?
P.S. I use for this purpose Redis storage and setnx operation (which can set key only once).
I need to write a very specific data processing daemon.
Here is how I thought it could work with multiprocessing :
Process #1: One process to fetch some vital meta data, they can be fetched every second, but those data must be available in process #2. Process #1 writes the data, and Process #2 reads them.
Process #2: Two processes which will fetch the real data based on what has been received in process #1. Fetched data will be stored into a (big) queue to be processed "later"
Process #3: Two (or more) processes which poll the queue created in Process #2 and process those data. Once done, a new queue is filled up to be used in Process #4
Process #4 : Two processes which will read the queue filled by Process(es) #3 and send the result back over HTTP.
The idea behind all these different processes is to specialize them as much as possible and to make them as independent as possible.
All thoses processes will be wrapped into a main daemon which is implemented here :
http://www.jejik.com/articles/2007/02/a_simple_unix_linux_daemon_in_python/
I am wondering if what I have imagined is relevant/stupid/overkill/etc, especially if I run daemon multiprocessing.Process(es) within a main parent process which will daemonized.
Furthermore I am a bit concerned about potential locking problems. In theory processes that read and write data uses different variables/structures so that should avoid a few problems, but I am still concerned.
Maybe using multiprocessing for my context is not the right thing to do. I would love to get your feedback about this.
Notes :
I can not use Redis as a data structure server
I thought about using ZeroMQ for IPC but I would avoid using another extra library if multiprocessing can do the job as well.
Thanks in advance for your feedback.
Generally, your division in different workers with different tasks as well as your plan to let them communicate already looks good. However, one thing you should be aware of is whenever a processing step is I/O or CPU bound. If you are I/O bound, I'd go for the threading module whenever you can: the memory footprint of your application will be smaller and the communication between threads can be more efficient, as shared memory is allowed. Only if you need additional CPU power, go for multiprocessing. In your system, you can use both (it looks like process 3 (or more) will do some heavy computing, while the other workers will predominantly be I/O bound).
I need to send information to every thread that's running in my program, and every thread has to process that information.
I can't do it using a regular queue, because that way once one thread removes the data from the queue all the other threads won't be able to see it anymore.
What's the best way of achieving this?
One way is to have a queue for each thread, and the function that broadcasts the information is responsible for inserting the message into the queue of every thread.
This is similar to the way message queues work in Windows, for example. Each thread that does GUI operations has an associated message queue, independent from that of any other thread.
I have never written any code that uses threads.
I have a web application that accepts a POST request, and creates an image based on the data in the body of the request.
Would I want to spin off a thread for the image creation, as to prevent the server from hanging until the image is created? Is this an appropriate use, or merely a solution looking for a problem ?
Please correct any misunderstandings I may have.
Rather than thinking about handling this via threads or even processes, consider using a distributed task manager such as Celery to manage this sort of thing.
Usual approach for handling HTTP requests synchronously is to spawn (or re-use one in the pool) new thread for each request as soon as it comes.
However, python threads are not very good for HTTP, due to GIL and some i/o and other calls blocking whole app, including other threads.
You should look into multiprocessing module for this usage. Spawn some worker processes, and then pass requests to them to process.