I have a single process A a the first Pool, and several processes B1..Bk in the second pool and I would like to put items into the queue in A and consume items in B1..Bk.
My first attempt was to just create multiprocessing.Queue and pass it to all those processes. However this gave me the error
RuntimeError: Queue objects should only be shared between processes through inheritance
I found advice that suggests to use multiprocessing.Manager().Queue() instead. But when I do this and try to read from the queue in Bi, I get the error
TypeError: 'AutoProxy[Queue]' object is not iterable
So what is the correct way to do this?
OK, this was just sillyness from my side, I misunderstood what a Queue is!
Most importantly a queue is not iterable so one cannot do "for el in somequeue".
(My mistake was to think that the queue proxy is not iterable because it is the proxy. However the proxy works fine in place of the actual queue if put/get are used)
Also, a (FIFO) queue cannot be closed and does not have a natural "end" which I found annoying because it means one has to send around special "end of queue" entries but not too many of them to not unintentionally block the queue.
So the bottom line is: to share the queue I create a multiprocessing.Manager().Queue() and pass that around, then I use put/get to write/read the queue in different processes and I send some special entry to the reader to indicate end of job.
That a queue cannot get closed and the consumer get an "end of queue" condition is really annoying though, especially when there is an error situation: if a queue is consumed by k consumers, then the writer has to know k and send k end of job indicators and the k consumers all have to be well behaved to retrieve those and shut down. If there is any error, all this cannot be guaranteed any more and e.g. a consumer may lock or timeout waiting for the end of job indicator that will never arrive.
Related
I am quite experienced in single-threaded Python as well as embarrasingly parallel multi-processing, but this is the first time I attempt processing something with a producer- and a consumer-thread via a shared queue.
The producer thread is going to download data items from URLs and put them in a queue. Simultaneously, a consumer thread is going to process the data items as they arrive on the queue.
Eventually, there will be no more data items to download and the program should terminate. I wish for the consumer thread to be able to distinguish whether it should keep waiting at an empty queue, because more items may be coming in, or it should terminate, because the producer thread is done.
I am considering signaling the latter situation by placing a special object on the queue in the producer thread when there are no more data items to download. When the consumer thread sees this object, it then stops waiting at the queue and terminates.
Is this a sensible approach?
I use a list of processes with queues for each one. Another thread is used to fill these queues one after the other and the processes fetch the data from it. The problem is that after a while the queues raise an empty exception from within the processes but the thread get a full exception. When I check the queue size it is consistent with the exceptions.
To make it worse this behavior can only be reproduced as part of a large code base, i can’t generate a small program to reproduce this.
Anyone had similar issues with multiprocessing queues not being consistent in different processes?
Edit
To add more to the description of the pipeline. I have multiple worker objects, each worker has an input queue (multiprocessing.Queue), a worker queue (multiprocessing.Queue), an output queue (threading.Queue), a worker process (multiprocessing.Process) and a manager thread (threading.Thread).
Against all these workers, I have a single feeder thread (threading.Thread) that adds sample identifiers to the input queues of all workers, one by one. The sample identifiers are very small in size (paths of files) so the feeder thread can keep up with the processes.
The worker gets the sample identifiers from the input queue, reads these samples, processes them and puts them into the worker queue on by one. The manager thread reads the data in the worker queues and puts it into the output queue because multiprocessing.Queue is slower on read.
All .get() and .put() calls have timeouts and I keep track of time it takes to get new data from this pipeline. I also have mechanisms for closing it and reopening it, by joining all processes and threads (even for queues) and then recreating all of them from scratch. When everything is working, the main process goes over the workers and reads the data off of their output queue one by one. It also takes a few ms to read new data most of the time.
This whole pipeline exists two times in my code (used for machine learning with Tensorflow). One instance is used for training and is created close to the beginning of the program, the other is used for testing. The second instance is created after a while of training, it goes over all of my dataset and then resets. When the second instance is run for the second time it gets stuck after 1000 samples or so. When it is stuck and I break on debug mode in the main process, I see that the input queue is full and the worker and output queues are empty. When I then break inside one of the worker processes I see that their input queue is empty. It seems like for some reason the worker process sees a different input queue than it should. Note that this is not some race issue because this result is stable.
Edit 2
I zeroed in on the point that the program hangs on. It seems like performing json.loads() on read file data. This means that the problem is different than what originally described. The processes hang and don't see an empty queue.
code for opening the file:
with open(json_path, 'r') as f:
data = f.read()
json_data = json.loads(data) # <== program hangs at this line
I tried using signal.alarm package to pinpoint where in json.loads() the program hangs but it doesn't raise the exception. The problem is reproduced with a single multiprocessing.Process as well, but not when all processing is done in the main process.
Rings a bell to anyone?
I have read lots about python threading and the various means to 'talk' across thread boundaries. My case seems a little different, so I would like to get advice on the best option:
Instead of having many identical worker threads waiting for items in a shared queue, I have a handful of mostly autonomous, non-daemonic threads with unique identifiers going about their business. These threads do not block and normally do not care about each other. They sleep most of the time and wake up periodically. Occasionally, based on certain conditions, one thread needs to 'tell' another thread to do something specific - an action -, meaningful to the receiving thread. There are many different combinations of actions and recipients, so using Events for every combination seems unwieldly. The queue object seems to be the recommended way to achieve this. However, if I have a shared queue and post an item on the queue having just one recipient thread, then every other thread needs monitor the queue, pull every item, check if it is addressed to it, and put it back in the queue if it was addressed to another thread. That seems a lot of getting and putting items from the queue for nothing. Alternatively, I could employ a 'router' thread: one shared-by-all queue plus one queue for every 'normal' thread, shared with the router thread. Normal threads only ever put items in the shared queue, the router pulls every item, inspects it and puts it on the addressee's queue. Still, a lot of putting and getting items from queues....
Are there any other ways to achieve what I need to do ? It seems a pub-sub class is the right approach, but there is no such thread-safe module in standard python, at least to my knowledge.
Many thanks for your suggestions.
Instead of having many identical worker threads waiting for items in a shared queue
I think this is the right approach to do this. Just remove identical and shared from the above statement. i.e.
having many worker threads waiting for items in queues
So I would suggest using Celery for this approach.
Occasionally, based on certain conditions, one thread needs to 'tell'
another thread to do something specific - an action, meaningful to the receiving thread.
This can be done by calling another celery task from within the calling task. All the tasks can have separate queues.
Thanks for the response. After some thoughts, I have decided to use the approach of many queues and a router-thread (hub-and-spoke). Every 'normal' thread has its private queue to the router, enabling separate send and receive queues or 'channels'. The router's queue is shared by all threads (as a property) and used by 'normal' threads as a send-only-channel, ie they only post items to this queue, and only the router listens to it, ie pulls items. Additionally, each 'normal' thread uses its own queue as a 'receive-only-channel' on which it listens and which is shared only with the router. Threads register themselves with the router on the router queue/channel, the router maintains a list of registered threads including their queues, so it can send an item to a specific thread after its registration.
This means that peer to peer communication is not possible, all communication is sent via the router.
There are several reasons I did it this way:
1. There is no logic in the thread for checking if an item is addressed to 'me', making the code simpler and no constant pulling, checking and re-putting of items on one shared queue. Threads only listen on their queue, when a message arrives the thread can be sure that the message is addressed to it, including the router itself.
2. The router can act as a message bus, do vocabulary translation and has the possibility to address messages to external programs or hosts.
3. Threads don't need to know anything about other threads capabilities, ie they just speak the language of the router. In a peer-to-peer world, all peers must be able to understand each other, and since my threads are of many different classes, I would have to teach each class all other classes' vocabulary.
Hope this helps someone some day when faced with a similar challenge.
Perhaps I'm being silly asking the question but I need to wrap my head around the basic concepts before I do further work.
I am processing a few thousand RSS feeds, using multiple Celery worker nodes and a RabbitMQ node as the broker. The URL of each feed is being written as a message in the queue. A worker just reads the URL from the queue and starts processing it. I have to ensure that a single RSS feed does not get processed by two workers at the same time.
The article Ensuring a task is only executed one at a time suggests a Memcahced-based solution for locking the feed when it's being processed.
But what I'm trying to understand is that why do I need to use Memcached (or something else) to ensure that a message on a RabbitMQ queue not be consumed by multiple workers at the same time. Is there some configuration change in RabbitMQ (or Celery) that I can do to achieve this goal?
A single MQ message will certainly not be seen by multiple consumers in a normal working setup. You'll have to do some work for the cases involving failing/crashing workers, read up on auto-acks and message rejections, but the basic case is sound.
I don't see a synchronized queue (read: MQ) in the article you've linked, so (as far as I can tell) they're using the lock mechanism (read: memcache) to synchronize, as an alternative. And I can think of a few problems which wouldn't be there in a proper MQ setup.
As noted by others you are mixing apples and oranges.
Being a celery task and a MQ message.
You can ensure that a message will be processed by only one worker at the same time.
eg.
#task(...)
def my_task(
my_task.apply(1)
the .apply publishes a message to the message broker you are using (rabbit, redis...).
Then the message will get routed to a queue and consumed by one worker at time. you dont need locking for this, you have it for free :)
The example on the celery cookbook shows how to prevent two messages like that (my_task.apply(1)) from running at the same time, this is something you need to ensure within the task itself.
You need something which you can access from all workers of course (memcached, redis ...) as they might be running on different machines.
Mentioned example typically used for other goal: it prevents you from working with different messages with the same meaning (not the same message). Eg, I have two processes: first one puts to queue some URLs, and second one - takes URL from queue and fetch them. What will be if first process puts to queue one URL twice (or even more times)?
P.S. I use for this purpose Redis storage and setnx operation (which can set key only once).
I have some queue, for etc:
online_queue = self._channel.queue_declare(
durable = True,
queue = 'online'
)
At the moment, I need to flush all content in this queue.
But, at this moment, another process, probably, may publish to this queue.
If I use channel.queue_purge(queue='online'), what will happened with messages, published, while queue_purge still working?
Depending on your ultimate goal, you might be able to solve this issue by using a temporary queue.
To make things more clear, lets give things some names. Call your current queue (the one you want to purge) Queue A, and assume it is 1-1 bound to Exchange A.
If you create a new queue (Queue B) and bind it to Exchange A in the same way that Queue A is bound, Queue B will now get all of the messages (from the time of binding) that Queue A gets.
You can now safely purge Queue A without loosing any of the messages that got sent in after Queue B was bound.
Re-bind Queue A to Exchange A and you are back up and running.
You can then deal with the "interim" messages in Queue B however you might need to.
This has the advantage of having a very well defined behavior and doesn't get you into any race conditions because you can completely blow Queue A away and re-create it instead of purging.
You're describing a race condition. Some might remain in the queue and some others might get purged. Or all of them will get purged. Or none of them will get purged.
There's just no way to tell, because it's a time-dependent situation. You should re-examine your need to purge a queue which is still active, or build a more robust consumer that can live with the fact that there might be messages in the queue it is connecting to (which is basically what consumers have to live with, anyway).