I am new to working with message exchange and met problem finding proper manual for the task.
I need to organize pool of queues so, that:
Producer create some random empty queue and write there all the pack of messages (100 messages usually).
Consumer find non-empty and non-locked queue and read from it till
it's empty and then delete it and look for next one.
So my task is to work with messages as packs, I understand how to produce and consume using same key in one queue, but can't find how to work with the pool of queues.
We can have several producers and consumers run in parallel, but there is no matter which of them send to whom. We don't need and ever can't link particular producers with particular consumer.
General task: we have lot of clients to receive push-notifications, we group pushes by some parameters to process later as group, so such group should be in one queue in RabbitMQ to be produced and consumed as a group, but each group is independent from other groups.
Big thanks to Hannu for the help: key idea of his easy and robust solution that we can have one persistant queue with known name where producer will write names of created queues and consumer will read these names from there.
To make his solution more readable and easy work with in future in my personal task, I have divided publish_data() in producer into two function - one make random queue and write it to control_queue another receive this random_queue and fill it with messages. Similar idea is good for consumer - one function to process queue, another will be called for process message itself.
I have done something like this but with Pika. I had to clean and kombufy an old code snippet for the examples. It is probably not very kombuish (this is my absolutely first code snippet written using it) but this is how I would solve it. Basically I would set up a control queue with a known name.
Publishers will create a random queue name for a pack of messages, dump N messages to it (in my case numbers 1-42) and then post the queue name to the control queue. A consumer then receives this queue name, binds to it, reads messages until queue is empty and then deletes the queue.
This keeps things relatively simple, as publishers do not need to figure out where they are allowed to publish their groups of data (every queue is new with a random name). Receivers do not need to worry about timeouts or "all done" -messages, as a receiver would receive a queue name only when a group of data has been written to the queue and every message is there waiting.
There is also no need to tinker with locks or signalling or anything else that would complicate things. You can have as many consumers and producers as you want. And of course using exchanges and routing keys there could be different sets of consumers for different tasks etc.
Publisher
from kombu import Connection
import uuid
from time import sleep
def publish_data(conn):
random_name= "q" + str(uuid.uuid4()).replace("-", "")
random_queue = conn.SimpleQueue(random_name)
for i in xrange(0, 42):
random_queue.put(i)
random_queue.close()
return random_name
with Connection('amqp://guest:guest#localhost:5672//') as conn:
control_queue = conn.SimpleQueue('control_queue')
_a = 0
while True:
y_name = publish_data(conn)
message = y_name
control_queue.put(message)
print('Sent: {0}'.format(message))
_a += 1
sleep(0.3)
if _a > 20:
break
control_queue.close()
Consumer
from Queue import Empty
from kombu import Connection, Queue
def process_msg(foo):
print str(foo)
with Connection("amqp://guest:guest#localhost:5672//") as _conn:
sub_queue = _conn.SimpleQueue(str(foo))
while True:
try:
_msg = sub_queue.get(block=False)
print _msg.payload
_msg.ack()
except Empty:
break
sub_queue.close()
chan = _conn.channel()
dq = Queue(name=str(foo), exchange="")
bdq = dq(chan)
bdq.delete()
with Connection('amqp://guest:guest#localhost:5672//') as conn:
rec = conn.SimpleQueue('control_queue')
while True:
msg = rec.get(block=True)
entry = msg.payload
msg.ack()
process_msg(entry)
Related
I have an issue thinking of an architecture that'll solve the following problem:
I have a web application (producer) that receives some data on request. I also have a number of processes (consumers) that should process this data. 1 request generates 1 batch of data and should be processes by only 1 consumer.
My current solution consists of receiving the data, cache-ing it in memory with Redis, sending a message through a message channel that data has been written while the consumers are listening on the same channel, and then the data is processed by the consumers. The issue here is that I need to stop multiple consumers from working on the same data. So how can I inform the other consumers that I have started working on this task?
Producer code (flask endpoint):
data = request.get_json()
db = redis.Redis(connection_pool=pool)
db.set(data["externalId"], data)
# Subscribe to the batches channel and publish the id
db.pubsub()
db.publish('batches', request_key)
results = None
result_key = str(data["externalId"])
# Wait till the batch is processed
while results is None:
results = db.get(result_key)
if results is not None:
results = results.decode('utf8')
db.delete(data["externalId"])
db.delete(result_key)
Consumer:
db = redis.Redis(connection_pool = pool)
channel = db.pubsub()
channel.subscribe('batches')
while True:
try:
message = channel.get_message()
message_data = bytes(message['data']).decode('utf8')
external_id = message_data.split('-')[-1]
data = json.loads(db.get(external_id).decode('utf8'))
result = DataProcessor.process(data)
db.set(str(external_id), result)
except Exception:
pass
PUBSUB is often problematic for task queuing for exactly this reason. From the docs (https://redis.io/topics/pubsub):
SUBSCRIBE, UNSUBSCRIBE and PUBLISH implement the Publish/Subscribe messaging paradigm where (citing Wikipedia) senders (publishers) are not programmed to send their messages to specific receivers (subscribers). Rather, published messages are characterized into channels, without knowledge of what (if any) subscribers there may be.
A popular alternative to consider would be to implement "publish" by pushing an element to the end of a Redis list, and "subscribe" by having your worker poll that list at some interval (exponential backoff is often an appropriate choice). In order to avoid cases where multiple workers get the same job, use lpop to get and remove an element from the list. Redis is single-threaded, so you're guaranteed only one worker will receive each element.
So, on the publish side, aim for something like this:
db = redis.Redis(connection_pool=pool)
db.rpush("my_queue", task_payload)
And on the subscribe side, you can safely run a loop like this in parallel as many times as you need:
while True:
db = redis.Redis(connection_pool=pool)
payload = db.lpop("my_queue")
if not payload:
continue
< deserialize and process payload here >
Note this is a last-in-first-out queue (LIFO) since we're pushing onto the right side with rpush and popping off the left with lpop. You can implement the FIFO version trivially by combining lpush/lpop.
Application represents stream splitter - producer process is receiving data stream and multiple consumer processes (client connections) send (pass) received data to connected client.
I've found this sample code for Condition Variable (it's using multithreading but it should work for multiprocessing also) and refactored it so it doesn't pop item in consumer process. That's how I expected that other consumer process will be able to reuse it and resend same data to connected clients. Once all consumer processes finish sending I'd remove item[0] from buffer array. But this is not working since processes are not executing in predictable order.
1. Receive new data - Producer process
2. Send received data - Consumer process [1]
3. Send received data - Consumer process [2]
...
n. Send received data - Consumer process [n]
Loop everything.
Usually happens that producer process removes item[0] before all Consumer processes get to retrieve item[0] and send it.
I guess one possible solution would be to use separate Queue() for each consumer process and in producer process to populate those separate queues.
Is it possible to use Event() to notify consumer process that new data arrived and then pass that data independently from other consumer processes without queue?
If using queue is best solution is it possible to use only one queue and keep new data until all consumer processes finish sending it?
I'm open to any suggestions since I'm not sure what's is the best way to do this.
import threading
import time
# A list of items that are being produced. Note: it is actually
# more efficient to use a collections.deque() object for this.
items = []
# A condition variable for items
items_cv = threading.Condition()
# A producer thread
def producer():
print "I'm the producer"
for i in range(30):
with items_cv: # Always must acquire the lock first
items.append(i) # Add an item to the list
items_cv.notify() # Send a notification signal
time.sleep(1)
items.pop(0) # Pop item remove it from the "buffer"
# A consumer thread
def consumer():
print "I'm a consumer", threading.currentThread().name
while True:
with items_cv: # Must always acquire the lock
while not items: # Check if there are any items
items_cv.wait() # If not, we have to sleep
# x = items.pop(0) # Pop an item off
x = items[0]
print threading.currentThread().name,"got", x
time.sleep(5)
# Launch a bunch of consumers
cons = [threading.Thread(target=consumer)
for i in range(10)]
for c in cons:
c.setDaemon(True)
c.start()
# Run the producer
producer()
Easiest way to solve this problem is to have Queue per each client. In my acceptor (listener) function I have this peace of code which creates buffer/queue for each incoming connection.
buffer = multiprocessing.Queue()
self.client_buffers.append(buffer)
process = multiprocessing.Process(name=procces_name,
target=self.stream_handler,
args(conn, address, buffer))
process.daemon = True
process.start()
In the main (producer) thread each queue is populated as soon as new data arrives.
while True:
data = sock.recv(2048)
if not data: break
for buffer in self.client_buffers:
buffer.put(data)
And each consumer process sends data independently
if not buffer.empty():
connection.sendall(buffer.get())
I would like to define a pool of n workers and have each execute tasks held in a rabbitmq queue. When this task finished (fails or succeeds) I want the worker execute another task from the queue.
I can see in docs how to spawn a pool of workers and have them all wait for their siblings to complete. I would something like different though: I would like to have a buffer of n tasks where when one worker finishes it adds another tasks to the buffer (so no more than n tasks are in the bugger). Im having difficulty searching for this in docs.
For context, my non-multithreading code is this:
while True:
message = get_frame_from_queue() # get message from rabbit mq
do_task(message.body) #body defines urls to download file
acknowledge_complete(message) # tell rabbitmq the message is acknowledged
At this stage my "multithreading" implementation will look like this:
#recieves('ask_for_a_job')
def get_a_task():
# this function is executed when `ask_for_a_job` signal is fired
message = get_frame_from_queue()
do_task(message)
def do_tasks(task_info):
try:
# do stuff
finally:
# once the "worker" has finished start another.
fire_fignal('ask_for_a_job')
# start the "workers"
for i in range(5):
fire_fignal('ask_for_a_job')
I don't want to reinvent the wheel. Is there a more built in way to achieve this?
Note get_frame_from_queue is not thread safe.
You should be able to have each subprocess/thread consume directly from the queue, and then within each thread, simply process from the queue exactly as you would synchronously.
from threading import Thread
def do_task(msg):
# Do stuff here
def consume():
while True:
message = get_frame_from_queue()
do_task(message.body)
acknowledge_complete(message)
if __name __ == "__main__":
threads = []
for i in range(5):
t = Thread(target=consume)
t.start()
threads.append(t)
This way, you'll always have N messages from the queue being processed simultaneously, without any need for signaling to occur between threads.
The only "gotcha" here is the thread-safety of the rabbitmq library you're using. Depending on how it's implemented, you may need a separate connection per thread, or possibly one connection with a channel per thread, etc.
One solution is to leverage the multiprocessing.Pool object. Use an outer loop to get N items from RabbitMQ. Feed the items to the Pool, waiting until the entire batch is done. Then loop through the batch, acknowledging each message. Lastly continue the outer loop.
source
import multiprocessing
def worker(word):
return bool(word=='whiskey')
messages = ['syrup', 'whiskey', 'bitters']
BATCHSIZE = 2
pool = multiprocessing.Pool(BATCHSIZE)
while messages:
# take first few messages, one per worker
batch,messages = messages[:BATCHSIZE],messages[BATCHSIZE:]
print 'BATCH:',
for res in pool.imap_unordered(worker, batch):
print res,
print
# TODO: acknowledge msgs in 'batch'
output
BATCH: False True
BATCH: False
I am using an external service (Service) to process some particular type of objects. The Service works faster if I send objects in batches of 10. My current architecture is as follows. A producer broadcasts objects one-by-one, and a bunch of consumers pull them (one-by-one) from a queue and send them to The Service. This is obviously suboptimal.
I don't want to modify producer code as it can be used in different cases. I can modify consumer code but only with the cost of additional complexity. I'm also aware of the prefetch_count option but I think it only works on the network level -- the client library (pika) does not allow fetching multiple messages at once in the consumer callback.
So, can RabbitMQ create batches of messages before sending them to consumers? I'm looking for an option like "consume n messages at a time".
You cannot batch messages in the consumer callback, but you could use a thread safe library and use multiple threads to consume data. The advantage here is that you can fetch five messages on five different threads and combine the data if needed.
As an example you can take a look on how I would implement this using my AMQP library.
https://github.com/eandersson/amqpstorm/blob/master/examples/scalable_consumer.py
The below code will make use of channel.consume to start consuming messages. We break out/stop when the desired number of messages is reached.
I have set a batch_size to prevent pulling of huge number of messages at once. You can always change the batch_size to fit your needs.
def consume_messages(queue_name: str):
msgs = list([])
batch_size = 500
q = channel.queue_declare(queue_name, durable=True, exclusive=False, auto_delete=False)
q_length = q.method.message_count
if not q_length:
return msgs
msgs_limit = batch_size if q_length > batch_size else q_length
try:
# Get messages and break out
for method_frame, properties, body in channel.consume(queue_name):
# Append the message
try:
msgs.append(json.loads(bytes.decode(body)))
except:
logger.info(f"Rabbit Consumer : Received message in wrong format {str(body)}")
# Acknowledge the message
channel.basic_ack(method_frame.delivery_tag)
# Escape out of the loop when desired msgs are fetched
if method_frame.delivery_tag == msgs_limit:
# Cancel the consumer and return any pending messages
requeued_messages = channel.cancel()
print('Requeued %i messages' % requeued_messages)
break
except (ChannelWrongStateError, StreamLostError, AMQPConnectionError) as e:
logger.info(f'Connection Interrupted: {str(e)}')
finally:
# Close the channel and the connection
channel.stop_consuming()
channel.close()
return msgs
What's the best way to wait (without spinning) until something is available in either one of two (multiprocessing) Queues, where both reside on the same system?
Actually you can use multiprocessing.Queue objects in select.select. i.e.
que = multiprocessing.Queue()
(input,[],[]) = select.select([que._reader],[],[])
would select que only if it is ready to be read from.
No documentation about it though. I was reading the source code of the multiprocessing.queue library (at linux it's usually sth like /usr/lib/python2.6/multiprocessing/queue.py) to find it out.
With Queue.Queue I didn't have found any smart way to do this (and I would really love to).
It doesn't look like there's an official way to handle this yet. Or at least, not based on this:
http://bugs.python.org/issue3831
You could try something like what this post is doing -- accessing the underlying pipe filehandles:
http://haltcondition.net/?p=2319
and then use select.
Not sure how well the select on a multiprocessing queue works on windows. As select on windows listens for sockets and not file handles, I suspect there could be problems.
My answer is to make a thread to listen to each queue in a blocking fashion, and to put the results all into a single queue listened to by the main thread, essentially multiplexing the individual queues into a single one.
My code for doing this is:
"""
Allow multiple queues to be waited upon.
queue,value = multiq.select(list_of_queues)
"""
import queue
import threading
class queue_reader(threading.Thread):
def __init__(self,inq,sharedq):
threading.Thread.__init__(self)
self.inq = inq
self.sharedq = sharedq
def run(self):
while True:
data = self.inq.get()
print ("thread reads data=",data)
result = (self.inq,data)
self.sharedq.put(result)
class multi_queue(queue.Queue):
def __init__(self,list_of_queues):
queue.Queue.__init__(self)
for q in list_of_queues:
qr = queue_reader(q,self)
qr.start()
def select(list_of_queues):
outq = queue.Queue()
for q in list_of_queues:
qr = queue_reader(q,outq)
qr.start()
return outq.get()
The following test routine shows how to use it:
import multiq
import queue
q1 = queue.Queue()
q2 = queue.Queue()
q3 = multiq.multi_queue([q1,q2])
q1.put(1)
q2.put(2)
q1.put(3)
q1.put(4)
res=0
while not res==4:
while not q3.empty():
res = q3.get()[1]
print ("returning result =",res)
Hope this helps.
Tony Wallace
Seems like using threads which forward incoming items to a single Queue which you then wait on is a practical choice when using multiprocessing in a platform independent manner.
Avoiding the threads requires either handling low-level pipes/FDs which is both platform specific and not easy to handle consistently with the higher-level API.
Or you would need Queues with the ability to set callbacks which i think are the proper higher level interface to go for. I.e. you would write something like:
singlequeue = Queue()
incoming_queue1.setcallback(singlequeue.put)
incoming_queue2.setcallback(singlequeue.put)
...
singlequeue.get()
Maybe the multiprocessing package could grow this API but it's not there yet. The concept works well with py.execnet which uses the term "channel" instead of "queues", see here http://tinyurl.com/nmtr4w
As of Python 3.3 you can use multiprocessing.connection.wait to wait on multiple Queue._reader objects at once.
You could use something like the Observer pattern, wherein Queue subscribers are notified of state changes.
In this case, you could have your worker thread designated as a listener on each queue, and whenever it receives a ready signal, it can work on the new item, otherwise sleep.
New version of above code...
Not sure how well the select on a multiprocessing queue works on windows. As select on windows listens for sockets and not file handles, I suspect there could be problems.
My answer is to make a thread to listen to each queue in a blocking fashion, and to put the results all into a single queue listened to by the main thread, essentially multiplexing the individual queues into a single one.
My code for doing this is:
"""
Allow multiple queues to be waited upon.
An EndOfQueueMarker marks a queue as
"all data sent on this queue".
When this marker has been accessed on
all input threads, this marker is returned
by the multi_queue.
"""
import queue
import threading
class EndOfQueueMarker:
def __str___(self):
return "End of data marker"
pass
class queue_reader(threading.Thread):
def __init__(self,inq,sharedq):
threading.Thread.__init__(self)
self.inq = inq
self.sharedq = sharedq
def run(self):
q_run = True
while q_run:
data = self.inq.get()
result = (self.inq,data)
self.sharedq.put(result)
if data is EndOfQueueMarker:
q_run = False
class multi_queue(queue.Queue):
def __init__(self,list_of_queues):
queue.Queue.__init__(self)
self.qList = list_of_queues
self.qrList = []
for q in list_of_queues:
qr = queue_reader(q,self)
qr.start()
self.qrList.append(qr)
def get(self,blocking=True,timeout=None):
res = []
while len(res)==0:
if len(self.qList)==0:
res = (self,EndOfQueueMarker)
else:
res = queue.Queue.get(self,blocking,timeout)
if res[1] is EndOfQueueMarker:
self.qList.remove(res[0])
res = []
return res
def join(self):
for qr in self.qrList:
qr.join()
def select(list_of_queues):
outq = queue.Queue()
for q in list_of_queues:
qr = queue_reader(q,outq)
qr.start()
return outq.get()
The follow code is my test routine to show how it works:
import multiq
import queue
q1 = queue.Queue()
q2 = queue.Queue()
q3 = multiq.multi_queue([q1,q2])
q1.put(1)
q2.put(2)
q1.put(3)
q1.put(4)
q1.put(multiq.EndOfQueueMarker)
q2.put(multiq.EndOfQueueMarker)
res=0
have_data = True
while have_data:
res = q3.get()[1]
print ("returning result =",res)
have_data = not(res==multiq.EndOfQueueMarker)
The one situation where I'm usually tempted to multiplex multiple queues is when each queue corresponds to a different type of message that requires a different handler. You can't just pull from one queue because if it isn't the type of message you want, you need to put it back.
However, in this case, each handler is essentially a separate consumer, which makes it an a multi-producer, multi-consumer problem. Fortunately, even in this case you still don't need to block on multiple queues. You can create different thread/process for each handler, with each handler having its own queue. Basically, you can just break it into multiple instances of a multi-producer, single-consumer problem.
The only situation I can think of where you would have to wait on multiple queues is if you were forced to put multiple handlers in the same thread/process. In that case, I would restructure it by creating a queue for my main thread, spawning a thread for each handler, and have the handlers communicate with the main thread using the main queue. Each handler could then have a separate queue for its unique type of message.
Don't do it.
Put a header on the messages and send them to a common queue. This simplifies the code and will be cleaner overall.