Application represents stream splitter - producer process is receiving data stream and multiple consumer processes (client connections) send (pass) received data to connected client.
I've found this sample code for Condition Variable (it's using multithreading but it should work for multiprocessing also) and refactored it so it doesn't pop item in consumer process. That's how I expected that other consumer process will be able to reuse it and resend same data to connected clients. Once all consumer processes finish sending I'd remove item[0] from buffer array. But this is not working since processes are not executing in predictable order.
1. Receive new data - Producer process
2. Send received data - Consumer process [1]
3. Send received data - Consumer process [2]
...
n. Send received data - Consumer process [n]
Loop everything.
Usually happens that producer process removes item[0] before all Consumer processes get to retrieve item[0] and send it.
I guess one possible solution would be to use separate Queue() for each consumer process and in producer process to populate those separate queues.
Is it possible to use Event() to notify consumer process that new data arrived and then pass that data independently from other consumer processes without queue?
If using queue is best solution is it possible to use only one queue and keep new data until all consumer processes finish sending it?
I'm open to any suggestions since I'm not sure what's is the best way to do this.
import threading
import time
# A list of items that are being produced. Note: it is actually
# more efficient to use a collections.deque() object for this.
items = []
# A condition variable for items
items_cv = threading.Condition()
# A producer thread
def producer():
print "I'm the producer"
for i in range(30):
with items_cv: # Always must acquire the lock first
items.append(i) # Add an item to the list
items_cv.notify() # Send a notification signal
time.sleep(1)
items.pop(0) # Pop item remove it from the "buffer"
# A consumer thread
def consumer():
print "I'm a consumer", threading.currentThread().name
while True:
with items_cv: # Must always acquire the lock
while not items: # Check if there are any items
items_cv.wait() # If not, we have to sleep
# x = items.pop(0) # Pop an item off
x = items[0]
print threading.currentThread().name,"got", x
time.sleep(5)
# Launch a bunch of consumers
cons = [threading.Thread(target=consumer)
for i in range(10)]
for c in cons:
c.setDaemon(True)
c.start()
# Run the producer
producer()
Easiest way to solve this problem is to have Queue per each client. In my acceptor (listener) function I have this peace of code which creates buffer/queue for each incoming connection.
buffer = multiprocessing.Queue()
self.client_buffers.append(buffer)
process = multiprocessing.Process(name=procces_name,
target=self.stream_handler,
args(conn, address, buffer))
process.daemon = True
process.start()
In the main (producer) thread each queue is populated as soon as new data arrives.
while True:
data = sock.recv(2048)
if not data: break
for buffer in self.client_buffers:
buffer.put(data)
And each consumer process sends data independently
if not buffer.empty():
connection.sendall(buffer.get())
Related
I have an issue thinking of an architecture that'll solve the following problem:
I have a web application (producer) that receives some data on request. I also have a number of processes (consumers) that should process this data. 1 request generates 1 batch of data and should be processes by only 1 consumer.
My current solution consists of receiving the data, cache-ing it in memory with Redis, sending a message through a message channel that data has been written while the consumers are listening on the same channel, and then the data is processed by the consumers. The issue here is that I need to stop multiple consumers from working on the same data. So how can I inform the other consumers that I have started working on this task?
Producer code (flask endpoint):
data = request.get_json()
db = redis.Redis(connection_pool=pool)
db.set(data["externalId"], data)
# Subscribe to the batches channel and publish the id
db.pubsub()
db.publish('batches', request_key)
results = None
result_key = str(data["externalId"])
# Wait till the batch is processed
while results is None:
results = db.get(result_key)
if results is not None:
results = results.decode('utf8')
db.delete(data["externalId"])
db.delete(result_key)
Consumer:
db = redis.Redis(connection_pool = pool)
channel = db.pubsub()
channel.subscribe('batches')
while True:
try:
message = channel.get_message()
message_data = bytes(message['data']).decode('utf8')
external_id = message_data.split('-')[-1]
data = json.loads(db.get(external_id).decode('utf8'))
result = DataProcessor.process(data)
db.set(str(external_id), result)
except Exception:
pass
PUBSUB is often problematic for task queuing for exactly this reason. From the docs (https://redis.io/topics/pubsub):
SUBSCRIBE, UNSUBSCRIBE and PUBLISH implement the Publish/Subscribe messaging paradigm where (citing Wikipedia) senders (publishers) are not programmed to send their messages to specific receivers (subscribers). Rather, published messages are characterized into channels, without knowledge of what (if any) subscribers there may be.
A popular alternative to consider would be to implement "publish" by pushing an element to the end of a Redis list, and "subscribe" by having your worker poll that list at some interval (exponential backoff is often an appropriate choice). In order to avoid cases where multiple workers get the same job, use lpop to get and remove an element from the list. Redis is single-threaded, so you're guaranteed only one worker will receive each element.
So, on the publish side, aim for something like this:
db = redis.Redis(connection_pool=pool)
db.rpush("my_queue", task_payload)
And on the subscribe side, you can safely run a loop like this in parallel as many times as you need:
while True:
db = redis.Redis(connection_pool=pool)
payload = db.lpop("my_queue")
if not payload:
continue
< deserialize and process payload here >
Note this is a last-in-first-out queue (LIFO) since we're pushing onto the right side with rpush and popping off the left with lpop. You can implement the FIFO version trivially by combining lpush/lpop.
I am new to working with message exchange and met problem finding proper manual for the task.
I need to organize pool of queues so, that:
Producer create some random empty queue and write there all the pack of messages (100 messages usually).
Consumer find non-empty and non-locked queue and read from it till
it's empty and then delete it and look for next one.
So my task is to work with messages as packs, I understand how to produce and consume using same key in one queue, but can't find how to work with the pool of queues.
We can have several producers and consumers run in parallel, but there is no matter which of them send to whom. We don't need and ever can't link particular producers with particular consumer.
General task: we have lot of clients to receive push-notifications, we group pushes by some parameters to process later as group, so such group should be in one queue in RabbitMQ to be produced and consumed as a group, but each group is independent from other groups.
Big thanks to Hannu for the help: key idea of his easy and robust solution that we can have one persistant queue with known name where producer will write names of created queues and consumer will read these names from there.
To make his solution more readable and easy work with in future in my personal task, I have divided publish_data() in producer into two function - one make random queue and write it to control_queue another receive this random_queue and fill it with messages. Similar idea is good for consumer - one function to process queue, another will be called for process message itself.
I have done something like this but with Pika. I had to clean and kombufy an old code snippet for the examples. It is probably not very kombuish (this is my absolutely first code snippet written using it) but this is how I would solve it. Basically I would set up a control queue with a known name.
Publishers will create a random queue name for a pack of messages, dump N messages to it (in my case numbers 1-42) and then post the queue name to the control queue. A consumer then receives this queue name, binds to it, reads messages until queue is empty and then deletes the queue.
This keeps things relatively simple, as publishers do not need to figure out where they are allowed to publish their groups of data (every queue is new with a random name). Receivers do not need to worry about timeouts or "all done" -messages, as a receiver would receive a queue name only when a group of data has been written to the queue and every message is there waiting.
There is also no need to tinker with locks or signalling or anything else that would complicate things. You can have as many consumers and producers as you want. And of course using exchanges and routing keys there could be different sets of consumers for different tasks etc.
Publisher
from kombu import Connection
import uuid
from time import sleep
def publish_data(conn):
random_name= "q" + str(uuid.uuid4()).replace("-", "")
random_queue = conn.SimpleQueue(random_name)
for i in xrange(0, 42):
random_queue.put(i)
random_queue.close()
return random_name
with Connection('amqp://guest:guest#localhost:5672//') as conn:
control_queue = conn.SimpleQueue('control_queue')
_a = 0
while True:
y_name = publish_data(conn)
message = y_name
control_queue.put(message)
print('Sent: {0}'.format(message))
_a += 1
sleep(0.3)
if _a > 20:
break
control_queue.close()
Consumer
from Queue import Empty
from kombu import Connection, Queue
def process_msg(foo):
print str(foo)
with Connection("amqp://guest:guest#localhost:5672//") as _conn:
sub_queue = _conn.SimpleQueue(str(foo))
while True:
try:
_msg = sub_queue.get(block=False)
print _msg.payload
_msg.ack()
except Empty:
break
sub_queue.close()
chan = _conn.channel()
dq = Queue(name=str(foo), exchange="")
bdq = dq(chan)
bdq.delete()
with Connection('amqp://guest:guest#localhost:5672//') as conn:
rec = conn.SimpleQueue('control_queue')
while True:
msg = rec.get(block=True)
entry = msg.payload
msg.ack()
process_msg(entry)
A python multi-producer & multi-consumer threading pseudocode:
def threadProducer():
while upstreams_not_done:
data = do_some_work()
queue_of_data.put(data)
def threadConsumer():
while True:
data = queue_of_data.get()
do_other_work()
queue_of_data.task_done()
queue_of_data = queue.Queue()
list_of_producers = create_and_start_producers()
list_of_consumers = create_and_start_consumers()
queue_of_data.join()
# is now all work done?
In which queue_of_data.task_done() is called for each item in queue.
When producers work slower then consumers, is there a possibility queue_of_data.join() non-blocks at some moment when no producer generates data yet, but all consumers finish their tasks by task_done()?
And if Queue.join() is not reliable like this, how can I check if all work done?
The usual way is to put a sentinel value (like None) on the queue, one for each consumer thread, when producers are done. Then consumers are written to exit the thread when it pulls None from the queue.
So, e.g., in the main program:
for t in list_of_producers:
t.join()
# Now we know all producers are done.
for t in list_of_consumers:
queue_of_data.put(None) # tell a consumer we're done
for t in list_of_consumers:
t.join()
and consumers look like:
def threadConsumer():
while True:
data = queue_of_data.get()
if data is None:
break
do_other_work()
Note: if producers can overwhelm consumers, create the queue with a maximum size. Then queue.put() will block when the queue reaches that size, until a consumer removes something from the queue.
I would like to define a pool of n workers and have each execute tasks held in a rabbitmq queue. When this task finished (fails or succeeds) I want the worker execute another task from the queue.
I can see in docs how to spawn a pool of workers and have them all wait for their siblings to complete. I would something like different though: I would like to have a buffer of n tasks where when one worker finishes it adds another tasks to the buffer (so no more than n tasks are in the bugger). Im having difficulty searching for this in docs.
For context, my non-multithreading code is this:
while True:
message = get_frame_from_queue() # get message from rabbit mq
do_task(message.body) #body defines urls to download file
acknowledge_complete(message) # tell rabbitmq the message is acknowledged
At this stage my "multithreading" implementation will look like this:
#recieves('ask_for_a_job')
def get_a_task():
# this function is executed when `ask_for_a_job` signal is fired
message = get_frame_from_queue()
do_task(message)
def do_tasks(task_info):
try:
# do stuff
finally:
# once the "worker" has finished start another.
fire_fignal('ask_for_a_job')
# start the "workers"
for i in range(5):
fire_fignal('ask_for_a_job')
I don't want to reinvent the wheel. Is there a more built in way to achieve this?
Note get_frame_from_queue is not thread safe.
You should be able to have each subprocess/thread consume directly from the queue, and then within each thread, simply process from the queue exactly as you would synchronously.
from threading import Thread
def do_task(msg):
# Do stuff here
def consume():
while True:
message = get_frame_from_queue()
do_task(message.body)
acknowledge_complete(message)
if __name __ == "__main__":
threads = []
for i in range(5):
t = Thread(target=consume)
t.start()
threads.append(t)
This way, you'll always have N messages from the queue being processed simultaneously, without any need for signaling to occur between threads.
The only "gotcha" here is the thread-safety of the rabbitmq library you're using. Depending on how it's implemented, you may need a separate connection per thread, or possibly one connection with a channel per thread, etc.
One solution is to leverage the multiprocessing.Pool object. Use an outer loop to get N items from RabbitMQ. Feed the items to the Pool, waiting until the entire batch is done. Then loop through the batch, acknowledging each message. Lastly continue the outer loop.
source
import multiprocessing
def worker(word):
return bool(word=='whiskey')
messages = ['syrup', 'whiskey', 'bitters']
BATCHSIZE = 2
pool = multiprocessing.Pool(BATCHSIZE)
while messages:
# take first few messages, one per worker
batch,messages = messages[:BATCHSIZE],messages[BATCHSIZE:]
print 'BATCH:',
for res in pool.imap_unordered(worker, batch):
print res,
print
# TODO: acknowledge msgs in 'batch'
output
BATCH: False True
BATCH: False
I have a python script that implement two levels of multiprocessing:
from multiprocessing import Process, Queue, Lock
if __name__ == '__main__':
pl1s = []
for ip1 in range(10):
pl1 = Process(target=process_level1, args=(...))
pl1.start()
pl1s.append(pl1)
# do somehting for awhile, e.g. for 24 hours
# time to terminate
# ISSUE: this terminates process_level1 processes
# but not process_level2 ones
for pl1 in pl1s:
pl1.terminate()
def process_level1(...):
# subscriibe to external queue
#
with queue.open(name_of_external_queue, 'r') as subq:
qInternal = Queue()
pl2s = []
for ip2 in range(3):
pl2 = Process(target=process_level2, args=(qInternal))
pl2.start()
pl2s.append(pl2)
# grab messages from external queue and push them to
# process_level2 processes to process
#
while True:
message = subq.read()
qInternal.put(m)
def process_level2(qInternal):
while True:
message = qInternal.get()
# do something with date form message
So in main I launch a bunch of slave subprocesses process_level1 each of which launches a bunch of its own subprocesses process_level2. Main is supposed to run for a predefined amount of time (e.g. 24 hrs) then terminate everything. The problem is that the code above terminates the 1st layer of subprocesses but not the 2nd one.
How should I do this to terminate both layers at the same time?
(Maybe an important) caveat: I guess one approach would be to set up an internal queue to communicate from main to process_level1 and then send a signal to each process_level1 subprocess to terminate their respective subprocesses. The problem is that process_level1 runs an infinite loop reading messages form an external queue. So I am not sure where and how I would check for the terminate signal from main.