Multiple ipc publishers and one subscriber using python-zmq - python

I'm wondering if is possible set multiple ipc publishers for one subscriber using zmq ipc...
Abstractly I have only one publisher like this, but I need run multiple instances of it getting several data types but publishing the same format every time.
context = zmq.Context()
publisher = context.socket(zmq.PUB)
publisher.connect("ipc://VCserver")
myjson = json.dumps(worker.data)
publisher.send(myjson)
My subscriber:
context = zmq.Context()
subscriber = context.socket(zmq.SUB)
subscriber.bind("ipc://VCserver")
subscriber.setsockopt(zmq.SUBSCRIBE, '')
while True:
response = subscriber.recv()
if response:
data = json.loads(response)
check_and_store(data)
My subscriber every time is checking the same parameters from the data and storing it on a db.I do not know if it is possible, as this mode of communication uses a shared file and maybe I should think in publisher-subscriber pairs for every instance...
EDITED:Every publisher will send 2kb aprox, 100 times/sec

You can definitely have multiple publishers, the only restriction is that you cannot have multiple binders on one IPC socket - each successive bind simply clobbers previous ones (as opposed to TCP, where you will get EADDRINUSE if you try to bind to an already-in-use interface). Your case should work fine.

Related

Implementing a Master/Worker pattern with PUB/SUB using ZeroMQ

I have written a toy Master/Worker" or "task farm" using ZeroMQ.
This is what I have got so far - but I want to add PUB/SUB, so that the workers listen and respond to topics (either specific topics, or wildcard matches).
master
#!/usr/bin/env python
from __future__ import print_function
import random
import time
from multiprocessing import Pool, Process
import zmq
from zmq.devices.basedevice import ProcessDevice
REQ_ADDRESS = 'tcp://127.0.0.1:6240'
REP_ADDRESS = 'tcp://127.0.0.1:6241'
if __name__ == '__main__':
# Start queue
context = zmq.Context()
sock_in = context.socket(zmq.ROUTER)
sock_in.bind(REQ_ADDRESS)
sock_out = context.socket(zmq.DEALER)
sock_out.bind(REP_ADDRESS)
zmq.device(zmq.QUEUE, sock_in, sock_out)
worker
#!/usr/bin/env python
from __future__ import print_function
import random
import time
import zmq
REP_ADDRESS = 'tcp://127.0.0.1:6241'
def receive_tasks():
"""
Client action: request tasks
"""
# ID: just to show that we're getting the right replies
my_id = random.randint(1, 1000000)
context = zmq.Context()
socket = context.socket(zmq.REP)
socket.connect(REP_ADDRESS)
while True:
# Data is received here. Note that this blocks until
# we get a job.
job = socket.recv_json()
# Do work here
time.sleep(0.5)
# Send the result back. Pass any JSON-serializable object.
socket.send_json([my_id, job['id'], job['task_id']])
if __name__ == '__main__':
receive_tasks()
client
#!/usr/bin/env python
from __future__ import print_function
import random
import zmq
from zmq.core.poll import select
REQ_ADDRESS = 'tcp://127.0.0.1:6240'
def request_tasks():
"""
Client action: request tasks
"""
# ID: just to show that we're getting the right replies
my_id = random.randint(1, 1000000)
context = zmq.Context()
socket = context.socket(zmq.REQ)
socket.connect(REQ_ADDRESS)
for i in xrange(100):
job = {'id': my_id, 'task_id': random.randint(1, 100)}
socket.send_json(job)
# Selects the sockets that have READ, WRITE, and ERROR
# events respectively within the lists, with timeout 5.
# Same API as: http://docs.python.org/library/select.html
(rlist, wlist, xlist) = select([socket], [], [], 5)
if len(rlist) > 0:
# This receives the reply and deserializes it from JSON.
msg = socket.recv_json()
print('Client {0}, task #{1}: received work from {2} (for: {3})'.format(
my_id, i+1, msg[0], msg[1]))
else:
print('Client {0}, task#{1}: error, timeout reached.'.format(my_id,
i+1))
socket.close()
socket = context.socket(zmq.REQ)
socket.connect(REQ_ADDRESS)
if __name__ == '__main__':
request_tasks()
My question is: how can I modify the master and workers to be "TOPIC aware" - using PUB/SUB?
Note: Although my example code is in Python, and the image illustration refers to Java - I'm actually writing my real code in C++, so please (if possible) don't use any language specifics in your answer.
Q : ... how can I modify the master and workers to be "TOPIC aware" - using PUB/SUB?
Welcome, the PUB/SUB-Scalable Formal Communication Pattern Archetype has one silent trap (caveat if not reading full details from ZeroMQ API) - the SUB-side has to actively subscribe to something, otherwise it receives nothing until doing so (as like with the real newspapers, one will never find any at the doorstep, unless a subscriptions was both raised & paid - here the costs are split between PUB-SUB Context()-engines, in early versions doing the TOPIC-filtering of (all) delivered messages ALAP, i.e. after delivery to (all) SUB-side (yes, network-I/O & many-times (distributed) CPU-loads were a way to "offload" the PUB-side costs for doing so in a hell-fast cadence & volumes. Later versions, post v3.4+ IIRC, 've moved the TOPIC-management + TOPIC-filtering onto the PUB-side (preventing network-I/O, yet demanding RAM+CPU resources of the PUB-side to be tuned-up adequately for large-scale deployments. Professional FinTech INDU was more than happy with performance & latency envelopes, so no need to panic prematurely on this))
ZeroMQ PUB-archetype TOPIC-filtering has been ever since based on the actual message payload Byte-stream, so given some SUB-s have already setup their active subscription to "ABC", the PUB-side will make any message-payload, starting with "ABC.....", indeed placed into their respective delivery-queue(s). TOPIC-filter subscription management is well defined in the ZeroMQ documentation, just wanted to notice the default state, where no subscription is present at all and omitting to mention ""-string subscription ( to receive everything ), which would produce rather absurd results in Master/Worker herd case, where any work-package would get processed by each and every Worker which (except for in some ultimate,yet gigantically expensive & inefficient fault-fighting robustness-increasing approach) would make no sense, having no other ( performance, latency or other ) benefit for doing so.
This said, there are no other limitations for designing a meta-plane network of any amount of additional Signalling & Communication "socket"-archetypes, that altogether fulfill the job:
Master
can
PUB.{ bind | setsockopt | send | close }() in due order & fashion, resulting in a cheap way (... latency + RAM + CPU related remarks above still valid ...) distribute job-tasks only to those, who have actively subscribed to do so ( a "herd"-management could handle newcomers, lost ones, N-replicated job-tasking, all just by using the TOPIC-filtering tricks )
Can use
PULL.{ bind | setsockopt | poll | receive | close }() accordingly, so as to efficiently, best in some "soft/mild"-real-time driven control-loop, collect results for the above distributed workpackages' results, validating them against (un)authorised and/or (non)tampered controlling checkups, as needed
Can also
"soft"-signal about (un)authenticated worker(s), for presence / health-status / state-of-work, if needed, be it by re-using the primary PUB/SUB channel and receiving answers via the primary PUSH/PULL one. Yet there is no problem to setup a knowingly separate, offloading, secondary signalling PUB/SUB channels among Master/Workers, so as to keep this "soft"-signalling flow independently of the primary workloads ( indeed growing a more professional SIG/COMMs meta-plane architecture for custom-defined distributed-computing )
Soft-signalling channel is a typical way how to create some kind of domain-specific-language (having a grammar of "commands") for controlling the "herd" throughout the whole lifecycle of the such defined distributed-system.
Cool, isn't it?
Client ( doing a set of selective work-types )
can
SUB.{ connect | setsockopt | receive | close }() in due order & fashion, to so as to adaptively setup, configure & receive subscribed-to work-packages from PUB, yet keeping any other complexity of STATE-signalling & DATA-inter-communication with other, principally unrestricted set of peers )
Can use
PUSH.{ connect | setsockopt | send | close }() accordingly, so as to match the Master's way how to deliver (+ auth, ev. +protect against tampering ) any and all results for the above received workpackages' results, self-validating them as "The Authorised"-entity to deliver them and/or providing any tampering-control checkups, if & as needed
Can also
receive and respond to any "soft"-signal request or asynchronously notifying any explicit state-changes (implicit state-changes detection is naturally the Master's task, after not receiving any response and alike) related to presence / health-state / state-of-work, etc, if needed, either by re-using the primary PUB/SUB channel and delivering such respective responsed via the primary PUSH/PULL upstream one. Yet there is no problem to setup a knowingly separate, Master-offloading, secondary signalling PUB and/or other channels among Workers themselves plus the Master, so as to keep any kind of "soft"-signalling flow independent of the primary workloads ( the custom-defined distributed-computing can indeed create any sort of "Parallel Universe", where Master is (or is not) a part thereof ;o) )

Task delegation in Python/Redis

I have an issue thinking of an architecture that'll solve the following problem:
I have a web application (producer) that receives some data on request. I also have a number of processes (consumers) that should process this data. 1 request generates 1 batch of data and should be processes by only 1 consumer.
My current solution consists of receiving the data, cache-ing it in memory with Redis, sending a message through a message channel that data has been written while the consumers are listening on the same channel, and then the data is processed by the consumers. The issue here is that I need to stop multiple consumers from working on the same data. So how can I inform the other consumers that I have started working on this task?
Producer code (flask endpoint):
data = request.get_json()
db = redis.Redis(connection_pool=pool)
db.set(data["externalId"], data)
# Subscribe to the batches channel and publish the id
db.pubsub()
db.publish('batches', request_key)
results = None
result_key = str(data["externalId"])
# Wait till the batch is processed
while results is None:
results = db.get(result_key)
if results is not None:
results = results.decode('utf8')
db.delete(data["externalId"])
db.delete(result_key)
Consumer:
db = redis.Redis(connection_pool = pool)
channel = db.pubsub()
channel.subscribe('batches')
while True:
try:
message = channel.get_message()
message_data = bytes(message['data']).decode('utf8')
external_id = message_data.split('-')[-1]
data = json.loads(db.get(external_id).decode('utf8'))
result = DataProcessor.process(data)
db.set(str(external_id), result)
except Exception:
pass
PUBSUB is often problematic for task queuing for exactly this reason. From the docs (https://redis.io/topics/pubsub):
SUBSCRIBE, UNSUBSCRIBE and PUBLISH implement the Publish/Subscribe messaging paradigm where (citing Wikipedia) senders (publishers) are not programmed to send their messages to specific receivers (subscribers). Rather, published messages are characterized into channels, without knowledge of what (if any) subscribers there may be.
A popular alternative to consider would be to implement "publish" by pushing an element to the end of a Redis list, and "subscribe" by having your worker poll that list at some interval (exponential backoff is often an appropriate choice). In order to avoid cases where multiple workers get the same job, use lpop to get and remove an element from the list. Redis is single-threaded, so you're guaranteed only one worker will receive each element.
So, on the publish side, aim for something like this:
db = redis.Redis(connection_pool=pool)
db.rpush("my_queue", task_payload)
And on the subscribe side, you can safely run a loop like this in parallel as many times as you need:
while True:
db = redis.Redis(connection_pool=pool)
payload = db.lpop("my_queue")
if not payload:
continue
< deserialize and process payload here >
Note this is a last-in-first-out queue (LIFO) since we're pushing onto the right side with rpush and popping off the left with lpop. You can implement the FIFO version trivially by combining lpush/lpop.

RabbitMQ + kombu: write/read to one-time use queues with random names

I am new to working with message exchange and met problem finding proper manual for the task.
I need to organize pool of queues so, that:
Producer create some random empty queue and write there all the pack of messages (100 messages usually).
Consumer find non-empty and non-locked queue and read from it till
it's empty and then delete it and look for next one.
So my task is to work with messages as packs, I understand how to produce and consume using same key in one queue, but can't find how to work with the pool of queues.
We can have several producers and consumers run in parallel, but there is no matter which of them send to whom. We don't need and ever can't link particular producers with particular consumer.
General task: we have lot of clients to receive push-notifications, we group pushes by some parameters to process later as group, so such group should be in one queue in RabbitMQ to be produced and consumed as a group, but each group is independent from other groups.
Big thanks to Hannu for the help: key idea of his easy and robust solution that we can have one persistant queue with known name where producer will write names of created queues and consumer will read these names from there.
To make his solution more readable and easy work with in future in my personal task, I have divided publish_data() in producer into two function - one make random queue and write it to control_queue another receive this random_queue and fill it with messages. Similar idea is good for consumer - one function to process queue, another will be called for process message itself.
I have done something like this but with Pika. I had to clean and kombufy an old code snippet for the examples. It is probably not very kombuish (this is my absolutely first code snippet written using it) but this is how I would solve it. Basically I would set up a control queue with a known name.
Publishers will create a random queue name for a pack of messages, dump N messages to it (in my case numbers 1-42) and then post the queue name to the control queue. A consumer then receives this queue name, binds to it, reads messages until queue is empty and then deletes the queue.
This keeps things relatively simple, as publishers do not need to figure out where they are allowed to publish their groups of data (every queue is new with a random name). Receivers do not need to worry about timeouts or "all done" -messages, as a receiver would receive a queue name only when a group of data has been written to the queue and every message is there waiting.
There is also no need to tinker with locks or signalling or anything else that would complicate things. You can have as many consumers and producers as you want. And of course using exchanges and routing keys there could be different sets of consumers for different tasks etc.
Publisher
from kombu import Connection
import uuid
from time import sleep
def publish_data(conn):
random_name= "q" + str(uuid.uuid4()).replace("-", "")
random_queue = conn.SimpleQueue(random_name)
for i in xrange(0, 42):
random_queue.put(i)
random_queue.close()
return random_name
with Connection('amqp://guest:guest#localhost:5672//') as conn:
control_queue = conn.SimpleQueue('control_queue')
_a = 0
while True:
y_name = publish_data(conn)
message = y_name
control_queue.put(message)
print('Sent: {0}'.format(message))
_a += 1
sleep(0.3)
if _a > 20:
break
control_queue.close()
Consumer
from Queue import Empty
from kombu import Connection, Queue
def process_msg(foo):
print str(foo)
with Connection("amqp://guest:guest#localhost:5672//") as _conn:
sub_queue = _conn.SimpleQueue(str(foo))
while True:
try:
_msg = sub_queue.get(block=False)
print _msg.payload
_msg.ack()
except Empty:
break
sub_queue.close()
chan = _conn.channel()
dq = Queue(name=str(foo), exchange="")
bdq = dq(chan)
bdq.delete()
with Connection('amqp://guest:guest#localhost:5672//') as conn:
rec = conn.SimpleQueue('control_queue')
while True:
msg = rec.get(block=True)
entry = msg.payload
msg.ack()
process_msg(entry)

Consume multiple messages at a time

I am using an external service (Service) to process some particular type of objects. The Service works faster if I send objects in batches of 10. My current architecture is as follows. A producer broadcasts objects one-by-one, and a bunch of consumers pull them (one-by-one) from a queue and send them to The Service. This is obviously suboptimal.
I don't want to modify producer code as it can be used in different cases. I can modify consumer code but only with the cost of additional complexity. I'm also aware of the prefetch_count option but I think it only works on the network level -- the client library (pika) does not allow fetching multiple messages at once in the consumer callback.
So, can RabbitMQ create batches of messages before sending them to consumers? I'm looking for an option like "consume n messages at a time".
You cannot batch messages in the consumer callback, but you could use a thread safe library and use multiple threads to consume data. The advantage here is that you can fetch five messages on five different threads and combine the data if needed.
As an example you can take a look on how I would implement this using my AMQP library.
https://github.com/eandersson/amqpstorm/blob/master/examples/scalable_consumer.py
The below code will make use of channel.consume to start consuming messages. We break out/stop when the desired number of messages is reached.
I have set a batch_size to prevent pulling of huge number of messages at once. You can always change the batch_size to fit your needs.
def consume_messages(queue_name: str):
msgs = list([])
batch_size = 500
q = channel.queue_declare(queue_name, durable=True, exclusive=False, auto_delete=False)
q_length = q.method.message_count
if not q_length:
return msgs
msgs_limit = batch_size if q_length > batch_size else q_length
try:
# Get messages and break out
for method_frame, properties, body in channel.consume(queue_name):
# Append the message
try:
msgs.append(json.loads(bytes.decode(body)))
except:
logger.info(f"Rabbit Consumer : Received message in wrong format {str(body)}")
# Acknowledge the message
channel.basic_ack(method_frame.delivery_tag)
# Escape out of the loop when desired msgs are fetched
if method_frame.delivery_tag == msgs_limit:
# Cancel the consumer and return any pending messages
requeued_messages = channel.cancel()
print('Requeued %i messages' % requeued_messages)
break
except (ChannelWrongStateError, StreamLostError, AMQPConnectionError) as e:
logger.info(f'Connection Interrupted: {str(e)}')
finally:
# Close the channel and the connection
channel.stop_consuming()
channel.close()
return msgs

ZMQ PUB Send file

I'm trying (PY)ZMQ for the first time, and wonder if it's possible to send a complete FILE (binary) using PUB/SUB? I need to send database updates to many subscribers. I see examples of short messages but not files. Is it possible?
publisher:
import zmq
import time
import os
import sys
while True:
print 'loop'
msg = 'C:\TEMP\personnel.db'
# Prepare context & publisher
context = zmq.Context()
publisher = context.socket(zmq.PUB)
publisher.bind("tcp://*:2002")
time.sleep(1)
curFile = 'C:/TEMP/personnel.db'
size = os.stat(curFile).st_size
print 'File size:',size
target = open(curFile, 'rb')
file = target.read(size)
if file:
publisher.send(file)
publisher.close()
context.term()
target.close()
time.sleep(10)
subscriber:
'''always listening'''
import zmq
import os
import time
import sys
while True:
path = 'C:/TEST'
filename = 'personnel.db'
destfile = path + '/' + filename
if os.path.isfile(destfile):
os.remove(destfile)
time.sleep(2)
context = zmq.Context()
subscriber = context.socket(zmq.SUB)
subscriber.connect("tcp://127.0.0.1:2002")
subscriber.setsockopt(zmq.SUBSCRIBE,'')
msg = subscriber.recv(313344)
if msg:
f = open(destfile, 'wb')
print 'open'
f.write(msg)
print 'close\n'
f.close()
time.sleep(5)
You shall be able to accomplish to distribute files to many subscribers using zmq and PUB/SUB pattern.
Your code is almost there, or in other words, it might work in most situations, could be improved a bit.
Things to be aware of
Messages are living in memory
The message must fit into memory when getting published (living in PUB socket) and stays there until last currently subscribed consumer does not read it out or disconnects.
The message must also fit into memory when being received. But with reasonable large files (like your 313 kB) it shall work unless you are really short with RAM.
Slow consumer issue
In case you have multiple consumers, and one of them is reading much slower then the others, it will start slowing down all of them. Zmq is explaining this problem and also proposes some methods how to avoid it (e.g. suicide of slow subscriber).
However, in most situations, you will not encounter this problem.
Start your consumer first not to miss a message
zmq messaging is extremely fast. There is no problem, if you start your consumer sooner, then the publisher, zmq makes this scenario easy and consumer will connect automatically.
However, your publisher shall allow consumers to connect before it start publishing, your code does 1 second sleep before sending the message, this shall be sufficient.
Comments to your code
do you really have to sleep after os.remove? Probably not
subscriber.recv - there is no need to know message size in advance, zmq packet is aware of file size, so if you call it without number of bytes to receive, you will get it properly.
Send large files in chunks
zmq provides a feature called multipart messages, but according to doc, it has to fit completely (all message parts) in memory, before being sent out, so this is not the trick to use.
On the other hand, you can create "application level multipart protocol" in such a way, that you decide sending messages with structure like (hasNextPart, chunkData). This way you would be sending in well controlled sized messages and only the last one would tell "hasNextPart" == False.
Consumer would then read and write to disk all the parts until last message, claiming that there is no further part arrives.

Categories

Resources