Implementing a Master/Worker pattern with PUB/SUB using ZeroMQ - python

I have written a toy Master/Worker" or "task farm" using ZeroMQ.
This is what I have got so far - but I want to add PUB/SUB, so that the workers listen and respond to topics (either specific topics, or wildcard matches).
master
#!/usr/bin/env python
from __future__ import print_function
import random
import time
from multiprocessing import Pool, Process
import zmq
from zmq.devices.basedevice import ProcessDevice
REQ_ADDRESS = 'tcp://127.0.0.1:6240'
REP_ADDRESS = 'tcp://127.0.0.1:6241'
if __name__ == '__main__':
# Start queue
context = zmq.Context()
sock_in = context.socket(zmq.ROUTER)
sock_in.bind(REQ_ADDRESS)
sock_out = context.socket(zmq.DEALER)
sock_out.bind(REP_ADDRESS)
zmq.device(zmq.QUEUE, sock_in, sock_out)
worker
#!/usr/bin/env python
from __future__ import print_function
import random
import time
import zmq
REP_ADDRESS = 'tcp://127.0.0.1:6241'
def receive_tasks():
"""
Client action: request tasks
"""
# ID: just to show that we're getting the right replies
my_id = random.randint(1, 1000000)
context = zmq.Context()
socket = context.socket(zmq.REP)
socket.connect(REP_ADDRESS)
while True:
# Data is received here. Note that this blocks until
# we get a job.
job = socket.recv_json()
# Do work here
time.sleep(0.5)
# Send the result back. Pass any JSON-serializable object.
socket.send_json([my_id, job['id'], job['task_id']])
if __name__ == '__main__':
receive_tasks()
client
#!/usr/bin/env python
from __future__ import print_function
import random
import zmq
from zmq.core.poll import select
REQ_ADDRESS = 'tcp://127.0.0.1:6240'
def request_tasks():
"""
Client action: request tasks
"""
# ID: just to show that we're getting the right replies
my_id = random.randint(1, 1000000)
context = zmq.Context()
socket = context.socket(zmq.REQ)
socket.connect(REQ_ADDRESS)
for i in xrange(100):
job = {'id': my_id, 'task_id': random.randint(1, 100)}
socket.send_json(job)
# Selects the sockets that have READ, WRITE, and ERROR
# events respectively within the lists, with timeout 5.
# Same API as: http://docs.python.org/library/select.html
(rlist, wlist, xlist) = select([socket], [], [], 5)
if len(rlist) > 0:
# This receives the reply and deserializes it from JSON.
msg = socket.recv_json()
print('Client {0}, task #{1}: received work from {2} (for: {3})'.format(
my_id, i+1, msg[0], msg[1]))
else:
print('Client {0}, task#{1}: error, timeout reached.'.format(my_id,
i+1))
socket.close()
socket = context.socket(zmq.REQ)
socket.connect(REQ_ADDRESS)
if __name__ == '__main__':
request_tasks()
My question is: how can I modify the master and workers to be "TOPIC aware" - using PUB/SUB?
Note: Although my example code is in Python, and the image illustration refers to Java - I'm actually writing my real code in C++, so please (if possible) don't use any language specifics in your answer.

Q : ... how can I modify the master and workers to be "TOPIC aware" - using PUB/SUB?
Welcome, the PUB/SUB-Scalable Formal Communication Pattern Archetype has one silent trap (caveat if not reading full details from ZeroMQ API) - the SUB-side has to actively subscribe to something, otherwise it receives nothing until doing so (as like with the real newspapers, one will never find any at the doorstep, unless a subscriptions was both raised & paid - here the costs are split between PUB-SUB Context()-engines, in early versions doing the TOPIC-filtering of (all) delivered messages ALAP, i.e. after delivery to (all) SUB-side (yes, network-I/O & many-times (distributed) CPU-loads were a way to "offload" the PUB-side costs for doing so in a hell-fast cadence & volumes. Later versions, post v3.4+ IIRC, 've moved the TOPIC-management + TOPIC-filtering onto the PUB-side (preventing network-I/O, yet demanding RAM+CPU resources of the PUB-side to be tuned-up adequately for large-scale deployments. Professional FinTech INDU was more than happy with performance & latency envelopes, so no need to panic prematurely on this))
ZeroMQ PUB-archetype TOPIC-filtering has been ever since based on the actual message payload Byte-stream, so given some SUB-s have already setup their active subscription to "ABC", the PUB-side will make any message-payload, starting with "ABC.....", indeed placed into their respective delivery-queue(s). TOPIC-filter subscription management is well defined in the ZeroMQ documentation, just wanted to notice the default state, where no subscription is present at all and omitting to mention ""-string subscription ( to receive everything ), which would produce rather absurd results in Master/Worker herd case, where any work-package would get processed by each and every Worker which (except for in some ultimate,yet gigantically expensive & inefficient fault-fighting robustness-increasing approach) would make no sense, having no other ( performance, latency or other ) benefit for doing so.
This said, there are no other limitations for designing a meta-plane network of any amount of additional Signalling & Communication "socket"-archetypes, that altogether fulfill the job:
Master
can
PUB.{ bind | setsockopt | send | close }() in due order & fashion, resulting in a cheap way (... latency + RAM + CPU related remarks above still valid ...) distribute job-tasks only to those, who have actively subscribed to do so ( a "herd"-management could handle newcomers, lost ones, N-replicated job-tasking, all just by using the TOPIC-filtering tricks )
Can use
PULL.{ bind | setsockopt | poll | receive | close }() accordingly, so as to efficiently, best in some "soft/mild"-real-time driven control-loop, collect results for the above distributed workpackages' results, validating them against (un)authorised and/or (non)tampered controlling checkups, as needed
Can also
"soft"-signal about (un)authenticated worker(s), for presence / health-status / state-of-work, if needed, be it by re-using the primary PUB/SUB channel and receiving answers via the primary PUSH/PULL one. Yet there is no problem to setup a knowingly separate, offloading, secondary signalling PUB/SUB channels among Master/Workers, so as to keep this "soft"-signalling flow independently of the primary workloads ( indeed growing a more professional SIG/COMMs meta-plane architecture for custom-defined distributed-computing )
Soft-signalling channel is a typical way how to create some kind of domain-specific-language (having a grammar of "commands") for controlling the "herd" throughout the whole lifecycle of the such defined distributed-system.
Cool, isn't it?
Client ( doing a set of selective work-types )
can
SUB.{ connect | setsockopt | receive | close }() in due order & fashion, to so as to adaptively setup, configure & receive subscribed-to work-packages from PUB, yet keeping any other complexity of STATE-signalling & DATA-inter-communication with other, principally unrestricted set of peers )
Can use
PUSH.{ connect | setsockopt | send | close }() accordingly, so as to match the Master's way how to deliver (+ auth, ev. +protect against tampering ) any and all results for the above received workpackages' results, self-validating them as "The Authorised"-entity to deliver them and/or providing any tampering-control checkups, if & as needed
Can also
receive and respond to any "soft"-signal request or asynchronously notifying any explicit state-changes (implicit state-changes detection is naturally the Master's task, after not receiving any response and alike) related to presence / health-state / state-of-work, etc, if needed, either by re-using the primary PUB/SUB channel and delivering such respective responsed via the primary PUSH/PULL upstream one. Yet there is no problem to setup a knowingly separate, Master-offloading, secondary signalling PUB and/or other channels among Workers themselves plus the Master, so as to keep any kind of "soft"-signalling flow independent of the primary workloads ( the custom-defined distributed-computing can indeed create any sort of "Parallel Universe", where Master is (or is not) a part thereof ;o) )

Related

How to wait program until the function give return value?

I have code like this :
zmq = Zmq_Connector_Mod.DWX_ZeroMQ_Connector()
zmq._GET_HIST_INDICATORS_(_symbol, 'C1')
sleep(random() * 5 )
c1_path = zmq._GET_DATA_()
zmq = Zmq_Connector_Mod.DWX_ZeroMQ_Connector()
zmq._GET_HIST_INDICATORS_(_symbol, 'BASELINE')
sleep(random() * 5 )
baseline_path = zmq._GET_DATA_()
zmq = Zmq_Connector_Mod.DWX_ZeroMQ_Connector()
zmq._GET_HIST_INDICATORS_(_symbol, 'C2')
sleep(random() * 5 )
c2_path = zmq._GET_DATA_()
zmq = Zmq_Connector_Mod.DWX_ZeroMQ_Connector()
zmq._GET_HIST_INDICATORS_(_symbol, 'EXIT')
sleep(random() * 5 )
exit_path = zmq._GET_DATA_()
I have a problem when zmq._GET_DATA_() is running, it doesn't have returned value, because zmq._GET_HIST_INDICATORS_() function needs a couple seconds to return the value. I already used sleep(), but it's not efficient because when I try to run this code in another device that slower than mine, it just not helping. How to wait program from execute the zmq._GET_DATA_() until zmq._GET_HIST_INDICATORS_() has returned the value without using sleep() that need time, meanwhile every device has different running time to execute the code ?
Higher-level overview here: typically in asynchronous message queueing, there are a few patterns you can use so you don't have to poll over and over:
Publish-subscribe
Get with wait
Request-reply
Message listener
This is implementable in ZeroMQ, e.g. https://rillabs.org/posts/pub-sub-with-zeromq-in-python and this stackoverflow question discusses it in detail: ZeroMQ - Multiple Publishers and Listener
Get with wait is a pattern where a timeout is set for a get operation, it won't return an error until the time expires. On a typical zmq.recv() call, you can specify the timeout.
Request-reply is typically implemented where the requestor specifies a reply queue and does a get with wait operation. Using this means you'll know which returned message corresponds to each message you sent. https://zguide.zeromq.org/docs/chapter3/#Recap-of-Request-Reply-Sockets
Message listeners set up responsive objects that respond to events and can be implemented in various ways. Various message queueing technologies have this built-in, couldn't find a good zmq example but it's definitely implementable!
Other queueing technologies have these patterns implemented more readily, e.g. ActiveMQ, IBM MQ, RabbitMQ if you wanted to explore.
It looks like you are using a message queue, so there must be a documented async way of doing this, but you may try something like the following:
exit_path = None
while exit_path is None:
try:
exit_path = zmq._GET_DATA_()
except AttributeError:
exit_path = None
sleep(1)
This should check once every second to see if the data is available.

Why ZeroMQ fails to communicate when I use multiprocessing.Process to run?

please see the code below :
server.py
import zmq
import time
from multiprocessing import Process
class A:
def __init__(self):
ctx = zmq.Context(1)
sock = zmq.Socket(ctx, zmq.PUB)
sock.bind('ipc://test')
p = Process(target=A.run, args=(sock,))
p.start() # Process calls run, but the client can't receive messages
p.join() #
#A.run(sock) # this one is ok, messages get it to be received
#staticmethod
def run(sock):
while True:
sock.send('demo'.encode('utf-8'))
print('sent')
time.sleep(1)
if __name__ =='__main__':
a = A()
client.py
import zmq
ctx=zmq.Context(1)
sock = zmq.Socket(ctx, zmq.SUB)
sock.connect('ipc://test')
sock.setsockopt_string(zmq.SUBSCRIBE, '')
while True:
print(sock.recv())
In the constructor of server.py, if I call .run()-method directly, the client can receive the message, but when I use the multiprocessing.Process()-method, it fails. Can anyone explain on this and provide some advice?
Q : "Why ZeroMQ fails to communicate when I use multiprocessing.Process to run?"
Well, ZeroMQ does not fail to communicate, the problem is, how Python multiprocessing module "operates".
The module is designed so that some processing may escape from the python central GIL-lock (re-[SERIAL]-iser, that is used as a forever present [CONCURRENT]-situations' principal avoider).
This means that the call to the multiprocessing.Process makes one exact "mirror-copy" of the python interpreter state, "exported" into new O/S-spawned process (details depend on localhost O/S).
Given that, there is zero chance a "mirror"-ed replica could get access to resources already owned by the __main__ - here the .bind()-method already acquired ipc://test address, so "remote"-process will never get "permission" to touch this ZeroMQ AccessPoint, unless the code gets repaired & fully re-factored.
Q : "Can anyone explain on this and provide some advice?"
Sure. The best step to start is to fully understand Pythonic culture of monopolistic GIL-lock re-[SERIAL]-isation, where no two things ever happen in the same time, so even adding more threads does not speed-up the flow of the processing, as it all gets re-aligned by the central "monopolist" The GIL-lock.
Next, understanding a promise of a fully reflected copy of the python interpreter state, while it sounds promising, also has some obvious drawbacks - the new processes, being "mirror"-copies cannot introduce colliding cases on already owned resources. If they try to, a not working as expected cases are the milder of the problems in such principally ill-designed cases.
In your code, the first row in __main__ instantiates a = A(), where A's .__init__ method straight occupies the IPC-resource since .bind('ipc://test'). The later step, p = Process( target = A.run, args = ( sock, ) ) "mirror"-replicates the state of the python interpreter (an as-is copy) and the p.start() cannot but crash into disability to "own" the same resource as the __main__ already owns (yes, the ipc://test for a "mirror"-ed process instructed call to grab the same, non-free resource in .bind('ipc://test') ). This will never fly.
Last but not least, enjoy the Zen-of-Zero, the masterpiece of Martin SUSTRIK for distributed-computing, so well crafted for ultimately scalable, almost zero-latency, very comfortable, widely ported signalling & messaging framework.
Short answer: Start your subprocesses. Create your zmq.Context- and .Socket-instances from within your Producer.run()-classmethod within each subprocess. Use .bind()-method on the side on which your cardinality is 1, and .connect()-method on the side where your cardinality is >1 (in this case, the "server").
My approach would be structured something like...
# server.py :
import zmq
from multiprocessing import Process
class Producer (Process):
def init(self):
...
def run(self):
ctx = zmq.Context(1)
sock = zmq.Socket(ctx, zmq.PUB)
# Multiple producers, so connect instead of bind (consumer must bind)
sock.connect('ipc://test')
while True:
...
if __name__ == "__main__":
producer = Producer()
p = Process(target=producer.run)
p.start()
p.join()
# client.py :
import zmq
ctx = zmq.Context(1)
sock = zmq.Socket(ctx, zmq.SUB)
# Capture from multiple producers, so bind (producers must connect)
sock.bind('ipc://test')
sock.setsockopt_string(zmq.SUBSCRIBE, '')
while True:
print(sock.recv())

Task delegation in Python/Redis

I have an issue thinking of an architecture that'll solve the following problem:
I have a web application (producer) that receives some data on request. I also have a number of processes (consumers) that should process this data. 1 request generates 1 batch of data and should be processes by only 1 consumer.
My current solution consists of receiving the data, cache-ing it in memory with Redis, sending a message through a message channel that data has been written while the consumers are listening on the same channel, and then the data is processed by the consumers. The issue here is that I need to stop multiple consumers from working on the same data. So how can I inform the other consumers that I have started working on this task?
Producer code (flask endpoint):
data = request.get_json()
db = redis.Redis(connection_pool=pool)
db.set(data["externalId"], data)
# Subscribe to the batches channel and publish the id
db.pubsub()
db.publish('batches', request_key)
results = None
result_key = str(data["externalId"])
# Wait till the batch is processed
while results is None:
results = db.get(result_key)
if results is not None:
results = results.decode('utf8')
db.delete(data["externalId"])
db.delete(result_key)
Consumer:
db = redis.Redis(connection_pool = pool)
channel = db.pubsub()
channel.subscribe('batches')
while True:
try:
message = channel.get_message()
message_data = bytes(message['data']).decode('utf8')
external_id = message_data.split('-')[-1]
data = json.loads(db.get(external_id).decode('utf8'))
result = DataProcessor.process(data)
db.set(str(external_id), result)
except Exception:
pass
PUBSUB is often problematic for task queuing for exactly this reason. From the docs (https://redis.io/topics/pubsub):
SUBSCRIBE, UNSUBSCRIBE and PUBLISH implement the Publish/Subscribe messaging paradigm where (citing Wikipedia) senders (publishers) are not programmed to send their messages to specific receivers (subscribers). Rather, published messages are characterized into channels, without knowledge of what (if any) subscribers there may be.
A popular alternative to consider would be to implement "publish" by pushing an element to the end of a Redis list, and "subscribe" by having your worker poll that list at some interval (exponential backoff is often an appropriate choice). In order to avoid cases where multiple workers get the same job, use lpop to get and remove an element from the list. Redis is single-threaded, so you're guaranteed only one worker will receive each element.
So, on the publish side, aim for something like this:
db = redis.Redis(connection_pool=pool)
db.rpush("my_queue", task_payload)
And on the subscribe side, you can safely run a loop like this in parallel as many times as you need:
while True:
db = redis.Redis(connection_pool=pool)
payload = db.lpop("my_queue")
if not payload:
continue
< deserialize and process payload here >
Note this is a last-in-first-out queue (LIFO) since we're pushing onto the right side with rpush and popping off the left with lpop. You can implement the FIFO version trivially by combining lpush/lpop.

Python: How to trigger multiple process at same instant

I am trying to run a process that does a http POST which in turn will send an alert(time taken to send an alert is in nano second) to a server. I am trying to test the capacity of the server in handling alerts in milliseconds. As per the given standard, the server is said to handle 6000 alerts/second.
I created a piece of code using multiprocessing module, which sends 6000 alerts, but I am using a for loop and hence the time taken to execute the for loop exceeds more than a second. And hence all the 6000 process are not triggered at SAME INSTANT.
Is there a way to trigger multiple(N number) process at same instant?
This is my code: flowtesting.py which is a library. And this is followed by my script after '####'
import json
import httplib2
class flowTesting():
def init(self, companyId, deviceIp):
self.companyId = companyId
self.deviceIp = deviceIp
def generate_savedSearchName(self, randNum):
self.randMsgId = randNum
self.savedSearchName = "TEST %s risk31 more than 3" % self.randMsgId
def def_request_body_dict(self):
self.reqBody_dict = \
{ "Header" : {"agid" : "Agent1",
"mid": self.randMsgId,
"ts" : 1253125001
},
"mp":
{
"host" : self.deviceIp,
"index" : self.companyId,
"savedSearchName" : self.savedSearchName,
}
}
self.req_body = json.dumps(self.reqBody_dict)
def get_default_hdrs(self):
self.hdrs = {'Content-type': 'application/json',
'Accept-Language': 'en-US,en;q=0.8'}
def send_request(self, sIp, method="POST"):
self.sIp = sIp
self.url = "http://%s:8080/agent/splunk/messages" % self.sIp
http_cli = httplib2.Http(timeout=180, disable_ssl_certificate_validation=True)
rsp, rsp_body = http_cli.request(uri=self.url, method=method, headers=self.hdrs, body=self.req_body)
print "rsp: %s and rsp_body: %s" % (rsp, rsp_body)
# My testScript
from flowTesting import flowTesting
import random
import multiprocessing
deviceIp = "10.31.421.35"
companyId = "CPY0000909"
noMsgToBeSent = 1000
sIp = "10.31.44.235"
uniq_msg_id_list = random.sample(xrange(1,10000), noMsgToBeSent)
def runner(companyId, deviceIp, uniq_msg_id):
proc = flowTesting(companyId, deviceIp)
proc.generate_savedSearchName(uniq_msg_id)
proc.def_request_body_dict()
proc.get_default_hdrs()
proc.send_request(sIp)
process_list = []
for uniq_msg_id in uniq_msg_id_list:
savedSearchName = "TEST-1000 %s risk31 more than 3" % uniq_msg_id
process = multiprocessing.Process(target=runner, args=(companyId,deviceIp,uniq_msg_id,))
process.start()
process.join()
process_list.append(process)
print "Process list: %s" % process_list
print "Unique Message Id: %s" % uniq_msg_id_list
Making them all happen in the same instant is obviously impossible—unless you have a 6000-core machine and an OS kernel whose scheduler is able to handle them all perfectly (which you don't), you can't get 6000 pieces of code running at once.
And, even if you did, what they're all trying to do is to send a message on a socket. Even if your kernel was that insanely parallel, unless you have 6000 separate NICs, they're going to end up serialized in the NIC buffer. That's the way IP works: one packet after another. And of course there are all the routers on the path, the server's NIC, the server's OS, etc. And even if IP doesn't get in the way, bytes take time to transfer over a cable. So the only way to do this at the same instant, even in theory, would be to have 6000 NICs on each side and wire them up directly to each other with identical fiber.
However, you don't really need them in the same instant, just closer to each other than they are. You didn't show us your code, but presumably you're just starting 6000 Processes that all immediately try to send a message. That means you're including the process startup time—which can be pretty slow (especially on Windows)—in the skew time.
You can reduce that by using threads instead of processes. That may seem counterintuitive, but Python is pretty good at handling I/O-bound threads, and every modern OS is very good at starting new threads.
But really, what you need is a Barrier on your threads or processes, to let all of them complete all the setup work (including process startup) before any of them try to do any work.
It still probably won't be tight enough, but it will be a lot tighter than you probably have right now.
The next limit you're going to face is context-switching time. Modern OSs are pretty good at scheduling, but not 6000-simultaneous-tasks good. So really, you want to reduce this to N processes, each one just spamming 6000/N connections sequentially as fast as possible. That will get them into the kernel/NIC much faster than trying to do 6000 at once and making the OS do the serialization for you. (In fact, on some platforms, depending on your hardware, you might actually be better off with one process doing 6000 in a row than N doing 6000/N. Test it both ways.)
There's still some overhead for the socket library itself. To get around that, you want to pre-craft all of the IP packets, then create a single raw socket and spam those packets. Send the first packet from each connection, then the second packet from each connection, etc.
You need to use an inter-process synchronization primitive. On Linux you would use a Sys-V semaphore, on Windows you would use a Win32 event.
Your 6000 processes would wait on this semaphore/event, and from a different process you would trigger it, thus releasing all your 6000 processes from their waiting state to a ready state, and then the OS would start executing them as quickly as possible.

ZMQ PUB Send file

I'm trying (PY)ZMQ for the first time, and wonder if it's possible to send a complete FILE (binary) using PUB/SUB? I need to send database updates to many subscribers. I see examples of short messages but not files. Is it possible?
publisher:
import zmq
import time
import os
import sys
while True:
print 'loop'
msg = 'C:\TEMP\personnel.db'
# Prepare context & publisher
context = zmq.Context()
publisher = context.socket(zmq.PUB)
publisher.bind("tcp://*:2002")
time.sleep(1)
curFile = 'C:/TEMP/personnel.db'
size = os.stat(curFile).st_size
print 'File size:',size
target = open(curFile, 'rb')
file = target.read(size)
if file:
publisher.send(file)
publisher.close()
context.term()
target.close()
time.sleep(10)
subscriber:
'''always listening'''
import zmq
import os
import time
import sys
while True:
path = 'C:/TEST'
filename = 'personnel.db'
destfile = path + '/' + filename
if os.path.isfile(destfile):
os.remove(destfile)
time.sleep(2)
context = zmq.Context()
subscriber = context.socket(zmq.SUB)
subscriber.connect("tcp://127.0.0.1:2002")
subscriber.setsockopt(zmq.SUBSCRIBE,'')
msg = subscriber.recv(313344)
if msg:
f = open(destfile, 'wb')
print 'open'
f.write(msg)
print 'close\n'
f.close()
time.sleep(5)
You shall be able to accomplish to distribute files to many subscribers using zmq and PUB/SUB pattern.
Your code is almost there, or in other words, it might work in most situations, could be improved a bit.
Things to be aware of
Messages are living in memory
The message must fit into memory when getting published (living in PUB socket) and stays there until last currently subscribed consumer does not read it out or disconnects.
The message must also fit into memory when being received. But with reasonable large files (like your 313 kB) it shall work unless you are really short with RAM.
Slow consumer issue
In case you have multiple consumers, and one of them is reading much slower then the others, it will start slowing down all of them. Zmq is explaining this problem and also proposes some methods how to avoid it (e.g. suicide of slow subscriber).
However, in most situations, you will not encounter this problem.
Start your consumer first not to miss a message
zmq messaging is extremely fast. There is no problem, if you start your consumer sooner, then the publisher, zmq makes this scenario easy and consumer will connect automatically.
However, your publisher shall allow consumers to connect before it start publishing, your code does 1 second sleep before sending the message, this shall be sufficient.
Comments to your code
do you really have to sleep after os.remove? Probably not
subscriber.recv - there is no need to know message size in advance, zmq packet is aware of file size, so if you call it without number of bytes to receive, you will get it properly.
Send large files in chunks
zmq provides a feature called multipart messages, but according to doc, it has to fit completely (all message parts) in memory, before being sent out, so this is not the trick to use.
On the other hand, you can create "application level multipart protocol" in such a way, that you decide sending messages with structure like (hasNextPart, chunkData). This way you would be sending in well controlled sized messages and only the last one would tell "hasNextPart" == False.
Consumer would then read and write to disk all the parts until last message, claiming that there is no further part arrives.

Categories

Resources