pyzmq proxy in a strange state after subscribing multiple processes

pyzmq proxy in a strange state after subscribing multiple processes - python

I'm having a weird issue with the proxy in pyzmq. Here's the code of that proxy:
import zmq
context = zmq.Context.instance()
frontend_socket = context.socket(zmq.XSUB)
frontend_socket.bind("tcp://0.0.0.0:%s" % sub_port)
backend_socket = context.socket(zmq.XPUB)
backend_socket.bind("tcp://0.0.0.0:%s" % pub_port)
zmq.proxy(frontend_socket, backend_socket)
I'm using that proxy to send messages between ~50 processes that run on 6 different machines. The total amount of topics is around 1,000, but since multiple processes can listen on the same topics, the total amount of subscriptions is around 10,000.
In normal times this works very well, messages go through the proxy correctly as long as a process publishes it and at least one other processes is subscribed to the topic. It works whether the publisher or subscriber was started first.
But at some point in time, when we start a new process (let's call it X), it starts behaving strangely. Everything that was already connected keeps working, but the new processes that we connect can only get messages to go through if the publisher is connected before the subscriber. X can be any one of the processes that normally work, and it can be from any machine, and the result is the same. When we get in this state, killing X makes everything work again, and starting it again makes it fail. If we stop other processes and then start X, it works well (so it's not related with X's code in particular).
I'm not sure if we could be reaching some limit of ZMQ? I've read examples of people that seem to have way more processes, subscriptions, etc. than us. It could be some option that we should set on the proxy, so far here are the ones we've tried without success:
Changing RCVHWM on frontend_socket
Changing SNDHWM on backend_socket
Setting XPUB_VERBOSE on backend_socket
Setting XPUB_VERBOSER on backend_socket
Here is sample code of how we publish messages to the proxy:
topic = "test"
message = {"test": "test"}
context = zmq.Context.instance()
socket = context.socket(zmq.PUB)
socket.connect("tcp://1.2.3.4:1234")
while True:
time.sleep(1)
socket.send_multipart([topic.encode(), json.dumps(message).encode()])
Here is sample code of how we subscribe to messages from the proxy:
topic = "test"
context = zmq.Context.instance()
socket = context.socket(zmq.SUB)
socket.connect("tcp://1.2.3.4:5678")
socket.subscribe(topic)
while True:
multi_part = socket.recv_multipart()
[topic, message] = multi_part
print(topic.decode(), message.decode())
Has anyone ever seen a similar issue? Is there something we can do to avoid the proxy getting in this state?
Thanks!

Make all the publishers (proxy and publish process) XPUB ( + sockopt verbose/verboser) then read from the publisher sockets on a poll loop. The first byte of the subscription message will tell you if the message is sub/unsub followed by the subject/topic. If you log all of the this information with timestamps it should tell you which component is at fault (it could be any of the three) and help with a fix.
The format of the subscription messages that arrive on the publisher (XPUB) will be
Subscription [0x01][topic]
Unsubscription [0x00][topic]
Code needed
I usually work on C++ but this is the general idea in python
proxy
You need to create a capture socket (this acts like a network tap). You connect a ZMQ_PAIR socket to the proxy (capture) over inproc and then read the contents at the other end of the socket. As you are using XPUB/XSUB you will see the subscription messages.
zmq.proxy(frontend, backend, capture)
read the docs/examples for the python proxy.
publisher
In this case you need to read from the publishing socket in the same thread as you are sending on it. That's the reason I said a poll loop might be best.
This code is not tested at all.
topic = "test"
message = {"test": "test"}
context = zmq.Context.instance()
socket = context.socket(zmq.XPUB)
socket.connect("tcp://1.2.3.4:1234")
poller = zmq.Poller()
poller.register(socket, zmq.POLLIN)
timeout = 1000 #ms
while True:
socks = dict(poller.poll(timeout))
if not socks : # 1
socket.send_multipart([topic.encode(), json.dumps(message).encode()])
if socket in socks:
sub_msg = socket.recv()
# print out the message here.

Related

ZeroMQ REQ .recv() hangs with messages larger than ~1kB if run inside Docker

I'm working on a relatively simple Python / ZeroMQ based work distribution system, using REQ/ROUTER sockets. The system is distributed and worker nodes are geographically distributed on different continents.
The ROUTER, responsible for distributing work, .bind()-s a ROUTER socket. Workers .connect() to it over TCP using a REQ socket.
In the process of setting up a new worker node, I've noticed that while smaller messages (up to 1kB) do the trip with no issues, replies of ~2kB and up, sent by the ROUTER-end are never received by the worker into their REQ-socket - when I call recv(), the socket just hangs.
The worker code runs inside Docker containers, and I was able to work around the issue when running the same image with --net=host - it seems to not happen if Docker is using the host network.
I'm wondering if this is something in the network stack configuration on the host machine or in Docker, or maybe something that can be prevented in my code?
Here is a simplified version of my code that reproduces this issue:
Worker
import sys
import zmq
import logging
import time
READY = 'R'
def worker(connect_to):
ctx = zmq.Context()
socket = ctx.socket(zmq.REQ)
socket.connect(connect_to)
log = logging.getLogger(__name__)
while True:
socket.send_string(READY)
log.debug("Send READY message, waiting for reply")
message = socket.recv()
log.debug("Got reply of %d bytes", len(message))
time.sleep(5)
if __name__ == '__main__':
logging.basicConfig(level=logging.DEBUG)
worker(sys.argv[1])
Router
import sys
import zmq
import logging
REPLY_SIZE = 1024 * 8
def router(bind_to):
ctx = zmq.Context()
socket = ctx.socket(zmq.ROUTER)
socket.bind(bind_to)
poller = zmq.Poller()
poller.register(socket, zmq.POLLIN)
log = logging.getLogger(__name__)
while True:
socks = dict(poller.poll(5000))
if socks.get(socket) == zmq.POLLIN:
message = socket.recv_multipart()
log.debug("Received message of %d parts", len(message))
identity, _ = message[:2]
res = handle_message(message[2:])
log.debug("Sending %d bytes back in response on socket", len(res))
socket.send_multipart([identity, '', res])
def handle_message(parts):
log = logging.getLogger(__name__)
log.debug("Got message: %s", parts)
return 'A' * REPLY_SIZE
if __name__ == '__main__':
logging.basicConfig(level=logging.DEBUG)
router(sys.argv[1])
FWIW I was able to reproduce this on Ubuntu 16.04 (both router and worker) with Docker 17.09.0-ce, libzmq 4.1.5 and PyZMQ 15.4.0.

No, sir, the socket does not hang at all:
Why?
The issue is, that you have instructed the Socket()-instance to enter into an infinitely blocking state, once having called .recv() method, without specifying a zmq.NOBLOCK flag ( the ZMQ_DONTWAIT flag in the ZeroMQ original API ).
This is the cause, that upon other circumstances reported yesterday, moves the code into infinite blocking, as there seem to be other issues that prevent Docker-container to properly deliver any first message to the hands of the Worker's Docker-embedded-ZeroMQ-Context() I/O-engine and to the hands of the REQ-access-point. As the REQ-archetype uses a strict two-step Finite-State-Automaton - strictly striding ( .send()->.recv()->.send()-> ... ad infimum )
This cause->effect reversing is wrong and misleading -
the issue of "socket just hangs"
is un-decideable
from an issue Docker does not deliver a single message ( to allow .recv() to return )
Next steps:
may use .poll() in REQ-side to sniff without blocking for any already arrived message in the Worker.
Once there are none such, focus on Docker first + next may benefit from ZeroMQ Context()-I/O-engine performance and link-level tweaking configuration options.

How can I write a socket server in a different thread from my main program (using gevent)?

I'm developing a Flask/gevent WSGIserver webserver that needs to communicate (in the background) with a hardware device over two sockets using XML.
One socket is initiated by the client (my application) and I can send XML commands to the device. The device answers on a different port and sends back information that my application has to confirm. So my application has to listen to this second port.
Up until now I have issued a command, opened the second port as a server, waited for a response from the device and closed the second port.
The problem is that it's possible that the device sends multiple responses that I have to confirm. So my solution was to keep the port open and keep responding to incoming requests. However, in the end the device is done sending requests, and my application is still listening (I don't know when the device is done), thereby blocking everything else.
This seemed like a perfect use case for a thread, so that my application launches a listening server in a separate thread. Because I'm already using gevent as a WSGI server for Flask, I can use the greenlets.
The problem is, I have looked for a good example of such a thing, but all I can find is examples of multi-threading handlers for a single socket server. I don't need to handle a lot of connections on the socket server, but I need it launched in a separate thread so it can listen for and handle incoming messages while my main program can keep sending messages.
The second problem I'm running into is that in the server, I need to use some methods from my "main" class. Being relatively new to Python I'm unsure how to structure it in a way to make that possible.
class Device(object):
def __init__(self, ...):
self.clientsocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
self.serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
def _connect_to_device(self):
print "OPEN CONNECTION TO DEVICE"
try:
self.clientsocket.connect((self.ip, 5100))
except socket.error as e:
pass
def _disconnect_from_device(self):
print "CLOSE CONNECTION TO DEVICE"
self.clientsocket.close()
def deviceaction1(self, ...):
# the data that is sent is an XML document that depends on the parameters of this method.
self._connect_to_device()
self._send_data(XMLdoc)
self._wait_for_response()
return True
def _send_data(self, data):
print "SEND:"
print(data)
self.clientsocket.send(data)
def _wait_for_response(self):
print "WAITING FOR REQUESTS FROM DEVICE (CHANNEL 1)"
self.serversocket.bind(('10.0.0.16', 5102))
self.serversocket.listen(5) # listen for answer, maximum 5 connections
connection, address = self.serversocket.accept()
# the data is of a specific length I can calculate
if len(data) > 0:
self._process_response(data)
self.serversocket.close()
def _process_response(self, data):
print "RECEIVED:"
print(data)
# here is some code that processes the incoming data and
# responds to the device
# this may or may not result in more incoming data
if __name__ == '__main__':
machine = Device(ip="10.0.0.240")
Device.deviceaction1(...)
This is (globally, I left out sensitive information) what I'm doing now. As you can see everything is sequential.
If anyone can provide an example of a listening server in a separate thread (preferably using greenlets) and a way to communicate from the listening server back to the spawning thread, it would be of great help.
Thanks.
EDIT:
After trying several methods, I decided to use Pythons default select() method to solve this problem. This worked, so my question regarding the use of threads is no longer relevant. Thanks for the people who provided input for your time and effort.

Hope it can provide some help, In example class if we will call tenMessageSender function then it will fire up an async thread without blocking main loop and then _zmqBasedListener will start listening on separate port untill that thread is alive. and whatever message our tenMessageSender function will send, those will be received by client and respond back to zmqBasedListener.
Server Side
import threading
import zmq
import sys
class Example:
def __init__(self):
self.context = zmq.Context()
self.publisher = self.context.socket(zmq.PUB)
self.publisher.bind('tcp://127.0.0.1:9997')
self.subscriber = self.context.socket(zmq.SUB)
self.thread = threading.Thread(target=self._zmqBasedListener)
def _zmqBasedListener(self):
self.subscriber.connect('tcp://127.0.0.1:9998')
self.subscriber.setsockopt(zmq.SUBSCRIBE, "some_key")
while True:
message = self.subscriber.recv()
print message
sys.exit()
def tenMessageSender(self):
self._decideListener()
for message in range(10):
self.publisher.send("testid : %d: I am a task" %message)
def _decideListener(self):
if not self.thread.is_alive():
print "STARTING THREAD"
self.thread.start()
Client
import zmq
context = zmq.Context()
subscriber = context.socket(zmq.SUB)
subscriber.connect('tcp://127.0.0.1:9997')
publisher = context.socket(zmq.PUB)
publisher.bind('tcp://127.0.0.1:9998')
subscriber.setsockopt(zmq.SUBSCRIBE, "testid")
count = 0
print "Listener"
while True:
message = subscriber.recv()
print message
publisher.send('some_key : Message received %d' %count)
count+=1
Instead of thread you can use greenlet etc.

Responding to client disconnects using bottle and gevent.wsgi?

I have a small asynchronous server implemented using bottle and gevent.wsgi. There is a routine used to implement long poll that looks pretty much like the "Event Callbacks" example in the bottle documentation:
def worker(body):
msg = msgbus.recv()
body.put(msg)
body.put(StopIteration)
#route('/poll')
def poll():
body = gevent.queue.Queue()
worker = gevent.spawn(worker, body)
return body
Here, msgbus is a ZMQ sub socket.
This all works fine, but if a client breaks the connection while
worker is blocked on msgbus.recv(), that greenlet task will hang
around "forever" (well, until a message is received), and will only
find out about the disconnected client when it attempts to send a
response.
I can use msgbus.poll(timeout=something) if I don't want to block
forever waiting for ipc messages, but I still can't detect a client
disconnect.
What I want to do is get something like a reference to the client
socket so that I can use it in some kind of select or poll loop,
or get some sort of asynchronous notification inside my greenlet, but
I'm not sure how to accomplish either of these things with these
frameworks (bottle and gevent).
Is there a way to get notified of client disconnects?

Aha! The wsgi.input variable, at least under gevent.wsgi, has an rfile member that is a file-like object. This doesn't appear to be required by the WSGI spec, so it might not work with other servers.
With this I was able to modify my code to look something like:
def worker(body, rfile):
poll = zmq.Poller()
poll.register(msgbus)
poll.register(rfile, zmq.POLLIN)
while True:
events = dict(poll.poll())
if rfile.fileno() in events:
# client disconnect!
break
if msgbus in events:
msg = msgbus.recv()
body.put(msg)
break
body.put(StopIteration)
#route('/poll')
def poll():
rfile = bottle.request.environ['wsgi.input'].rfile
body = gevent.queue.Queue()
worker = gevent.spawn(worker, body, rfile)
return body
And this works great...
...except on OpenShift, where you will have to use the
alternate frontend on port 8000 with websockets support.

pyzmq create a process with its own socket

I have some code thats monitoring some other changing files, what i would like to do is to start that code that uses zeromq with different socket, the way im doing it now seems to cause assertions to fail somewhere in libzmq, since i may be reusing the same socket. how do i ensure when i create a new process from the monitor class the context will not be reused? thats what i think is going on, if you can tell there is some other stupidity on my part, please advise.
here is some code:
import zmq
from zmq.eventloop import ioloop
from zmq.eventloop.zmqstream import ZMQStream
class Monitor(object):
def __init(self)
self.context = zmq.Context()
self.socket = self.context.socket(zmq.DEALER)
self.socket.connect("tcp//127.0.0.1:5055")
self.stream = ZMQStream(self._socket)
self.stream.on_recv(self.somefunc)
def initialize(self,id)
self._id = id
def somefunc(self, something)
"""work here and send back results if any """
import json
jdecoded = json.loads(something)
if self_id == jdecoded['_id']
""" good im the right monitor for you """
work = jdecoded['message']
results = algorithm (work)
self.socket.send(json.dumps(results))
else:
"""let some other process deal with it, not mine """
pass
class Prefect(object):
def __init(self, id)
self.context = zmq.Context()
self.socket = self.context.socket(zmq.DEALER)
self.socket.bind("tcp//127.0.0.1:5055")
self.stream = ZMQStream(self._socket)
self.stream.on_recv(self.check_if)
self._id = id
self.monitors = []
def check_if(self,message):
"""find out from message's id whether we have
started a proces for it previously"""
import json
jdecoded = json.loads(message)
this_id = jdecoded['_id']
if this_id in self.monitors:
pass
else:
"""start new process for it should have its won socket """
new = Monitor()
import Process
newp = Process(target=new.initialize,args=(this_id) )
newp.start()
self.monitors.append(this_id) ## ensure its remembered
what is going on is that i want all the monitor processess and a single prefect process listening on the same port, so when prefect sees a request that it hasnt seen it starts a process for it, all the processes that exist probably should listen too but ignore messages not meant for them.
as it stands, if i do this i get some crash possibly related to concurrent access of the same zmq socket by something (i tried threading.thread, still crashes) i read somewhere that concurrent access of a zmq socket by different threads is not possible. How would i ensure that new processes get their own zmq sockets?
EDIT:
the main deal in my app is that a request comes in via zmq socket, and a process(s) thats listening reacts to the message by:
1. If its directed at that process judged by the _id field, do some reading on a file and reply since one of the monitors match the messages _id, if none match, then:
2 If the messages _id files is not recognized, all monitors ignore it but the Prefect creates a process to handle that _id and all future messages to that id.
3. I want all the messages to be seen by the monitor processes as well as the prefect process, seems that seems easiest,
4. All the messages are very small, avarage ~4096 bytes.
5. The monitor does some non-blocking read and for each ioloop it sends what it has found out
more-edit=>and the prefect process binds now, and it will receive messages and echo them so they can be seen by monitors. This is what i have in mind, as the architecture but its not final.
.
All the messages are arriving from remote users over a browser that lets the server know what a client wants and the server sends the message to the backend via zmq(i did not show this, but is not hard) so in production they might not bind/connect to localhost.
I chose DEALER since it allows asyc / unlimited messages in either direction (see point 5.) and DEALER can bind with DEALER, and initial request/reply can arrive from either side. The other that can do this is possibly DEALER/ROUTER.

You are correct that you cannot keep using the same socket in a subprocess (multiprocessing usually uses fork to create subprocesses). In general, what this means is that you don't want to create the socket that will be used in the subprocess until after the subprocess starts.
Since, in your case, the socket is an attribute on the Monitor object, you probably don't want to create the Monitor in the main process at all. That would look something like this:
def start_monitor(this_id):
monitor = Monitor()
monitor.initialize(this_id)
# run the eventloop, or this will return immediately and destroy the monitor
... inside Prefect.check_if():
proc = Process(target=start_monitor, args=(this_id,))
proc.start()
self.monitors.append(this_id)
rather than your example, where the only thing the subprocess does is assign an ID and then kill the process, ultimately having no effect.

zeromq: how to prevent infinite wait?

I just got started with ZMQ. I am designing an app whose workflow is:
one of many clients (who have random PULL addresses) PUSH a request to a server at 5555
the server is forever waiting for client PUSHes. When one comes, a worker process is spawned for that particular request. Yes, worker processes can exist concurrently.
When that process completes it's task, it PUSHes the result to the client.
I assume that the PUSH/PULL architecture is suited for this. Please correct me on this.
But how do I handle these scenarios?
the client_receiver.recv() will wait for an infinite time when server fails to respond.
the client may send request, but it will fail immediately after, hence a worker process will remain stuck at server_sender.send() forever.
So how do I setup something like a timeout in the PUSH/PULL model?
EDIT: Thanks user938949's suggestions, I got a working answer and I am sharing it for posterity.

If you are using zeromq >= 3.0, then you can set the RCVTIMEO socket option:
client_receiver.RCVTIMEO = 1000 # in milliseconds
But in general, you can use pollers:
poller = zmq.Poller()
poller.register(client_receiver, zmq.POLLIN) # POLLIN for recv, POLLOUT for send
And poller.poll() takes a timeout:
evts = poller.poll(1000) # wait *up to* one second for a message to arrive.
evts will be an empty list if there is nothing to receive.
You can poll with zmq.POLLOUT, to check if a send will succeed.
Or, to handle the case of a peer that might have failed, a:
worker.send(msg, zmq.NOBLOCK)
might suffice, which will always return immediately - raising a ZMQError(zmq.EAGAIN) if the send could not complete.

This was a quick hack I made after I referred user938949's answer and http://taotetek.wordpress.com/2011/02/02/python-multiprocessing-with-zeromq/ . If you do better, please post your answer, I will recommend your answer.
For those wanting lasting solutions on reliability, refer http://zguide.zeromq.org/page:all#toc64
Version 3.0 of zeromq (beta ATM) supports timeout in ZMQ_RCVTIMEO and ZMQ_SNDTIMEO. http://api.zeromq.org/3-0:zmq-setsockopt
Server
The zmq.NOBLOCK ensures that when a client does not exist, the send() does not block.
import time
import zmq
context = zmq.Context()
ventilator_send = context.socket(zmq.PUSH)
ventilator_send.bind("tcp://127.0.0.1:5557")
i=0
while True:
i=i+1
time.sleep(0.5)
print ">>sending message ",i
try:
ventilator_send.send(repr(i),zmq.NOBLOCK)
print " succeed"
except:
print " failed"
Client
The poller object can listen in on many recieving sockets (see the "Python Multiprocessing with ZeroMQ" linked above. I linked it only on work_receiver. In the infinite loop, the client polls with an interval of 1000ms. The socks object returns empty if no message has been recieved in that time.
import time
import zmq
context = zmq.Context()
work_receiver = context.socket(zmq.PULL)
work_receiver.connect("tcp://127.0.0.1:5557")
poller = zmq.Poller()
poller.register(work_receiver, zmq.POLLIN)
# Loop and accept messages from both channels, acting accordingly
while True:
socks = dict(poller.poll(1000))
if socks:
if socks.get(work_receiver) == zmq.POLLIN:
print "got message ",work_receiver.recv(zmq.NOBLOCK)
else:
print "error: message timeout"

The send wont block if you use ZMQ_NOBLOCK, but if you try closing the socket and context, this step would block the program from exiting..
The reason is that the socket waits for any peer so that the outgoing messages are ensured to get queued.. To close the socket immediately and flush the outgoing messages from the buffer, use ZMQ_LINGER and set it to 0..

If you're only waiting for one socket, rather than create a Poller, you can do this:
if work_receiver.poll(1000, zmq.POLLIN):
print "got message ",work_receiver.recv(zmq.NOBLOCK)
else:
print "error: message timeout"
You can use this if your timeout changes depending on the situation, instead of setting work_receiver.RCVTIMEO.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pyzmq proxy in a strange state after subscribing multiple processes - python

Related

ZeroMQ REQ .recv() hangs with messages larger than ~1kB if run inside Docker

How can I write a socket server in a different thread from my main program (using gevent)?

Responding to client disconnects using bottle and gevent.wsgi?

pyzmq create a process with its own socket

zeromq: how to prevent infinite wait?

Categories

Resources