ZeroMQ - N-to-N server communication for a consensus negotiation

ZeroMQ - N-to-N server communication for a consensus negotiation - python

I am implementing a consensus algorithm (raft) where 5 servers have to communicate together asynchronously to manage a replicated log.
I am using python and multiprocessing package to run 5 processes acting as servers, and ØMQ (pyzmq) to handle communications between servers.
I drafted a first design and I am wondering about a few things:
I have 5 servers, and I hardcode 5 "identities" I configure sockets with. For each server, I am using two ROUTER sockets:
the first one is binding to the server's identity (this socket is responsible for receiving others messages)
the other one is connecting to the other servers' identities (this one is responsible for sending messages)
It looks strange to me to use two ROUTER sockets on the same node/server, but I do not find better : is there a more elegant way to proceed?
I tried to use only one ROUTER socket per server, i.e a socket that sets its identity, binds on its given port, and connects to the others, but this 'connect' step on each side doesn't work : what is the reason for that?
Here is a simple example with 3 servers:
import time
from threading import Thread
from typing import List
import zmq
SEND_RULE = {
'5555': '6666',
'6666': '7777',
'7777': '5555'
}
def worker(socket_port: str, peer_ports: List[str]):
context = zmq.Context()
receiving_socket = context.socket(zmq.ROUTER)
receiving_socket.setsockopt(zmq.IDENTITY, socket_port.encode())
receiving_socket.bind(f'tcp://*:{socket_port}')
sending_socket = context.socket(zmq.ROUTER)
sending_socket.setsockopt(zmq.IDENTITY, socket_port.encode())
for peer_port in peer_ports:
sending_socket.connect(f'tcp://localhost:{peer_port}')
time.sleep(1)
recipient_id = SEND_RULE[socket_port].encode()
message_for_recipient = b'coucou!'
print(socket_port, ' sending a message to ', recipient_id.decode())
sending_socket.send_multipart([recipient_id, message_for_recipient])
# Receive a message from peer
sender_id, sender_message = receiving_socket.recv_multipart()
print( socket_port,
' server received: ',
sender_message.decode(),
' from: ',
sender_id.decode()
)
if __name__ == '__main__':
socket_ports = ['5555', '6666', '7777']
for socket_port in socket_ports:
Thread( target = worker,
args = ( socket_port,
[ port for port in socket_ports
if port != socket_ports ]
)
).start()
time.sleep(3)

Related

ZeroMQ Asynchronous Client-Server using Python multiprocessing

I am trying to adopt the ZeroMQ asynchronous client-server pattern described here with python multiprocessing. A brief description in the ZeroMQ guide
It's a DEALER/ROUTER for the client to server frontend communication and DEALER/DEALER for the server backend to the server workers communication. The server frontend and backend are connected using a zmq.proxy()-instance.
Instead of using threads, I want to use multiprocessing on the server. But requests from the client do not reach the server workers. However, they do reach the server frontend. And also the backend. But the backend is not able to connect to the server workers.
How do we generally debug these issues in pyzmq?How to turn on verbose logging for the sockets?
The python code snippets I am using -
server.py
import zmq
import time
from multiprocessing import Process
def run(context, worker_id):
socket = context.socket(zmq.DEALER)
socket.connect("ipc://backend.ipc")
print(f"Worker {worker_id} started")
try:
while True:
ident, msg = socket.recv_multipart()
print("Worker received %s from %s" % (msg, "ident"))
time.sleep(5)
socket.send_multipart([ident, msg])
print("Worker sent %s from %s" % (msg, ident))
except:
socket.close()
if __name__ == "__main__":
context = zmq.Context()
frontend = context.socket(zmq.ROUTER)
frontend.bind("tcp://*:5570")
backend = context.socket(zmq.DEALER)
backend.bind("ipc://backend.ipc")
N_WORKERS = 7
jobs = []
try:
for worker_id in range(N_WORKERS):
job = Process(target=run, args=(context, worker_id,))
jobs.append(job)
job.start()
zmq.proxy(frontend, backend)
for job in jobs:
job.join()
except:
frontend.close()
backend.close()
context.term()
client.py
import re
import zmq
from uuid import uuid4
if __name__ == "__main__":
context = zmq.Context()
socket = context.socket(zmq.DEALER)
identity = uuid4()
socket.identity = identity.encode("ascii")
socket.connect("tcp://localhost:5570")
poll = zmq.Poller()
poll.register(socket, zmq.POLLIN)
request = {
"body": "Some request body.",
}
socket.send_string(json.dumps(request))
while True:
for i in range(5):
sockets = dict(poll.poll(10))
if socket in sockets:
msg = socket.recv()
print(msg)

Q : "How to turn on verbose logging for the sockets?"
Start using the published native API socket_monitor() for all relevant details, reported as events arriving from socket-(instance)-under-monitoring.
Q : "How do we generally debug these issues in pyzmq?"
There is no general strategy on doing this. Having gone into a domain of a distributed-computing, you will almost always create your own, project-specific, tools for "collecting" & "viewing/interpreting" a time-ordered flow of (principally) distributed-events.
Last but not least : avoid trying to share a Context()-instance, the less "among" 8 processes
The Art of Zen of Zero strongly advocates to avoid any shape and form of sharing. Here, the one and the very same Context()-instance is referenced ("shared") via a multiprocessing.Process's process-instantiation call-signature interface, which does not make the inter-process-"sharing" work.
One may let each spawned process-instance create it's own Context()-instance and use it from inside its private space during its own life-cycle.
Btw, your code ignores any return-codes, documented in the native API, that help you handle ( in worse cases debug post-mortem ) what goes alongside the distributed-computing. The try: ... except: ... finally: scaffolding also helps a lot here.
Anyway, the sooner you will learn to stop using the blocking-forms of the { .send() | .recv() | .poll() }-methods, the better your code starts to re-use the actual powers of the ZeroMQ.

pyzmq proxy in a strange state after subscribing multiple processes

I'm having a weird issue with the proxy in pyzmq. Here's the code of that proxy:
import zmq
context = zmq.Context.instance()
frontend_socket = context.socket(zmq.XSUB)
frontend_socket.bind("tcp://0.0.0.0:%s" % sub_port)
backend_socket = context.socket(zmq.XPUB)
backend_socket.bind("tcp://0.0.0.0:%s" % pub_port)
zmq.proxy(frontend_socket, backend_socket)
I'm using that proxy to send messages between ~50 processes that run on 6 different machines. The total amount of topics is around 1,000, but since multiple processes can listen on the same topics, the total amount of subscriptions is around 10,000.
In normal times this works very well, messages go through the proxy correctly as long as a process publishes it and at least one other processes is subscribed to the topic. It works whether the publisher or subscriber was started first.
But at some point in time, when we start a new process (let's call it X), it starts behaving strangely. Everything that was already connected keeps working, but the new processes that we connect can only get messages to go through if the publisher is connected before the subscriber. X can be any one of the processes that normally work, and it can be from any machine, and the result is the same. When we get in this state, killing X makes everything work again, and starting it again makes it fail. If we stop other processes and then start X, it works well (so it's not related with X's code in particular).
I'm not sure if we could be reaching some limit of ZMQ? I've read examples of people that seem to have way more processes, subscriptions, etc. than us. It could be some option that we should set on the proxy, so far here are the ones we've tried without success:
Changing RCVHWM on frontend_socket
Changing SNDHWM on backend_socket
Setting XPUB_VERBOSE on backend_socket
Setting XPUB_VERBOSER on backend_socket
Here is sample code of how we publish messages to the proxy:
topic = "test"
message = {"test": "test"}
context = zmq.Context.instance()
socket = context.socket(zmq.PUB)
socket.connect("tcp://1.2.3.4:1234")
while True:
time.sleep(1)
socket.send_multipart([topic.encode(), json.dumps(message).encode()])
Here is sample code of how we subscribe to messages from the proxy:
topic = "test"
context = zmq.Context.instance()
socket = context.socket(zmq.SUB)
socket.connect("tcp://1.2.3.4:5678")
socket.subscribe(topic)
while True:
multi_part = socket.recv_multipart()
[topic, message] = multi_part
print(topic.decode(), message.decode())
Has anyone ever seen a similar issue? Is there something we can do to avoid the proxy getting in this state?
Thanks!

Make all the publishers (proxy and publish process) XPUB ( + sockopt verbose/verboser) then read from the publisher sockets on a poll loop. The first byte of the subscription message will tell you if the message is sub/unsub followed by the subject/topic. If you log all of the this information with timestamps it should tell you which component is at fault (it could be any of the three) and help with a fix.
The format of the subscription messages that arrive on the publisher (XPUB) will be
Subscription [0x01][topic]
Unsubscription [0x00][topic]
Code needed
I usually work on C++ but this is the general idea in python
proxy
You need to create a capture socket (this acts like a network tap). You connect a ZMQ_PAIR socket to the proxy (capture) over inproc and then read the contents at the other end of the socket. As you are using XPUB/XSUB you will see the subscription messages.
zmq.proxy(frontend, backend, capture)
read the docs/examples for the python proxy.
publisher
In this case you need to read from the publishing socket in the same thread as you are sending on it. That's the reason I said a poll loop might be best.
This code is not tested at all.
topic = "test"
message = {"test": "test"}
context = zmq.Context.instance()
socket = context.socket(zmq.XPUB)
socket.connect("tcp://1.2.3.4:1234")
poller = zmq.Poller()
poller.register(socket, zmq.POLLIN)
timeout = 1000 #ms
while True:
socks = dict(poller.poll(timeout))
if not socks : # 1
socket.send_multipart([topic.encode(), json.dumps(message).encode()])
if socket in socks:
sub_msg = socket.recv()
# print out the message here.

ZeroMQ REQ .recv() hangs with messages larger than ~1kB if run inside Docker

I'm working on a relatively simple Python / ZeroMQ based work distribution system, using REQ/ROUTER sockets. The system is distributed and worker nodes are geographically distributed on different continents.
The ROUTER, responsible for distributing work, .bind()-s a ROUTER socket. Workers .connect() to it over TCP using a REQ socket.
In the process of setting up a new worker node, I've noticed that while smaller messages (up to 1kB) do the trip with no issues, replies of ~2kB and up, sent by the ROUTER-end are never received by the worker into their REQ-socket - when I call recv(), the socket just hangs.
The worker code runs inside Docker containers, and I was able to work around the issue when running the same image with --net=host - it seems to not happen if Docker is using the host network.
I'm wondering if this is something in the network stack configuration on the host machine or in Docker, or maybe something that can be prevented in my code?
Here is a simplified version of my code that reproduces this issue:
Worker
import sys
import zmq
import logging
import time
READY = 'R'
def worker(connect_to):
ctx = zmq.Context()
socket = ctx.socket(zmq.REQ)
socket.connect(connect_to)
log = logging.getLogger(__name__)
while True:
socket.send_string(READY)
log.debug("Send READY message, waiting for reply")
message = socket.recv()
log.debug("Got reply of %d bytes", len(message))
time.sleep(5)
if __name__ == '__main__':
logging.basicConfig(level=logging.DEBUG)
worker(sys.argv[1])
Router
import sys
import zmq
import logging
REPLY_SIZE = 1024 * 8
def router(bind_to):
ctx = zmq.Context()
socket = ctx.socket(zmq.ROUTER)
socket.bind(bind_to)
poller = zmq.Poller()
poller.register(socket, zmq.POLLIN)
log = logging.getLogger(__name__)
while True:
socks = dict(poller.poll(5000))
if socks.get(socket) == zmq.POLLIN:
message = socket.recv_multipart()
log.debug("Received message of %d parts", len(message))
identity, _ = message[:2]
res = handle_message(message[2:])
log.debug("Sending %d bytes back in response on socket", len(res))
socket.send_multipart([identity, '', res])
def handle_message(parts):
log = logging.getLogger(__name__)
log.debug("Got message: %s", parts)
return 'A' * REPLY_SIZE
if __name__ == '__main__':
logging.basicConfig(level=logging.DEBUG)
router(sys.argv[1])
FWIW I was able to reproduce this on Ubuntu 16.04 (both router and worker) with Docker 17.09.0-ce, libzmq 4.1.5 and PyZMQ 15.4.0.

No, sir, the socket does not hang at all:
Why?
The issue is, that you have instructed the Socket()-instance to enter into an infinitely blocking state, once having called .recv() method, without specifying a zmq.NOBLOCK flag ( the ZMQ_DONTWAIT flag in the ZeroMQ original API ).
This is the cause, that upon other circumstances reported yesterday, moves the code into infinite blocking, as there seem to be other issues that prevent Docker-container to properly deliver any first message to the hands of the Worker's Docker-embedded-ZeroMQ-Context() I/O-engine and to the hands of the REQ-access-point. As the REQ-archetype uses a strict two-step Finite-State-Automaton - strictly striding ( .send()->.recv()->.send()-> ... ad infimum )
This cause->effect reversing is wrong and misleading -
the issue of "socket just hangs"
is un-decideable
from an issue Docker does not deliver a single message ( to allow .recv() to return )
Next steps:
may use .poll() in REQ-side to sniff without blocking for any already arrived message in the Worker.
Once there are none such, focus on Docker first + next may benefit from ZeroMQ Context()-I/O-engine performance and link-level tweaking configuration options.

How can I debug a connection issue within VOLTTRON?

I am connecting to an external VOLTTRON instance. I am not getting a response from the connection. What's the issue?
I am writing a simple python script to connect to an external platform and retrieve the peers. If I get the serverkey, clientkey, and/or publickey incorrect I don't know how to determine which is the culprit, from the client side. I just get a gevent timeout. Is there a way to know?
import os
import gevent
from volttron.platform.vip.agent import Agent
secret = "secret"
public = "public"
serverkey = "server"
tcp_address = "tcp://external:22916"
agent = Agent(address=tcp_address, serverkey=serverkey, secretkey=secret,
publickey=public)
event = gevent.event.Event()
greenlet = gevent.spawn(agent.core.run, event)
event.wait(timeout=30)
print("My id: {}".format(agent.core.identity))
peers = agent.vip.peerlist().get(timeout=5)
for p in peers:
print(p)
gevent.sleep(3)
greenlet.kill()

The short answer: no, the client cannot determine why its connection to the server failed. The client will attempt to connect until it times out.
Logs and debug messages on the server side can help troubleshoot a connection problem. There are three distinct messages related to key errors:
CURVE I: cannot open client HELLO -- wrong server key?
Either the client omit the server key, the client used the wrong server key, or the server omit the secret key.
CURVE I: cannot open client INITIATE vouch
Either the client omit the public or secret key, or its public and secret keys don't correspond to each other.
authentication failure
The server key was correct and the secret and public keys are valid, but the server rejected the connection because the client was not authorized to connect (based on the client's public key).
The first two messages are printed by libzmq. To see the third message volttron must be started with increased verboseness (at least -v).
Here is a simple ZMQ server-client example you can use to test some of these scenarios:
Server:
import zmq
context = zmq.Context()
socket = context.socket(zmq.REP)
socket.curve_server = 1
socket.curve_secretkey = "mW4i2O{kmcOXs9q>UP0(no4-Sp1r(p>vK?*NFwV$"
# The corresponding public key is "krEC0>hsx+o4Jxg2yvitCOVwr2GF85akNIsUdiH5"
socket.bind("ipc://test123")
while True:
msg = socket.recv()
new_msg = "I got the message: {}".format(msg)
print(new_msg)
socket.send(new_msg)
Client:
import zmq
pub, sec = zmq.curve_keypair()
context = zmq.Context()
socket = context.socket(zmq.REQ)
socket.curve_secretkey = sec
socket.curve_publickey = pub
socket.curve_serverkey = "krEC0>hsx+o4Jxg2yvitCOVwr2GF85akNIsUdiH5"
socket.connect("ipc://test123")
socket.send(b'Hello')
msg = socket.recv()
print("From the server: {}".format(msg))

How can I write a socket server in a different thread from my main program (using gevent)?

I'm developing a Flask/gevent WSGIserver webserver that needs to communicate (in the background) with a hardware device over two sockets using XML.
One socket is initiated by the client (my application) and I can send XML commands to the device. The device answers on a different port and sends back information that my application has to confirm. So my application has to listen to this second port.
Up until now I have issued a command, opened the second port as a server, waited for a response from the device and closed the second port.
The problem is that it's possible that the device sends multiple responses that I have to confirm. So my solution was to keep the port open and keep responding to incoming requests. However, in the end the device is done sending requests, and my application is still listening (I don't know when the device is done), thereby blocking everything else.
This seemed like a perfect use case for a thread, so that my application launches a listening server in a separate thread. Because I'm already using gevent as a WSGI server for Flask, I can use the greenlets.
The problem is, I have looked for a good example of such a thing, but all I can find is examples of multi-threading handlers for a single socket server. I don't need to handle a lot of connections on the socket server, but I need it launched in a separate thread so it can listen for and handle incoming messages while my main program can keep sending messages.
The second problem I'm running into is that in the server, I need to use some methods from my "main" class. Being relatively new to Python I'm unsure how to structure it in a way to make that possible.
class Device(object):
def __init__(self, ...):
self.clientsocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
self.serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
def _connect_to_device(self):
print "OPEN CONNECTION TO DEVICE"
try:
self.clientsocket.connect((self.ip, 5100))
except socket.error as e:
pass
def _disconnect_from_device(self):
print "CLOSE CONNECTION TO DEVICE"
self.clientsocket.close()
def deviceaction1(self, ...):
# the data that is sent is an XML document that depends on the parameters of this method.
self._connect_to_device()
self._send_data(XMLdoc)
self._wait_for_response()
return True
def _send_data(self, data):
print "SEND:"
print(data)
self.clientsocket.send(data)
def _wait_for_response(self):
print "WAITING FOR REQUESTS FROM DEVICE (CHANNEL 1)"
self.serversocket.bind(('10.0.0.16', 5102))
self.serversocket.listen(5) # listen for answer, maximum 5 connections
connection, address = self.serversocket.accept()
# the data is of a specific length I can calculate
if len(data) > 0:
self._process_response(data)
self.serversocket.close()
def _process_response(self, data):
print "RECEIVED:"
print(data)
# here is some code that processes the incoming data and
# responds to the device
# this may or may not result in more incoming data
if __name__ == '__main__':
machine = Device(ip="10.0.0.240")
Device.deviceaction1(...)
This is (globally, I left out sensitive information) what I'm doing now. As you can see everything is sequential.
If anyone can provide an example of a listening server in a separate thread (preferably using greenlets) and a way to communicate from the listening server back to the spawning thread, it would be of great help.
Thanks.
EDIT:
After trying several methods, I decided to use Pythons default select() method to solve this problem. This worked, so my question regarding the use of threads is no longer relevant. Thanks for the people who provided input for your time and effort.

Hope it can provide some help, In example class if we will call tenMessageSender function then it will fire up an async thread without blocking main loop and then _zmqBasedListener will start listening on separate port untill that thread is alive. and whatever message our tenMessageSender function will send, those will be received by client and respond back to zmqBasedListener.
Server Side
import threading
import zmq
import sys
class Example:
def __init__(self):
self.context = zmq.Context()
self.publisher = self.context.socket(zmq.PUB)
self.publisher.bind('tcp://127.0.0.1:9997')
self.subscriber = self.context.socket(zmq.SUB)
self.thread = threading.Thread(target=self._zmqBasedListener)
def _zmqBasedListener(self):
self.subscriber.connect('tcp://127.0.0.1:9998')
self.subscriber.setsockopt(zmq.SUBSCRIBE, "some_key")
while True:
message = self.subscriber.recv()
print message
sys.exit()
def tenMessageSender(self):
self._decideListener()
for message in range(10):
self.publisher.send("testid : %d: I am a task" %message)
def _decideListener(self):
if not self.thread.is_alive():
print "STARTING THREAD"
self.thread.start()
Client
import zmq
context = zmq.Context()
subscriber = context.socket(zmq.SUB)
subscriber.connect('tcp://127.0.0.1:9997')
publisher = context.socket(zmq.PUB)
publisher.bind('tcp://127.0.0.1:9998')
subscriber.setsockopt(zmq.SUBSCRIBE, "testid")
count = 0
print "Listener"
while True:
message = subscriber.recv()
print message
publisher.send('some_key : Message received %d' %count)
count+=1
Instead of thread you can use greenlet etc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

ZeroMQ - N-to-N server communication for a consensus negotiation - python

Related

ZeroMQ Asynchronous Client-Server using Python multiprocessing

pyzmq proxy in a strange state after subscribing multiple processes

ZeroMQ REQ .recv() hangs with messages larger than ~1kB if run inside Docker

How can I debug a connection issue within VOLTTRON?

How can I write a socket server in a different thread from my main program (using gevent)?

Categories

Resources