I have to write a reliable, totally-ordered multicast system from scratch in Python. I can't use any external libraries. I'm allowed to use a central sequencer.
There seems to be two immediate approaches:
write an efficient system, attaching a unique id to each multicasted message,
having the sequencer multicast sequence numbers for the message id's it receives,
and sending back and forth ACK's and NACK's.
write an inefficient flooding system, where each multicaster simply re-sends each
message it receives once (unless it was sent by that particular multicaster.)
I'm allowed to use the second option, and am inclined to do so.
I'm currently multicasting UDP messages (which seems to be the only option,) but that means that some messages might get lost. That means I have to be able to uniquely identify each sent UDP message, so that it can be re-sent according to #2. Should I really generate unique numbers (e.g. using the sender address and a counter) and pack them into each and every UDP message sent? How would I go about doing that? And how do I receive a single UDP message in Python, and not a stream of data (i.e. socket.recv)?
The flooding approach can cause a bad situation to get worse. If messages are dropped due to high network load, having every node resend every message will only make the situation worse.
The best approach to take depends on the nature of the data you are sending. For example:
Multimedia data: no retries, a dropped packet is a dropped frame, which won't matter when the next frame gets there anyway.
Fixed period data: Recipient node keeps a timer that is reset each time an update is received. If the time expires, it requests the missing update from the master node. Retries can be unicast to the requesting node.
If neither of these situations applies (every packet has to be received by every node, and the packet timing is unpredictable, so recipients can't detect missed packets on their own), then your options include:
Explicit ACK from every node for each packet. Sender retries (unicast) any packet that is not ACKed.
TCP-based grid approach, where each node is manually repeats received packets to neighbor nodes, relying on TCP mechanisms to ensure delivery.
You could possibly rely on recipients noticing a missed packet upon reception of one with a later sequence number, but this requires the sender to keep the packet around until at least one additional packet has been sent. Requiring positive ACKs is more reliable (and provable).
The approach you take is going to depend very much on the nature of the data that you're sending, the scale of your network and the quantity of data you're sending. In particular it is going to depend on the number of targets each of your nodes is connected to.
If you're expecting this to scale to a large number of targets for each node and a large quantity of data then you may well find that the overhead of adding an ACK/NAK to every packet is sufficient to adversely limit your throughput, particularly when you add retransmissions into the mix.
As Frank Szczerba has said multimedia data has the benefit of being able to recover from lost packets. If you have any control over the data that you're sending you should try to design the payloads so that you minimise the susceptibility to dropped packets.
If the data that you're sending cannot tolerate dropped packets and you're trying to scale to high utilisation of your network then perhaps udp is not the best protocol to use. Implementing a series of tcp proxies (where each node retransmits, unicast, to all other connected nodes - similar to your flooding idea) would be a more reliable mechanism.
With all of that said, have you considered using true multicast for this application?
Just saw the "homework" tag... these suggestions might not be appropriate for a homework problem.
IMHO, you should choose an existing reliable UDP protocol. There are several you can choose, take a look at this SO question: What do you use when you need reliable UDP?
I personally like and use MoldUDP, which is the protocol used by Nasdaq's ITCH market data feed.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I am client receiver of UDP multicast data sent by Sender server (Stock Exchange data). I am continuously receiving udp multicast packet flow sequentially numbered 1 to approximately 35,000,000 sent uniformly over a period of 6 hours . I need to ensure all packets upto say N are received before the set of N packets is periodically processed after every say ~ 256 packets. i.e. I need reliable UDP.
Reliable UDP is mimicked using TCP retransmit. If any udp packet(s) is lost/not received, it is requested by using tcp protocol by specifying the desired missing packet range (starting number, ending number).
Sender keeps record of all the packets (stock exchange data) it has sent via UDP multicast so far. So Sender will resend by TCP only those packets numbers that the receiver specifically requests for via TCP. This is how UDP reliability is achieved by receiver. The UDP drop ratio is very small (less than 0.001%) except when starting the UDP multicast in the middle of the day, in which case all previously sent UDP packets from 1 to some N will need to be resent on TCP, while live transmission of UDP multicast data packet number N+1 onward is being received.) I can't request Sender (Stock Exchange) to change its protocol--it is fixed.
What is the efficient algorithm to implement this in terms of CPU?
The issue is speed BigOh. I can make a naive algorithm using several nested loops and methods, but it not necessarily the best.
I am thinking of maintaining a number N which confirms I have received UDP
packets 1 through N, and any packet no. M which is not the next expected packet no. N+1 will be buffered, for say 256 packets, and then TCP will be used to request the missing numbers. Then normal UDP reception resumes over from the last confirmed received number after TCP request is filled.
Example:
Suppose UDP packets received by receiver are in the following sequence {1,2,3,6,7,8,9,10 ...}
After packet No. 3, the next packet is No. 6. Packets 4 through 5 are missing.
So the missing packets {4,5} are requested using TCP request({4 through 5}), and {6,7,8,9,10} are buffered. There is enough space on the 10GBaseT LAN card for buffering 35,000,000 packets.
So: receive UDP {1,2,3}, refill by TCP request {4,5}, continue receive UDP {6,7,8,9,10, ...}
I assume since you are using multicast that there are going to be multiple receivers of this data? (Because if not, you'd probably be using unicast instead)
Therefore, if the receivers are going to have the option of requesting TCP retransmission of packets they didn't get, that means that the transmitting program will need to keep a copy of recently-sent UDP packets in memory, so that when it receives a retransmit-request, it will have the requested data available to retransmit. Assuming you're stamping each packet with a unique ID, it can store this data in a std::map or std::unordered_map or similar for quick lookup.
The real question becomes, how much of this old-packet data should the transmitter retain? ideally it would retain all of it, because you never know how much a given receiver might have missed and might want to request; but that would require infinite memory so that's not a realistic option. Probably the best you can do is decide how much RAM you're willing to tie up for this purpose, and keep a count of the total number of bytes you have in your table, and when it reaches the limit, start dropping the oldest packets from the table in order to keep its size under the limit.
I wrote an open-source library that uses essentially the technique you describe (multicast UDP + TCP-retransmit-to-recover-from-packet-loss) to synchronize databases across multiple hosts as quickly as possible; some things I learned while implementing it include:
If/when you can, pack your data-messages together into larger packets, up to the MTU of the network you are transmitting over (e.g. 1388 bytes for IPv4/Ethernet). Very small packet-sizes (like 48-bytes/packet) are inefficient, since the fixed-sized packet-headers make up a greater percentage of the total data sent/received.
Only try to send when your sending-socket indicates it is ready-for-write. (i.e. don't assume that you will never fill up the socket's outgoing-data-buffer; if your traffic is "bursty", you probably will at some point)
Minimize UDP packet loss by making your UDP sockets' send and receive buffers as large as you can get away with
Further minimize UDP packet loss by doing all the UDP receiving in a dedicated, high-priority thread (which can then route the received UDP data back to a normal-priority thread for further processing -- the main thing is to avoid allowing the receiving UDP-socket's incoming-data-buffer to overflow if possible)
For the TCP retransmission part, keep in mind that TCP streams can potentially slow down to nearly zero bytes-per-second in the worst case scenario, which makes it important to ensure that poor TCP performance to client A doesn't block the TCP communications to/from clients B, C, D, etc. This can be accomplished either via non-blocking I/O and select() (or poll() or similar), or asynchronous networking, or via multiple threads; avoid blocking I/O unless you are implementing a thread-per-socket model (and probably avoid that model as well, since a thread that is indefinitely-blocked-inside-recv() is difficult to shut down cleanly)
Think about under what circumstances (if any) it is acceptable for a client to never receive a particular packet at all; are there situations where that is okay? Or must the entire system grind to a halt until every receiver has received every packet in the group, regardless of how long that might take?
If you want to get really fancy, you can look into Forward Error Correction algorithms that encode data across packets, such that the receiver can still decode all of the data even if it never receives (up to a certain percentage of) the packets. This makes the need for a re-transmit request less likely, at the cost of making all of the packets slightly larger.
TCP flows by their own nature will grow until they fill the maximum capacity of the links used from src to dst (if all those links are empty).
Is there an easy way to limit that ? I want to be able to send TCP flows with a maximum X mbps rate.
I thought about just sending X bytes per second using the socket.send() function and then sleeping the rest of the time. However if the link gets congested and the rate gets reduced, once the link gets uncongested again it will need to recover what it could not send previously and the rate will increase.
At the TCP level, the only control you have is how many bytes you pass off to send(), and how often you call it. Once send() has handed over some bytes to the networking stack, it's entirely up to the networking stack how fast (or slow) it wants to send them.
Given the above, you can roughly limit your transmission rate by monitoring how many bytes you have sent, and how much time has elapsed since you started sending, and holding off subsequent calls to send() (and/or the number of data bytes your pass to send()) to keep the average rate from going higher than your target rate.
If you want any finer control than that, you'll need to use UDP instead of TCP. With UDP you have direct control of exactly when each packet gets sent. (Whereas with TCP it's the networking stack that decides when to send each packet, what will be in the packet, when to resend a dropped packet, etc)
I have two apps sending tcp packages, both written in python 2. When client sends tcp packets to server too fast, the packets get concatenated. Is there a way to make python recover only last sent package from socket? I will be sending files with it, so I cannot just use some character as packet terminator, because I don't know the content of the file.
TCP uses packets for transmission, but it is not exposed to the application. Instead, the TCP layer may decide how to break the data into packets, even fragments, and how to deliver them. Often, this happens because of the unterlying network topology.
From an application point of view, you should consider a TCP connection as a stream of octets, i.e. your data unit is the byte, not a packet.
If you want to transmit "packets", use a datagram-oriented protocol such as UDP (but beware, there are size limits for such packets, and with UDP you need to take care of retransmissions yourself), or wrap them manually. For example, you could always send the packet length first, then the payload, over TCP. On the other side, read the size first, then you know how many bytes need to follow (beware, you may need to read more than once to get everything, because of fragmentation). Here, TCP will take care of in-order delivery and retransmission, so this is easier.
TCP is a streaming protocol, which doesn't expose individual packets. While reading from stream and getting packets might work in some configurations, it will break with even minor changes to operating system or networking hardware involved.
To resolve the issue, use a higher-level protocol to mark file boundaries. For example, you can prefix the file with its length in octets (bytes). Or, you can switch to a protocol that already handles this kind of stuff, like http.
First you need to know if the packet is combined before it is sent or after. Use wireshark to check it the sender is sending one packet or two. If it is sending one, then your fix is to call flush() after each write. I do not know the answer if the receiver is combining packets after receiving them.
You could change what you are sending. You could send bytes sent, followed by the bytes. Then the other side would know how many bytes to read.
Normally, TCP_NODELAY prevents that. But there are very few situations where you need to switch that on. One of the few valid ones are telnet style applications.
What you need is a protocol on top of the tcp connection. Think of the TCP connection as a pipe. You put things in one end of the pipe and get them out of the other. You cannot just send a file through this without both ends being coordinated. You have recognised you don't know how big it is and where it ends. This is your problem. Protocols take care of this. You don't have a protocol and so what you're writing is never going to be robust.
You say you don't know the length. Get the length of the file and transmit that in a header, followed by the number of bytes.
For example, if the header is a 64bits which is the length, then when you receive your header at the server end, you read the 64bit number as the length and then keep reading until the end of the file which should be the length.
Of course, this is extremely simplistic but that's the basics of it.
In fact, you don't have to design your own protocol. You could go to the internet and use an existing protocol. Such as HTTP.
I am sending 20000 messages from a DEALER to a ROUTER using pyzmq.
When I pause 0.0001 seconds between each messages they all arrive but if I send them 10x faster by pausing 0.00001 per message only around half of the messages arrive.
What is causing the problem?
What is causing the problem?
A default setup of the ZMQ IO-thread - that is responsible for the mode of operations.
I would dare to call it a problem, the more if you invest your time and dive deeper into the excellent ZMQ concept and architecture.
Since early versions of the ZMQ library, there were some important parameters, that help the central masterpiece ( the IO-thread ) keep the grounds both stable and scalable and thus giving you this powerful framework.
Zero SHARING / Zero COPY / (almost) Zero LATENCY are the maxims that do not come at zero-cost.
The ZMQ.Context instance has quite a rich internal parametrisation that can be modified via API methods.
Let me quote from a marvelous and precious source -- Pieter HINTJENS' book, Code Connected, Volume 1.
( It is definitely worth spending time and step through the PDF copy. C-language code snippets do not hurt anyone's pythonic state of mind as the key messages are in the text and stories that Pieter has crafted into his 300+ thrilling pages ).
High-Water Marks
When you can send messages rapidly from process to process, you soon discover that memory is a precious resource, and one that can be trivially filled up. A few seconds of delay somewhere in a process can turn into a backlog that blows up a server unless you understand the problem and take precautions.
...
ØMQ uses the concept of HWM (high-water mark) to define the capacity of its internal pipes. Each connection out of a socket or into a socket has its own pipe, and HWM for sending, and/or receiving, depending on the socket type. Some sockets (PUB, PUSH) only have send buffers. Some (SUB, PULL, REQ, REP) only have receive buffers. Some (DEALER, ROUTER, PAIR) have both send and receive buffers.
In ØMQ v2.x, the HWM was infinite by default. This was easy but also typically fatal for high-volume publishers. In ØMQ v3.x, it’s set to 1,000 by default, which is more sensible. If you’re still using ØMQ v2.x, you should always set a HWM on your sockets, be it 1,000 to match ØMQ v3.x or another figure that takes into account your message sizes and expected subscriber performance.
When your socket reaches its HWM, it will either block or drop data depending on the socket type. PUB and ROUTER sockets will drop data if they reach their HWM, while other socket types will block. Over the inproc transport, the sender and receiver share the same buffers, so the real HWM is the sum of the HWM set by both sides.
Lastly, the HWM-s are not exact; while you may get up to 1,000 messages by default, the real buffer size may be much lower (as little as half), due to the way libzmq implements its queues.
I've set up a server reusing the code found in the documentation where I have self.data = self.request.recv(1024).strip().
But how do I go from this, deserialize it to protobuf message (Message.proto/Message_pb2.py). Right now it seems that it's receiving chunks of 1024 bytes, and that more then one at the time... making it all rubbish :D
TCP is typically just a stream of data. Just because you sent each packet as a unit, doesn't mean the receiver gets that. Large messages may be split into multiple packets; small messages may be combined into a single packet.
The only way to interpret multiple messages over TCP is with some kind of "framing". With text-based protocols, a CR/LF/CRLF/zero-byte might signify the end of each frame, but that won't work with binary protocols like protobuf. In such cases, the most common approach is to simply prepend each message with the length, for example in a fixed-size (4 bytes?) network-byte-order chunk. Then the payload. In the case of protobuf, the API for your platform may also provide a mechanism to write the length as a "varint".
Then, reading is a matter of:
read an entire length-header
read (and buffer) that many bytes
process the buffered data
rinse and repeat
But keeping in mind that you might have (in a single packet) the end of one message, 2 complete messages, and the start of another message (maybe half of the length-header, just to make it interesting). So: keeping track of exactly what you are reading at any point becomes paramount.