How can I deserialize incoming data on the TCP server?

How can I deserialize incoming data on the TCP server? - python

I've set up a server reusing the code found in the documentation where I have self.data = self.request.recv(1024).strip().
But how do I go from this, deserialize it to protobuf message (Message.proto/Message_pb2.py). Right now it seems that it's receiving chunks of 1024 bytes, and that more then one at the time... making it all rubbish :D

TCP is typically just a stream of data. Just because you sent each packet as a unit, doesn't mean the receiver gets that. Large messages may be split into multiple packets; small messages may be combined into a single packet.
The only way to interpret multiple messages over TCP is with some kind of "framing". With text-based protocols, a CR/LF/CRLF/zero-byte might signify the end of each frame, but that won't work with binary protocols like protobuf. In such cases, the most common approach is to simply prepend each message with the length, for example in a fixed-size (4 bytes?) network-byte-order chunk. Then the payload. In the case of protobuf, the API for your platform may also provide a mechanism to write the length as a "varint".
Then, reading is a matter of:
read an entire length-header
read (and buffer) that many bytes
process the buffered data
rinse and repeat
But keeping in mind that you might have (in a single packet) the end of one message, 2 complete messages, and the start of another message (maybe half of the length-header, just to make it interesting). So: keeping track of exactly what you are reading at any point becomes paramount.

Related

How to receive for sendall in python socket module

I am fairly new to using sockets, and this will probably have a simple answer that I am overlooking, but since an hour of agonizing has not yielded results so... what the heck.
How do I receive for .sendall() in the python socket module? By this I mean how do I receive data from a socket with out a buffer? is there a simple solution for this like some sort of conn.recvall() function or do I have it write out logic to do this? If I do have to write logic for it, then how should I do it? Should I just keep using .recv() with some arbitrary buffint or do I have to split the inputs into segments before sending? Which is more efficient, or better? Is there a smarter way to go about it?
Thanks

send and sendall will chop your buffer into pieces for sending over the network. It's important to remember that TCP is a streaming protocol, not a packet protocol. If you send 1,024 bytes, it might be received by the other end as 1,024 bytes, or as one of 256 and one of 768, or one of 1,000 and one of 24. The receiver need to know when the transmission is complete. Sometimes it's fixed buffer, sometimes you'll send a byte count first, sometimes you use a special termination character, sometimes you wait for a timeout. The receiver just needs to keep calling .recv until he knows its done.
Some of the higher level Python packages (like twisted (which I recommend)) can handle that for you.

Stop tcp packets from concatenating

I have two apps sending tcp packages, both written in python 2. When client sends tcp packets to server too fast, the packets get concatenated. Is there a way to make python recover only last sent package from socket? I will be sending files with it, so I cannot just use some character as packet terminator, because I don't know the content of the file.

TCP uses packets for transmission, but it is not exposed to the application. Instead, the TCP layer may decide how to break the data into packets, even fragments, and how to deliver them. Often, this happens because of the unterlying network topology.
From an application point of view, you should consider a TCP connection as a stream of octets, i.e. your data unit is the byte, not a packet.
If you want to transmit "packets", use a datagram-oriented protocol such as UDP (but beware, there are size limits for such packets, and with UDP you need to take care of retransmissions yourself), or wrap them manually. For example, you could always send the packet length first, then the payload, over TCP. On the other side, read the size first, then you know how many bytes need to follow (beware, you may need to read more than once to get everything, because of fragmentation). Here, TCP will take care of in-order delivery and retransmission, so this is easier.

TCP is a streaming protocol, which doesn't expose individual packets. While reading from stream and getting packets might work in some configurations, it will break with even minor changes to operating system or networking hardware involved.
To resolve the issue, use a higher-level protocol to mark file boundaries. For example, you can prefix the file with its length in octets (bytes). Or, you can switch to a protocol that already handles this kind of stuff, like http.

First you need to know if the packet is combined before it is sent or after. Use wireshark to check it the sender is sending one packet or two. If it is sending one, then your fix is to call flush() after each write. I do not know the answer if the receiver is combining packets after receiving them.
You could change what you are sending. You could send bytes sent, followed by the bytes. Then the other side would know how many bytes to read.

Normally, TCP_NODELAY prevents that. But there are very few situations where you need to switch that on. One of the few valid ones are telnet style applications.
What you need is a protocol on top of the tcp connection. Think of the TCP connection as a pipe. You put things in one end of the pipe and get them out of the other. You cannot just send a file through this without both ends being coordinated. You have recognised you don't know how big it is and where it ends. This is your problem. Protocols take care of this. You don't have a protocol and so what you're writing is never going to be robust.
You say you don't know the length. Get the length of the file and transmit that in a header, followed by the number of bytes.
For example, if the header is a 64bits which is the length, then when you receive your header at the server end, you read the 64bit number as the length and then keep reading until the end of the file which should be the length.
Of course, this is extremely simplistic but that's the basics of it.
In fact, you don't have to design your own protocol. You could go to the internet and use an existing protocol. Such as HTTP.

How to make an endmark in python TCP socket calls send() and recv()

I am relatively new to sockets and very new to python. How would you go about making an endmark for python send() and recv().
I have searched all over and there is no easy tutorial. I have read the man page for recv(2) a thousand times and it ironically makes less sense to me each time I read it.
I would like to use the send() function in a server to let the client calling recv() know when the end of the send() is.
Do you use the flag argument of send?
Or do you use something like "|".join(str1, str2) and use an if statement in the client to recognize the | and parse the statement?

TCP is not message oriented protocol. It does not maintain message boundaries nor it does not help in other ways to achieve it. It is up-to the application to mark boundaries. Client and server can agree upon a method in exchange of data. Common method is to put in the message length in the data you send.
[2 byte message length][Actual Data of Interest]
The end which receives packets will always look for two byte length indicator. recv as much data indicated by length bytes, process them and again go to recv length bytes and so on.
Another method is that the application can mark the start and end of the message with markers. It also needs to handle cases where the markers can also be part of actual data.
[Start Indicator][Actual Data of Interest][ End Indicator]

Sending big files in Twisted

I have a really simple code that allows me to send an image from client to server. And it works.
As simple as this:
On the client side...
def sendFile(self):
image = open(picname)
data = image.read()
self.transport.write(data)
On the server side...
def dataReceived(self, data):
print 'Received'
f = open("image.png",'wb')
f.write(data)
f.close()
Problem with this is that only works if the image is up to 4.somethingkB, as it stops working when the image is bigger (at least doesn't work when gets to 6kB). Then, is when I see that the "Received" is being printed more than one time. Which makes me think that data is being separated in smaller chunks. However, even if those chunks of data get to the server (as I'm seeing the repeated print called from the dataReceived) the image is corrupted and can't be opened.
I don't know that much about protocols, but I supposed that TCP should be reliable, so the fact that the packets got there in a different order or so, shouldn't...happen? So I was thinking that maybe Twisted is doing something there that I ignore and maybe I should use another Protocol.
So here is my question. Is there something that I could do now to make it work or I should definitely change to another Protocol? If so...any idea? My goal would be sending a bigger image, maybe the order of hundreds of kB.

This is a variant of an entry in the Twisted FAQ:
Why is protocol.dataReceived called with only part of the data I called transport.write with?
TCP is a stream-based protocol. It is delivering a stream of bytes, which may be broken up into an arbitrary number of fragments. If you write one big blob of bytes, it may be broken up into an arbitrary number of smaller chunks, depending on the characteristics of your physical network connection. When you say that TCP should be "reliable", and that the chunks should arrive in order, you are roughly correct: however, what arrives in order is the bytes, not the chunks.
What you are doing in your dataReceived method is, upon receiving each chunk, opening a file and writing the contents of just that chunk to "image.png", then closing it. If you change it to open the file in connectionMade and close the file in connectionLost you should see at least vaguely the right behavior, although this will still cause you to get corrupted / truncated images if the connection is lost unexpectedly, with no warning. You should really use a framing protocol like AMP; although if you're just sending big blobs of data around, HTTP is probably a better choice.

How to use dataReceived in Twisted?

I have implemented a server program using Twisted. I am using basic.lineReceiver with the method dataReceived to receive data from multiple clients. Also, I am using protocol.ServerFactory to keep track of connected clients. The server sends some commands to each connected client. Based on the response that the server gets from each client, it (the server) should perform some tasks. Thus, the best solution that came to my mind was to create a buffer for received messages as a python list, and each time that the functions at server side want to know the response from a client, they access the last element of the buffer list (of that client).
This approach has turned out to be unreliable. The first issue is that since TCP streaming is used, sometimes messages merge (I can use a delimiter for this). Second, the received messages are sometimes not in their appropriate sequence. Third, the networking communication seems to be too slow, as when the server initially tries to access the last element of the buffered list, the list is empty (this shows that the last messages on the buffer might not be the response to the last sent commands).
Could you tell me what is the best parctice for using dataReceived or its equivalents in the above problem? thank you in advance.
EDIT 1: Answer- While I accept #Jean-Paul Calderone's answer since I certainly learned from it, I would like to add that in my own research of Twisted's documentation, I learned that in order to avoid delays in communications of the server, one should use return at the end of dataReceived() or lineReceived() functions, and this solved part of my problem. The rest, were explained in the answer.

I have implemented a server program using Twisted. I am using basic.lineReceiver with the method dataReceived to receive data from multiple clients.
This is a mistake - an unfortunately common one brought on by the mistaken use of inheritance in many of Twisted's protocol implementations as the mechanism for building up more and more sophisticated behaviors. When you use twisted.protocols.basic.LineReceiver, the dataReceived callback is not for you. LineReceiver.dataReceived is an implementation detail of LineReceiver. The callback for you is LineReceiver.lineReceived. LineReceiver.dataReceived looks like it might be for you - it doesn't start with an underscore or anything - but it's not. dataReceived is how LineReceiver receives information from its transport. It is one of the public methods of IProtocol - the interface between a transport and the protocol interpreting the data received over that transport. Yes, I just said "public method" there. The trouble is it's public for the benefit of someone else. This is confusing and perhaps not communicated as well as it could be. No doubt this is why it is a Frequently Asked Question.
This approach has turned out to be unreliable. The first issue is that since TCP streaming is used, sometimes messages merge (I can use a delimiter for this).
Use of dataReceived is why this happens. LineReceiver already implements delimiter-based parsing for you. That's why it's called "line" receiver - it receives lines separated by a delimiter. If you override lineReceived instead of dataReceived then you'll be called which each line that is received, regardless of how TCP splits things up or smashes them together.
Second, the received messages are sometimes not in their appropriate sequence.
TCP is a reliable, ordered, stream-oriented transport. "Ordered" means that bytes arrive in the same order they are sent. Put another way, when you write("x"); write("y") it is guaranteed that the receiver will receive "x" before they receive "y" (they may receive "x" and "y" in the same call to recv() but if they do, the data will definitely be "xy" and not "yx"; or they may receive the two bytes in two calls to recv() and if they do, the first recv() will definitely by "x" and the second will definitely be "y", not the other way around).
If bytes appear to be arriving in a different order than you sent them, there's probably another bug somewhere that makes it look like this is happening - but it actually isn't. Your platform's TCP stack is very likely very close to bug free and in particular it probably doesn't have TCP data re-ordering bugs. Likewise, this area of Twisted is extremely well tested and probably works correctly. This leaves a bug in your application code or a misinterpretation of your observations. Perhaps your code doesn't always append data to a list or perhaps the data isn't being sent in the order you expected.
Another possibility is that you are talking about the order in which data arrives across multiple separate TCP connections. TCP is only ordered over a single connection. If you have two connections, there are very few (if any) guarantees about the order in which data will arrive over them.
Third, the networking communication seems to be too slow, as when the server initially tries to access the last element of the buffered list, the list is empty (this shows that the last messages on the buffer might not be the response to the last sent commands).
What defines "too slow"? The network is as fast as the network is. If that's not fast enough for you, find a fatter piece of copper. It sounds like what you really mean here is that your server sometimes expects data to have arrived before that data actually arrives. This doesn't mean the network is too slow, though, it means your server isn't properly event driven. If you're inspecting a buffer and not finding the information you expected, it's because you inspected it before the occurrence of the event which informs you of the arrival of that information. This is why Twisted has all these callback methods - dataReceived, lineReceived, connectionLost, etc. When lineReceived is called, this is an event notification telling you that right now something happened which resulted in a line being available (and, for convenience, lineReceived takes one argument - an object representing the line which is now available).
If you have some code that is meant to run when a line has arrived, consider putting that code inside an implementation of the lineReceived method. That way, when it runs (in response to a line being received), you can be 100% sure that you have a line to operate on. You can also be sure that it will run as soon as possible (as soon as the line arrives) but no sooner.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.