Azure Storage Queue data overhead with python

Azure Storage Queue data overhead with python - python

I'm using wireshark to understand the data usage when posting messages to an azure storage queue via the python SDK (https://pypi.org/project/azure-storage-queue/).
The wireshark capture filter has been set to show communication to and from the Azure queue. The following table shows the data transferred for a single post to the queue (certificate exchange has already occurred). If I post multiple messages using queue.send_message, the entire block repeats.
The message itself is posted as part of line 2 (length 429 bytes, which varies with message size as expected). Then there is a TCP Ack and the response comes back (fixed length, 753 bytes), see https://learn.microsoft.com/en-us/rest/api/storageservices/put-message.
I do not understand the first (619 bytes) or last (88 bytes) packets, which are also of fixed length even when varying message size. Any idea what these other packets are?

Related

reading both tcp and udp packets from same socket

I am trying to read packets in a router, like this in python:
# (skipping the exception handling code here)
s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW, socket.ntohs(0x0003))
while True:
p = s.recvfrom(2000)
pkt = p[0]
# process pkt here ...
Answers to a related question (36115971) say that parameters and methods for UDP vs TCP data are different (some say recv is for TCP and recvfrom is for UDP, and others say the opposite, similarly some say 1024 as buffer size for TCP and larger for UDP, and again some say the reverse). In my case of reading in a router, I do not have different sockets for TCP and UDP, so I need to read both from the same socket, so I am bit confused regarding how I should read the incoming packets.
(1) Should I use recv() or recvfrom(), if I want to read both TCP and UDP packets?
(2) Do the calls return data one packet at a time, or do they return after the buffer is filled up? eg, if I have a large buffer of 4096 bytes, and the incoming streaming 2 packets have 2400 bytes each, will the call return as soon as the 1st packet ends, or will it return after filling up the buffer from the 2nd packet also?
(2a) same question, but if I have a smaller buffer of 2000 bytes. It is clear that on the 1st call I will get the first 2000 bytes of the 1st packet. But on the next call, will I get the last 400 bytes of the 1st packet, or the first 2000 bytes of the 2nd packet?
(3) If I am delayed in making the next call, maybe because I was busy processing the 1st dataset, am I in danger of losing data, or will the OS keep its internal queue of the incoming packets to be given to me when I call the next time? If the OS keeps its internal queue, where can I find information about its size?
NOTE: Some of the given replies have been divergent, so let me put in some boundaries to my question. Hopefully these restrictions will help to give more specific answers.
(a) My objective is to sniff the incoming packets with python sockets only. So other solutions involving tcpdump or tshark etc are outside the scope.
(b) The objective is to only sniff for incoming packets. Additional details like packet reordering (for connection oriented protocols like TCP) are outside the scope, actually they are avoidable overhead.

If you're reading packets from a raw socket (as shown in your source code), then you can easily read all packets from the same socket. Be sure this is what you intend to do. A raw socket is for doing packet inspection for troubleshooting, forensic, security or educational purposes. You cannot easily communicate with another system this way.
And likewise, the receive calls will not differ here by protocol because you are not actually using TCP or UDP, you're simply receiving the raw packets that those protocols build and decode.
(1) Should I use recv() or recvfrom(), if I want to read both TCP and UDP packets?
Either one will work. recv() will return to you only the actual packet data, while recvfrom will return to you the data along with metadata about the packet, including the interface from which the data was received (and other things defined in struct sockaddr_ll from the packet(7) man page).
(2) Do the calls return data one packet at a time, or do they return after the buffer is filled up? eg, if I have a large buffer of 4096 bytes, and the incoming streaming 2 packets have 2400 bytes each, will the call return as soon as the 1st packet ends, or will it return after filling up the buffer from the 2nd packet also?
When using a raw socket like this, you get exactly one packet at a time. You will never get more than one. If the buffer you give is not large enough, then the packet will be truncated (with the ending bytes discarded).
(2a) same question, but if I have a smaller buffer of 2000 bytes. It is clear that on the 1st call I will get the first 2000 bytes of the 1st packet. But on the next call, will I get the last 400 bytes of the 1st packet, or the first 2000 bytes of the 2nd packet?
Generally speaking, packets on most networks are limited to about 1514 bytes. This is because the traditional "MTU" (Maximum Transfer Unit) that is configured on the network interface is 1500 bytes and usually an Ethernet header containing two MAC addresses (6 bytes each) plus a two-byte Ethertype is prepended to that. In a switch or router, you may also see packets that have an additional 4-byte header containing a VLAN header (IEEE 802.1Q). (But, some networks internally use "jumbo" packets up to about 9K in size for specific purposes.)
You should also understand that, in writing an application, one can send UDP datagrams (or TCP buffers) larger than the maximum packet size. In that case, the OS breaks those up into smaller chunks for sending (and they are re-assembled on the destination side before being handed to an application). When you're receiving raw packets like this, you will see the packets in their low-level, possibly fragmented, state.
(3) If I am delayed in making the next call, maybe because I was busy processing the 1st dataset, am I in danger of losing data, or will the OS keep its internal queue of the incoming packets to be given to me when I call the next time? If the OS keeps its internal queue, where can I find information about its size?
The OS will keep a queue of packets for you. The size is of course limited since there is no way you would be able to keep up with, say, a 1Gb NIC at full line rate (let alone a 10Gb or higher NIC). The size is configured in a system-specific way. On linux -- and probably other Unix-based systems -- you can call getsockopt with SOL_SOCKET / SO_RCVBUF to get an idea of the queue space available.
On linux, at least, the size can be set with setsockopt up to a system-imposed maximum (which itself can be configured with various sysctl settings).

I think you should not do that, because TCP assures various things like reliability, ordering, flow control, and congestion. However UDP does not guarantee anything.
These parameters are defined in the moment of creation of the socket by operating system. That is why I think that you cannot do that you are saying.
Open two different sockets, one native UDP sock and one native TCP sock.

How does the python socket.recv() method know that the end of the message has been reached?

Let's say I'm using 1024 as buffer size for my client socket:
recv(1024)
Let's assume the message the server wants to send to me consists of 2024 bytes.
Only 1024 bytes can be received by my socket. What's happening to the other 1000 bytes?
Will the recv-method wait for a certain amount of time (say 2 seconds) for more data to come and stop working after this time span? (I.e., if the rest of the data arrives after 3 seconds, the data will not be received by the socket any more?)
or
Will the recv-method stop working immediately after having received 1024 bytes of data? (I.e. will the other 1000 bytes be discarded?)
In case that 1.) is correct ... is there a way for me to to determine the amount of time, the recv data should wait before returning or is it determined by the system? (I.e. could I tell the socket to wait for 5 seconds before stopping to wait for more data?)
UPDATE:
Assume, I have the following code:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((sys.argv[1], port))
s.send('Hello, world')
data = s.recv(1024)
print("received: {}".format(data))
s.close()
Assume that the server sends data of size > 1024 bytes. Can I be sure that the variable "data" will contain all the data (including those beyond the 1024th byte)?
If I can't be sure about that, how would I have to change the code so that I can always be sure that the variable "data" will contain all the data sent (in one or many steps) from the server?

It depends on the protocol. Some protocols like UDP send messages and exactly 1 message is returned per recv. Assuming you are talking about TCP specifically, there are several factors involved. TCP is stream oriented and because of things like the amount of currently outstanding send/recv data, lost/reordered packets on the wire, delayed acknowledgement of data, and the Nagle algorithm (which delays some small sends by a few hundred milliseconds), its behavior can change subtly as a conversation between client and server progresses.
All the receiver knows is that it is getting a stream of bytes. It could get anything from 1 to the fully requested buffer size on any recv. There is no one-to-one correlation between the send call on one side and the recv call on the other.
If you need to figure out message boundaries its up to the higher level protocols to figure that out. Take HTTP for example. It starts with a \r\n delimited header and then has a count of the remaining bytes the client should expect to receive. The client knows how to read the header because of the \r\n then knows exactly how many bytes are coming next. Part of the charm of RESTful protocols is that they are HTTP based and somebody else already figured this stuff out!
Some protocols use NUL to delimit messages. Others may have a fixed length binary header that includes a count of any variable data to come. I like zeromq which has a robust messaging system on top of TCP.
More details on what happens with receive...
When you do recv(1024), there are 6 possibilities
There is no receive data. recv will wait until there is receive data. You can change that by setting a timeout.
There is partial receive data. You'll get that part right away. The rest is either buffered or hasn't been sent yet and you just do another recv to get more (and the same rules apply).
There is more than 1024 bytes available. You'll get 1024 of that data and the rest is buffered in the kernel waiting for another receive.
The other side has shut down the socket. You'll get 0 bytes of data. 0 means you will never get more data on that socket. But if you keep asking for data, you'll keep getting 0 bytes.
The other side has reset the socket. You'll get an exception.
Some other strange thing has gone on and you'll get an exception for that.

Socket queue (Twitter streaming as a reference)

I just found out Twitter streaming endpoints support detection of slow connections somehow.
Reference: https://dev.twitter.com/docs/streaming-apis/parameters#stall_warnings (and bottom of page)
Idea is that socket send will probably process data one by one. And it knows when one packet is received by client so it can maintain queue and always know of it's size.
It's easy when client sends some confirmation packets for each of them. But that is not the case with Twitter Streaming API - it's a one-way transfer.
My question is: how did they achieve that? I can't see a way to do it without some very low level raw socket support - but I may be forgetting something here. With some low level support we could probably get ACKs for each packets. Is that even possible? Can ACKs be somehow traced?
Any other ideas how this was done?
Any way to do this e.g. in Python? Or any other language example would be appreciated.
Or maybe I am over my head here and it simply uses to track how many bytes are not yet processed through socket.send? But isn't it a poor indication of client's connection?

I started off thinking along the same lines as you but I think the implementation is actually much easier than we both expect.
Twitter's API docs state:-
"A client reads data too slowly. Every streaming connection is backed by a queue of messages to be sent to the client. If this queue grows too large over time, the connection will be closed." - https://dev.twitter.com/docs/streaming-apis/connecting#Disconnections
Based on the above I imagine Twitter will have a thread that is pushing tweets onto a queue and a long lived http connection to a client (kept open with a while loop) that pops a message off the queue and writes the data to the http response during each loop iteration.
Now if you imagine what happens inside the while loop and you think in terms of buffers, Twitter will pop an item off the queue then write the tweet data to some kind of output buffer, that buffer will get flushed and then fill up a TCP buffer for transport to the client.
If a client is reading data slowly from its TCP buffer then the server's TCP send buffer will fill up meaning that when the server's output buffer is flushed it will block because the data cannot be written to the TCP buffer which consequently means that the while loop is not popping tweets off the queue as often (because it is being blocked when data is being flushed) causing the tweet queue to fill up.
Now you would just need a check at the beginning of each loop iteration to check whether the Tweet queue has reached some predefined threshold.

How can I deserialize incoming data on the TCP server?

I've set up a server reusing the code found in the documentation where I have self.data = self.request.recv(1024).strip().
But how do I go from this, deserialize it to protobuf message (Message.proto/Message_pb2.py). Right now it seems that it's receiving chunks of 1024 bytes, and that more then one at the time... making it all rubbish :D

TCP is typically just a stream of data. Just because you sent each packet as a unit, doesn't mean the receiver gets that. Large messages may be split into multiple packets; small messages may be combined into a single packet.
The only way to interpret multiple messages over TCP is with some kind of "framing". With text-based protocols, a CR/LF/CRLF/zero-byte might signify the end of each frame, but that won't work with binary protocols like protobuf. In such cases, the most common approach is to simply prepend each message with the length, for example in a fixed-size (4 bytes?) network-byte-order chunk. Then the payload. In the case of protobuf, the API for your platform may also provide a mechanism to write the length as a "varint".
Then, reading is a matter of:
read an entire length-header
read (and buffer) that many bytes
process the buffered data
rinse and repeat
But keeping in mind that you might have (in a single packet) the end of one message, 2 complete messages, and the start of another message (maybe half of the length-header, just to make it interesting). So: keeping track of exactly what you are reading at any point becomes paramount.

Are twisted RPCs guaranteed to arrive in order?

I'm using twisted to implement a client and a server. I've set up RPC between the client and the server. So on the client I do protocol.REQUEST_UPDATE_STATS(stats), which translates into sending a message with transport.write on the client transport that is some encoded version of ["update_stats", stats]. When the server receives this message, the dataReceived function on the server protocol is called, it decodes it, and calls a function based on the message, like CMD_UPDATE_STATS(stats) in this case.
If, on the client, I do something like:
protocol.REQUEST_UPDATE_STATS("stats1")
protocol.REQUEST_UPDATE_STATS("stats2")
...am I guaranteed that the "stats1" message arrives before the "stats2" message on the server?
UPDATE: Edited for more clarity. But now the answer seems obvious - no way.

They will arrive in the order that the request is received by the Python process. This includes the connection setup time plus the packets containing the request data. So no, this is not guaranteed to be the order that the sending processes sent the request, because of network latency, dropped packets, sender-side packet queuing, etc. "In-order" is also loosely defined for distributed systems.
But yes, in general you can count on them being delivered in-order as long as they're separated by a relatively large amount of time (100's of ms over the internet).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.