Convert ASCII data to hex/binary/bytes in Python

Convert ASCII data to hex/binary/bytes in Python - python

The protocol for a device I'm working with sends a UDP packet to a server (My Python program, using twisted for networking) in a comma separated ASCII format. In order to acknowledge that my application received the packet correctly, I have to take a couple of values from the header of that packet, convert them to binary and send them back. I have everything setup, though I can't seem to figure out the binary part.
I've been going through a lot of stuff and I'm still confused. The documentation says "binary" though it looks more like hexadecimal because it says the ACK packet has to start with "0xFE 0x02".
The format of the acknowledgement requires me to send "0xFE 0x02" + an 8 unsigned integer (IMEI number, 15 digits) + 2 byte unsigned integer (Sequence ID)
How would I go about converting the ASCII text values that I have into "binary"?

First:
The documentation says "binary" though it looks more like hexadecimal because it says the ACK packet has to start with "0xFE 0x02".
Well, it's impossible print actual binary data in a human-readable form, so documentation will usually either give a sequence of hexadecimal bytes. They could use a bytes literal like b'\xFE\x02' or something instead, but that's still effectively hexadecimal to the human reader, right?
So, if they say "binary", they probably mean "binary", and the hex is just how they're showing you what binary bytes you need.
So, you need to convert the ASCII representation of a number into an actual number, which you do with the int function. Then you need to pack that into 8 bytes, which you do with the struct module.
You didn't mention whether you needed big-endian or little-endian. Since this sounds like a network protocol, and it sounds like it wasn't designed by Microsoft, I would guess big-endian, but you should actually know, not guess.
So:
imei_string = '1234567890123456789'
imei_number = int(imei_string) # 1234567890123456789
imei_bytes = struct.pack('>Q', imei_number) # b'\x11\x22\x10\xf4\x7d\xe9\x81\x15'
buf = b'\xFE\x02' + imei_bytes + seq_bytes
(You didn't say where you're supposed to get the sequence number from, but wherever it comes from, you'll pack it the same way, just using >H instead of >Q.)
If you actually did need a hex string rather than binary, you'd need to know exactly what format. The binascii.hexlify function gives you "bare hex", two characters per byte, no 0x or other header or footer. If you want something different, well, it depends on what exactly you want; no format is really that hard. But, again, I'm pretty sure you don't need a hex string here.
One last thing: Since you didn't specify your Python version, I wrote this in a way that's compatible with both 2.6+ and 3.0+; if you need to use 2.5 or earlier, just drop the b prefix on the literal in the buf = line.

Related

Python Websockets receive Raw Bytes

I've been using the python websockets library of 8.1 version. This has been a good tool for receiving string data, yet now I have experienced a need to receive a mixture of string and bytes data.
Let me explain.
There is a socket, which encodes its data with the algorithm, which uses character codes as numbers.
For example, at the beginning of the message it has a character c, whose ord(c) == 777. It doesn't mean, though, that this is a chr(777), as a human would read it. It represents, for example, the type of message the client got. So it's a message with the type 777, and there is an algorithm to handle this type of messages. The next character would represent the lenght of the message. And so on.
There is a problem, though. There is a message, whose type is 0, which means a NUL byte. When the client receives such a string, for some reason either Python or the websockets recv method interpretes it as a space character, resulting in ord(c) == 32 instead of ord(c) == 0. Which, obviously, makes the whole message incorrect. I could replace 32 with 0 and vice versa, yet it would lead to even more errors, as those characters are not interchangable.
I suppose, if I received bytes instead of str, the problem would go away? But I cannot seem to find a method for that. Maybe, someone had such experience in the past with python and/or websockets incorrectly treating unreadable characters?

You can use the struct module to build your packets
https://docs.python.org/3/library/struct.html

Convert from python struct.pack (big-endian) to list of integers

*EDIT: Title is incorrect, Big-Endian should be Little-Endian. Didn't want to change due to solutions provided.
I am trying to convert a string (ex b'\x01\x00\x00\x00' <- 32 bit intger) back to an integer in my C program.
Client (in Python):
example = [1,2,3]
struct.pack('i'*int(len(example)/4),packed)
<Send over open socket to server>
Server (in C):
char buffer[1024];
numbytes = recv(sockfd,buffer,1023,0);
char message[numbytes];
memcpy(message,buffer,n);
<If 'message' is sent back, I can unpack on client>
??? How to unpack on C then repack to send response to client ???
In C, I want to 'unpack' into a array/struct
Thanks!

Assuming the title of your question is correct, and the values are actually in big-endian order, you want the ntohl (network to host long) function. Call this function for each of the 32-bit integers to convert them into the host byte order.
Based on the value b'\x01\x00\x00\x00' it seems more likely that you're encoding the values in little-endian order, and that is in fact what the struct.pack call you showed will produce if you run it on a little-endian machine. Your client and server probably are both running on little-endian hardware (although you don't specify that, so it's impossible to be 100% certain).
In any case, whatever form you use, you need to use the same endianness on both sides. It's probably best to not make your wire protocol endianness-dependent, so you should probably ensure that both the client and the server convert bytes into and out of a common endianness. Internet standards specify big-endian as the standard for network protocols.
If you decide to standardize on big-endian, here's what you need to do:
Change your struct.pack call to select big-endian encoding of integers. You can do this by adding a '>' prefix to the struct definition.
Change your C code to read each integer one a time (four bytes for a 32-bit value), and then pass the values through ntohl to get a uint32_t.
Re-assemble the integers into your struct on the server side.

Can you compress bytes in python and send them?

I am writing a TCP python script, and I need the first 4 bytes to be the size of the file.
I got the size of the file by doing
SIZE_OF_FILE = os.path.getsize(infile.name)
The size is 392399 bytes.
When I do
s.send(str(SIZE_OF_FILE).encode("utf-8"))
it sends the file, and then on my server I have
fileSize = conn.recv(4).decode('utf-8')
This should read the first 4 bytes, and extract the file size information, but it returns 3923 instead of the 392399.
as the file size... what happened? "392399" should be able to fit into 4 bytes.
We are suppose to be using big endian.

This is because str(SIZE_OF_FILE) typesets the number using decimal notation - that is, you get the string "392399", which is 6 characters (and 6 bytes in UTF-8). If you send only the first 4, you are sending "3923".
What you probably want to do is use something like struct.pack to create a bytestring containing the binary representation of the number.
s.send(struct.pack(format_string, SIZE_OF_FILE))

You are sending the size as a string ("392399"), which is 6 ASCII characters and therefore 6 bytes. You want to send it as a raw integer; use struct.pack to do that:
s.send(struct.pack(">i", SIZE_OF_FILE))
To recieve:
fileSize = struct.unpack(">i", conn.recv(4))[0]
The > makes it big-endian. To make it little-endian, use < instead. i is the type; in this case, a 4-byte integer. The linked documentation has a list of types, in case you want to use another one.

string to wstring in python

I have a udp socket which received datagram of different length.
The first of the datagram specifies what type of data it is going to receive say for example 64-means bool false, 65-means bool true, 66-means sint, 67-means int and so on. As most of datatypes have known length, but when it comes to string and wstring, the first byte says 85-means string, next 2 bytes says string length followed by actual string. For wstring 85, next 2 bytes says wstring length, followed by actual wstring.
To parse the above kind off wstring format b'U\x00\x07\x00C\x00o\x00u\x00p\x00o\x00n\x001' I used the following code
data = str(rawdata[3:]).split("\\x00")
data = "".join(data[1:])
data = "".join(data[:-1])
Is this correct or any other simple way?
As I received the datagram, I need to send the datagram also. But I donot know how to create the datagrams as the socket.sendto requires bytes. If I try to convert string to utf-16 format will it covert to wstring. If so how would I add the rest of the information into bytes
From the above datagram information U-85 which is wstring, \x00\x07 - 7 length of the wstring data, \x00C\x00o\x00u\x00p\x00o\x00n\x001 - is the actual string Coupon1

A complete answer depends on exactly what you intend to do with the resulting data. Splitting the string with '\x00' (assuming that's what you meant to do? not sure I understand why there are two backslashes there) doesn't really make sense. The reason for using a wstring type in the first place is to be able to represent characters that aren't plain old 8-bit (really 7-bit) ascii. If you have any characters that aren't standard Roman characters, they may well have something other than a zero byte separating the characters in which case your split result will make no sense.
Caveat: Since you mentioned sendto requiring bytes, I assume you're using python3. Details will be slightly different under python2.
Anyway if I understand what it is you're meaning to do, the "utf-16-be" codec may be what you're looking for. (The "utf-16" codec puts a "byte order marker" at the beginning of the encoded string which you probably don't want; "utf-16-be" just puts the big-endian 16-bit chars into the byte string.) Decoding could be performed something like this:
rawdata = b'U\x00\x07\x00C\x00o\x00u\x00p\x00o\x00n\x001'
dtype = rawdata[0]
if dtype == 85: # wstring
dlen = ord(rawdata[1:3].decode('utf-16-be'))
data = rawdata[3: (dlen * 2) + 3]
dstring = data.decode('utf-16-be')
This will leave dstring as a python unicode string. In python3, all strings are unicode. So you're done.
Encoding it could be done something like this:
tosend = 'Coupon1'
snd_data = bytearray([85]) # wstring indicator
snd_data += bytearray([(len(tosend) >> 8), (len(tosend) & 0xff)])
snd_data += tosend.encode('utf-16-be')

Python convert mixed ASCII code to String

I am retrieving a value that is set by another application from memcached using python-memcached library. But unfortunately this is the value that I am getting:
>>> mc.get("key")
'\x04\x08"\nHello'
Is it possible to parse this mixed ASCII code into plain string using python function?
Thanks heaps for your help

It is a "plain string", to the extent that such a thing exists. I have no idea what kind of output you're expecting, but:
There ain't no such thing as plain text.
The Python (in 2.x, anyway) str type is really a container for bytes, not characters. So it isn't really text in the first place :) It displays the bytes assuming a very simple encoding, using escape sequence to represent every byte that's even slightly "weird". It will be formatted differently again if you print the string (what you're seeing right now is syntax for creating such a literal string in your code).
In simpler times, we naively assumed that we could just map bytes to these symbols we call "characters", and that would be that. Then it turned out that there were approximately a zillion different mappings that people wanted to use, and lots of them needed more symbols than a byte could represent. Which is why we have Unicode now: it represents every symbol you could conceivably need for any real-world language (and several for fake languages and other purposes), and it abstractly assigns numbers to those symbols but does not say how to collect and interpret the bytes as numbers. (That is the purpose of the encoding).
If you know that the string data is encoded in a particular way, you can decode it to a Unicode string. It could either be an encoding of actual Unicode data, or it could be in some other format (for example, Japanese text is often found in something called "Shift-JIS", because it has approximately the same significance to them as "Latin-1" - a common extension of ASCII - does to us). Either way, you get an in-memory representation of a series of Unicode code points (the numbers referred to in the previous paragraph). This, for all intents and purposes, is really "text", but it isn't really "plain" :)
But it looks like the data you have is really a binary blob of bytes that simply happens to consist mostly of "readable text" if interpreted as ASCII.
What you really need to do is figure out why the first byte has a value of 4 and the next byte has a value of 8, and proceed accordingly.

If you just need to trim the '\x04\x08"\n', and it's always the same (you haven't put your question very clearly, I'm not certain if that's what it is or what you want), do something like this:
to_trim = '\x04\x08"\n'
string = mc.get('key')
if string.startswith(to_trim):
string = string[len(to_trim):]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.