Reverse Engineering 'UTF-8 Like' Encoding Algorithm

Reverse Engineering 'UTF-8 Like' Encoding Algorithm - python

I'm attempting to reverse engineer an encoding algorithm to ensure backwards compatibility with other software packages. For each type of quantity to be encoded in the output file, there is a separate encoding procedure.
The given documentation only shows the end-user how to parse values from the encoded file, not write anything back to it. However, I have been able to successfully create a corresponding write_int() for every documented read_int() for every file type except the read_string() below.
I am currently (and have been for a while) struggling to wrap my head around exactly what is going on in the read_string() function listed below.
I understand fully that this is a masking problem, and that the first operation while partial_length & 0x80 > 0: is a simple bitwise mask that mandates we only enter the loop when we examine values larger than 128, I begin to lose my head when trying to assign or extract meaning from the loop that is within that while statement. I get the mathematical machinery behind the operations, but I can't see why they would be doing things in this way.
I have included the read_byte() function for context, as it is called in the read_string() function.
def read_byte(handle):
return struct.unpack("<B", handle.read(1))[0]
def read_string(handle):
total_length = 0
partial_length = read_byte(handle)
num_bytes = 0
while partial_length & 0x80 > 0:
total_length += (partial_length & 0x7F) << (7 * num_bytes)
partial_length = ord(struct.unpack("c", handle.read(1))[0])
num_bytes += 1
total_length += partial_length << (7 * num_bytes)
result = handle.read(total_length)
result = result.decode("utf-8")
if len(result) < total_length:
raise Exception("Failed to read complete string")
else:
return result
Is this indicative of an impossible task due to information loss, or am I missing an obvious way to perform the opposite of this read_string function?
I would greatly appreciate any information, insights (however obvious you may think they may be), help, or pointers possible, even if it means just a link to a page that you think might prove useful.
Cheers!

It's just reading a length, which then tells it how many characters to read. (I don't get the check at the end but that's a different issue.)
In order to avoid a fixed length for the length, the length is divided into seven-bit units, which are sent low-order chunk first. Each seven-bit unit is sent in a single 8-bit byte with the high-order bit set, except the last unit which is sent as is. Thus, the reader knows when it gets to the end of the length, because it reads a byte whose high-order bit is 0 (in other words, a byte less than 0x80).

Related

Finding a Python 2.7 -> 3.8 struct.Pack("H") into Strings for joining into Lists?

This is.. a long one. So I apologize for any inconsistency regarding code and problems. I'll be sure to try and add as much of the source code as I can to make sure the issue is as clear as possible.
This project at work is an attempt at converting Python 2 to 3, and thus far has been mildly straightforward. My coworker and I have reached a point though where no amount of googling or searching has given a straight answer, so here we are.
Alright, Starting offwith...
Python 2 code:
listBytes[102:104]=struct.pack('H',rotation_deg*100) # rotational position in degrees
listBytes[202:204]=struct.pack('H',rotation_deg*100) # rotational position in degrees
listBytes[302:304]=struct.pack('H',rotation_deg*100) # rotational position in degrees
# this continues on for a while in the same fashion
Where rotation_deg is a float between 0.00 and 359.99 (but for testing is almost always changing between 150-250)
For the purpose of testing, we're going to make rotation_deg be 150.00 all the time.
a = listBytes[102:104]=struct.pack('H',150.00*100)
print a
print type(a)
The print out of the following is:
�:
<type 'str'>
From what I understand, in the Python2 version of struct.pack, it is packing the floats as shorts, which then are "added" to the list as a short. Python 2 sees it as a string, and adds no encoding to the string (will get to that later for python 3). All simple and good, then a few more bits and bobs of dropping more stuff into the list and we get to:
return ''.join(listBytes)
Which, is being sent back to a simple variable:
bytes=self.UpdatePacket(bytes,statusIndex,rotation_deg,StatusIdList,StatusValueList, Stat,offsetUTC)
To then be sent along as a string through
sock.sendto(bytes , (host, port) )
This all comes together to look like this:
A string with a bunch of bytes (I think)
This is the working version, in which we are sending the bytes along the socket, data is being retrieved, and everyone is happy. If I missed anything, please let me know, otherwise, lets move to...
Python 3
This is where the Fun Begins
There are a few changes that are required between Python 2 and 3 right off the bat.
struct.pack('H',rotation_deg*100) requires an INT type to be packed, meaning all instances of packing had to be given int(rotatin_deg*100) as to not error the program.
sock.sendto(bytes, (host, port)) did not work anymore as the socket needed a bytes object to send something. No more strings that look like bytes, they had to be properly encoded to send properly. So now this becomes sock.sendto(bytes.encode(), (host, port)) to properly encode the "bytes" string.
As more of a background, the length of listBytes should always be 1206. Anymore and our socket won't work properly, and the issue is that no matter what we try with this python 3 code, the .join seems to be sending a LOT more than just byte objects, often quintupling the length of listBytes and breaking the socket.sendto .
listBytes[102:104] = struct.pack('H', int(rotation_deg * 100)) # rotational position in degrees
listBytes[202:204] = struct.pack('H', int(rotation_deg * 100)) # rotational position in degrees
listBytes[302:304] = struct.pack('H', int(rotation_deg * 100)) # rotational position in degrees
# continues on in this fashion again
return ''.join(str(listBytes))
returns to:
bytes = self.UpdatePacket(bytes, statusIndex, rotation_deg, StatusIdList, StatusValueList, Stat, offsetUTC)
sock.sendto(bytes.encode(), (host, port))
Here's where things start getting weird
a = struct.pack('H', int(150.00 * 100))
returns:
b'\x98:', with it's type being <class 'bytes'>, which is fine and the value we want, except we specifically need to store this variable into the list as maybe a string... to encode it later to send as a byte object for the socket.
You're starting to see the problem, yes?
The thing is, we've tried just about every technique to convert the two bytes that struct.pack returns into a string of some kind, and we've been able to convert it over, but then we run into the issue of the .join being evil.
Remember when I was talking about listBytes had to remain a size of 1206 or else it would break? For some reason, if we .join literally anything other than the two bytes as a string, we think python is trying to add a bunch of other stuff that we don't need.
So for now, we're focusing on trying to match the python 2 equivalent to python 3.
Here's what we've tried
binascii.hexlify(struct.pack('H', int(150.00 * 100))).decode() returns '983a'
str(struct.pack('H', int(150.00 * 100.00)).decode()) returns an error, 'utf-8' codec can't decode byte 0x98 in position 0: invalid start byte
str(struct.pack('H', int(150.00 * 100.00)).decode("utf-16")) returns '㪘'. Can't even begin to understand that.
return b''.join(listBytes) returns an error because there are int's at the start of the list.
return ''.join(str(listBytes)).encode('utf-8') still is adding a bunch of nonsense.
Now we get to the .join, and the first loop around it seems.. fine? It has 1206 as listBytes length before .joining, but on the second loop around, it creates a massive influx of junk, making the list 5503 in length. Third go around it becomes 27487, and finally on the last go around, it becomes to large for the socket to handle and I get slapped with [WinError 10040] A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself
Phew, if you made it this far, thank you. Any help at all would be extremely appreciated. If you have questions or I'm missing something, let me know.
Thanks!

You’re just trying too hard. It may be helpful to think of bytes and unicode rather than the ambiguous str type (which is the former in Python 2 and the latter in Python 3). struct always produces bytes, and socket needs to send bytes, regardless of the language version. Put differently, everything outside your process is bytes, although some represent characters in some encoding chosen by the rest of the system. Characters are a data type used to manipulate text within your program, just like floats are a data type used to manipulate “real numbers”. You have no characters here, so you definitely don’t need encode.
So just accumulate your bytes objects and then join them with b''.join (if you can’t just directly use the buffer into which your slice assignments seem to be writing) to keep them as bytes.

Came up with serialization algorithm for strings, but only works for words that are < 10 letters long.

So I had a question about a serialization algorithm I just came up with it, wanted to know if it already exists and if there's a better version out there.
So we know normal algorithms use a delimiter and join words in a list, but then you have to look through the whole word for existence of the delimiter, escape, etc, or make the serialization algorithm not robust. I thought a more intuitive approach would be to use higher level languages like Python where len() is O(1) and prepend that to each word. So for example this code I attached.
Wouldn't this be faster because instead of going through every letter of every word we instead just go through every word? And then deserialization we don't have to look through every character to find the delimiter, we can just skip directly to the end of each word.
The only problem I see is that double digit sizes would cause problems, but I'm sure there's a way around that I haven't found yet.
It was suggested to me that protocol buffers are similar to this idea, but I haven't understood why yet.
def serialize(list_o_words):
return ''.join(str(len(word)) + word for word in list_o_words)
def deserialize(serialized_list_o_words):
index = 0
deserialized_list = []
while index < len(serialized_list_o_words):
word_length = int(serialized_list_o_words[index])
next_index = index + word_length + 1
deserialized_list.append(serialized_list_o_words[index+1:next_index])
index = next_index
return deserialized_list
serialized_list = "some,comma,separated,text".split(",")
print(serialize(serialized_list))
print(deserialize(serialize(serialized_list)) == serialized_list)
Essentially, I want to know how I can handle double digit lengths.

There are many variations on length-prefixed strings, but the key bits come down to how you store the length.
You're deserializing the lengths as a single-character ASCII number, which means you can only handle lengths from 0 to 9. (You don't actually test that on the serialize size, so you can generate garbage, but let's forget that.)
So, the obvious option is to use 2 characters instead of 1. Let's add in a bit of error handling while we're at it; the code is still pretty easy:
def _len(word):
s = format(len(word), '02')
if len(s) != 2:
raise ValueError(f"Can't serialize {s}; it's too long")
return s
def serialize(list_o_words):
return ''.join(_len(word) + word for word in list_o_words)
def deserialize(serialized_list_o_words):
index = 0
deserialized_list = []
while index+1 < len(serialized_list_o_words):
word_length = int(serialized_list_o_words[index:index+2])
next_index = index + word_length + 2
deserialized_list.append(serialized_list_o_words[index+2:next_index])
index = next_index
return deserialized_list
But now you can't handle strings >99 characters.
Of course you can keep adding more digits for longer strings, but if you think "I'm never going to need a 100,000-character string"… you are going to need it, and then you'll have a zillion old files in the 5-digit format that aren't compatible with the new 6-digit format.
Also, this wastes a lot of bytes. If you're using 5-digit lengths, s encodes as 00000s, which is 6x as big as the original value.
You can stretch things a lot farther by using binary lengths instead of ASCII. Now, with two bytes, we can handle lengths up to 65535 instead of just 99. And if you go to four or eight bytes, that might actually be big enough for all your strings ever. Of course this only works if you're storing bytes rather than Unicode strings, but that's fine; you probably needed to encode your strings for persistence anyway. So:
def _len(word):
# already raises an exception for lengths > 65535
s = struct.pack('>H', len(word))
def serialize(list_o_words):
utfs8 = (word.encode() for word in list_o_words)
return b''.join(_len(utf8) + utf8 for utf8 in utfs8)
Of course this isn't very human-readable or -editable; you need to be comfortable in a hex editor to replace a string in a file this way.
Another option is to delimit the lengths. This may sound like a step backward—but it still gives us all the benefits of knowing the length in advance. Sure, you have to "read until comma", but you don't have to worry about escaped or quoted commas the way you do with CSV files, and if you're worried about performance, it's going to be much faster to read a buffer of 8K at a time and chunk through it with some kind of C loop (whether that's slicing, or str.find, barely matters by comparison) than to actually read either until comma or just two bytes.
This also has the benefit of solving the sync problem. With delimited values, if you come in mid-stream, or get out of sync because of an error, it's no big deal; just read until the next unescaped delimiter and worst-case you missed a few values. With length-prefixed values, if you're out of sync, you're reading arbitrary characters and treating them as a length, which just throws you even more out of sync. The netstring format is a minor variation on this idea, with a tiny bit more redundancy to make sync problems easier to detect/recover from.
Going back to binary lengths, there are all kinds of clever tricks for encoding variable-length numbers. Here's one idea, in pseudocode:
if the current byte is < hex 0x80 (128):
that's the length
else:
add the low 7 bits of the current byte
plus 128 times (recursively process the next byte)
Now you can handle short strings with just 1 byte of length, but if a 5-billion-character string comes along, you can handle that too.
Of course this is even less human-readable than fixed binary lengths.
And finally, if you ever want to be able to store other kinds of values, not just strings, you probably want a format that uses a "type code". For example, use I for 32-bit int, f for 64-bit float, D for datetime.datetime, etc. Then you can use s for strings <256 characters with a 1-byte length, S for strings <65536 characters with a 2-byte length, z for string <4B characters with a 4-byte length, and Z for unlimited strings with a complicated variable-int length (or maybe null-terminated strings, or maybe an 8-byte length is close enough to unlimited—after all, nobody's ever going to want more than 640KB in a computer…).

Reading bit by bit for Huffman Compression

I'm writing a python program that implements the Huffman Compression. However, it seems that I can only read / write to bin file byte by byte instead of bit by bit. Is there any workaround for this problem? Wouldn't processing byte by byte defeat the purpose of compression since extraneous padding would be needed. Also, it'd be great if someone can enlighten me about the application of Huffman Compression with regards to this byte-by-byte problem. w

A potential way to only have to read bytes is by buffering directly in the decoding routine. This combines well with table-based decoding, and does not have the overhead of ever doing bit-by-bit IO (hiding that with layers of abstraction doesn't make it go away, just wipes it under the carpet).
In the simplest case, table based decoding needs a "window" of the bit stream that is as large as1 the largest possible code (incidentally this sort of thing is a large part of the reason why many formats that use Huffman compression specify a maximum code length that isn't super long2), which can be created by shifting a buffer to the right until it has the correct size:
window = buffer >> (maxCodeLen - bitsInBuffer)
Since this gets rid of excess bits anyway, it is safe to append more bits than strictly necessary to the buffer when there are not enough:
while bitsInBuffer < maxCodeLen:
buffer = (buffer << 8) | readByte()
bitsInBuffer += 8
Thus byte-IO is sufficient. Actually you could read slightly bigger blocks (eg two bytes at the time) if you wanted. By the way there is a slight complication here: if all bytes of a file have been read and the buffer does not have enough bits in it (which is a legitimate condition that can happen for valid bitstreams) you just have to fill with "padding" (basically shift left without ORing in new bits).
Decoding itself could look like this:
# this line does the actual decoding
(symbol, length) = table[window]
# remove that code from the buffer
bitsInBuffer -= length
buffer = buffer & ((1 << bitsInBuffer) - 1)
# use decoded symbol
This is all very easy, the hard part is constructing the table. One way to do it (not a great way, but a simple way) is to take every integer from 0 up to and including (1 << maxCodeLen) - 1 and decoding the first symbol in it using bit-by-bit tree-walking the way you're used to. A faster way is taking every symbol/code pair and using it to fill the right entries of the table:
# for each symbol/code do this:
bottomSize = maxCodeLen - codeLen
topBits = code << bottomSize
for bottom in range(0, (1 << bottomSize) - 1):
table[topBits | bottom] = (symbol, codeLen)
By the way none of this code has been tested, it's just to show roughly how it might be done. It also assumes a particular way of packing the bitstream into bytes, with the first bit in the top of the byte.
1: some multi-stage decoding strategies are able to use a smaller window, which may be required if there is no bound on the code length.
2: eg 15 bits max for Deflate

Layer your code. Have a bottom io layer that does all file reads and writes either entire file at once or with buffering. Have a layer above that which processes the Huffman code bitstream by bits.

how to keep count of replaced strings

I have a massive string im trying to parse as series of tokens in string form, and i found a problem: because many of the strings are alike, sometimes doing string.replace()will cause previously replaced characters to be replaced again.
say i have the string being replaced is 'goto' and it gets replaced by '41' (hex) and gets converted into ASCII ('A'). later on, the string 'A' is also to be replaced, so that converted token gets replaced again, causing problems.
what would be the best way to get the strings to be replaced only once? breaking each token off the original string and searching for them one at a time takes very long
This is the code i have now. although it more or less works, its not very fast
# The largest token is 8 ASCII chars long
'out' is the string with the final outputs
while len(data) != 0:
length = 8
while reverse_search(data[:length]) == None:#sorry THC4k, i used your code
#at first, but it didnt work out
#for this and I was too lazy to
#change it
length -= 1
out += reverse_search(data[:length])
data = data[length:]

If you're trying to substitute strings at once, you can use a dictionary:
translation = {'PRINT': '32', 'GOTO': '41'}
code = ' '.join(translation[i] if i in translation else i for i in code.split(' '))
which is basically O(2|S|+(n*|dict|)). Very fast. Although memory usage could be quite substantial. Keeping track of substitutions would allow you to solve the problem in linear time, but only if you exclude the cost of looking up previous substitution. Altogether, the problem seems to be polynomial by nature.
Unless there is a function in python to translate strings via dictionaries that i don't know about, this one seems to be the simplest way of putting it.
it turns
10 PRINT HELLO
20 GOTO 10
into
10 32 HELLO
20 41 10
I hope this has something to do with your problem.

Convert binary information to regular data type without outside modules in python

I'm tasked with reading a poorly formatted binary file and taking in the variables. Although I need to do it in C++ (ROOT, specifically), I've decided to do it in python because python makes sense to me, but my plan is to get it working in python and then tackle re-writing in in C++, so using easy to use python modules won't get me too far later down the road.
Basically, I do this:
In [5]: some_value
Out[5]: '\x00I'
In [6]: ''.join([str(ord(i)) for i in some_value])
Out[6]: '073'
In [7]: int(''.join([str(ord(i)) for i in some_value]))
Out[7]: 73
And I know there has to be a better way. What do you think?
EDIT:
A bit of info on the binary format.
alt text http://grab.by/3njm
alt text http://grab.by/3njv
alt text http://grab.by/3nkL
This is the endian test I am using:
# Read a uint32 for endianess
endian_test = rq1_file.read(uint32)
if endian_test == '\x04\x03\x02\x01':
print "Endian test: \\x04\\x03\\x02\\x01"
swapbits = True
elif endian_test == '\x01\x02\x03\x04':
print "Endian test: \\x01\\x02\\x03\\x04"
swapbits = False

Your int(''.join([str(ord(i)) for i in some_value])) works ONLY when all bytes except the last byte are zero.
Examples:
'\x01I' should be 1 * 256 + 73 == 329; you get 173
'\x01\x02' should be 1 * 256 + 2 == 258; you get 12
'\x01\x00' should be 1 * 256 + 0 == 256; you get 10
It also relies on an assumption that integers are stored in bigendian fashion; have you verified this assumption? Are you sure that '\x00I' represents the integer 73, and not the integer 73 * 256 + 0 == 18688 (or something else)? Please let us help you verify this assumption by telling us what brand and model of computer and what operating system were used to create the data.
How are negative integers represented?
Do you need to deal with floating-point numbers?
Is the requirement to write it in C++ immutable? What does "(ROOT, specifically)" mean?
If the only dictate is common sense, the preferred order would be:
Write it in Python using the struct module.
Write it in C++ but use C++ library routines (especially if floating-point is involved). Don't re-invent the wheel.
Roll your own conversion routines in C++. You could snarf a copy of the C source for the Python struct module.
Update
Comments after the file format details were posted:
The endianness marker is evidently optional, except at the start of a file. This is dodgy; it relies on the fact that if it is not there, the 3rd and 4th bytes of the block are the 1st 2 bytes of the header string, and neither '\x03\x04' nor '\x02\x01' can validly start a header string. The smart thing to do would be to read SIX bytes -- if first 4 are the endian marker, the next two are the header length, and your next read is for the header string; otherwise seek backwards 4 bytes then read the header string.
The above is in the nuisance category. The negative sizes are a real worry, in that they specify a MAXIMUM length, and there is no mention of how the ACTUAL length is determined. It says "The actual size of the entry is then given line by line". How? There is no documentation of what a "line of data" looks like. The description mentions "lines" many times; are these lines terminated by carriage return and/or line feed? If so, how does one tell the difference between say a line feed byte and the first byte of say a uint16 that belongs to the current "line" of data? If no linefeed or whatever, how does one know when the current line of data is finished? Is there a uintNN size in front of every variable or slice thereof?
Then it says that (2) above (negative size) also applies to the header string. The mind boggles. Do you have any examples (in documentation of the file layout, or in actual files) of "negative size" of (a) header string (b) data "line"?
Is this "decided format" publically available e.g. documentation on the web? Does the format have a searchable name? Are you sure you are the first person in the world to want to read that format?
Reading that file format, even with a full specification, is no trivial exercise, even for a binary-format-experienced person who's also experienced with Python (which BTW doesn't have a float128). How many person-hours have you been allocated for the task? What are the penalties for (a) delay (b) failure?
Your original question involved fixing your interesting way of trying to parse a uint16 -- doing much more is way outside the scope/intention of what SO questions are all about.

You're basically computing a "number-in-base-256", which is a polynomial, so, by Horner's method:
>>> v = 0
>>> for c in someval: v = v * 256 + ord(c)
More typical would be to use equivalent bit-operations rather than arithmetic -- the following's equivalent:
>>> v = 0
>>> for c in someval: v = v << 8 | ord(c)

import struct
result, = struct.unpack('>H', some_value)

The equivalent to the Python struct module is a C struct and/or union, so being afraid to use it is silly.

I'm not exactly sure how the format of the data is you want to extract, but maybe you better just write a couple of generic utility functions to extract the different data type you need:
def int1b(data, i):
return ord(data[i])
def int2b(data, i):
return (int1b(data, i) << 8) + int1b(data, i+1)
def int4b(data, i):
return (int2b(data, i) << 16) + int2b(data, i+2)
With such functions you can easily extract values from the data and they also can be translated rather easily to C.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reverse Engineering 'UTF-8 Like' Encoding Algorithm - python

Related

Finding a Python 2.7 -> 3.8 struct.Pack("H") into Strings for joining into Lists?

Came up with serialization algorithm for strings, but only works for words that are < 10 letters long.

Reading bit by bit for Huffman Compression

how to keep count of replaced strings

Convert binary information to regular data type without outside modules in python

Categories

Resources