Python Decode OctetString 7-bit Characters

Python Decode OctetString 7-bit Characters - python

I'm currently playing around with decoded asn1 data and can't wrap my head around correctly decoding the data into strings (if the data is numerical it's working absolutely fine)
Example:
Hex String -> 0ddc2f93c6c7bb10
Expected Result -> MegaFon
According to the spec the first two octets are meta info and starting with octet 3 there should be two 7 bit chars in each octet
I tried to use the soltion's mentioned in decode 7-bit GSM but I just get scrap returns, would highly appreciate any ideas

managed to solve the riddle in the meantime (#BoarGules, you are right, the spec is misleading from my perspective). First of all, for Chars (the hex starts with d0 in this case), the nibbles must not be rotated as it is done for numerical output. Then just cut out the first two octets (d0 in our case) and run it through the gsm7bitdecode function mentioned in the other stackoverlow thread (linked in the question). To keep with the example 'CD' => 11001101, cut the first bit or set it to 0 gives us 01001101 or 4D in Hex which is M in Ascii!

Related

While scanning for badchars to avoid in a buffer overflow attack, hex number "C2" keeps appearing every second character in the hexdump

I'm learning about buffer overflows because I have an exam on it tomorrow.
I've been following this guide, and I'm currently on the step where I'm using immunity debugger to look for badchars. However, a weird problem is occurring where after I get to the hex number "7F", where for some reason every second number appears to be "C2".
I'm using a script which is slightly modified from the guide since I'm using python3 instead of python2. That script looks like this:
import socket
ip = "192.168.10.136"
port = 31337
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((ip, port))
offset = 146
eip = "B" * 4
allchars = "\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f\x20\x21\x22\x23\x24\x25\x26\x27\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f\x30\x31\x32\x33\x34\x35\x36\x37\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f\x40\x41\x42\x43\x44\x45\x46\x47\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f\x50\x51\x52\x53\x54\x55\x56\x57\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f\x60\x61\x62\x63\x64\x65\x66\x67\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f\x70\x71\x72\x73\x74\x75\x76\x77\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
buffer = "A" * offset + eip + allchars
buffer += "\n"
s.sendall(buffer.encode('utf-8'))
I couldn't find anything online, it was a kind of hard problem to troubleshoot, especially since I'm not super experienced with python3 or immunity debugger. Any help is very appreciated, and I'll do my best to answer any questions if there's any important information I've forgotten to include.

Comment promoted to answer at OP's request.
As #Sören points out, you have encoded your original data as UTF-8. Your original data goes from byte value 01 to FF which, because it is in a string, represents the Unicode codepoints U+0001 to U+00FF.
But the values in the second half of that block, Unicode codepoints U+0080 to U+00FF, are represented in UTF-8 as two bytes. So when you encoded the original data 0x80 0x81 etc as UTF-8 you got the UTF-8 2-byte representations C2 80 C2 81 etc.
To fix: Make allchars a bytestring b"..." and use it as is without encoding.
Unicode is a 21-bit system that can accommodate a theoretical 1,114,112 different codepoints (though only 144,697 of them were assigned as of Unicode 14.0, in 2021). UTF-8 does represent some codepoints as a single byte: essentially, the 127 characters of the ASCII character set from 1963. But it follows from number of codepoints that any representation in 8-bit units will be variable width, with some codepoints occupying two, three or even four bytes.

Python convert strings of bytes to byte array

For example given an arbitrary string. Could be chars or just random bytes:
string = '\xf0\x9f\xa4\xb1'
I want to output:
b'\xf0\x9f\xa4\xb1'
This seems so simple, but I could not find an answer anywhere. Of course just typing the b followed by the string will do. But I want to do this runtime, or from a variable containing the strings of byte.
if the given string was AAAA or some known characters I can simply do string.encode('utf-8'), but I am expecting the string of bytes to just be random. Doing that to '\xf0\x9f\xa4\xb1' ( random bytes ) produces unexpected result b'\xc3\xb0\xc2\x9f\xc2\xa4\xc2\xb1'.
There must be a simpler way to do this?
Edit:
I want to convert the string to bytes without using an encoding

The Latin-1 character encoding trivially (and unlike every other encoding supported by Python) encodes every code point in the range 0x00-0xff to a byte with the same value.
byteobj = '\xf0\x9f\xa4\xb1'.encode('latin-1')
You say you don't want to use an encoding, but the alternatives which avoid it seem far inferior.
The UTF-8 encoding is unsuitable because, as you already discovered, code points above 0x7f map to a sequence of multiple bytes (up to four bytes) none of which are exactly the input code point as a byte value.
Omitting the argument to .encode() (as in a now-deleted answer) forces Python to guess an encoding, which produces system-dependent behavior (probably picks UTF-8 on most systems except Windows, where it will typically instead choose something much more unpredictable, as well as usually much more sinister and horrible).

I found a working solution
import struct
def convert_string_to_bytes(string):
bytes = b''
for i in string:
bytes += struct.pack("B", ord(i))
return bytes
string = '\xf0\x9f\xa4\xb1'
print (convert_string_to_bytes(string)))
output:
b'\xf0\x9f\xa4\xb1'

Python hex bit flipping ascii

The following statement is from a documentation I'm following.
“7c bd 9c 91” 2442968444(919cbd7c hex)usec = 2442.9sec
If you assume:
7c -> a
bd -> b
9c -> c
91 -> d
Then its easy to see how they got 919cbd7c simply by flipping it abcd to dcba.
What I don't understand is why they aren't filliping the actual bits.
That is to say I expect 19c9dbc7 rather than 919cbd7c.
Is there a way to convert the original string to what they expect?
EG: convert 7cbd9c91 to 919cbd7c?
I know that I can split the string in twos and reverse the order. But is there a way python is aware of this and can decode it automatically?
Here is the documentation. The part in question is on the 2nd line of page 22.

I think you're trying to put too much thought into it. The hex pairs you're seeing are actually single bytes, and the order of the bits within the bytes is unambiguous. It's only the byte-order of the higher-level multi-byte integer that can go more than one way. Fortunately, byte-order swapping is very easy, since computers have to do it all the time (network byte order is big-endian, but most PCs these days are little-endian internally).
In Python, just pass the raw bytestring you're getting (which would be b"\x7c\xbd\x9c\x91" for the example data shown in the documentation) to struct.unpack with an appropriate format parameter. Since the documentation says it's a little endian 4-byte number, use "<L" as the format code to specify a "little-endian unsigned long integer":
>>> bytestring = b"\x7c\xbd\x9c\x91" # from wherever
>>> struct.unpack("<L", bytestring)
(2442968444,)

network data in the form of hex convertion to ascii in python issue

I'm trying to comprehend something, So I'm receiving data from raw network data in the form of HEX, in this particular example a MAC address, now I'm using the Unhexlify() / Hexilfy() functions from the Binascii library in Python 2.7, and for example, I'm recieving for example the following MAC address in the form of hex
"\xa5\xbb%\x8f\xa0\xda"
it's a six octet long mac address and I absolutely have no clue what's going on....
if I use the function
binascii.hexlify('\xa5\xbb%\x8f\xa0\xda')
it returns
a5bb258fa0da
which is indeed the correct MAC address I'm expecting to receive but this really really doesn't make sense....
"\xa5\xbb%\x8f\xa0\xda", this isn't correct form of HEX, it contains a %, and somehow the binascii.hexlify() function manages to translate it to the correct mac address...
I'm honestly failing to understand this, I'm guessing it has something to do with translating from hex to ascii and not hex to dec, but I'm failing to understand how the unhexlify() / hexlify() functions work, and how come I'm receiving data in a form of hex and it contains a % in it, and yet my hex to ascii function manages to handle it...
what's going on.

Andrey is right, if characters are not preceded by the \x the standard ASCII table is used:
>>> print binascii.hexlify('012')
'303132'

"\xa5\xbb%\x8f\xa0\xda" - it is a sequence of characters that are defined by their hex code. One character is "\xa5", it is single character, not 4. For example print '\x61' will produce just a.
About % sign, it is printable character that is why it printed as is in the string. It has hex code of 0x25 which is actually used. you can write it as \x25: "\xa5\xbb\x25\x8f\xa0\xda"
More here.

Python convert mixed ASCII code to String

I am retrieving a value that is set by another application from memcached using python-memcached library. But unfortunately this is the value that I am getting:
>>> mc.get("key")
'\x04\x08"\nHello'
Is it possible to parse this mixed ASCII code into plain string using python function?
Thanks heaps for your help

It is a "plain string", to the extent that such a thing exists. I have no idea what kind of output you're expecting, but:
There ain't no such thing as plain text.
The Python (in 2.x, anyway) str type is really a container for bytes, not characters. So it isn't really text in the first place :) It displays the bytes assuming a very simple encoding, using escape sequence to represent every byte that's even slightly "weird". It will be formatted differently again if you print the string (what you're seeing right now is syntax for creating such a literal string in your code).
In simpler times, we naively assumed that we could just map bytes to these symbols we call "characters", and that would be that. Then it turned out that there were approximately a zillion different mappings that people wanted to use, and lots of them needed more symbols than a byte could represent. Which is why we have Unicode now: it represents every symbol you could conceivably need for any real-world language (and several for fake languages and other purposes), and it abstractly assigns numbers to those symbols but does not say how to collect and interpret the bytes as numbers. (That is the purpose of the encoding).
If you know that the string data is encoded in a particular way, you can decode it to a Unicode string. It could either be an encoding of actual Unicode data, or it could be in some other format (for example, Japanese text is often found in something called "Shift-JIS", because it has approximately the same significance to them as "Latin-1" - a common extension of ASCII - does to us). Either way, you get an in-memory representation of a series of Unicode code points (the numbers referred to in the previous paragraph). This, for all intents and purposes, is really "text", but it isn't really "plain" :)
But it looks like the data you have is really a binary blob of bytes that simply happens to consist mostly of "readable text" if interpreted as ASCII.
What you really need to do is figure out why the first byte has a value of 4 and the next byte has a value of 8, and proceed accordingly.

If you just need to trim the '\x04\x08"\n', and it's always the same (you haven't put your question very clearly, I'm not certain if that's what it is or what you want), do something like this:
to_trim = '\x04\x08"\n'
string = mc.get('key')
if string.startswith(to_trim):
string = string[len(to_trim):]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.