Manipulating binary data in Python

Manipulating binary data in Python - python

I am opening up a binary file like so:
file = open("test/test.x", 'rb')
and reading in lines to a list. Each line looks a little like:
'\xbe\x00\xc8d\xf8d\x08\xe4.\x07~\x03\x9e\x07\xbe\x03\xde\x07\xfe\n'
I am having a hard time manipulating this data. If I try and print each line, python freezes, and emits beeping noises (I think there's a binary beep code in there somewhere). How do I go about using this data safely? How can I convert each hex number to decimal?

To print it, you can do something like this:
print repr(data)
For the whole thing as hex:
print data.encode('hex')
For the decimal value of each byte:
print ' '.join([str(ord(a)) for a in data])
To unpack binary integers, etc. from the data as if they originally came from a C-style struct, look at the struct module.

\xhh is the character with hex value hh. Other characters such as . and `~' are normal characters.
Iterating on a string gives you the characters in it, one at a time.
ord(c) will return an integer representing the character. E.g., ord('A') == 65.
This will print the decimal numbers for each character:
s = '\xbe\x00\xc8d\xf8d\x08\xe4.\x07~\x03\x9e\x07\xbe\x03\xde\x07\xfe\n'
print ' '.join(str(ord(c)) for c in s)

Binary data is rarely divided into "lines" separated by '\n'. If it is, it will have an implicit or explicit escape mechanism to distinguish between '\n' as a line terminator and '\n' as part of the data. Reading such a file as lines blindly without knowledge of the escape mechanism is pointless.
To answer your specific concerns:
'\x07' is the ASCII BEL character, which was originally for ringing the bell on a teletype machine.
You can get the integer value of a byte 'b' by doing ord(b).
HOWEVER, to process binary data properly, you need to know what the layout is. You can have signed and unsigned integers (of sizes 1, 2, 4, 8 bytes), floating point numbers, decimal numbers of varying lengths, fixed length strings, variable length strings, etc etc. Added complication comes from whether the data is recorded in bigendian fashion or littleendian fashion. Once you know all of the above (or have very good informed guesses), the Python struct module should be able to be used for all or most of your processing; the ctypes module may also be useful.
Does the data format have a name? If so, tell us; we may be able to point you to code or docs.
You ask "How do I go about using this data safely?" which begs the question: What do you want to use it for? What manipulations do you want to do?

Like theatrus mentioned, ord and hex might help you.
If you want to try to interpret some sort of structured binary data in the file, the struct module might be helpful.

You are trying to print the data converted to ASCII characters, which will not work.
You can safely use any byte of the data. If you want to print it as a hexadecimal, look at the functions ord and hex/

Are you using read() or readline()? You should be using read(n) to read n bytes; readline() will read until it hits a newline, which the binary file might not have.
In either case, though, you are returned a string of bytes, which may be printable or non-printable characters, and is probably not very useful.
What you want is ord(), which converts a one-byte string into the corresponding integer value. read() from the file one byte at a time and call ord() on the result, or iterate through the entire string.

If you are willing to use NumPy and bitstream, you can do
>>> from numpy import *
>>> from bitstream import BitStream
>>> raw = '\xbe\x00\xc8d\xf8d\x08\xe4.\x07~\x03\x9e\x07\xbe\x03\xde\x07\xfe\n'
>>> stream = BitStream(raw)
>>> stream.read(raw, uint8, len(stream) // 8)
array([190, 0, 200, 100, 248, 100, 8, 228, 46, 7, 126, 3, 158,
7, 190, 3, 222, 7, 254, 10], dtype=uint8)

Related

Bytes representation in Python

I need to do some work with bytes in Python and I've come across a byte string I don't really understand:
b"H\x00\x84\xffQ\x00\xa6\xff+\x00\x96\xff\xc2\xffI\xff\xa5\xff'\xff\x8a\xff\x19\xff\x19\xff\xf6\xfe\xb0\xfe\xc7\xfeJ\xfel\xfe\xf8\xfd+\xfe\xef\xfd:\xfe\xc3\xfd*\xfe_\xfd\xdf\xfd\n\xfd\xa3\xfd\xc6\xfcq\xfd\xbd\xfc?\xfd"
So according to what I know, bytes should be represented as \xhh, where hh are hexadecimal values (from 0 to f). However, in the third segment, there is \xffQ, and farther on there are other characters which shouldn't appear: I, ', *, :, ? etc.
I've used hex() method to see what would be the outcome, and I got this:
480084ff5100a6ff2b0096ffc2ff49ffa5ff27ff8aff19ff19fff6feb0fec7fe4afe6cfef8fd2bfeeffd3afec3fd2afe5ffddffd0afda3fdc6fc71fdbdfc3ffd
As you can see, some parts of the hex are the same, but e.g. \xffQ was changed into ff51. I need to append some data to this byte string, so I'd like to know what's going on there (or how to get the same result).

Both repr and str when processing a bytes object will print ASCII characters where possible. Otherwise their hexadecimal values will be shown in the form \xNN.
It might help you to visualise the content if you print it as all hexadecimal as follows:
b = b"H\x00\x84\xffQ\x00\xa6\xff+\x00\x96\xff\xc2\xffI\xff\xa5\xff'\xff\x8a\xff\x19\xff\x19\xff\xf6\xfe\xb0\xfe\xc7\xfeJ\xfel\xfe\xf8\xfd+\xfe\xef\xfd:\xfe\xc3\xfd*\xfe_\xfd\xdf\xfd\n\xfd\xa3\xfd\xc6\xfcq\xfd\xbd\xfc?\xfd"
print(''.join(hex(b_) for b_ in b))
Output:
0x480x00x840xff0x510x00xa60xff0x2b0x00x960xff0xc20xff0x490xff0xa50xff0x270xff0x8a0xff0x190xff0x190xff0xf60xfe0xb00xfe0xc70xfe0x4a0xfe0x6c0xfe0xf80xfd0x2b0xfe0xef0xfd0x3a0xfe0xc30xfd0x2a0xfe0x5f0xfd0xdf0xfd0xa0xfd0xa30xfd0xc60xfc0x710xfd0xbd0xfc0x3f0xfd

Or you can use binascii module if you want to visualize the content:
import binascii
print(binascii.hexlify(b"hello world"))
Output:
68656c6c6f20776f726c64

Use string as bytes [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
My problem is as follows:
I'm reading a .csv generated by some software and to read it I'm using Pandas. Pandas read the .csv properly but one of the columns stores bytes sequences representing vectors and Pandas stores them as a string.
So I have data (string) and I want to use np.frombuffer() to get the proper vector. The problem is, data is a string so its already encoded so when I use .encode() to turn it into bytes, the sequence is not the original one.
Example: The .csv contains \x00\x00 representing the vector [0,0] with dtype=np.uint8. Pandas stores it as a string and when I try to process it something like this happens:
data = df.data[x] # With x any row.
type(data)
<class 'str'>
print(data)
\x00\x00
e_data = data.encode("latin1")
print(e_data)
b'\\x00\\x00'
v = np.frombuffer(e_data, np.uint8)
print(v)
array([ 92 120 48 48 92 120 48 48], dtype=uint8)
I just want to get b'\x00\x00' from data instead of b'\\x00\\x00' which I understand is a little encoding mess I have not been able to fix yet.
Any way to do this?
Thanks!

Issue: you (apparently) have a string that contains literal backslash escape sequences, such as:
>>> x = r'\x00' # note the use of a raw string literal
>>> x # Python's representation of the string escapes the backslash
'\\x00'
>>> print(x) # but it looks right when printing
\x00
From this, you wish to create a corresponding bytes object, wherein the backslash-escape sequences are translated into the corresponding byte.
Handling these kinds of escape sequences is done using the unicode-escape string encoding. As you may be aware, string encodings convert between bytes and str objects, specifying the rules for which byte sequences correspond to what Unicode code points.
However, the unicode-escape codec assumes that the escape sequences are on the bytes side of the equation and that the str side will have the corresponding Unicode characters:
>>> rb'\x00'.decode('unicode-escape') # create a string with a NUL char
'\x00'
Applying .encode to the string will reverse that process; so if you start with the backslash-escape sequence, it will re-escape the backslash:
>>> r'\x00'.encode('unicode-escape') # the result contains two backslashes, represented as four
b'\\\\x00'
>>> list(r'\x00'.encode('unicode-escape')) # let's look at the numeric values of the bytes
[92, 92, 120, 48, 48]
As you can see, that is clearly not what we want.
We want to convert from bytes to str to do the backslash-escaping. But we have a str to start, so we need to change that to bytes; and we want bytes at the end, so we need to change the str that we get from the backslash-escaping. In both cases, we need to make each Unicode code point from 0-255 inclusive, correspond to a single byte with the same value.
The encoding we need for that task is called latin-1, also known as iso-8859-1.
For example:
>>> r'\x00'.encode('latin-1')
b'\\x00'
Thus, we can reason out the overall conversion:
>>> r'\x00'.encode('latin-1').decode('unicode-escape').encode('latin-1')
b'\x00'
As desired: our str with a literal backslash, lowercase x and two zeros, is converted to a bytes object containing a single zero byte.
Alternately: we can request that backslash-escapes are processed while decoding, by using escape_decode from the codecs standard library module. However, this isn't documented and isn't really meant to be used that way - it's internal stuff used to implement the unicode-escape codec and possibly some other things.
If you want to expose yourself to the risk of that breaking in the future, it looks like:
>>> import codecs
>>> codecs.escape_decode(r'\x00\x00')
(b'\x00\x00', 8)
We get a 2-tuple, with the desired bytes and what I assume is the number of Unicode code points that were decoded (i.e. the length of the string). From my testing, it appears that it can only use UTF-8 encoding for the non-backslash sequences (but this could be specific to how Python is configured), and you can't change this; there is no actual parameter to specify the encoding, for a decode method. Like I said - not meant for general use.
Yes, all of that is as awkward as it seems. The reason you don't get easy support for this kind of thing is that it isn't really how you're intended to design your system. Fundamentally, all data is bytes; text is an abstraction that is encoded by that byte data. Using a single byte (with value 0) to represent four characters of text (the symbols \, x, 0 and 0) is not a normal encoding, and not a reversible one (how do I know whether to decode the byte as those four characters, or as a single NUL character?). Instead, you should strongly consider using some other friendly string representation of your data (perhaps a plain hex dump) and a non-text-encoding-related way to parse it. For example:
>>> data = '41 42' # a string in a simple hex dump format
>>> bytes.fromhex(data) # support is built-in, and works simply
b'AB'
>>> list(bytes.fromhex(data))
[65, 66]

strange return from python's f.read

I am trying to capture some data from a piece of hardware I'm developing through one of cypress' fx2lp chips. I used cypress' software to record a sample of my data stream to a file, which I am trying to read with python. However, when I read it, I'm getting some interesting output that I'm not sure how to interpret.
I am opening the file like this:
f = open("testdata_5Aug2014.dat","rb")
Then I read the data in various sized chunks, similar to this:
f.read(100)
Typically, the result of the above line (and what I want to see) is something like this:
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
But I sometimes get returns that include 't's and '?'s thrown in there like this:
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14K\x01?\x00\xff??\x00\xff??\x00\xff??\x00\xff?\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
This is a problem, because when I use struct.unpack to parse this out, it won't return any of those bytes with the special characters appended.
So my question is: What are those symbols? How did they get there? and How do I remove them or deal with them?

You're reading binary data from a file, but f.read returns that data as a string. When you print that string, it's interpreting those bytes as characters. However, not every byte value maps to a displayable character, so some bytes are shown as escape sequences: \x followed by two hexadecimal digits. For example, 0 shows up as \x00 and 255 shows up as \xff.
Some values do map to characters, such as 63 mapping to '?' and 116 mapping to 't'. The ord and chr functions can be used to fetch the numerical value of a character, and the character mapping for a number, respectively, so ord('t') returns 116 and chr(63) returns '?'.
Either way, no matter how it's displayed, your data should be fine, and struct.unpack should be able to work with it as usual.

Python 3 file input change in binary mode

In Python 3, when I opened a text file with mode string 'rb', and then did f.read(), I was taken aback to find the file contents enclosed in single quotes after the character 'b'.
In Python 2 I just get the file contents.
I'm sure this is well known, but I can't find anything about it in the doco. Could someone point me to it?

You get "just the file contents" in Python 3 as well. Most likely you can just keep on doing whatever you were doing anyway. Read on for a longer explanation:
The b'' signifies that the result value is a bytes string. A bytes-string is quite similar to a normal string, but not quite, and is used to handle binary, non-textual data.
Some of the methods on a string that doesn't make sense for binary data is gone, but most are still there. A big difference is that when you get a specific byte from a bytes string you get an integer back, while for a normal str you get a one-length str.
>>> b'foo'[1]
111
>>> 'foo'[1]
'o'
If you open the file in text mode with the 't' flag you get a str back. The Python 3 str is what in Python 2 was called unicode. It's used to handle textual data.
You convert back and forth between bytes and str with the .encode() and .decode methods.

First of all, the Python 2 str type has been renamed to bytes in Python 3, and byte literals use the b'' prefix. The Python 2 unicode type is the new Python 3 str type.
To get the Python 3 file behaviour in Python 2, you'd use io.open() or codecs.open(); Python 3 decodes text files to Unicode by default.
What you see is that for binary files, Python 3 gives you the exact same thing as in Python 2, namely byte strings. What changed then, is that the repr() of a byte string is prefixed with b and the print() function will use the repr() representation of any object passed to it except for unicode values.
To print your binary data as unicode text with the print() function., decode it to unicode first. But then you could perhaps have opened the file as a text file instead anyway.
The bytes type has some other improvements to reflect that you are dealing with binary data, not text. Indexing individual bytes or iterating over a bytes value gives you int values (between 0 and 255) and not characters, for example.

Sometimes we need (needed?) to know whether a text file had single-character newlines (0A) or double character newlines (0D0A).
We used to avoid confusion by opening the text file in binary mode, recognising 0D and 0A, and treating other bytes as regular text characters.
One could port such code by finding all binary﷓mode reads and replacing them with a new function oldread() that stripped off the added material, but it’s a bit painful.
I suppose the Python theologians thought of keeping ‘rb’ as it was, and adding a new ‘rx’ or something for the new behaviour. It seems a bit high-handed just to abolish something.
But, there it is, the question is certainly answered by a search for ‘rb’ in Lennert’s document.

In Python, how can I convert both numbers and strings into byte arrays?

I want to encode a set of configuration options into a long string of hex digits.
The input is a mix of numbers (integers and floats) and strings. I can use binascii.a2b_hex from the standard library for the strings, bit-wise operators for the integers, and probably, if I go and read some on floating point representation (sigh), I can probably handle the floats, too.
Now, my questions:
When given the list of options, (how) should I type check the value to select the correct conversion routine?
Isn't there a library function for the numbers, too? I can't seem to find it.
The serialized data is sent to an embedded device and I have limited control over the code that consumes it (meaning, changes are possible, but a hassle). The specification for the serialization seems to conform to C value representation (char arrays for strings, Little Endian integers, IEEE 754 for floats), but it doesn't explicitily state this. So, Python-specific stuff like pickle are off-limits.

You want struct.
>>> struct.pack('16sdl', 'Hello, world!', 3.141592654, 42)
'Hello, world!\x00\x00\x00PERT\xfb!\t#*\x00\x00\x00\x00\x00\x00\x00'

You easiest bet is to pickle the whole list to a string and then use binascii.a2b_hex() to convert this string to hex digits:
a = ["Hello", 42, 3.1415]
s = binascii.b2a_hex(pickle.dumps(a, 2))
print s
# 80025d710028550548656c6c6f71014b2a47400921cac083126f652e
print pickle.loads(binascii.a2b_hex(s))
# ['Hello', 42, 3.1415]

What about using the struct module to do your packing/unpacking?
import struct
s = struct.pack('S5if',"Hello",42,3.1415)
print s
print struct.unpack('5sif')
or if you really want just hex characters
import struct, binascii
s = binascii.b2a_hex(struct.pack('S5if',"Hello",42,3.1415))
print s
print struct.unpack('5sif',binascii.a2b_hex(s))
Of course this requires that you know the length of strings that are being sent across or you could figure it out by looking for a NULL character or something.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Manipulating binary data in Python - python

Like theatrus mentioned, ord and hex might help you. If you want to try to interpret some sort of structured binary data in the file, the struct module might be helpful.

You are trying to print the data converted to ASCII characters, which will not work. You can safely use any byte of the data. If you want to print it as a hexadecimal, look at the functions ord and hex/

Related

Bytes representation in Python

Use string as bytes [duplicate]

strange return from python's f.read

Python 3 file input change in binary mode

In Python, how can I convert both numbers and strings into byte arrays?

Categories

Resources