I am experimenting with binary reading and writing to and from files in Python. I am trying to teach myself a bit of programming (it's not really teaching myself, since I use the internet, but anyway...). My problem is that reading a file in Python in binary does not actually output the bits to me, but seems to process it into text already.
Example:
My system has a file "Test.txt" in the same folder as the script.
The content of this file is the following text written in notepad:
Testing Temp "Testing"
This is a small piece of the code that is giving me some confusion:
f=open("Test.txt", "rb")
print(f.read(22))
This results in the following output:
b'Testing Temp "Testing"'
However, I want bits in the form of a string (so a string of 0's and 1's) as output. How can I do this?
What you have is a sequence of bytes (note the b at the beginning).
You can access the value of every single byte using indexing. In your example, if s=f.read(22) then s[0] will be 84 which is the ASCII code for T.
If you want to obtain the binary representation of a byte you use the bin built-in:
>>> bin(84)
'0b1010100'
It also adds the 0b prefix which is python's prefix for binary literals:
>>> 0b1010100
84
To obtain the bit-per-bit binary representation you can simply access every byte and call bin on each value:
def to_bits(contents):
return ''.join(bin(byte)[2:].zfill(8) for byte in contents)
which results in:
>>> to_bits(b'Testing Temp "Testing"')
'01010100011001010111001101110100011010010110111001100111001000000101010001100101011011010111000000100000001000100101010001100101011100110111010001101001011011100110011100100010'
Note that you have to call zfill(8) because bin can return representation shorter than 8 bits:
>>> bin(1)[2:]
'1'
>>> bin(1)[2:].zfill(8)
'00000001'
Related
I have a binary file written by the delphi. This is what i know:
Block 1: 4 bytes, stands for a integer value of 32 bits.
Block 2: A String value (The length is not fixed for all binary files)
Block 3: 4 bytes, stands for a integer value of 32 bits.
Block 4: A String value (The length is not fixed for all binary files)
...
BlockN
i made this to read the first block value:
import struct
f = open("filename", 'rb')
value = struct.unpack('i', f.read(4))
What about the Strings values? What a good solution would be like? Is there any way to iterate over the string and find the final delimiter "\0" of each string value like in C?
It's a little more complex with the unpack if you don't know the length. I give you a reference which should solve your problem.
packing and unpacking variable length array/string using the struct module in python
I discovered that Delphi use a 7 bit integer compression to specify at beginning of a string, how many bytes need to read.I found here the same algorithm implemented with python. So, i just have to pass the file into decode7bit(bytes): function and it will tell me how many bytes i have to read forward.
The following statement is from a documentation I'm following.
“7c bd 9c 91” 2442968444(919cbd7c hex)usec = 2442.9sec
If you assume:
7c -> a
bd -> b
9c -> c
91 -> d
Then its easy to see how they got 919cbd7c simply by flipping it abcd to dcba.
What I don't understand is why they aren't filliping the actual bits.
That is to say I expect 19c9dbc7 rather than 919cbd7c.
Is there a way to convert the original string to what they expect?
EG: convert 7cbd9c91 to 919cbd7c?
I know that I can split the string in twos and reverse the order. But is there a way python is aware of this and can decode it automatically?
Here is the documentation. The part in question is on the 2nd line of page 22.
I think you're trying to put too much thought into it. The hex pairs you're seeing are actually single bytes, and the order of the bits within the bytes is unambiguous. It's only the byte-order of the higher-level multi-byte integer that can go more than one way. Fortunately, byte-order swapping is very easy, since computers have to do it all the time (network byte order is big-endian, but most PCs these days are little-endian internally).
In Python, just pass the raw bytestring you're getting (which would be b"\x7c\xbd\x9c\x91" for the example data shown in the documentation) to struct.unpack with an appropriate format parameter. Since the documentation says it's a little endian 4-byte number, use "<L" as the format code to specify a "little-endian unsigned long integer":
>>> bytestring = b"\x7c\xbd\x9c\x91" # from wherever
>>> struct.unpack("<L", bytestring)
(2442968444,)
I'm trying to write a single-hex value, say 'F' to a file:
a = int('F', 16)
f.write(chr(a))
However, this code segment gives me the file with 0F. I just want the single hex F in my file. I know this is because a char is represented by a byte, is there a way to directly write the hex value without the pad?
Use string formatting:
f.write("{:X}".format(a))
It will write it as F:
>>> "{:X}".format(a)
'F'
What you are trying to do is not possible on most modern operating systems. The smallest data unit that a general purpose computing platform can handle is is one byte.
Check this wiki article for additional details where in it it states:
"Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable unit of memory in many computer architectures. "
You can use the struct module to write raw data to a file. This will write a single byte to the file
open('file','wb').write(struct.pack('b', 0xf))
In Python 3, when I opened a text file with mode string 'rb', and then did f.read(), I was taken aback to find the file contents enclosed in single quotes after the character 'b'.
In Python 2 I just get the file contents.
I'm sure this is well known, but I can't find anything about it in the doco. Could someone point me to it?
You get "just the file contents" in Python 3 as well. Most likely you can just keep on doing whatever you were doing anyway. Read on for a longer explanation:
The b'' signifies that the result value is a bytes string. A bytes-string is quite similar to a normal string, but not quite, and is used to handle binary, non-textual data.
Some of the methods on a string that doesn't make sense for binary data is gone, but most are still there. A big difference is that when you get a specific byte from a bytes string you get an integer back, while for a normal str you get a one-length str.
>>> b'foo'[1]
111
>>> 'foo'[1]
'o'
If you open the file in text mode with the 't' flag you get a str back. The Python 3 str is what in Python 2 was called unicode. It's used to handle textual data.
You convert back and forth between bytes and str with the .encode() and .decode methods.
First of all, the Python 2 str type has been renamed to bytes in Python 3, and byte literals use the b'' prefix. The Python 2 unicode type is the new Python 3 str type.
To get the Python 3 file behaviour in Python 2, you'd use io.open() or codecs.open(); Python 3 decodes text files to Unicode by default.
What you see is that for binary files, Python 3 gives you the exact same thing as in Python 2, namely byte strings. What changed then, is that the repr() of a byte string is prefixed with b and the print() function will use the repr() representation of any object passed to it except for unicode values.
To print your binary data as unicode text with the print() function., decode it to unicode first. But then you could perhaps have opened the file as a text file instead anyway.
The bytes type has some other improvements to reflect that you are dealing with binary data, not text. Indexing individual bytes or iterating over a bytes value gives you int values (between 0 and 255) and not characters, for example.
Sometimes we need (needed?) to know whether a text file had single-character newlines (0A) or double character newlines (0D0A).
We used to avoid confusion by opening the text file in binary mode, recognising 0D and 0A, and treating other bytes as regular text characters.
One could port such code by finding all binarymode reads and replacing them with a new function oldread() that stripped off the added material, but it’s a bit painful.
I suppose the Python theologians thought of keeping ‘rb’ as it was, and adding a new ‘rx’ or something for the new behaviour. It seems a bit high-handed just to abolish something.
But, there it is, the question is certainly answered by a search for ‘rb’ in Lennert’s document.
I am opening up a binary file like so:
file = open("test/test.x", 'rb')
and reading in lines to a list. Each line looks a little like:
'\xbe\x00\xc8d\xf8d\x08\xe4.\x07~\x03\x9e\x07\xbe\x03\xde\x07\xfe\n'
I am having a hard time manipulating this data. If I try and print each line, python freezes, and emits beeping noises (I think there's a binary beep code in there somewhere). How do I go about using this data safely? How can I convert each hex number to decimal?
To print it, you can do something like this:
print repr(data)
For the whole thing as hex:
print data.encode('hex')
For the decimal value of each byte:
print ' '.join([str(ord(a)) for a in data])
To unpack binary integers, etc. from the data as if they originally came from a C-style struct, look at the struct module.
\xhh is the character with hex value hh. Other characters such as . and `~' are normal characters.
Iterating on a string gives you the characters in it, one at a time.
ord(c) will return an integer representing the character. E.g., ord('A') == 65.
This will print the decimal numbers for each character:
s = '\xbe\x00\xc8d\xf8d\x08\xe4.\x07~\x03\x9e\x07\xbe\x03\xde\x07\xfe\n'
print ' '.join(str(ord(c)) for c in s)
Binary data is rarely divided into "lines" separated by '\n'. If it is, it will have an implicit or explicit escape mechanism to distinguish between '\n' as a line terminator and '\n' as part of the data. Reading such a file as lines blindly without knowledge of the escape mechanism is pointless.
To answer your specific concerns:
'\x07' is the ASCII BEL character, which was originally for ringing the bell on a teletype machine.
You can get the integer value of a byte 'b' by doing ord(b).
HOWEVER, to process binary data properly, you need to know what the layout is. You can have signed and unsigned integers (of sizes 1, 2, 4, 8 bytes), floating point numbers, decimal numbers of varying lengths, fixed length strings, variable length strings, etc etc. Added complication comes from whether the data is recorded in bigendian fashion or littleendian fashion. Once you know all of the above (or have very good informed guesses), the Python struct module should be able to be used for all or most of your processing; the ctypes module may also be useful.
Does the data format have a name? If so, tell us; we may be able to point you to code or docs.
You ask "How do I go about using this data safely?" which begs the question: What do you want to use it for? What manipulations do you want to do?
Like theatrus mentioned, ord and hex might help you.
If you want to try to interpret some sort of structured binary data in the file, the struct module might be helpful.
You are trying to print the data converted to ASCII characters, which will not work.
You can safely use any byte of the data. If you want to print it as a hexadecimal, look at the functions ord and hex/
Are you using read() or readline()? You should be using read(n) to read n bytes; readline() will read until it hits a newline, which the binary file might not have.
In either case, though, you are returned a string of bytes, which may be printable or non-printable characters, and is probably not very useful.
What you want is ord(), which converts a one-byte string into the corresponding integer value. read() from the file one byte at a time and call ord() on the result, or iterate through the entire string.
If you are willing to use NumPy and bitstream, you can do
>>> from numpy import *
>>> from bitstream import BitStream
>>> raw = '\xbe\x00\xc8d\xf8d\x08\xe4.\x07~\x03\x9e\x07\xbe\x03\xde\x07\xfe\n'
>>> stream = BitStream(raw)
>>> stream.read(raw, uint8, len(stream) // 8)
array([190, 0, 200, 100, 248, 100, 8, 228, 46, 7, 126, 3, 158,
7, 190, 3, 222, 7, 254, 10], dtype=uint8)