Interpreting binary files as ASCII

Interpreting binary files as ASCII - python

I have a binary file (which I've created in C) and I would like to have a look inside the file. Obviously, I won't be able to "see" anything useful as it's in binary. However I do know that contains certain number of rows with numbers in double precision. I am looking for a script to just read some values and print them so I can verify the if they are in the right range. In other words, it would be like doing head or tail in linux on an text file.
Is there a way of doing it?
Right now I've got something in Python, but it does not do what I want:
CHUNKSIZE = 8192
file = open('eigenvalues.bin', 'rb')
data = list(file.read())
print data

Use the array module to read homogenous binary-representation numbers:
from array import array
data = array('d')
CHUNKSIZE = 8192
rowcount = CHUNKSIZE / data.itemsize # number of doubles we find in CHUNKSIZE bytes
with open('eigenvalues.bin', 'rb') as eg:
data.fromfile(eg, rowcount)
The array.array type otherwise behaves just like a list, only the type of values it can hold is constricted (in this case to float).
Depending on the input data, you may need to add a data.byteswap() call after reading to switch between little and big-endian. Use sys.byteorder to see what byteorder was used to read the data. If your data was written on a platform using little-endianess, swap if your platform uses the other form, and vice-versa:
import sys
if sys.byteorder == 'big':
# data was written in little-endian form, so swap the bytes to match
data.byteswap()

You can use struct.unpack to convert binary data into a specific data type.
For example, if you want to read the first double from the binary data. (not tested, but believe this is correct)
struct.unpack("d",inputData[0:7])
http://docs.python.org/2/library/struct.html

You can see each byte of your file represented in unsigned decimal with this shell command:
od -t u1 eigenvalues.bin | less
Should you want to see a particular area and decode floating point numbers, you can use dd to extract them and od -F option to decode them, eg:
dd status=noxfer if=eigenvalues.bin bs=1 skip=800 count=16 | od -F
will show two double precision numbers stored at offset 800 and 808 in the binary file.
Note that according to the Linux tag set to your question, I assume you are using Gnu versions of dd and od.

Related

Convert hex string to float

I am trying to read from a file a bunch of hex numbers.
lines ='4005297103CE40C040059B532A7472C440061509BB9597D7400696DBCF1E35CC4007206BB5B0A67B4007AF4B08111B87400840D4766460524008D47E0FFB4ABA400969A572EBAFE7400A0107CCFDF50E'
dummy = [lines[index][i:i+16] for i in range(0, len(lines[index]),16)]
rdummy=[]
for elem in dummy[:-1]:
rdummy.append(int(elem,16))
these are 10 number of 16 digits
in particular when reading the first one, I have:
print(dummy[0])
4005297103CE40C0
now I would like to convert it to float
I have an IDL script that when reading this number gives 2.64523509
the command used in IDL is
double(4613138958682833088,0)
where it appers 0 is an offset used when converting.
is there a way to do this in python?

you probably want to use the struct package for this, something like this seems to work:
import struct
lines ='4005297103CE40C040059B532A7472C440061509BB9597D7400696DBCF1E35CC4007206BB5B0A67B4007AF4B08111B87400840D4766460524008D47E0FFB4ABA400969A572EBAFE7400A0107CCFDF50E'
for [value] in struct.iter_unpack('>d', bytes.fromhex(lines)):
print(value)
results in 2.64523509 being printed first which seems about right

Convert Python array('B') to array('H') - always little-endian?

I have a Python array('B') (containing some data read from a file), which I would like to convert to an array('H'). I am currently using code similar to the following:
a = array.array('B', f.read())
b = a[16:32]
c = array.array('H', b.tostring())
Unfortunately the conversion in the third line uses the native byte order, so will give different results on different machines.
Is there any way to make the conversion always little-endian, irrespective of the native byte order?

array.array is only useful for internal calculations, because it always uses the native byte order. There is a method byteswap to change the order. Therefore you have to check sys.byteorder to determine the system byteorder, and swap accordingly.
To have better control of ordering use struct:
data = f.read()
c = struct.unpack_from('<8H', data, 16)

python 32 bit float conversion

Python 2.6 on Redhat 6.3
I have a device that saves 32 bit floating point value across 2 memory registers, split into most significant word and least significant word.
I need to convert this to a float.
I have been using the following code found on SO and it is similar to code I have seen elsewhere
#!/usr/bin/env python
import sys
from ctypes import *
first = sys.argv[1]
second = sys.argv[2]
reading_1 = str(hex(int(first)).lstrip("0x"))
reading_2 = str(hex(int(second)).lstrip("0x"))
sample = reading_1 + reading_2
def convert(s):
i = int(s, 16) # convert from hex to a Python int
cp = pointer(c_int(i)) # make this into a c integer
fp = cast(cp, POINTER(c_float)) # cast the int pointer to a float pointer
return fp.contents.value # dereference the pointer, get the float
print convert(sample)
an example of the register values would be ;
register-1;16282 register-2;60597
this produces the resulting float of
1.21034872532
A perfectly cromulent number, however sometimes the memory values are something like;
register-1;16282 register-2;1147
which, using this function results in a float of;
1.46726675314e-36
which is a fantastically small number and not a number that seems to be correct. This device should be producing readings around the 1.2, 1.3 range.
What I am trying to work out is if the device is throwing bogus values or whether the values I am getting are correct but the function I am using is not properly able to convert them.
Also is there a better way to do this, like with numpy or something of that nature?
I will hold my hand up and say that I have just copied this code from examples on line and I have very little understanding of how it works, however it seemed to work in the test cases that I had available to me at the time.
Thank you.

If you have the raw bytes (e.g. read from memory, from file, over the network, ...) you can use struct for this:
>>> import struct
>>> struct.unpack('>f', '\x3f\x9a\xec\xb5')[0]
1.2103487253189087
Here, \x3f\x9a\xec\xb5 are your input registers, 16282 (hex 0x3f9a) and 60597 (hex 0xecb5) expressed as bytes in a string. The > is the byte order mark.
So depending how you get the register values, you may be able to use this method (e.g. by converting your input integers to byte strings). You can use struct for this, too; this is your second example:
>>> raw = struct.pack('>HH', 16282, 1147) # from two unsigned shorts
>>> struct.unpack('>f', raw)[0] # to one float
1.2032617330551147

The way you've converting the two ints makes implicit assumptions about endianness that I believe are wrong.
So, let's back up a step. You know that the first argument is the most significant word, and the second is the least significant word. So, rather than try to figure out how to combine them into a hex string in the appropriate way, let's just do this:
import struct
import sys
first = sys.argv[1]
second = sys.argv[2]
sample = int(first) << 16 | int(second)
Now we can just convert like this:
def convert(i):
s = struct.pack('=i', i)
return struct.unpack('=f', s)[0]
And if I try it on your inputs:
$ python floatify.py 16282 60597
1.21034872532
$ python floatify.py 16282 1147
1.20326173306

How do I write a long integer as binary in Python?

In Python, long integers have unlimited precision. I would like to write a 16 byte (128 bit) integer to a file. struct from the standard library supports only up to 8 byte integers. array has the same limitation. Is there a way to do this without masking and shifting each integer?
Some clarification here: I'm writing to a file that's going to be read in from non-Python programs, so pickle is out. All 128 bits are used.

I think for unsigned integers (and ignoring endianness) something like
import binascii
def binify(x):
h = hex(x)[2:].rstrip('L')
return binascii.unhexlify('0'*(32-len(h))+h)
>>> for i in 0, 1, 2**128-1:
... print i, repr(binify(i))
...
0 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
1 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01'
340282366920938463463374607431768211455 '\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff'
might technically satisfy the requirements of having non-Python-specific output, not using an explicit mask, and (I assume) not using any non-standard modules. Not particularly elegant, though.

Two possible solutions:
Just pickle your long integer. This will write the integer in a special format which allows it to be read again, if this is all you want.
Use the second code snippet in this answer to convert the long int to a big endian string (which can be easily changed to little endian if you prefer), and write this string to your file.
The problem is that the internal representation of bigints does not directly include the binary data you ask for.

The PyPi bitarray module in combination with the builtin bin() function seems like a good combination for a solution that is simple and flexible.
bytes = bitarray(bin(my_long)[2:]).tobytes()
The endianness can be controlled with a few more lines of code. You'll have to evaluate the efficiency.

Why not use struct with the unsigned long long type twice?
import struct
some_file.write(struct.pack("QQ", var/(2**64), var%(2**64)))
That's documented here (scroll down to get the table with Q): http://docs.python.org/library/struct.html

This may not avoid the "mask and shift each integer" requirement. I'm not sure that avoiding mask and shift means in the context of Python long values.
The bytes are these:
def bytes( long_int ):
bytes = []
while long_int != 0:
b = long_int%256
bytes.insert( 0, b )
long_int //= 256
return bytes
You can then pack this list of bytes using struct.pack( '16b', bytes )

With Python 3.2 and later, you can use int.to_bytes and int.from_bytes: https://docs.python.org/3/library/stdtypes.html#int.to_bytes

You could pickle the object to binary, use protocol buffers (I don't know if they allow you to serialize unlimited precision integers though) or BSON if you do not want to write code.
But writing a function that dumps 16 byte integers by shifting it should not be so hard to do if it's not time critical.

This may be a little late, but I don't see why you can't use struct:
bigint = 0xFEDCBA9876543210FEDCBA9876543210L
print bigint,hex(bigint).upper()
cbi = struct.pack("!QQ",bigint&0xFFFFFFFFFFFFFFFF,(bigint>>64)&0xFFFFFFFFFFFFFFFF)
print len(cbi)
The bigint by itself is rejected, but if you mask it with &0xFFFFFFFFFFFFFFFF you can reduce it to an 8 byte int instead of 16. Then the upper part is shifted and masked as well. You may have to play with byte ordering a bit. I used the ! mark to tell it to produce a network endian byte order. Also, the msb and lsb (upper and lower bytes) may need to be reversed. I will leave that as an exercise for the user to determine. I would say saving things as network endian would be safer so you always know what the endianess of your data is.
No, don't ask me if network endian is big or little endian...

Based on #DSM's answer, and to support negative integers and varying byte sizes, I've created the following improved snippet:
def to_bytes(num, size):
x = num if num >= 0 else 256**size + num
h = hex(x)[2:].rstrip("L")
return binascii.unhexlify("0"*((2*size)-len(h))+h)
This will properly handle negative integers and let the user set the number of bytes

Convert binary information to regular data type without outside modules in python

I'm tasked with reading a poorly formatted binary file and taking in the variables. Although I need to do it in C++ (ROOT, specifically), I've decided to do it in python because python makes sense to me, but my plan is to get it working in python and then tackle re-writing in in C++, so using easy to use python modules won't get me too far later down the road.
Basically, I do this:
In [5]: some_value
Out[5]: '\x00I'
In [6]: ''.join([str(ord(i)) for i in some_value])
Out[6]: '073'
In [7]: int(''.join([str(ord(i)) for i in some_value]))
Out[7]: 73
And I know there has to be a better way. What do you think?
EDIT:
A bit of info on the binary format.
alt text http://grab.by/3njm
alt text http://grab.by/3njv
alt text http://grab.by/3nkL
This is the endian test I am using:
# Read a uint32 for endianess
endian_test = rq1_file.read(uint32)
if endian_test == '\x04\x03\x02\x01':
print "Endian test: \\x04\\x03\\x02\\x01"
swapbits = True
elif endian_test == '\x01\x02\x03\x04':
print "Endian test: \\x01\\x02\\x03\\x04"
swapbits = False

Your int(''.join([str(ord(i)) for i in some_value])) works ONLY when all bytes except the last byte are zero.
Examples:
'\x01I' should be 1 * 256 + 73 == 329; you get 173
'\x01\x02' should be 1 * 256 + 2 == 258; you get 12
'\x01\x00' should be 1 * 256 + 0 == 256; you get 10
It also relies on an assumption that integers are stored in bigendian fashion; have you verified this assumption? Are you sure that '\x00I' represents the integer 73, and not the integer 73 * 256 + 0 == 18688 (or something else)? Please let us help you verify this assumption by telling us what brand and model of computer and what operating system were used to create the data.
How are negative integers represented?
Do you need to deal with floating-point numbers?
Is the requirement to write it in C++ immutable? What does "(ROOT, specifically)" mean?
If the only dictate is common sense, the preferred order would be:
Write it in Python using the struct module.
Write it in C++ but use C++ library routines (especially if floating-point is involved). Don't re-invent the wheel.
Roll your own conversion routines in C++. You could snarf a copy of the C source for the Python struct module.
Update
Comments after the file format details were posted:
The endianness marker is evidently optional, except at the start of a file. This is dodgy; it relies on the fact that if it is not there, the 3rd and 4th bytes of the block are the 1st 2 bytes of the header string, and neither '\x03\x04' nor '\x02\x01' can validly start a header string. The smart thing to do would be to read SIX bytes -- if first 4 are the endian marker, the next two are the header length, and your next read is for the header string; otherwise seek backwards 4 bytes then read the header string.
The above is in the nuisance category. The negative sizes are a real worry, in that they specify a MAXIMUM length, and there is no mention of how the ACTUAL length is determined. It says "The actual size of the entry is then given line by line". How? There is no documentation of what a "line of data" looks like. The description mentions "lines" many times; are these lines terminated by carriage return and/or line feed? If so, how does one tell the difference between say a line feed byte and the first byte of say a uint16 that belongs to the current "line" of data? If no linefeed or whatever, how does one know when the current line of data is finished? Is there a uintNN size in front of every variable or slice thereof?
Then it says that (2) above (negative size) also applies to the header string. The mind boggles. Do you have any examples (in documentation of the file layout, or in actual files) of "negative size" of (a) header string (b) data "line"?
Is this "decided format" publically available e.g. documentation on the web? Does the format have a searchable name? Are you sure you are the first person in the world to want to read that format?
Reading that file format, even with a full specification, is no trivial exercise, even for a binary-format-experienced person who's also experienced with Python (which BTW doesn't have a float128). How many person-hours have you been allocated for the task? What are the penalties for (a) delay (b) failure?
Your original question involved fixing your interesting way of trying to parse a uint16 -- doing much more is way outside the scope/intention of what SO questions are all about.

You're basically computing a "number-in-base-256", which is a polynomial, so, by Horner's method:
>>> v = 0
>>> for c in someval: v = v * 256 + ord(c)
More typical would be to use equivalent bit-operations rather than arithmetic -- the following's equivalent:
>>> v = 0
>>> for c in someval: v = v << 8 | ord(c)

import struct
result, = struct.unpack('>H', some_value)

The equivalent to the Python struct module is a C struct and/or union, so being afraid to use it is silly.

I'm not exactly sure how the format of the data is you want to extract, but maybe you better just write a couple of generic utility functions to extract the different data type you need:
def int1b(data, i):
return ord(data[i])
def int2b(data, i):
return (int1b(data, i) << 8) + int1b(data, i+1)
def int4b(data, i):
return (int2b(data, i) << 16) + int2b(data, i+2)
With such functions you can easily extract values from the data and they also can be translated rather easily to C.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Interpreting binary files as ASCII - python

You can use struct.unpack to convert binary data into a specific data type. For example, if you want to read the first double from the binary data. (not tested, but believe this is correct) struct.unpack("d",inputData[0:7]) http://docs.python.org/2/library/struct.html

Related

Convert hex string to float

Convert Python array('B') to array('H') - always little-endian?

python 32 bit float conversion

How do I write a long integer as binary in Python?

Convert binary information to regular data type without outside modules in python

Categories

Resources