Read binary file in Python using IDL function as reference - python

I am wanting to read a binary file using Python. I've so far used numpy.fromfile but have not been able to figure out the structure of the resultant array. I have an IDL function that would read the file, so this is the only thing I have to go on. I have no knowledge of IDL at all.
The following IDL function will read the file and return lc,zgrid,fnu,efnu etc.:
openr,lun,'file.dat',/swap_if_big_endian,/get_lun
s = lonarr(4) & readu,lun,s
NFILT=s[0] & NTEMP = s[1] & NZ = s[2] & NOBJ = s[3]
tempfilt = dblarr(NFILT,NTEMP,NZ)
lc = dblarr(NFILT) ; central wavelengths
zgrid = dblarr(NZ)
fnu = dblarr(NFILT,NOBJ)
efnu = dblarr(NFILT,NOBJ)
readu,lun,tempfilt,lc,zgrid,fnu,efnu
close,/all
But am unsure how to replicate this in Python. Any help is appreciated. Thanks.
I'm not looking for translation. I'm looking for a springboard from which I can try and solve this problem.

To read a binary file (assuming this is 32 bits or something the user already knows), I would first make a method that uses,
>>> a = '00011111001101110000101010101010'
>>> int(a,2)
523700906
That is, our method has to convert this from something we make ourselves, such as:
def binaryToAscii(string_of_binary):
'''
binaryToAscii takes in a string of binary and returns an ASCII character
'''
charVal = int(string_of_binary,2)
char = chr(charVal)
return char
The next step would be to make a method that incorporates binaryToAscii in such a way that we are either concatenating some string, or writing to a new file. This should be left to the user to decide.
As an aside, if you are not retrieving the binary as a string, then there our built in methods that turn unicode characters into ascii values by taking in there unicode value (binary included).
Regarding the reading of a file, the same link for reading and writing to a file can be used.

Related

How to perform SHA-256 on binary values with Hashlib?

I’m using Python 2 and am attempting to performing sha256 on binary values using hashlib.
I’ve become a bit stuck as I’m quite new to it all but have cobbled together:
hashlib.sha256('0110100001100101011011000110110001101111’.decode('hex')).hexdigest()
I believe it interprets the string as hex based on substituting the hex value (‘68656c6c6f’) into the above and it returning
2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824
And comparing to this answer in which ‘hello’ or ‘68656c6c6f’ is used.
I think the answer lies with the decode component but I can’t find an example for binary only ‘hex’ or ‘utf-8’
Is anyone able to suggest what needs to be changed so that the function interprets as binary values instead of hex?
Here is code that does each of the data conversions you are looking for. These steps can all be combined, but are separated here so you can see each value.
import hashlib
import binascii
binstr = '0110100001100101011011000110110001101111'
hexstr = "{0:0>4X}".format(int(binstr,2)) # '68656C6C6F'
data = binascii.a2b_hex(hexstr) # 'hello'
output = hashlib.sha256(data).hexdigest()
print output
OUTPUT:
2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824

hex header of file, magic numbers, python

I have a program in Python which analyses file headers and decides which file type it is. (https://github.com/LeoGSA/Browser-Cache-Grabber)
The problem is the following:
I read first 24 bytes of a file:
with open (from_folder+"/"+i, "rb") as myfile:
header=str(myfile.read(24))
then I look for pattern in it:
if y[1] in header:
shutil.move (from_folder+"/"+i,to_folder+y[2]+i+y[3])
where y = ['/video', r'\x47\x40\x00', '/video/', '.ts']
y[1] is the pattern and = r'\x47\x40\x00'
the file has it inside, as you can see from the picture below.
the program does NOT find this pattern (r'\x47\x40\x00') in the file header.
so, I tried to print header:
You see? Python sees it as 'G#' instead of '\x47\x40'
and if i search for 'G#'+r'\x00' in header - everything is ok. It finds it.
Question: What am I doing wrong? I want to look for r'\x47\x40\x00' and find it. Not for some strange 'G#'+r'\x00'.
OR
why python sees first two numbers as 'G#' and not as '\x47\x40', though the rest of header it sees in HEX? Is there a way to fix it?
with open (from_folder+"/"+i, "rb") as myfile:
header=myfile.read(24)
header = str(binascii.hexlify(header))[2:-1]
the result I get is:
And I can work with it
4740001b0000b00d0001c100000001efff3690e23dffffff
P.S. But anyway, if anybody will explain what was the problem with 2 first bytes - I would be grateful.
In Python 3 you'll get bytes from a binary read, rather than a string.
No need to convert it to a string by str.
Print will try to convert bytes to something human readable.
If you don't want that, convert your bytes to e.g. hex representations of the integer values of the bytes by:
aBytes = b'\x00\x47\x40\x00\x13\x00\x00\xb0'
print (aBytes)
print (''.join ([hex (aByte) for aByte in aBytes]))
Output as redirected from the console:
b'\x00G#\x00\x13\x00\x00\xb0'
0x00x470x400x00x130x00x00xb0
You can't search in aBytes directly with the in operator, since aBytes isn't a string but an array of bytes.
If you want to apply a string search on '\x00\x47\x40', use:
aBytes = b'\x00\x47\x40\x00\x13\x00\x00\xb0'
print (aBytes)
print (r'\x'.join ([''] + ['%0.2x'%aByte for aByte in aBytes]))
Which will give you:
b'\x00G#\x00\x13\x00\x00\xb0'
\x00\x47\x40\x00\x13\x00\x00\xb0
So there's a number of separate issues at play here:
print tries to print something human readable, which succeeds only for the first two chars.
You can't directly search for bytearrays in bytearrays with in, so convert them to a string containing fixed length hex representations as substrings, as shown.

reading char data from binary file with numpy

I have a binary file with some character data stuck in the middle of a bunch of ints and floats. I am trying to read with numpy. The farthest I have been able to get regarding the character data is:
strbits = np.fromfile(infile,dtype='int8',count=73)
(Yes, it's a 73-character string.)
Three questions: Is my data now stored without corruption or truncation in strbits? And, can I now convert strbits into a readable string? Finally, should I be doing this in some completely different way?
UPDATE:
Here's something that works, but I would think there would be a more elegant way.
strarr = np.zeros(73,dtype='c')
for n in range(73):
strarr[n] = np.fromfile(infile,dtype='c',count=1)[0]
So now I have an array where each element is a single character from the input file.
The way you're doing it is fine. Here's how you can convert it to a string.
strbits = np.fromfile(infile, dtype=np.int8, count=73)
a_string = ''.join([chr(item) for item in strbits])

Reading a wave file in Python

I have created a morse code generator that converts English sentences into morse code. It also converts this text based morse code into an audio file. If the character is a dot, I append a dot.wave file to the output wave file followed by a dash.wav file if the next character is a dash.
I now want to open this wave file and read its content to figure out the order in which these dashes and dots are placed.
I have tried the following code:
waveFile = wave.open(r"C:\Users\Gaurav Keswani\Documents\Eclipse\Morse Code Converter\src\resources\sound\morse.wav", 'r')
x =waveFile.readframes(20)
print (struct.unpack("<40H", x))
This gives me the output as:
(65089, 65089, 3093, 3093, 11895, 11895, 18629, 18629, 25196, 25196,
29325, 29325, 31986, 31986, 32767, 32767, 31265, 31265, 27532, 27532,
22485, 22485, 15762, 15762, 7895, 7895, 103, 103, 57228, 57228, 49571,
49571, 42790, 42790, 37667, 37667, 34362, 34362, 32776, 32776)
I don't know what to make of this output. Can anyone help?
If you want a general solution to detecting Morse code, you are going to have to take a look at what it looks like as a waveform (tom10's link to this question should help here if you can install numpy and matplotlib; if not, you can use the stdlib's csv module to export a file that you can use in your favorite spreadsheet program); work out how you as a human can distinguish dots, dashes, and spaces; turn that into an algorithm (a series of steps that even a literal-minded moron can follow); then turn that algorithm into code. Or you may be able to find a library that's already done this for you.
But for your specific case, you only need to detect exact copies of the contents of dot.wav and dash.wav within your larger file. (At least assuming you're not using any lossy compression, which usually you aren't in .wav files.) So, this is really just a substring search.
Think about how you'd detect the strings 'dot' and 'dash' within a string like 'dash dash dash dash dash dot dash dot dot dot dot dot '. For such a simple problem, you could use a stupid brute-force algorithm, and it would be fine:
def find(haystack, needle, start):
for i in range(start, len(haystack)):
if haystack[i:i+len(needle)] == needle:
return i
return len(haystack)
def decode_morse(morse):
i = 0
while i < len(morse):
next_dot = find(morse, 'dot', i)
next_dash = find(morse, 'dash', i)
if next_dot < next_dash:
if next_dot < len(morse):
yield '.'
i = next_dot
else:
if next_dash < len(morse):
yield '-'
i = next_dash
Now, if you're searching a list of numbers instead of a string, how does this have to change? Barely at all; you can slice a list, compare two lists, etc. just like you can with strings.
The only real problem you'll run into is that you don't have the whole list in memory at once, just 20 frames at a time. What happens if a dot starts in frame 19 and ends in frame 20? If your files aren't too big, this is easy to solve: just read all the frames into memory in one giant list, then search the whole thing. But otherwise, you have to do some buffering.
For example (ignoring error handling and dealing with the end of the file properly, and dealing only with dashes for simplicity—of course you have to do both of those properly in your real code):
buf = []
while True:
while len(buf) < 2*len(dash):
buf.extend(waveFile.readFrames(20))
next_dash = find(buf, dot)
if next_dash < len(buf):
yield '.'
buf = buf[next_dash:]
else:
buf = buf[-len(dash):]
We're making sure we always have at least two dash lengths in our buffer. And we always keep the leftover after the first dot or dash (if one was found) or a full dash length (if not) in the buffer, and add the next buffer to that. That's actually overkill; think it through and think through out exactly what you need to make sure we never miss a dash that falls between two buffers. But the point is, as long as you get that right, you can't miss any dots or dashes.

Convert binary information to regular data type without outside modules in python

I'm tasked with reading a poorly formatted binary file and taking in the variables. Although I need to do it in C++ (ROOT, specifically), I've decided to do it in python because python makes sense to me, but my plan is to get it working in python and then tackle re-writing in in C++, so using easy to use python modules won't get me too far later down the road.
Basically, I do this:
In [5]: some_value
Out[5]: '\x00I'
In [6]: ''.join([str(ord(i)) for i in some_value])
Out[6]: '073'
In [7]: int(''.join([str(ord(i)) for i in some_value]))
Out[7]: 73
And I know there has to be a better way. What do you think?
EDIT:
A bit of info on the binary format.
alt text http://grab.by/3njm
alt text http://grab.by/3njv
alt text http://grab.by/3nkL
This is the endian test I am using:
# Read a uint32 for endianess
endian_test = rq1_file.read(uint32)
if endian_test == '\x04\x03\x02\x01':
print "Endian test: \\x04\\x03\\x02\\x01"
swapbits = True
elif endian_test == '\x01\x02\x03\x04':
print "Endian test: \\x01\\x02\\x03\\x04"
swapbits = False
Your int(''.join([str(ord(i)) for i in some_value])) works ONLY when all bytes except the last byte are zero.
Examples:
'\x01I' should be 1 * 256 + 73 == 329; you get 173
'\x01\x02' should be 1 * 256 + 2 == 258; you get 12
'\x01\x00' should be 1 * 256 + 0 == 256; you get 10
It also relies on an assumption that integers are stored in bigendian fashion; have you verified this assumption? Are you sure that '\x00I' represents the integer 73, and not the integer 73 * 256 + 0 == 18688 (or something else)? Please let us help you verify this assumption by telling us what brand and model of computer and what operating system were used to create the data.
How are negative integers represented?
Do you need to deal with floating-point numbers?
Is the requirement to write it in C++ immutable? What does "(ROOT, specifically)" mean?
If the only dictate is common sense, the preferred order would be:
Write it in Python using the struct module.
Write it in C++ but use C++ library routines (especially if floating-point is involved). Don't re-invent the wheel.
Roll your own conversion routines in C++. You could snarf a copy of the C source for the Python struct module.
Update
Comments after the file format details were posted:
The endianness marker is evidently optional, except at the start of a file. This is dodgy; it relies on the fact that if it is not there, the 3rd and 4th bytes of the block are the 1st 2 bytes of the header string, and neither '\x03\x04' nor '\x02\x01' can validly start a header string. The smart thing to do would be to read SIX bytes -- if first 4 are the endian marker, the next two are the header length, and your next read is for the header string; otherwise seek backwards 4 bytes then read the header string.
The above is in the nuisance category. The negative sizes are a real worry, in that they specify a MAXIMUM length, and there is no mention of how the ACTUAL length is determined. It says "The actual size of the entry is then given line by line". How? There is no documentation of what a "line of data" looks like. The description mentions "lines" many times; are these lines terminated by carriage return and/or line feed? If so, how does one tell the difference between say a line feed byte and the first byte of say a uint16 that belongs to the current "line" of data? If no linefeed or whatever, how does one know when the current line of data is finished? Is there a uintNN size in front of every variable or slice thereof?
Then it says that (2) above (negative size) also applies to the header string. The mind boggles. Do you have any examples (in documentation of the file layout, or in actual files) of "negative size" of (a) header string (b) data "line"?
Is this "decided format" publically available e.g. documentation on the web? Does the format have a searchable name? Are you sure you are the first person in the world to want to read that format?
Reading that file format, even with a full specification, is no trivial exercise, even for a binary-format-experienced person who's also experienced with Python (which BTW doesn't have a float128). How many person-hours have you been allocated for the task? What are the penalties for (a) delay (b) failure?
Your original question involved fixing your interesting way of trying to parse a uint16 -- doing much more is way outside the scope/intention of what SO questions are all about.
You're basically computing a "number-in-base-256", which is a polynomial, so, by Horner's method:
>>> v = 0
>>> for c in someval: v = v * 256 + ord(c)
More typical would be to use equivalent bit-operations rather than arithmetic -- the following's equivalent:
>>> v = 0
>>> for c in someval: v = v << 8 | ord(c)
import struct
result, = struct.unpack('>H', some_value)
The equivalent to the Python struct module is a C struct and/or union, so being afraid to use it is silly.
I'm not exactly sure how the format of the data is you want to extract, but maybe you better just write a couple of generic utility functions to extract the different data type you need:
def int1b(data, i):
return ord(data[i])
def int2b(data, i):
return (int1b(data, i) << 8) + int1b(data, i+1)
def int4b(data, i):
return (int2b(data, i) << 16) + int2b(data, i+2)
With such functions you can easily extract values from the data and they also can be translated rather easily to C.

Categories

Resources