Reading binary file in python

Reading binary file in python - python

I wrote a python script to create a binary file of integers.
import struct
pos = [7623, 3015, 3231, 3829]
inh = open('test.bin', 'wb')
for e in pos:
inh.write(struct.pack('i', e))
inh.close()
It worked well, then I tried to read the 'test.bin' file using the below code.
import struct
inh = open('test.bin', 'rb')
for rec in inh:
pos = struct.unpack('i', rec)
print pos
inh.close()
But it failed with an error message:
Traceback (most recent call last):
File "readbinary.py", line 10, in <module>
pos = struct.unpack('i', rec)
File "/usr/lib/python2.5/struct.py", line 87, in unpack
return o.unpack(s)
struct.error: unpack requires a string argument of length 4
I would like to know how I can read these file using struct.unpack.
Many thanks in advance,
Vipin

for rec in inh: reads one line at a time -- not what you want for a binary file. Read 4 bytes at a time (with a while loop and inh.read(4)) instead (or read everything into memory with a single .read() call, then unpack successive 4-byte slices). The second approach is simplest and most practical as long as the amount of data involved isn't huge:
import struct
with open('test.bin', 'rb') as inh:
indata = inh.read()
for i in range(0, len(data), 4):
pos = struct.unpack('i', data[i:i+4])
print(pos)
If you do fear potentially huge amounts of data (which would take more memory than you have available), a simple generator offers an elegant alternative:
import struct
def by4(f):
rec = 'x' # placeholder for the `while`
while rec:
rec = f.read(4)
if rec: yield rec
with open('test.bin', 'rb') as inh:
for rec in by4(inh):
pos = struct.unpack('i', rec)
print(pos)
A key advantage to this second approach is that the by4 generator can easily be tweaked (while maintaining the specs: return a binary file's data 4 bytes at a time) to use a different implementation strategy for buffering, all the way to the first approach (read everything then parcel it out) which can be seen as "infinite buffering" and coded:
def by4(f):
data = inf.read()
for i in range(0, len(data), 4):
yield data[i:i+4]
while leaving the "application logic" (what to do with that stream of 4-byte chunks) intact and independent of the I/O layer (which gets encapsulated within the generator).

I think "for rec in inh" is supposed to read 'lines', not bytes. What you want is:
while True:
rec = inh.read(4) # Or inh.read(struct.calcsize('i'))
if len(rec) != 4:
break
(pos,) = struct.unpack('i', rec)
print pos
Or as others have mentioned:
while True:
try:
(pos,) = struct.unpack_from('i', inh)
except (some_exception...):
break

Check the size of the packed integers:
>>> pos
[7623, 3015, 3231, 3829]
>>> [struct.pack('i',e) for e in pos]
['\xc7\x1d\x00\x00', '\xc7\x0b\x00\x00', '\x9f\x0c\x00\x00', '\xf5\x0e\x00\x00']
We see 4-byte strings, it means that reading should be 4 bytes at a time:
>>> inh=open('test.bin','rb')
>>> b1=inh.read(4)
>>> b1
'\xc7\x1d\x00\x00'
>>> struct.unpack('i',b1)
(7623,)
>>>
This is the original int! Extending into a reading loop is left as an exercise .

You can probably use array as well if you want:
import array
pos = array.array('i', [7623, 3015, 3231, 3829])
inh = open('test.bin', 'wb')
pos.write(inh)
inh.close()
Then use array.array.fromfile or fromstring to read it back.

This function reads all bytes from file
def read_binary_file(filename):
try:
f = open(filename, 'rb')
n = os.path.getsize(filename)
data = array.array('B')
data.read(f, n)
f.close()
fsize = data.__len__()
return (fsize, data)
except IOError:
return (-1, [])
# somewhere in your code
t = read_binary_file(FILENAME)
fsize = t[0]
if (fsize > 0):
data = t[1]
# work with data
else:
print 'Error reading file'

Your iterator isn't reading 4 bytes at a time so I imagine it's rather confused. Like SilentGhost mentioned, it'd probably be best to use unpack_from().

Related

How to fix the code to work in HEX format?

Help fix the code. My script sorts into even and odd numbers of coordinates in the list and only works with a list in decimal number format, but I need to fix the code to work with a list in HEX format (hexadecimal number format)
I don't know the Python language well, but I need to add function hex(str)
Here is a list like this List.txt
(0x52DF625,0x47A406E)
(0x3555F30,0x3323041)
(0x326A573,0x5A5E578)
(0x48F8EF7,0x98A4EF3)
(0x578FE62,0x331DF3E)
(0x3520CAD,0x1719BBB)
(0x506FC9F,0x40CF4A6)
Сode:
with open('List.txt') as fin,\
open('Save+even.txt', 'a') as foutch,\
open('Save-odd.txt', 'a') as foutnch:
data = [line.strip() for line in fin]
nch = [foutnch.write(str(i) + '\n')
for i in data if int(i[1:-1].split(',')[1]) % 2]
ch = [foutch.write(str(i) + '\n')
for i in data if int(i[1:-1].split(',')[1]) % 2 != 1]

this may work for you (i used StringIO instead of real files - but added a comment on how you could use that with real files)
in_file = StringIO("""(0x52DF625,0x47A406E)
(0x3555F30,0x3323041)
(0x326A573,0x5A5E578)
(0x48F8EF7,0x98A4EF3)
(0x578FE62,0x331DF3E)
(0x3520CAD,0x1719BBB)
(0x506FC9F,0x40CF4A6)
""")
even_file = StringIO()
odd_file = StringIO()
# with open( "List.txt") as in_file, open("Save-even.txt", "w") as even_file, open("Save-odd.txt", "w") as odd_file:
for line in in_file:
x_str, y_str = line.strip()[1:-1].split(",")
x, y = int(x_str, 0), int(y_str, 0)
if y & 1: # y is odd
odd_file.write(line)
else:
even_file.write(line)
print("odd")
print(odd_file.getvalue())
print("even")
print(even_file.getvalue())
it outputs:
odd
(0x3555F30,0x3323041)
(0x48F8EF7,0x98A4EF3)
(0x3520CAD,0x1719BBB)
even
(0x52DF625,0x47A406E)
(0x326A573,0x5A5E578)
(0x578FE62,0x331DF3E)
(0x506FC9F,0x40CF4A6)
the trick is to use base 0 when converting a hex string to int: int(x_str, 0),. see this answer.

How to solve TypeError: 'str' does not support the buffer interface?

My code:
big_int = 536870912
f = open('sample.txt', 'wb')
for item in range(0, 10):
y = bytearray.fromhex('{:0192x}'.format(big_int))
f.write("%s" %y)
f.close()
I want to convert a long int to bytes. But I am getting TypeError: 'str' does not support the buffer interface.

In addition to #Will answer, it is better to use with statement in to open your files. Also, if you using python 3.2 and later, you can use int.to_bytes or int.from_bytes to reverse the process.
Sample of what I said:
big_int=536870912
with open('sample.txt', 'wb') as f:
y = big_int.to_bytes((big_int.bit_length() // 8) + 1, byteorder='big')
f.write(y)
print(int.from_bytes(y, byteorder='big'))
The last print is to show you how to reverse it.

In Python 3, strings are implicitly Unicode and are therefore detached from a specific binary representation (which depends on the encoding that is used). Therefore, the string "%s" % y cannot be written to a file that is opened in binary mode.
Instead, you can just write y directly to the file:
y = bytearray.fromhex('{:0192x}'.format(big_int))
f.write(y)
Moreover, your code ("%s" % y) in fact creates a Unicode string containing the string representation of y (i.e., str(y)), which isn't what you think it is. For example:
>>> '%s' % bytearray()
"bytearray(b'')"

python iterate through binary file without lines

I've got some data in a binary file that I need to parse. The data is separated into chunks of 22 bytes, so I'm trying to generate a list of tuples, each tuple containing 22 values. The file isn't separated into lines though, so I'm having problems figuring out how to iterate through the file and grab the data.
If I do this it works just fine:
nextList = f.read(22)
newList = struct.unpack("BBBBBBBBBBBBBBBBBBBBBB", nextList)
where newList contains a tuple of 22 values. However, if I try to apply similar logic to a function that iterates through, it breaks down.
def getAllData():
listOfAll = []
nextList = f.read(22)
while nextList != "":
listOfAll.append(struct.unpack("BBBBBBBBBBBBBBBBBBBBBB", nextList))
nextList = f.read(22)
return listOfAll
data = getAllData()
gives me this error:
Traceback (most recent call last):
File "<pyshell#27>", line 1, in <module>
data = getAllData()
File "<pyshell#26>", line 5, in getAllData
listOfAll.append(struct.unpack("BBBBBBBBBBBBBBBBBBBBBB", nextList))
struct.error: unpack requires a bytes object of length 22
I'm fairly new to python so I'm not too sure where I'm going wrong here. I know for sure that the data in the file breaks down evenly into sections of 22 bytes, so it's not a problem there.

Since you reported that it was running when len(nextList) == 0, this is probably because nextList (which isn't a list..) is an empty bytes object which isn't equal to an empty string object:
>>> b"" == ""
False
and so the condition in your line
while nextList != "":
is never true, even when nextList is empty. That's why using len(nextList) != 22 as a break condition worked, and even
while nextList:
should suffice.

read(22) isn't guaranteed to return a string of length 22. It's contract is to return string of length from anywhere between 0 and 22 (inclusive). A string of length zero indicates there is no more data to be read. In python 3 file objects produce bytes objects instead of str. str and bytes will never be considered equal.
If your file is small-ish then you'd be better off to read the entire file into memory and then split it up into chunks. eg.
listOfAll = []
data = f.read()
for i in range(0, len(data), 22):
t = struct.unpack("BBBBBBBBBBBBBBBBBBBBBB", data[i:i+22])
listOfAll.append(t)
Otherwise you will need to do something more complicated with checking the amount of data you get back from the read.
def dataiter(f, chunksize=22, buffersize=4096):
data = b''
while True:
newdata = f.read(buffersize)
if not newdata: # end of file
if not data:
return
else:
yield data
# or raise error as 0 < len(data) < chunksize
# or pad with zeros to chunksize
return
data += newdata
i = 0
while len(data) - i >= chunksize:
yield data[i:i+chunksize]
i += chunksize
try:
data = data[i:] # keep remainder of unused data
except IndexError:
data = b'' # all data was used

Why am I getting an IndexError: string index out of range?

I am running the following code on ubuntu 11.10, python 2.7.2+.
import urllib
import Image
import StringIO
source = '/home/cah/Downloads/evil2.gfx'
dataFile = open(source, 'rb').read()
slicedFile1 = StringIO.StringIO(dataFile[::5])
slicedFile2 = StringIO.StringIO(dataFile[1::5])
slicedFile3 = StringIO.StringIO(dataFile[2::5])
slicedFile4 = StringIO.StringIO(dataFile[3::5])
jpgimage1 = Image.open(slicedFile1)
jpgimage1.save('/home/cah/Documents/pychallenge12.1.jpg')
pngimage1 = Image.open(slicedFile2)
pngimage1.save('/home/cah/Documents/pychallenge12.2.png')
gifimage1 = Image.open(slicedFile3)
gifimage1.save('/home/cah/Documents/pychallenge12.3.gif')
pngimage2 = Image.open(slicedFile4)
pngimage2.save('/home/cah/Documents/pychallenge12.4.png')
in essence i'm taking a .bin file that has hex code for several image files jumbled
like 123451234512345... and clumping together then saving. The problem is i'm getting the following error:
File "/usr/lib/python2.7/dist-packages/PIL/PngImagePlugin.py", line 96, in read
len = i32(s)
File "/usr/lib/python2.7/dist-packages/PIL/PngImagePlugin.py", line 44, in i32
return ord(c[3]) + (ord(c[2])<<8) + (ord(c[1])<<16) + (ord(c[0])<<24)
IndexError: string index out of range
i found the PngImagePlugin.py and I looked at what it had:
def i32(c):
return ord(c[3]) + (ord(c[2])<<8) + (ord(c[1])<<16) + (ord(c[0])<<24) (line 44)
"Fetch a new chunk. Returns header information."
if self.queue:
cid, pos, len = self.queue[-1]
del self.queue[-1]
self.fp.seek(pos)
else:
s = self.fp.read(8)
cid = s[4:]
pos = self.fp.tell()
len = i32(s) (lines 88-96)
i would try tinkering, but I'm afraid I'll screw up png and PIL, which have been erksome to get working.
thanks

It would appear that len(s) < 4 at this stage
len = i32(s)
Which means that
s = self.fp.read(8)
isn't reading the whole 4 bytes
probably the data in the fp you are passing isn't making sense to the image decoder.
Double check that you are slicing correctly

Make sure that the string you are passing in is of at least length 4.

Python text file processing speed issues

I'm having a problem with processing a largeish file in Python. All I'm doing is
f = gzip.open(pathToLog, 'r')
for line in f:
counter = counter + 1
if (counter % 1000000 == 0):
print counter
f.close
This takes around 10m25s just to open the file, read the lines and increment this counter.
In perl, dealing with the same file and doing quite a bit more (some regular expression stuff), the whole process takes around 1m17s.
Perl Code:
open(LOG, "/bin/zcat $logfile |") or die "Cannot read $logfile: $!\n";
while (<LOG>) {
if (m/.*\[svc-\w+\].*login result: Successful\.$/) {
$_ =~ s/some regex here/$1,$2,$3,$4/;
push #an_array, $_
}
}
close LOG;
Can anyone advise what I can do to make the Python solution run at a similar speed to the Perl solution?
EDIT
I've tried just uncompressing the file and dealing with it using open instead of gzip.open, but that only changes the total time to around 4m14.972s, which is still too slow.
I also removed the modulo and print statements and replaced them with pass, so all that is being done now is moving from file to file.

In Python (at least <= 2.6.x), gzip format parsing is implemented in Python (over zlib). More, it appears to be doing some strange things, namely, decompress to the end of file to memory and then discard everything beyond the requested read size (then do it again for next read). DISCLAIMER: I've just looked at gzip.read() for 3 minutes, so I can be wrong here. Regardless of whether my understanding of gzip.read() is correct or not, gzip module appears to be not optimized for large data volumes. Try doing the same thing as in Perl, i.e. launching an external process (e.g. see module subprocess).
EDIT
Actually, I missed the OP's remark about plain file I/O being just as slow as compressed (thanks to ire_and_curses for pointing it out). This striken me as unlikely, so I did some measurements...
from timeit import Timer
def w(n):
L = "*"*80+"\n"
with open("ttt", "w") as f:
for i in xrange(n) :
f.write(L)
def r():
with open("ttt", "r") as f:
for n,line in enumerate(f) :
if n % 1000000 == 0 :
print n
def g():
f = gzip.open("ttt.gz", "r")
for n,line in enumerate(f) :
if n % 1000000 == 0 :
print n
Now, running it...
>>> Timer("w(10000000)", "from __main__ import w").timeit(1)
14.153118133544922
>>> Timer("r()", "from __main__ import r").timeit(1)
1.6482770442962646
# here i switched to a terminal and made ttt.gz from ttt
>>> Timer("g()", "from __main__ import g").timeit(1)
...and after having a tea break and discovering that it's still running, I've killed it, sorry. Then I tried 100'000 lines instead of 10'000'000:
>>> Timer("w(100000)", "from __main__ import w").timeit(1)
0.05810999870300293
>>> Timer("r()", "from __main__ import r").timeit(1)
0.09662318229675293
# here i switched to a terminal and made ttt.gz from ttt
>>> Timer("g()", "from __main__ import g").timeit(1)
11.939290046691895
Module gzip's time is O(file_size**2), so with number of lines on the order of millions, gzip read time just cannot be the same as plain read time (as we see confirmed by an experiment). Anonymouslemming, please check again.

If you Google "why is python gzip slow" you'll find plenty of discussion of this, including patches for improvements in Python 2.7 and 3.2. In the meantime, use zcat as you did in Perl which is wicked fast. Your (first) function takes me about 4.19s with a 5MB compressed file, and the second function takes 0.78s. However, I don't know what's going on with your uncompressed files. If I uncompress the log files (apache logs) and run the two function on them with a simple Python open(file), and Popen('cat'), Python is faster (0.17s) than cat (0.48s).
#!/usr/bin/python
import gzip
from subprocess import PIPE, Popen
import sys
import timeit
#pathToLog = 'big.log.gz' # 50M compressed (*10 uncompressed)
pathToLog = 'small.log.gz' # 5M ""
def test_ori():
counter = 0
f = gzip.open(pathToLog, 'r')
for line in f:
counter = counter + 1
if (counter % 100000 == 0): # 1000000
print counter, line
f.close
def test_new():
counter = 0
content = Popen(["zcat", pathToLog], stdout=PIPE).communicate()[0].split('\n')
for line in content:
counter = counter + 1
if (counter % 100000 == 0): # 1000000
print counter, line
if '__main__' == __name__:
to = timeit.Timer('test_ori()', 'from __main__ import test_ori')
print "Original function time", to.timeit(1)
tn = timeit.Timer('test_new()', 'from __main__ import test_new')
print "New function time", tn.timeit(1)

I spent a while on this. Hopefully this code will do the trick. It uses zlib and no external calls.
The gunzipchunks method reads the compressed gzip file in chunks which can be iterated over (generator).
The gunziplines method reads these uncompressed chunks and provides you with one line at a time which can also be iterated over (another generator).
Finally, the gunziplinescounter method gives you what you're looking for.
Cheers!
import zlib
file_name = 'big.txt.gz'
#file_name = 'mini.txt.gz'
#for i in gunzipchunks(file_name): print i
def gunzipchunks(file_name,chunk_size=4096):
inflator = zlib.decompressobj(16+zlib.MAX_WBITS)
f = open(file_name,'rb')
while True:
packet = f.read(chunk_size)
if not packet: break
to_do = inflator.unconsumed_tail + packet
while to_do:
decompressed = inflator.decompress(to_do, chunk_size)
if not decompressed:
to_do = None
break
yield decompressed
to_do = inflator.unconsumed_tail
leftovers = inflator.flush()
if leftovers: yield leftovers
f.close()
#for i in gunziplines(file_name): print i
def gunziplines(file_name,leftovers="",line_ending='\n'):
for chunk in gunzipchunks(file_name):
chunk = "".join([leftovers,chunk])
while line_ending in chunk:
line, leftovers = chunk.split(line_ending,1)
yield line
chunk = leftovers
if leftovers: yield leftovers
def gunziplinescounter(file_name):
for counter,line in enumerate(gunziplines(file_name)):
if (counter % 1000000 != 0): continue
print "%12s: %10d" % ("checkpoint", counter)
print "%12s: %10d" % ("final result", counter)
print "DEBUG: last line: [%s]" % (line)
gunziplinescounter(file_name)
This should run a whole lot faster than using the builtin gzip module on extremely large files.

It took your computer 10 minutes? It must be your hardware. I wrote this function to write 5 million lines:
def write():
fout = open('log.txt', 'w')
for i in range(5000000):
fout.write(str(i/3.0) + "\n")
fout.close
Then I read it with a program much like yours:
def read():
fin = open('log.txt', 'r')
counter = 0
for line in fin:
counter += 1
if counter % 1000000 == 0:
print counter
fin.close
It took my computer about 3 seconds to read all 5 million lines.

Try using StringIO to buffer the output from the gzip module. The following code to read a gzipped pickle cut the execution time of my code by well over 90%.
Instead of...
import cPickle
# Use gzip to open/read the pickle.
lPklFile = gzip.open("test.pkl", 'rb')
lData = cPickle.load(lPklFile)
lPklFile.close()
Use...
import cStringIO, cPickle
# Use gzip to open the pickle.
lPklFile = gzip.open("test.pkl", 'rb')
# Copy the pickle into a cStringIO.
lInternalFile = cStringIO.StringIO()
lInternalFile.write(lPklFile.read())
lPklFile.close()
# Set the seek position to the start of the StringIO, and read the
# pickled data from it.
lInternalFile.seek(0, os.SEEK_SET)
lData = cPickle.load(lInternalFile)
lInternalFile.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading binary file in python - python

You can probably use array as well if you want: import array pos = array.array('i', [7623, 3015, 3231, 3829]) inh = open('test.bin', 'wb') pos.write(inh) inh.close() Then use array.array.fromfile or fromstring to read it back.

Your iterator isn't reading 4 bytes at a time so I imagine it's rather confused. Like SilentGhost mentioned, it'd probably be best to use unpack_from().

Related

How to fix the code to work in HEX format?

How to solve TypeError: 'str' does not support the buffer interface?

python iterate through binary file without lines

Why am I getting an IndexError: string index out of range?

Python text file processing speed issues

Categories

Resources