python iterate through binary file without lines - python

I've got some data in a binary file that I need to parse. The data is separated into chunks of 22 bytes, so I'm trying to generate a list of tuples, each tuple containing 22 values. The file isn't separated into lines though, so I'm having problems figuring out how to iterate through the file and grab the data.
If I do this it works just fine:
nextList = f.read(22)
newList = struct.unpack("BBBBBBBBBBBBBBBBBBBBBB", nextList)
where newList contains a tuple of 22 values. However, if I try to apply similar logic to a function that iterates through, it breaks down.
def getAllData():
listOfAll = []
nextList = f.read(22)
while nextList != "":
listOfAll.append(struct.unpack("BBBBBBBBBBBBBBBBBBBBBB", nextList))
nextList = f.read(22)
return listOfAll
data = getAllData()
gives me this error:
Traceback (most recent call last):
File "<pyshell#27>", line 1, in <module>
data = getAllData()
File "<pyshell#26>", line 5, in getAllData
listOfAll.append(struct.unpack("BBBBBBBBBBBBBBBBBBBBBB", nextList))
struct.error: unpack requires a bytes object of length 22
I'm fairly new to python so I'm not too sure where I'm going wrong here. I know for sure that the data in the file breaks down evenly into sections of 22 bytes, so it's not a problem there.

Since you reported that it was running when len(nextList) == 0, this is probably because nextList (which isn't a list..) is an empty bytes object which isn't equal to an empty string object:
>>> b"" == ""
False
and so the condition in your line
while nextList != "":
is never true, even when nextList is empty. That's why using len(nextList) != 22 as a break condition worked, and even
while nextList:
should suffice.

read(22) isn't guaranteed to return a string of length 22. It's contract is to return string of length from anywhere between 0 and 22 (inclusive). A string of length zero indicates there is no more data to be read. In python 3 file objects produce bytes objects instead of str. str and bytes will never be considered equal.
If your file is small-ish then you'd be better off to read the entire file into memory and then split it up into chunks. eg.
listOfAll = []
data = f.read()
for i in range(0, len(data), 22):
t = struct.unpack("BBBBBBBBBBBBBBBBBBBBBB", data[i:i+22])
listOfAll.append(t)
Otherwise you will need to do something more complicated with checking the amount of data you get back from the read.
def dataiter(f, chunksize=22, buffersize=4096):
data = b''
while True:
newdata = f.read(buffersize)
if not newdata: # end of file
if not data:
return
else:
yield data
# or raise error as 0 < len(data) < chunksize
# or pad with zeros to chunksize
return
data += newdata
i = 0
while len(data) - i >= chunksize:
yield data[i:i+chunksize]
i += chunksize
try:
data = data[i:] # keep remainder of unused data
except IndexError:
data = b'' # all data was used

Related

Keep Getting ValueError: not enough values to unpack (expected 2, got 1) for a text file for sentiment analysis?

I am trying to turn this text file into a dictionary using the code below:
with open("/content/corpus.txt", "r") as my_corpus:
wordpoints_dict = {}
for line in my_corpus:
key, value = line.split('')
wordpoints_dict[key] = value
print(wordpoints_dict)
It keeps returning:
ValueError Traceback (most recent call last)
<ipython-input-18-8cf5e5efd882> in <module>()
2 wordpoints_dict = {}
3 for line in my_corpus:
----> 4 key, value = line.split('-')
5 wordpoints_dict[key] = value
6 print(wordpoints_dict)
ValueError: not enough values to unpack (expected 2, got 1)
The data in the text file looks like this:
Text Data
You are trying to split a text value at ‘-‘. And unpack it to two values (key (before the dash), value (after the dash)). However, some lines in your txt file do not contain a dash so there is not two values to unpack. Try checking for blank lines as this could be a cause of the issue.
Your code doesn't match the error message. I'm going to assume that the error message is the correct one...
Just add a little logic to handle the case where there isn't a - on a line. I wouldn't be surprised if you fixed that problem and then hit the other side of that problem, where the line has more than one -. If that occurs in your file, you'll have to deal with that case as well, as you'll get a "too many values to unpack" error then. Here's your code with the added boilerplate for doing both of these things:
with open("/content/corpus.txt", "r") as my_corpus:
wordpoints_dict = {}
for line in my_corpus:
parts = line.split('-')
if len(parts) == 1:
parts = (line, '') # If no '-', use an empty second value
elif len(parts) > 2:
parts = parts[:2] # If too many items from split, use the first two
key, value = [x.strip() for x in parts] # strip leading and trailing spaces
wordpoints_dict[key] = value
print(wordpoints_dict)

Handle unwanted line breaks with read_csv in Pandas

I have a problem with data that is exported from SAP. Sometimes you can find a line break in the posting text. What should be in one line, is then in two and this results in a pretty bad data frame.
The most annoying thing is, that I am unable to make pandas aware of this problem, it just read those wrong lines even if the column count is smaller than the header.
An example of a wrong data.txt:
MANDT~BUKRS~BELNR~GJAHR
030~01~0100650326
~2016
030~01~0100758751~2017
You can see, that the first line has a wrong line break after 0100650326. The 2016 belongs to the first row. The third line is as it should be.
If I import this file:
data = pd.read_csv(
path_to_file,
sep='~',
encoding='latin1',
error_bad_lines=True,
warn_bad_lines=True)
I get this. What is pretty wrong:
MANDT BUKRS BELNR GJAHR
0 30.0 1 100650326.0 NaN
1 NaN 2016 NaN NaN
2 30.0 1 100758751.0 2016.0
Is it possible to fix the wrong line break or to tell pandas to ignore lines where column count is smaller than header?
Just to make it complete. I want to get this:
MANDT BUKRS BELNR GJAHR
0 30 1 100650326 2016
1 30 1 100758751 2016
I tried to use with open and to replace '\n' (the line break) with '' (nothing), but this results in a single liner file. This is not intended.
You can do some pre-processing to get rid of the unwanted breaks. Example below which I tested.
import fileinput
with fileinput.FileInput('input.csv', inplace=True, backup='.orig.bak') as file:
for line in file:
print(line.replace('\n','^'), end='')
with fileinput.FileInput('input.csv', inplace=True, backup='.1.bak') as file:
for line in file:
print(line.replace('^~','~'), end='')
with fileinput.FileInput('input.csv', inplace=True, backup='.2.bak') as file:
for line in file:
print(line.replace('^','\n'), end='')
The correct way would be to fix the file at creation time. If this is not possible, you could pre-process the file or use a wrapper.
Here is a solution using a byte level wrapper that combines lines until you have the correct number of delimiters. I use a byte level wrapper to make use of the classes of the io module and add as little code of my own as I can: a RawIOBase reads lines from an underlying byte file object, and combines lines to have the expected number of delimiters (only readinto and readable are overriden)
class csv_wrapper(io.RawIOBase):
def __init__(self, base, delim):
self.fd = base # underlying (byte) file object
self.nfields = None
self.delim = ord(delim) # code of the delimiter (passed as a character)
self.numl = 0 # number of line for error processing
self._getline() # load and process the header line
def _nfields(self):
# number of delimiters in current line
return len([c for c in self.line if c == self.delim])
def _getline(self):
while True:
# loads a new line in the internal buffer
self.line = next(self.fd)
self.numl += 1
if self.nfields is None: # store number of delims if not known
self.nfields = self._nfields()
else:
while self.nfields > self._nfields(): # optionaly combine lines
self.line = self.line.rstrip() + next(self.fd)
self.numl += 1
if self.nfields != self._nfields(): # too much here...
print("Too much fields line {}".format(self.numl))
continue # ignore the offending line and proceed
self.index = 0 # reset line pointers
self.linesize = len(self.line)
break
def readinto(self, b):
if len(b) == 0: return 0
if self.index == self.linesize: # if current buffer is exhausted
try: # read a new one
self._getline()
except StopIteration:
return 0
for i in range(len(b)): # store in passed bytearray
if self.index == self.linesize: break
b[i] = self.line[self.index]
self.index += 1
return i
def readable(self):
return True
You can then change your code to:
data = pd.read_csv(
csv_wrapper(open(path_to_file, 'rb'), '~'),
sep='~',
encoding='latin1',
error_bad_lines=True,
warn_bad_lines=True)

Converting string to int in serial connection

I am trying to read a line from serial connection and convert it to int:
print arduino.readline()
length = int(arduino.readline())
but getting this error:
ValueError: invalid literal for int() with base 10: ''
I looked up this error and means that it is not possible to convert an empty string to int, but the thing is, my readline is not empty, because it prints it out.
The print statement prints it out and the next call reads the next line. You should probably do.
num = arduino.readline()
length = int(num)
Since you mentioned that the Arduino is returning C style strings, you should strip the NULL character.
num = arduino.readline()
length = int(num.strip('\0'))
Every call to readline() reads a new line, so your first statement has read a line already, next time you call readline() data is not available anymore.
Try this:
s = arduino.readline()
if len(s) != 0:
print s
length = int(s)
When you say
print arduino.readline()
you have already read the currently available line. So, the next readline might not be getting any data. You might want to store this in a variable like this
data = arduino.readline()
print data
length = int(data)
As the data seems to have null character (\0) in it, you might want to strip that like this
data = arduino.readline().rstrip('\0')
The problem is when the arduino starts to send serial data it starts by sending empty strings initially, so the pyserial picks up an empty string '' which cannot be converted to an integer. You can add a delay above serial.readline(), like this:
while True:
time.sleep(1.5)
pos = arduino.readline().rstrip().decode()
print(pos)

Why am I getting an IndexError: string index out of range?

I am running the following code on ubuntu 11.10, python 2.7.2+.
import urllib
import Image
import StringIO
source = '/home/cah/Downloads/evil2.gfx'
dataFile = open(source, 'rb').read()
slicedFile1 = StringIO.StringIO(dataFile[::5])
slicedFile2 = StringIO.StringIO(dataFile[1::5])
slicedFile3 = StringIO.StringIO(dataFile[2::5])
slicedFile4 = StringIO.StringIO(dataFile[3::5])
jpgimage1 = Image.open(slicedFile1)
jpgimage1.save('/home/cah/Documents/pychallenge12.1.jpg')
pngimage1 = Image.open(slicedFile2)
pngimage1.save('/home/cah/Documents/pychallenge12.2.png')
gifimage1 = Image.open(slicedFile3)
gifimage1.save('/home/cah/Documents/pychallenge12.3.gif')
pngimage2 = Image.open(slicedFile4)
pngimage2.save('/home/cah/Documents/pychallenge12.4.png')
in essence i'm taking a .bin file that has hex code for several image files jumbled
like 123451234512345... and clumping together then saving. The problem is i'm getting the following error:
File "/usr/lib/python2.7/dist-packages/PIL/PngImagePlugin.py", line 96, in read
len = i32(s)
File "/usr/lib/python2.7/dist-packages/PIL/PngImagePlugin.py", line 44, in i32
return ord(c[3]) + (ord(c[2])<<8) + (ord(c[1])<<16) + (ord(c[0])<<24)
IndexError: string index out of range
i found the PngImagePlugin.py and I looked at what it had:
def i32(c):
return ord(c[3]) + (ord(c[2])<<8) + (ord(c[1])<<16) + (ord(c[0])<<24) (line 44)
"Fetch a new chunk. Returns header information."
if self.queue:
cid, pos, len = self.queue[-1]
del self.queue[-1]
self.fp.seek(pos)
else:
s = self.fp.read(8)
cid = s[4:]
pos = self.fp.tell()
len = i32(s) (lines 88-96)
i would try tinkering, but I'm afraid I'll screw up png and PIL, which have been erksome to get working.
thanks
It would appear that len(s) < 4 at this stage
len = i32(s)
Which means that
s = self.fp.read(8)
isn't reading the whole 4 bytes
probably the data in the fp you are passing isn't making sense to the image decoder.
Double check that you are slicing correctly
Make sure that the string you are passing in is of at least length 4.

Reading binary file in python

I wrote a python script to create a binary file of integers.
import struct
pos = [7623, 3015, 3231, 3829]
inh = open('test.bin', 'wb')
for e in pos:
inh.write(struct.pack('i', e))
inh.close()
It worked well, then I tried to read the 'test.bin' file using the below code.
import struct
inh = open('test.bin', 'rb')
for rec in inh:
pos = struct.unpack('i', rec)
print pos
inh.close()
But it failed with an error message:
Traceback (most recent call last):
File "readbinary.py", line 10, in <module>
pos = struct.unpack('i', rec)
File "/usr/lib/python2.5/struct.py", line 87, in unpack
return o.unpack(s)
struct.error: unpack requires a string argument of length 4
I would like to know how I can read these file using struct.unpack.
Many thanks in advance,
Vipin
for rec in inh: reads one line at a time -- not what you want for a binary file. Read 4 bytes at a time (with a while loop and inh.read(4)) instead (or read everything into memory with a single .read() call, then unpack successive 4-byte slices). The second approach is simplest and most practical as long as the amount of data involved isn't huge:
import struct
with open('test.bin', 'rb') as inh:
indata = inh.read()
for i in range(0, len(data), 4):
pos = struct.unpack('i', data[i:i+4])
print(pos)
If you do fear potentially huge amounts of data (which would take more memory than you have available), a simple generator offers an elegant alternative:
import struct
def by4(f):
rec = 'x' # placeholder for the `while`
while rec:
rec = f.read(4)
if rec: yield rec
with open('test.bin', 'rb') as inh:
for rec in by4(inh):
pos = struct.unpack('i', rec)
print(pos)
A key advantage to this second approach is that the by4 generator can easily be tweaked (while maintaining the specs: return a binary file's data 4 bytes at a time) to use a different implementation strategy for buffering, all the way to the first approach (read everything then parcel it out) which can be seen as "infinite buffering" and coded:
def by4(f):
data = inf.read()
for i in range(0, len(data), 4):
yield data[i:i+4]
while leaving the "application logic" (what to do with that stream of 4-byte chunks) intact and independent of the I/O layer (which gets encapsulated within the generator).
I think "for rec in inh" is supposed to read 'lines', not bytes. What you want is:
while True:
rec = inh.read(4) # Or inh.read(struct.calcsize('i'))
if len(rec) != 4:
break
(pos,) = struct.unpack('i', rec)
print pos
Or as others have mentioned:
while True:
try:
(pos,) = struct.unpack_from('i', inh)
except (some_exception...):
break
Check the size of the packed integers:
>>> pos
[7623, 3015, 3231, 3829]
>>> [struct.pack('i',e) for e in pos]
['\xc7\x1d\x00\x00', '\xc7\x0b\x00\x00', '\x9f\x0c\x00\x00', '\xf5\x0e\x00\x00']
We see 4-byte strings, it means that reading should be 4 bytes at a time:
>>> inh=open('test.bin','rb')
>>> b1=inh.read(4)
>>> b1
'\xc7\x1d\x00\x00'
>>> struct.unpack('i',b1)
(7623,)
>>>
This is the original int! Extending into a reading loop is left as an exercise .
You can probably use array as well if you want:
import array
pos = array.array('i', [7623, 3015, 3231, 3829])
inh = open('test.bin', 'wb')
pos.write(inh)
inh.close()
Then use array.array.fromfile or fromstring to read it back.
This function reads all bytes from file
def read_binary_file(filename):
try:
f = open(filename, 'rb')
n = os.path.getsize(filename)
data = array.array('B')
data.read(f, n)
f.close()
fsize = data.__len__()
return (fsize, data)
except IOError:
return (-1, [])
# somewhere in your code
t = read_binary_file(FILENAME)
fsize = t[0]
if (fsize > 0):
data = t[1]
# work with data
else:
print 'Error reading file'
Your iterator isn't reading 4 bytes at a time so I imagine it's rather confused. Like SilentGhost mentioned, it'd probably be best to use unpack_from().

Categories

Resources