Ignore newline character in binary file with Python? - python

I open my file like so :
f = open("filename.ext", "rb") # ensure binary reading with b
My first line of data looks like this (when using f.readline()):
'\x04\x00\x00\x00\x12\x00\x00\x00\x04\x00\x00\x00\xb4\x00\x00\x00\x01\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x18\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00:\x00\x00\x00;\x00\x00\x00<\x00\x00\x007\x00\x00\x008\x00\x00\x009\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n'
Thing is, I want to read this data byte by byte (f.read(4)). While debugging, I realized that when it gets to the end of the first line, it still takes in the newline character \n and it is used as the first byte of the following int I read. I don't want to simply use .splitlines()because some data could have an n inside and I don't want to corrupt it. I'm using Python 2.7.10, by the way. I also read that opening a binary file with the b parameter "takes care" of the new line/end of line characters; why is not the case with me?
This is what happens in the console as the file's position is right before the newline character:
>>> d = f.read(4)
>>> d
'\n\x00\x00\x00'
>>> s = struct.unpack("i", d)
>>> s
(10,)

(Followed from discussion with OP in chat)
Seems like the file is in binary format and the newlines are just mis-interpreted values. This can happen when writing 10 to the file for example.
This doesn't mean that newline was intended, and it is probably not. You can just ignore it being printed as \n and just use it as data.

You should just be able to replace the bytes that indicate it is a newline.
>>> d = f.read(4).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
>>> diff = 4 - len(d)
>>> while diff > 0: # You can probably make this more sophisticated
... d += f.read(diff).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
... diff = 4 - len(d)
>>>
>>> s = struct.unpack("i", d)
This should give you an idea of how it will work. This approach could mess with your data's byte alignment.
If you really are seeing "\n" in your print of d then try .replace(b"\n", b"")

Related

Read binary file and check with matching character in python

I would like to scan through data files from GPS receiver byte-wise (actually it will be a continuous flow, not want to test the code with offline data). If find a match, then check the next 2 bytes for the 'length' and get the next 2 bytes and shift 2 bits(not byte) to the right, etc. I didn't handle binary before, so stuck in a simple task. I could read the binary file byte-by-byte, but can not find a way to match by desired pattern (i.e. D3).
with open("COM6_200417.ubx", "rb") as f:
byte = f.read(1) # read 1-byte at a time
while byte != b"":
# Do stuff with byte.
byte = f.read(1)
print(byte)
The output file is:
b'\x82'
b'\xc2'
b'\xe3'
b'\xb8'
b'\xe0'
b'\x00'
b'#'
b'\x13'
b'\x05'
b'!'
b'\xd3'
b'\x00'
b'\x13'
....
how to check if that byte is == '\xd3'? (D3)
also would like to know how to shift bit-wise, as I need to check decimal value consisting of 6 bits
(1-byte and next byte's first 2-bits). Considering, taking 2-bytes(8-bits) and then 2-bit right-shift
to get 6-bits. Is it possible in python? Any improvement/addition/changes are very much appreciated.
ps. can I get rid of that pesky 'b' from the front? but if ignoring it does not affect then no problem though.
Thanks in advance.
'That byte' is represented with a b'' in front, indicating that it is a byte object. To get rid of it, you can convert it to an int:
thatbyte = b'\xd3'
byteint = thatbyte[0] # or
int.from_bytes(thatbyte, 'big') # 'big' or 'little' endian, which results in the same when converting a single byte
To compare, you can do:
thatbyte == b'\xd3'
Thus compare a byte object with another byte object.
The shift << operator works on int only
To convert an int back to bytes (assuming it is [0..255]) you can use:
bytes([byteint]) # note the extra brackets!
And as for improvements, I would suggest to read the whole binary file at once:
with open("COM6_200417.ubx", "rb") as f:
allbytes = f.read() # read all
for val in allbytes:
# Do stuff with val, val is int !!!
print(bytes([val]))

I'm reading into a 256 byte string. I want to skip it, if it's all binary zeros (\x00) Is there a single test?

Totally new to python. Trying to parse a file but not all records contain data. I want to skip the records that are all hex 00.
if record == ('\x00' * 256): from a sample of print("-"*80))
gave a Syntax error, hey I said I was new. :)
Thanks for the reply, I'm using 2.7 and reading like this....
with open(testfile, "rb") as f:
counter = 0
while True:
record = f.read(256)
counter += 1
Your example looks to be very close. I'm not sure about Python 2, but in Python 3 you should specify that a string is binary.
I would do something like:
empty = b'\x00' * 256
if record == empty:
print('skipped this line')
Remember that Python 2 uses print statements, so you should do print 'skipped this line' instead.

python read file utf-8 decode issue

I am running into an issue with reading a file that has UTF8 and ASCII character. The problem is I am using seek to only read some part of the data, but I have no idea if I am "read" in the "middle" of an UTF8.
osx
python 3.6.6
to simply it, my issue can demoed with following code.
# write some utf-8 to a file
open('/tmp/test.txt', 'w').write(chr(12345)+chr(23456)+chr(34567)+'\n')
data = open('/tmp/test.txt')
data.read() # this works fine. to just demo I can read the file as whole
data.seek(1)
data.read(1) # UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
# I can read by seek 3 by 3
data.seek(3)
data.read(1) # this works fine.
I know I can open the file in binary then read it without issue by seeking to any position, however, I need to process the string, so I will end up with same issue when decode into string.
data = open('/tmp/test.txt', 'rb')
data.seek(1)
z = data.seek(3)
z.decode() # will hit same error
without using seek, I can read it correctly even just calling read(1).
data = open('/tmp/test.txt')
data.tell() # 0
data.read(1)
data.tell() # shows 3 even calling read(1)
one thing I can think is after seek to a location, try to read, on UnicodeDecodeError, position = position -1, seek(position), until I can read it correctly.
Is there a better (right) way to handle it?
As the documentation explains, when you seek on text files:
offset must either be a number returned by TextIOBase.tell(), or zero. Any other offset value produces undefined behaviour.
In practice, what seek(1) actually does is seek 1 byte into the file—which puts it in the middle of a character. So, what ends up happening is similar to this:
>>> s = chr(12345)+chr(23456)+chr(34567)+'\n'
>>> b = s.encode()
>>> b
b'\xe3\x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:]
b'x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 3: invalid start byte
So, seek(3) happens to work, even though it's not legal, because you happen to be seeking to the start of a character. It's equivalent to this:
>>> b[3:].decode()
'宠蜇\n'
If you want to rely on that undocumented behavior to try to seek randomly into the middle of a UTF-8 text file, you can usually get away with it by doing what you suggested. For example:
def readchar(f, pos):
for i in range(pos:pos+5):
try:
f.seek(i)
return f.read(1)
except UnicodeDecodeError:
pass
raise UnicodeDecodeError('Unable to find a UTF-8 start byte')
Or you could use knowledge of the UTF-8 encoding to manually scan for a valid start byte in a binary file:
def readchar(f, pos):
f.seek(pos)
for _ in range(5):
byte = f.read(1)
if byte in range(0, 0x80) or byte in range(0xC0, 0x100):
return byte
raise UnicodeDecodeError('Unable to find a UTF-8 start byte')
However, if you're actually just looking for the next complete line before or after some arbitrary point, that's a whole lot easier.
In UTF-8, the newline character is encoded as a single byte, and the same byte as in ASCII—that is, '\n' encodes to b'\n'. (If you have Windows-style endings, the same is true for return, so '\r\n' also encodes to b'\r\n'.) This is by design, to make it easier to handle this kind of problem.
So, if you open the file in binary mode, you can seek forward or backward until you find a newline byte. And then, you can just use the (binary-file) readline method to read from there until the next newline.
The exact details depend on exactly what rule you want to use here. Also, I'm going to show a stupid, completely unoptimized version that reads a character at a time; in real life you probably want to back up, read, and scan (e.g., with rfind), say, 80 characters at a time, but this is hopefully simpler to understand:
def getline(f, pos, maxpos):
for start in range(pos-1, -1, -1):
f.seek(start)
if f.read(1) == b'\n':
break
else:
f.seek(0)
return f.readline().decode()
Here it is in action:
>>> s = ''.join(f'{i}:\u3039\u5ba0\u8707\n' for i in range(5))
>>> b = s.encode()
>>> f = io.BytesIO(b)
>>> maxlen = len(b)
>>> print(getline(f, 0, maxlen))
0:〹宠蜇
>>> print(getline(f, 1, maxlen))
0:〹宠蜇
>>> print(getline(f, 10, maxlen))
0:〹宠蜇
>>> print(getline(f, 11, maxlen))
0:〹宠蜇
>>> print(getline(f, 12, maxlen))
1:〹宠蜇
>>> print(getline(f, 59, maxlen))
4:〹宠蜇

Test if byte is empty in python

I'd like to test the content of a variable containing a byte in a way like this:
line = []
while True:
for c in self.ser.read(): # read() from pySerial
line.append(c)
if c == binascii.unhexlify('0A').decode('utf8'):
print("Line: " + line)
line = []
break
But this does not work...
I'd like also to test, if a byte is empty:
In this case
print(self.ser.read())
prints: b'' (with two single quotes)
I do not until now succeed to test this
if self.ser.read() == b''
or what ever always shows a syntax error...
I know, very basic, but I don't get it...
Thank you for your help. The first part of the question was answerd by #sisanared:
if self.ser.read():
does the test for an empty byte
The second part of the question (the end-of-line with the hex-value 0A) stil doesn't work, but I think it is whise to close this question since the answer to the title is given.
Thank you all
If you want to verify the contents of your variable or string which you want to read from pySerial, use the repr() function, something like:
import serial
import repr as reprlib
from binascii import unhexlify
self.ser = serial.Serial(self.port_name, self.baudrate, self.bytesize, self.parity, self.stopbits, self.timeout, self.xonxoff, self.rtscts)
line = []
while 1:
for c in self.ser.read(): # read() from pySerial
line.append(c)
if if c == b'\x0A':
print("Line: " + line)
print repr(unhexlify(''.join('0A'.split())).decode('utf8'))
line = []
break

Zeroes appearing when reading file (where aren't any)

When reading a file (UTF-8 Unicode text, csv) with Python on Linux, either with:
csv.reader()
file()
values of some columns get a zero as their first characeter (there are no zeroues in input), other get a few zeroes, which are not seen when viewing file with Geany or any other editor. For example:
Input
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
Output
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;0378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
See 378983561 > 0378983561
Reading with:
f = file('/home/foo/data.csv', 'r')
data = f.read()
split_data = data.splitlines()
lines = list(line.split(';') for line in split_data)
print data[51220][8]
>>> '0378983561' #should have been '478983561' (reads like this in Geany etc.)
Same result with csv.reader().
Help me solve the mystery, what could be the cause of this? Could it be related to encoding/decoding?
The data you're getting is a string.
print data[51220][8]
>>> '0478983561'
If you want to use this as an integer, you should parse it.
print int(data[51220][8])
>>> 478983561
If you want this as a string, you should convert it back to a string.
print repr(int(data[51220][8]))
>>> '478983561'
csv.reader treats all columns as strings. Conversion to the appropriate type is up to you as in:
print int(data[51220][8])

Categories

Resources