python read file utf-8 decode issue - python

I am running into an issue with reading a file that has UTF8 and ASCII character. The problem is I am using seek to only read some part of the data, but I have no idea if I am "read" in the "middle" of an UTF8.
osx
python 3.6.6
to simply it, my issue can demoed with following code.
# write some utf-8 to a file
open('/tmp/test.txt', 'w').write(chr(12345)+chr(23456)+chr(34567)+'\n')
data = open('/tmp/test.txt')
data.read() # this works fine. to just demo I can read the file as whole
data.seek(1)
data.read(1) # UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
# I can read by seek 3 by 3
data.seek(3)
data.read(1) # this works fine.
I know I can open the file in binary then read it without issue by seeking to any position, however, I need to process the string, so I will end up with same issue when decode into string.
data = open('/tmp/test.txt', 'rb')
data.seek(1)
z = data.seek(3)
z.decode() # will hit same error
without using seek, I can read it correctly even just calling read(1).
data = open('/tmp/test.txt')
data.tell() # 0
data.read(1)
data.tell() # shows 3 even calling read(1)
one thing I can think is after seek to a location, try to read, on UnicodeDecodeError, position = position -1, seek(position), until I can read it correctly.
Is there a better (right) way to handle it?

As the documentation explains, when you seek on text files:
offset must either be a number returned by TextIOBase.tell(), or zero. Any other offset value produces undefined behaviour.
In practice, what seek(1) actually does is seek 1 byte into the file—which puts it in the middle of a character. So, what ends up happening is similar to this:
>>> s = chr(12345)+chr(23456)+chr(34567)+'\n'
>>> b = s.encode()
>>> b
b'\xe3\x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:]
b'x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 3: invalid start byte
So, seek(3) happens to work, even though it's not legal, because you happen to be seeking to the start of a character. It's equivalent to this:
>>> b[3:].decode()
'宠蜇\n'
If you want to rely on that undocumented behavior to try to seek randomly into the middle of a UTF-8 text file, you can usually get away with it by doing what you suggested. For example:
def readchar(f, pos):
for i in range(pos:pos+5):
try:
f.seek(i)
return f.read(1)
except UnicodeDecodeError:
pass
raise UnicodeDecodeError('Unable to find a UTF-8 start byte')
Or you could use knowledge of the UTF-8 encoding to manually scan for a valid start byte in a binary file:
def readchar(f, pos):
f.seek(pos)
for _ in range(5):
byte = f.read(1)
if byte in range(0, 0x80) or byte in range(0xC0, 0x100):
return byte
raise UnicodeDecodeError('Unable to find a UTF-8 start byte')
However, if you're actually just looking for the next complete line before or after some arbitrary point, that's a whole lot easier.
In UTF-8, the newline character is encoded as a single byte, and the same byte as in ASCII—that is, '\n' encodes to b'\n'. (If you have Windows-style endings, the same is true for return, so '\r\n' also encodes to b'\r\n'.) This is by design, to make it easier to handle this kind of problem.
So, if you open the file in binary mode, you can seek forward or backward until you find a newline byte. And then, you can just use the (binary-file) readline method to read from there until the next newline.
The exact details depend on exactly what rule you want to use here. Also, I'm going to show a stupid, completely unoptimized version that reads a character at a time; in real life you probably want to back up, read, and scan (e.g., with rfind), say, 80 characters at a time, but this is hopefully simpler to understand:
def getline(f, pos, maxpos):
for start in range(pos-1, -1, -1):
f.seek(start)
if f.read(1) == b'\n':
break
else:
f.seek(0)
return f.readline().decode()
Here it is in action:
>>> s = ''.join(f'{i}:\u3039\u5ba0\u8707\n' for i in range(5))
>>> b = s.encode()
>>> f = io.BytesIO(b)
>>> maxlen = len(b)
>>> print(getline(f, 0, maxlen))
0:〹宠蜇
>>> print(getline(f, 1, maxlen))
0:〹宠蜇
>>> print(getline(f, 10, maxlen))
0:〹宠蜇
>>> print(getline(f, 11, maxlen))
0:〹宠蜇
>>> print(getline(f, 12, maxlen))
1:〹宠蜇
>>> print(getline(f, 59, maxlen))
4:〹宠蜇

Related

Read binary file and check with matching character in python

I would like to scan through data files from GPS receiver byte-wise (actually it will be a continuous flow, not want to test the code with offline data). If find a match, then check the next 2 bytes for the 'length' and get the next 2 bytes and shift 2 bits(not byte) to the right, etc. I didn't handle binary before, so stuck in a simple task. I could read the binary file byte-by-byte, but can not find a way to match by desired pattern (i.e. D3).
with open("COM6_200417.ubx", "rb") as f:
byte = f.read(1) # read 1-byte at a time
while byte != b"":
# Do stuff with byte.
byte = f.read(1)
print(byte)
The output file is:
b'\x82'
b'\xc2'
b'\xe3'
b'\xb8'
b'\xe0'
b'\x00'
b'#'
b'\x13'
b'\x05'
b'!'
b'\xd3'
b'\x00'
b'\x13'
....
how to check if that byte is == '\xd3'? (D3)
also would like to know how to shift bit-wise, as I need to check decimal value consisting of 6 bits
(1-byte and next byte's first 2-bits). Considering, taking 2-bytes(8-bits) and then 2-bit right-shift
to get 6-bits. Is it possible in python? Any improvement/addition/changes are very much appreciated.
ps. can I get rid of that pesky 'b' from the front? but if ignoring it does not affect then no problem though.
Thanks in advance.
'That byte' is represented with a b'' in front, indicating that it is a byte object. To get rid of it, you can convert it to an int:
thatbyte = b'\xd3'
byteint = thatbyte[0] # or
int.from_bytes(thatbyte, 'big') # 'big' or 'little' endian, which results in the same when converting a single byte
To compare, you can do:
thatbyte == b'\xd3'
Thus compare a byte object with another byte object.
The shift << operator works on int only
To convert an int back to bytes (assuming it is [0..255]) you can use:
bytes([byteint]) # note the extra brackets!
And as for improvements, I would suggest to read the whole binary file at once:
with open("COM6_200417.ubx", "rb") as f:
allbytes = f.read() # read all
for val in allbytes:
# Do stuff with val, val is int !!!
print(bytes([val]))

Encoding a file with ord function

I'm trying to encode a file and output the encode into a new file, but I got this error:
TypeError: ord() expected string of length 1, but int found
My code:
from sys import argv, exit
def encode(data):
encoded = ''
while data:
current = data[0]
count = 1
for i in data[1:]:
if i == current:
count += 1
else:
break
if count == 255:
break
encoded += '{}{}'.format(chr(ord(current) & 255), chr(count & 255)) #error occurs here.
data = data[count:]
return encoded
if __name__ == '__main__':
if len(argv) < 2:
print('Please specify input file!')
exit(0)
with open(argv[1], 'rb') as (f):
data = f.read()
with open(argv[1] + '.out', 'wb') as (f):
f.write(encode(data))
Additional question: How do I decode the encoded file?
You are reading bytes (open(..., 'rb')), so when you take one element of the byte string, you get a byte, ie. a number. This number already is the character code, so just leave out the ord. Alternatively, you could open the file without the b modifier (open(..., 'r')), which will return a string; I would advise to keep it as a byte string though (or you could run into encoding issues if you are parsing something non-ascii).
You will run into a similar problem saving your file: you cannot write a string into a file opened with the b modifier. Since you have characters outside the ascii range (>128), writing as a string is not a good idea, since python will try to encode your characters (eg. in UTF-8), and you will end up with completely different bytes. Therefore, the best solution probably is not to concat your data to a string in your loop (the part where you do '{}{}'.format(...), but to have a list (encoded = [], concat with encoded.append(current)) and convert that to a byte string using bytes(encoded) after your loop. You can then pass that to write without a problem.
As for how to decode your file, you can just open the file like you do for encoding, read two bytes b1 and b2, and append [b1]*b2 to your output (again, as a list), and convert that to a byte string with bytes().

What happens if you seek() to the middle of a multi-byte UTF-8 character and call read(1)?

In Python 3, read(size) has the following documentation:
Read and return at most size characters from the stream as a single str. If size is negative or None, reads until EOF.
But suppose that you seek() to the middle of a multi-byte UTF-8 character. What will read(1) return?
The partial unicode character can't be decoded so python will raise a UnicodeDecodeError. But you can recover from the problem. The UTF-8 encoding is built to be self-healing, meaning that the first byte of the character sequence (0x00-0x7f or 0xc0-0xfd) will not appear in any other byte, so you just need to keep seeking backwards by 1 byte until the decode works.
>>> def read_unicode(fp, position, count):
... while position >= 0:
... fp.seek(position)
... try:
... return fp.read(count)
... except UnicodeDecodeError:
... position -= 1
... raise UnicodeDecodeError("File not decodable")
...
>>> open('test.txt', 'w', encoding='utf-8').write("学"*10000)
10000
>>> f=open('test.txt', 'r', encoding='utf-8')
>>> f.seek(32)
32
>>> f.read(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa6 in position 0: invalid start byte
>>> read_unicode(f, 32, 1)
'学'
Text streams in Python 3 don't support arbitrary seek offsets, you're only supposed to use offsets of 0, or values returned by tell with whence of SEEK_SET. Everything else is undefined or unsupported behavior. See the docs for TextIOBase.seek.
Sure, in practice, you might get UnicodeDecodeError, but that is not a guarantee. As soon as you violate the API contractual requirements, it can do whatever it wants.

Ignore newline character in binary file with Python?

I open my file like so :
f = open("filename.ext", "rb") # ensure binary reading with b
My first line of data looks like this (when using f.readline()):
'\x04\x00\x00\x00\x12\x00\x00\x00\x04\x00\x00\x00\xb4\x00\x00\x00\x01\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x18\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00:\x00\x00\x00;\x00\x00\x00<\x00\x00\x007\x00\x00\x008\x00\x00\x009\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n'
Thing is, I want to read this data byte by byte (f.read(4)). While debugging, I realized that when it gets to the end of the first line, it still takes in the newline character \n and it is used as the first byte of the following int I read. I don't want to simply use .splitlines()because some data could have an n inside and I don't want to corrupt it. I'm using Python 2.7.10, by the way. I also read that opening a binary file with the b parameter "takes care" of the new line/end of line characters; why is not the case with me?
This is what happens in the console as the file's position is right before the newline character:
>>> d = f.read(4)
>>> d
'\n\x00\x00\x00'
>>> s = struct.unpack("i", d)
>>> s
(10,)
(Followed from discussion with OP in chat)
Seems like the file is in binary format and the newlines are just mis-interpreted values. This can happen when writing 10 to the file for example.
This doesn't mean that newline was intended, and it is probably not. You can just ignore it being printed as \n and just use it as data.
You should just be able to replace the bytes that indicate it is a newline.
>>> d = f.read(4).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
>>> diff = 4 - len(d)
>>> while diff > 0: # You can probably make this more sophisticated
... d += f.read(diff).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
... diff = 4 - len(d)
>>>
>>> s = struct.unpack("i", d)
This should give you an idea of how it will work. This approach could mess with your data's byte alignment.
If you really are seeing "\n" in your print of d then try .replace(b"\n", b"")

Reading UTF-8 strings from a binary file

I have some files which contains a bunch of different kinds of binary data and I'm writing a module to deal with these files.
Amongst other, it contains UTF-8 encoded strings in the following format: 2 bytes big endian stringLength (which I parse using struct.unpack()) and then the string. Since it's UTF-8, the length in bytes of the string may be greater than stringLength and doing read(stringLength) will come up short if the string contains multi-byte characters (not to mention messing up all the other data in the file).
How do I read n UTF-8 characters (distinct from n bytes) from a file, being aware of the multi-byte properties of UTF-8? I've been googling for half an hour and all the results I've found are either not relevant or makes assumptions that I cannot make.
Given a file object, and a number of characters, you can use:
# build a table mapping lead byte to expected follow-byte count
# bytes 00-BF have 0 follow bytes, F5-FF is not legal UTF8
# C0-DF: 1, E0-EF: 2 and F0-F4: 3 follow bytes.
# leave F5-FF set to 0 to minimize reading broken data.
_lead_byte_to_count = []
for i in range(256):
_lead_byte_to_count.append(
1 + (i >= 0xe0) + (i >= 0xf0) if 0xbf < i < 0xf5 else 0)
def readUTF8(f, count):
"""Read `count` UTF-8 bytes from file `f`, return as unicode"""
# Assumes UTF-8 data is valid; leaves it up to the `.decode()` call to validate
res = []
while count:
count -= 1
lead = f.read(1)
res.append(lead)
readcount = _lead_byte_to_count[ord(lead)]
if readcount:
res.append(f.read(readcount))
return (''.join(res)).decode('utf8')
Result of a test:
>>> test = StringIO(u'This is a test containing Unicode data: \ua000'.encode('utf8'))
>>> readUTF8(test, 41)
u'This is a test containing Unicode data: \ua000'
In Python 3, it is of course much, much easier to just wrap the file object in a io.TextIOWrapper() object and leave decoding to the native and efficient Python UTF-8 implementation.
One character in UTF-8 can be 1byte,2bytes,3byte3.
If you have to read your file byte by byte, you have to follow the UTF-8 encoding rules. http://en.wikipedia.org/wiki/UTF-8
Most the time, you can just set the encoding to utf-8, and read the input stream.
You do not need to care how much bytes you have read.

Categories

Resources