Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError? - python

I have a text file which the publisher (the US Securities Exchange Commission) asserts is encoded in UTF-8 (https://www.sec.gov/files/aqfs.pdf, section 4). I'm processing the lines with the following code:
def tags(filename):
"""Yield Tag instances from tag.txt."""
with codecs.open(filename, 'r', encoding='utf-8', errors='strict') as f:
fields = f.readline().strip().split('\t')
for line in f.readlines():
yield process_tag_record(fields, line)
I receive the following error:
Traceback (most recent call last):
File "/home/randm/Projects/finance/secxbrl.py", line 151, in <module>
main()
File "/home/randm/Projects/finance/secxbrl.py", line 143, in main
all_tags = list(tags("tag.txt"))
File "/home/randm/Projects/finance/secxbrl.py", line 109, in tags
content = f.read()
File "/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py", line 698, in read
return self.reader.read(size)
File "/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py", line 501, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte
Given that I probably can't go back to the SEC and tell them they have files that don't seem to be encoded in UTF-8, how should I debug and catch this error?
What have I tried
I did a hexdump of the file and found that the offending text was the text "SUPPLEMENTAL DISCLOSURE OF NON�CASH INVESTING". If I decode the offending byte as a hex code point (i.e. "U+00AD"), it makes sense in context as it is the soft hyphen. But the following does not seem to work:
Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> b"\x41".decode("utf-8")
'A'
>>> b"\xad".decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec cant decode byte 0xad in position 0: invalid start byte
>>> b"\xc2ad".decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc2 in position 0: invalid continuation byte
I've used errors='replace', which seems to pass. But I'd like to understand what will happen if I try to insert that into a database.
Hexdump:
0036ae40 31 09 09 09 09 53 55 50 50 4c 45 4d 45 4e 54 41 |1....SUPPLEMENTA|
0036ae50 4c 20 44 49 53 43 4c 4f 53 55 52 45 20 4f 46 20 |L DISCLOSURE OF |
0036ae60 4e 4f 4e ad 43 41 53 48 20 49 4e 56 45 53 54 49 |NON.CASH INVESTI|
0036ae70 4e 47 20 41 4e 44 20 46 49 4e 41 4e 43 49 4e 47 |NG AND FINANCING|
0036ae80 20 41 43 54 49 56 49 54 49 45 53 3a 09 0a 50 72 | ACTIVITIES:..Pr|

You have a corrupted data file. If that character really is meant to be a U+00AD SOFT HYPHEN, then you are missing a 0xC2 byte:
>>> '\u00ad'.encode('utf8')
b'\xc2\xad'
Of all the possible UTF-8 encodings that end in 0xAD, a soft hyphen does make the most sense. However, it is indicative of a data set that may have other bytes missing. You just happened to have hit one that matters.
I'd go back to the source of this dataset and verify that the file was not corrupted when downloaded. Otherwise, using error='replace' is a viable work-around, provided no delimiters (tabs, newlines, etc.) are missing.
Another possibility is that the SEC is really using a different encoding for the file; for example in Windows Codepage 1252 and Latin-1, 0xAD is the correct encoding of a soft hyphen. And indeed, when I download the same dataset directly (warning, large ZIP file linked), and open tags.txt, I can't decode the data as UTF-8:
>>> open('/tmp/2017q1/tag.txt', encoding='utf8').read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte
>>> from pprint import pprint
>>> f = open('/tmp/2017q1/tag.txt', 'rb')
>>> f.seek(3583550)
3583550
>>> pprint(f.read(100))
(b'1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH INVESTING AND FINANCING A'
b'CTIVITIES:\t\nProceedsFromSaleOfIn')
There are two such non-ASCII characters in the file:
>>> f.seek(0)
0
>>> pprint([l for l in f if any(b > 127 for b in l)])
[b'SupplementalDisclosureOfNoncashInvestingAndFinancingActivitiesAbstract\t0'
b'001654954-17-000551\t1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH I'
b'NVESTING AND FINANCING ACTIVITIES:\t\n',
b'HotelKranichhheMember\t0001558370-17-001446\t1\t0\tmember\tD\t\tHotel Krani'
b'chhhe [Member]\tRepresents information pertaining to Hotel Kranichh\xf6h'
b'e.\n']
Hotel Kranichh\xf6he decoded as Latin-1 is Hotel Kranichhöhe.
There are also several 0xC1 / 0xD1 pairs in the file:
>>> f.seek(0)
0
>>> quotes = [l for l in f if any(b in {0x1C, 0x1D} for b in l)]
>>> quotes[0].split(b'\t')[-1][50:130]
b'Temporary Payroll Tax Cut Continuation Act of 2011 (\x1cTCCA\x1d) recognized during th'
>>> quotes[1].split(b'\t')[-1][50:130]
b'ributory defined benefit pension plan (the \x1cAetna Pension Plan\x1d) to allow certai'
I'm betting those are really U+201C LEFT DOUBLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK characters; note the 1C and 1D parts. It almost feels as if their encoder took UTF-16 and stripped out all the high bytes, rather than encode to UTF-8 properly!
There is no codec shipping with Python that would encode '\u201C\u201D' to b'\x1C\x1D', making it all the more likely that the SEC has botched their encoding process somewhere. In fact, there are also 0x13 and 0x14 characters that are probably en and em dashes (U+2013 and U+2014), as well as 0x19 bytes that are almost certainly single quotes (U+2019). All that is missing to complete the picture is a 0x18 byte to represent U+2018.
If we assume that the encoding is broken, we can attempt to repair. The following code would read the file and fix the quotes issues, assuming that the rest of the data does not use characters outside of Latin-1 apart from the quotes:
_map = {
# dashes
0x13: '\u2013', 0x14: '\u2014',
# single quotes
0x18: '\u2018', 0x19: '\u2019',
# double quotes
0x1c: '\u201c', 0x1d: '\u201d',
}
def repair(line, _map=_map):
"""Repair mis-encoded SEC data. Assumes line was decoded as Latin-1"""
return line.translate(_map)
then apply that to lines you read:
with open(filename, 'r', encoding='latin-1') as f:
repaired = map(repair, f)
fields = next(repaired).strip().split('\t')
for line in repaired:
yield process_tag_record(fields, line)
Separately, addressing your posted code, you are making Python work harder than it needs to. Don't use codecs.open(); that's legacy code that has known issues and is slower than the newer Python 3 I/O layer. Just use open(). Do not use f.readlines(); you don't need to read the whole file into a list here. Just iterate over the file directly:
def tags(filename):
"""Yield Tag instances from tag.txt."""
with open(filename, 'r', encoding='utf-8', errors='strict') as f:
fields = next(f).strip().split('\t')
for line in f:
yield process_tag_record(fields, line)
If process_tag_record also splits on tabs, use a csv.reader() object and avoid splitting each row manually:
import csv
def tags(filename):
"""Yield Tag instances from tag.txt."""
with open(filename, 'r', encoding='utf-8', errors='strict') as f:
reader = csv.reader(f, delimiter='\t')
fields = next(reader)
for row in reader:
yield process_tag_record(fields, row)
If process_tag_record combines the fields list with the values in row to form a dictionary, just use csv.DictReader() instead:
def tags(filename):
"""Yield Tag instances from tag.txt."""
with open(filename, 'r', encoding='utf-8', errors='strict') as f:
reader = csv.DictReader(f, delimiter='\t')
# first row is used as keys for the dictionary, no need to read fields manually.
yield from reader

Related

python read binary file byte by byte and read line feed as 0a

This might seem pretty stupid, but I'm a complete newbie in python.
So, I have a binary file that holds data as
47 40 ad e8 66 29 10 87 d7 73 0a 40 10
When I tried to read it with python by
with open(absolutePathInput, "rb") as f:
for line in self.file:
for byte, nextbyte in zip(line[:], line[1:]):
if state == 'wait_for_sync_1':
if ((byte == 0x10) and (nextbyte == 0x87)):
state = 'message_id'
I get
all bytes but 0a is read as line feed(i.e. \n)
It seems that it considers as "line feed" and self.file reads only till 0a.
correct me to read 0a as hex value

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 0: invalid start byte [duplicate]

This question already has an answer here:
python read file utf-8 decode issue
(1 answer)
Closed 1 year ago.
I am trying to read and write some non-English characters with python but I am having troubles and do not know why it does not work:
a = "hečžo wĐlk" * 10
with open("neki.txt", 'w') as f:
f.write(a)
with open("neki.txt", 'r') as f:
for i in range(4):
f.seek(i*10)
print(i)
print(f.read(10))
This is the program output, it works in the first loop pass and then errors:
output:
0
hečžo wĐlk
1
error:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
~/githubs/wikiq/diskdic/rank_disk_gen.py in
46 f.seek(i*10)
47 print(i)
----> 48 print(f.read(10))
/usr/lib/python3.8/codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 0: invalid start byte
If I try to cat neki.txt I get the correct output.
Unicode is very fun. If you comment out the f.seek line, the program runs correctly. So why does it error in the first place? The answer is that UTF-8 is a multi-byte encoding, meaning that some characters require multiple bytes to store. If we add a f.tell to see what the current position is, like this:
a = "hečžo wĐlk" * 10
with open("neki.txt", 'w') as f:
f.write(a)
with open("neki.txt", 'r') as f:
for i in range(4):
#f.seek(i*10)
print(i)
print(f.read(10))
print("curr pos",f.tell())
You can see that the byte offset each iteration is actually 13 instead of 10:
0
hečžo wĐlk
curr pos 13
1
hečžo wĐlk
curr pos 26
2
hečžo wĐlk
curr pos 39
3
hečžo wĐlk
curr pos 52
Your "10 char" string is actually 13 bytes. What this means is that f.seek(i*10) will try to seek to byte 10, which happens to be in the middle of a multi-byte char. This will cause the error that you have described.

Converting broken byte string from unicode back to corresponding bytes

The following code retrieves an iterable object of strings in rows which contains a PDF byte stream. The string row was type of str. The resulting file was a PDF format and could be opened.
with open(fname, "wb") as fd:
for row in rows:
fd.write(row)
Due to a new C-Library and changes in the Python implementation the str changes to unicode. And the corresponding content changed as well so my PDF file is broken.
Starting bytes of first row object:
old row[0]: 25 50 44 46 2D 31 2E 33 0D 0A 25 E2 E3 CF D3 0D 0A ...
new row[0]: 25 50 44 46 2D 31 2E 33 0D 0A 25 C3 A2 C3 A3 C3 8F C3 93 0D 0A ...
I adjust the corresponding byte positions here so it looks like a unicode problem.
I think this is a good start but I still have a unicode string as input...
>>> "\xc3\xa2".decode('utf8') # but as input I have u"\xc3\xa2"
u'\xe2'
I already tried several calls of encode and decode so I need a more analytical way to fix this. I can't see the wood for the trees. Thank you.
When you find u"\xc3\xa2" in a Python unicode string, it often means that you have read an UTF-8 encoded file as is it was Latin1 encoded. So the best thing to do is certainly to fix the initial read.
That being said if you have to depend on broken code, the fix is still easy: you just encode the string as Latin1 and then decode it as UTF-8:
fixed_u_str = broken_u_str.encode('Latin1').decode('UTF-8')
For example:
u"\xc3\xa2\xc3\xa3".encode('Latin1').decode('utf8')
correctly gives u"\xe2\xe3" which displays as âã
This looks like you should be doing
fd.write(row.encode('utf-8'))
assuming the type of row is now unicode (this is my understanding of how you presented things).

How to read hex values at specific addresses in Python?

Say I have a file and I'm interested in reading and storing hex values at certain addresses, like the snippet below:
22660 00 50 50 04 00 56 0F 50 25 98 8A 19 54 EF 76 00
22670 75 38 D8 B9 90 34 17 75 93 19 93 19 49 71 EF 81
I want to read the value at 0x2266D, and be able to replace it with another hex value, but I can't understand how to do it. I've tried using open('filename', 'rb'), however this reads it as the ASCII representation of the values, and I don't see how to pick and choose when addresses I want to change.
Thanks!
Edit: For an example, I have
rom = open("filename", 'rb')
for i in range(5):
test = rom.next().split()
print test
rom.close()
This outputs: ['NES\x1a', 'B\x00\x00\x00\x00\x00\x00\x00\x00\x00!\x0f\x0f\x0f(\x0f!!', '!\x02\x0f\x0f', '!\x0f\x01\x08', '!:\x0f\x0f\x03!', '\x0f', '\x0f\x0f', '!', '\x0f\x0f!\x0f\x03\x0f\x12', '\x0f\x0f\x0f(\x02&%\x0f', '\x0f', '#', '!\x0f\x0f1', '!"#$\x14\x14\x14\x13\x13\x03\x04\x0f\x0f\x03\x13#!!\x00\x00\x00\x00\x00!!', '(', '\x0f"\x0f', '#\x14\x11\x12\x0f\x0f\x0f#', '\x10', "5'4\x0270&\x02\x02\x02\x02\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f126&\x13\x0f\x0f\x0f\x13&6222\x0f", '\x1c,', etc etc.
Much more than 5 bytes, and while some of it is in hex, some has been replaced with ASCII.
There's no indication that some of the bytes were replaced by their ASCII representations. Some bytes happen to be printable.
With a binary file, you can simply seek to the location offset and write the bytes in. Working with the line-iterator in the case of binary file is problematic, as there's no meaningful "lines" in the binary blob.
You can do in-place editing like follows (in fake Python):
with open("filename", "rb+") as f:
f.seek(0x2266D)
the_byte = f.read(1)
if len(the_byte) != 1:
# something's wrong; bolt out ...
else
transformed_byte = your_function(the_byte)
f.seek(-1, 1) # back one byte relative to the current position
f.write(transformed_byte)
But of course, you may want to do the edit on a copy, either in-memory (and commit later, as in the answer of #JosepValls), or on a file copy. The problem with gulping the whole file in memory is, of course, sometimes the system may choke ;) For that purpose you may want to mmap part of the file.
Given that is not a very big file (roms should fit fine in today's computer's memory), just do data = open('filename', 'rb').read(). Now you can do whatever you want to the data (if you print it, it will show ascii, but that is just data!). Unfortunately, string objects don't support item assignment, see this answer for more:
Change one character in a string in Python?
In your case:
data = data[0:0x2266C] + str(0xFF) + data[0x2266D:]

Convert UTF-16 to UTF-8 and remove BOM?

We have a data entry person who encoded in UTF-16 on Windows and would like to have utf-8 and remove the BOM. The utf-8 conversion works but BOM is still there. How would I remove this? This is what I currently have:
batch_3={'src':'/Users/jt/src','dest':'/Users/jt/dest/'}
batches=[batch_3]
for b in batches:
s_files=os.listdir(b['src'])
for file_name in s_files:
ff_name = os.path.join(b['src'], file_name)
if (os.path.isfile(ff_name) and ff_name.endswith('.json')):
print ff_name
target_file_name=os.path.join(b['dest'], file_name)
BLOCKSIZE = 1048576
with codecs.open(ff_name, "r", "utf-16-le") as source_file:
with codecs.open(target_file_name, "w+", "utf-8") as target_file:
while True:
contents = source_file.read(BLOCKSIZE)
if not contents:
break
target_file.write(contents)
If I hexdump -C I see:
Wed Jan 11$ hexdump -C svy-m-317.json
00000000 ef bb bf 7b 0d 0a 20 20 20 20 22 6e 61 6d 65 22 |...{.. "name"|
00000010 3a 22 53 61 76 6f 72 79 20 4d 61 6c 69 62 75 2d |:"Savory Malibu-|
in the resulting file. How do I remove the BOM?
thx
This is the difference between UTF-16LE and UTF-16
UTF-16LE is little endian without a BOM
UTF-16 is big or little endian with a BOM
So when you use UTF-16LE, the BOM is just part of the text. Use UTF-16 instead, so the BOM is automatically removed. The reason UTF-16LE and UTF-16BE exist is so people can carry around "properly-encoded" text without BOMs, which does not apply to you.
Note what happens when you encode using one encoding and decode using the other. (UTF-16 automatically detects UTF-16LE sometimes, not always.)
>>> u'Hello, world'.encode('UTF-16LE')
'H\x00e\x00l\x00l\x00o\x00,\x00 \x00w\x00o\x00r\x00l\x00d\x00'
>>> u'Hello, world'.encode('UTF-16')
'\xff\xfeH\x00e\x00l\x00l\x00o\x00,\x00 \x00w\x00o\x00r\x00l\x00d\x00'
^^^^^^^^ (BOM)
>>> u'Hello, world'.encode('UTF-16LE').decode('UTF-16')
u'Hello, world'
>>> u'Hello, world'.encode('UTF-16').decode('UTF-16LE')
u'\ufeffHello, world'
^^^^ (BOM)
Or you can do this at the shell:
for x in * ; do iconv -f UTF-16 -t UTF-8 <"$x" | dos2unix >"$x.tmp" && mv "$x.tmp" "$x"; done
Just use str.decode and str.encode:
with open(ff_name, 'rb') as source_file:
with open(target_file_name, 'w+b') as dest_file:
contents = source_file.read()
dest_file.write(contents.decode('utf-16').encode('utf-8'))
str.decode will get rid of the BOM for you (and deduce the endianness).

Categories

Resources