Python CSV file UTF-16 to UTF-8 print error - python

There is a number of topics on this problem around the web, but I can not seem to find the answer for my specific case.
I have a CSV file. I am not sure what was was done to it, but when I try to open it, I get:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
Here is a full Traceback:
Traceback (most recent call last):
File "keywords.py", line 31, in <module>
main()
File "keywords.py", line 28, in main
get_csv(file_full_path)
File "keywords.py", line 19, in get_csv
for row in reader:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10: ordinal not in range(128)
With the help of Stack Overflow, I got it open with:
reader = csv.reader(codecs.open(file_full_path, 'rU', 'UTF-16'), delimiter='\t', quotechar='"')
Now the problem is that when I am reading the file:
def get_csv(file_full_path):
import csv, codecs
reader = csv.reader(codecs.open(file_full_path, 'rU', 'UTF-16'), delimiter='\t', quotechar='"')
for row in reader:
print row
I get stuck on Asian symbols:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10: ordinal not in range(128)
I have tried decode, 'encode', unicode() on the string containing that character, but it does not seem help.
for row in reader:
#decoded_row = [element_s.decode('UTF-8') for element_s in row]
#print decoded_row
encoded_row = [element_s.encode('UTF-8') for element_s in row]
print encoded_row
At this point I do not really understand why. If I
>>> print u'\u5a07'
娇
or
>>> print '娇'
娇
it works. Also in terminal, it also works. I have checked The default encoding on terminal and Python shell, it is UTF-8 everywhere. And it prints that symbol easily. I assume that it has something to do with me opening file with codecs using UTF-16.
I am not sure where to go from here. Could anyone help out?

The csv module can not handle Unicode input. It says so specifically on its documentation page:
Note: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe;
You need to convert your CSV file to UTF-8 so that the module can deal with it:
with codecs.open(file_full_path, 'rU', 'UTF-16') as infile:
with open(file_full_path + '.utf8', 'wb') as outfile:
for line in infile:
outfile.write(line.encode('utf8'))
Alternatively, you can use the command-line utility iconv to convert the file for you.
Then use that re-coded file to read your data:
reader = csv.reader(open(file_full_path + '.utf8', 'rb'), delimiter='\t', quotechar='"')
for row in reader:
print [c.decode('utf8') for c in row]
Note that the columns then need decoding to unicode manually.

Encode errors is what you get when you try to convert unicode characters to 8-bit sequences. So your first error is not an error get when actually reading the file, but a bit later.
You probably get this error because the Python 2 CSV module expects the files to be in binary mode, while you opened it so it returns unicode strings.
Change your opening to this:
reader = csv.reader(open(file_full_path, 'rb'), delimiter='\t', quotechar='"')
And you should be fine. Or even better:
with open(file_full_path, 'rb') as infile:
reader = csv.reader(infile, delimiter='\t', quotechar='"')
# CVS handling here.
However, you can't use UTF-16 (or UTF-32), as the separation characters are two-byte characters in UTF-16, and it will not handle this correctly, so you will need to convert it to UTF-8 first.

Related

Reading Hong Kong Supplementary Character Set in python 3

I have a hkscs dataset that I am trying to read in python 3. Below code
encoding = 'big5hkscs'
lines = []
num_errors = 0
for line in open('file.txt'):
try:
lines.append(line.decode(encoding))
except UnicodeDecodeError as e:
num_errors += 1
It throws me error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 0: invalid start byte. Seems like there is a non utf-8 character in the dataset that the code is not able to decode.
I tried adding errors = ignore in this line
lines.append(line.decode(encoding, errors='ignore'))
But that does not solve the problem.
Can anyone please suggest?
If a text file contains text encoded with an encoding that is not the default encoding, the encoding must be specified when opening the file to avoid decoding errors:
encoding = 'big5hkscs'
path = 'file.txt'
with open(path, 'r', encoding=encoding,) as f:
for line in f:
# do something with line
Alternatively, the file may be opened in binary mode, and text decoded afterwards:
encoding = 'big5hkscs'
path = 'file.txt'
with open(path, 'rb') as f:
for line in f:
decoded = line.decode(encoding)
# do something with decoded text
In the question, the file is opened without specifying an encoding, so its contents are automatically decoded with the default encoding - apparently UTF-8 in the is case.
Looks like if I do NOT add the except clause except UnicodeDecodeError as e, it works fine
encoding = 'big5hkscs'
lines = []
path = 'file.txt'
with open(path, encoding=encoding, errors='ignore') as f:
for line in f:
line = '\t' + line
lines.append(line)
The answer of snakecharmerb is correct, but possibly you need an explanation.
You didn't write it in the original question, but I assume you have the error on the for line. In this line you are decoding the file from UTF-8 (probably the default on your environment, so not on Windows), but later you are trying to decode it again. So the error is not about decoding big5hkscs, but about opening the file as text.
As in the good answer of snakecharmerb (second part), you should open the file as binary, so without decoding texts, and then you can decode the text with line.decode(encoding). Note: I'm not sure you can read lines on a binary file. In this manner you can still catch the errors, e.g. to write a message. Else the normal way is to decode at open() time, but then you loose the ability to fall back and to get users better error message (e.g. line number).

'utf-8' codec can't decode byte 0x8a in position 170: invalid start byte

I am trying to do this:
fh = request.FILES['csv']
fh = io.StringIO(fh.read().decode())
reader = csv.DictReader(fh, delimiter=";")
This is failing always with the error in title and I spent almost 8 hours on this.
here is my understanding:
I am using python3, so file fh is in bytes. I am encoding it into string and putting it in memory via StringIO.
with csv.DictReader() trying to read it as dict into memory. It is failing here:
also tried with io.StringIO(fh.read().decode('utf-8')), but same error.
what am I missing? :/
The error is because there is some non-ASCII character in the file and it can't be encoded/decoded. One simple way to avoid this error is to encode/decode such strings with encode()/decode() function as follows (if a is the string with non-ASCII character):
a.decode('utf-8')
Also, you could try opening the file as:
with open('filename', 'r', encoding = 'utf-8') as f:
your code using f as file pointer
use 'rb' if your file is binary.

Why does csv.reader fail when the file size is larger than 40K bytes?

I have the following code:
with open(filename, 'rt') as csvfile:
csvDictReader = csv.DictReader(csvfile, delimiter=',', quotechar='"')
for row in csvDictReader:
print(row)
Whenever the file size is less than 40k bytes, the program works great.
When the file size crosses 40k, I get this error while trying to read the file:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 7206: invalid start byte
The actual file content doesn't seem to be a problem, only the size of the file itself (40k bytes is really tiny).
When file size is greater than 40K bytes, the error always happens on the line that contains the 32K-th byte.
I have a feeling that python fails to read the file that is more than 40K bytes without an exception, and just truncates it around the 32K-th byte, in the middle. Is that correct? Where is this limit defined?
You have invalid UTF-8 data in your file. This has nothing to do with the csv module, nor the size of the file; your larger file has invalid data in it, your smaller file does not. Simply doing:
with open(filename) as f:
f.read()
should trigger the same error, and it's purely a matter of encountering an invalid UTF-8 byte, which indicates your file either wasn't UTF-8 to start with, or has been corrupted in some way.
If your file is actually a different encoding (e.g. latin-1, cp1252, etc.; the file command line utility might help with identification, but for many ASCII superset encodings, you just have to know), pass that as the encoding argument to open to use instead of the locale default (utf-8 in this case), so you can decode the bytes properly, e.g.:
# Also add newline='' to defer newline processing to csv module, where it's part
# of the CSV dialect
with open(filename, encoding='latin-1', newline='') as csvfile:
csvDictReader = csv.DictReader(csvfile, delimiter=',', quotechar='"')
for row in csvDictReader:
print(row)
the file size is not the real problem
see the exception:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 7206: invalid start byte
you should handle the encoding problem at first
with open(filename, 'rt', encoding='utf-8', errors='ignore') as csvfile:
which will ignore the encoding errors

UnicodeDecodeError: save to file in python

i want to read file, find something in it and save the result, but when I want to save it it give me a error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
Code to save to file:
fileout.write((key + ';' + nameDict[key]+ ';'+src + alt +'\n').decode('utf-8'))
What can I do to fix it?
Thank you
You are trying to concatenate unicode values with byte strings, then turn the result to unicode, to write it to a file object that most likely only takes byte strings.
Don't mix unicode and byte strings like that.
Open the file to write to with io.open() to automate encoding Unicode values, then handle only unicode in your code:
import io
with io.open(filename, 'w', encoding='utf8') as fileout:
# code gathering stuff from BeautifulSoup
fileout.write(u'{};{};{}{}\n'.format(key, nameDict[key], src, alt)
You may want to check out the csv module to handle writing out delimiter-separated values. If you do go that route, you'll have to explicitly encode your columns:
import csv
with open(filename, 'wb') as fileout:
writer = csv.writer(fileout, delimiter=';')
# code gathering stuff from BeautifulSoup
row = [key, nameDict[key], src + alt]
writer.writerow([c.encode('utf8') for c in row])
If some of this data comes from other files, make sure you also decode to Unicode first; again, io.open() to read these files is probably the best option, to have the data decoded to Unicode values for you as you read.

Python: How do I read and parse a unicode utf-8 text file?

I am exporting UTF-8 text from Excel and I want to read and parse the incoming data using Python. I've read all the online info so I've already tried this, for example:
txtFile = codecs.open( 'halout.txt', 'r', 'utf-8' )
for line in txtFile:
print repr( line )
The error I am getting is:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: unexpected code byte
Looking at the text file in a Hex editor, the first values are FFFE I've also tried:
txtFile.seek( 2 )
right after the 'open' but that just causes a different error.
That file is not UTF-8; it's UTF-16LE with a byte-order marker.
That is a BOM
EDIT, from the coments, it seems to be a utf-16 bom
codecs.open('foo.txt', 'r', 'utf-16')
should work.
Expanding on Johnathan's comment, this code should read the file correctly:
import codecs
txtFile = codecs.open( 'halout.txt', 'r', 'utf-16' )
for line in txtFile:
print repr( line )
Try to see if the excel file has some blank rows (and then has values again), that might cause the unexpected error.

Categories

Resources