Getting UnicodeDecodeError while accessing csv file - python

Input file : chars.csv :
4,,x,,2,,9.012,2,,,,
6,,y,,2,,12.01,±4,,,,
7,,z,,2,,14.01,_3,,,,
When I try to parse this file, I get this error even after specifying utf-8 encoding.
>>> f=open('chars.csv',encoding='utf-8')
>>> f.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.2/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 36: invalid start byte
How to correct this error?
Version: Python 3.2.3

Your input file is clearly not utf-8 encoded, so you have at least those options:
f=open('chars.csv', encoding='utf-8', errors='ignore') if given file is mostly utf-8 and you don't care about some small data loss. For other errors parameter values check manual
simply use proper encoding, like latin-1, if you know one

This is not UTF-8 encoding. The UTF-8 encoding of ± is \xC2\xB1 and  is \xC2\x83. As RobertT suggested, try Latin-1:
f=open('chars.csv',encoding='latin-1')

Related

Python: Can't read file encoded in ASCII

I generated a bugreport in Android through ADB and extracted the large report file. But when I open and read that file, it prints:
>>> f = open('bugreport.txt')
>>> f.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 12788794: invalid start byte
>>> f = open('bugreport.txt', encoding='ascii')
>>> f.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 5455694: ordinal not in range(128)
It seems that neither UTF-8 nor ASCII codec can decode the file.
Then I checked the file encoding by two commands:
$ enca bugreport.txt
7bit ASCII characters
$ file -i bugreport.txt
bugreport.txt: text/plain; charset=us-ascii
They show me the file is encoded in ascii, while I can't open it by ascii codec.
Some other clues:
1. The above python interpreter is python 3.6.3. I tried python 2.7.14 and it went well.
2. If the file is opened by adding parameters errors='ignore' and encoding='ascii', it can be read but all Chinese characters are lost.
So how can I open that peculiar file in python 3? Can anyone help me?
In python 3 you can specify encoding with open context.
with open(file, encoding='utf-8') as f:
data = f.read()
It's likely that the file is encoded as latin-1 or utf-16 (little-endian).
>>> bytes_ = [b'\xc0', b'\xef']
>>> for b in bytes_:
... print(repr(b), b.decode('latin-1'))
...
b'\xc0' À
b'\xef' ï
>>> bytes_ = [b'\xc0\x00', b'\xef\x00']
>>> for b in bytes_:
... print(repr(b), b.decode('utf-16le'))
...
b'\xc0\x00' À
b'\xef\x00' ï

UnicodeDecodeError when using specific python library (gender-detector)

I need to do gender guessing for some analysis, and after some research I've found this Python Library on github: malev/gender-detector
After following the instructions and doing some tweaks (e.g. readme instructs import gender_detector as gd but I needed to do
from gender_detector import gender_detector as gd
Then this happens, the lib has 4 datasets, 'us','uk','ar','uy', but only works when using 'us' or 'uk'
See example below:
from gender_detector import gender_detector as gd
detector = gd.GenderDetector('us')
detector2 = gd.GenderDetector('ar')
detector.guess('Marcos')
Out[25]: 'male'
detector2.guess('Marcos')
Traceback (most recent call last):
File "", line 1, in
detector2.guess('Marcos')
File "/home/cpneto/anaconda3/lib/python3.6/site-packages/gender_detector/gender_detector.py", line 25, in guess
initial_position = self.index(name[0])
File "/home/cpneto/anaconda3/lib/python3.6/site-packages/gender_detector/index.py", line 19, in call
self._generate_index()
File "/home/cpneto/anaconda3/lib/python3.6/site-packages/gender_detector/index.py", line 25, in _generate_index
total = file.readline() # Omit headers line
File "/home/cpneto/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 1078: invalid continuation byte
I believe this happens because of py2 vs py3 compatibility, but I'm not sure of that and don't have any clue on how to solve this.
Any suggestions?
The library assumes your ar file is UTF-8 encoded, but it isn't (hence the byte 0xf1 in position 1078 error). You need to either convert your file to UTF-8 or find some way to pass the actual encoding to the library.

UnicodeDecodeError, utf-8 invalid continuation byte

I m trying to extract lines from a log file , using that code :
with open('fichier.01') as f:
content = f.readlines()
print (content)
but its always makes the error statement
Traceback (most recent call last):
File "./parsepy", line 4, in <module>
content = f.readlines()
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2213: invalid continuation byte
how can i fix it ??
try one of the following
open('fichier.01', 'rb')
open('fichier.01', encoding ='utf-8')
open('fichier.01', encoding ='ISO-8859-1')
or also you can use io Module:
import io
io.open('fichier.01')
This is a common error when opening files when using Python (or any language really). This is an error you will soon learn to catch.
If it's not encoded as text then you will have to open it in binary mode e.g.:
with open('fichier.01', 'rb') as f:
content = f.readlines()
If it's encoded as something other than UTF-8 and it can be opened in text mode then open takes an encoding argument: https://docs.python.org/3.5/library/functions.html#open
Try to use it to solve it:
with open('fichier.01', errors='ignore') as f:
###

Decoding HappyBase data from HBase

While trying to decode the values from HBase, i am seeing an error but it is apparent that Python thinks it is not in UTF-8 format but the Java application that put the data into HBase encoded it in UTF-8 only
a = '\x00\x00\x00\x00\x10j\x00\x00\x07\xe8\x02Y'
a.decode("UTF-8")
Traceback (most recent call last):
File "", line 1, in
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 9: invalid continuation byte
any thoughts?
that data is not valid utf-8, so if you really retrieved it as such from the database, you should check who/what put it in there.

UnicodeDecodeError in Python 2.7

I am trying to read a utf-8 encoded xml file in python and I am doing some processing on the lines read from the file something like below:
next_sent_separator_index = doc_content.find(word_value, int(characterOffsetEnd_value) + 1)
Where doc_content is the line read from the file and word_value is one of the string from the the same line. I am getting encoding related error for above line whenever doc_content or word_value is having some Unicode characters. So, I tried to decode them first with utf-8 decoding (instead of default ascii encoding) as below :
next_sent_separator_index = doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
But I am still getting UnicodeDecodeError as below :
Traceback (most recent call last):
File "snippetRetriver.py", line 402, in <module>
sentences_list,lemmatised_sentences_list = getSentenceList(form_doc)
File "snippetRetriver.py", line 201, in getSentenceList
next_sent_separator_index = doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8: ordinal not in range(128)
Can anyone suggest me a suitable approach / way to avoid these kind of encoding errors in python 2.7 ?
codecs.utf_8_decode(input.encode('utf8'))

Categories

Resources