UnicodeDecodeError, utf-8 invalid continuation byte - python

I m trying to extract lines from a log file , using that code :
with open('fichier.01') as f:
content = f.readlines()
print (content)
but its always makes the error statement
Traceback (most recent call last):
File "./parsepy", line 4, in <module>
content = f.readlines()
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2213: invalid continuation byte
how can i fix it ??

try one of the following
open('fichier.01', 'rb')
open('fichier.01', encoding ='utf-8')
open('fichier.01', encoding ='ISO-8859-1')
or also you can use io Module:
import io
io.open('fichier.01')
This is a common error when opening files when using Python (or any language really). This is an error you will soon learn to catch.

If it's not encoded as text then you will have to open it in binary mode e.g.:
with open('fichier.01', 'rb') as f:
content = f.readlines()
If it's encoded as something other than UTF-8 and it can be opened in text mode then open takes an encoding argument: https://docs.python.org/3.5/library/functions.html#open

Try to use it to solve it:
with open('fichier.01', errors='ignore') as f:
###

Related

Error when printing random line from text file

I need to print a random line from the file "Long films".
My code is:
import random
with open('Long films') as f:
lines = f.readlines()
print(random.choice(lines))
But it prints this error:
Traceback (most recent call last):
line 3, in <module>
lines = f.readlines()
line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 36: ordinal not in range(128)
What do I need to do in order to avoid this error?
The problem is not with printing, it is with reading. It seems your file has some special characters. Try opening your file with a different encoding:
with open('Long films', encoding='latin-1') as f:
...
Also, have you made any settings to your locale? Have you set any encoding scheme at the top of your file? Ordinarily, python3 will "helpfully" decode your text to utf-8, so you typically should not be getting this error.

UnicodeEncodeError when reading a file

I am trying to read from rockyou wordlist and write all words that are >= 8 chars to a new file.
Here is the code -
def main():
with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w') as out_file:
for line in in_file:
if len(line.rstrip()) < 8:
continue
print(line, file = out_file, end = '')
print("done")
if __name__ == '__main__':
main()
Some words are not utf-8.
Traceback (most recent call last):
File "wpa_rock.py", line 10, in <module>
main()
File "wpa_rock.py", line 6, in main
print(line, file = out_file, end = '')
File "C:\Python\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0e45' in position
0: character maps to <undefined>
Update
def main():
with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w', encoding="utf8") as out_file:
for line in in_file:
if len(line.rstrip()) < 8:
continue
out_file.write(line)
print("done")
if __name__ == '__main__':
main()```
Traceback (most recent call last):
File "wpa_rock.py", line 10, in <module>
main()
File "wpa_rock.py", line 3, in main
for line in in_file:
File "C:\Python\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 933: invali
d continuation byte
Your UnicodeEncodeError: 'charmap' error occurs during writing to out_file (in print()).
By default, open() uses locale.getpreferredencoding() that is ANSI codepage on Windows (such as cp1252) that can't represent all Unicode characters and '\u0e45' character in particular. cp1252 is a one-byte encoding that can represent at most 256 different characters but there are a million (1114111) Unicode characters. It can't represent them all.
Pass encoding that can represent all the desired data e.g., encoding='utf-8' must work (as #robyschek suggested)—if your code reads utf-8 data without any errors then the code should be able to write the data using utf-8 too.
Your UnicodeDecodeError: 'utf-8' error occurs during reading in_file (for line in in_file). Not all byte sequences are valid utf-8 e.g., os.urandom(100).decode('utf-8') may fail. What to do depends on the application.
If you expect the file to be encoded as utf-8; you could pass errors="ignore" open() parameter, to ignore occasional invalid byte sequences. Or you could use some other error handlers depending on your application.
If the actual character encoding used in the file is different then you should pass the actual character encoding. bytes by themselves do not have any encoding—that metadata should come from another source (though some encodings are more likely than others: chardet can guess) e.g., if the file content is an http body then see A good way to get the charset/encoding of an HTTP response in Python
Sometimes a broken software can generate mostly utf-8 byte sequences with some bytes in a different encoding. bs4.BeautifulSoup can handle some special cases. You could also try ftfy utility/library and see if it helps in your case e.g., ftfy may fix some utf-8 variations.
Hey I was having a similar issue, in the case of rockyou.txt wordlist, I tried a number of encodings that Python had to offer and I found that encoding = 'kio8_u' worked to read the file.

Python3 CSV reader Unicode Decode error

I have a huge csv file in utf8 encoding, but some columns with encoding that differs from main file encoding. It looks like:
input.txt in UTF-8 encoding:
a,b,c
d,"e?",f
g,h,"kü"
same input.txt in win-1252
a,b,c
d,"eü",f
g,h,"kü
Code:
import csv
file = open("input.txt",encoding="...")
c = csv.reader(file, delimiter=';', quotechar='"')
for itm in c:
print(itm)
and standart python3 csv reader generates encoding error on such lines.I can not just ignore reading this line but I need only always good encoded "someOther" column.
Is it posible using standart csv reader to split somehow CSV data in some "bytes mode" and then convert each array element to normal python unicode string, or should I implement my own csv reader ?
Traceback:
Traceback (most recent call last):
File "C:\Development\t.py", line 7, in <module>
for itm in c:
File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 11: invalid start byte
How sure are you that your file is UTF8 encoded?
For the small sample that you've posted UTF8 decoding fails on the ü which is "LATIN SMALL LETTER U WITH DIAERESIS". When encoded as ISO-8859-1, ü is '\xfc'. Two other possibilities are that the CSV file is UTF-16 encoded (UTF-16 little endian is common on Windows), or even Windows-1252.
If your CSV file is encoded in one of the ISO-8859-X family of encodings; any of ISO 8859-1/3/4/9/10/14/15/16 encode ü as 0xfc.
To solve, use the correct encoding and open the file like this:
file = open("input.txt", encoding="iso-8859-1")
or, for Windows 1252:
file = open("input.txt", encoding="windows-1252")
or, for UTF-16:
file = open("input.txt", encoding="utf-16") # or utf-16-le or utf-16-be as required

UnicodeDecodeError: invalid start byte

I have a quick question about UnicodeDecodeError:invalid start byte.
I think somewhere in my text has non-UTF-8 Character, but location of error message is the starting point of reading a file, so I have no idea how to fix it.
If you have any suggestion, just let me know
Following is my error message returned from python.
for line in fi:
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte
Following is my code:
for filename in os.listdir(readDir):
filename = os.path.join(readDir, filename)
for keyword in keywords:
outFileName = os.path.join(sortDir, keyword)
outFileName = outFileName+'.csv'
with open(filename, 'r') as fi, open(outFileName, "a") as fo:
for line in fi:
I had the same issue and after searching for a while what i did
import sys
#Set default encoder
sys.setdefaultencoding("ISO-8859-1")
#Then convert string to UTF-8
yourString.encode('utf-8').strip()
I hope it will be useful to someone

Getting UnicodeDecodeError while accessing csv file

Input file : chars.csv :
4,,x,,2,,9.012,2,,,,
6,,y,,2,,12.01,±4,,,,
7,,z,,2,,14.01,_3,,,,
When I try to parse this file, I get this error even after specifying utf-8 encoding.
>>> f=open('chars.csv',encoding='utf-8')
>>> f.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.2/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 36: invalid start byte
How to correct this error?
Version: Python 3.2.3
Your input file is clearly not utf-8 encoded, so you have at least those options:
f=open('chars.csv', encoding='utf-8', errors='ignore') if given file is mostly utf-8 and you don't care about some small data loss. For other errors parameter values check manual
simply use proper encoding, like latin-1, if you know one
This is not UTF-8 encoding. The UTF-8 encoding of ± is \xC2\xB1 and  is \xC2\x83. As RobertT suggested, try Latin-1:
f=open('chars.csv',encoding='latin-1')

Categories

Resources