This question already has answers here:
Switching to Python 3 causing UnicodeDecodeError
(3 answers)
Closed 5 years ago.
Im currently trying to use some simple regex on a very big .txt file (couple of million lines of text). The most simple code that causes the problem:
file = open("exampleFileName", "r")
for line in file:
pass
The error message:
Traceback (most recent call last):
File "example.py", line 34, in <module>
example()
File "example.py", line 16, in example
for line in file:
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7332: invalid continuation byte
How can i fix this? is utf-8 the wrong encoding? And if it is, how do i know which one is right?
Thanks and best regards!
It looks like it is invalid UTF-8 and you should try to read with latin-1 encoding. Try
file = open('exampleFileName', 'r', encoding='latin-1')
It is not possible to identify the encoding on the fly. So, either user a method which I wrote as a comment or use similar constructions (as proposed by another answer), but this is a wild shot:
try:
file = open("exampleFileName", "r")
except UnicodeDecodeError:
try:
file = open("exampleFileName", "r", encoding="latin2")
except: #...
And so on, until you test all the encodings from Standard Python Encodings.
So I think there's no need to bother with this nested hell, just do file -bi [filename] once, copy the encoding and forget about this.
UPD. Actually, I've found another stackoverflow answer which you can use if you're on Windows.
Related
I am very new to python scripting but I have a very simple task that I would like to perform, but I seem to be stuck at it. All I am trying to accomplish is to read data from a .txt file and parse it.
Steps I have taken
I have downloaded the pdf file from my schools website, it contains a list of courses http://info.sjsu.edu/cgi-bin/pdfserv?ftok=soc-fall-courses
I converted the pdf file to a .txt file simply by saving it as a .txt file
Googled the error to find out that it is some sort of encoding issue
Used the terminal command file -I [filename] and returned the result sjsuclassdata.txt: text/plain; charset=unknown-8bit
Used the many methods online to try and convert the file to a UTF-8 encoding but to no avail
Error Message that I got
Traceback (most recent call last):
File "/Users/edward/MyPythonScripts/sjsuClassExtractor.py", line 25, in <module>
regexMatches = lectureRegex.findall(file.read())
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 9: invalid continuation byte
So as you can see, I am really lost as to what Im supposed to do from here, I have verified that everything works if I read a different file that contains similar data.
Assuming that the original text file is ANSI encoded (default with Acrobat Reader's 'Save As Text' option), this command will convert it to utf-8:
iconv -f "iso-8859-1" -t "utf-8" sjsuclassdata.txt -o sjsuclassdata-utf8.txt
I would like to import even a simple text file into Python. For example, here's the contents of example.txt:
hello
my
friend
Very simple. However, when I try to import the file and read it:
f = open('example.txt')
f.read()
I get the following error:
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
f.read()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
What's the source of this problem? Clearly there are not any non-ascii characters in the file.
I've tried this in IDLE, terminal (Mac OSX) and Rodeo and get similar issues in all.
I'm very new to Python and am concerned I may have screwed up something in installation. I've downloaded various versions over the years, straight from Python, Anaconda, macports, etc. and I'm wondering if the various sources are not playing nicely...
Python 3.5.1 on OSX 10.11.4.
Maybe your file is saved with the encoding UTF-8 with BOM (Byte order mark). Try to save you file explicit as UTF-8 (without BOM). While the BOM is not included in the ASCII codec, it causes an UnicodeError.
Hope this helps!
I'm trying to write out a csv file with Unicode characters, so I'm using the unicodecsv package. Unfortunately, I'm still getting UnicodeDecodeErrors:
# -*- coding: utf-8 -*-
import codecs
import unicodecsv
raw_contents = 'He observes an “Oversized Gorilla” near Ashford'
encoded_contents = unicode(raw_contents, errors='replace')
with codecs.open('test.csv', 'w', 'UTF-8') as f:
w = unicodecsv.writer(f, encoding='UTF-8')
w.writerow(["1", encoded_contents])
This is the traceback:
Traceback (most recent call last):
File "unicode_test.py", line 11, in <module>
w.writerow(["1", encoded_contents])
File "/Library/Python/2.7/site-packages/unicodecsv/__init__.py", line 83, in writerow
self.writer.writerow(_stringify_list(row, self.encoding, self.encoding_errors))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 17: ordinal not in range(128)
I thought converting it to Unicode would be good enough, but that doesn't seem to be the case. I'd really like to understand what is happening so that I'm better prepared to handle these errors in other projects in the future.
From the traceback, it looks like I can reproduce the error like this:
>>> raw_contents = 'He observes an “Oversized Gorilla” near Ashford'
>>> raw_contents.encode('UTF-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 15: ordinal not in range(128)
>>>
Up until now, I thought I had a decent working knowledge of working with Unicode text in Python 2.x, but this has humbled me.
You should not use codecs.open() for your file. unicodecsv wraps the csv module, which always writes a byte string to the open file object. In order to write that byte string to a Unicode-aware file object such as returned by codecs.open(), it is implicitly decoded; this is where your UnicodeDecodeError exception stems from.
Use a file in binary mode instead:
with open('test.csv', 'wb') as f:
w = unicodecsv.writer(f, encoding='UTF-8')
w.writerow(["1", encoded_contents])
The binary mode is not strictly necessary unless your data contains embedded newlines, but the csv module wants to control how newlines are written to ensure that such values are handled correctly. However, not using codecs.open() is an absolute requirement.
The same thing happens when you call .encode() on a byte string; you already have encoded data there, so Python implicitly decodes to get a Unicode value to encode.
I have a (windows) text file reported by linux as being a:
ISO-8859 text, with very long lines, with CRLF line terminators
I want to read this into numpy, except the first line which contains labels (with special characters, usually only the greek mu).
Python 2.7.6, Numpy 1.8.0, this works perfectly:
data = np.loadtxt('input_file.txt', skiprows=1)
Python 3.4.0, Numpy 1.8.0, gives an error:
>>> np.loadtxt('input_file.txt', skiprows=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/site-packages/numpy/lib/npyio.py", line 796, in loadtxt
next(fh)
File "/usr/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 4158: invalid start byte
To me this is "buggy" behaviour for the following reasons:
I want to skip the first line so it should be ignored, regardless of its encoding
If I delete the first line from the file, loadtxt works fine in both versions of python
Shouldn't numpy.loadtxt behave the same in python2 and python3?
Questions:
How to get around this problem (using python3 of course)?
Should I file a bug report or is this expected behaviour?
It seems like a bug in loadtxt(), try to use genfromtxt() instead.
Yes, it seems to be a bug in Numpy - it tries to do some parsing even in skipped rows and fails. Better report it.
By the time, the documentation says that loadtxt supports fileobject or string generator as its first argument. Try this
f = open ('load_file.txt')
f.readline()
data = np.loadtxt(f)
P.S. Error 'utf-8' codec can't decode byte 0xb5 in position 4158 don't seem to happen in the beginning of the file. Are you sure your file does not contain some weird symbol that is invisible or looks like space but actually is not?
Alright, so I have tried to wade through the multiple posts about this error,
but unfortunately I am either too tired to understand them, or something is simply eluding me.
Im trying to read a UTF-8 encoded txt file (a backup of my Whatsapp chat history) and dump it into a variable (for now just printing it), so that I can later do some splitting on its content.
However, when I run this:
protocol = open('C:/chat.txt', 'r', encoding='utf-8', errors='replace') #use the uft codec, and replace chars u dnt recognize instead of raising errors
print(protocol.read())
I get this:
Traceback (most recent call last):
File "C:\xx\src\main.py", line 8, in <module>
print(protocol.read())
File "C:\Python33\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-1: character maps to <undefined>
I have read some answers using the codec.open() function, but I am simply not entirely sure how to use it. So, I'm sorry if this is the 100th question about this, but I just cant wrap my mind around what exactly the problem is, and how to solve it.
Thanks for your patience and any answers you can provide :)