UnicodeDecodeError, Invalid continuation byte again

UnicodeDecodeError, Invalid continuation byte again - python

I can't figure out how to solve these problems once for all. I first encountered these problems when I tried to write "è" (i'm Italian). After some research, I found out that adding "#coding: utf-8" at the very beginning seemed to solve the problem....UNTIL NOW.
I edited a code wrote 1 or 2 days ago..it worked perfectly.... now, whenever i run the script, it doesn't work: it never starts, and I'm stuck with this error:
SyntaxError: 'utf-8' codec can't decode byte 0xe0 in position 32: invalid continuation byte.
The problem is... position 32? Where? what's the problematic line? I don't know exactly what I added, because I made a couple of changes. Trying to execute in debug mode doesn't help either, when I "Step Into" at the very beginning of the script, the error shows up immediately (by the way, i'm using Wingware 101 as an IDLE,I'm on Win7). Sorry if I don't have enough information, I could post the code but I'm afraid to do so, it's a mess written in Italian, maybe it could be not easy to understand exactly what's going on.
Thank you for replies and happy holidays!

#coding: utf8 is a declaration that the source code is saved in UTF-8. Make sure that is actually the encoding of the source file. For example, the following file was created in Windows Notepad and saved as "ANSI", which on US Windows is the Windows-1252 encoding:
#coding: utf8
print('hàllo')
It produces the following error on Python 2.7:
File "test.py", line 2
SyntaxError: 'utf8' codec can't decode byte 0xe0 in position 8: invalid continuation byte
As you can see, then 8th position (counting from 0) of line 2 is à, which in Windows-1252 is byte 0xe0. The wrong encoding is used and the error message is clear.
Either declare the correct encoding for your source file, or re-save the source file in UTF-8.
Note: I don't have Python 3.4 installed, but Python 3.5 gives a less clear error message:
File "x.py", line 1
SyntaxError: encoding problem: utf8
It doesn't match your error message, though, but still indicates the file is not declared with the right encoding.

Related

Python encoding issue while reading a file

I am trying to read a file that contains this character in it "ë". The problem is that I can not figure out how to read it no matter what I try to do with the encoding. When I manually look at the file in textedit it is listed as a unknown 8-bit file. If I try changing it to utf-8, utf-16 or anything else it either does not work or messes up the entire file. I tried reading the file just in standard python commands as well as using codecs and can not come up with anything that will read it correctly. I will include a code sample of the read below. Does anyone have any clue what I am doing wrong? This is Python 2.17.10 by the way.
readFile = codecs.open("FileName",encoding='utf-8')
The line I am trying to read is this with nothing else in it.
Aeëtes
Here are some of the errors I get:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x91 in position 0: invalid start byte
UTF-16 stream does not start with BOM"
UnicodeError: UTF-16 stream does not start with BOM -- I know this one is that it is not a utf-16 file.
UnicodeDecodeError: 'ascii' codec can't decode byte 0x91 in position 0: ordinal not in range(128)
If I don't use a Codec the word comes in as Ae?tes which then crashes later in the program. Just to be clear, none of the suggested questions or any other anywhere on the net have pointed to an answer. One other detail that might help is that I am using OS X, not Windows.

Credit for this answer goes to RadLexus for figuring out the proper encoding and also to Mad Physicist who pointed me in the right track even if I did not consider all possible encodings.
The issue is apparently a Mac will convert the .txt file to mac_roman. If you use that encoding it will work perfectly.
This is the line of code that I used to convert it.
readFile = codecs.open("FileName",encoding='mac_roman')

UnicodeDecodeError for Reading files in Python

pythonNotes = open('E:\\Python Notes.docx','r')
read_it_now = pythonNotes.read()
print(read_it_now.encode('utf-16'))
When I try this code, I get:
UnicodeDecodeError: 'charmap' can't decode byte 0x8f in position 591 character maps to <undefined>
I am running this in visual studio with python tools - starting without debugging.
I have tried putting enc='utf-8' at the top, throwing it in as a parameter, I've looked at other questions and just couldn't find a solution to this simple issue.
Please assist.

This error can occur when text that is already in utf-8 format is read in as an 8-bit encoding, and python tries to "decode" it to Unicode: Bytes that have no meaning in the supposed encoding throw a UnicodeDecodeError. But you'll always get an error if you try to read a file as utf-8 that is not in the utf-8 encoding.
In your case, the problem is that a docx file is not a regular text file; no single text encoding can meaningfully import it. See this SO answer for directions on how to read it on a low level, or use python-docx to get access to the document in a way that resembles what you see in Word.

Encoding issues with CLD

After some issues getting Chrome Compact Language Detection library installed on Windows, I installed CLD from this easy_install.
I can now use CLD, but getting some encoding issues.
Background
Pulling Tweets into a python script, and after stripping out the hashtags and links, passing them to CLD to detect the language.
Following is a simplified version of my code:
s = "I am a tweet from Twitter"
clean_s = s.encode('utf-8')
lan = cld.detect(clean_s, pickSummaryLanguage=True, removeWeakMatches=True)
Problem
4 out of 5 times, this works as expected (get returned a response about what language it is).
However, I keep getting this error popping up:
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019'
in position 15: character maps to undefined
I did read that:
"You must provide CLD clean (interchange-valid) UTF-8, so any encoding
issues must be sorted out before-hand."
However, I thought I had this covered with my statement to encode to UTF8?
I assume that I need to ensure that I pass a string to CLD that preserves fonts in languages such as arabic, asian, etc.
This is my first python project, so likely this is a rookie mistake. Can anyone point out my mistake and how to rectify?
Let me know in comments if I need to gather more info, and I will edit my Q to provide more info.
EDIT
If it helps, here is my rookie code (cut down to replicate issue).
I am running Python 2.7 32bit.
Running this code, after awhile, I get this error.
Let me know if I have not correctly implemented the error reporting.
Raw: Traceback (most recent call last):
File "LanguageTesting.py", line 71, in <module>
parse_tweet(tweet)
File "LanguageTesting.py", line 43, in parse_tweet
print "Raw:", raw
File "C:\Python27\ArcGIS10.1\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 29-32: character maps to <undefined>

It looks like you are failing on the print statement right? This means Python cannot encode the unicode string into what it thinks the console's stdout encoding is ("print sys.getdefaultencoding()").
If python is wrong about what your terminal expects, you can set the env var ("export PYTHONIOENCODING=UTF-8") and it will encode your printed strings to utf-8. Alternatively, before printing, you can encode to whatever charset your terminal expects (and will likely have to ignore/replace errors to avoid exceptions like the one you hit)...

Python Encoding\Decoding for writing to a text file

I've honestly spent a lot of time on this, and it's slowly killing me. I've stripped content from a PDF and stored it in an array. Now I'm trying to pull it back out of the array and write it into a txt file. However, I do not seem to be able to make it happen because of encoding issues.
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
print kmlDescription #this prints out fine
outputFile.write(kmlDescription)
The error i'm getting is "unicodedecodeerror: ascii codec can't decode byte 0xc2 in position 213:ordinal not in range (128).
I'm just messing around now, but I've tried all kinds of ways to get this stuff to write out.
outputFile.write(kmlDescription).decode('utf-8')
Please forgive me if this is basic, I'm still learning Python (2.7).
Cheers!
EDIT1: Sample data looks something like the following:
Chart 3686 (plan, Morehead City) [ previous update 4997/11 ] NAD83 DATUM
Insert the accompanying block, showing amendments to coastline,
depths and dolphins, centred on: 34° 41´·19N., 76° 40´·43W.
Delete R 34° 43´·16N., 76° 41´·64W.
When I add the print type(raw), I get
Edit 2: When I just try to write the data, I receive the original error message (ascii codec can't decode byte...)
I will check out the suggested thread and video. Thanks folks!
Edit 3: I'm using Python 2.7
Edit 4: agf hit the nail on the head in the comments below when (s)he noticed that I was double encoding. I tried intentionally double encoding a string that had previously been working and produced the same error message that was originally thrown. Something like:
text = "Here's a string, but imagine it has some weird symbols and whatnot in it - apparently latin-1"
textEncoded = text.encode('utf-8')
textEncodedX2 = textEncoded.encode('utf-8')
outputfile.write(textEncoded) #Works!
outputfile.write(textEncodedX2) #failed
Once I figured out I was trying to double encode, the solution was the following:
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
kmlDescriptionDecode = kmlDescription.decode("latin-1")
outputFile.write(kmlDescriptionDecode)
It's working now, and I sure appreciate all of your help!!

My guess is that output file you have opened has been opened with latin1 or even utf-8 codec hence you are not able to write utf-8 encoded data to that because it tries to reconvert it, otherwise to a normally opened file you can write any arbitrary data string, here is an example recreating similar error
u = u'सच्चिदानन्द हीरानन्द वात्स्यायन '
s = u.encode('utf-8')
f = codecs.open('del.text', 'wb',encoding='latin1')
f.write(s)
output:
Traceback (most recent call last):
File "/usr/lib/wingide4.1/src/debug/tserver/_sandbox.py", line 1, in <module>
# Used internally for debug sandbox under external interpreter
File "/usr/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
Solution:
this will work, if you don't set any codec
f = open('del.txt', 'wb')
f.write(s)
other option is to directly write to file without encoding the unicode strings, if outputFile has been opened with correct codec e.g.
f = codecs.open('del.text', 'wb',encoding='utf-8')
f.write(u)

Your error message doesn't seem to appear to relate to any of your Python syntax but actually the fact you're trying to decode a Hex value which has no equivalent in UTF-8.
HEX 0xc2 appears to represent a latin character - an uppercase A with an accent on the top. Therefore, instead of using "allTheNTMs.append(contentRaw[s1:].encode("utf-8"))", try:-
allTheNTMs.append(contentRaw[s1:].encode("latin-1"))
I'm not an expert in Python so this may not work but it would appear you're trying to encode a latin character. Given the error message you are receiving too, it would appear that when trying to encode in UTF-8, Python only looks through the first 128 entries given that your error appears to indicate that entry "0Xc2" is out of range which indeed it is out of the first 128 entries of UTF-8.

UnicodeDecodeError only with cx_freeze

I get the error: "UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 7338: ordinal not in range(128)" once I try to run the program after I freeze my script with cx_freeze. If I run the Python 3 script normally it runs fine, but only after I freeze it and try to run the executable does it give me this error. I would post my code, but I don't know exactly what parts to post so if there are any certain parts that will help just let me know and I will post them, otherwise it seems like I have had this problem once before and solved it, but it has been a while and I can't remember what exactly the problem was or how I fixed it so any help or pointers to get me going in the right direction will help greatly. Thanks in advance.

Tell us exactly which version of Python on what platform.
Show the full traceback that you get when the error happens. Look at it yourself. What is the last line of your code that appears? What do you think is the bytes string that is being decoded? Why is the ascii codec being used??
Note that automatic conversion of bytes to str with a default codec (e.g. ascii) is NOT done by Python 3.x. So either you are doing it explicitly or cx_freeze is.
Update after further info in comments.
Excel does not save csv files in ASCII. It saves them in what MS calls "the ANSI codepage", which varies by locale. If you don't know what yours is, it is probably cp1252. To check, do this:
>>> import locale; print(locale.getpreferredencoding())
cp1252
If Excel did save files in ASCII, your offending '\xa0' byte would have been replaced by '?' and you would not be getting a UnicodeDecodeError.
Saving your files in UTF-8 would need you to open your files with encoding='utf8' and would have the same problem (except that you'd get a grumble about 0xc2 instead of 0xa0).
You don't need to post all four of your csv files on the web. Just run this little script (untested):
import sys
for filename in sys.argv[1:]:
for lino, line in enumerate(open(filename), 1):
if '\xa0' in line:
print(ascii(filename), lino, ascii(line))
The '\xa0' is a NO-BREAK SPACE aka ... you may want to edit your files to change these to ordinary spaces.
Probably you will need to ask on the cx_freeze mailing list to get an answer to why this error is happening. They will want to know the full traceback. Get some practice -- show it here.
By the way, "offset 7338" is rather large -- do you expect lines that long in your csv file? Perhaps something is reading all of your file ...

That error itself indicates that you have a character in a python string that isn't a normal ASCII character:
>>> b'abc\xa0'.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 3: ordinal not in range(128)
I certainly don't know why this would only happen when a script is frozen. You could wrap the whole script in a try/except and manually print out all or part of the string in question.
EDIT: here's how that might look
try:
# ... your script here
except UnicodeDecodeError as e:
print("Exception happened in string '...%s...'"%(e.object[e.start-50:e.start+51],))
raise

fix by set default coding:
reload(sys)
sys.setdefaultencoding("utf-8")

Use str.decode() function for that lines. And also you can specify encoding like myString.decode('cp1252').
Look also: http://docs.python.org/release/3.0.1/howto/unicode.html#unicode-howto

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.