UnicodeDecodeError only with cx_freeze

UnicodeDecodeError only with cx_freeze - python

I get the error: "UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 7338: ordinal not in range(128)" once I try to run the program after I freeze my script with cx_freeze. If I run the Python 3 script normally it runs fine, but only after I freeze it and try to run the executable does it give me this error. I would post my code, but I don't know exactly what parts to post so if there are any certain parts that will help just let me know and I will post them, otherwise it seems like I have had this problem once before and solved it, but it has been a while and I can't remember what exactly the problem was or how I fixed it so any help or pointers to get me going in the right direction will help greatly. Thanks in advance.

Tell us exactly which version of Python on what platform.
Show the full traceback that you get when the error happens. Look at it yourself. What is the last line of your code that appears? What do you think is the bytes string that is being decoded? Why is the ascii codec being used??
Note that automatic conversion of bytes to str with a default codec (e.g. ascii) is NOT done by Python 3.x. So either you are doing it explicitly or cx_freeze is.
Update after further info in comments.
Excel does not save csv files in ASCII. It saves them in what MS calls "the ANSI codepage", which varies by locale. If you don't know what yours is, it is probably cp1252. To check, do this:
>>> import locale; print(locale.getpreferredencoding())
cp1252
If Excel did save files in ASCII, your offending '\xa0' byte would have been replaced by '?' and you would not be getting a UnicodeDecodeError.
Saving your files in UTF-8 would need you to open your files with encoding='utf8' and would have the same problem (except that you'd get a grumble about 0xc2 instead of 0xa0).
You don't need to post all four of your csv files on the web. Just run this little script (untested):
import sys
for filename in sys.argv[1:]:
for lino, line in enumerate(open(filename), 1):
if '\xa0' in line:
print(ascii(filename), lino, ascii(line))
The '\xa0' is a NO-BREAK SPACE aka ... you may want to edit your files to change these to ordinary spaces.
Probably you will need to ask on the cx_freeze mailing list to get an answer to why this error is happening. They will want to know the full traceback. Get some practice -- show it here.
By the way, "offset 7338" is rather large -- do you expect lines that long in your csv file? Perhaps something is reading all of your file ...

That error itself indicates that you have a character in a python string that isn't a normal ASCII character:
>>> b'abc\xa0'.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 3: ordinal not in range(128)
I certainly don't know why this would only happen when a script is frozen. You could wrap the whole script in a try/except and manually print out all or part of the string in question.
EDIT: here's how that might look
try:
# ... your script here
except UnicodeDecodeError as e:
print("Exception happened in string '...%s...'"%(e.object[e.start-50:e.start+51],))
raise

fix by set default coding:
reload(sys)
sys.setdefaultencoding("utf-8")

Use str.decode() function for that lines. And also you can specify encoding like myString.decode('cp1252').
Look also: http://docs.python.org/release/3.0.1/howto/unicode.html#unicode-howto

Related

Python3 using UTF-8 data from bytes [duplicate]

Why is the below item failing? Why does it succeed with "latin-1" codec?
o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")
Which results in:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

I had the same error when I tried to open a CSV file by pandas.read_csv
method.
The solution was change the encoding to latin-1:
pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')

In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:
>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'
But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:
>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'
(Note, I'm using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)

It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.
If you don't know the codeset you're receiving strings in, you're in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you'd just reject ones that didn't decode.
If you can't do that, you'll need heuristics.

Because UTF-8 is multibyte and there is no char corresponding to your combination of \xe9 plus following space.
Why should it succeed in both utf-8 and latin-1?
Here how the same sentence should be in utf-8:
>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'

If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode

Use this, If it shows the error of UTF-8
pd.read_csv('File_name.csv',encoding='latin-1')

utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.
the reason to raise this exception is:
1)If the code point is < 128, each byte is the same as the value of the code point.
2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
In order to to overcome this we have a set of encodings, the most widely used is "Latin-1, also known as ISO-8859-1"
So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1
when this exception occurs when you are trying to load a data set ,try using this format
df=pd.read_csv("top50.csv",encoding='ISO-8859-1')
Add encoding technique at the end of the syntax which then accepts to load the data set.

Well this type of error comes when u are taking input a particular file or data in pandas such as :-
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)
Then the error is displaying like this :-
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 1: invalid continuation byte
So to avoid this type of error can be removed by adding an argument
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')

This happened to me also, while i was reading text containing Hebrew from a .txt file.
I clicked: file -> save as and I saved this file as a UTF-8 encoding

TLDR: I would recommend investigating the source of the problem in depth before switching encoders to silence the error.
I got this error as I was processing a large number of zip files with additional zip files in them.
My workflow was the following:
Read zip
Read child zip
Read text from child zip
At some point I was hitting the encoding error above. Upon closer inspection, it turned out that some child zips erroneously contained further zips. Reading these zips as text lead to some funky character representation that I could silence with encoding="latin-1", but which in turn caused issues further down the line. Since I was working with international data it was not completely foolish to assume it was an encoding problem (I had problems with 0xc2: Â), but in the end it was not the actual issue.

In this case, I tried to execute a .py which active a path/file.sql.
My solution was to modify the codification of the file.sql to "UTF-8 without BOM" and it works!
You can do it with Notepad++.
i will leave a part of my code.
con = psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])
cursor = con.cursor()
sqlfile = open(path, 'r')

I encountered this problem, and it turned out that I had saved my CSV directly from a google sheets file. In other words, I was in a google sheet file. I chose, save a copy, and then when my browser downloaded it, I chose Open. Then, I DIRECTLY saved the CSV. This was the wrong move.
What fixed it for me was first saving the sheet as an .xlsx file on my local computer, and from there exporting single sheet as .csv. Then the error went away for pd.read_csv('myfile.csv')

The solution was change to "UTF-8 sin BOM"

Python 2.7: Printing out an decoded string

I have an file that is called: Abrázame.txt
I want to decode this so that python understands what this 'á' char is so that it will print me Abrázame.txt
This is the following code i have on an Scratch file:
import os
s = os.path.join(r'C:\Test\AutoTest', os.listdir(r'C:\\Test\\AutoTest')[0])
print(unicode(s.decode(encoding='utf-16', errors='strict')))
The error i get from above is:
Traceback (most recent call last):
File "C:/Users/naythan_onfri/.PyCharmCE2017.2/config/scratches/scratch_3.py", line 12, in <module>
print(unicode(s.decode(encoding='utf-16', errors='strict')))
File "C:\Python27\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x74 in position 28: truncated data
I have looked up the utf-16 character set and it does indeed have 'á' character in it. So why is it that this string cannot be decoded with Utf-16.
Also i know that 'latin-1' will work and produce the string im looking for however since this is for an automation project and i wanting to ensure that any filename with any registered character can be decoded and used for other things within the project for example:
"Opening up file explorer at the directory of the file with the file already selected."
Is looping through each of the codecs(Mind you i believe there is 93 codecs) to find whichever one can decode the string, the best way of getting the result I'm looking for? I figure there something far better than that solution.

You want to decode at the edge when you first read a string so that you don't have surprises later in your code. At the edge, you have some reasonable chance of guessing what that encoding is. For this code, the edge is
os.listdir(r'C:\\Test\\AutoTest')[0]
and you can get the current file system directory encoding. So,
import sys
fs_encoding = sys.getfilesystemencoding()
s = os.path.join(r'C:\Test\AutoTest',
os.listdir(r'C:\\Test\\AutoTest')[0].decode(encoding=fs_encodig, errors='strict')
print(s)
Note that once you decode you have a unicode string and you don't need to build a new unicode() object from it.
latin-1 works if that's your current code page. Its an interesting curiosity that even though Windows has supported "wide" characters with "W" versions of their API for many years, python 2 is single-byte character based and doesn't use them.
Long live python 3.

UnicodeDecodeError for Reading files in Python

pythonNotes = open('E:\\Python Notes.docx','r')
read_it_now = pythonNotes.read()
print(read_it_now.encode('utf-16'))
When I try this code, I get:
UnicodeDecodeError: 'charmap' can't decode byte 0x8f in position 591 character maps to <undefined>
I am running this in visual studio with python tools - starting without debugging.
I have tried putting enc='utf-8' at the top, throwing it in as a parameter, I've looked at other questions and just couldn't find a solution to this simple issue.
Please assist.

This error can occur when text that is already in utf-8 format is read in as an 8-bit encoding, and python tries to "decode" it to Unicode: Bytes that have no meaning in the supposed encoding throw a UnicodeDecodeError. But you'll always get an error if you try to read a file as utf-8 that is not in the utf-8 encoding.
In your case, the problem is that a docx file is not a regular text file; no single text encoding can meaningfully import it. See this SO answer for directions on how to read it on a low level, or use python-docx to get access to the document in a way that resembles what you see in Word.

Encoding issues with CLD

After some issues getting Chrome Compact Language Detection library installed on Windows, I installed CLD from this easy_install.
I can now use CLD, but getting some encoding issues.
Background
Pulling Tweets into a python script, and after stripping out the hashtags and links, passing them to CLD to detect the language.
Following is a simplified version of my code:
s = "I am a tweet from Twitter"
clean_s = s.encode('utf-8')
lan = cld.detect(clean_s, pickSummaryLanguage=True, removeWeakMatches=True)
Problem
4 out of 5 times, this works as expected (get returned a response about what language it is).
However, I keep getting this error popping up:
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019'
in position 15: character maps to undefined
I did read that:
"You must provide CLD clean (interchange-valid) UTF-8, so any encoding
issues must be sorted out before-hand."
However, I thought I had this covered with my statement to encode to UTF8?
I assume that I need to ensure that I pass a string to CLD that preserves fonts in languages such as arabic, asian, etc.
This is my first python project, so likely this is a rookie mistake. Can anyone point out my mistake and how to rectify?
Let me know in comments if I need to gather more info, and I will edit my Q to provide more info.
EDIT
If it helps, here is my rookie code (cut down to replicate issue).
I am running Python 2.7 32bit.
Running this code, after awhile, I get this error.
Let me know if I have not correctly implemented the error reporting.
Raw: Traceback (most recent call last):
File "LanguageTesting.py", line 71, in <module>
parse_tweet(tweet)
File "LanguageTesting.py", line 43, in parse_tweet
print "Raw:", raw
File "C:\Python27\ArcGIS10.1\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 29-32: character maps to <undefined>

It looks like you are failing on the print statement right? This means Python cannot encode the unicode string into what it thinks the console's stdout encoding is ("print sys.getdefaultencoding()").
If python is wrong about what your terminal expects, you can set the env var ("export PYTHONIOENCODING=UTF-8") and it will encode your printed strings to utf-8. Alternatively, before printing, you can encode to whatever charset your terminal expects (and will likely have to ignore/replace errors to avoid exceptions like the one you hit)...

Python Encoding\Decoding for writing to a text file

I've honestly spent a lot of time on this, and it's slowly killing me. I've stripped content from a PDF and stored it in an array. Now I'm trying to pull it back out of the array and write it into a txt file. However, I do not seem to be able to make it happen because of encoding issues.
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
print kmlDescription #this prints out fine
outputFile.write(kmlDescription)
The error i'm getting is "unicodedecodeerror: ascii codec can't decode byte 0xc2 in position 213:ordinal not in range (128).
I'm just messing around now, but I've tried all kinds of ways to get this stuff to write out.
outputFile.write(kmlDescription).decode('utf-8')
Please forgive me if this is basic, I'm still learning Python (2.7).
Cheers!
EDIT1: Sample data looks something like the following:
Chart 3686 (plan, Morehead City) [ previous update 4997/11 ] NAD83 DATUM
Insert the accompanying block, showing amendments to coastline,
depths and dolphins, centred on: 34° 41´·19N., 76° 40´·43W.
Delete R 34° 43´·16N., 76° 41´·64W.
When I add the print type(raw), I get
Edit 2: When I just try to write the data, I receive the original error message (ascii codec can't decode byte...)
I will check out the suggested thread and video. Thanks folks!
Edit 3: I'm using Python 2.7
Edit 4: agf hit the nail on the head in the comments below when (s)he noticed that I was double encoding. I tried intentionally double encoding a string that had previously been working and produced the same error message that was originally thrown. Something like:
text = "Here's a string, but imagine it has some weird symbols and whatnot in it - apparently latin-1"
textEncoded = text.encode('utf-8')
textEncodedX2 = textEncoded.encode('utf-8')
outputfile.write(textEncoded) #Works!
outputfile.write(textEncodedX2) #failed
Once I figured out I was trying to double encode, the solution was the following:
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
kmlDescriptionDecode = kmlDescription.decode("latin-1")
outputFile.write(kmlDescriptionDecode)
It's working now, and I sure appreciate all of your help!!

My guess is that output file you have opened has been opened with latin1 or even utf-8 codec hence you are not able to write utf-8 encoded data to that because it tries to reconvert it, otherwise to a normally opened file you can write any arbitrary data string, here is an example recreating similar error
u = u'सच्चिदानन्द हीरानन्द वात्स्यायन '
s = u.encode('utf-8')
f = codecs.open('del.text', 'wb',encoding='latin1')
f.write(s)
output:
Traceback (most recent call last):
File "/usr/lib/wingide4.1/src/debug/tserver/_sandbox.py", line 1, in <module>
# Used internally for debug sandbox under external interpreter
File "/usr/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
Solution:
this will work, if you don't set any codec
f = open('del.txt', 'wb')
f.write(s)
other option is to directly write to file without encoding the unicode strings, if outputFile has been opened with correct codec e.g.
f = codecs.open('del.text', 'wb',encoding='utf-8')
f.write(u)

Your error message doesn't seem to appear to relate to any of your Python syntax but actually the fact you're trying to decode a Hex value which has no equivalent in UTF-8.
HEX 0xc2 appears to represent a latin character - an uppercase A with an accent on the top. Therefore, instead of using "allTheNTMs.append(contentRaw[s1:].encode("utf-8"))", try:-
allTheNTMs.append(contentRaw[s1:].encode("latin-1"))
I'm not an expert in Python so this may not work but it would appear you're trying to encode a latin character. Given the error message you are receiving too, it would appear that when trying to encode in UTF-8, Python only looks through the first 128 entries given that your error appears to indicate that entry "0Xc2" is out of range which indeed it is out of the first 128 entries of UTF-8.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.