Django getting encoding error only on command line - python

I have a script that reads an XML file and writes it into the Database.
When I run it through the browser (call it via a view) it works fine, but
when I created a Command for it (./manage.py importxmlfile) I get the following message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 6: ordinal not in range(128)
I'm not sure why it would only happen when calling the import via command line.. any ideas?
Update
I'm trying to convert an lxml.etree._ElementUnicodeResult object to string and save it in the DB (utf8 collation) using str(result).
This produces the error mentioned above only on Command Line.

Ah, don't use str(result).
instead, do:
result.encode('utf-8')
When you call str(result), python will use the default system encoding (usually ascii) to try and encode the bytes in result. This will break if the ordinal not in range(128). Rather than using the ascii codec, just .encode() and tell python which codec to use.
Check out the Python Unicode HowTo for more information. You might also want to check out this related question or this excellent presentation on the subject.

Related

convert string in Python from ascii to utf-8 and back

I have a string variable blah that I am trying to send to Python's print and I get this error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 26892-26894: ordinal not in range(128)
The text is coming from a windows server via WMI. So I tried this:
try:
print(blah, end='')
except:
print(blah.encode('utf-8'), end='')
This catches the error, but now my blah variable looks like this:
b'line#1\nline#2\nline#3\nline#4'
However, I need to send the output of this Python program to a daemon that expects the original format,
line#1
line#2
line#3
line#4
Which is what I really want. I tried chaining it with string.encode('utf-8').decode('utf-8','errors=ignore') but apparently my version of Python doesn't accept this. Any suggestions?

“UnicodeEncodeError: 'ascii' codec can't encode character” in Python3

I'm fetching JSON with Requests from an API (using Python 3.5) and when I'm trying to print (or use) the JSON, either by response.text, json.loads(...) or response.json(), I get an UnicodeEncodeError.
print(response.text)
UnicodeEncodeError: 'ascii' codec can't encode character '\xc5' in position 676: ordinal not in range(128)
The JSON contains an array of dictionaries with country names and some of them contain special characters, e.g.: (just one dictionary in the binary array for example)
b'[{\n "name" : "\xc3\x85land Islands"\n}]
I have no idea why there is an encoding problem and also why "ascii" is used when Requests detects an UTF-8 encoding (and even by setting it manually to UTF-8 doesn't change anything).
Edit2: The problem was Microsoft Visual Studio Code 1.4. It wasn't able to print the characters.
If your code is running within VS, then it sounds that Python can't work out the encoding of the inbuilt console, so defaults to ASCII. If you try to print any non-ASCII then Python throws an error rather printing text that won't display.
You can force Python's encoding by using the PYTHONIOENCODING environment variable. Set it within the run configuration for the script.
Depending on Visual Studio's console, you may get away with:
PYTHONIOENCODING=utf-8
or you may have to use a typical 8bit charset like:
PYTHONIOENCODING=windows-1252

UnicodeDecodeError for Reading files in Python

pythonNotes = open('E:\\Python Notes.docx','r')
read_it_now = pythonNotes.read()
print(read_it_now.encode('utf-16'))
When I try this code, I get:
UnicodeDecodeError: 'charmap' can't decode byte 0x8f in position 591 character maps to <undefined>
I am running this in visual studio with python tools - starting without debugging.
I have tried putting enc='utf-8' at the top, throwing it in as a parameter, I've looked at other questions and just couldn't find a solution to this simple issue.
Please assist.
This error can occur when text that is already in utf-8 format is read in as an 8-bit encoding, and python tries to "decode" it to Unicode: Bytes that have no meaning in the supposed encoding throw a UnicodeDecodeError. But you'll always get an error if you try to read a file as utf-8 that is not in the utf-8 encoding.
In your case, the problem is that a docx file is not a regular text file; no single text encoding can meaningfully import it. See this SO answer for directions on how to read it on a low level, or use python-docx to get access to the document in a way that resembles what you see in Word.

Encoding issues with CLD

After some issues getting Chrome Compact Language Detection library installed on Windows, I installed CLD from this easy_install.
I can now use CLD, but getting some encoding issues.
Background
Pulling Tweets into a python script, and after stripping out the hashtags and links, passing them to CLD to detect the language.
Following is a simplified version of my code:
s = "I am a tweet from Twitter"
clean_s = s.encode('utf-8')
lan = cld.detect(clean_s, pickSummaryLanguage=True, removeWeakMatches=True)
Problem
4 out of 5 times, this works as expected (get returned a response about what language it is).
However, I keep getting this error popping up:
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019'
in position 15: character maps to undefined
I did read that:
"You must provide CLD clean (interchange-valid) UTF-8, so any encoding
issues must be sorted out before-hand."
However, I thought I had this covered with my statement to encode to UTF8?
I assume that I need to ensure that I pass a string to CLD that preserves fonts in languages such as arabic, asian, etc.
This is my first python project, so likely this is a rookie mistake. Can anyone point out my mistake and how to rectify?
Let me know in comments if I need to gather more info, and I will edit my Q to provide more info.
EDIT
If it helps, here is my rookie code (cut down to replicate issue).
I am running Python 2.7 32bit.
Running this code, after awhile, I get this error.
Let me know if I have not correctly implemented the error reporting.
Raw: Traceback (most recent call last):
File "LanguageTesting.py", line 71, in <module>
parse_tweet(tweet)
File "LanguageTesting.py", line 43, in parse_tweet
print "Raw:", raw
File "C:\Python27\ArcGIS10.1\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 29-32: character maps to <undefined>
It looks like you are failing on the print statement right? This means Python cannot encode the unicode string into what it thinks the console's stdout encoding is ("print sys.getdefaultencoding()").
If python is wrong about what your terminal expects, you can set the env var ("export PYTHONIOENCODING=UTF-8") and it will encode your printed strings to utf-8. Alternatively, before printing, you can encode to whatever charset your terminal expects (and will likely have to ignore/replace errors to avoid exceptions like the one you hit)...

UnicodeDecodeError only with cx_freeze

I get the error: "UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 7338: ordinal not in range(128)" once I try to run the program after I freeze my script with cx_freeze. If I run the Python 3 script normally it runs fine, but only after I freeze it and try to run the executable does it give me this error. I would post my code, but I don't know exactly what parts to post so if there are any certain parts that will help just let me know and I will post them, otherwise it seems like I have had this problem once before and solved it, but it has been a while and I can't remember what exactly the problem was or how I fixed it so any help or pointers to get me going in the right direction will help greatly. Thanks in advance.
Tell us exactly which version of Python on what platform.
Show the full traceback that you get when the error happens. Look at it yourself. What is the last line of your code that appears? What do you think is the bytes string that is being decoded? Why is the ascii codec being used??
Note that automatic conversion of bytes to str with a default codec (e.g. ascii) is NOT done by Python 3.x. So either you are doing it explicitly or cx_freeze is.
Update after further info in comments.
Excel does not save csv files in ASCII. It saves them in what MS calls "the ANSI codepage", which varies by locale. If you don't know what yours is, it is probably cp1252. To check, do this:
>>> import locale; print(locale.getpreferredencoding())
cp1252
If Excel did save files in ASCII, your offending '\xa0' byte would have been replaced by '?' and you would not be getting a UnicodeDecodeError.
Saving your files in UTF-8 would need you to open your files with encoding='utf8' and would have the same problem (except that you'd get a grumble about 0xc2 instead of 0xa0).
You don't need to post all four of your csv files on the web. Just run this little script (untested):
import sys
for filename in sys.argv[1:]:
for lino, line in enumerate(open(filename), 1):
if '\xa0' in line:
print(ascii(filename), lino, ascii(line))
The '\xa0' is a NO-BREAK SPACE aka ... you may want to edit your files to change these to ordinary spaces.
Probably you will need to ask on the cx_freeze mailing list to get an answer to why this error is happening. They will want to know the full traceback. Get some practice -- show it here.
By the way, "offset 7338" is rather large -- do you expect lines that long in your csv file? Perhaps something is reading all of your file ...
That error itself indicates that you have a character in a python string that isn't a normal ASCII character:
>>> b'abc\xa0'.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 3: ordinal not in range(128)
I certainly don't know why this would only happen when a script is frozen. You could wrap the whole script in a try/except and manually print out all or part of the string in question.
EDIT: here's how that might look
try:
# ... your script here
except UnicodeDecodeError as e:
print("Exception happened in string '...%s...'"%(e.object[e.start-50:e.start+51],))
raise
fix by set default coding:
reload(sys)
sys.setdefaultencoding("utf-8")
Use str.decode() function for that lines. And also you can specify encoding like myString.decode('cp1252').
Look also: http://docs.python.org/release/3.0.1/howto/unicode.html#unicode-howto

Categories

Resources