I have a small webapp that runs Python on the server side and javascript (jQuery) on the client side.
Now upon a certain request my Python script returns a unicode string and the client is supposed to put that string inside a div in the browser. However i get a unicode encode error from Python.
If i run the script from the shell (bash on debian linux) teh script runs fine and prints the unicode string.
Any ideas ?
Thanks!
EDIT
This is the print statement that causes the error:
print u'öäü°'
This is the error message i get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-36: ordinal not in range(128)
However i only get that message when calling the script via ajax ( $('#somediv').load('myscript.py'); )
Thank you !
If the python interpreter can't determine the encoding of sys.stdout ascii is used as a fallback however the characters in the string are not part of ascii, therefore a UnicodeEncodeError exception is raised.
A solution would be to encode the string yourself using something like .encode(sys.stdout.encoding or "utf-8"). This way utf-8 is used as a fallback instead of ascii.
Related
I am Python beginner so I hope this problem will be an easy fix.
I would like to print the value of an attribute as follows:
print (follower.city)
I receive the following error message:
File “C:\Python34\lib\encodings\cp850.py“, line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: ‘charmap‘ codec can’t encode character ‘\u0130‘
0: character maps to (undefined)
I think the problem is that cp850.py does not contain the relevant character in the encoding table.
What would be the solution to this problem? No ultimate need to display the character correctly, but the error message must be avoided. Do I need to modify cp850.py?
Sorry if this question has been addressed before, but I was not able to figure it out using previous answers to this topic.
To print a string it must first be converted from pure Unicode to the byte sequences supported by your output device. This requires an encode to the proper character set, which Python has identified as cp850 - the Windows Console default.
Starting with Python 3.3 you can set the Windows console to use UTF-8 with the following command issued at the command prompt:
chcp 65001
This should fix your issue, as long as you've configured the window to use a font that contains the character.
I'm having the following problem: when I use SQSConnection.send_message method with a fixed string as a parameter (with no accented characters), it works as expected. But when I get the body of a message (using get_messages) and try to send it again to the same queue, I get this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 38: ordinal not in range(128)
The messages were written directly from Amazon Web Console and have a few ";" characters and some accented such as "õ" and "ã". What should I do? I'm already using set_message_class(RawMessage) as suggested here
Using python BOTO with AWS SQS, getting back nonsense characters
but it only worked for receiving the messages. I'm using Ubuntu 12.04, with python-boto installed from repositories (I think it's version 2.22, but don't know how to check).
Thanks!!
send_message can only handle byte strings (str class). What you are receiving from SQS is a Unicode string (unicode class). You need to convert your Unicode string to a byte string by calling encode('utf-8') on it.
If you have a mix of string types coming in you may need to conditionally encode the Unicode string into a byte string.
if type(message_body) is unicode:
message_content = message_body.encode('utf-8')
else:
message_content = message_body
I have a Python script that writes some strings with UTF-8 encoding. In my script I am using mainly the str() function to cast to string. It looks like that:
mystring="this is unicode string:"+japanesevalues[1]
#japanesevalues is a list of unicode values, I am sure it is unicode
print mystring
I don't use the Python terminal, just the standard Linux Red Hat x86_64 terminal. I set the terminal to output utf8 chars.
If I execute this:
#python myscript.py
this is unicode string: カラダーズ ソフィー
But if I do that:
#python myscript.py > output
I got the typical error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 253-254: ordinal not in range(128)
Why is that?
The terminal has a character set, and Python knows what that character set is, so it will automatically decode your Unicode strings to the byte-encoding that the terminal uses, in your case UTF-8.
But when you redirect, you are no longer using the terminal. You are now just using a Unix pipe. That Unix pipe doesn't have a charset, and Python has no way of knowing which encoding you now want, so it will fall back to a default character set.
You have marked your question with "Python-3.x" but your print syntax is Python 2, so I suspect you are actually using Python 2. And then your sys.getdefaultencoding() is generally 'ascii', and in your case it's definitely so. And of course, you can not encode Japanese characters as ASCII, so you get an error.
Your best bet when using Python 2 is to encode the string with UTF-8 before printing it. Then redirection will work, and the resulting file with be UTF-8. That means it will not work if your terminal is something else, though, but you can get the terminal encoding from sys.stdout.encoding and use that (it will be None when redirecting under Python 2).
In Python 3, your code should work as is, except that you need to change print mystring to print(mystring).
If it outputs to the terminal then Python can examine the value of $LANG to pick a charset. All bets are off if you redirect.
I am working with external data that's encoded in latin1. So I've add sitecustomize.py and in it added
sys.setdefaultencoding('latin_1')
sure enough, now working with latin1 strings works fine.
But, in case I encounter something that is not encoded in latin1:
s=str(u'abc\u2013')
I get UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 3: ordinal not in range(256)
What I would like is that the undecodable chars would simply be ignored, i.e I would get that in the above example s=='abc?', and do that without explicitly calling decode() or encode each time, i.e not s.decode(...,'replace') on each call.
I tried doing different things with codecs.register_error but to no avail.
please help?
There is a reason scripts can't call sys.setdefaultencoding. Don't do that, some libraries (including standard libraries included with Python) expect the default to be 'ascii'.
Instead, explicitly decode strings to Unicode when read into your program (via file, stdin, socket, etc.) and explicitly encode strings when writing them out.
Explicit decoding takes a parameter specifying behavior for undecodable bytes.
You can define your own custom handler and use it instead to do as you please. See this example:
import codecs
from logging import getLogger
log = getLogger()
def custom_character_handler(exception):
log.error("%s for %s on %s from position %s to %s. Using '?' in-place of it!",
exception.reason,
exception.object[exception.start:exception.end],
exception.encoding,
exception.start,
exception.end )
return ("?", exception.end)
codecs.register_error("custom_character_handler", custom_character_handler)
print( b'F\xc3\xb8\xc3\xb6\xbbB\xc3\xa5r'.decode('utf8', 'custom_character_handler') )
print( codecs.encode(u"abc\u03c0de", "ascii", "custom_character_handler") )
Running it, you will see:
invalid start byte for b'\xbb' on utf-8 from position 5 to 6. Using '?' in-place of it!
Føö?Bår
ordinal not in range(128) for π on ascii from position 3 to 4. Using '?' in-place of it!
b'abc?de'
References:
https://docs.python.org/3/library/codecs.html#codecs.register_error
https://docs.python.org/3/library/exceptions.html#UnicodeError
How to ignore invalid lines in a file?
'str' object has no attribute 'decode'. Python 3 error?
How to replace invalid unicode characters in a string in Python?
UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line?
I'm having a problem when trying to apply a regular expression to some strings encoded in latin-1 (ISO-8859-1).
What I'm trying to do is send some data via HTTP POST from a page encoded in ISO-8859-1 to my python application and do some parsing on the data using regular expressions in my python script.
The web page uses jQuery to send the data to the server and I'm grabbing the text from the page using the .text() method. Once the data is sent back to the server looks like this: re.compile(r"^[\s,]*(\d*\s*\d*\/*\d)[\s,]*") - Unfortunately the \s in my regular expression is not matching my data, and I traced the problem down to the fact that the html page uses which gets encoded to 0xA0 (non-breaking space) and sent to the server. For some reason, it seems, my script is not interpreting that character as whitespace and is not matching. According to the python [documentation][1] it looks like this should work, so I must have an encoding issue here.
I then wanted to try converting the string into unicode and pass it to the regular expression, so I tried to view what would happen when I converted the string: print(unicode(data, 'iso-8859-1')).
Unfortunately I got this error:
UnicodeEncodeError at /script/
'ascii' codec can't encode character u'\xa0' in position 122: ordinal not in range(128)
I'm confused though - I'm obviously not trying to use ASCII decoding - is python trying to decode using ASCII even though I'm obviously passing another codec?
Try this instead:
print(repr(unicode(data, 'iso-8859-1')))
by printing a unicode object you're implicitly trying to convert it to the default encoding, which is ASCII. Using repr will escape it into an ASCII-safe form, plus it'll be easier for you to figure out what's going on for debugging.
Are you using Python 3.X or 2.X? It makes a difference. Actually looks like 2.X but you confused me by using print(blahblah) :-)
Answer to your last question: Yes, ASCII by default when you do print(). On 3.X: Use print(ascii(foo)) for debugging, not print(foo). On 2.X use repr(), not ascii().
Your original problem with the no-break space should go away if (a) the data is unicode and (b) you use the re.UNICODE flag with the re.compile()