Using mysql (not my choice), everything is set to utf8, utf8_general_ci. In the normal case everything is utf8 and happy.
However, if I POST sth like É’s, some latin1, and save it into the database as normal, I can't call .decode('utf-8') on the resulting model field:
>>> myinstance.myfield.decode('utf-8')
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 7: ordinal not in range(128)
I want to clean all incoming data so that it can be decoded as utf8.
Trying an approach like this just causes the UnicodeEncodeError upfront.
Edit: As Daniel's answer suggests, this question comes from a misunderstanding. latin1 is not the culprit here. .decode('utf-8') tries to encode to ASCII, so, it will fail for unicode like u'팩맨'.decode('utf-8'). It pains me to leave this question up, knowing what I know now. But, maybe it will help someone. I think, since the data is actually coming back as unicode, what we were trying to do was actually equivalent to u'É’'.decode('utf-8').
Django fields are always unicode. Trying to call decode on them means that Python will try to encode first, to ASCII, before trying to decode as UTF-8. That clearly isn't what you want. I expect you actually just want to do myinstance.myfield.encode('utf-8').
Related
so i am using a Django tastypie resource and i am trying to find a generic way to decode any string that may be posted to the resource.
i have for example a Name like this
luiçscoico2##!&&á
and i want my to be able to identify the type of encoding, and appropriately decode it.
I am trying to fetch the string like this:
print bundle.data.get('first_name')
when i do a json dumps my string first name becomes like
"lui\u00e7scoico2##!&&\u00e1"
and i get an INTERNAL SERVER ERROR... any ideas?
UPDATE:
i do get a
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in
position 3: ordinal not in range(128)
if i try to decode('utf-8') before doing the json dumps to send to the server
Ok I'm gonna try to give a semi-blind answer here. Your string is already in Unicode, the reason I know this is because of the u'\xe7' which is exactly the ç character.
This means you don't have to encode it. If you need your string in utf-8 then just do:
x.decode('utf-8')
and it will porbably work :)
Hope this helps!
I am a newbie in python.
I have a unicode in Tamil.
When I use the sys.getdefaultencoding() I get the output as "Cp1252"
My requirement is that when I use text = testString.decode("utf-8") I get the error "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-8: character maps to undefined"
When I use the
sys.getdefaultencoding() I get the
output as "Cp1252"
Two comments on that: (1) it's "cp1252", not "Cp1252". Don't type from memory. (2) Whoever caused sys.getdefaultencoding() to produce "cp1252" should be told politely that that's not a very good idea.
As for the rest, let me guess. You have a unicode object that contains some text in the Tamil language. You try, erroneously, to decode it. Decode means to convert from a str object to a unicode object. Unfortunately you don't have a str object, and even more unfortunately you get bounced by one of the very few awkish/perlish warts in Python 2: it tries to make a str object by encoding your unicode string using the system default encoding. If that's 'ascii' or 'cp1252', encoding will fail. That's why you get a Unicode*En*codeError instead of a Unicode*De*codeError.
Short answer: do text = testString.encode("utf-8"), if that's what you really want to do. Otherwise please explain what you want to do, and show us the result of print repr(testString).
add this as your 1st line of code
# -*- coding: utf-8 -*-
later in your code...
text = unicode(testString,"UTF-8")
you need to know which character-encoding is testString using. if not utf8, an error will occur when using decode('utf8').
I have a site that displays user input by decoding it to unicode using utf-8. However, user input can include binary data, which is obviously not always able to be 'decoded' by utf-8.
I'm using Python, and I get an error saying:
'utf8' codec can't decode byte 0xbf in position 0: unexpected code byte. You passed in '\xbf\xcd...
Is there a standard efficient way to convert those undecodable characters into question marks?
It would be most helpful if the answer uses Python.
Try:
inputstring.decode("utf8", "replace")
See here for reference
I think what you are looking for is:
str.decode('utf8','ignore')
which should drop invalid bytes rather than raising exception
I'm trying to get Mako render some string with unicode characters :
tempLook=TemplateLookup(..., default_filters=[], input_encoding='utf8',output_encoding='utf-8', encoding_errors='replace')
...
print sys.stdout.encoding
uname=cherrypy.session['userName']
print uname
kwargs['_toshow']=uname
...
return tempLook.get_template(page).render(**kwargs)
The related template file :
...${_toshow}...
And the output is :
UTF-8
Deşghfkskhü
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 1: ordinal not in range(128)
I don't think there's any problem with the string itself since I can print it just fine.
Altough I've played (a lot) with input/output_encoding and default_filters parameters, it always complains about being unable to decode/encode with ascii codec.
So I decided to try out the example found on the documentation, and the following works the "best" :
input_encoding='utf-8', output_encoding='utf-8'
#(note : it still raised an error without output_encoding, despite tutorial not implying it)
With
${u"voix m’a réveillé."}
And the result being
voix mâ�a réveillé
I simply don't get why this doesn't work. "Magic encoding comment"s don't work either. All the files are encoded with UTF-8.
I've spent hours to no avail, am I missing something ?
Update :
I have a simpler question now :
Now that all the variables are unicode, how can I get Mako to render unicode strings without applying anything ? Passing a blank filter / render_unicode() doesn't help.
Yes, UTF-8 != Unicode.
UTF-8 is a specifc string encoding, as are ASCII and ISO 8859-1. Try this:
For any input string do a inputstring.decode('utf-8') (or whatever input encoding you get). For any output string do a outputstring.encode('utf-8')(or whatever output encoding you want). For any internal use, take unicode strings ('this is a normal string'.decode('utf-8') == u'this is a normal string')
'foo' is a string, u'foo' is a unicode string, which doesn't "have" an encoding (can't be decoded). SO anytime python want to change an encoding of a normal string, it first tries to "decode" it, the to "encode" it. And the default is "ascii", which fails more often than not :-)
I'm dealing with unknown data and trying to insert into a MySQL database using Python/Django. I'm getting some errors that I don't quite understand and am looking for some help. Here is the error.
Incorrect string value: '\xEF\xBF\xBDs m...'
My guess is that the string is not being properly converted to unicode? Here is my code for unicode conversion.
s = unicode(content, "utf-8", errors="replace")
Without the above unicode conversion, the error I get is
'utf8' codec can't decode byte 0x92 in position 31: unexpected code byte. You passed in 'Fabulous home on one of Decatur\x92s most
Any help is appreciated!
What is the original encoding? I'm assuming "cp1252", from pixelbeat's answer. In that case, you can do
>>> orig # Byte string, encoded in cp1252
'Fabulous home on one of Decatur\x92s most'
>>> uni = orig.decode('cp1252')
>>> uni # Unicode string
u'Fabulous home on one of Decatur\u2019s most'
>>> s = uni.encode('utf8')
>>> s # Correct byte string encoded in utf-8
'Fabulous home on one of Decatur\xe2\x80\x99s most'
0x92 is right single curly quote in windows cp1252 encoding.
\xEF\xBF\xBD is the UTF8 encoding of the unicode replacement character
(which was inserted instead of the erroneous cp1252 character).
So it looks like your database is not accepting the valid UTF8 data?
2 options:
1. Perhaps you should be using unicode(content,"cp1252")
2. If you want to insert UTF-8 into the DB, then you'll need to config it appropriately. I'll leave that answer to others more knowledgeable
The "Fabulous..." string doesn't look like utf-8: 0x92 is above 128 and as such should be a continuation of a multi-byte character. However, in that string it appears on its own (apparently representing an apostrophe).