How to send to client dictionary which contains utf characters with simplejson? - python

I have in dictionary under key "verb" string which contains non ascii characters (utf-8).
I want to send to client that dictionary (I am using Tornado i Python 2.7.2 and simplejson).
I am trying like
result = {"verb" : "Želeći"}
self.write(simplejson.dumps(result, ensure_ascii=False)) # tried also with utf-8 encoding parameter passed
self.flush()
but always get error utf8 codec can't decode byte 0x8e in position 0
How to send to client dictionary which contains utf characters with simplejson ?

It works for me:
>>> import simplejson
>>> result = {"verb" : "Želeći"}
>>> simplejson.dumps(result, ensure_ascii=False)
u'{"verb": "\u017dele\u0107i"}'
I'm using python 2.7.4

You have data that is not UTF-8 encoded.
JSON strings are really Unicode strings, but you are giving it a byte string instead. Decode your data manually first instead of having the json module do it for you, wrongly.
Judging by the error code, you have windows-1252 (cp1252) encoded data instead, so the following will work:
result['verb'] = result['verb'].decode('cp1252')
simplejson.dumps(result, ensure_ascii=False).encode('UTF8')
It could also be windows-1250; both 1250 and 1252 encode the Ž character (Unicode codepoint U+017D, LATIN CAPITAL LETTER Z WITH CARON) to hex 8E.

Related

UTF-8 encoding in str type Python 2

I have a Python 2.7 code which retrieves a base64 encoded response from a server. This response is decoded using base64 module (b64decode / decodestring functions, returning str). Its decoded content has the Unicode code points of the original strings.
I need to convert these Unicode code points to UTF-8.
The original string has a substring content "Não". When I decode the responded string, it shows:
>>> encoded_str = ... # server response
>>> decoded_str = base64.b64decode(encoded_str)
>>> type(decoded_str)
<type 'str'>
>>> decoded_str[x:y]
'N\xe3o'
When I try to encode to UTF-8, it leads to errors as
>>> (decode_str[x:y]).encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 2: ordinal not in range(128)
However, when this string is manually written in Unicode type, I can correctly convert it to my desired UTF-8 string.
>>> test_str = u'N\xe3o'
>>> test.encode('utf-8')
'N\xc3\xa3o'
I have to retrieve this response from the server and correctly generate an UTF-8 string which can be printed as "Não", how can I do this in Python 2?
You want to decode, not encode the byte string.
Think of it like this: a Unicode string was encoded into bytes, and these bytes were further encoded into base64.
To reverse this, you need to reverse both encodings, in the opposite order.
However, the sample you show most definitely isn't a valid UTF-8 byte string - 0xE3 in isolation is not a valid UTF-8 encoding. Most likely, the Unicode string was encoded using Latin-1 or a related encoding (the sample is much too small to establish this conclusively; other common candidates are the fugly Windows code page CP1252 and Latin-9).

Parsing JSON string with \u escapes

I have a Python service with and endpoint that passes on data to another service, get's back the result and passes it to the requester. There is a filed message in the form and if I input a Unicode character - let's say 'GRINNING FACE WITH SMILING EYES' (U+1F601) - I see following in the request form object
ImmutableMultiDict([('message', u'\U0001f601'),...
When I get response from the other service, I have this
{..., u'message': u'\xf0\x9f\x98\x81',...}
This is then JSONified using json.dumps into
{..."message": "\u00f0\u009f\u0098\u0081"...}
Finally, on client, the message string gets parsed into
ð
(If I'm not mistaken, Unicode code for that character is \u00f0)
So where does it go wrong? It looks like I have a string that gets returned from an external service with utf8 hex escapes. I tried utf8-decoding that string but I get the following error
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not inrange(128)
To handle this correctly you need to fix the process that is creating the u'\xf0\x9f\x98\x81' mojibake. As noted, those bytes are correct, but they need to be in a plain string (in Python 3 that's a bytes string) not a Unicode string. We can't give further details without seeing the relevant code.
However, you can extract the byte codes from the mojibake by encoding it as Latin 1, and then decode those bytes as UTF-8 to create proper Unicode:
d = {u'message': u'\xf0\x9f\x98\x81'}
for k, v in d.items():
# Extract bytes from mojibake Unicode
b = v.encode('latin1')
# Now decode the extracted bytes as UTF-8
s = b.decode('UTF-8')
print k, s
output
message 😁
Or in a more compact form:
v = u'\xf0\x9f\x98\x81'
s = v.encode('latin1').decode('utf-8')
print(s)
That will work in both Python 2 & 3.
You should seriously consider migrating to Python 3, where Unicode handling is a lot saner, and you're much less likely to create these kinds of mix-ups.

Python - ASCII encoding string in the unicode string; how to remove that 'u'?

When I use python module 'pygoogle' in chinese, I got url like u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
It's unicode but include ascii. I try to encode it back to utf-8 but the code be changed too.
a = u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
a.encode('utf-8')
>>> 'http://zh.wikipedia.org/zh/\xc3\xa6\xc2\xb1\xc2\x89\xc3\xa8\xc2\xaf\xc2\xad'
Also I try to use :
str(a)
but I got error :
UnicodeEncodeError: 'ascii' codec can't encode characters in position 27-32: ordinal not in range(128)
How can I encoding it for remove the 'u' ?
By the way, if there is not 'u' I will get correct result like:
s = 'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
print s
>>> http://zh.wikipedia.org/zh/汉语
You have a Mojibake; in this case those are UTF-8 bytes decoded as if they were Latin-1 bytes.
To reverse the process, encode to Latin-1 again:
>>> a = u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
>>> a.encode('latin-1')
'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
>>> print a.encode('latin-1')
http://zh.wikipedia.org/zh/汉语
The print worked because my terminal is configured to handle UTF-8. You can get a unicode object again by decoding as UTF-8:
>>> a.encode('latin-1').decode('utf8')
u'http://zh.wikipedia.org/zh/\u6c49\u8bed'
The ISO-8859-1 (Latin-1) codec maps one-on-one to the first 255 Unicode codepoints, which is why the string contents look otherwise unchanged.
You may want to use the ftfy library for jobs like these; it handles a wide variety of text issues, including Windows codepage Mojibake where some resulting 'codepoints' are not legally encodable to the codepage. The ftfy.fix_text() function takes Unicode input and repairs it:
>>> import ftfy
>>> ftfy.fix_text(a)
u'http://zh.wikipedia.org/zh/\u6c49\u8bed'

A script in python 2.7 urllib2 and json raises unicode error

import json
import urllib2
url='http://search.twitter.com/search.json?q=python'
open=urllib2.urlopen(url)
response=open.read().encode('utf8')
data=json.loads(response)
results=data['results']
for result in results:
print result['from_user'] + ': ' + result['text'] + '\n'
gives the error UnicodeEncodeError: 'charmap' codec can't encode characters in position 16-24: character maps to <undefined>.
Anyone have a solution for this?
What you are looking to do is probably to decode and not encode the response.
A very short explanation why is that the http server doesn't know how to send unicode characters, just byte. Hence it uses an encoding like utf-8 to translate these characters into bytes.
When you receive a response from the server you receive this chunk of bytes, and if you want to translate it back into a list of unicode characters (basically a unicode object in python) you have to decode them.
What adds more to the confusion is that the lower spectrum of ascii characters (codepoint < 127) are exactly the same as the lower unicode codepoints when using utf-8. A situation where a unicode codepoint is both encoded the same and fits within the range that can be represented in a single byte for each character.
Hope this is helpful.

Converting to safe unicode in python

I'm dealing with unknown data and trying to insert into a MySQL database using Python/Django. I'm getting some errors that I don't quite understand and am looking for some help. Here is the error.
Incorrect string value: '\xEF\xBF\xBDs m...'
My guess is that the string is not being properly converted to unicode? Here is my code for unicode conversion.
s = unicode(content, "utf-8", errors="replace")
Without the above unicode conversion, the error I get is
'utf8' codec can't decode byte 0x92 in position 31: unexpected code byte. You passed in 'Fabulous home on one of Decatur\x92s most
Any help is appreciated!
What is the original encoding? I'm assuming "cp1252", from pixelbeat's answer. In that case, you can do
>>> orig # Byte string, encoded in cp1252
'Fabulous home on one of Decatur\x92s most'
>>> uni = orig.decode('cp1252')
>>> uni # Unicode string
u'Fabulous home on one of Decatur\u2019s most'
>>> s = uni.encode('utf8')
>>> s # Correct byte string encoded in utf-8
'Fabulous home on one of Decatur\xe2\x80\x99s most'
0x92 is right single curly quote in windows cp1252 encoding.
\xEF\xBF\xBD is the UTF8 encoding of the unicode replacement character
(which was inserted instead of the erroneous cp1252 character).
So it looks like your database is not accepting the valid UTF8 data?
2 options:
1. Perhaps you should be using unicode(content,"cp1252")
2. If you want to insert UTF-8 into the DB, then you'll need to config it appropriately. I'll leave that answer to others more knowledgeable
The "Fabulous..." string doesn't look like utf-8: 0x92 is above 128 and as such should be a continuation of a multi-byte character. However, in that string it appears on its own (apparently representing an apostrophe).

Categories

Resources