How to Handle JSON with escaped Unicode characters using python json module?

How to Handle JSON with escaped Unicode characters using python json module? - python

EDIT: The error doesn't appear in Prompt, but in the following Google App Engine environment.
I have following json
>>>dat = r"""{"name":"Something", "data":"For youth \n\nBe a hero! Donate blood!\n\u091c\u092f \u0939\u093f\u0902\u0926! \u0935\u0928\u094d\u0926\u0947 \u092e\u093e\u0924\u0930\u092e\u094d"}"""
It contains unicode escaped characters.
I want to parse this. So I did
>>>jsDat = json.loads(js)
Then following works
>>>name = jsDat.get('name')
>>>name = name.encode('ascii') #This is because json module handles in unicode
>>>print name
Something
But trying for the field with unicode data, that is "data", an error is displayed
>>>data = jsDat.get('data')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 366-367: ordinal not in range(128)
How should I parse the data?

You can't encode unicode to ASCII if the characters exceed the ASCII character set. If you want to force the conversion, and lose data, you can do this:
data = jsDat.get('data')
data = data.encode('ascii', 'ignore')
See the doc for str.encode for more details about the ignore.
As an aside, I'm not sure why you're trying to encode to ASCII - the JSON module seems to handle that raw string just fine?

The error is coming from your 'print' line, and only because you're trying to print to a 'terminal' that doesn't understand the encoding. Doing anything else with the JSON object shouldn't produce errors.

Related

'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)

I am running into this problem where when I try to decode a string I run into one error,when I try to encode I run into another error,errors below,is there a permanent solution for this?
P.S please note that you may not be able to reproduce the encoding error with the string I provided as I couldnt copy/paste some errors
text = "sometext"
string = '\n'.join(list(set(text)))
try:
print "decode"
text = string.decode('UTF-8')
except Exception as e:
print e
text = string.encode('UTF-8')
Errors:-
error while using string.decode('UTF-8')
'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)
Error while using string.encode('UTF-8')
Exception All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

The First Error
The code you have provided will work as the text is a a bytestring (as you are using Python 2). But what you're trying to do is to decode from a UTF-8 string to
an ASCII one, which is possible, but only if that Unicode string contains only characters that have an ASCII equivalent (you can see the list of ASCII characters here). In your case, it's encountering a unicode character (specifically ☂) which has no ASCII equivalent. You can get around this behaviour by using:
string.decode('UTF-8', 'ignore')
Which will just ignore (i.e. replace with nothing) the characters that cannot be encoded into ASCII.
The Second Error
This error is more interesting. It appears the text you are trying to encode into UTF-8 contains either NULL bytes or specific control characters, which are not allowed by the version of Unicode (UTF-8) that you are trying to encode into. Again, the code that you have actually provided works, but something in the text that you are trying to encode is violating the encoding. You can try the same trick as above:
string.encode('UTF-8', 'ignore')
Which will simply remove the offending characters, or you can look into what it is in your specific text input that is causing the problem.

Unicode error in python program output

I am trying run a bash command from my python program which out put the result in a file.I am using os.system to execute the bash command.But I am getting an error as follows:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 793: ordinal not in range(128)
I am not able to understand how to handle it.Please suggest me a solution for it.

Have a look at this Blog post
These messages usually means that you’re trying to either mix Unicode strings with 8-bit strings, or is trying to write Unicode strings to an output file or device that only handles ASCII.
Try to do the following to encode your string:
This can then be used to properly convert input data to Unicode. Assuming the string referred to by value is encoded as UTF-8:
value = unicode(value, "utf-8")

You need to encode your string as:
your_string = your_string.encode('utf-8')
For example:
>>> print(u'\u201c'.encode('utf - 8'))
“

UnicodeEncodeError: 'ascii' codec can't encode characters due to Ã©Ã©n from database

I have a field to get from database which contains string with this part Ã©Ã©n and while getting this i get error:
"UnicodeEncodeError: 'ascii' codec can't encode characters in position 12-15: ordinal not in range(128)"
I have search this error, and other people were having issue due to unicodes which start something like this u'\xa0, etc. But in my case, i think its due to special characters. I can not do changes in database as its not under my access. I can just access it.
The code is here: (actually its call to external url)
req = urllib2.Request(url)
req.add_header("Content-type", "application/json")
res = urllib2.urlopen(req,timeout = 50) #50 secs timeout
clientid = res.read()
result = json.loads(clientid)
Then I use result variable to get the above mentioned string and I get error on this line:
updateString +="name='"+str(result['product_name'])+"', "

You need to find the encoding for which is used for your data before it's inserted into the database. Let's assume it's UTF-8 since that's the most common.
In that case you will want to UTF-8 decode instead of ascii decode. You didn't provide any code, so I'm assuming you have "data".decode(). Try "data".decode("utf-8"), and if your data was encoded using this encoding, it will work.

So it sounds to me like the string already was unicode then. So remove str() and unicode functions on that line.

Unicode error trying to call Google search API

I need to perform google search to retrieve the number of results for a query. I found the answer here - Google Search from a Python App
However, for few queries I am getting the below error. I think the query has unicode characters.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 28: ordinal not in range(128)
I searched google and found I need to convert unicode to ascii, and found below code.
def convertToAscii(text, action):
temp = unicode(text, "utf-8")
fixed = unicodedata.normalize('NFKD', temp).encode('ASCII', action)
return fixed
except Exception, errorInfo:
print errorInfo
print "Unable to convert the Unicode characters to xml character entities"
raise errorInfo
If I use the action ignore, it removes those characters, but if I use other actions, I am getting exceptions.
Any idea, how to handle this?
Thanks
== Edit ==
I am using below code to encode and then perform the search and this is throwing the error.
query = urllib.urlencode({'q': searchfor})

You cannot urlencode raw Unicode strings. You need to first encode them to UTF-8 and then feed to it:
query = urllib.urlencode({'q': u"München".encode('UTF-8')})
This returns q=M%C3%BCnchen which Google happily accepts.

You can't safely convert Unicode to ASCII. Doing so involves throwing away information (specifically, it throws away non-English letters).
You should be doing the entire process in Unicode, so as not to lose any information.

Convert Unicode to ASCII without errors in Python

My code just scrapes a web page, then converts it to Unicode.
html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)
But I get a UnicodeDecodeError:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
handler.get(*groups)
File "/Users/greg/clounce/main.py", line 55, in get
html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)
I assume that means the HTML contains some wrongly-formed attempt at Unicode somewhere. Can I just drop whatever code bytes are causing the problem instead of getting an error?

>>> u'aあä'.encode('ascii', 'ignore')
'a'
Decode the string you get back, using either the charset in the the appropriate meta tag in the response or in the Content-Type header, then encode.
The method encode(encoding, errors) accepts custom handlers for errors. The default values, besides ignore, are:
>>> u'aあä'.encode('ascii', 'replace')
b'a??'
>>> u'aあä'.encode('ascii', 'xmlcharrefreplace')
b'aあä'
>>> u'aあä'.encode('ascii', 'backslashreplace')
b'a\\u3042\\xe4'
See https://docs.python.org/3/library/stdtypes.html#str.encode

As an extension to Ignacio Vazquez-Abrams' answer
>>> u'aあä'.encode('ascii', 'ignore')
'a'
It is sometimes desirable to remove accents from characters and print the base form. This can be accomplished with
>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'
You may also want to translate other characters (such as punctuation) to their nearest equivalents, for instance the RIGHT SINGLE QUOTATION MARK unicode character does not get converted to an ascii APOSTROPHE when encoding.
>>> print u'\u2019'
’
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"
Although there are more efficient ways to accomplish this. See this question for more details Where is Python's "best ASCII for this Unicode" database?

2018 Update:
As of February 2018, using compressions like gzip has become quite popular (around 73% of all websites use it, including large sites like Google, YouTube, Yahoo, Wikipedia, Reddit, Stack Overflow and Stack Exchange Network sites).
If you do a simple decode like in the original answer with a gzipped response, you'll get an error like or similar to this:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: unexpected code byte
In order to decode a gzpipped response you need to add the following modules (in Python 3):
import gzip
import io
Note: In Python 2 you'd use StringIO instead of io
Then you can parse the content out like this:
response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource
This code reads the response, and places the bytes in a buffer. The gzip module then reads the buffer using the GZipFile function. After that, the gzipped file can be read into bytes again and decoded to normally readable text in the end.
Original Answer from 2010:
Can we get the actual value used for link?
In addition, we usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in
html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")
As an example:
html = '\xa0'
encoded_str = html.encode("utf8")
Fails with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
While:
html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")
Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.
Another problem I see there is that the .encode() string method returns the modified string and does not modify the source in place. So it's kind of useless to have self.response.out.write(html) as html is not the encoded string from html.encode (if that is what you were originally aiming for).
As Ignacio suggested, check the source webpage for the actual encoding of the returned string from read(). It's either in one of the Meta tags or in the ContentType header in the response. Use that then as the parameter for .decode().
Do note however that it should not be assumed that other developers are responsible enough to make sure the header and/or meta character set declarations match the actual content. (Which is a PITA, yeah, I should know, I was one of those before).

Use unidecode - it even converts weird characters to ascii instantly, and even converts Chinese to phonetic ascii.
$ pip install unidecode
then:
>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'

I use this helper function throughout all of my projects. If it can't convert the unicode, it ignores it. This ties into a django library, but with a little research you could bypass it.
from django.utils import encoding
def convert_unicode_to_string(x):
"""
>>> convert_unicode_to_string(u'ni\xf1era')
'niera'
"""
return encoding.smart_str(x, encoding='ascii', errors='ignore')
I no longer get any unicode errors after using this.

For broken consoles like cmd.exe and HTML output you can always use:
my_unicode_string.encode('ascii','xmlcharrefreplace')
This will preserve all the non-ascii chars while making them printable in pure ASCII and in HTML.
WARNING: If you use this in production code to avoid errors then most likely there is something wrong in your code. The only valid use case for this is printing to a non-unicode console or easy conversion to HTML entities in an HTML context.
And finally, if you are on windows and use cmd.exe then you can type chcp 65001 to enable utf-8 output (works with Lucida Console font). You might need to add myUnicodeString.encode('utf8').

You wrote """I assume that means the HTML contains some wrongly-formed attempt at unicode somewhere."""
The HTML is NOT expected to contain any kind of "attempt at unicode", well-formed or not. It must of necessity contain Unicode characters encoded in some encoding, which is usually supplied up front ... look for "charset".
You appear to be assuming that the charset is UTF-8 ... on what grounds? The "\xA0" byte that is shown in your error message indicates that you may have a single-byte charset e.g. cp1252.
If you can't get any sense out of the declaration at the start of the HTML, try using chardet to find out what the likely encoding is.
Why have you tagged your question with "regex"?
Update after you replaced your whole question with a non-question:
html = urllib.urlopen(link).read()
# html refers to a str object. To get unicode, you need to find out
# how it is encoded, and decode it.
html.encode("utf8","ignore")
# problem 1: will fail because html is a str object;
# encode works on unicode objects so Python tries to decode it using
# 'ascii' and fails
# problem 2: even if it worked, the result will be ignored; it doesn't
# update html in situ, it returns a function result.
# problem 3: "ignore" with UTF-n: any valid unicode object
# should be encodable in UTF-n; error implies end of the world,
# don't try to ignore it. Don't just whack in "ignore" willy-nilly,
# put it in only with a comment explaining your very cogent reasons for doing so.
# "ignore" with most other encodings: error implies that you are mistaken
# in your choice of encoding -- same advice as for UTF-n :-)
# "ignore" with decode latin1 aka iso-8859-1: error implies end of the world.
# Irrespective of error or not, you are probably mistaken
# (needing e.g. cp1252 or even cp850 instead) ;-)

If you have a string line, you can use the .encode([encoding], [errors='strict']) method for strings to convert encoding types.
line = 'my big string'
line.encode('ascii', 'ignore')
For more information about handling ASCII and unicode in Python, this is a really useful site: https://docs.python.org/2/howto/unicode.html

I think the answer is there but only in bits and pieces, which makes it difficult to quickly fix the problem such as
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)
Let's take an example, Suppose I have file which has some data in the following form ( containing ascii and non-ascii chars )
1/10/17, 21:36 - Land : Welcome ï¿½ï¿½
and we want to ignore and preserve only ascii characters.
This code will do:
import unicodedata
fp = open(<FILENAME>)
for line in fp:
rline = line.strip()
rline = unicode(rline, "utf-8")
rline = unicodedata.normalize('NFKD', rline).encode('ascii','ignore')
if len(rline) != 0:
print rline
and type(rline) will give you
>type(rline)
<type 'str'>

unicodestring = '\xa0'
decoded_str = unicodestring.decode("windows-1252")
encoded_str = decoded_str.encode('ascii', 'ignore')
Works for me

You can use the following piece of code as an example to avoid Unicode to ASCII errors:
from anyascii import anyascii
content = "Base Rent for – CC# 2100 Acct# 8410: $41,667.00 – PO – Lines - for Feb to Dec to receive monthly"
content = anyascii(content)
print(content)

Looks like you are using python 2.x.
Python 2.x defaults to ascii and it doesn’t know about Unicode. Hence the exception.
Just paste the below line after shebang, it will work
# -*- coding: utf-8 -*-

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Handle JSON with escaped Unicode characters using python json module? - python

The error is coming from your 'print' line, and only because you're trying to print to a 'terminal' that doesn't understand the encoding. Doing anything else with the JSON object shouldn't produce errors.

Related

'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)

Unicode error in python program output

UnicodeEncodeError: 'ascii' codec can't encode characters due to Ã©Ã©n from database

Unicode error trying to call Google search API

Convert Unicode to ASCII without errors in Python

Categories

Resources