Unicode error trying to call Google search API

Unicode error trying to call Google search API - python

I need to perform google search to retrieve the number of results for a query. I found the answer here - Google Search from a Python App
However, for few queries I am getting the below error. I think the query has unicode characters.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 28: ordinal not in range(128)
I searched google and found I need to convert unicode to ascii, and found below code.
def convertToAscii(text, action):
temp = unicode(text, "utf-8")
fixed = unicodedata.normalize('NFKD', temp).encode('ASCII', action)
return fixed
except Exception, errorInfo:
print errorInfo
print "Unable to convert the Unicode characters to xml character entities"
raise errorInfo
If I use the action ignore, it removes those characters, but if I use other actions, I am getting exceptions.
Any idea, how to handle this?
Thanks
== Edit ==
I am using below code to encode and then perform the search and this is throwing the error.
query = urllib.urlencode({'q': searchfor})

You cannot urlencode raw Unicode strings. You need to first encode them to UTF-8 and then feed to it:
query = urllib.urlencode({'q': u"München".encode('UTF-8')})
This returns q=M%C3%BCnchen which Google happily accepts.

You can't safely convert Unicode to ASCII. Doing so involves throwing away information (specifically, it throws away non-English letters).
You should be doing the entire process in Unicode, so as not to lose any information.

Related

how to encode character '\xa0' in 'ascii' codec

I am trying to fetch data using Here's Rest API using python but I am receiving the following error,
1132
1133 # Non-ASCII characters should have been eliminated earlier
-> 1134 self._output(request.encode('ascii'))
1135
1136 if self._http_vsn == 11:
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 86: ordinal not in range(128)
My python code is -
df = pd.read_csv(r"data.csv", encoding='utf8', sep=",",
engine="python")
def GoogPlac(auth_key,lat,lon):
location = str(lat) + ',' + str(lon)
MyUrl = ('https://places.ls.hereapi.com/places/v1/browse'
'?apiKey=%s'
'&in=%s'
';r=2000'
'&cat=restaurant&pretty') % (auth_key,location)
#grabbing the JSON result
response = urllib.request.urlopen(MyUrl)
jsonRaw = response.read()
jsonData = json.loads(jsonRaw)
return jsonData
# Function call
df['response'] = df.apply(lambda x: GoogPlac(auth_key,x['latitude'],x['longitude']), axis=1)
I want to avoid the error and continue my API fetch

You said you want to avoid the error, but how you avoid it matters.
Your title says you want to encode something to ASCII, but the thing you want to encode is not encodable in ASCII. There is no A0 character in 7-bit ASCII. You've asked the impossible.
You can decide among a few different things:
Encode with a lossy encode() parameter that says to throw away everything that doesn't fit in ASCII. This is dangerous and probably not very smart. If you can't trust your data, then why are you using your data?
Use a different encoding for output. You seem to know what encoding your text was, because you could fetch it and render it to Unicode. (OR, you are using ancient Python 2, and the default system encoding understands that page's encoding, and there's a silent .decode(DEFAULT_ENCODING) right before your .encode("ascii") . This is by far the best scheme. Just don't use ASCII. UTF-8 is the present and future!
Specifically snip out A0 with .replace() before your .encode(). Also pretty bad.
Get your page author to agree it should be ASCII and get himher to fix it. This is best of all.

'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)

I am running into this problem where when I try to decode a string I run into one error,when I try to encode I run into another error,errors below,is there a permanent solution for this?
P.S please note that you may not be able to reproduce the encoding error with the string I provided as I couldnt copy/paste some errors
text = "sometext"
string = '\n'.join(list(set(text)))
try:
print "decode"
text = string.decode('UTF-8')
except Exception as e:
print e
text = string.encode('UTF-8')
Errors:-
error while using string.decode('UTF-8')
'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)
Error while using string.encode('UTF-8')
Exception All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

The First Error
The code you have provided will work as the text is a a bytestring (as you are using Python 2). But what you're trying to do is to decode from a UTF-8 string to
an ASCII one, which is possible, but only if that Unicode string contains only characters that have an ASCII equivalent (you can see the list of ASCII characters here). In your case, it's encountering a unicode character (specifically ☂) which has no ASCII equivalent. You can get around this behaviour by using:
string.decode('UTF-8', 'ignore')
Which will just ignore (i.e. replace with nothing) the characters that cannot be encoded into ASCII.
The Second Error
This error is more interesting. It appears the text you are trying to encode into UTF-8 contains either NULL bytes or specific control characters, which are not allowed by the version of Unicode (UTF-8) that you are trying to encode into. Again, the code that you have actually provided works, but something in the text that you are trying to encode is violating the encoding. You can try the same trick as above:
string.encode('UTF-8', 'ignore')
Which will simply remove the offending characters, or you can look into what it is in your specific text input that is causing the problem.

UnicodeEncodeError: 'ascii' codec can't encode characters due to Ã©Ã©n from database

I have a field to get from database which contains string with this part Ã©Ã©n and while getting this i get error:
"UnicodeEncodeError: 'ascii' codec can't encode characters in position 12-15: ordinal not in range(128)"
I have search this error, and other people were having issue due to unicodes which start something like this u'\xa0, etc. But in my case, i think its due to special characters. I can not do changes in database as its not under my access. I can just access it.
The code is here: (actually its call to external url)
req = urllib2.Request(url)
req.add_header("Content-type", "application/json")
res = urllib2.urlopen(req,timeout = 50) #50 secs timeout
clientid = res.read()
result = json.loads(clientid)
Then I use result variable to get the above mentioned string and I get error on this line:
updateString +="name='"+str(result['product_name'])+"', "

You need to find the encoding for which is used for your data before it's inserted into the database. Let's assume it's UTF-8 since that's the most common.
In that case you will want to UTF-8 decode instead of ascii decode. You didn't provide any code, so I'm assuming you have "data".decode(). Try "data".decode("utf-8"), and if your data was encoded using this encoding, it will work.

So it sounds to me like the string already was unicode then. So remove str() and unicode functions on that line.

Python ascii encoding issue

I run a python script and i receive the following error
sql = 'insert into posts(msg_id, msg_text, msg_date) values("{0}", "{1}", "{2}")'.format(msg['id'], text.encode('utf-8'), msg['datime'])
UnicodeEncodeError: 'ascii' codec can't encode characters in position 25-31: ordinal not in range(128)
How can i correct this error or maybe caught it with an exception? Any ideas?

try:
sql = u'insert into posts(msg_id, msg_text, msg_date) values("{0}", "{1}", "{2}")'.format(msg['id'], text.decode('utf-8'), msg['datime'])
basically, your text contains utf-8 characters, and using the encode() method, you keep it as is. But the main string (the ones you're formatting) is a plain ASCII string. By adding u in front of the string (u'') you make it a unicode string. Then, whatever being in text, you want to have it decoded as utf-8, thus the .decode() instead of .encode().
and if you want to catch that kind of errors, simply:
try:
sql = …
except UnicodeEncodeError, err:
print err
but if you want to really get rid of any utf8/ascii mess, you should think of switching to python 3.
HTH

How to Handle JSON with escaped Unicode characters using python json module?

EDIT: The error doesn't appear in Prompt, but in the following Google App Engine environment.
I have following json
>>>dat = r"""{"name":"Something", "data":"For youth \n\nBe a hero! Donate blood!\n\u091c\u092f \u0939\u093f\u0902\u0926! \u0935\u0928\u094d\u0926\u0947 \u092e\u093e\u0924\u0930\u092e\u094d"}"""
It contains unicode escaped characters.
I want to parse this. So I did
>>>jsDat = json.loads(js)
Then following works
>>>name = jsDat.get('name')
>>>name = name.encode('ascii') #This is because json module handles in unicode
>>>print name
Something
But trying for the field with unicode data, that is "data", an error is displayed
>>>data = jsDat.get('data')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 366-367: ordinal not in range(128)
How should I parse the data?

You can't encode unicode to ASCII if the characters exceed the ASCII character set. If you want to force the conversion, and lose data, you can do this:
data = jsDat.get('data')
data = data.encode('ascii', 'ignore')
See the doc for str.encode for more details about the ignore.
As an aside, I'm not sure why you're trying to encode to ASCII - the JSON module seems to handle that raw string just fine?

The error is coming from your 'print' line, and only because you're trying to print to a 'terminal' that doesn't understand the encoding. Doing anything else with the JSON object shouldn't produce errors.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unicode error trying to call Google search API - python

You cannot urlencode raw Unicode strings. You need to first encode them to UTF-8 and then feed to it: query = urllib.urlencode({'q': u"München".encode('UTF-8')}) This returns q=M%C3%BCnchen which Google happily accepts.

You can't safely convert Unicode to ASCII. Doing so involves throwing away information (specifically, it throws away non-English letters). You should be doing the entire process in Unicode, so as not to lose any information.

Related

how to encode character '\xa0' in 'ascii' codec

'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)

UnicodeEncodeError: 'ascii' codec can't encode characters due to Ã©Ã©n from database

Python ascii encoding issue

How to Handle JSON with escaped Unicode characters using python json module?

Categories

Resources