URLDecoding requests - python

I am trying to get the original url from requests. Here is what I have so far:
res = requests.get(...)
url = urllib.unquote(res.url).decode('utf8')
I then get an error that says:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 60-61: ordinal not in range(128)
The original url I requested is:
https://www.microsoft.com/de-at/store/movies/american-pie-pr\xc3\xa4sentiert-nackte-tatsachen/8d6kgwzl63ql
And here is what happens when I try printing:
>>> print '111', res.url
111 https://www.microsoft.com/de-at/store/movies/american-pie-pr%C3%A4sentiert-nackte-tatsachen/8d6kgwzl63ql
>>> print '222', urllib.unquote( res.url )
222 https://www.microsoft.com/de-at/store/movies/american-pie-präsentiert-nackte-tatsachen/8d6kgwzl63ql
>>> print '333', urllib.unquote(res.url).decode('utf8')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 60-61: ordinal not in range(128)
Why is this occurring, and how would I fix this?

UnicodeEncodeError: 'ascii' codec can't encode characters
You are trying to decode a string that is Unicode already. It raises AttributeError on Python 3 (unicode string has no .decode() method there). Python 2 tries to encode the string into bytes first using sys.getdefaultencoding() ('ascii') before passing it to .decode('utf8') which leads to UnicodeEncodeError.
In short, do not call .decode() on Unicode strings, use this instead:
print urllib.unquote(res.url.encode('ascii')).decode('utf-8')
Without .decode() call, the code prints bytes (assuming a bytestring is passed to unquote()) that may lead to mojibake if the character encoding used by your environment is not utf-8. To avoid mojibake, always print Unicode (don't print text as bytes), do not hardcode the character encoding of your environment inside your script i.e., .decode() is necessary here.
There is a bug in urllib.unquote() if you pass it a Unicode string:
>>> print urllib.unquote(u'​%C3%A4')
ä
>>> print urllib.unquote('​%C3%A4') # utf-8 output
ä
Pass bytestrings to unquote() on Python 2.

Related

selenium unicode encode error

When retrieving the content of a google search result page I get this error?
print driver.find_element_by_tag_name('body').get_attribute('innerHTML')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 15663: ordinal not in range(128)
I'm calling the python script from PHP like this
exec('python selenium_scrape.py');
This solves the problem, but then all unicode chars will be encoded twice
print driver.find_element_by_tag_name('body').get_attribute('innerHTML').encode('utf-8')
That's probably because you're printing to a stdout that uses ASCII (7 bit) encoding. Call Python with a locale setting that uses utf-8, or do some appropriate encoding of the (unicode) HTML content to a 7-bit character string first.
Try to encode the the text before printing:
print driver.find_element_by_tag_name('body').get_attribute('innerHTML').encode("utf-‌​8")

Python 2.7.6 + unicode_literals - UnicodeDecodeError: 'ascii' codec can't decode byte

I'm trying to print the following unicode string but I'm receiving a UnicodeDecodeError: 'ascii' codec can't decode byte error. Can you please help form this query so it can print the unicode string properly?
>>> from __future__ import unicode_literals
>>> ts='now'
>>> free_form_request='[EXID(이엑스아이디)] 위아래 (UP&DOWN) MV'
>>> nick='me'
>>> print('{ts}: free form request {free_form_request} requested from {nick}'.format(ts=ts,free_form_request=free_form_request.encode('utf-8'),nick=nick))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 6: ordinal not in range(128)
Thank you very much in advance!
Here's what happen when you construct this string:
'{ts}: free form request {free_form_request} requested from {nick}'.format(ts=ts,free_form_request=free_form_request.encode('utf-8'),nick=nick)
free_form_request is encode-d into a byte string using utf-8 as the encoding. This works because utf-8 can represent [EXID(이엑스아이디)] 위아래 (UP&DOWN) MV.
However, the format string ('{ts}: free form request {free_form_request} requested from {nick}') is a unicode string (because of imported from __future__ import unicode_literals).
You can't use byte strings as format arguments for a unicode string, so Python attempts to decode the byte string created in 1. to create a unicode string (which would be valid as an format argument).
Python attempts the decode-ing using the default encoding, which is ascii, and fails, because the byte string is a utf-8 byte string that includes byte values that don't make sense in ascii.
Python throws a UnicodeDecodeError.
Note that while the code is obviously doing something here, this would actually not throw an exception on Python 3, which would instead substitute the repr of the byte string (the repr being a unicode string).
To fix your issue, just pass unicode strings to format.
That is, don't do step 1. where you encoded free_form_request as a byte string: keep it as a unicode string by removing .encode(...):
'{ts}: free form request {free_form_request} requested from {nick}'.format(
ts=ts,
free_form_request=free_form_request,
nick=nick)
Note Padraic Cunningham's answer in the comments as well.

Python: Unicode problems

I am getting an error at this line
logger.info(u"Data: {}".format(data))
I'm getting this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 4: ordinal not in range(128)
Before that line, I tried adding data = data.decode('utf8') and I still get the same error.
I tried data = data.encode('utf8') and it says UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)
How do I fix this? I don't know if I should encode or decode but neither works.
Use a string literal:
if isinstance(data, unicode):
data = data.encode('utf8')
logger.info("Data: {}".format(data))
The logging module needs you to pass in string values as these values are passed on unaltered to formatters and the handlers. Writing log messages to a file means that unicode values are encoded with the default (ASCII) codec otherwise. But you also need to pass in a bytestring value when formatting.
Passing in a str value into a unicode .format() template leads to decoding errors, passing in a unicode value into a str .format() template leads to encoding errors, and passing a formatted unicode value to logger.info() leads to encoding errors too.
Better not mix and encode explicitly beforehand.
You could do something such as
data.decode('utf-8').encode("ascii",errors="ignore")
This will "ignore" the unicode characters
edit: data.encode('ascii',error='ignore') may be enough but i'm not in a position to test this currently.

What happens when you call str() on a unicode string?

I'm wondering what happens internally when you call str() on a unicode string.
# coding: utf-8
s2 = str(u'hello')
Is s2 just the unicode byte representation of the str() arg?
It will try to encode it with your default encoding. On my system, that's ASCII, and if there's any non-ASCII characters, it will fail:
>>> str(u'あ')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u3042' in position 0: ordinal not in range(128)
Note that this is the same error you'd get if you called encode('ascii') on it:
>>> u'あ'.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u3042' in position 0: ordinal not in range(128)
As you might imagine, str working on some arguments and failing on others makes it easy to write code that on first glance seems to work, but stops working once you throw some international characters in there. Python 3 avoids this by making the problem blatantly obvious: you can't convert Unicode to a byte string without an explicit encoding:
>>> bytes(u'あ')
TypeError: string argument without an encoding

Send a non-ASCII POST request in Python?

I'm trying to send a POST request to a web app. I'm using the mechanize module (itself a wrapper of urllib2). Anyway, when I try to send a POST request, I get UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 0: ordinal not in range(128). I tried putting the unicode(string), the unicode(string, encoding="utf-8"), unicode(string).encode() etc, nothing worked - either returned the error above, or the TypeError: decoding Unicode is not supported
I looked at the other SO answers to similar questions, but none helped.
Thanks in advance!
EDIT: Example that produces an error:
prda = "šđćč" #valid UTF-8 characters
prda # typing in python shell
'\xc5\xa1\xc4\x91\xc4\x87\xc4\x8d'
print prda # in shell
šđćč
prda.encode("utf-8") #in shell
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 0: ordinal not in range(128)
unicode(prda)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 0: ordinal not in range(128)
I assume you're using Python 2.x.
Given a unicode object:
myUnicode = u'\u4f60\u597d'
encode it using utf-8:
mystr = myUnicode.encode('utf-8')
Note that you need to specify the encoding explicitly. By default it'll (usually) use ascii.
In your example, you use a non-unicode string literal containing non-ascii characters, which results in prda becoming a bytes string.
To achieve this, python uses sys.stdin.encoding to automatically encode the string. In your case, this means the string is gets encoded as "utf-8".
To convert prda to a unicode object, you need to decode it using the appropriate encoding:
>>> print prda.decode('utf-8')
šđćč
Note that, in a script or module, you cannot rely on python to automatically guess the encoding - you would need to explicitly delare the encoding at the top of the file, like this:
# -*- coding: utf-8 -*-
Whenever you encounter unicode errors in Python 2, it is very often because your code is mixing bytes strings with unicode strings. So you should always check what kind of string is causing the error, by using type(string).
If the string object is <type 'str'>, but you need unicode, decode it using the appropriate encoding. If the string object is <type 'unicode'>, but you need bytes, encode it using the appropriate encoding.
You don't need to wrap your chars in unicode calls, because they're already encoded :) if anything, you need to DE-code it to get a unicode object:
>>> s = '\xc5\xa1\xc4\x91\xc4\x87\xc4\x8d' # your string
>>> s.decode('utf-8')
u'\u0161\u0111\u0107\u010d'
>>> type(s.decode('utf-8'))
<type 'unicode'>
I don't know mechanize so I don't know exactly whether it handles it correctly or not, I'm afraid.
What I'd do with a regular urllib2 POST call, would be to use urlencode :
>>> from urllib import urlencode
>>> postData = urlencode({'test': s }) # note I'm NOT decoding it
>>> postData
'test=%C5%A1%C4%91%C4%87%C4%8D'
>>> urllib2.urlopen(url, postData) # etc etc etc

Categories

Resources