UTF-8 encoding issue with Python 3 [duplicate] - python

This question already has answers here:
How to fetch a non-ascii url with urlopen?
(10 answers)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' - -when using urlib.request python3
(2 answers)
Closed 6 years ago.
I wrote a Wikipedia scraper in Python last week.
It scrapes French pages, so I must manage UTF-8 encoding to avoid errors. I did this with these lines at the beginning of my script:
#!/usr/bin/python
# -*- coding: utf-8 -*-
I also encode the scraped string like this:
adresse = monuments[1].get_text().encode('utf-8')
My first script worked perfectly fine with Python 2.7, but I rewrote it for Python 3 (especially to use urllib.request) and UTF-8 doesn't work anymore.
I got these errors after scraping the first few elements:
File "scraper_monu_historiques_ge_py3.py", line 19, in <module>
url = urllib.request.urlopen(url_ville).read() # et on ouvre chacune d'entre elles
File "/usr/lib/python3.4/urllib/request.py", line 153, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 455, in open
response = self._open(req, data)
File "/usr/lib/python3.4/urllib/request.py", line 473, in _open
'_open', req)
File "/usr/lib/python3.4/urllib/request.py", line 433, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 1217, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/usr/lib/python3.4/urllib/request.py", line 1174, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/lib/python3.4/http/client.py", line 1090, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.4/http/client.py", line 1118, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.4/http/client.py", line 975, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 58: ordinal not in range(128)
I don't understand why, because it worked fine in Python 2.7... I published a version of this WIP on Github.

You are passing a string which contain non-ASCII characters to urllib.urlopen, which isn't a valid URI (it is a valid IRI or International Resource Identifier, though).
You need to make the IRI a valid URI before passing it to urlopen. The specifics of this
depend on which part of the IRI contain non-ASCII characters: the domain part should be encoded using Punycode, while the path should use percent-encoding.
Since your problem is exclusively due to the path containing Unicode characters, assuming your IRI is stored in the variable iri, you can fix it using the following:
import urllib.parse
import urllib.request
split_url = list(urllib.parse.urlsplit(iri))
split_url[2] = urllib.parse.quote(split_url[2]) # the third component is the path of the URL/IRI
url = urllib.parse.urlunsplit(split_url)
urllib.request.urlopen(url).read()
However, if you can avoid urllib and have the option of using the requests library instead, I would recommend doing so. The library is easier to use and has automatic IRI handling.

Related

how do I pass the √ untouched

is it possible to pass the √ through this untouched or am i asking too much
import urllib.request
path = 'html'
links = 'links'
with open(links, 'r', encoding='UTF-8') as links:
for link in links: #for each link in the file
print(link)
with urllib.request.urlopen(link) as linker: #get the html
print(linker)
with open(path, 'ab') as f: #append the html to html
f.write(linker.read())
links
https://myanimelist.net/anime/27899/Tokyo_Ghoul_√A
output
File "PYdown.py", line 7, in <module>
with urllib.request.urlopen(link) as linker:
File "/usr/lib64/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib64/python3.6/urllib/request.py", line 526, in open
response = self._open(req, data)
File "/usr/lib64/python3.6/urllib/request.py", line 544, in _open
'_open', req)
File "/usr/lib64/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/usr/lib64/python3.6/urllib/request.py", line 1392, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/usr/lib64/python3.6/urllib/request.py", line 1349, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "/usr/lib64/python3.6/http/client.py", line 1254, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib64/python3.6/http/client.py", line 1265, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib64/python3.6/http/client.py", line 1132, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\u221a' in position 29: ordinal not in range(128)
You need to quote Unicode chars in URL. You have file which contains list of urls you need to open, so you need to split each url (using urllib.parse.urlsplit()), quote (with urllib.parse.quote()) host and every part of path (to split paths you can use pathlib.PurePosixPath.parts) and then form URL back (using urllib.parse.urlunsplit()).
from pathlib import PurePosixPath
from urllib.parse import urlsplit, urlunsplit, quote, urlencode, parse_qsl
def normalize_url(url):
splitted = urlsplit(url) # split link
path = PurePosixPath(splitted.path) # initialize path
parts = iter(path.parts) # first element always "/"
quoted_path = PurePosixPath(next(parts)) # "/"
for part in parts:
quoted_path /= quote(part) # quote each part
return urlunsplit((
splitted.scheme,
splitted.netloc.encode("idna").decode(), # idna
str(quoted_path),
urlencode(parse_qsl(splitted.query)), # force encode query
splitted.fragment
))
Usage:
links = (
"https://myanimelist.net/anime/27899/Tokyo_Ghoul_√A",
"https://stackoverflow.com/",
"https://www.google.com/search?q=√2&client=firefox-b-d",
"http://pfarmerü.com/"
)
print(*(normalize_url(link) for link in links), sep="\n")
Output:
https://myanimelist.net/anime/27899/Tokyo_Ghoul_%E2%88%9AA
https://stackoverflow.com/
https://www.google.com/search?q=%E2%88%9A2&client=firefox-b-d,
http://xn--pfarmer-t2a.com/
instead of getting python to read √ as itself, I would have to translate the √ to %E2%88%9A in order to get python to output √
credit
#Olvin Roght

Unicode String in urllib.request [duplicate]

This question already has answers here:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' - -when using urlib.request python3
(2 answers)
Closed 3 years ago.
The short version: I have a variable s = 'bär'. I need to convert s to ASCII so that s = 'b%C3%A4r'.
Long version:
I'm using urllib.request.urlopen() to read an mp3 pronunciation file from URL. This has worked very well, except I ran into a problem because the URLs often contain unicode characters. For example, the German "Bär". The full URL is https://d7mj4aqfscim2.cloudfront.net/tts/de/token/bär. Indeed, typing this into Chrome as a URL works, and navigates me to the mp3 file without problems. However, feeding this same URL to urllib creates a problem.
I determined this was a unicode problem because the stack-trace reads:
Traceback (most recent call last):
File "importer.py", line 145, in <module>
download_file(tuple[1], tuple[0], ".mp3")
File "importer.py", line 81, in download_file
with urllib.request.urlopen(url) as in_stream, open(to_fname+ext, 'wb') as out_file: #`with object as name:` safely __enter__() and __exit__() the runtime of object. `as` assigns `name` as referring to the object `object`.
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 162, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 465, in open
response = self._open(req, data)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 483, in _open
'_open', req)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 443, in _call_chain
result = func(*args)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1283, in https_open
context=self._context, check_hostname=self._check_hostname)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1240, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1083, in request
self._send_request(method, url, body, headers)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1118, in _send_request
self.putrequest(method, url, **skips)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 960, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in position 19: ordinal not in range(128)
... and other than the obvious UnicodeEncodeError, I can see it's trying to encode() to ASCII.
Interestingly, when I copied the URL from Chrome (instead of simply typing it into the Python interpreter), it translated the bär to b%C3%A4r. When I feed this to urllib.request.urlopen(), it processes fine, because all of these characters are ASCII. So my goal is to make this conversion within my program. I tried to get my original string to the unicode equivalent, but unicodedata.normalize() in all of its variants didn't work; further, I'm not sure how to store the Unicode as ASCII, given that Python 3 stores all strings as Unicode and thus makes no attempt to convert the text.
Use urllib.parse.quote:
>>> urllib.parse.quote('bär')
'b%C3%A4r'
>>> urllib.parse.urljoin('https://d7mj4aqfscim2.cloudfront.net/tts/de/token/',
... urllib.parse.quote('bär'))
'https://d7mj4aqfscim2.cloudfront.net/tts/de/token/b%C3%A4r'

Urllib Unicode Error, no unicode involved

EDIT: I've majorly edited the content of this post since the original to specify my problem:
I am writing a program to download webcomics, and I'm getting this weird error when downloading a page of the comic. The code I am running essentially boils down to the following line followed by the error. I do not know what is causing this error, and it is confusing me greatly.
>>> urllib.request.urlopen("http://abominable.cc/post/47699281401")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 470, in open
response = meth(req, response)
File "/usr/lib/python3.4/urllib/request.py", line 580, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.4/urllib/request.py", line 502, in error
result = self._call_chain(*args)
File "/usr/lib/python3.4/urllib/request.py", line 442, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 685, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/usr/lib/python3.4/urllib/request.py", line 464, in open
response = self._open(req, data)
File "/usr/lib/python3.4/urllib/request.py", line 482, in _open
'_open', req)
File "/usr/lib/python3.4/urllib/request.py", line 442, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 1211, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.4/urllib/request.py", line 1183, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/lib/python3.4/http/client.py", line 1137, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.4/http/client.py", line 1172, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.4/http/client.py", line 1014, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 37-38: ordinal not in range(128)
The entirety of my program can be found here: https://github.com/nstephenh/pycomic
I was having the same problem. The root cause is that the remote server isn't playing by the rules. HTTP Headers are supposed to be US-ASCII only but apparently the leading http webservers (apache2, nginx) doesn't care and send direct UTF-8 encoded string.
However in http.client the parse_header function fetch the headers as iso-8859, and the default HTTPRedirectHandler in urllib doesn't care to quote the location or URI header, resulting in the aformentioned error.
I was able to 'work around' both thing by overriding the default HTTPRedirectHandler and adding three line to counter the latin1 decoding and add a path quote:
import urllib.request
from urllib.error import HTTPError
from urllib.parse import (
urlparse, quote, urljoin, urlunparse)
class UniRedirectHandler(urllib.request.HTTPRedirectHandler):
# Implementation note: To avoid the server sending us into an
# infinite loop, the request object needs to track what URLs we
# have already seen. Do this by adding a handler-specific
# attribute to the Request object.
def http_error_302(self, req, fp, code, msg, headers):
# Some servers (incorrectly) return multiple Location headers
# (so probably same goes for URI). Use first header.
if "location" in headers:
newurl = headers["location"]
elif "uri" in headers:
newurl = headers["uri"]
else:
return
# fix a possible malformed URL
urlparts = urlparse(newurl)
# For security reasons we don't allow redirection to anything other
# than http, https or ftp.
if urlparts.scheme not in ('http', 'https', 'ftp', ''):
raise HTTPError(
newurl, code,
"%s - Redirection to url '%s' is not allowed" % (msg, newurl),
headers, fp)
if not urlparts.path:
urlparts = list(urlparts)
urlparts[2] = "/"
else:
urlparts = list(urlparts)
# Header should only contain US-ASCII chars, but some servers do send unicode data
# that should be quoted back before reused
# Need to re-encode the string as iso-8859-1 before use of ""quote"" to cancel the effet of parse_header() in http/client.py
urlparts[2] = quote(urlparts[2].encode('iso-8859-1'))
newurl = urlunparse(urlparts)
newurl = urljoin(req.full_url, newurl)
# XXX Probably want to forget about the state of the current
# request, although that might interact poorly with other
# handlers that also use handler-specific request attributes
new = self.redirect_request(req, fp, code, msg, headers, newurl)
if new is None:
return
# loop detection
# .redirect_dict has a key url if url was previously visited.
if hasattr(req, 'redirect_dict'):
visited = new.redirect_dict = req.redirect_dict
if (visited.get(newurl, 0) >= self.max_repeats or
len(visited) >= self.max_redirections):
raise HTTPError(req.full_url, code,
self.inf_msg + msg, headers, fp)
else:
visited = new.redirect_dict = req.redirect_dict = {}
visited[newurl] = visited.get(newurl, 0) + 1
# Don't close the fp until we are sure that we won't use it
# with HTTPError.
fp.read()
fp.close()
return self.parent.open(new, timeout=req.timeout)
http_error_301 = http_error_303 = http_error_307 = http_error_302
[...]
# Change default Redirect Handler in urllib, should be done once at the beginning of the program
opener = urllib.request.build_opener(UniRedirectHandler())
urllib.request.install_opener(opener)
This is python3 code but should be easily adapted for python2 if need be.

Nested text encodings in suds requests

Environment: Python 2.7.4 (partly on Windows, partly on Linux, see below), suds (SVN HEAD with minor modifications)
I need to call into a web service that takes a single argument, which is an XML string (yes, I know…), i.e. the request is declared in the WSDL with the following type:
<s:complexType>
<s:sequence>
<s:element minOccurs="0" maxOccurs="1" name="actionString" type="s:string"/>
</s:sequence>
</s:complexType>
I'm using cElementTree to construct this inner XML document, then I pass it as the only parameter to the client.service.ProcessAction(request) method that suds generates.
For a while, this worked okay:
root = ET.Element(u'ActionCommand')
value = ET.SubElement(root, u'value')
value.text = saxutils.escape(complex_value)
request = u'<?xml version="1.0" encoding="utf-8"?>\n' + ET.tostring(root, encoding='utf-8')
client.service.ProcessAction(request)
The saxutils.escape, I had added at some point to fix the first encoding problems, pretty much without being able to understand why exactly I need it and what difference it makes.
Now (possibly due to the first occurence of the pound sign), I suddenly got the following exception:
Traceback (most recent call last):
File "/app/module.py", line 135, in _process_web_service_call
request = u'<?xml version="1.0" encoding="utf-8"?>\n' + ET.tostring(root, encoding='utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 137: ordinal not in range(128)
The position 137 here corresponds to the location of the special characters inside the inner XML request. Apparently, cElementTree.tostring() returns a 'str' type, not a 'unicode' even when an encoding is given. So Python tries to decode this string str into unicode (why with 'ascii'?), so that it can concatenate it with the unicode literal. This fails (of course, because the str is actually encoded in UTF-8, not ASCII).
So I figured, fine, I'll decode it to unicode myself then:
root = ET.Element(u'ActionCommand')
value = ET.SubElement(root, u'value')
value.text = saxutils.escape(complex_value)
request_encoded_str = ET.tostring(root, encoding='utf-8')
request_unicode = request_encoded_str.decode('utf-8')
request = u'<?xml version="1.0" encoding="utf-8"?>\n' + request_unicode
client.service.ProcessClientAction(request)
Except that now, it blows up inside suds, which tries to decode the outer XML request for some reason:
Traceback (most recent call last):
File "/app/module.py", line 141, in _process_web_service_call
raw_response = client.service.ProcessAction(request)
File "/app/.heroku/python/lib/python2.7/site-packages/suds/client.py", line 542, in __call__
return client.invoke(args, kwargs)
File "/app/.heroku/python/lib/python2.7/site-packages/suds/client.py", line 602, in invoke
result = self.send(soapenv)
File "/app/.heroku/python/lib/python2.7/site-packages/suds/client.py", line 643, in send
reply = transport.send(request)
File "/app/.heroku/python/lib/python2.7/site-packages/suds/transport/https.py", line 64, in send
return HttpTransport.send(self, request)
File "/app/.heroku/python/lib/python2.7/site-packages/suds/transport/http.py", line 118, in send
return self.invoke(request)
File "/app/.heroku/python/lib/python2.7/site-packages/suds/transport/http.py", line 153, in invoke
u2response = urlopener.open(u2request, timeout=tm)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 1222, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 1181, in do_open
h.request(req.get_method(), req.get_selector(), req.data, headers)
File "/app/.heroku/python/lib/python2.7/httplib.py", line 973, in request
self._send_request(method, url, body, headers)
File "/app/.heroku/python/lib/python2.7/httplib.py", line 1007, in _send_request
self.endheaders(body)
File "/app/.heroku/python/lib/python2.7/httplib.py", line 969, in endheaders
self._send_output(message_body)
File "/app/.heroku/python/lib/python2.7/httplib.py", line 827, in _send_output
msg += message_body
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 565: ordinal not in range(128)
The position 565 here again corresponds with the same character as above, except this time it's the location of my inner XML request embedded into the outer XML request (SOAP) created by suds.
I'm confused. Can anyone help me out of this mess? :)
To make matters worse, all of this only happens on the server under Linux. None of these raises an exception in my development environment on Windows. (Bonus points for an explanation as to why that is, just because I'm curious. I suspect it has to do with a different default encoding.) However, they all are not accepted by the server. What does work on Windows is if I drop the saxutils.escape and then hand a proper unicode object to suds. This however still results in the same UnicodeDecodeError on Linux.
Update: I started debugging this on Windows (where it works fine), and in the line 827 of httplib.py, it indeed tries to concatenate the unicode object msg (containing the HTTP headers) and the str object message_body, leading to the implicit unicode decoding with the incorrect encoding. I guess it just happens to not fail on Windows for some reason. I don't understand why suds tries to send a str object when I put a unicode object in at the top.
This turned out to be more than absurd. I'm still understanding only small parts of the whole problem and situation, but I managed to solve my problem.
So let's trace it back: my last attempt was the most sane one, I believe. So let's start there:
msg += message_body
That line in Python's httplib.py tries to concatenate a unicode and a str object, which leads to an implicit .decode('ascii') of the str, even though the str is UTF8-encoded. Why is that? Because msg is a unicode object.
msg = "\r\n".join(self._buffer)
self._buffer is a list of HTTP headers. Inspecting that, only one header in there was unicode, 'infecting' the resulting string: the action and endpoint.
And there's the problem: I'm using unicode_literals from __future__ (makes it more future-proof, right? right???) and I'm passing my own endpoint into suds.
By just doing an .encode('utf-8') on the URL, all my problems went away. Even the whole saxutils.escape was no longer needed (even though it weirdly also didn't hurt).
tl;dr: make sure you're not passing any unicode objects anywhere into httplib or suds, I guess.
root = ET.Element(u'ActionCommand')
value = ET.SubElement(root, u'value')
value.text = complex_value)
request = ET.tostring(root, encoding='utf-8').decode('utf-8')
client.service.ProcessAction(request)

python appengine unicodeencodeerror on search api snippeted results

I'm crawling pages and indexing them with appengine search api (Spanish and Catalan pages, with accented characters). I'm able to perform searches and make a page of results.
Problem arises when I try to use a query object with snipetted_fields, as it always generates a UnicodeEncodeError:
File "/home/otger/python/jobs-gae/src/apps/search/handlers/results.py", line 82, in find_documents
return index.search(query_obj)
File "/opt/google_appengine_1.7.6/google/appengine/api/search/search.py", line 2707, in search
apiproxy_stub_map.MakeSyncCall('search', 'Search', request, response)
File "/opt/google_appengine_1.7.6/google/appengine/api/apiproxy_stub_map.py", line 94, in MakeSyncCall
return stubmap.MakeSyncCall(service, call, request, response)
File "/opt/google_appengine_1.7.6/google/appengine/api/apiproxy_stub_map.py", line 320, in MakeSyncCall
rpc.CheckSuccess()
File "/opt/google_appengine_1.7.6/google/appengine/api/apiproxy_rpc.py", line 156, in _WaitImpl
self.request, self.response)
File "/opt/google_appengine_1.7.6/google/appengine/ext/remote_api/remote_api_stub.py", line 200, in MakeSyncCall
self._MakeRealSyncCall(service, call, request, response)
File "/opt/google_appengine_1.7.6/google/appengine/ext/remote_api/remote_api_stub.py", line 234, in _MakeRealSyncCall
raise pickle.loads(response_pb.exception())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 52: ordinal not in range(128)
I've found a similar question on stackoverflow: GAE Full Text Search development console UnicodeEncodeError but it says that it was a bug fixed on 1.7.0. I get same error either using version 1.7.5 and 1.7.6.
When Indexing pages I add two fields: description and description_ascii. If I try to generate snippets for description_ascii it works perfectly.
Is this possible to generate snippets of not ascii contents on dev_appserver?
I think this is a bug, reported new defect issue https://code.google.com/p/googleappengine/issues/detail?id=9335.
Temporary solution for dev server - locate google.appengine.api.search module (search.py), and patch function _DecodeUTF8 by adding inline if like this:
def _DecodeUTF8(pb_value):
"""Decodes a UTF-8 encoded string into unicode."""
if pb_value is not None:
return pb_value.decode('utf-8') if not isinstance(pb_value, unicode) else pb_value
return None
Workaround - until the issue is solved implement snippet functionality yourself - assuming field which is base for snippet is called snippet_base:
query = search.Query(query_string=query_string,
options=
search.QueryOptions(
...
returned_fields= [... 'snippet_base' ...]
))
results = search.Index(name="<index-name>").search(query)
if results:
for res in results.results:
res.snippet = some_snippeting_function(res.field("snippet_base"))

Categories

Resources