I have a web crawler that get a lot of these errors:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 27: ordinal not in range(128)
To mitigate these errors I have implemented a function that encode them like this:
def properEncode(url):
url = url.replace("ø", "%C3%B8")
url = url.replace("å", "%C3%A5")
url = url.replace("æ", "%C3%A6")
url = url.replace("é", "%c3%a9")
url = url.replace("Ø", "%C3%98")
url = url.replace("Å", "%C3%A5")
url = url.replace("Æ", "%C3%85")
url = url.replace("í", "%C3%AD")
return url
These are based on this table: http://www.utf8-chartable.de/
The conversion I do seems to be to convert them to utf-8 hex? Is there a python function to do this automatically?
You are URL encoding them. You can do so trivially with the urllib.parse.quote() function:
>>> from urllib.parse import quote
>>> quote("ø")
'%C3%B8'
or put into a function to only fix the URL path of a given URL (as this encoding doesn't apply to the host portion, for example):
from urllib.parse import quote, urlparse
def properEncode(url):
parts = urlparse(url)
path = quote(parts.path)
return parts._replace(path=path).geturl()
This limits the encoding to just the path portion of the URL. If you need to encode the query string, use the quote_plus function as query parameters replace spaces with a plus instead of %20 (and handle the query portion of the URL).
Related
I use the requests module in Python to fetch a result of a web page. However, I found that if the URL includes a character à in its URL, it issues the UnicodeDecodeError:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 27: invalid continuation byte
Strangely, this only happens if I also add a space in the URL. So for example, the following does not issue an error.
requests.get("http://myurl.com/àieou")
However, the following does:
requests.get("http://myurl.com/àienah aie")
Why does it happen and how can I make the request correctly?
using the lib urllib to auto-encode characters.
import urllib
requests.get("http://myurl.com/"+urllib.quote_plus("àieou"))
Use quote_plus().
from urllib.parse import quote_plus
requests.get("http://myurl.com/" + quote_plus("àienah aie"))
You can try to url encode your value:
requests.get("http://myurl.com/%C3%A0ieou")
The value for à is %C3%A0 once encoded.
I'm working on a new project but I can't fix the error in the title.
Here's the code:
#!/usr/bin/env python3.5.2
import urllib.request , urllib.parse
def start(url):
source_code = urllib.request.urlopen(url).read()
info = urllib.parse.parse_qs(source_code)
print(info)
start('https://www.youtube.com/watch?v=YfRLJQlpMNw')
The error occurred because of .encode which works on a unicode object. So we need to convert the byte string to unicode string using
.decode('unicode_escape')
So the code will be:
#!/usr/bin/env python3.5.2
import urllib.request , urllib.parse
def start(url):
source_code = urllib.request.urlopen(url).read()
info = urllib.parse.parse_qs(source_code.decode('unicode_escape'))
print(info)
start('https://www.youtube.com/watch?v=YfRLJQlpMNw')
Try this
source_code = urllib.request.urlopen(url).read().decode('utf-8')
The error message is self explainatory: there is a byte 0xf0 in an input string that is expected to be an ascii string.
You should have given the exact error message and on what line it happened, but I can guess that is happened on info = urllib.parse.parse_qs(source_code), because parse_qs expects either a unicode string or an ascii byte string.
The first question is why you call parse_qs on data coming from youtube, because the doc for the Python Standart Library says:
Parse a query string given as a string argument (data of type application/x-www-form-urlencoded). Data are returned as a dictionary. The dictionary keys are the unique query variable names and the values are lists of values for each name.
So you are going to parse this on = and & character to interpret it as a query string in the form key1=value11&key2=value2&key1=value12 to give { 'key1': [ 'value11', 'value12'], 'key2': ['value2']}.
If you know why you want that, you should first decode the byte string into a unicode string, using the proper encoding, or if unsure Latin1 which is able to accept any byte:
def start(url):
source_code = urllib.request.urlopen(url).read().decode('latin1')
info = urllib.parse.parse_qs(source_code)
print(info)
This code is rather weird indeed. You are using query parser to parse contents of a web page.
So instead of using parse_qs you should be using something like this.
This question already has answers here:
UnicodeEncodeError: 'charmap' codec can't encode characters
(11 answers)
Closed 5 months ago.
I am trying to make a crawler in python by following an udacity course. I have this method get_page() which returns the content of the page.
def get_page(url):
'''
Open the given url and return the content of the page.
'''
data = urlopen(url)
html = data.read()
return html.decode('utf8')
the original method was just returning data.read(), but that way I could not do operations like str.find(). After a quick search I found out I need to decode the data. But now I get this error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position
1: invalid start byte
I have found similar questions in SO but none of them were specifically for this. Please help.
You are trying to decode an invalid string.
The start byte of any valid UTF-8 string must be in the range of 0x00 to 0x7F.
So 0x8B is definitely invalid.
From RFC3629 Section 3:
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character number.
You should post the string you are trying to decode.
Maybe the page is encoded with other character encoding but 'utf-8'. So the start byte is invalid.
You could do this.
def get_page(self, url):
if url is None:
return None
response=urllib.request.urlopen(url)
if response.getcode()!=200:
print("Http code:",response.getcode())
return None
else:
try:
return response.read().decode('utf-8')
except:
return response.read()
Web servers often serve HTML pages with a Content-Type header that includes the encoding used to encoding the page. The header might look this:
Content-Type: text/html; charset=UTF-8
We can inspect the content of this header to find the encoding to use to decode the page:
from urllib.request import urlopen
def get_page(url):
""" Open the given url and return the content of the page."""
data = urlopen(url)
content_type = data.headers.get('content-type', '')
print(f'{content_type=}')
encoding = 'latin-1'
if 'charset' in content_type:
_, _, encoding = content_type.rpartition('=')
print(f'{encoding=}')
html = data.read()
return html.decode(encoding)
Using requests is similar:
response = requests.get(url)
content_type = reponse.headers.get('content-type', '')
Latin-1 (or ISO-8859-1) is a safe default: it will always decode any bytes (though the result may not be useful).
If the server doesn't serve a content-type header you can try looking for a <meta> tag that specifies the encoding in the HTML. Or pass the response bytes to Beautiful Soup and let it try to guess the encoding.
Trying to retrieve some data from the web using urlib and lxml, I've got an error and have no idea, how to fix it.
url='http://sum.in.ua/?swrd=автор'
page = urllib.request.urlopen(url)
The error itself:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-15: ordinal not in range(128)
I'm using Ukrainian in API this time, but when I use API (without any Ukrainian letters in it) here:
url="http://www.toponymic-dictionary.in.ua/index.php?option=com_content&view=section&layout=blog&id=8&Itemid=9"
page = urllib.request.urlopen(url)
pageWritten = page.read()
pageReady = pageWritten.decode('utf-8')
xmldata = lxml.html.document_fromstring(pageReady)
text1 = xmldata.xpath('//p[#class="MsoNormal"]//text()')
it gets me the data in Ukrainian and everything works just fine.
URLs can only use a subset of printable ASCII codepoints; everything else must be properly encoded using URL percent encoding.
You can best achieve that by letting Python handle your parameters. The urllib.parse.urlencode() function can convert a dictionary (or a sequence of key-value pairs) for use in URLs:
from urllib.parse import urlencode
url = 'http://sum.in.ua/'
parameters = {'swrd': 'автор'}
url = '{}?{}'.format(url, urlencode(parameters))
page = urllib.request.urlopen(url)
This will first encode the parameters to UTF-8 bytes, then convert those bytes to percent-encoding sequences:
>>> from urllib.parse import urlencode
>>> parameters = {'swrd': 'автор'}
>>> urlencode(parameters)
'swrd=%D0%B0%D0%B2%D1%82%D0%BE%D1%80'
If you did not construct this URL yourself, you'll need to 'repair' the encoding. You can split of the query string, parse it into a dictionary, then pass it to urlencode to put it back into the URL using urllib.parse.urlparse() and urllib.parse.parse_qs():
from urllib.parse import urlparse, parse_qs, urlencode
url = 'http://sum.in.ua/?swrd=автор'
parsed_url = urlparse(url)
parameters = parse_qs(parsed_url.query)
url = parsed_url._replace(query=urlencode(parameters, doseq=True)).geturl()
This splits the URL into its constituent parts, parses out the query string, re-encodes and re-builds the URL afterwards:
>>> from urllib.parse import urlparse, parse_qs, urlencode
>>> url = 'http://sum.in.ua/?swrd=автор'
>>> parsed_url = urlparse(url)
>>> parameters = parse_qs(parsed_url.query)
>>> parsed_url._replace(query=urlencode(parameters, doseq=True)).geturl()
'http://sum.in.ua/?swrd=%D0%B0%D0%B2%D1%82%D0%BE%D1%80'
I believe you can do something like below
url = 'http://sum.in.ua/'
q = 'swrd=автор'
import urllib,requests
requests.get(url+"?"+urllib.quote(q))
I think urllib.quote will transform "swrd=автор" into something like "swrd=%D0%B0%D0%B2%D1%82%D0%BE%D1%80"
which should be accepted just fine
I need to fetch data from a URL with non-ascii characters but urllib2.urlopen refuses to open the resource and raises:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0131' in position 26: ordinal not in range(128)
I know the URL is not standards compliant but I have no chance to change it.
What is the way to access a resource pointed by a URL containing non-ascii characters using Python?
edit: In other words, can / how urlopen open a URL like:
http://example.org/Ñöñ-ÅŞÇİİ/
Strictly speaking URIs can't contain non-ASCII characters; what you have there is an IRI.
To convert an IRI to a plain ASCII URI:
non-ASCII characters in the hostname part of the address have to be encoded using the Punycode-based IDNA algorithm;
non-ASCII characters in the path, and most of the other parts of the address have to be encoded using UTF-8 and %-encoding, as per Ignacio's answer.
So:
import re, urlparse
def urlEncodeNonAscii(b):
return re.sub('[\x80-\xFF]', lambda c: '%%%02x' % ord(c.group(0)), b)
def iriToUri(iri):
parts= urlparse.urlparse(iri)
return urlparse.urlunparse(
part.encode('idna') if parti==1 else urlEncodeNonAscii(part.encode('utf-8'))
for parti, part in enumerate(parts)
)
>>> iriToUri(u'http://www.a\u0131b.com/a\u0131b')
'http://www.xn--ab-hpa.com/a%c4%b1b'
(Technically this still isn't quite good enough in the general case because urlparse doesn't split away any user:pass# prefix or :port suffix on the hostname. Only the hostname part should be IDNA encoded. It's easier to encode using normal urllib.quote and .encode('idna') at the time you're constructing a URL than to have to pull an IRI apart.)
In python3, use the urllib.parse.quote function on the non-ascii string:
>>> from urllib.request import urlopen
>>> from urllib.parse import quote
>>> chinese_wikipedia = 'http://zh.wikipedia.org/wiki/Wikipedia:' + quote('首页')
>>> urlopen(chinese_wikipedia)
Python 3 has libraries to handle this situation. Use
urllib.parse.urlsplit to split the URL into its components, and
urllib.parse.quote to properly quote/escape the unicode characters
and urllib.parse.urlunsplit to join it back together.
>>> import urllib.parse
>>> url = 'http://example.com/unicodè'
>>> url = urllib.parse.urlsplit(url)
>>> url = list(url)
>>> url[2] = urllib.parse.quote(url[2])
>>> url = urllib.parse.urlunsplit(url)
>>> print(url)
http://example.com/unicod%C3%A8
It is more complex than the accepted #bobince's answer suggests:
netloc should be encoded using IDNA;
non-ascii URL path should be encoded to UTF-8 and then percent-escaped;
non-ascii query parameters should be encoded to the encoding of a page URL was extracted from (or to the encoding server uses), then percent-escaped.
This is how all browsers work; it is specified in https://url.spec.whatwg.org/ - see this example. A Python implementation can be found in w3lib (this is the library Scrapy is using); see w3lib.url.safe_url_string:
from w3lib.url import safe_url_string
url = safe_url_string(u'http://example.org/Ñöñ-ÅŞÇİİ/', encoding="<page encoding>")
An easy way to check if a URL escaping implementation is incorrect/incomplete is to check if it provides 'page encoding' argument or not.
Based on #darkfeline answer:
from urllib.parse import urlsplit, urlunsplit, quote
def iri2uri(iri):
"""
Convert an IRI to a URI (Python 3).
"""
uri = ''
if isinstance(iri, str):
(scheme, netloc, path, query, fragment) = urlsplit(iri)
scheme = quote(scheme)
netloc = netloc.encode('idna').decode('utf-8')
path = quote(path)
query = quote(query)
fragment = quote(fragment)
uri = urlunsplit((scheme, netloc, path, query, fragment))
return uri
For those not depending strictly on urllib, one practical alternative is requests, which handles IRIs "out of the box".
For example, with http://bücher.ch:
>>> import requests
>>> r = requests.get(u'http://b\u00DCcher.ch')
>>> r.status_code
200
Encode the unicode to UTF-8, then URL-encode.
Use iri2uri method of httplib2. It makes the same thing as by bobin (is he/she the author of that?)
Another option to convert an IRI to an ASCII URI is to use furl package:
gruns/furl: 🌐 URL parsing and manipulation made easy. - https://github.com/gruns/furl
Python's standard urllib and urlparse modules provide a number of URL
related functions, but using these functions to perform common URL
operations proves tedious. Furl makes parsing and manipulating URLs
easy.
Examples
Non-ASCII domain
http://国立極地研究所.jp/english/ (Japanese National Institute of Polar Research website)
import furl
url = 'http://国立極地研究所.jp/english/'
furl.furl(url).tostr()
'http://xn--vcsoey76a2hh0vtuid5qa.jp/english/'
Non-ASCII path
https://ja.wikipedia.org/wiki/日本語 ("Japanese" article in Wikipedia)
import furl
url = 'https://ja.wikipedia.org/wiki/日本語'
furl.furl(url).tostr()
'https://ja.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E8%AA%9E'
works! finally
I could not avoid from this strange characters, but at the end I come through it.
import urllib.request
import os
url = "http://www.fourtourismblog.it/le-nuove-tendenze-del-marketing-tenere-docchio/"
with urllib.request.urlopen(url) as file:
html = file.read()
with open("marketingturismo.html", "w", encoding='utf-8') as file:
file.write(str(html.decode('utf-8')))
os.system("marketingturismo.html")