Trying to retrieve some data from the web using urlib and lxml, I've got an error and have no idea, how to fix it.
url='http://sum.in.ua/?swrd=автор'
page = urllib.request.urlopen(url)
The error itself:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-15: ordinal not in range(128)
I'm using Ukrainian in API this time, but when I use API (without any Ukrainian letters in it) here:
url="http://www.toponymic-dictionary.in.ua/index.php?option=com_content&view=section&layout=blog&id=8&Itemid=9"
page = urllib.request.urlopen(url)
pageWritten = page.read()
pageReady = pageWritten.decode('utf-8')
xmldata = lxml.html.document_fromstring(pageReady)
text1 = xmldata.xpath('//p[#class="MsoNormal"]//text()')
it gets me the data in Ukrainian and everything works just fine.
URLs can only use a subset of printable ASCII codepoints; everything else must be properly encoded using URL percent encoding.
You can best achieve that by letting Python handle your parameters. The urllib.parse.urlencode() function can convert a dictionary (or a sequence of key-value pairs) for use in URLs:
from urllib.parse import urlencode
url = 'http://sum.in.ua/'
parameters = {'swrd': 'автор'}
url = '{}?{}'.format(url, urlencode(parameters))
page = urllib.request.urlopen(url)
This will first encode the parameters to UTF-8 bytes, then convert those bytes to percent-encoding sequences:
>>> from urllib.parse import urlencode
>>> parameters = {'swrd': 'автор'}
>>> urlencode(parameters)
'swrd=%D0%B0%D0%B2%D1%82%D0%BE%D1%80'
If you did not construct this URL yourself, you'll need to 'repair' the encoding. You can split of the query string, parse it into a dictionary, then pass it to urlencode to put it back into the URL using urllib.parse.urlparse() and urllib.parse.parse_qs():
from urllib.parse import urlparse, parse_qs, urlencode
url = 'http://sum.in.ua/?swrd=автор'
parsed_url = urlparse(url)
parameters = parse_qs(parsed_url.query)
url = parsed_url._replace(query=urlencode(parameters, doseq=True)).geturl()
This splits the URL into its constituent parts, parses out the query string, re-encodes and re-builds the URL afterwards:
>>> from urllib.parse import urlparse, parse_qs, urlencode
>>> url = 'http://sum.in.ua/?swrd=автор'
>>> parsed_url = urlparse(url)
>>> parameters = parse_qs(parsed_url.query)
>>> parsed_url._replace(query=urlencode(parameters, doseq=True)).geturl()
'http://sum.in.ua/?swrd=%D0%B0%D0%B2%D1%82%D0%BE%D1%80'
I believe you can do something like below
url = 'http://sum.in.ua/'
q = 'swrd=автор'
import urllib,requests
requests.get(url+"?"+urllib.quote(q))
I think urllib.quote will transform "swrd=автор" into something like "swrd=%D0%B0%D0%B2%D1%82%D0%BE%D1%80"
which should be accepted just fine
Related
I have been trying to figure out how to use python-requests to send a request that the url looks like:
http://example.com/api/add.json?name='hello'&data[]='hello'&data[]='world'
Normally I can build a dictionary and do:
data = {'name': 'hello', 'data': 'world'}
response = requests.get('http://example.com/api/add.json', params=data)
That works fine for most everything that I do. However, I have hit the url structure from above, and I am not sure how to do that in python without manually building strings. I can do that, but would rather not.
Is there something in the requests library I am missing or some python feature I am unaware of?
Also what do you even call that type of parameter so I can better google it?
All you need to do is putting it on a list and making the key as list like string:
data = {'name': 'hello', 'data[]': ['hello', 'world']}
response = requests.get('http://example.com/api/add.json', params=data)
What u are doing is correct only. The resultant url is same what u are expecting.
>>> payload = {'name': 'hello', 'data': 'hello'}
>>> r = requests.get("http://example.com/api/params", params=payload)
u can see the resultant url:
>>> print(r.url)
http://example.com/api/params?name=hello&data=hello
According to url format:
In particular, encoding the query string uses the following rules:
Letters (A–Z and a–z), numbers (0–9) and the characters .,-,~ and _ are left as-is
SPACE is encoded as + or %20
All other characters are encoded as %HH hex representation with any non-ASCII characters first encoded as UTF-8 (or other specified encoding)
So array[] will not be as expected and will be automatically replaced according to the rules:
If you build a url like :
`Build URL: http://example.com/api/add.json?name='hello'&data[]='hello'&data[]='world'`
OutPut will be:
>>> payload = {'name': 'hello', "data[]": 'hello','data[]':'world'}
>>> r = requests.get("http://example.com/api/params", params=payload)
>>> r.url
u'http://example.com/api/params?data%5B%5D=world&name=hello'
This is because Duplication will be replaced by the last value of the key in url and data[] will be replaced by data%5B%5D.
If data%5B%5D is not the problem(If server is able to parse it correctly),then u can go ahead with it.
Source Link
One solution if using the requests module is not compulsory, is using the urllib/urllib2 combination:
payload = [('name', 'hello'), ('data[]', ('hello', 'world'))]
params = urllib.urlencode(payload, doseq=True)
sampleRequest = urllib2.Request('http://example.com/api/add.json?' + params)
response = urllib2.urlopen(sampleRequest)
Its a little more verbose and uses the doseq(uence) trick to encode the url parameters but I had used it when I did not know about the requests module.
For the requests module the answer provided by #Tomer should work.
Some api-servers expect json-array as value in the url query string. The requests params doesn't create json array as value for parameters.
The way I fixed this on a similar problem was to use urllib.parse.urlencode to encode the query string, add it to the url and pass it to requests
e.g.
from urllib.parse import urlencode
query_str = urlencode(params)
url = "?" + query_str
response = requests.get(url, params={}, headers=headers)
The solution is simply using the famous function: urlencode
>>> import urllib.parse
>>> params = {'q': 'Python URL encoding', 'as_sitesearch': 'www.urlencoder.io'}
>>> urllib.parse.urlencode(params)
'q=Python+URL+encoding&as_sitesearch=www.urlencoder.io'
I am trying to get a JSON response from the link used as a parameter to the urllib request. but it gives me an error that it can't contain control characters.
how can I solve the issue?
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
source = urllib.request.urlopen(start_url).read()
the error I get is :
URL can't contain control characters. '/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq=' (found at least ' ')
Replacing whitespace with:
url = url.replace(" ", "%20")
if the problem is with the whitespace.
Spaces are not allowed in URL, I removed them and it seems to be working now:
import urllib.request
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
url = start_url.replace(" ","")
source = urllib.request.urlopen(url).read()
Solr search strings can get pretty weird. Better use the 'quote' method to encode characters before making the request. See example below:
from urllib.parse import quote
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
source = urllib.request.urlopen(quote(start_url)).read()
Better later than never...
You probably already found out by now but let's get it written here.
There can't be any space character in the URL, and there are 2, after bundle_fq e dm_field_deadlineTo_fq
Remove those and you're good to go
Like the error message says, there are some control characters in your url, which doesn't seem to be a valid one by the way.
You need to encode the control characters inside the URL. Especially spaces need to be encoded to %20.
Parsing the url first and then encoding the url elements would work.
import urllib.request
from urllib.parse import urlparse, quote
def make_safe_url(url: str) -> str:
"""
Returns a parsed and quoted url
"""
_url = urlparse(url)
url = _url.scheme + "://" + _url.netloc + quote(_url.path) + "?" + quote(_url.query)
return url
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
start_url = make_safe_url(start_url)
source = urllib.request.urlopen(start_url).read()
The code returns the JSON-document despite the double forward-slash and the whitespace in the url.
I have a web crawler that get a lot of these errors:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 27: ordinal not in range(128)
To mitigate these errors I have implemented a function that encode them like this:
def properEncode(url):
url = url.replace("ø", "%C3%B8")
url = url.replace("å", "%C3%A5")
url = url.replace("æ", "%C3%A6")
url = url.replace("é", "%c3%a9")
url = url.replace("Ø", "%C3%98")
url = url.replace("Å", "%C3%A5")
url = url.replace("Æ", "%C3%85")
url = url.replace("í", "%C3%AD")
return url
These are based on this table: http://www.utf8-chartable.de/
The conversion I do seems to be to convert them to utf-8 hex? Is there a python function to do this automatically?
You are URL encoding them. You can do so trivially with the urllib.parse.quote() function:
>>> from urllib.parse import quote
>>> quote("ø")
'%C3%B8'
or put into a function to only fix the URL path of a given URL (as this encoding doesn't apply to the host portion, for example):
from urllib.parse import quote, urlparse
def properEncode(url):
parts = urlparse(url)
path = quote(parts.path)
return parts._replace(path=path).geturl()
This limits the encoding to just the path portion of the URL. If you need to encode the query string, use the quote_plus function as query parameters replace spaces with a plus instead of %20 (and handle the query portion of the URL).
I wrote a simple script that just takes a webpage and extracts the contents of it to a tokenized list. However, I'm running into an issue where when I convert the BeautifulSoup object to a String, the UTF-8 characters for ",', etc. won't convert. Instead, they remain in the unicode format.
I'm defining the source as UTF-8 when I create the BeautifulSoup object, and I've even tried running a unicode conversion separately, but nothing works. Any have any idea why this is happening?
from urllib2 import urlopen
from bs4 import BeautifulSoup
import nltk, re, pprint
url = "http://www.bloomberg.com/news/print/2013-07-05/softbank-s-21-6-billion-bid-for- sprint-approved-by-u-s-.html"
raw = urlopen(url).read()
soup = BeautifulSoup(raw, fromEncoding="UTF-8")
result = soup.find_all(id="story_content")
str_result = str(result)
notag = re.sub("<.*?>", " ", str_result)
output = nltk.word_tokenize(notag)
print(output)
The characters you're having trouble with aren't " (U+0022) and ' (U+0027), they're curly quotes “ (U+201C) and ” (U+201D) and ’ (U+2019). Convert those to their straight versions first, and you should get the results you're expecting:
raw = urlopen(url).read()
original = raw.decode('utf-8')
replacement = original.replace('\u201c', '"').replace('\u201d', '"').replace('\u2019', "'")
soup = BeautifulSoup(replacement) # Don't need fromEncoding if we're passing in Unicode
That should get the quote characters into the form you're expecting.
I need to fetch data from a URL with non-ascii characters but urllib2.urlopen refuses to open the resource and raises:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0131' in position 26: ordinal not in range(128)
I know the URL is not standards compliant but I have no chance to change it.
What is the way to access a resource pointed by a URL containing non-ascii characters using Python?
edit: In other words, can / how urlopen open a URL like:
http://example.org/Ñöñ-ÅŞÇİİ/
Strictly speaking URIs can't contain non-ASCII characters; what you have there is an IRI.
To convert an IRI to a plain ASCII URI:
non-ASCII characters in the hostname part of the address have to be encoded using the Punycode-based IDNA algorithm;
non-ASCII characters in the path, and most of the other parts of the address have to be encoded using UTF-8 and %-encoding, as per Ignacio's answer.
So:
import re, urlparse
def urlEncodeNonAscii(b):
return re.sub('[\x80-\xFF]', lambda c: '%%%02x' % ord(c.group(0)), b)
def iriToUri(iri):
parts= urlparse.urlparse(iri)
return urlparse.urlunparse(
part.encode('idna') if parti==1 else urlEncodeNonAscii(part.encode('utf-8'))
for parti, part in enumerate(parts)
)
>>> iriToUri(u'http://www.a\u0131b.com/a\u0131b')
'http://www.xn--ab-hpa.com/a%c4%b1b'
(Technically this still isn't quite good enough in the general case because urlparse doesn't split away any user:pass# prefix or :port suffix on the hostname. Only the hostname part should be IDNA encoded. It's easier to encode using normal urllib.quote and .encode('idna') at the time you're constructing a URL than to have to pull an IRI apart.)
In python3, use the urllib.parse.quote function on the non-ascii string:
>>> from urllib.request import urlopen
>>> from urllib.parse import quote
>>> chinese_wikipedia = 'http://zh.wikipedia.org/wiki/Wikipedia:' + quote('首页')
>>> urlopen(chinese_wikipedia)
Python 3 has libraries to handle this situation. Use
urllib.parse.urlsplit to split the URL into its components, and
urllib.parse.quote to properly quote/escape the unicode characters
and urllib.parse.urlunsplit to join it back together.
>>> import urllib.parse
>>> url = 'http://example.com/unicodè'
>>> url = urllib.parse.urlsplit(url)
>>> url = list(url)
>>> url[2] = urllib.parse.quote(url[2])
>>> url = urllib.parse.urlunsplit(url)
>>> print(url)
http://example.com/unicod%C3%A8
It is more complex than the accepted #bobince's answer suggests:
netloc should be encoded using IDNA;
non-ascii URL path should be encoded to UTF-8 and then percent-escaped;
non-ascii query parameters should be encoded to the encoding of a page URL was extracted from (or to the encoding server uses), then percent-escaped.
This is how all browsers work; it is specified in https://url.spec.whatwg.org/ - see this example. A Python implementation can be found in w3lib (this is the library Scrapy is using); see w3lib.url.safe_url_string:
from w3lib.url import safe_url_string
url = safe_url_string(u'http://example.org/Ñöñ-ÅŞÇİİ/', encoding="<page encoding>")
An easy way to check if a URL escaping implementation is incorrect/incomplete is to check if it provides 'page encoding' argument or not.
Based on #darkfeline answer:
from urllib.parse import urlsplit, urlunsplit, quote
def iri2uri(iri):
"""
Convert an IRI to a URI (Python 3).
"""
uri = ''
if isinstance(iri, str):
(scheme, netloc, path, query, fragment) = urlsplit(iri)
scheme = quote(scheme)
netloc = netloc.encode('idna').decode('utf-8')
path = quote(path)
query = quote(query)
fragment = quote(fragment)
uri = urlunsplit((scheme, netloc, path, query, fragment))
return uri
For those not depending strictly on urllib, one practical alternative is requests, which handles IRIs "out of the box".
For example, with http://bücher.ch:
>>> import requests
>>> r = requests.get(u'http://b\u00DCcher.ch')
>>> r.status_code
200
Encode the unicode to UTF-8, then URL-encode.
Use iri2uri method of httplib2. It makes the same thing as by bobin (is he/she the author of that?)
Another option to convert an IRI to an ASCII URI is to use furl package:
gruns/furl: 🌐 URL parsing and manipulation made easy. - https://github.com/gruns/furl
Python's standard urllib and urlparse modules provide a number of URL
related functions, but using these functions to perform common URL
operations proves tedious. Furl makes parsing and manipulating URLs
easy.
Examples
Non-ASCII domain
http://国立極地研究所.jp/english/ (Japanese National Institute of Polar Research website)
import furl
url = 'http://国立極地研究所.jp/english/'
furl.furl(url).tostr()
'http://xn--vcsoey76a2hh0vtuid5qa.jp/english/'
Non-ASCII path
https://ja.wikipedia.org/wiki/日本語 ("Japanese" article in Wikipedia)
import furl
url = 'https://ja.wikipedia.org/wiki/日本語'
furl.furl(url).tostr()
'https://ja.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E8%AA%9E'
works! finally
I could not avoid from this strange characters, but at the end I come through it.
import urllib.request
import os
url = "http://www.fourtourismblog.it/le-nuove-tendenze-del-marketing-tenere-docchio/"
with urllib.request.urlopen(url) as file:
html = file.read()
with open("marketingturismo.html", "w", encoding='utf-8') as file:
file.write(str(html.decode('utf-8')))
os.system("marketingturismo.html")