python url decode %E3 - python

I get some wikipedia URL from freebase dump:
url 1: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa
url 2: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%E3o_Costa
They both refer to the same page on wikipedia:
url 3: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brandão_Costa
urllib.unquote works on url 1
url = 'Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa'
url = urllib.unquote(url)
url = urllib.unquote(url)
print url
result is
Pedro_Miguel_de_Castro_Brandão_Costa
but not work on url 2.
url = 'Pedro_Miguel_de_Castro_Brand%E3o_Costa'
url = urllib.unquote(url)
print url
result is
Pedro_Miguel_de_Castro_Brand�o_Costa
Are there something wrong?

The former is double-quoted UTF-8, which prints out normally since your terminal uses UTF-8. The latter is quoted Latin-1, which requires decoding first.
>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'
Pedro_Miguel_de_Castro_Brand�o_Costa
>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'.decode('latin-1')
Pedro_Miguel_de_Castro_Brandão_Costa

Related

I get InvalidURL: URL can't contain control characters when I try to send a request using urllib

I am trying to get a JSON response from the link used as a parameter to the urllib request. but it gives me an error that it can't contain control characters.
how can I solve the issue?
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
source = urllib.request.urlopen(start_url).read()
the error I get is :
URL can't contain control characters. '/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq=' (found at least ' ')
Replacing whitespace with:
url = url.replace(" ", "%20")
if the problem is with the whitespace.
Spaces are not allowed in URL, I removed them and it seems to be working now:
import urllib.request
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
url = start_url.replace(" ","")
source = urllib.request.urlopen(url).read()
Solr search strings can get pretty weird. Better use the 'quote' method to encode characters before making the request. See example below:
from urllib.parse import quote
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
source = urllib.request.urlopen(quote(start_url)).read()
Better later than never...
You probably already found out by now but let's get it written here.
There can't be any space character in the URL, and there are 2, after bundle_fq e dm_field_deadlineTo_fq
Remove those and you're good to go
Like the error message says, there are some control characters in your url, which doesn't seem to be a valid one by the way.
You need to encode the control characters inside the URL. Especially spaces need to be encoded to %20.
Parsing the url first and then encoding the url elements would work.
import urllib.request
from urllib.parse import urlparse, quote
def make_safe_url(url: str) -> str:
"""
Returns a parsed and quoted url
"""
_url = urlparse(url)
url = _url.scheme + "://" + _url.netloc + quote(_url.path) + "?" + quote(_url.query)
return url
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
start_url = make_safe_url(start_url)
source = urllib.request.urlopen(start_url).read()
The code returns the JSON-document despite the double forward-slash and the whitespace in the url.

Open file as string instead of bytes in python

I am working on a project that reads a url which contains an ICS file (icalendar). Instead of reading it as a string it prints as bytes need some advice on this.
import requests
url = "http://ical.keele.ac.uk/index.php/ical/ical/15021113"
c = requests.get(url)
c.encoding = 'ISO-8859-1'
print(c.content)
Expected return
BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//hacksw/handcal//NONSGML v1.0//EN
BEGIN:VEVENT
Actual return
b"BEGIN:VCALENDAR\rVERSION:2.0\rPRODID:-//hacksw/handcal//NONSGML v1.0//EN\rBEGIN:VEVENT\r
I have tried using the ics file directly and works without any problems but when I request from url it doesnt work. Thanks
Delirious Lettuce is right, just use text:
http://docs.python-requests.org/en/master/user/quickstart/#response-content
import requests
url = "http://ical.keele.ac.uk/index.php/ical/ical/15021113"
c = requests.get(url)
#c.encoding = 'ISO-8859-1'
#print(c.content)
print(c.text[:10])
results in
BEGIN:VCAL
(3.6.1 32 bit windows)

Python 3.4.0 -- xpath -- gets me empty list [duplicate]

Trying to retrieve some data from the web using urlib and lxml, I've got an error and have no idea, how to fix it.
url='http://sum.in.ua/?swrd=автор'
page = urllib.request.urlopen(url)
The error itself:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-15: ordinal not in range(128)
I'm using Ukrainian in API this time, but when I use API (without any Ukrainian letters in it) here:
url="http://www.toponymic-dictionary.in.ua/index.php?option=com_content&view=section&layout=blog&id=8&Itemid=9"
page = urllib.request.urlopen(url)
pageWritten = page.read()
pageReady = pageWritten.decode('utf-8')
xmldata = lxml.html.document_fromstring(pageReady)
text1 = xmldata.xpath('//p[#class="MsoNormal"]//text()')
it gets me the data in Ukrainian and everything works just fine.
URLs can only use a subset of printable ASCII codepoints; everything else must be properly encoded using URL percent encoding.
You can best achieve that by letting Python handle your parameters. The urllib.parse.urlencode() function can convert a dictionary (or a sequence of key-value pairs) for use in URLs:
from urllib.parse import urlencode
url = 'http://sum.in.ua/'
parameters = {'swrd': 'автор'}
url = '{}?{}'.format(url, urlencode(parameters))
page = urllib.request.urlopen(url)
This will first encode the parameters to UTF-8 bytes, then convert those bytes to percent-encoding sequences:
>>> from urllib.parse import urlencode
>>> parameters = {'swrd': 'автор'}
>>> urlencode(parameters)
'swrd=%D0%B0%D0%B2%D1%82%D0%BE%D1%80'
If you did not construct this URL yourself, you'll need to 'repair' the encoding. You can split of the query string, parse it into a dictionary, then pass it to urlencode to put it back into the URL using urllib.parse.urlparse() and urllib.parse.parse_qs():
from urllib.parse import urlparse, parse_qs, urlencode
url = 'http://sum.in.ua/?swrd=автор'
parsed_url = urlparse(url)
parameters = parse_qs(parsed_url.query)
url = parsed_url._replace(query=urlencode(parameters, doseq=True)).geturl()
This splits the URL into its constituent parts, parses out the query string, re-encodes and re-builds the URL afterwards:
>>> from urllib.parse import urlparse, parse_qs, urlencode
>>> url = 'http://sum.in.ua/?swrd=автор'
>>> parsed_url = urlparse(url)
>>> parameters = parse_qs(parsed_url.query)
>>> parsed_url._replace(query=urlencode(parameters, doseq=True)).geturl()
'http://sum.in.ua/?swrd=%D0%B0%D0%B2%D1%82%D0%BE%D1%80'
I believe you can do something like below
url = 'http://sum.in.ua/'
q = 'swrd=автор'
import urllib,requests
requests.get(url+"?"+urllib.quote(q))
I think urllib.quote will transform "swrd=автор" into something like "swrd=%D0%B0%D0%B2%D1%82%D0%BE%D1%80"
which should be accepted just fine

encoding URL into a format readable by google

I would like to know how can I encode a URL so that I can pass it as a parameter into google maps?
https://maps.google.com/maps?q=https:%2F%2Fdl.dropbox.com%2Fu%2F94943007%2Ffile.kml&hl=en&sll=32.824552,-117.108978&sspn=0.889745,1.575165&t=m&z=4
as you can see the URL is being passed as a parameter into the maps.google.com/ url
the question is if i start with a url like http://www.microsoft.com/somemap.kml, how can i encode this URL so that i can pass it into it
The process is called URL encoding:
>>> urllib.quote('https://dl.dropbox.com/u/94943007/file.kml', '')
'https%3A%2F%2Fdl.dropbox.com%2Fu%2F94943007%2Ffile.kml'

Url open encoding

I have the following code for urllib and BeautifulSoup:
getSite = urllib.urlopen(pageName) # open current site
getSitesoup = BeautifulSoup(getSite.read()) # reading the site content
print getSitesoup.originalEncoding
for value in getSitesoup.find_all('link'): # extract all <a> tags
defLinks.append(value.get('href'))
The result of it:
/usr/lib/python2.6/site-packages/bs4/dammit.py:231: UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
"Some characters could not be decoded, and were "
And when i try to read the site i get:
�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��#�����]���(P��^��q�$�S5���tT*�Z
The page is in UTF-8, but the server is sending it to you in a compressed format:
>>> print getSite.headers['content-encoding']
gzip
You'll need to decompress the data before running it through Beautiful Soup. I got an error using zlib.decompress() on the data, but writing the data to a file and using gzip.open() to read from it worked fine--I'm not sure why.
BeautifulSoup works with Unicode internally; it'll try and decode non-unicode responses from UTF-8 by default.
It looks like the site you are trying to load is using a different encode; for example, it could be UTF-16 instead:
>>> print u"""�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��#�����]���(P��^��q�$�S5���tT*�Z""".encode('utf-8').decode('utf-16-le')
뿯㞽뿯施뿯붿뿯붿⨰䤢럟뿯䞽뿯䢽뿯붿뿯붿붿뿯붿뿯붿뿯㦽붿뿯붿뿯붿뿯㮽뿯붿붿썙䊞붿뿯붿뿯붿뿯붿뿯붿铣㾶뿯㒽붿뿯붿붿뿯붿뿯붿坞뿯붿뿯붿뿯悽붿敋뿯붿붿뿯⪽붿✮兏붿뿯붿붿뿯䂽뿯붿뿯붿뿯嶽뿯붿뿯⢽붿뿯庽뿯붿붿붿㕓뿯붿뿯璽⩔뿯媽
It could be mac_cyrillic too:
>>> print u"""�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��#�����]���(P��^��q�$�S5���tT*�Z""".encode('utf-8').decode('mac_cyrillic')
пњљ7пњљeпњљпњљпњљпњљ0*"IяЈпњљGпњљHпњљпњљпњљпњљFпњљпњљпњљпњљпњљпњљ9-пњљпњљпњљпњљпњљпњљ;пњљпњљEпњљY√ЮBsпњљпњљпњљпњљпњљпњљпњљпњљпњљгФґ?пњљ4iпњљпњљпњљ)пњљпњљпњљпњљпњљ^Wпњљпњљпњљпњљпњљ`wпњљKeпњљпњљ%пњљпњљ*9пњљ.'OQBпњљпњљпњљVпњљпњљ#пњљпњљпњљпњљпњљ]пњљпњљпњљ(Pпњљпњљ^пњљпњљqпњљ$пњљS5пњљпњљпњљtT*пњљZ
But I have way too little information about what kind of site you are trying to load nor can I read the output of either encoding. :-)
You'll need to decode the result of getSite() before passing it to BeautifulSoup:
getSite = urllib.urlopen(pageName).decode('utf-16')
Generally, the website will return what encoding was used in the headers, in the form of a Content-Type header (probably text/html; charset=utf-16 or similar).
I ran into the same problem, and as Leonard mentioned, it was due to a compressed format.
This link solved it for me which says to add ('Accept-Encoding', 'gzip,deflate') in the request header. For example:
opener = urllib2.build_opener()
opener.addheaders = [('Referer', referer),
('User-Agent', uagent),
('Accept-Encoding', 'gzip,deflate')]
usock = opener.open(url)
url = usock.geturl()
data = decode(usock)
usock.close()
return data
Where the decode() function is defined by:
def decode (page):
encoding = page.info().get("Content-Encoding")
if encoding in ('gzip', 'x-gzip', 'deflate'):
content = page.read()
if encoding == 'deflate':
data = StringIO.StringIO(zlib.decompress(content))
else:
data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content))
page = data.read()
return page

Categories

Resources