encoding URL into a format readable by google - python

I would like to know how can I encode a URL so that I can pass it as a parameter into google maps?
https://maps.google.com/maps?q=https:%2F%2Fdl.dropbox.com%2Fu%2F94943007%2Ffile.kml&hl=en&sll=32.824552,-117.108978&sspn=0.889745,1.575165&t=m&z=4
as you can see the URL is being passed as a parameter into the maps.google.com/ url
the question is if i start with a url like http://www.microsoft.com/somemap.kml, how can i encode this URL so that i can pass it into it

The process is called URL encoding:
>>> urllib.quote('https://dl.dropbox.com/u/94943007/file.kml', '')
'https%3A%2F%2Fdl.dropbox.com%2Fu%2F94943007%2Ffile.kml'

Related

django ignore slash in url and take parameter

hi i have url like this:
path('api/v1/store/download/<str:ix>/', DownloadVideoAPI.as_view(), name='download'),
it accept long string .
I want to keep allthing after download key in above URL as the parameter.
but when I enter a long string that contains some slash Django says page not found for example when if enter "/api/v1/store/download/asdasd2asdsadas/asdasd" will give me 404 not found ...
how can I do that?
this is my view:
class DownloadVideoAPI(APIView):
def get(self, request, ix):
pre = ix.split(",")
hash = pre[0]
dec = pre[1]
de_hash = decode_data(hash, dec)
Well, It's possible to add the extra parameters in the request. you can use re_path method.
# urls.py
from django.urls import re_path
re_path(r'api/v1/store/download/(?P<ix>\w+)/', DownloadVideoAPI.as_view(), name='download'),
ref: https://docs.djangoproject.com/en/2.0/ref/urls/#django.urls.re_path
Just use
path('api/v1/store/download/<str:ix>', DownloadVideoAPI.as_view(), name='download'),
without / at the end.
/api/v1/store/download/asdasd2asdsadas/asdasd will result in a 404 page since Django cannot map the URL, /api/v1/store/download/asdasd2asdsadas/, to a route in your urls.py. To solve this, aside from using BugHunter's answer, you could URL encode your long string first before passing it to your URL.
So, given the long string, "asdasd2asdsadas/asdasd", URL encode it first to "asdasd2asdsadas%2Fasdasd". Once you have encoded it, your URL should now look like "/api/v1/store/download/asdasd2asdsadas%2Fasdasd".
To URL encode in Python 3, you can use urllib.
import urllib
parameter = 'asdasd2asdsadas/asdasd'
encoded_string = urllib.quote(parameter, safe='')
encoded_string here should have the value, "asdasd2asdsadas%2Fasdasd".

Encoding characters to utf-8 hex in Python 3

I have a web crawler that get a lot of these errors:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 27: ordinal not in range(128)
To mitigate these errors I have implemented a function that encode them like this:
def properEncode(url):
url = url.replace("ø", "%C3%B8")
url = url.replace("å", "%C3%A5")
url = url.replace("æ", "%C3%A6")
url = url.replace("é", "%c3%a9")
url = url.replace("Ø", "%C3%98")
url = url.replace("Å", "%C3%A5")
url = url.replace("Æ", "%C3%85")
url = url.replace("í", "%C3%AD")
return url
These are based on this table: http://www.utf8-chartable.de/
The conversion I do seems to be to convert them to utf-8 hex? Is there a python function to do this automatically?
You are URL encoding them. You can do so trivially with the urllib.parse.quote() function:
>>> from urllib.parse import quote
>>> quote("ø")
'%C3%B8'
or put into a function to only fix the URL path of a given URL (as this encoding doesn't apply to the host portion, for example):
from urllib.parse import quote, urlparse
def properEncode(url):
parts = urlparse(url)
path = quote(parts.path)
return parts._replace(path=path).geturl()
This limits the encoding to just the path portion of the URL. If you need to encode the query string, use the quote_plus function as query parameters replace spaces with a plus instead of %20 (and handle the query portion of the URL).

Python 3.4.0 -- xpath -- gets me empty list [duplicate]

Trying to retrieve some data from the web using urlib and lxml, I've got an error and have no idea, how to fix it.
url='http://sum.in.ua/?swrd=автор'
page = urllib.request.urlopen(url)
The error itself:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-15: ordinal not in range(128)
I'm using Ukrainian in API this time, but when I use API (without any Ukrainian letters in it) here:
url="http://www.toponymic-dictionary.in.ua/index.php?option=com_content&view=section&layout=blog&id=8&Itemid=9"
page = urllib.request.urlopen(url)
pageWritten = page.read()
pageReady = pageWritten.decode('utf-8')
xmldata = lxml.html.document_fromstring(pageReady)
text1 = xmldata.xpath('//p[#class="MsoNormal"]//text()')
it gets me the data in Ukrainian and everything works just fine.
URLs can only use a subset of printable ASCII codepoints; everything else must be properly encoded using URL percent encoding.
You can best achieve that by letting Python handle your parameters. The urllib.parse.urlencode() function can convert a dictionary (or a sequence of key-value pairs) for use in URLs:
from urllib.parse import urlencode
url = 'http://sum.in.ua/'
parameters = {'swrd': 'автор'}
url = '{}?{}'.format(url, urlencode(parameters))
page = urllib.request.urlopen(url)
This will first encode the parameters to UTF-8 bytes, then convert those bytes to percent-encoding sequences:
>>> from urllib.parse import urlencode
>>> parameters = {'swrd': 'автор'}
>>> urlencode(parameters)
'swrd=%D0%B0%D0%B2%D1%82%D0%BE%D1%80'
If you did not construct this URL yourself, you'll need to 'repair' the encoding. You can split of the query string, parse it into a dictionary, then pass it to urlencode to put it back into the URL using urllib.parse.urlparse() and urllib.parse.parse_qs():
from urllib.parse import urlparse, parse_qs, urlencode
url = 'http://sum.in.ua/?swrd=автор'
parsed_url = urlparse(url)
parameters = parse_qs(parsed_url.query)
url = parsed_url._replace(query=urlencode(parameters, doseq=True)).geturl()
This splits the URL into its constituent parts, parses out the query string, re-encodes and re-builds the URL afterwards:
>>> from urllib.parse import urlparse, parse_qs, urlencode
>>> url = 'http://sum.in.ua/?swrd=автор'
>>> parsed_url = urlparse(url)
>>> parameters = parse_qs(parsed_url.query)
>>> parsed_url._replace(query=urlencode(parameters, doseq=True)).geturl()
'http://sum.in.ua/?swrd=%D0%B0%D0%B2%D1%82%D0%BE%D1%80'
I believe you can do something like below
url = 'http://sum.in.ua/'
q = 'swrd=автор'
import urllib,requests
requests.get(url+"?"+urllib.quote(q))
I think urllib.quote will transform "swrd=автор" into something like "swrd=%D0%B0%D0%B2%D1%82%D0%BE%D1%80"
which should be accepted just fine

python url decode %E3

I get some wikipedia URL from freebase dump:
url 1: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa
url 2: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%E3o_Costa
They both refer to the same page on wikipedia:
url 3: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brandão_Costa
urllib.unquote works on url 1
url = 'Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa'
url = urllib.unquote(url)
url = urllib.unquote(url)
print url
result is
Pedro_Miguel_de_Castro_Brandão_Costa
but not work on url 2.
url = 'Pedro_Miguel_de_Castro_Brand%E3o_Costa'
url = urllib.unquote(url)
print url
result is
Pedro_Miguel_de_Castro_Brand�o_Costa
Are there something wrong?
The former is double-quoted UTF-8, which prints out normally since your terminal uses UTF-8. The latter is quoted Latin-1, which requires decoding first.
>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'
Pedro_Miguel_de_Castro_Brand�o_Costa
>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'.decode('latin-1')
Pedro_Miguel_de_Castro_Brandão_Costa

How to fetch a non-ascii url with urlopen?

I need to fetch data from a URL with non-ascii characters but urllib2.urlopen refuses to open the resource and raises:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0131' in position 26: ordinal not in range(128)
I know the URL is not standards compliant but I have no chance to change it.
What is the way to access a resource pointed by a URL containing non-ascii characters using Python?
edit: In other words, can / how urlopen open a URL like:
http://example.org/Ñöñ-ÅŞÇİİ/
Strictly speaking URIs can't contain non-ASCII characters; what you have there is an IRI.
To convert an IRI to a plain ASCII URI:
non-ASCII characters in the hostname part of the address have to be encoded using the Punycode-based IDNA algorithm;
non-ASCII characters in the path, and most of the other parts of the address have to be encoded using UTF-8 and %-encoding, as per Ignacio's answer.
So:
import re, urlparse
def urlEncodeNonAscii(b):
return re.sub('[\x80-\xFF]', lambda c: '%%%02x' % ord(c.group(0)), b)
def iriToUri(iri):
parts= urlparse.urlparse(iri)
return urlparse.urlunparse(
part.encode('idna') if parti==1 else urlEncodeNonAscii(part.encode('utf-8'))
for parti, part in enumerate(parts)
)
>>> iriToUri(u'http://www.a\u0131b.com/a\u0131b')
'http://www.xn--ab-hpa.com/a%c4%b1b'
(Technically this still isn't quite good enough in the general case because urlparse doesn't split away any user:pass# prefix or :port suffix on the hostname. Only the hostname part should be IDNA encoded. It's easier to encode using normal urllib.quote and .encode('idna') at the time you're constructing a URL than to have to pull an IRI apart.)
In python3, use the urllib.parse.quote function on the non-ascii string:
>>> from urllib.request import urlopen
>>> from urllib.parse import quote
>>> chinese_wikipedia = 'http://zh.wikipedia.org/wiki/Wikipedia:' + quote('首页')
>>> urlopen(chinese_wikipedia)
Python 3 has libraries to handle this situation. Use
urllib.parse.urlsplit to split the URL into its components, and
urllib.parse.quote to properly quote/escape the unicode characters
and urllib.parse.urlunsplit to join it back together.
>>> import urllib.parse
>>> url = 'http://example.com/unicodè'
>>> url = urllib.parse.urlsplit(url)
>>> url = list(url)
>>> url[2] = urllib.parse.quote(url[2])
>>> url = urllib.parse.urlunsplit(url)
>>> print(url)
http://example.com/unicod%C3%A8
It is more complex than the accepted #bobince's answer suggests:
netloc should be encoded using IDNA;
non-ascii URL path should be encoded to UTF-8 and then percent-escaped;
non-ascii query parameters should be encoded to the encoding of a page URL was extracted from (or to the encoding server uses), then percent-escaped.
This is how all browsers work; it is specified in https://url.spec.whatwg.org/ - see this example. A Python implementation can be found in w3lib (this is the library Scrapy is using); see w3lib.url.safe_url_string:
from w3lib.url import safe_url_string
url = safe_url_string(u'http://example.org/Ñöñ-ÅŞÇİİ/', encoding="<page encoding>")
An easy way to check if a URL escaping implementation is incorrect/incomplete is to check if it provides 'page encoding' argument or not.
Based on #darkfeline answer:
from urllib.parse import urlsplit, urlunsplit, quote
def iri2uri(iri):
"""
Convert an IRI to a URI (Python 3).
"""
uri = ''
if isinstance(iri, str):
(scheme, netloc, path, query, fragment) = urlsplit(iri)
scheme = quote(scheme)
netloc = netloc.encode('idna').decode('utf-8')
path = quote(path)
query = quote(query)
fragment = quote(fragment)
uri = urlunsplit((scheme, netloc, path, query, fragment))
return uri
For those not depending strictly on urllib, one practical alternative is requests, which handles IRIs "out of the box".
For example, with http://bücher.ch:
>>> import requests
>>> r = requests.get(u'http://b\u00DCcher.ch')
>>> r.status_code
200
Encode the unicode to UTF-8, then URL-encode.
Use iri2uri method of httplib2. It makes the same thing as by bobin (is he/she the author of that?)
Another option to convert an IRI to an ASCII URI is to use furl package:
gruns/furl: 🌐 URL parsing and manipulation made easy. - https://github.com/gruns/furl
Python's standard urllib and urlparse modules provide a number of URL
related functions, but using these functions to perform common URL
operations proves tedious. Furl makes parsing and manipulating URLs
easy.
Examples
Non-ASCII domain
http://国立極地研究所.jp/english/ (Japanese National Institute of Polar Research website)
import furl
url = 'http://国立極地研究所.jp/english/'
furl.furl(url).tostr()
'http://xn--vcsoey76a2hh0vtuid5qa.jp/english/'
Non-ASCII path
https://ja.wikipedia.org/wiki/日本語 ("Japanese" article in Wikipedia)
import furl
url = 'https://ja.wikipedia.org/wiki/日本語'
furl.furl(url).tostr()
'https://ja.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E8%AA%9E'
works! finally
I could not avoid from this strange characters, but at the end I come through it.
import urllib.request
import os
url = "http://www.fourtourismblog.it/le-nuove-tendenze-del-marketing-tenere-docchio/"
with urllib.request.urlopen(url) as file:
html = file.read()
with open("marketingturismo.html", "w", encoding='utf-8') as file:
file.write(str(html.decode('utf-8')))
os.system("marketingturismo.html")

Categories

Resources