How to use regex & python to parse lat/long from a url? - python

I have the url: https://maps.google.com/maps?ll=44.864505,-93.44873&z=18&t=m&hl=en&gl=US&mapclient=apiv3
I wanted to extract the lat/long from the url, so that I have 44.864505,-93.44873.
So far I have (^[maps?ll=]*$|(?<=\?).*)* which gives me ll=44.864505,-93.44873&z=18&t=m&hl=en&gl=US&mapclient=apiv3
but this needs impovement. I have been trying to use pythex to work this out, but I am stuck.
Any suggestions? Thanks

I wouldn't use regex, I'd use urlparse
For Python2:
import urlparse
url = 'https://maps.google.com/maps?ll=44.864505,-93.44873&z=18&t=m&hl=en&gl=US&mapclient=apiv3'
parsed = urlparse.urlparse(url)
params = urlparse.parse_qs(parsed.query)
print(params['ll'])
prints:
['44.864505,-93.44873']
For Python3 (urllib.parse) :
import urllib.parse as urlparse
url = 'https://maps.google.com/maps?ll=44.864505,-93.44873&z=18&t=m&hl=en&gl=US&mapclient=apiv3'
parsed = urlparse.urlparse(url)
params = urlparse.parse_qs(parsed.query)
print(params['ll'])
prints:
['44.864505,-93.44873']

Related

Redact and remove password from URL

I have an URL like this:
https://user:password#example.com/path?key=value#hash
The result should be:
https://user:???#example.com/path?key=value#hash
I could use a regex, but instead I would like to parse the URL a high level data structure, then operate on this data structure, then serializing to a string.
Is this possible with Python?
You can use the built in urlparse to query out the password from a url. It is available in both Python 2 and 3, but under different locations.
Python 2 import urlparse
Python 3 from urllib.parse import urlparse
Example
from urllib.parse import urlparse
parsed = urlparse("https://user:password#example.com/path?key=value#hash")
parsed.password # 'password'
replaced = parsed._replace(netloc="{}:{}#{}".format(parsed.username, "???", parsed.hostname))
replaced.geturl() # 'https://user:???#example.com/path?key=value#hash'
See also this question: Changing hostname in a url
from urllib.parse import urlparse
def redact_url(url: str) -> str:
url_components = urlparse(url)
if url_components.username or url_components.password:
url_components = url_components._replace(
netloc=f"{url_components.username}:???#{url_components.hostname}",
)
return url_components.geturl()
The pip module already have an internal utility function which does exactly this.
>>> from pip._internal.utils.misc import redact_auth_from_url
>>>
>>> redact_auth_from_url("https://user:password#example.com/path?key=value#hash")
'https://user:****#example.com/path?key=value#hash'
>>> redact_auth_from_url.__doc__
'Replace the password in a given url with ****.'
This will provide the expected result even if the url does not contain username or password.
>>> redact_auth_from_url("https://example.com/path?key=value#hash")
'https://example.com/path?key=value#hash'

Extract part of a url using pattern matching in python

I want to extract part of a url using pattern matching in python from a list of links
Examples:
http://www.fairobserver.com/about/
http://www.fairobserver.com/about/interview/
This is my regex :
re.match(r'(http?|ftp)(://[a-zA-Z0-9+&/##%?=~_|!:,.;]*)(.\b[a-z]{1,3}\b)(/about[a-zA-Z-_]*/?)', str(href), re.IGNORECASE)
I want to get links ending only with /about or /about/
but the above regex selects all links with "about" word in it
Suggest you parse your URLs using an appropriate library, e.g. urlparse instead.
E.g.
import urlparse
samples = [
"http://www.fairobserver.com/about/",
"http://www.fairobserver.com/about/interview/",
]
def about_filter(urls):
for url in urls:
parsed = urlparse.urlparse(url)
if parsed.path.endswith('/about/'):
yield url
Yielding:
>>> print list(about_filter(samples))
['http://www.fairobserver.com/about/']
Or
def about_filter(urls):
for url in urls:
parsed = urlparse.urlparse(url)
if parsed.path.startswith('/about'):
yield url
Yielding
>>> print list(about_filter(samples))
['http://www.fairobserver.com/about/', 'http://www.fairobserver.com/about/interview/']
Matching the path of exactly /about/ or /about per your comment clarification.
Below is using urlparse in python2/3.
try:
# https://docs.python.org/3.5/library/urllib.parse.html?highlight=urlparse#urllib.parse.urlparse
# python 3
from urllib.parse import urlparse
except ImportError:
# https://docs.python.org/2/library/urlparse.html#urlparse.urlparse
# python 2
from urlparse import urlparse
urls = (
'http://www.fairobserver.com/about/',
'http://www.fairobserver.com/about/interview/',
'http://www.fairobserver.com/interview/about/',
)
for url in urls:
print("{}: path is /about? {}".format(url,
urlparse(url.rstrip('/')).path == '/about'))
Here is the output:
http://www.fairobserver.com/about/: path is /about? True
http://www.fairobserver.com/about/interview/: path is /about? False
http://www.fairobserver.com/interview/about/: path is /about? False
The important part is urlparse(url.rstrip('/')).path == '/about', normalizing the url by stripping off the trailing / before parsing so that we don't have to use regex.
If you just want links ending in either use a html parser and str.endwith:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.fairobserver.com/about/")
print(list(filter(lambda x: x.endswith(("/about", '/about/')),
(a["href"] for a in BeautifulSoup(r.content).find_all("a", href=True)))))
You can also use a regex with BeautifulSoup:
r = requests.get("http://www.fairobserver.com/about/")
print([a["href"] for a in BeautifulSoup(r.content).find_all(
"a", href=re.compile(".*/about/$|.*/about$"))])

Using urlparse to remove a certain string?

I have this URL:
www.domain.com/a/b/c/d,authorised=false.html
and I want to convert it into
www.domain.com/a/b/c/d.html
Please note I am using python 2.7.
from urlparse import urlparse
url = "www.domain.com/a/b/c/d,athorised=false.html_i_location=http%3A%2F%2Fwww.domain.com%2Fcms%2Fs%2F0%2Ff416e134-2484-11e4-ae78-00144feabdc0.html%3Fsiteedition%3Dintl&siteedition=intl&_i_referer=http%3A%2F%2Fwww.domain.com%2Fhome%2Fus"
o = urlparse(url)
url = o.hostname + o.path
print url
returns www.domain.com/a/b/c/d,authorised=false.html but I don't know how to remove authorised=false part from the URL
import re
print re.sub(r',.+\.', '.', 'www.domain.com/a/b/c/d,authorised=false.html')
# www.domain.com/a/b/c/d.html

How to find URL in another URL?

Kinda tricky question about regexes. I have url of such a pattern:
http://www.domain.com/img?res=high&refurl=http://www.ahother_domain.com/page/&imgurl=http://www.one_more.com/static/images/mercedes.jpg&w=640&h=480
how can I extract imgurl value?
Take a look at urlparse
http://docs.python.org/2/library/urlparse.html
You can easily split your URL into parameters and then exctract whatever you need.
Example:
import urlparse
url = "http://www.domain.com/img?res=high&refurl=http://www.ahother_domain.com/page/&imgurl=http://www.one_more.com/static/images/mercedes.jpg&w=640&h=480"
urlParams = urlparse.parse_qs(urlparse.urlparse(url).query)
urlInUrl = urlParams['imgurl']
print urlInUrl
This solution asssumes that the imgurl param value is always followed by size params such as: &w=...:
import re
re.findall('imgurl=([^&]+)&', url)

Slicing url with Python

Hi how to use python to transform the url of a article to it's print url.
article url:http://www.indianexpress.com/news/second-time-as-farce/800228/0
print url:http://www.indianexpress.com/story-print/800228/
How to convert article url to print url?
Use urllib.parse.urlparse() to carve the path from the rest of the url, and posixpath.split() and posixpath.join() to reform the path, and urllib.parse.urlunparse() to put it all back together again.
from urllib.parse import urlparse
def transform(url):
parsed = urlparse(url)
return '{0}://{1}/story-print/{2}/'.format(parsed.scheme, parsed.netloc, parsed.path.split('/')[-2])

Categories

Resources