Extract part of a url using pattern matching in python - python

I want to extract part of a url using pattern matching in python from a list of links
Examples:
http://www.fairobserver.com/about/
http://www.fairobserver.com/about/interview/
This is my regex :
re.match(r'(http?|ftp)(://[a-zA-Z0-9+&/##%?=~_|!:,.;]*)(.\b[a-z]{1,3}\b)(/about[a-zA-Z-_]*/?)', str(href), re.IGNORECASE)
I want to get links ending only with /about or /about/
but the above regex selects all links with "about" word in it

Suggest you parse your URLs using an appropriate library, e.g. urlparse instead.
E.g.
import urlparse
samples = [
"http://www.fairobserver.com/about/",
"http://www.fairobserver.com/about/interview/",
]
def about_filter(urls):
for url in urls:
parsed = urlparse.urlparse(url)
if parsed.path.endswith('/about/'):
yield url
Yielding:
>>> print list(about_filter(samples))
['http://www.fairobserver.com/about/']
Or
def about_filter(urls):
for url in urls:
parsed = urlparse.urlparse(url)
if parsed.path.startswith('/about'):
yield url
Yielding
>>> print list(about_filter(samples))
['http://www.fairobserver.com/about/', 'http://www.fairobserver.com/about/interview/']

Matching the path of exactly /about/ or /about per your comment clarification.
Below is using urlparse in python2/3.
try:
# https://docs.python.org/3.5/library/urllib.parse.html?highlight=urlparse#urllib.parse.urlparse
# python 3
from urllib.parse import urlparse
except ImportError:
# https://docs.python.org/2/library/urlparse.html#urlparse.urlparse
# python 2
from urlparse import urlparse
urls = (
'http://www.fairobserver.com/about/',
'http://www.fairobserver.com/about/interview/',
'http://www.fairobserver.com/interview/about/',
)
for url in urls:
print("{}: path is /about? {}".format(url,
urlparse(url.rstrip('/')).path == '/about'))
Here is the output:
http://www.fairobserver.com/about/: path is /about? True
http://www.fairobserver.com/about/interview/: path is /about? False
http://www.fairobserver.com/interview/about/: path is /about? False
The important part is urlparse(url.rstrip('/')).path == '/about', normalizing the url by stripping off the trailing / before parsing so that we don't have to use regex.

If you just want links ending in either use a html parser and str.endwith:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.fairobserver.com/about/")
print(list(filter(lambda x: x.endswith(("/about", '/about/')),
(a["href"] for a in BeautifulSoup(r.content).find_all("a", href=True)))))
You can also use a regex with BeautifulSoup:
r = requests.get("http://www.fairobserver.com/about/")
print([a["href"] for a in BeautifulSoup(r.content).find_all(
"a", href=re.compile(".*/about/$|.*/about$"))])

Related

How to compare URLs in python? (not traditional way)?

In python, I used == to check if 2 URLs are the same, but to me, the following are the same too:
https://hello.com?test=test and https://hello.com?test22=test22
https://hello.com and https://hello.com#you_can_ignore_this
Is there any build-in function instead of working hard to compare every char etc...
You can use urllib to parse the URLs and only keep the initial parts you want (here keeping scheme+netloc+path):
from urllib.parse import urlparse
url1 = urlparse('https://hello.com/?test=test')
url2 = urlparse('https://hello.com/?test22=test22')
url1[:3]
# ('https', 'hello.com', '/')
url1[:3] == url2[:3]
# True
Comparing only the netloc (aka "domain"):
url1[1] == url2[1]
As you can see, once you have parsed the URL you have a lot of flexibility to perform comparisons.
Using urlparse is the way to go, as suggested in another answer. However, special treatment should be used for the URLs that have an empty path or the path consisting only of the root "/", because they refer to the same document.
from urllib.parse import urlparse
url1 = urlparse('https://hello.com/?test=test')
url2 = urlparse('https://hello.com')
result = (url1.path in "/" and url2.path in "/" and url1[:2] == url2[:2])\
or (url1[:3] == url2[:3])
It's not very clear what you mean, but you should try parsing the url first.
You could check it using urlparse().
from urllib.parse import urlparse
url = urlparse("https://hello.com?test=test")
Since the urlparse method returns a ParseResult:
ParseResult(scheme='https', netloc='hello.com', path='', params='', query='test=test', fragment='')
You can compare these by doing
url[1] == 'hello.com' #Index 1 = netloc
https://docs.python.org/3/library/urllib.parse.html

How to use regex & python to parse lat/long from a url?

I have the url: https://maps.google.com/maps?ll=44.864505,-93.44873&z=18&t=m&hl=en&gl=US&mapclient=apiv3
I wanted to extract the lat/long from the url, so that I have 44.864505,-93.44873.
So far I have (^[maps?ll=]*$|(?<=\?).*)* which gives me ll=44.864505,-93.44873&z=18&t=m&hl=en&gl=US&mapclient=apiv3
but this needs impovement. I have been trying to use pythex to work this out, but I am stuck.
Any suggestions? Thanks
I wouldn't use regex, I'd use urlparse
For Python2:
import urlparse
url = 'https://maps.google.com/maps?ll=44.864505,-93.44873&z=18&t=m&hl=en&gl=US&mapclient=apiv3'
parsed = urlparse.urlparse(url)
params = urlparse.parse_qs(parsed.query)
print(params['ll'])
prints:
['44.864505,-93.44873']
For Python3 (urllib.parse) :
import urllib.parse as urlparse
url = 'https://maps.google.com/maps?ll=44.864505,-93.44873&z=18&t=m&hl=en&gl=US&mapclient=apiv3'
parsed = urlparse.urlparse(url)
params = urlparse.parse_qs(parsed.query)
print(params['ll'])
prints:
['44.864505,-93.44873']

Python, url parsing

I have url e.g: "http://www.nicepage.com/nicecat/something"
And I need parse it,
I use:
from urlparse import urlparse
url=urlparse("http://www.nicepage.com/nicecat/something")
#then I have:
#url.netloc() -- www.nicepage.com
#url.path() -- /nicecat/something
But I want to delete "www", and parse it little more.
I would like to have something like this:
#path_without_www -- nicepage.com
#list_of_path -- list_of_path[0] -> "nicecat", list_of_path[1] -> "something"
How about this:
import re
from urlparse import urlparse
url = urlparse('http://www.nicepage.com/nicecat/something')
url = url._replace(netloc=re.sub(r'^(www.)(.*)', r'\2', url.netloc))
The regex strips the 'www.' from the beginning of the netloc. From there you can parse it more as you wish.
The following would remove any leading www and split the remaining elements for further processing:
print url.netloc.lstrip("www.").split(".")
Giving:
['nicepage', 'com']

Using urlparse to remove a certain string?

I have this URL:
www.domain.com/a/b/c/d,authorised=false.html
and I want to convert it into
www.domain.com/a/b/c/d.html
Please note I am using python 2.7.
from urlparse import urlparse
url = "www.domain.com/a/b/c/d,athorised=false.html_i_location=http%3A%2F%2Fwww.domain.com%2Fcms%2Fs%2F0%2Ff416e134-2484-11e4-ae78-00144feabdc0.html%3Fsiteedition%3Dintl&siteedition=intl&_i_referer=http%3A%2F%2Fwww.domain.com%2Fhome%2Fus"
o = urlparse(url)
url = o.hostname + o.path
print url
returns www.domain.com/a/b/c/d,authorised=false.html but I don't know how to remove authorised=false part from the URL
import re
print re.sub(r',.+\.', '.', 'www.domain.com/a/b/c/d,authorised=false.html')
# www.domain.com/a/b/c/d.html

How to find URL in another URL?

Kinda tricky question about regexes. I have url of such a pattern:
http://www.domain.com/img?res=high&refurl=http://www.ahother_domain.com/page/&imgurl=http://www.one_more.com/static/images/mercedes.jpg&w=640&h=480
how can I extract imgurl value?
Take a look at urlparse
http://docs.python.org/2/library/urlparse.html
You can easily split your URL into parameters and then exctract whatever you need.
Example:
import urlparse
url = "http://www.domain.com/img?res=high&refurl=http://www.ahother_domain.com/page/&imgurl=http://www.one_more.com/static/images/mercedes.jpg&w=640&h=480"
urlParams = urlparse.parse_qs(urlparse.urlparse(url).query)
urlInUrl = urlParams['imgurl']
print urlInUrl
This solution asssumes that the imgurl param value is always followed by size params such as: &w=...:
import re
re.findall('imgurl=([^&]+)&', url)

Categories

Resources