Python, url parsing - python

I have url e.g: "http://www.nicepage.com/nicecat/something"
And I need parse it,
I use:
from urlparse import urlparse
url=urlparse("http://www.nicepage.com/nicecat/something")
#then I have:
#url.netloc() -- www.nicepage.com
#url.path() -- /nicecat/something
But I want to delete "www", and parse it little more.
I would like to have something like this:
#path_without_www -- nicepage.com
#list_of_path -- list_of_path[0] -> "nicecat", list_of_path[1] -> "something"

How about this:
import re
from urlparse import urlparse
url = urlparse('http://www.nicepage.com/nicecat/something')
url = url._replace(netloc=re.sub(r'^(www.)(.*)', r'\2', url.netloc))
The regex strips the 'www.' from the beginning of the netloc. From there you can parse it more as you wish.

The following would remove any leading www and split the remaining elements for further processing:
print url.netloc.lstrip("www.").split(".")
Giving:
['nicepage', 'com']

Related

How to remove query string from a url?

I have the following URL:
https://stackoverflow.com/questions/7990301?aaa=aaa
https://stackoverflow.com/questions/7990300?fr=aladdin
https://stackoverflow.com/questions/22375#6
https://stackoverflow.com/questions/22375?
https://stackoverflow.com/questions/22375#3_1
I need URLs for example:
https://stackoverflow.com/questions/7990301
https://stackoverflow.com/questions/7990300
https://stackoverflow.com/questions/22375
https://stackoverflow.com/questions/22375
https://stackoverflow.com/questions/22375
My attempt:
url='https://stackoverflow.com/questions/7990301?aaa=aaa'
if '?' in url:
url=url.split('?')[0]
if '#' in url:
url = url.split('#')[0]
I think this is a stupid way
The very helpful library furl makes it trivial to remove both query and fragment parts:
>>> furl.furl("https://hi.com/?abc=def#ghi").remove(args=True, fragment=True).url
https://hi.com/
You can split on something that doesn't exist in the string, you'll just get a list of one element, so depending on your goal, you could do something like this to simplify your existing code:
url = url.split('?')[0].split('#')[0]
Not saying this is the best way (furl is a great solution), but it is a way.
In your example you're also removing the fragment (the thing after a #), not just the query.
You can remove both by using urllib.parse.urlsplit, then calling ._replace on the namedtuple it returns and converting back to a string URL with urllib.parse.unsplit:
from urllib.parse import urlsplit, urlunsplit
def remove_query_params_and_fragment(url):
return urlunsplit(urlsplit(url)._replace(query="", fragment=""))
Output:
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/7990301?aaa=aaa")
'https://stackoverflow.com/questions/7990301'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/7990300?fr=aladdin")
'https://stackoverflow.com/questions/7990300'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375#6")
'https://stackoverflow.com/questions/22375'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375?")
'https://stackoverflow.com/questions/22375'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375#3_1")
'https://stackoverflow.com/questions/22375'
You could try
urls = ["https://stackoverflow.com/questions/7990301?aaa=aaa",
"https://stackoverflow.com/questions/7990300?fr=aladdin",
"https://stackoverflow.com/questions/22375#6",
"https://stackoverflow.com/questions/22375"?,
"https://stackoverflow.com/questions/22375#3_1"]
urls_without_query = [url.split('?')[0] for url in urls]
for example, "https://stackoverflow.com/questions/7990301?aaa=aaa".split() returns a list that looks like ["https://stackoverflow.com/questions/7990301", "aaa=aaa"], and if that string is url, url.split('?')[0] would give you "https://stackoverflow.com/questions/7990301".
Edit: I didn't think about # arguments. The other answers might help you more :)
You can use w3lib
from w3lib import url as w3_url
url_without_query = w3_url.url_query_cleaner(url)
Here is an answer using standard libraries, and which parses the URL properly:
from urllib.parse import urlparse
url = 'http://www.example.com/this/category?one=two'
parsed = urlparse(url)
print("".join([parsed.scheme,"://",parsed.netloc,parsed.path]))
expected output:
http://www.example.com/this/category
Note: this also strips params and the fragment, but is easy to modify to include those if you want.

Redact and remove password from URL

I have an URL like this:
https://user:password#example.com/path?key=value#hash
The result should be:
https://user:???#example.com/path?key=value#hash
I could use a regex, but instead I would like to parse the URL a high level data structure, then operate on this data structure, then serializing to a string.
Is this possible with Python?
You can use the built in urlparse to query out the password from a url. It is available in both Python 2 and 3, but under different locations.
Python 2 import urlparse
Python 3 from urllib.parse import urlparse
Example
from urllib.parse import urlparse
parsed = urlparse("https://user:password#example.com/path?key=value#hash")
parsed.password # 'password'
replaced = parsed._replace(netloc="{}:{}#{}".format(parsed.username, "???", parsed.hostname))
replaced.geturl() # 'https://user:???#example.com/path?key=value#hash'
See also this question: Changing hostname in a url
from urllib.parse import urlparse
def redact_url(url: str) -> str:
url_components = urlparse(url)
if url_components.username or url_components.password:
url_components = url_components._replace(
netloc=f"{url_components.username}:???#{url_components.hostname}",
)
return url_components.geturl()
The pip module already have an internal utility function which does exactly this.
>>> from pip._internal.utils.misc import redact_auth_from_url
>>>
>>> redact_auth_from_url("https://user:password#example.com/path?key=value#hash")
'https://user:****#example.com/path?key=value#hash'
>>> redact_auth_from_url.__doc__
'Replace the password in a given url with ****.'
This will provide the expected result even if the url does not contain username or password.
>>> redact_auth_from_url("https://example.com/path?key=value#hash")
'https://example.com/path?key=value#hash'

How can I get the base of a URL in Python?

I'm trying to determine the base of a URL, or everything besides the page and parameters. I tried using split, but is there a better way than splitting it up into pieces? Is there a way I can remove everything from the last '/'?
Given this:
http://127.0.0.1/asdf/login.php
I would like:
http://127.0.0.1/asdf/
The best way to do this is use urllib.parse.
From the docs:
The module has been designed to match the Internet RFC on Relative
Uniform Resource Locators. It supports the following URL schemes:
file, ftp, gopher, hdl, http, https, imap, mailto, mms, news, nntp,
prospero, rsync, rtsp, rtspu, sftp, shttp, sip, sips, snews, svn,
svn+ssh, telnet, wais, ws, wss.
You'd want to do something like this using urlsplit and urlunsplit:
from urllib.parse import urlsplit, urlunsplit
split_url = urlsplit('http://127.0.0.1/asdf/login.php?q=abc#stackoverflow')
# You now have:
# split_url.scheme "http"
# split_url.netloc "127.0.0.1"
# split_url.path "/asdf/login.php"
# split_url.query "q=abc"
# split_url.fragment "stackoverflow"
# Use all the path except everything after the last '/'
clean_path = "".join(split_url.path.rpartition("/")[:-1])
# "/asdf/"
# urlunsplit joins a urlsplit tuple
clean_url = urlunsplit(split_url)
# "http://127.0.0.1/asdf/login.php?q=abc#stackoverflow"
# A more advanced example
advanced_split_url = urlsplit('http://foo:bar#127.0.0.1:5000/asdf/login.php?q=abc#stackoverflow')
# You now have *in addition* to the above:
# advanced_split_url.username "foo"
# advanced_split_url.password "bar"
# advanced_split_url.hostname "127.0.0.1"
# advanced_split_url.port "5000"
Well, for one, you could just use os.path.dirname:
>>> os.path.dirname('http://127.0.0.1/asdf/login.php')
'http://127.0.0.1/asdf'
It's not explicitly for URLs, but it happens to work on them (even on Windows), it just doesn't leave the trailing slash (you can just add it back yourself).
You may also want to look at urllib.parse.urlparse for more fine-grained parsing; if the URL has a query string or hash involved, you'd want to parse it into pieces, trim the path component returned by parsing, then recombine, so the path is trimmed without losing query and hash info.
Lastly, if you want to just split off the component after the last slash, you can do an rsplit with a maxsplit of 1, and keep the first component:
>>> 'http://127.0.0.1/asdf/login.php'.rsplit('/', 1)[0]
'http://127.0.0.1/asdf'
There is shortest solution for Python3 with use of urllib library (don't know if fastest):
from urllib.parse import urljoin
base_url = urljoin('http://127.0.0.1/asdf/login.php', '.')
# output: http://127.0.0.1/asdf/
Keep in mind that urllib library supports uri/url compatible with HTML's keyword. It means that uri/url ending with '/' means different that without like here https://stackoverflow.com/a/1793282/7750840/:
base_url = urljoin('http://127.0.0.1/asdf/', '.')
# output: http://127.0.0.1/asdf/
base_url = urljoin('http://127.0.0.1/asdf', '.')
# output: http://127.0.0.1/
This is link to urllib for python: https://pythonprogramming.net/urllib-tutorial-python-3/
Agree that best way to do it is with urllib.parse
Specifically, you can decompose the url with urllib.parse.urlparse and then replace every attribute other than scheme and netloc with an empty string. If you want to keep the path attribute (as in your question), you can do so with an extra string parsing step. Example function below:
import urllib.parse
def base_url(url, with_path=False):
parsed = urllib.parse.urlparse(url)
path = '/'.join(parsed.path.split('/')[:-1]) if with_path else ''
parsed = parsed._replace(path=path)
parsed = parsed._replace(params='')
parsed = parsed._replace(query='')
parsed = parsed._replace(fragment='')
return parsed.geturl()
Examples:
>>> base_url('http://127.0.0.1/asdf/login.php', with_path=True)
'http://127.0.0.1/asdf'
>>> base_url('http://127.0.0.1/asdf/login.php')
'http://127.0.0.1'
No need to use a regex, you can just use rsplit():
>>> url = 'http://127.0.0.1/asdf/login.php'
>>> url.rsplit('/', 1)[0]
'http://127.0.0.1/asdf'
When you use urlsplit, it returns a SplitResult object:
from urllib.parse import urlsplit
split_url = urlsplit('http://127.0.0.1/asdf/login.php')
print(split_url)
>>> SplitResult(scheme='http' netloc='127.0.0.1' path='/asdf/login.php' query='' fragment='')
You can make your own SplitResult() object and pass it through urlunsplit. This code should work for multiple url splits, regardless of their length, as long as you know what the last path element you want is.
from urllib.parse import urlsplit, urlunsplit, SplitResult
# splitting url:
split_url = urlsplit('http://127.0.0.1/asdf/login.php')
# editing the variables you want to change (in this case, path):
last_element = 'asdf' # this can be any element in the path.
path_array = split_url.path.split('/')
# print(path_array)
# >>> ['', 'asdf', 'login.php']
path_array.remove('')
ind = path_array.index(last_element)
new_path = '/' + '/'.join(path_array[:ind+1]) + '/'
# making SplitResult() object with edited data:
new_url = SplitResult(scheme=split_url.scheme, netloc=split_url.netloc, path=new_path, query='', fragment='')
# unsplitting:
base_url = urlunsplit(new_url)
Get the right-most occurence of slash; use the string slice through that position in the original string. The +1 gets you that final slash at the end.
link = "http://127.0.0.1/asdf/login.php"
link[:link.rfind('/')+1]
If you use python3, you can use urlparse and urlunparse.
In :from urllib.parse import urlparse, urlunparse
In :url = "http://127.0.0.1/asdf/login.php"
In :result = urlparse(url)
In :new = list(result)
In :new[2] = new[2].replace("login.php", "")
In :urlunparse(new)
Out:'http://127.0.0.1/asdf/'

Extract part of a url using pattern matching in python

I want to extract part of a url using pattern matching in python from a list of links
Examples:
http://www.fairobserver.com/about/
http://www.fairobserver.com/about/interview/
This is my regex :
re.match(r'(http?|ftp)(://[a-zA-Z0-9+&/##%?=~_|!:,.;]*)(.\b[a-z]{1,3}\b)(/about[a-zA-Z-_]*/?)', str(href), re.IGNORECASE)
I want to get links ending only with /about or /about/
but the above regex selects all links with "about" word in it
Suggest you parse your URLs using an appropriate library, e.g. urlparse instead.
E.g.
import urlparse
samples = [
"http://www.fairobserver.com/about/",
"http://www.fairobserver.com/about/interview/",
]
def about_filter(urls):
for url in urls:
parsed = urlparse.urlparse(url)
if parsed.path.endswith('/about/'):
yield url
Yielding:
>>> print list(about_filter(samples))
['http://www.fairobserver.com/about/']
Or
def about_filter(urls):
for url in urls:
parsed = urlparse.urlparse(url)
if parsed.path.startswith('/about'):
yield url
Yielding
>>> print list(about_filter(samples))
['http://www.fairobserver.com/about/', 'http://www.fairobserver.com/about/interview/']
Matching the path of exactly /about/ or /about per your comment clarification.
Below is using urlparse in python2/3.
try:
# https://docs.python.org/3.5/library/urllib.parse.html?highlight=urlparse#urllib.parse.urlparse
# python 3
from urllib.parse import urlparse
except ImportError:
# https://docs.python.org/2/library/urlparse.html#urlparse.urlparse
# python 2
from urlparse import urlparse
urls = (
'http://www.fairobserver.com/about/',
'http://www.fairobserver.com/about/interview/',
'http://www.fairobserver.com/interview/about/',
)
for url in urls:
print("{}: path is /about? {}".format(url,
urlparse(url.rstrip('/')).path == '/about'))
Here is the output:
http://www.fairobserver.com/about/: path is /about? True
http://www.fairobserver.com/about/interview/: path is /about? False
http://www.fairobserver.com/interview/about/: path is /about? False
The important part is urlparse(url.rstrip('/')).path == '/about', normalizing the url by stripping off the trailing / before parsing so that we don't have to use regex.
If you just want links ending in either use a html parser and str.endwith:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.fairobserver.com/about/")
print(list(filter(lambda x: x.endswith(("/about", '/about/')),
(a["href"] for a in BeautifulSoup(r.content).find_all("a", href=True)))))
You can also use a regex with BeautifulSoup:
r = requests.get("http://www.fairobserver.com/about/")
print([a["href"] for a in BeautifulSoup(r.content).find_all(
"a", href=re.compile(".*/about/$|.*/about$"))])

How to find URL in another URL?

Kinda tricky question about regexes. I have url of such a pattern:
http://www.domain.com/img?res=high&refurl=http://www.ahother_domain.com/page/&imgurl=http://www.one_more.com/static/images/mercedes.jpg&w=640&h=480
how can I extract imgurl value?
Take a look at urlparse
http://docs.python.org/2/library/urlparse.html
You can easily split your URL into parameters and then exctract whatever you need.
Example:
import urlparse
url = "http://www.domain.com/img?res=high&refurl=http://www.ahother_domain.com/page/&imgurl=http://www.one_more.com/static/images/mercedes.jpg&w=640&h=480"
urlParams = urlparse.parse_qs(urlparse.urlparse(url).query)
urlInUrl = urlParams['imgurl']
print urlInUrl
This solution asssumes that the imgurl param value is always followed by size params such as: &w=...:
import re
re.findall('imgurl=([^&]+)&', url)

Categories

Resources