How can I check if a URL is absolute using Python? - python

What is the preferred solution for checking if an URL is relative or absolute?

Python 2
You can use the urlparse module to parse an URL and then you can check if it's relative or absolute by checking whether it has the host name set.
>>> import urlparse
>>> def is_absolute(url):
... return bool(urlparse.urlparse(url).netloc)
...
>>> is_absolute('http://www.example.com/some/path')
True
>>> is_absolute('//www.example.com/some/path')
True
>>> is_absolute('/some/path')
False
Python 3
urlparse has been moved to urllib.parse, so use the following:
from urllib.parse import urlparse
def is_absolute(url):
return bool(urlparse(url).netloc)

If you want to know if an URL is absolute or relative in order to join it with a base URL, I usually do urllib.parse.urljoin anyway:
>>> from urllib.parse import urljoin
>>> urljoin('http://example.com/', 'http://example.com/picture.png')
'http://example.com/picture.png'
>>> urljoin('http://example1.com/', '/picture.png')
'http://example1.com/picture.png'
>>>

Can't comment accepted answer, so write this comment as new answer: IMO checking scheme in accepted answer ( bool(urlparse.urlparse(url).scheme) ) is not really good idea because of http://example.com/file.jpg, https://example.com/file.jpg and //example.com/file.jpg are absolute urls but in last case we get scheme = ''
I use this code:
is_absolute = True if '//' in my_url else False

Related

How to compare URLs in python? (not traditional way)?

In python, I used == to check if 2 URLs are the same, but to me, the following are the same too:
https://hello.com?test=test and https://hello.com?test22=test22
https://hello.com and https://hello.com#you_can_ignore_this
Is there any build-in function instead of working hard to compare every char etc...
You can use urllib to parse the URLs and only keep the initial parts you want (here keeping scheme+netloc+path):
from urllib.parse import urlparse
url1 = urlparse('https://hello.com/?test=test')
url2 = urlparse('https://hello.com/?test22=test22')
url1[:3]
# ('https', 'hello.com', '/')
url1[:3] == url2[:3]
# True
Comparing only the netloc (aka "domain"):
url1[1] == url2[1]
As you can see, once you have parsed the URL you have a lot of flexibility to perform comparisons.
Using urlparse is the way to go, as suggested in another answer. However, special treatment should be used for the URLs that have an empty path or the path consisting only of the root "/", because they refer to the same document.
from urllib.parse import urlparse
url1 = urlparse('https://hello.com/?test=test')
url2 = urlparse('https://hello.com')
result = (url1.path in "/" and url2.path in "/" and url1[:2] == url2[:2])\
or (url1[:3] == url2[:3])
It's not very clear what you mean, but you should try parsing the url first.
You could check it using urlparse().
from urllib.parse import urlparse
url = urlparse("https://hello.com?test=test")
Since the urlparse method returns a ParseResult:
ParseResult(scheme='https', netloc='hello.com', path='', params='', query='test=test', fragment='')
You can compare these by doing
url[1] == 'hello.com' #Index 1 = netloc
https://docs.python.org/3/library/urllib.parse.html

Redact and remove password from URL

I have an URL like this:
https://user:password#example.com/path?key=value#hash
The result should be:
https://user:???#example.com/path?key=value#hash
I could use a regex, but instead I would like to parse the URL a high level data structure, then operate on this data structure, then serializing to a string.
Is this possible with Python?
You can use the built in urlparse to query out the password from a url. It is available in both Python 2 and 3, but under different locations.
Python 2 import urlparse
Python 3 from urllib.parse import urlparse
Example
from urllib.parse import urlparse
parsed = urlparse("https://user:password#example.com/path?key=value#hash")
parsed.password # 'password'
replaced = parsed._replace(netloc="{}:{}#{}".format(parsed.username, "???", parsed.hostname))
replaced.geturl() # 'https://user:???#example.com/path?key=value#hash'
See also this question: Changing hostname in a url
from urllib.parse import urlparse
def redact_url(url: str) -> str:
url_components = urlparse(url)
if url_components.username or url_components.password:
url_components = url_components._replace(
netloc=f"{url_components.username}:???#{url_components.hostname}",
)
return url_components.geturl()
The pip module already have an internal utility function which does exactly this.
>>> from pip._internal.utils.misc import redact_auth_from_url
>>>
>>> redact_auth_from_url("https://user:password#example.com/path?key=value#hash")
'https://user:****#example.com/path?key=value#hash'
>>> redact_auth_from_url.__doc__
'Replace the password in a given url with ****.'
This will provide the expected result even if the url does not contain username or password.
>>> redact_auth_from_url("https://example.com/path?key=value#hash")
'https://example.com/path?key=value#hash'

How to remove scheme from url in Python?

I am working with an application that returns urls, written with Flask. I want the URL displayed to the user to be as clean as possible so I want to remove the http:// from it. I looked and found the urlparse library, but couldn't find any examples of how to do this.
What would be the best way to go about it, and if urlparse is overkill is there a simpler way? Would simply removing the "http://" substring from the URL just using the regular string parsing tools be bad practice or cause problems?
I don't think urlparse offers a single method or function for this. This is how I'd do it:
from urlparse import urlparse
url = 'HtTp://stackoverflow.com/questions/tagged/python?page=2'
def strip_scheme(url):
parsed = urlparse(url)
scheme = "%s://" % parsed.scheme
return parsed.geturl().replace(scheme, '', 1)
print strip_scheme(url)
Output:
stackoverflow.com/questions/tagged/python?page=2
If you'd use (only) simple string parsing, you'd have to deal with http[s], and possibly other schemes yourself. Also, this handles weird casing of the scheme.
If you are using these programmatically rather than using a replace, I suggest having urlparse recreate the url without a scheme.
The ParseResult object is a tuple. So you can create another removing the fields you don't want.
# py2/3 compatibility
try:
from urllib.parse import urlparse, ParseResult
except ImportError:
from urlparse import urlparse, ParseResult
def strip_scheme(url):
parsed_result = urlparse(url)
return ParseResult('', *parsed_result[1:]).geturl()
You can remove any component of the parsedresult by simply replacing the input with an empty string.
It's important to note there is a functional difference between this answer and #Lukas Graf's answer. The most likely functional difference is that the '//' component of a url isn't technically the scheme, so this answer will preserve it, whereas it will remain here.
>>> Lukas_strip_scheme('https://yoman/hi?whatup')
'yoman/hi?whatup'
>>> strip_scheme('https://yoman/hi?whatup')
'//yoman/hi?whatup'
A simple regex search and replace works.
import re
def strip_scheme(url: str):
return re.sub(r'^https?:\/\/', '', url)
I've seen this done in Flask libraries and extensions. Worth noting you can do it although it does make use of a protected member (._replace) of the ParseResult/SplitResult.
url = 'HtTp://stackoverflow.com/questions/tagged/python?page=2'
split_url = urlsplit(url)
# >>> SplitResult(scheme='http', netloc='stackoverflow.com', path='/questions/tagged/python', query='page=2', fragment='')
split_url_without_scheme = split_url._replace(scheme="")
# >>> SplitResult(scheme='', netloc='stackoverflow.com', path='/questions/tagged/python', query='page=2', fragment='')
new_url = urlunsplit(split_url_without_scheme)
Almost 9 years since the question was asked and still not much have changed :D.
This is the cleanest way I came up with to solve that issue:
def strip_scheme(url: str) -> str:
schemaless = urlparse(url)._replace(scheme='').geturl()
return schemaless[2:] if schemaless.startswith("//") else schemaless
And some unit tests:
import pytest
#pytest.mark.parametrize(
['url', 'expected_url'],
[
# Test url not changed when no scheme
('www.test-url.com', 'www.test-url.com'),
# Test https scheme stripped
('https://www.test-url.com', 'www.test-url.com'),
# Test http scheme stripped
('http://www.test-url.com', 'www.test-url.com'),
# Test only scheme stripped when url with path
('https://www.test-url.com/de/fr', 'www.test-url.com/de/fr'),
# Test only scheme stripped when url with path and params
('https://test.com/de/fr?param1=foo', 'test.com/de/fr?param1=foo'),
]
)
def test_strip_scheme(url: str, expected_url: str) -> None:
assert strip_scheme(url) == expected_url
According to documentation (https://docs.python.org/3/library/urllib.parse.html#url-parsing) the return value is a named tuple, its items can be accessed by index or as named attributes. So we can get access to certain parts of parsed url by using named attributes:
from urllib.parse import urlparse
def delete_http(link):
url = urlparse(link)
return url.netloc + url.path
user_link = input()
print(delete_http(user_link))
Input: https://stackoverflow.com/
Output: stackoverflow.com/

how to validate url and redirect to some url using flask

I want to validate url before redirect it using Flask.
My abstract code is here...
#app.before_request
def before():
if request.before_url == "http://127.0.0.0:8000":
return redirect("http://127.0.0.1:5000")
Do you have any idea?
Thanks in advance.
Use urlparse (builtin module). Then, use the builtin flask redirection methods
>>> from urlparse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'
You can then check for the parsed out port and reconstruct a url (using the same library) with the correct port or path. This will keep the integrity of your urls, instead of dealing with string manipulation.
You can use urlparse from urllib to parse the url. The function below which checks scheme, netloc and path variables which comes after parsing the url. Supports both Python 2 and 3.
try:
# python 3
from urllib.parse import urlparse
except ImportError:
from urlparse import urlparse
def url_validator(url):
try:
result = urlparse(url)
return all([result.scheme, result.netloc, result.path])
except:
return False
You can do something like this (not tested):
#app.route('/<path>')
def redirection(path):
if path == '': # your condition
return redirect('redirect URL')

Python: How to resolve URLs containing '..'

I need to uniquely identify and store some URLs. The problem is that sometimes they come containing ".." like http://somedomain.com/foo/bar/../../some/url which basically is http://somedomain.com/some/url if I'm not wrong.
Is there a Python function or a tricky way to resolve this URLs ?
There’s a simple solution using urllib.parse.urljoin:
>>> from urllib.parse import urljoin
>>> urljoin('http://www.example.com/foo/bar/../../baz/bux/', '.')
'http://www.example.com/baz/bux/'
However, if there is no trailing slash (the last component is a file, not a directory), the last component will be removed.
This fix uses the urlparse function to extract the path, then use (the posixpath version of) os.path to normalize the components. Compensate for a mysterious issue with trailing slashes, then join the URL back together. The following is doctestable:
from urllib.parse import urlparse
import posixpath
def resolve_components(url):
"""
>>> resolve_components('http://www.example.com/foo/bar/../../baz/bux/')
'http://www.example.com/baz/bux/'
>>> resolve_components('http://www.example.com/some/path/../file.ext')
'http://www.example.com/some/file.ext'
"""
parsed = urlparse(url)
new_path = posixpath.normpath(parsed.path)
if parsed.path.endswith('/'):
# Compensate for issue1707768
new_path += '/'
cleaned = parsed._replace(path=new_path)
return cleaned.geturl()
Those are file paths. Look at os.path.normpath:
>>> import os
>>> os.path.normpath('/foo/bar/../../some/url')
'/some/url'
EDIT:
If this is on Windows, your input path will use backslashes instead of slashes. In this case, you still need os.path.normpath to get rid of the .. patterns (and // and /./ and whatever else is redundant), then convert the backslashes to forward slashes:
def fix_path_for_URL(path):
result = os.path.normpath(path)
if os.sep == '\\':
result = result.replace('\\', '/')
return result
EDIT 2:
If you want to normalize URLs, do it (before you strip off the method and such) with urlparse module, as shown in the answer to this question.
EDIT 3:
It seems that urljoin doesn't normalize the base path it's given:
>>> import urlparse
>>> urlparse.urljoin('http://somedomain.com/foo/bar/../../some/url', '')
'http://somedomain.com/foo/bar/../../some/url'
normpath by itself doesn't quite cut it either:
>>> import os
>>> os.path.normpath('http://somedomain.com/foo/bar/../../some/url')
'http:/somedomain.com/some/url'
Note the initial double slash got eaten.
So we have to make them join forces:
def fix_URL(urlstring):
parts = list(urlparse.urlparse(urlstring))
parts[2] = os.path.normpath(parts[2].replace('/', os.sep)).replace(os.sep, '/')
return urlparse.urlunparse(parts)
Usage:
>>> fix_URL('http://somedomain.com/foo/bar/../../some/url')
'http://somedomain.com/some/url'
urljoin won't work, as it only resolves dot segments if the second argument isn't absolute(!?) or empty. Not only that, it doesn't handle excessive ..s properly according to RFC 3986 (they should be removed; urljoin doesn't do so). posixpath.normpath can't be used either (much less os.path.normpath), since it resolves multiple slashes in a row to only one (e.g. ///// becomes /), which is incorrect behavior for URLs.
The following short function resolves any URL path string correctly. It shouldn't be used with relative paths, however, since additional decisions about its behavior would then need to be made (Raise an error on excessive ..s? Remove . in the beginning? Leave them both?) - instead, join URLs before resolving if you know you might handle relative paths. Without further ado:
def resolve_url_path(path):
segments = path.split('/')
segments = [segment + '/' for segment in segments[:-1]] + [segments[-1]]
resolved = []
for segment in segments:
if segment in ('../', '..'):
if resolved[1:]:
resolved.pop()
elif segment not in ('./', '.'):
resolved.append(segment)
return ''.join(resolved)
This handles trailing dot segments (that is, without a trailing slash) and consecutive slashes correctly. To resolve an entire URL, you can then use the following wrapper (or just inline the path resolution function into it).
try:
# Python 3
from urllib.parse import urlsplit, urlunsplit
except ImportError:
# Python 2
from urlparse import urlsplit, urlunsplit
def resolve_url(url):
parts = list(urlsplit(url))
parts[2] = resolve_url_path(parts[2])
return urlunsplit(parts)
You can then call it like this:
>>> resolve_url('http://example.com/../thing///wrong/../multiple-slashes-yeah/.')
'http://example.com/thing///multiple-slashes-yeah/'
Correct URL resolution has more than a few pitfalls, it turns out!
I wanted to comment on the resolveComponents function in the top response.
Notice that if your path is /, the code will add another one which can be problematic.
I therefore changed the IF condition to:
if parsed.path.endswith( '/' ) and parsed.path != '/':
According to RFC 3986 this should happen as part of "relative resolution" process. So answer could be urlparse.urljoin(url, ''). But due to bug urlparse.urljoin does not remove dot segments when second argument is empty url. You can use yurl — alternative url manipulation library. It do this right:
>>> import yurl
>>> print yurl.URL('http://somedomain.com/foo/bar/../../some/url') + yurl.URL()
http://somedomain.com/some/url
import urlparse
import posixpath
parsed = list(urlparse.urlparse(url))
parsed[2] = posixpath.normpath(posixpath.join(parsed[2], rel_path))
proper_url = urlparse.urlunparse(parsed)
First, you need to the base URL, and then you can use urljoin from urllib.parse
example :
from urllib.parse import urljoin
def resolve_urls( urls, site):
for url in urls:
print(urljoin(site, url))
return
urls= ["/aboutMytest", "#", "/terms-of-use", "https://www.example.com/Admission"]
resolve_urls(urls,'https://example.com/')
output :
https://example.com/aboutMytest
https://example.com/
https://example.com/terms-of-use
https://www.example.com/Admission

Categories

Resources