I have an URL like this:
https://user:password#example.com/path?key=value#hash
The result should be:
https://user:???#example.com/path?key=value#hash
I could use a regex, but instead I would like to parse the URL a high level data structure, then operate on this data structure, then serializing to a string.
Is this possible with Python?
You can use the built in urlparse to query out the password from a url. It is available in both Python 2 and 3, but under different locations.
Python 2 import urlparse
Python 3 from urllib.parse import urlparse
Example
from urllib.parse import urlparse
parsed = urlparse("https://user:password#example.com/path?key=value#hash")
parsed.password # 'password'
replaced = parsed._replace(netloc="{}:{}#{}".format(parsed.username, "???", parsed.hostname))
replaced.geturl() # 'https://user:???#example.com/path?key=value#hash'
See also this question: Changing hostname in a url
from urllib.parse import urlparse
def redact_url(url: str) -> str:
url_components = urlparse(url)
if url_components.username or url_components.password:
url_components = url_components._replace(
netloc=f"{url_components.username}:???#{url_components.hostname}",
)
return url_components.geturl()
The pip module already have an internal utility function which does exactly this.
>>> from pip._internal.utils.misc import redact_auth_from_url
>>>
>>> redact_auth_from_url("https://user:password#example.com/path?key=value#hash")
'https://user:****#example.com/path?key=value#hash'
>>> redact_auth_from_url.__doc__
'Replace the password in a given url with ****.'
This will provide the expected result even if the url does not contain username or password.
>>> redact_auth_from_url("https://example.com/path?key=value#hash")
'https://example.com/path?key=value#hash'
Related
I have URLs of the form
http://example.com/example/a/b/c.html
https//www.example.com/
How do I get the path from the server root, without protocol or domain name? With the examples above, the function should return:
/example/a/b/c.html
/
(I am using Django: answers relying on this framework are accepted!)
urlparse module can solve this:
from urlparse import urlparse # for python 2
from urllib.parse import urlparse # for python 3
parsed_url = urlparse('http://example.com/abc/cde')
assert parsed_url.path == '/abc/cde'
You could use the path attribute of django HttpRequest object, in other words:
request.path
see the docs for more
I have some Dot Net code that parses and retrieves a value from a URL string.
However, I would like to perform the same function but now use python code instead.
Dot Net code snippet is below:
string queryString = string.Empty;
string application_id = string.Empty;
string currentURL = Browser.getDriver.Url;
Uri url = new Uri(currentURL);
string query_String = url.Query;
application_id = query_String.Split(new char[] { '=' }).Last();
Thanks in advance
Always best to use std lib functions if they are available. Python 3 has urllib.parse (if you are still on Py2, it's urlparse). Use the urlparse method of this module to extract the query part of the url (the stuff after the '?'). Then parse_qs will convert this query to a dict of key:list values - the values are lists to handle query strings that have repeated keys.
url = 'http://www.somesite.blah/page?id=12345&attr=good&attr=bad&attr=ugly'
try:
from urllib.parse import urlparse, parse_qs
except ImportError:
# still using Python 2? time to move up
from urlparse import urlparse, parse_qs
parts = urlparse(url)
print(parts)
query_dict = parse_qs(parts.query)
print(query_dict)
print(query_dict['id'][0])
prints:
ParseResult(scheme='http', netloc='www.somesite.blah', path='/page', params='',
query='id=12345&attr=good&attr=bad&attr=ugly', fragment='')
{'attr': ['good', 'bad', 'ugly'], 'id': ['12345']}
12345
first, last = query_String.split('=')
I am working with an application that returns urls, written with Flask. I want the URL displayed to the user to be as clean as possible so I want to remove the http:// from it. I looked and found the urlparse library, but couldn't find any examples of how to do this.
What would be the best way to go about it, and if urlparse is overkill is there a simpler way? Would simply removing the "http://" substring from the URL just using the regular string parsing tools be bad practice or cause problems?
I don't think urlparse offers a single method or function for this. This is how I'd do it:
from urlparse import urlparse
url = 'HtTp://stackoverflow.com/questions/tagged/python?page=2'
def strip_scheme(url):
parsed = urlparse(url)
scheme = "%s://" % parsed.scheme
return parsed.geturl().replace(scheme, '', 1)
print strip_scheme(url)
Output:
stackoverflow.com/questions/tagged/python?page=2
If you'd use (only) simple string parsing, you'd have to deal with http[s], and possibly other schemes yourself. Also, this handles weird casing of the scheme.
If you are using these programmatically rather than using a replace, I suggest having urlparse recreate the url without a scheme.
The ParseResult object is a tuple. So you can create another removing the fields you don't want.
# py2/3 compatibility
try:
from urllib.parse import urlparse, ParseResult
except ImportError:
from urlparse import urlparse, ParseResult
def strip_scheme(url):
parsed_result = urlparse(url)
return ParseResult('', *parsed_result[1:]).geturl()
You can remove any component of the parsedresult by simply replacing the input with an empty string.
It's important to note there is a functional difference between this answer and #Lukas Graf's answer. The most likely functional difference is that the '//' component of a url isn't technically the scheme, so this answer will preserve it, whereas it will remain here.
>>> Lukas_strip_scheme('https://yoman/hi?whatup')
'yoman/hi?whatup'
>>> strip_scheme('https://yoman/hi?whatup')
'//yoman/hi?whatup'
A simple regex search and replace works.
import re
def strip_scheme(url: str):
return re.sub(r'^https?:\/\/', '', url)
I've seen this done in Flask libraries and extensions. Worth noting you can do it although it does make use of a protected member (._replace) of the ParseResult/SplitResult.
url = 'HtTp://stackoverflow.com/questions/tagged/python?page=2'
split_url = urlsplit(url)
# >>> SplitResult(scheme='http', netloc='stackoverflow.com', path='/questions/tagged/python', query='page=2', fragment='')
split_url_without_scheme = split_url._replace(scheme="")
# >>> SplitResult(scheme='', netloc='stackoverflow.com', path='/questions/tagged/python', query='page=2', fragment='')
new_url = urlunsplit(split_url_without_scheme)
Almost 9 years since the question was asked and still not much have changed :D.
This is the cleanest way I came up with to solve that issue:
def strip_scheme(url: str) -> str:
schemaless = urlparse(url)._replace(scheme='').geturl()
return schemaless[2:] if schemaless.startswith("//") else schemaless
And some unit tests:
import pytest
#pytest.mark.parametrize(
['url', 'expected_url'],
[
# Test url not changed when no scheme
('www.test-url.com', 'www.test-url.com'),
# Test https scheme stripped
('https://www.test-url.com', 'www.test-url.com'),
# Test http scheme stripped
('http://www.test-url.com', 'www.test-url.com'),
# Test only scheme stripped when url with path
('https://www.test-url.com/de/fr', 'www.test-url.com/de/fr'),
# Test only scheme stripped when url with path and params
('https://test.com/de/fr?param1=foo', 'test.com/de/fr?param1=foo'),
]
)
def test_strip_scheme(url: str, expected_url: str) -> None:
assert strip_scheme(url) == expected_url
According to documentation (https://docs.python.org/3/library/urllib.parse.html#url-parsing) the return value is a named tuple, its items can be accessed by index or as named attributes. So we can get access to certain parts of parsed url by using named attributes:
from urllib.parse import urlparse
def delete_http(link):
url = urlparse(link)
return url.netloc + url.path
user_link = input()
print(delete_http(user_link))
Input: https://stackoverflow.com/
Output: stackoverflow.com/
I want to validate url before redirect it using Flask.
My abstract code is here...
#app.before_request
def before():
if request.before_url == "http://127.0.0.0:8000":
return redirect("http://127.0.0.1:5000")
Do you have any idea?
Thanks in advance.
Use urlparse (builtin module). Then, use the builtin flask redirection methods
>>> from urlparse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'
You can then check for the parsed out port and reconstruct a url (using the same library) with the correct port or path. This will keep the integrity of your urls, instead of dealing with string manipulation.
You can use urlparse from urllib to parse the url. The function below which checks scheme, netloc and path variables which comes after parsing the url. Supports both Python 2 and 3.
try:
# python 3
from urllib.parse import urlparse
except ImportError:
from urlparse import urlparse
def url_validator(url):
try:
result = urlparse(url)
return all([result.scheme, result.netloc, result.path])
except:
return False
You can do something like this (not tested):
#app.route('/<path>')
def redirection(path):
if path == '': # your condition
return redirect('redirect URL')
What is the preferred solution for checking if an URL is relative or absolute?
Python 2
You can use the urlparse module to parse an URL and then you can check if it's relative or absolute by checking whether it has the host name set.
>>> import urlparse
>>> def is_absolute(url):
... return bool(urlparse.urlparse(url).netloc)
...
>>> is_absolute('http://www.example.com/some/path')
True
>>> is_absolute('//www.example.com/some/path')
True
>>> is_absolute('/some/path')
False
Python 3
urlparse has been moved to urllib.parse, so use the following:
from urllib.parse import urlparse
def is_absolute(url):
return bool(urlparse(url).netloc)
If you want to know if an URL is absolute or relative in order to join it with a base URL, I usually do urllib.parse.urljoin anyway:
>>> from urllib.parse import urljoin
>>> urljoin('http://example.com/', 'http://example.com/picture.png')
'http://example.com/picture.png'
>>> urljoin('http://example1.com/', '/picture.png')
'http://example1.com/picture.png'
>>>
Can't comment accepted answer, so write this comment as new answer: IMO checking scheme in accepted answer ( bool(urlparse.urlparse(url).scheme) ) is not really good idea because of http://example.com/file.jpg, https://example.com/file.jpg and //example.com/file.jpg are absolute urls but in last case we get scheme = ''
I use this code:
is_absolute = True if '//' in my_url else False