How to remove scheme from url in Python?

How to remove scheme from url in Python? - python

I am working with an application that returns urls, written with Flask. I want the URL displayed to the user to be as clean as possible so I want to remove the http:// from it. I looked and found the urlparse library, but couldn't find any examples of how to do this.
What would be the best way to go about it, and if urlparse is overkill is there a simpler way? Would simply removing the "http://" substring from the URL just using the regular string parsing tools be bad practice or cause problems?

I don't think urlparse offers a single method or function for this. This is how I'd do it:
from urlparse import urlparse
url = 'HtTp://stackoverflow.com/questions/tagged/python?page=2'
def strip_scheme(url):
parsed = urlparse(url)
scheme = "%s://" % parsed.scheme
return parsed.geturl().replace(scheme, '', 1)
print strip_scheme(url)
Output:
stackoverflow.com/questions/tagged/python?page=2
If you'd use (only) simple string parsing, you'd have to deal with http[s], and possibly other schemes yourself. Also, this handles weird casing of the scheme.

If you are using these programmatically rather than using a replace, I suggest having urlparse recreate the url without a scheme.
The ParseResult object is a tuple. So you can create another removing the fields you don't want.
# py2/3 compatibility
try:
from urllib.parse import urlparse, ParseResult
except ImportError:
from urlparse import urlparse, ParseResult
def strip_scheme(url):
parsed_result = urlparse(url)
return ParseResult('', *parsed_result[1:]).geturl()
You can remove any component of the parsedresult by simply replacing the input with an empty string.
It's important to note there is a functional difference between this answer and #Lukas Graf's answer. The most likely functional difference is that the '//' component of a url isn't technically the scheme, so this answer will preserve it, whereas it will remain here.
>>> Lukas_strip_scheme('https://yoman/hi?whatup')
'yoman/hi?whatup'
>>> strip_scheme('https://yoman/hi?whatup')
'//yoman/hi?whatup'

A simple regex search and replace works.
import re
def strip_scheme(url: str):
return re.sub(r'^https?:\/\/', '', url)

I've seen this done in Flask libraries and extensions. Worth noting you can do it although it does make use of a protected member (._replace) of the ParseResult/SplitResult.
url = 'HtTp://stackoverflow.com/questions/tagged/python?page=2'
split_url = urlsplit(url)
# >>> SplitResult(scheme='http', netloc='stackoverflow.com', path='/questions/tagged/python', query='page=2', fragment='')
split_url_without_scheme = split_url._replace(scheme="")
# >>> SplitResult(scheme='', netloc='stackoverflow.com', path='/questions/tagged/python', query='page=2', fragment='')
new_url = urlunsplit(split_url_without_scheme)

Almost 9 years since the question was asked and still not much have changed :D.
This is the cleanest way I came up with to solve that issue:
def strip_scheme(url: str) -> str:
schemaless = urlparse(url)._replace(scheme='').geturl()
return schemaless[2:] if schemaless.startswith("//") else schemaless
And some unit tests:
import pytest
#pytest.mark.parametrize(
['url', 'expected_url'],
[
# Test url not changed when no scheme
('www.test-url.com', 'www.test-url.com'),
# Test https scheme stripped
('https://www.test-url.com', 'www.test-url.com'),
# Test http scheme stripped
('http://www.test-url.com', 'www.test-url.com'),
# Test only scheme stripped when url with path
('https://www.test-url.com/de/fr', 'www.test-url.com/de/fr'),
# Test only scheme stripped when url with path and params
('https://test.com/de/fr?param1=foo', 'test.com/de/fr?param1=foo'),
]
)
def test_strip_scheme(url: str, expected_url: str) -> None:
assert strip_scheme(url) == expected_url

According to documentation (https://docs.python.org/3/library/urllib.parse.html#url-parsing) the return value is a named tuple, its items can be accessed by index or as named attributes. So we can get access to certain parts of parsed url by using named attributes:
from urllib.parse import urlparse
def delete_http(link):
url = urlparse(link)
return url.netloc + url.path
user_link = input()
print(delete_http(user_link))
Input: https://stackoverflow.com/
Output: stackoverflow.com/

Related

How to compare URLs in python? (not traditional way)?

In python, I used == to check if 2 URLs are the same, but to me, the following are the same too:
https://hello.com?test=test and https://hello.com?test22=test22
https://hello.com and https://hello.com#you_can_ignore_this
Is there any build-in function instead of working hard to compare every char etc...

You can use urllib to parse the URLs and only keep the initial parts you want (here keeping scheme+netloc+path):
from urllib.parse import urlparse
url1 = urlparse('https://hello.com/?test=test')
url2 = urlparse('https://hello.com/?test22=test22')
url1[:3]
# ('https', 'hello.com', '/')
url1[:3] == url2[:3]
# True
Comparing only the netloc (aka "domain"):
url1[1] == url2[1]
As you can see, once you have parsed the URL you have a lot of flexibility to perform comparisons.

Using urlparse is the way to go, as suggested in another answer. However, special treatment should be used for the URLs that have an empty path or the path consisting only of the root "/", because they refer to the same document.
from urllib.parse import urlparse
url1 = urlparse('https://hello.com/?test=test')
url2 = urlparse('https://hello.com')
result = (url1.path in "/" and url2.path in "/" and url1[:2] == url2[:2])\
or (url1[:3] == url2[:3])

It's not very clear what you mean, but you should try parsing the url first.
You could check it using urlparse().
from urllib.parse import urlparse
url = urlparse("https://hello.com?test=test")
Since the urlparse method returns a ParseResult:
ParseResult(scheme='https', netloc='hello.com', path='', params='', query='test=test', fragment='')
You can compare these by doing
url[1] == 'hello.com' #Index 1 = netloc
https://docs.python.org/3/library/urllib.parse.html

How to remove query string from a url?

I have the following URL:
https://stackoverflow.com/questions/7990301?aaa=aaa
https://stackoverflow.com/questions/7990300?fr=aladdin
https://stackoverflow.com/questions/22375#6
https://stackoverflow.com/questions/22375?
https://stackoverflow.com/questions/22375#3_1
I need URLs for example:
https://stackoverflow.com/questions/7990301
https://stackoverflow.com/questions/7990300
https://stackoverflow.com/questions/22375
https://stackoverflow.com/questions/22375
https://stackoverflow.com/questions/22375
My attempt:
url='https://stackoverflow.com/questions/7990301?aaa=aaa'
if '?' in url:
url=url.split('?')[0]
if '#' in url:
url = url.split('#')[0]
I think this is a stupid way

The very helpful library furl makes it trivial to remove both query and fragment parts:
>>> furl.furl("https://hi.com/?abc=def#ghi").remove(args=True, fragment=True).url
https://hi.com/

You can split on something that doesn't exist in the string, you'll just get a list of one element, so depending on your goal, you could do something like this to simplify your existing code:
url = url.split('?')[0].split('#')[0]
Not saying this is the best way (furl is a great solution), but it is a way.

In your example you're also removing the fragment (the thing after a #), not just the query.
You can remove both by using urllib.parse.urlsplit, then calling ._replace on the namedtuple it returns and converting back to a string URL with urllib.parse.unsplit:
from urllib.parse import urlsplit, urlunsplit
def remove_query_params_and_fragment(url):
return urlunsplit(urlsplit(url)._replace(query="", fragment=""))
Output:
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/7990301?aaa=aaa")
'https://stackoverflow.com/questions/7990301'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/7990300?fr=aladdin")
'https://stackoverflow.com/questions/7990300'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375#6")
'https://stackoverflow.com/questions/22375'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375?")
'https://stackoverflow.com/questions/22375'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375#3_1")
'https://stackoverflow.com/questions/22375'

You could try
urls = ["https://stackoverflow.com/questions/7990301?aaa=aaa",
"https://stackoverflow.com/questions/7990300?fr=aladdin",
"https://stackoverflow.com/questions/22375#6",
"https://stackoverflow.com/questions/22375"?,
"https://stackoverflow.com/questions/22375#3_1"]
urls_without_query = [url.split('?')[0] for url in urls]
for example, "https://stackoverflow.com/questions/7990301?aaa=aaa".split() returns a list that looks like ["https://stackoverflow.com/questions/7990301", "aaa=aaa"], and if that string is url, url.split('?')[0] would give you "https://stackoverflow.com/questions/7990301".
Edit: I didn't think about # arguments. The other answers might help you more :)

You can use w3lib
from w3lib import url as w3_url
url_without_query = w3_url.url_query_cleaner(url)

Here is an answer using standard libraries, and which parses the URL properly:
from urllib.parse import urlparse
url = 'http://www.example.com/this/category?one=two'
parsed = urlparse(url)
print("".join([parsed.scheme,"://",parsed.netloc,parsed.path]))
expected output:
http://www.example.com/this/category
Note: this also strips params and the fragment, but is easy to modify to include those if you want.

How can I get the base of a URL in Python?

I'm trying to determine the base of a URL, or everything besides the page and parameters. I tried using split, but is there a better way than splitting it up into pieces? Is there a way I can remove everything from the last '/'?
Given this:
http://127.0.0.1/asdf/login.php
I would like:
http://127.0.0.1/asdf/

The best way to do this is use urllib.parse.
From the docs:
The module has been designed to match the Internet RFC on Relative
Uniform Resource Locators. It supports the following URL schemes:
file, ftp, gopher, hdl, http, https, imap, mailto, mms, news, nntp,
prospero, rsync, rtsp, rtspu, sftp, shttp, sip, sips, snews, svn,
svn+ssh, telnet, wais, ws, wss.
You'd want to do something like this using urlsplit and urlunsplit:
from urllib.parse import urlsplit, urlunsplit
split_url = urlsplit('http://127.0.0.1/asdf/login.php?q=abc#stackoverflow')
# You now have:
# split_url.scheme "http"
# split_url.netloc "127.0.0.1"
# split_url.path "/asdf/login.php"
# split_url.query "q=abc"
# split_url.fragment "stackoverflow"
# Use all the path except everything after the last '/'
clean_path = "".join(split_url.path.rpartition("/")[:-1])
# "/asdf/"
# urlunsplit joins a urlsplit tuple
clean_url = urlunsplit(split_url)
# "http://127.0.0.1/asdf/login.php?q=abc#stackoverflow"
# A more advanced example
advanced_split_url = urlsplit('http://foo:bar#127.0.0.1:5000/asdf/login.php?q=abc#stackoverflow')
# You now have *in addition* to the above:
# advanced_split_url.username "foo"
# advanced_split_url.password "bar"
# advanced_split_url.hostname "127.0.0.1"
# advanced_split_url.port "5000"

Well, for one, you could just use os.path.dirname:
>>> os.path.dirname('http://127.0.0.1/asdf/login.php')
'http://127.0.0.1/asdf'
It's not explicitly for URLs, but it happens to work on them (even on Windows), it just doesn't leave the trailing slash (you can just add it back yourself).
You may also want to look at urllib.parse.urlparse for more fine-grained parsing; if the URL has a query string or hash involved, you'd want to parse it into pieces, trim the path component returned by parsing, then recombine, so the path is trimmed without losing query and hash info.
Lastly, if you want to just split off the component after the last slash, you can do an rsplit with a maxsplit of 1, and keep the first component:
>>> 'http://127.0.0.1/asdf/login.php'.rsplit('/', 1)[0]
'http://127.0.0.1/asdf'

There is shortest solution for Python3 with use of urllib library (don't know if fastest):
from urllib.parse import urljoin
base_url = urljoin('http://127.0.0.1/asdf/login.php', '.')
# output: http://127.0.0.1/asdf/
Keep in mind that urllib library supports uri/url compatible with HTML's keyword. It means that uri/url ending with '/' means different that without like here https://stackoverflow.com/a/1793282/7750840/:
base_url = urljoin('http://127.0.0.1/asdf/', '.')
# output: http://127.0.0.1/asdf/
base_url = urljoin('http://127.0.0.1/asdf', '.')
# output: http://127.0.0.1/
This is link to urllib for python: https://pythonprogramming.net/urllib-tutorial-python-3/

Agree that best way to do it is with urllib.parse
Specifically, you can decompose the url with urllib.parse.urlparse and then replace every attribute other than scheme and netloc with an empty string. If you want to keep the path attribute (as in your question), you can do so with an extra string parsing step. Example function below:
import urllib.parse
def base_url(url, with_path=False):
parsed = urllib.parse.urlparse(url)
path = '/'.join(parsed.path.split('/')[:-1]) if with_path else ''
parsed = parsed._replace(path=path)
parsed = parsed._replace(params='')
parsed = parsed._replace(query='')
parsed = parsed._replace(fragment='')
return parsed.geturl()
Examples:
>>> base_url('http://127.0.0.1/asdf/login.php', with_path=True)
'http://127.0.0.1/asdf'
>>> base_url('http://127.0.0.1/asdf/login.php')
'http://127.0.0.1'

No need to use a regex, you can just use rsplit():
>>> url = 'http://127.0.0.1/asdf/login.php'
>>> url.rsplit('/', 1)[0]
'http://127.0.0.1/asdf'

When you use urlsplit, it returns a SplitResult object:
from urllib.parse import urlsplit
split_url = urlsplit('http://127.0.0.1/asdf/login.php')
print(split_url)
>>> SplitResult(scheme='http' netloc='127.0.0.1' path='/asdf/login.php' query='' fragment='')
You can make your own SplitResult() object and pass it through urlunsplit. This code should work for multiple url splits, regardless of their length, as long as you know what the last path element you want is.
from urllib.parse import urlsplit, urlunsplit, SplitResult
# splitting url:
split_url = urlsplit('http://127.0.0.1/asdf/login.php')
# editing the variables you want to change (in this case, path):
last_element = 'asdf' # this can be any element in the path.
path_array = split_url.path.split('/')
# print(path_array)
# >>> ['', 'asdf', 'login.php']
path_array.remove('')
ind = path_array.index(last_element)
new_path = '/' + '/'.join(path_array[:ind+1]) + '/'
# making SplitResult() object with edited data:
new_url = SplitResult(scheme=split_url.scheme, netloc=split_url.netloc, path=new_path, query='', fragment='')
# unsplitting:
base_url = urlunsplit(new_url)

Get the right-most occurence of slash; use the string slice through that position in the original string. The +1 gets you that final slash at the end.
link = "http://127.0.0.1/asdf/login.php"
link[:link.rfind('/')+1]

If you use python3, you can use urlparse and urlunparse.
In :from urllib.parse import urlparse, urlunparse
In :url = "http://127.0.0.1/asdf/login.php"
In :result = urlparse(url)
In :new = list(result)
In :new[2] = new[2].replace("login.php", "")
In :urlunparse(new)
Out:'http://127.0.0.1/asdf/'

Modify URL components in Python 2

Is there a cleaner way to modify some parts of a URL in Python 2?
For example
http://foo/bar -> http://foo/yah
At present, I'm doing this:
import urlparse
url = 'http://foo/bar'
# Modify path component of URL from 'bar' to 'yah'
# Use nasty convert-to-list hack due to urlparse.ParseResult being immutable
parts = list(urlparse.urlparse(url))
parts[2] = 'yah'
url = urlparse.urlunparse(parts)
Is there a cleaner solution?

Unfortunately, the documentation is out of date; the results produced by urlparse.urlparse() (and urlparse.urlsplit()) use a collections.namedtuple()-produced class as a base.
Don't turn this namedtuple into a list, but make use of the utility method provided for just this task:
parts = urlparse.urlparse(url)
parts = parts._replace(path='yah')
url = parts.geturl()
The namedtuple._replace() method lets you create a new copy with specific elements replaced. The ParseResult.geturl() method then re-joins the parts into a url for you.
Demo:
>>> import urlparse
>>> url = 'http://foo/bar'
>>> parts = urlparse.urlparse(url)
>>> parts = parts._replace(path='yah')
>>> parts.geturl()
'http://foo/yah'
mgilson filed a bug report (with patch) to address the documentation issue.

I guess the proper way to do it is this way.
As using _replace private methods or variables is not suggested.
from urlparse import urlparse, urlunparse
res = urlparse('http://www.goog.com:80/this/is/path/;param=paramval?q=val&foo=bar#hash')
l_res = list(res)
# this willhave ['http', 'www.goog.com:80', '/this/is/path/', 'param=paramval', 'q=val&foo=bar', 'hash']
l_res[2] = '/new/path'
urlunparse(l_res)
# outputs 'http://www.goog.com:80/new/path;param=paramval?q=val&foo=bar#hash'

Python: How to resolve URLs containing '..'

I need to uniquely identify and store some URLs. The problem is that sometimes they come containing ".." like http://somedomain.com/foo/bar/../../some/url which basically is http://somedomain.com/some/url if I'm not wrong.
Is there a Python function or a tricky way to resolve this URLs ?

There’s a simple solution using urllib.parse.urljoin:
>>> from urllib.parse import urljoin
>>> urljoin('http://www.example.com/foo/bar/../../baz/bux/', '.')
'http://www.example.com/baz/bux/'
However, if there is no trailing slash (the last component is a file, not a directory), the last component will be removed.
This fix uses the urlparse function to extract the path, then use (the posixpath version of) os.path to normalize the components. Compensate for a mysterious issue with trailing slashes, then join the URL back together. The following is doctestable:
from urllib.parse import urlparse
import posixpath
def resolve_components(url):
"""
>>> resolve_components('http://www.example.com/foo/bar/../../baz/bux/')
'http://www.example.com/baz/bux/'
>>> resolve_components('http://www.example.com/some/path/../file.ext')
'http://www.example.com/some/file.ext'
"""
parsed = urlparse(url)
new_path = posixpath.normpath(parsed.path)
if parsed.path.endswith('/'):
# Compensate for issue1707768
new_path += '/'
cleaned = parsed._replace(path=new_path)
return cleaned.geturl()

Those are file paths. Look at os.path.normpath:
>>> import os
>>> os.path.normpath('/foo/bar/../../some/url')
'/some/url'
EDIT:
If this is on Windows, your input path will use backslashes instead of slashes. In this case, you still need os.path.normpath to get rid of the .. patterns (and // and /./ and whatever else is redundant), then convert the backslashes to forward slashes:
def fix_path_for_URL(path):
result = os.path.normpath(path)
if os.sep == '\\':
result = result.replace('\\', '/')
return result
EDIT 2:
If you want to normalize URLs, do it (before you strip off the method and such) with urlparse module, as shown in the answer to this question.
EDIT 3:
It seems that urljoin doesn't normalize the base path it's given:
>>> import urlparse
>>> urlparse.urljoin('http://somedomain.com/foo/bar/../../some/url', '')
'http://somedomain.com/foo/bar/../../some/url'
normpath by itself doesn't quite cut it either:
>>> import os
>>> os.path.normpath('http://somedomain.com/foo/bar/../../some/url')
'http:/somedomain.com/some/url'
Note the initial double slash got eaten.
So we have to make them join forces:
def fix_URL(urlstring):
parts = list(urlparse.urlparse(urlstring))
parts[2] = os.path.normpath(parts[2].replace('/', os.sep)).replace(os.sep, '/')
return urlparse.urlunparse(parts)
Usage:
>>> fix_URL('http://somedomain.com/foo/bar/../../some/url')
'http://somedomain.com/some/url'

urljoin won't work, as it only resolves dot segments if the second argument isn't absolute(!?) or empty. Not only that, it doesn't handle excessive ..s properly according to RFC 3986 (they should be removed; urljoin doesn't do so). posixpath.normpath can't be used either (much less os.path.normpath), since it resolves multiple slashes in a row to only one (e.g. ///// becomes /), which is incorrect behavior for URLs.
The following short function resolves any URL path string correctly. It shouldn't be used with relative paths, however, since additional decisions about its behavior would then need to be made (Raise an error on excessive ..s? Remove . in the beginning? Leave them both?) - instead, join URLs before resolving if you know you might handle relative paths. Without further ado:
def resolve_url_path(path):
segments = path.split('/')
segments = [segment + '/' for segment in segments[:-1]] + [segments[-1]]
resolved = []
for segment in segments:
if segment in ('../', '..'):
if resolved[1:]:
resolved.pop()
elif segment not in ('./', '.'):
resolved.append(segment)
return ''.join(resolved)
This handles trailing dot segments (that is, without a trailing slash) and consecutive slashes correctly. To resolve an entire URL, you can then use the following wrapper (or just inline the path resolution function into it).
try:
# Python 3
from urllib.parse import urlsplit, urlunsplit
except ImportError:
# Python 2
from urlparse import urlsplit, urlunsplit
def resolve_url(url):
parts = list(urlsplit(url))
parts[2] = resolve_url_path(parts[2])
return urlunsplit(parts)
You can then call it like this:
>>> resolve_url('http://example.com/../thing///wrong/../multiple-slashes-yeah/.')
'http://example.com/thing///multiple-slashes-yeah/'
Correct URL resolution has more than a few pitfalls, it turns out!

I wanted to comment on the resolveComponents function in the top response.
Notice that if your path is /, the code will add another one which can be problematic.
I therefore changed the IF condition to:
if parsed.path.endswith( '/' ) and parsed.path != '/':

According to RFC 3986 this should happen as part of "relative resolution" process. So answer could be urlparse.urljoin(url, ''). But due to bug urlparse.urljoin does not remove dot segments when second argument is empty url. You can use yurl — alternative url manipulation library. It do this right:
>>> import yurl
>>> print yurl.URL('http://somedomain.com/foo/bar/../../some/url') + yurl.URL()
http://somedomain.com/some/url

import urlparse
import posixpath
parsed = list(urlparse.urlparse(url))
parsed[2] = posixpath.normpath(posixpath.join(parsed[2], rel_path))
proper_url = urlparse.urlunparse(parsed)

First, you need to the base URL, and then you can use urljoin from urllib.parse
example :
from urllib.parse import urljoin
def resolve_urls( urls, site):
for url in urls:
print(urljoin(site, url))
return
urls= ["/aboutMytest", "#", "/terms-of-use", "https://www.example.com/Admission"]
resolve_urls(urls,'https://example.com/')
output :
https://example.com/aboutMytest
https://example.com/
https://example.com/terms-of-use
https://www.example.com/Admission

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove scheme from url in Python? - python

A simple regex search and replace works. import re def strip_scheme(url: str): return re.sub(r'^https?:\/\/', '', url)

Related

How to compare URLs in python? (not traditional way)?

How to remove query string from a url?

How can I get the base of a URL in Python?

Modify URL components in Python 2

Python: How to resolve URLs containing '..'

Categories

Resources