Python Code to Parse & Retrieve value from URL string

Python Code to Parse & Retrieve value from URL string - python

I have some Dot Net code that parses and retrieves a value from a URL string.
However, I would like to perform the same function but now use python code instead.
Dot Net code snippet is below:
string queryString = string.Empty;
string application_id = string.Empty;
string currentURL = Browser.getDriver.Url;
Uri url = new Uri(currentURL);
string query_String = url.Query;
application_id = query_String.Split(new char[] { '=' }).Last();
Thanks in advance

Always best to use std lib functions if they are available. Python 3 has urllib.parse (if you are still on Py2, it's urlparse). Use the urlparse method of this module to extract the query part of the url (the stuff after the '?'). Then parse_qs will convert this query to a dict of key:list values - the values are lists to handle query strings that have repeated keys.
url = 'http://www.somesite.blah/page?id=12345&attr=good&attr=bad&attr=ugly'
try:
from urllib.parse import urlparse, parse_qs
except ImportError:
# still using Python 2? time to move up
from urlparse import urlparse, parse_qs
parts = urlparse(url)
print(parts)
query_dict = parse_qs(parts.query)
print(query_dict)
print(query_dict['id'][0])
prints:
ParseResult(scheme='http', netloc='www.somesite.blah', path='/page', params='',
query='id=12345&attr=good&attr=bad&attr=ugly', fragment='')
{'attr': ['good', 'bad', 'ugly'], 'id': ['12345']}
12345

first, last = query_String.split('=')

Related

How to compare URLs in python? (not traditional way)?

In python, I used == to check if 2 URLs are the same, but to me, the following are the same too:
https://hello.com?test=test and https://hello.com?test22=test22
https://hello.com and https://hello.com#you_can_ignore_this
Is there any build-in function instead of working hard to compare every char etc...

You can use urllib to parse the URLs and only keep the initial parts you want (here keeping scheme+netloc+path):
from urllib.parse import urlparse
url1 = urlparse('https://hello.com/?test=test')
url2 = urlparse('https://hello.com/?test22=test22')
url1[:3]
# ('https', 'hello.com', '/')
url1[:3] == url2[:3]
# True
Comparing only the netloc (aka "domain"):
url1[1] == url2[1]
As you can see, once you have parsed the URL you have a lot of flexibility to perform comparisons.

Using urlparse is the way to go, as suggested in another answer. However, special treatment should be used for the URLs that have an empty path or the path consisting only of the root "/", because they refer to the same document.
from urllib.parse import urlparse
url1 = urlparse('https://hello.com/?test=test')
url2 = urlparse('https://hello.com')
result = (url1.path in "/" and url2.path in "/" and url1[:2] == url2[:2])\
or (url1[:3] == url2[:3])

It's not very clear what you mean, but you should try parsing the url first.
You could check it using urlparse().
from urllib.parse import urlparse
url = urlparse("https://hello.com?test=test")
Since the urlparse method returns a ParseResult:
ParseResult(scheme='https', netloc='hello.com', path='', params='', query='test=test', fragment='')
You can compare these by doing
url[1] == 'hello.com' #Index 1 = netloc
https://docs.python.org/3/library/urllib.parse.html

Redact and remove password from URL

I have an URL like this:
https://user:password#example.com/path?key=value#hash
The result should be:
https://user:???#example.com/path?key=value#hash
I could use a regex, but instead I would like to parse the URL a high level data structure, then operate on this data structure, then serializing to a string.
Is this possible with Python?

You can use the built in urlparse to query out the password from a url. It is available in both Python 2 and 3, but under different locations.
Python 2 import urlparse
Python 3 from urllib.parse import urlparse
Example
from urllib.parse import urlparse
parsed = urlparse("https://user:password#example.com/path?key=value#hash")
parsed.password # 'password'
replaced = parsed._replace(netloc="{}:{}#{}".format(parsed.username, "???", parsed.hostname))
replaced.geturl() # 'https://user:???#example.com/path?key=value#hash'
See also this question: Changing hostname in a url

from urllib.parse import urlparse
def redact_url(url: str) -> str:
url_components = urlparse(url)
if url_components.username or url_components.password:
url_components = url_components._replace(
netloc=f"{url_components.username}:???#{url_components.hostname}",
)
return url_components.geturl()

The pip module already have an internal utility function which does exactly this.
>>> from pip._internal.utils.misc import redact_auth_from_url
>>>
>>> redact_auth_from_url("https://user:password#example.com/path?key=value#hash")
'https://user:****#example.com/path?key=value#hash'
>>> redact_auth_from_url.__doc__
'Replace the password in a given url with ****.'
This will provide the expected result even if the url does not contain username or password.
>>> redact_auth_from_url("https://example.com/path?key=value#hash")
'https://example.com/path?key=value#hash'

How can I get the base of a URL in Python?

I'm trying to determine the base of a URL, or everything besides the page and parameters. I tried using split, but is there a better way than splitting it up into pieces? Is there a way I can remove everything from the last '/'?
Given this:
http://127.0.0.1/asdf/login.php
I would like:
http://127.0.0.1/asdf/

The best way to do this is use urllib.parse.
From the docs:
The module has been designed to match the Internet RFC on Relative
Uniform Resource Locators. It supports the following URL schemes:
file, ftp, gopher, hdl, http, https, imap, mailto, mms, news, nntp,
prospero, rsync, rtsp, rtspu, sftp, shttp, sip, sips, snews, svn,
svn+ssh, telnet, wais, ws, wss.
You'd want to do something like this using urlsplit and urlunsplit:
from urllib.parse import urlsplit, urlunsplit
split_url = urlsplit('http://127.0.0.1/asdf/login.php?q=abc#stackoverflow')
# You now have:
# split_url.scheme "http"
# split_url.netloc "127.0.0.1"
# split_url.path "/asdf/login.php"
# split_url.query "q=abc"
# split_url.fragment "stackoverflow"
# Use all the path except everything after the last '/'
clean_path = "".join(split_url.path.rpartition("/")[:-1])
# "/asdf/"
# urlunsplit joins a urlsplit tuple
clean_url = urlunsplit(split_url)
# "http://127.0.0.1/asdf/login.php?q=abc#stackoverflow"
# A more advanced example
advanced_split_url = urlsplit('http://foo:bar#127.0.0.1:5000/asdf/login.php?q=abc#stackoverflow')
# You now have *in addition* to the above:
# advanced_split_url.username "foo"
# advanced_split_url.password "bar"
# advanced_split_url.hostname "127.0.0.1"
# advanced_split_url.port "5000"

Well, for one, you could just use os.path.dirname:
>>> os.path.dirname('http://127.0.0.1/asdf/login.php')
'http://127.0.0.1/asdf'
It's not explicitly for URLs, but it happens to work on them (even on Windows), it just doesn't leave the trailing slash (you can just add it back yourself).
You may also want to look at urllib.parse.urlparse for more fine-grained parsing; if the URL has a query string or hash involved, you'd want to parse it into pieces, trim the path component returned by parsing, then recombine, so the path is trimmed without losing query and hash info.
Lastly, if you want to just split off the component after the last slash, you can do an rsplit with a maxsplit of 1, and keep the first component:
>>> 'http://127.0.0.1/asdf/login.php'.rsplit('/', 1)[0]
'http://127.0.0.1/asdf'

There is shortest solution for Python3 with use of urllib library (don't know if fastest):
from urllib.parse import urljoin
base_url = urljoin('http://127.0.0.1/asdf/login.php', '.')
# output: http://127.0.0.1/asdf/
Keep in mind that urllib library supports uri/url compatible with HTML's keyword. It means that uri/url ending with '/' means different that without like here https://stackoverflow.com/a/1793282/7750840/:
base_url = urljoin('http://127.0.0.1/asdf/', '.')
# output: http://127.0.0.1/asdf/
base_url = urljoin('http://127.0.0.1/asdf', '.')
# output: http://127.0.0.1/
This is link to urllib for python: https://pythonprogramming.net/urllib-tutorial-python-3/

Agree that best way to do it is with urllib.parse
Specifically, you can decompose the url with urllib.parse.urlparse and then replace every attribute other than scheme and netloc with an empty string. If you want to keep the path attribute (as in your question), you can do so with an extra string parsing step. Example function below:
import urllib.parse
def base_url(url, with_path=False):
parsed = urllib.parse.urlparse(url)
path = '/'.join(parsed.path.split('/')[:-1]) if with_path else ''
parsed = parsed._replace(path=path)
parsed = parsed._replace(params='')
parsed = parsed._replace(query='')
parsed = parsed._replace(fragment='')
return parsed.geturl()
Examples:
>>> base_url('http://127.0.0.1/asdf/login.php', with_path=True)
'http://127.0.0.1/asdf'
>>> base_url('http://127.0.0.1/asdf/login.php')
'http://127.0.0.1'

No need to use a regex, you can just use rsplit():
>>> url = 'http://127.0.0.1/asdf/login.php'
>>> url.rsplit('/', 1)[0]
'http://127.0.0.1/asdf'

When you use urlsplit, it returns a SplitResult object:
from urllib.parse import urlsplit
split_url = urlsplit('http://127.0.0.1/asdf/login.php')
print(split_url)
>>> SplitResult(scheme='http' netloc='127.0.0.1' path='/asdf/login.php' query='' fragment='')
You can make your own SplitResult() object and pass it through urlunsplit. This code should work for multiple url splits, regardless of their length, as long as you know what the last path element you want is.
from urllib.parse import urlsplit, urlunsplit, SplitResult
# splitting url:
split_url = urlsplit('http://127.0.0.1/asdf/login.php')
# editing the variables you want to change (in this case, path):
last_element = 'asdf' # this can be any element in the path.
path_array = split_url.path.split('/')
# print(path_array)
# >>> ['', 'asdf', 'login.php']
path_array.remove('')
ind = path_array.index(last_element)
new_path = '/' + '/'.join(path_array[:ind+1]) + '/'
# making SplitResult() object with edited data:
new_url = SplitResult(scheme=split_url.scheme, netloc=split_url.netloc, path=new_path, query='', fragment='')
# unsplitting:
base_url = urlunsplit(new_url)

Get the right-most occurence of slash; use the string slice through that position in the original string. The +1 gets you that final slash at the end.
link = "http://127.0.0.1/asdf/login.php"
link[:link.rfind('/')+1]

If you use python3, you can use urlparse and urlunparse.
In :from urllib.parse import urlparse, urlunparse
In :url = "http://127.0.0.1/asdf/login.php"
In :result = urlparse(url)
In :new = list(result)
In :new[2] = new[2].replace("login.php", "")
In :urlunparse(new)
Out:'http://127.0.0.1/asdf/'

strip string from url in Python

input: url = http://127.0.0.1:8000/data/number/
http://127.0.0.1:8000/data/ is consistent for each page number.
output: number
Instead of slicing the url url[-4:-1], Is there any better way to do it?

You can use a combination of urlparse and split.
import urlparse
url = "http://127.0.0.1:8000/data/number/"
path = urlparse.urlparse(url).path
val = path.split("/")[2]
print val
This prints:
number
The output of urlparse for the above URL is
ParseResult(scheme='http', netloc='127.0.0.1:8000', path='/data/number/', params='', query='', fragment='')
We are utilizing the path portion of this tuple. We split it on / and take the second index.

use urlparse module life would be easier
from urlparse import urlparse
urlparse('http://127.0.0.1:8000/data/number/').path.split('/')[2]

How to remove scheme from url in Python?

I am working with an application that returns urls, written with Flask. I want the URL displayed to the user to be as clean as possible so I want to remove the http:// from it. I looked and found the urlparse library, but couldn't find any examples of how to do this.
What would be the best way to go about it, and if urlparse is overkill is there a simpler way? Would simply removing the "http://" substring from the URL just using the regular string parsing tools be bad practice or cause problems?

I don't think urlparse offers a single method or function for this. This is how I'd do it:
from urlparse import urlparse
url = 'HtTp://stackoverflow.com/questions/tagged/python?page=2'
def strip_scheme(url):
parsed = urlparse(url)
scheme = "%s://" % parsed.scheme
return parsed.geturl().replace(scheme, '', 1)
print strip_scheme(url)
Output:
stackoverflow.com/questions/tagged/python?page=2
If you'd use (only) simple string parsing, you'd have to deal with http[s], and possibly other schemes yourself. Also, this handles weird casing of the scheme.

If you are using these programmatically rather than using a replace, I suggest having urlparse recreate the url without a scheme.
The ParseResult object is a tuple. So you can create another removing the fields you don't want.
# py2/3 compatibility
try:
from urllib.parse import urlparse, ParseResult
except ImportError:
from urlparse import urlparse, ParseResult
def strip_scheme(url):
parsed_result = urlparse(url)
return ParseResult('', *parsed_result[1:]).geturl()
You can remove any component of the parsedresult by simply replacing the input with an empty string.
It's important to note there is a functional difference between this answer and #Lukas Graf's answer. The most likely functional difference is that the '//' component of a url isn't technically the scheme, so this answer will preserve it, whereas it will remain here.
>>> Lukas_strip_scheme('https://yoman/hi?whatup')
'yoman/hi?whatup'
>>> strip_scheme('https://yoman/hi?whatup')
'//yoman/hi?whatup'

A simple regex search and replace works.
import re
def strip_scheme(url: str):
return re.sub(r'^https?:\/\/', '', url)

I've seen this done in Flask libraries and extensions. Worth noting you can do it although it does make use of a protected member (._replace) of the ParseResult/SplitResult.
url = 'HtTp://stackoverflow.com/questions/tagged/python?page=2'
split_url = urlsplit(url)
# >>> SplitResult(scheme='http', netloc='stackoverflow.com', path='/questions/tagged/python', query='page=2', fragment='')
split_url_without_scheme = split_url._replace(scheme="")
# >>> SplitResult(scheme='', netloc='stackoverflow.com', path='/questions/tagged/python', query='page=2', fragment='')
new_url = urlunsplit(split_url_without_scheme)

Almost 9 years since the question was asked and still not much have changed :D.
This is the cleanest way I came up with to solve that issue:
def strip_scheme(url: str) -> str:
schemaless = urlparse(url)._replace(scheme='').geturl()
return schemaless[2:] if schemaless.startswith("//") else schemaless
And some unit tests:
import pytest
#pytest.mark.parametrize(
['url', 'expected_url'],
[
# Test url not changed when no scheme
('www.test-url.com', 'www.test-url.com'),
# Test https scheme stripped
('https://www.test-url.com', 'www.test-url.com'),
# Test http scheme stripped
('http://www.test-url.com', 'www.test-url.com'),
# Test only scheme stripped when url with path
('https://www.test-url.com/de/fr', 'www.test-url.com/de/fr'),
# Test only scheme stripped when url with path and params
('https://test.com/de/fr?param1=foo', 'test.com/de/fr?param1=foo'),
]
)
def test_strip_scheme(url: str, expected_url: str) -> None:
assert strip_scheme(url) == expected_url

According to documentation (https://docs.python.org/3/library/urllib.parse.html#url-parsing) the return value is a named tuple, its items can be accessed by index or as named attributes. So we can get access to certain parts of parsed url by using named attributes:
from urllib.parse import urlparse
def delete_http(link):
url = urlparse(link)
return url.netloc + url.path
user_link = input()
print(delete_http(user_link))
Input: https://stackoverflow.com/
Output: stackoverflow.com/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Code to Parse & Retrieve value from URL string - python

first, last = query_String.split('=')

Related

How to compare URLs in python? (not traditional way)?

Redact and remove password from URL

How can I get the base of a URL in Python?

strip string from url in Python

How to remove scheme from url in Python?

Categories

Resources