strip string from url in Python

strip string from url in Python - python

input: url = http://127.0.0.1:8000/data/number/
http://127.0.0.1:8000/data/ is consistent for each page number.
output: number
Instead of slicing the url url[-4:-1], Is there any better way to do it?

You can use a combination of urlparse and split.
import urlparse
url = "http://127.0.0.1:8000/data/number/"
path = urlparse.urlparse(url).path
val = path.split("/")[2]
print val
This prints:
number
The output of urlparse for the above URL is
ParseResult(scheme='http', netloc='127.0.0.1:8000', path='/data/number/', params='', query='', fragment='')
We are utilizing the path portion of this tuple. We split it on / and take the second index.

use urlparse module life would be easier
from urlparse import urlparse
urlparse('http://127.0.0.1:8000/data/number/').path.split('/')[2]

Related

How to compare URLs in python? (not traditional way)?

In python, I used == to check if 2 URLs are the same, but to me, the following are the same too:
https://hello.com?test=test and https://hello.com?test22=test22
https://hello.com and https://hello.com#you_can_ignore_this
Is there any build-in function instead of working hard to compare every char etc...

You can use urllib to parse the URLs and only keep the initial parts you want (here keeping scheme+netloc+path):
from urllib.parse import urlparse
url1 = urlparse('https://hello.com/?test=test')
url2 = urlparse('https://hello.com/?test22=test22')
url1[:3]
# ('https', 'hello.com', '/')
url1[:3] == url2[:3]
# True
Comparing only the netloc (aka "domain"):
url1[1] == url2[1]
As you can see, once you have parsed the URL you have a lot of flexibility to perform comparisons.

Using urlparse is the way to go, as suggested in another answer. However, special treatment should be used for the URLs that have an empty path or the path consisting only of the root "/", because they refer to the same document.
from urllib.parse import urlparse
url1 = urlparse('https://hello.com/?test=test')
url2 = urlparse('https://hello.com')
result = (url1.path in "/" and url2.path in "/" and url1[:2] == url2[:2])\
or (url1[:3] == url2[:3])

It's not very clear what you mean, but you should try parsing the url first.
You could check it using urlparse().
from urllib.parse import urlparse
url = urlparse("https://hello.com?test=test")
Since the urlparse method returns a ParseResult:
ParseResult(scheme='https', netloc='hello.com', path='', params='', query='test=test', fragment='')
You can compare these by doing
url[1] == 'hello.com' #Index 1 = netloc
https://docs.python.org/3/library/urllib.parse.html

Why is .split failing to do as expected?

I'm hoping this is a quick one.
I am trying to get the second level domain from a given URL
here is my code:
url = url.split(".", 1)[1]
url = url.split('//', 1)[-1]
url = url.split("/", 0)[0]
the problem is with the last line, for some reason it doesn't seem to do anything.
if I feed it url = "http://www.nba.com/sports"
i get back "nba.com/sports"
im trying to just get "nba.com"

Correct solution: Don't reinvent the wheel, use the existing libraries for as much as you can:
from urllib.parse import urlsplit
# On Py2, from urlparse import urlsplit
url = "http://www.nba.com/sports"
domain = urlsplit(url).hostname
# split off the last two components, then join them back together to make
# the second level domain
secondlevel = '.'.join(domain.rsplit('.', 2)[-2:])
print(secondlevel)
which gets you nba.com.

Print url after each result and you'll see what you need to do:
>>> url = "http://www.nba.com/sports"
>>> url = url.split(".", 1)[1]
>>> print(url)
nba.com/sports
After here, it's clear all we need to do is just split at the /. Don't overcomplicate it too much :)
>>> url = url.split("/")[0]
>>> print(url)
nba.com
As #Mark mentioned in the comments, you can also use urllib.urlparse:
>>> from urllib.parse import urlparse
>>> url = "http://www.nba.com/sports"
>>> urlparse(url)
ParseResult(scheme='http', netloc='www.nba.com', path='/sports', params='', query='', fragment='')
>>> urlparse(url).netloc
'www.nba.com'
And you can then strip everything from the first . if necessary, but depending on what you're doing you might not need to.
Note, if you're using Python 2, then the module is urlparse.

Python Code to Parse & Retrieve value from URL string

I have some Dot Net code that parses and retrieves a value from a URL string.
However, I would like to perform the same function but now use python code instead.
Dot Net code snippet is below:
string queryString = string.Empty;
string application_id = string.Empty;
string currentURL = Browser.getDriver.Url;
Uri url = new Uri(currentURL);
string query_String = url.Query;
application_id = query_String.Split(new char[] { '=' }).Last();
Thanks in advance

Always best to use std lib functions if they are available. Python 3 has urllib.parse (if you are still on Py2, it's urlparse). Use the urlparse method of this module to extract the query part of the url (the stuff after the '?'). Then parse_qs will convert this query to a dict of key:list values - the values are lists to handle query strings that have repeated keys.
url = 'http://www.somesite.blah/page?id=12345&attr=good&attr=bad&attr=ugly'
try:
from urllib.parse import urlparse, parse_qs
except ImportError:
# still using Python 2? time to move up
from urlparse import urlparse, parse_qs
parts = urlparse(url)
print(parts)
query_dict = parse_qs(parts.query)
print(query_dict)
print(query_dict['id'][0])
prints:
ParseResult(scheme='http', netloc='www.somesite.blah', path='/page', params='',
query='id=12345&attr=good&attr=bad&attr=ugly', fragment='')
{'attr': ['good', 'bad', 'ugly'], 'id': ['12345']}
12345

first, last = query_String.split('=')

How can I get the base of a URL in Python?

I'm trying to determine the base of a URL, or everything besides the page and parameters. I tried using split, but is there a better way than splitting it up into pieces? Is there a way I can remove everything from the last '/'?
Given this:
http://127.0.0.1/asdf/login.php
I would like:
http://127.0.0.1/asdf/

The best way to do this is use urllib.parse.
From the docs:
The module has been designed to match the Internet RFC on Relative
Uniform Resource Locators. It supports the following URL schemes:
file, ftp, gopher, hdl, http, https, imap, mailto, mms, news, nntp,
prospero, rsync, rtsp, rtspu, sftp, shttp, sip, sips, snews, svn,
svn+ssh, telnet, wais, ws, wss.
You'd want to do something like this using urlsplit and urlunsplit:
from urllib.parse import urlsplit, urlunsplit
split_url = urlsplit('http://127.0.0.1/asdf/login.php?q=abc#stackoverflow')
# You now have:
# split_url.scheme "http"
# split_url.netloc "127.0.0.1"
# split_url.path "/asdf/login.php"
# split_url.query "q=abc"
# split_url.fragment "stackoverflow"
# Use all the path except everything after the last '/'
clean_path = "".join(split_url.path.rpartition("/")[:-1])
# "/asdf/"
# urlunsplit joins a urlsplit tuple
clean_url = urlunsplit(split_url)
# "http://127.0.0.1/asdf/login.php?q=abc#stackoverflow"
# A more advanced example
advanced_split_url = urlsplit('http://foo:bar#127.0.0.1:5000/asdf/login.php?q=abc#stackoverflow')
# You now have *in addition* to the above:
# advanced_split_url.username "foo"
# advanced_split_url.password "bar"
# advanced_split_url.hostname "127.0.0.1"
# advanced_split_url.port "5000"

Well, for one, you could just use os.path.dirname:
>>> os.path.dirname('http://127.0.0.1/asdf/login.php')
'http://127.0.0.1/asdf'
It's not explicitly for URLs, but it happens to work on them (even on Windows), it just doesn't leave the trailing slash (you can just add it back yourself).
You may also want to look at urllib.parse.urlparse for more fine-grained parsing; if the URL has a query string or hash involved, you'd want to parse it into pieces, trim the path component returned by parsing, then recombine, so the path is trimmed without losing query and hash info.
Lastly, if you want to just split off the component after the last slash, you can do an rsplit with a maxsplit of 1, and keep the first component:
>>> 'http://127.0.0.1/asdf/login.php'.rsplit('/', 1)[0]
'http://127.0.0.1/asdf'

There is shortest solution for Python3 with use of urllib library (don't know if fastest):
from urllib.parse import urljoin
base_url = urljoin('http://127.0.0.1/asdf/login.php', '.')
# output: http://127.0.0.1/asdf/
Keep in mind that urllib library supports uri/url compatible with HTML's keyword. It means that uri/url ending with '/' means different that without like here https://stackoverflow.com/a/1793282/7750840/:
base_url = urljoin('http://127.0.0.1/asdf/', '.')
# output: http://127.0.0.1/asdf/
base_url = urljoin('http://127.0.0.1/asdf', '.')
# output: http://127.0.0.1/
This is link to urllib for python: https://pythonprogramming.net/urllib-tutorial-python-3/

Agree that best way to do it is with urllib.parse
Specifically, you can decompose the url with urllib.parse.urlparse and then replace every attribute other than scheme and netloc with an empty string. If you want to keep the path attribute (as in your question), you can do so with an extra string parsing step. Example function below:
import urllib.parse
def base_url(url, with_path=False):
parsed = urllib.parse.urlparse(url)
path = '/'.join(parsed.path.split('/')[:-1]) if with_path else ''
parsed = parsed._replace(path=path)
parsed = parsed._replace(params='')
parsed = parsed._replace(query='')
parsed = parsed._replace(fragment='')
return parsed.geturl()
Examples:
>>> base_url('http://127.0.0.1/asdf/login.php', with_path=True)
'http://127.0.0.1/asdf'
>>> base_url('http://127.0.0.1/asdf/login.php')
'http://127.0.0.1'

No need to use a regex, you can just use rsplit():
>>> url = 'http://127.0.0.1/asdf/login.php'
>>> url.rsplit('/', 1)[0]
'http://127.0.0.1/asdf'

When you use urlsplit, it returns a SplitResult object:
from urllib.parse import urlsplit
split_url = urlsplit('http://127.0.0.1/asdf/login.php')
print(split_url)
>>> SplitResult(scheme='http' netloc='127.0.0.1' path='/asdf/login.php' query='' fragment='')
You can make your own SplitResult() object and pass it through urlunsplit. This code should work for multiple url splits, regardless of their length, as long as you know what the last path element you want is.
from urllib.parse import urlsplit, urlunsplit, SplitResult
# splitting url:
split_url = urlsplit('http://127.0.0.1/asdf/login.php')
# editing the variables you want to change (in this case, path):
last_element = 'asdf' # this can be any element in the path.
path_array = split_url.path.split('/')
# print(path_array)
# >>> ['', 'asdf', 'login.php']
path_array.remove('')
ind = path_array.index(last_element)
new_path = '/' + '/'.join(path_array[:ind+1]) + '/'
# making SplitResult() object with edited data:
new_url = SplitResult(scheme=split_url.scheme, netloc=split_url.netloc, path=new_path, query='', fragment='')
# unsplitting:
base_url = urlunsplit(new_url)

Get the right-most occurence of slash; use the string slice through that position in the original string. The +1 gets you that final slash at the end.
link = "http://127.0.0.1/asdf/login.php"
link[:link.rfind('/')+1]

If you use python3, you can use urlparse and urlunparse.
In :from urllib.parse import urlparse, urlunparse
In :url = "http://127.0.0.1/asdf/login.php"
In :result = urlparse(url)
In :new = list(result)
In :new[2] = new[2].replace("login.php", "")
In :urlunparse(new)
Out:'http://127.0.0.1/asdf/'

how to validate url and redirect to some url using flask

I want to validate url before redirect it using Flask.
My abstract code is here...
#app.before_request
def before():
if request.before_url == "http://127.0.0.0:8000":
return redirect("http://127.0.0.1:5000")
Do you have any idea?
Thanks in advance.

Use urlparse (builtin module). Then, use the builtin flask redirection methods
>>> from urlparse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'
You can then check for the parsed out port and reconstruct a url (using the same library) with the correct port or path. This will keep the integrity of your urls, instead of dealing with string manipulation.

You can use urlparse from urllib to parse the url. The function below which checks scheme, netloc and path variables which comes after parsing the url. Supports both Python 2 and 3.
try:
# python 3
from urllib.parse import urlparse
except ImportError:
from urlparse import urlparse
def url_validator(url):
try:
result = urlparse(url)
return all([result.scheme, result.netloc, result.path])
except:
return False

You can do something like this (not tested):
#app.route('/<path>')
def redirection(path):
if path == '': # your condition
return redirect('redirect URL')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

strip string from url in Python - python

input: url = http://127.0.0.1:8000/data/number/ http://127.0.0.1:8000/data/ is consistent for each page number. output: number Instead of slicing the url url[-4:-1], Is there any better way to do it?

use urlparse module life would be easier from urlparse import urlparse urlparse('http://127.0.0.1:8000/data/number/').path.split('/')[2]

Related

How to compare URLs in python? (not traditional way)?

Why is .split failing to do as expected?

Python Code to Parse & Retrieve value from URL string

How can I get the base of a URL in Python?

how to validate url and redirect to some url using flask

Categories

Resources