Url Parse is missing fragment - Python

Url Parse is missing fragment - Python - python

I need to save a file with the name of the given acquisition path's file.
Given an URL I would like to parse it and extract the name of the file, here's my code...
I read a JSON parameter and give it to the Parse Url function. The acquisition path is a string.
ParseUrl.py:
from urllib.parse import urlparse as up
a = up(jtp["AcquisitionPath"]) # => http://127.0.0.1:8000/Users/YodhResearch/Desktop/LongCtrl10min.tiff
print(a)
print(os.path.basename(a))
Result:
ParseResult(scheme='http', netloc='127.0.0.1:8000', path='/Users/YodhResearch/Desktop/LongCtrl10min.tiff', params='', query='', fragment='')
[....]
TypeError: expected str, bytes or os.PathLike object, not ParseResult
As you can see it Parse the URL but "LongCtrl10min.tiff" is not in the fragment section but is all on the path section. Why is that happening? Maybe because "AcquisitionPath" is a string and UrlParse recognize all as a unique path?
EDIT:
a.path WORKS, I would like to know why I don't get it into the fragment section.
Here's another example:
from urllib.parse import urlparse as up
string = "http://127.0.0.1:8000/GIULIO%20FERRARI%20FOLDER/Giulio%20_%20CSV/Py%20Script/sparse%20python/tiff_test.tiff_IDAnal#1_IDAcq#10_TEMP_.json"
a = up(string)
print(a)
print(os.path.basename(a))
Results:
ParseResult(scheme='http', netloc='127.0.0.1:8000', path='/GIULIO%20FERRARI%20FOLDER/Giulio%20_%20CSV/Py%20Script/sparse%20python/tiff_test.tiff_IDAnal', params='', query='', fragment='1_IDAcq#10_TEMP_.json')
See, Now it doesn't get the right fragment that should be: "tiff_test.tiff_IDAnal#1_IDAcq#10_TEMP_.json"
SOLUTION:
Fragment needs '#' symbol! Thanks to all.

There are two issues here: how to identify the components of a URL, and how to create the desired path from those components.
First, you are confused over what the fragment actually is. From RFC 3986:
The following are two example URIs and their component parts:
foo://example.com:8042/over/there?name=ferret#nose
\_/ \______________/\_________/ \_________/ \__/
| | | | |
scheme authority path query fragment
| _____________________|__
/ \ / \
urn:example:animal:ferret:nose
The fragment is only the portion following the #, not the entire final component of the path.
Second, the urlparse()function from urllib module returns a ParseResult object and the basename()-method from os.path wants a str as argument.
What you probably want is to get the path from the ParseResult-object. You will get this with a.path (the path you have given via urlparse is saved in the attribute path of the ParseResult-object).
from urllib.parse import urlparse as up
a = up("http://127.0.0.1:8000/Users/YodhResearch/Desktop/LongCtrl10min.tiff")
print(os.path.basename(a.path))
This will output:
LongCtrl10min.tiff
If you want to include also the fragments, you can do this by explicitly adding this. The fragments are saved in a separated attribute in the ParseResult object, i.e. a.fragment in your case:
from urllib.parse import urlparse as up
a = up("http://127.0.0.1:8000/Users/YodhResearch/Desktop/LongCtrl10min.tiff#anyfragment")
print(os.path.basename(a.path) + "#" + a.fragment)
will output:
LongCtrl10min.tiff#anyfragment

Related

How can I remove 'www.' from original URL through [urllib] parse in python?

Original URL ▶ https://www.exeam.org/index.html
I want to extract exeam.org/ or exeam.org from original URL.
To do this, I used urllib the most powerful parser in Python that I know,
but unfortunately urllib (url.scheme, url.netloc ...) couldn't give me the type of format I wanted.

to extract the domain name from a url using `urllib):
from urllib.parse import urlparse
surl = "https://www.exam.org/index.html"
urlparsed = urlparse(surl)
# network location from parsed url
print(urlparsed.netloc)
# ParseResult Object
print(urlparsed)
this will give you www.exam.org, but you want to further decompose this to registered domain if you are after just the exam.org part. so besides doing simple splits, which could be sufficient, you could also use library such as tldextract which knows how to parse subdmains, suffixes and more:
from tldextract import extract
ext = extract(surl)
print(ext.registered_domain)
this will produce:
exam.org

Python Code to Parse & Retrieve value from URL string

I have some Dot Net code that parses and retrieves a value from a URL string.
However, I would like to perform the same function but now use python code instead.
Dot Net code snippet is below:
string queryString = string.Empty;
string application_id = string.Empty;
string currentURL = Browser.getDriver.Url;
Uri url = new Uri(currentURL);
string query_String = url.Query;
application_id = query_String.Split(new char[] { '=' }).Last();
Thanks in advance

Always best to use std lib functions if they are available. Python 3 has urllib.parse (if you are still on Py2, it's urlparse). Use the urlparse method of this module to extract the query part of the url (the stuff after the '?'). Then parse_qs will convert this query to a dict of key:list values - the values are lists to handle query strings that have repeated keys.
url = 'http://www.somesite.blah/page?id=12345&attr=good&attr=bad&attr=ugly'
try:
from urllib.parse import urlparse, parse_qs
except ImportError:
# still using Python 2? time to move up
from urlparse import urlparse, parse_qs
parts = urlparse(url)
print(parts)
query_dict = parse_qs(parts.query)
print(query_dict)
print(query_dict['id'][0])
prints:
ParseResult(scheme='http', netloc='www.somesite.blah', path='/page', params='',
query='id=12345&attr=good&attr=bad&attr=ugly', fragment='')
{'attr': ['good', 'bad', 'ugly'], 'id': ['12345']}
12345

first, last = query_String.split('=')

How can I get the base of a URL in Python?

I'm trying to determine the base of a URL, or everything besides the page and parameters. I tried using split, but is there a better way than splitting it up into pieces? Is there a way I can remove everything from the last '/'?
Given this:
http://127.0.0.1/asdf/login.php
I would like:
http://127.0.0.1/asdf/

The best way to do this is use urllib.parse.
From the docs:
The module has been designed to match the Internet RFC on Relative
Uniform Resource Locators. It supports the following URL schemes:
file, ftp, gopher, hdl, http, https, imap, mailto, mms, news, nntp,
prospero, rsync, rtsp, rtspu, sftp, shttp, sip, sips, snews, svn,
svn+ssh, telnet, wais, ws, wss.
You'd want to do something like this using urlsplit and urlunsplit:
from urllib.parse import urlsplit, urlunsplit
split_url = urlsplit('http://127.0.0.1/asdf/login.php?q=abc#stackoverflow')
# You now have:
# split_url.scheme "http"
# split_url.netloc "127.0.0.1"
# split_url.path "/asdf/login.php"
# split_url.query "q=abc"
# split_url.fragment "stackoverflow"
# Use all the path except everything after the last '/'
clean_path = "".join(split_url.path.rpartition("/")[:-1])
# "/asdf/"
# urlunsplit joins a urlsplit tuple
clean_url = urlunsplit(split_url)
# "http://127.0.0.1/asdf/login.php?q=abc#stackoverflow"
# A more advanced example
advanced_split_url = urlsplit('http://foo:bar#127.0.0.1:5000/asdf/login.php?q=abc#stackoverflow')
# You now have *in addition* to the above:
# advanced_split_url.username "foo"
# advanced_split_url.password "bar"
# advanced_split_url.hostname "127.0.0.1"
# advanced_split_url.port "5000"

Well, for one, you could just use os.path.dirname:
>>> os.path.dirname('http://127.0.0.1/asdf/login.php')
'http://127.0.0.1/asdf'
It's not explicitly for URLs, but it happens to work on them (even on Windows), it just doesn't leave the trailing slash (you can just add it back yourself).
You may also want to look at urllib.parse.urlparse for more fine-grained parsing; if the URL has a query string or hash involved, you'd want to parse it into pieces, trim the path component returned by parsing, then recombine, so the path is trimmed without losing query and hash info.
Lastly, if you want to just split off the component after the last slash, you can do an rsplit with a maxsplit of 1, and keep the first component:
>>> 'http://127.0.0.1/asdf/login.php'.rsplit('/', 1)[0]
'http://127.0.0.1/asdf'

There is shortest solution for Python3 with use of urllib library (don't know if fastest):
from urllib.parse import urljoin
base_url = urljoin('http://127.0.0.1/asdf/login.php', '.')
# output: http://127.0.0.1/asdf/
Keep in mind that urllib library supports uri/url compatible with HTML's keyword. It means that uri/url ending with '/' means different that without like here https://stackoverflow.com/a/1793282/7750840/:
base_url = urljoin('http://127.0.0.1/asdf/', '.')
# output: http://127.0.0.1/asdf/
base_url = urljoin('http://127.0.0.1/asdf', '.')
# output: http://127.0.0.1/
This is link to urllib for python: https://pythonprogramming.net/urllib-tutorial-python-3/

Agree that best way to do it is with urllib.parse
Specifically, you can decompose the url with urllib.parse.urlparse and then replace every attribute other than scheme and netloc with an empty string. If you want to keep the path attribute (as in your question), you can do so with an extra string parsing step. Example function below:
import urllib.parse
def base_url(url, with_path=False):
parsed = urllib.parse.urlparse(url)
path = '/'.join(parsed.path.split('/')[:-1]) if with_path else ''
parsed = parsed._replace(path=path)
parsed = parsed._replace(params='')
parsed = parsed._replace(query='')
parsed = parsed._replace(fragment='')
return parsed.geturl()
Examples:
>>> base_url('http://127.0.0.1/asdf/login.php', with_path=True)
'http://127.0.0.1/asdf'
>>> base_url('http://127.0.0.1/asdf/login.php')
'http://127.0.0.1'

No need to use a regex, you can just use rsplit():
>>> url = 'http://127.0.0.1/asdf/login.php'
>>> url.rsplit('/', 1)[0]
'http://127.0.0.1/asdf'

When you use urlsplit, it returns a SplitResult object:
from urllib.parse import urlsplit
split_url = urlsplit('http://127.0.0.1/asdf/login.php')
print(split_url)
>>> SplitResult(scheme='http' netloc='127.0.0.1' path='/asdf/login.php' query='' fragment='')
You can make your own SplitResult() object and pass it through urlunsplit. This code should work for multiple url splits, regardless of their length, as long as you know what the last path element you want is.
from urllib.parse import urlsplit, urlunsplit, SplitResult
# splitting url:
split_url = urlsplit('http://127.0.0.1/asdf/login.php')
# editing the variables you want to change (in this case, path):
last_element = 'asdf' # this can be any element in the path.
path_array = split_url.path.split('/')
# print(path_array)
# >>> ['', 'asdf', 'login.php']
path_array.remove('')
ind = path_array.index(last_element)
new_path = '/' + '/'.join(path_array[:ind+1]) + '/'
# making SplitResult() object with edited data:
new_url = SplitResult(scheme=split_url.scheme, netloc=split_url.netloc, path=new_path, query='', fragment='')
# unsplitting:
base_url = urlunsplit(new_url)

Get the right-most occurence of slash; use the string slice through that position in the original string. The +1 gets you that final slash at the end.
link = "http://127.0.0.1/asdf/login.php"
link[:link.rfind('/')+1]

If you use python3, you can use urlparse and urlunparse.
In :from urllib.parse import urlparse, urlunparse
In :url = "http://127.0.0.1/asdf/login.php"
In :result = urlparse(url)
In :new = list(result)
In :new[2] = new[2].replace("login.php", "")
In :urlunparse(new)
Out:'http://127.0.0.1/asdf/'

How to remove scheme from url in Python?

I am working with an application that returns urls, written with Flask. I want the URL displayed to the user to be as clean as possible so I want to remove the http:// from it. I looked and found the urlparse library, but couldn't find any examples of how to do this.
What would be the best way to go about it, and if urlparse is overkill is there a simpler way? Would simply removing the "http://" substring from the URL just using the regular string parsing tools be bad practice or cause problems?

I don't think urlparse offers a single method or function for this. This is how I'd do it:
from urlparse import urlparse
url = 'HtTp://stackoverflow.com/questions/tagged/python?page=2'
def strip_scheme(url):
parsed = urlparse(url)
scheme = "%s://" % parsed.scheme
return parsed.geturl().replace(scheme, '', 1)
print strip_scheme(url)
Output:
stackoverflow.com/questions/tagged/python?page=2
If you'd use (only) simple string parsing, you'd have to deal with http[s], and possibly other schemes yourself. Also, this handles weird casing of the scheme.

If you are using these programmatically rather than using a replace, I suggest having urlparse recreate the url without a scheme.
The ParseResult object is a tuple. So you can create another removing the fields you don't want.
# py2/3 compatibility
try:
from urllib.parse import urlparse, ParseResult
except ImportError:
from urlparse import urlparse, ParseResult
def strip_scheme(url):
parsed_result = urlparse(url)
return ParseResult('', *parsed_result[1:]).geturl()
You can remove any component of the parsedresult by simply replacing the input with an empty string.
It's important to note there is a functional difference between this answer and #Lukas Graf's answer. The most likely functional difference is that the '//' component of a url isn't technically the scheme, so this answer will preserve it, whereas it will remain here.
>>> Lukas_strip_scheme('https://yoman/hi?whatup')
'yoman/hi?whatup'
>>> strip_scheme('https://yoman/hi?whatup')
'//yoman/hi?whatup'

A simple regex search and replace works.
import re
def strip_scheme(url: str):
return re.sub(r'^https?:\/\/', '', url)

I've seen this done in Flask libraries and extensions. Worth noting you can do it although it does make use of a protected member (._replace) of the ParseResult/SplitResult.
url = 'HtTp://stackoverflow.com/questions/tagged/python?page=2'
split_url = urlsplit(url)
# >>> SplitResult(scheme='http', netloc='stackoverflow.com', path='/questions/tagged/python', query='page=2', fragment='')
split_url_without_scheme = split_url._replace(scheme="")
# >>> SplitResult(scheme='', netloc='stackoverflow.com', path='/questions/tagged/python', query='page=2', fragment='')
new_url = urlunsplit(split_url_without_scheme)

Almost 9 years since the question was asked and still not much have changed :D.
This is the cleanest way I came up with to solve that issue:
def strip_scheme(url: str) -> str:
schemaless = urlparse(url)._replace(scheme='').geturl()
return schemaless[2:] if schemaless.startswith("//") else schemaless
And some unit tests:
import pytest
#pytest.mark.parametrize(
['url', 'expected_url'],
[
# Test url not changed when no scheme
('www.test-url.com', 'www.test-url.com'),
# Test https scheme stripped
('https://www.test-url.com', 'www.test-url.com'),
# Test http scheme stripped
('http://www.test-url.com', 'www.test-url.com'),
# Test only scheme stripped when url with path
('https://www.test-url.com/de/fr', 'www.test-url.com/de/fr'),
# Test only scheme stripped when url with path and params
('https://test.com/de/fr?param1=foo', 'test.com/de/fr?param1=foo'),
]
)
def test_strip_scheme(url: str, expected_url: str) -> None:
assert strip_scheme(url) == expected_url

According to documentation (https://docs.python.org/3/library/urllib.parse.html#url-parsing) the return value is a named tuple, its items can be accessed by index or as named attributes. So we can get access to certain parts of parsed url by using named attributes:
from urllib.parse import urlparse
def delete_http(link):
url = urlparse(link)
return url.netloc + url.path
user_link = input()
print(delete_http(user_link))
Input: https://stackoverflow.com/
Output: stackoverflow.com/

Python: How to resolve URLs containing '..'

I need to uniquely identify and store some URLs. The problem is that sometimes they come containing ".." like http://somedomain.com/foo/bar/../../some/url which basically is http://somedomain.com/some/url if I'm not wrong.
Is there a Python function or a tricky way to resolve this URLs ?

There’s a simple solution using urllib.parse.urljoin:
>>> from urllib.parse import urljoin
>>> urljoin('http://www.example.com/foo/bar/../../baz/bux/', '.')
'http://www.example.com/baz/bux/'
However, if there is no trailing slash (the last component is a file, not a directory), the last component will be removed.
This fix uses the urlparse function to extract the path, then use (the posixpath version of) os.path to normalize the components. Compensate for a mysterious issue with trailing slashes, then join the URL back together. The following is doctestable:
from urllib.parse import urlparse
import posixpath
def resolve_components(url):
"""
>>> resolve_components('http://www.example.com/foo/bar/../../baz/bux/')
'http://www.example.com/baz/bux/'
>>> resolve_components('http://www.example.com/some/path/../file.ext')
'http://www.example.com/some/file.ext'
"""
parsed = urlparse(url)
new_path = posixpath.normpath(parsed.path)
if parsed.path.endswith('/'):
# Compensate for issue1707768
new_path += '/'
cleaned = parsed._replace(path=new_path)
return cleaned.geturl()

Those are file paths. Look at os.path.normpath:
>>> import os
>>> os.path.normpath('/foo/bar/../../some/url')
'/some/url'
EDIT:
If this is on Windows, your input path will use backslashes instead of slashes. In this case, you still need os.path.normpath to get rid of the .. patterns (and // and /./ and whatever else is redundant), then convert the backslashes to forward slashes:
def fix_path_for_URL(path):
result = os.path.normpath(path)
if os.sep == '\\':
result = result.replace('\\', '/')
return result
EDIT 2:
If you want to normalize URLs, do it (before you strip off the method and such) with urlparse module, as shown in the answer to this question.
EDIT 3:
It seems that urljoin doesn't normalize the base path it's given:
>>> import urlparse
>>> urlparse.urljoin('http://somedomain.com/foo/bar/../../some/url', '')
'http://somedomain.com/foo/bar/../../some/url'
normpath by itself doesn't quite cut it either:
>>> import os
>>> os.path.normpath('http://somedomain.com/foo/bar/../../some/url')
'http:/somedomain.com/some/url'
Note the initial double slash got eaten.
So we have to make them join forces:
def fix_URL(urlstring):
parts = list(urlparse.urlparse(urlstring))
parts[2] = os.path.normpath(parts[2].replace('/', os.sep)).replace(os.sep, '/')
return urlparse.urlunparse(parts)
Usage:
>>> fix_URL('http://somedomain.com/foo/bar/../../some/url')
'http://somedomain.com/some/url'

urljoin won't work, as it only resolves dot segments if the second argument isn't absolute(!?) or empty. Not only that, it doesn't handle excessive ..s properly according to RFC 3986 (they should be removed; urljoin doesn't do so). posixpath.normpath can't be used either (much less os.path.normpath), since it resolves multiple slashes in a row to only one (e.g. ///// becomes /), which is incorrect behavior for URLs.
The following short function resolves any URL path string correctly. It shouldn't be used with relative paths, however, since additional decisions about its behavior would then need to be made (Raise an error on excessive ..s? Remove . in the beginning? Leave them both?) - instead, join URLs before resolving if you know you might handle relative paths. Without further ado:
def resolve_url_path(path):
segments = path.split('/')
segments = [segment + '/' for segment in segments[:-1]] + [segments[-1]]
resolved = []
for segment in segments:
if segment in ('../', '..'):
if resolved[1:]:
resolved.pop()
elif segment not in ('./', '.'):
resolved.append(segment)
return ''.join(resolved)
This handles trailing dot segments (that is, without a trailing slash) and consecutive slashes correctly. To resolve an entire URL, you can then use the following wrapper (or just inline the path resolution function into it).
try:
# Python 3
from urllib.parse import urlsplit, urlunsplit
except ImportError:
# Python 2
from urlparse import urlsplit, urlunsplit
def resolve_url(url):
parts = list(urlsplit(url))
parts[2] = resolve_url_path(parts[2])
return urlunsplit(parts)
You can then call it like this:
>>> resolve_url('http://example.com/../thing///wrong/../multiple-slashes-yeah/.')
'http://example.com/thing///multiple-slashes-yeah/'
Correct URL resolution has more than a few pitfalls, it turns out!

I wanted to comment on the resolveComponents function in the top response.
Notice that if your path is /, the code will add another one which can be problematic.
I therefore changed the IF condition to:
if parsed.path.endswith( '/' ) and parsed.path != '/':

According to RFC 3986 this should happen as part of "relative resolution" process. So answer could be urlparse.urljoin(url, ''). But due to bug urlparse.urljoin does not remove dot segments when second argument is empty url. You can use yurl — alternative url manipulation library. It do this right:
>>> import yurl
>>> print yurl.URL('http://somedomain.com/foo/bar/../../some/url') + yurl.URL()
http://somedomain.com/some/url

import urlparse
import posixpath
parsed = list(urlparse.urlparse(url))
parsed[2] = posixpath.normpath(posixpath.join(parsed[2], rel_path))
proper_url = urlparse.urlunparse(parsed)

First, you need to the base URL, and then you can use urljoin from urllib.parse
example :
from urllib.parse import urljoin
def resolve_urls( urls, site):
for url in urls:
print(urljoin(site, url))
return
urls= ["/aboutMytest", "#", "/terms-of-use", "https://www.example.com/Admission"]
resolve_urls(urls,'https://example.com/')
output :
https://example.com/aboutMytest
https://example.com/
https://example.com/terms-of-use
https://www.example.com/Admission

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Url Parse is missing fragment - Python - python

Related

How can I remove 'www.' from original URL through [urllib] parse in python?

Python Code to Parse & Retrieve value from URL string

How can I get the base of a URL in Python?

How to remove scheme from url in Python?

Python: How to resolve URLs containing '..'

Categories

Resources