my users will be redirected to my site with some information like this
http://127.0.0.1:8000/accounts/dasbboard/?trxref=621538940cbc9865e63ec43857ed0f&reference=621538940cbc9865e63ec43857ed0f
using urllib will get give me
query='trxref=621538940cbc9865e63ec43857ed0f&reference=621538940cbc9865e63ec43857ed0f'
But I'm no okay with that, I want to be able to get the number after the reference(621538940cbc9865e63ec43857ed0f), what is the best way to do it in my view? thank you in advance.
You can first parse the url with urlparse, and then use parse_qs to parse the querystring part:
>>> from urllib.parse import urlparse, parse_qs
>>> url = 'http://127.0.0.1:8000/accounts/dasbboard/?trxref=621538940cbc9865e63ec43857ed0f&reference=621538940cbc9865e63ec43857ed0f'
>>> parse_qs(urlparse(url).query)
{'trxref': ['621538940cbc9865e63ec43857ed0f'], 'reference': ['621538940cbc9865e63ec43857ed0f']}
This is a dictionary that maps keys to a list of values, since a key can occur multiple times. We can then retrieve the reference with:
>>> data = parse_qs(urlparse(url).query)
>>> data['reference'][0]
'621538940cbc9865e63ec43857ed0f'
Related
Original URL ▶ https://www.exeam.org/index.html
I want to extract exeam.org/ or exeam.org from original URL.
To do this, I used urllib the most powerful parser in Python that I know,
but unfortunately urllib (url.scheme, url.netloc ...) couldn't give me the type of format I wanted.
to extract the domain name from a url using `urllib):
from urllib.parse import urlparse
surl = "https://www.exam.org/index.html"
urlparsed = urlparse(surl)
# network location from parsed url
print(urlparsed.netloc)
# ParseResult Object
print(urlparsed)
this will give you www.exam.org, but you want to further decompose this to registered domain if you are after just the exam.org part. so besides doing simple splits, which could be sufficient, you could also use library such as tldextract which knows how to parse subdmains, suffixes and more:
from tldextract import extract
ext = extract(surl)
print(ext.registered_domain)
this will produce:
exam.org
I have the following URL:
https://stackoverflow.com/questions/7990301?aaa=aaa
https://stackoverflow.com/questions/7990300?fr=aladdin
https://stackoverflow.com/questions/22375#6
https://stackoverflow.com/questions/22375?
https://stackoverflow.com/questions/22375#3_1
I need URLs for example:
https://stackoverflow.com/questions/7990301
https://stackoverflow.com/questions/7990300
https://stackoverflow.com/questions/22375
https://stackoverflow.com/questions/22375
https://stackoverflow.com/questions/22375
My attempt:
url='https://stackoverflow.com/questions/7990301?aaa=aaa'
if '?' in url:
url=url.split('?')[0]
if '#' in url:
url = url.split('#')[0]
I think this is a stupid way
The very helpful library furl makes it trivial to remove both query and fragment parts:
>>> furl.furl("https://hi.com/?abc=def#ghi").remove(args=True, fragment=True).url
https://hi.com/
You can split on something that doesn't exist in the string, you'll just get a list of one element, so depending on your goal, you could do something like this to simplify your existing code:
url = url.split('?')[0].split('#')[0]
Not saying this is the best way (furl is a great solution), but it is a way.
In your example you're also removing the fragment (the thing after a #), not just the query.
You can remove both by using urllib.parse.urlsplit, then calling ._replace on the namedtuple it returns and converting back to a string URL with urllib.parse.unsplit:
from urllib.parse import urlsplit, urlunsplit
def remove_query_params_and_fragment(url):
return urlunsplit(urlsplit(url)._replace(query="", fragment=""))
Output:
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/7990301?aaa=aaa")
'https://stackoverflow.com/questions/7990301'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/7990300?fr=aladdin")
'https://stackoverflow.com/questions/7990300'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375#6")
'https://stackoverflow.com/questions/22375'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375?")
'https://stackoverflow.com/questions/22375'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375#3_1")
'https://stackoverflow.com/questions/22375'
You could try
urls = ["https://stackoverflow.com/questions/7990301?aaa=aaa",
"https://stackoverflow.com/questions/7990300?fr=aladdin",
"https://stackoverflow.com/questions/22375#6",
"https://stackoverflow.com/questions/22375"?,
"https://stackoverflow.com/questions/22375#3_1"]
urls_without_query = [url.split('?')[0] for url in urls]
for example, "https://stackoverflow.com/questions/7990301?aaa=aaa".split() returns a list that looks like ["https://stackoverflow.com/questions/7990301", "aaa=aaa"], and if that string is url, url.split('?')[0] would give you "https://stackoverflow.com/questions/7990301".
Edit: I didn't think about # arguments. The other answers might help you more :)
You can use w3lib
from w3lib import url as w3_url
url_without_query = w3_url.url_query_cleaner(url)
Here is an answer using standard libraries, and which parses the URL properly:
from urllib.parse import urlparse
url = 'http://www.example.com/this/category?one=two'
parsed = urlparse(url)
print("".join([parsed.scheme,"://",parsed.netloc,parsed.path]))
expected output:
http://www.example.com/this/category
Note: this also strips params and the fragment, but is easy to modify to include those if you want.
I have an URL like this:
https://user:password#example.com/path?key=value#hash
The result should be:
https://user:???#example.com/path?key=value#hash
I could use a regex, but instead I would like to parse the URL a high level data structure, then operate on this data structure, then serializing to a string.
Is this possible with Python?
You can use the built in urlparse to query out the password from a url. It is available in both Python 2 and 3, but under different locations.
Python 2 import urlparse
Python 3 from urllib.parse import urlparse
Example
from urllib.parse import urlparse
parsed = urlparse("https://user:password#example.com/path?key=value#hash")
parsed.password # 'password'
replaced = parsed._replace(netloc="{}:{}#{}".format(parsed.username, "???", parsed.hostname))
replaced.geturl() # 'https://user:???#example.com/path?key=value#hash'
See also this question: Changing hostname in a url
from urllib.parse import urlparse
def redact_url(url: str) -> str:
url_components = urlparse(url)
if url_components.username or url_components.password:
url_components = url_components._replace(
netloc=f"{url_components.username}:???#{url_components.hostname}",
)
return url_components.geturl()
The pip module already have an internal utility function which does exactly this.
>>> from pip._internal.utils.misc import redact_auth_from_url
>>>
>>> redact_auth_from_url("https://user:password#example.com/path?key=value#hash")
'https://user:****#example.com/path?key=value#hash'
>>> redact_auth_from_url.__doc__
'Replace the password in a given url with ****.'
This will provide the expected result even if the url does not contain username or password.
>>> redact_auth_from_url("https://example.com/path?key=value#hash")
'https://example.com/path?key=value#hash'
I have some Dot Net code that parses and retrieves a value from a URL string.
However, I would like to perform the same function but now use python code instead.
Dot Net code snippet is below:
string queryString = string.Empty;
string application_id = string.Empty;
string currentURL = Browser.getDriver.Url;
Uri url = new Uri(currentURL);
string query_String = url.Query;
application_id = query_String.Split(new char[] { '=' }).Last();
Thanks in advance
Always best to use std lib functions if they are available. Python 3 has urllib.parse (if you are still on Py2, it's urlparse). Use the urlparse method of this module to extract the query part of the url (the stuff after the '?'). Then parse_qs will convert this query to a dict of key:list values - the values are lists to handle query strings that have repeated keys.
url = 'http://www.somesite.blah/page?id=12345&attr=good&attr=bad&attr=ugly'
try:
from urllib.parse import urlparse, parse_qs
except ImportError:
# still using Python 2? time to move up
from urlparse import urlparse, parse_qs
parts = urlparse(url)
print(parts)
query_dict = parse_qs(parts.query)
print(query_dict)
print(query_dict['id'][0])
prints:
ParseResult(scheme='http', netloc='www.somesite.blah', path='/page', params='',
query='id=12345&attr=good&attr=bad&attr=ugly', fragment='')
{'attr': ['good', 'bad', 'ugly'], 'id': ['12345']}
12345
first, last = query_String.split('=')
Facebook returns access tokens in the form of a string:
'access_token=159565124071460|2.D98PLonBwOyYWlLMhMyNqA__.3600.1286373600-517705339|bFRH8d2SAeV-PpPUhbRkahcERfw&expires=4375'
Is there a way to parse the access_token without using regex? I'm afraid using regex would be unaccurate since I don't know what FB uses as access tokens
I'm getting the result like this:
result=urlfetch.fetch(url="https://graph.facebook.com/oauth/access_token",payload=payload,method=urlfetch.POST)
result2=result.content
Facebook access_token and expires are returned as key=value pairs. One way to parse them is to use the parse_qs function from the urlparse module.
>>> import urlparse
>>> s = 'access_token=159565124071460|2.D98PLonBwOyYWlLMhMyNqA__.3600.1286373600-517705339|bFRH8d2SAeV-PpPUhbRkahcERfw&expires=4375'
>>> urlparse.parse_qs(s)
{'access_token': ['159565124071460|2.D98PLonBwOyYWlLMhMyNqA__.3600.1286373600-517705339|bFRH8d2SAeV-PpPUhbRkahcERfw'], 'expires': ['4375']}
>>>
There is also parse_qsl should you wish to get the values as a list of tuples.
>>> urlparse.parse_qsl(s)
[('access_token', '159565124071460|2.D98PLonBwOyYWlLMhMyNqA__.3600.1286373600-517705339|bFRH8d2SAeV-PpPUhbRkahcERfw'), ('expires', '4375')]
>>> dict(urlparse.parse_qsl(s)).get('access_token')
'159565124071460|2.D98PLonBwOyYWlLMhMyNqA__.3600.1286373600-517705339|bFRH8d2SAeV-PpPUhbRkahcERfw'
>>>