I have a Django project which contains models with URLField. For the same project, I am writing non-Django Python script, and would like to normalise urls.
url1 = //habrastorage.org/files/fa9/f33/091/fa9f330913c0462c8f576393f4135ec6.jpg
url2 = http://habrastorage.org/files/fa9/f33/091/fa9f330913c0462c8f576393f4135ec6.jpg
url3 = www.habrastorage.org/files/fa9/f33/091/fa9f330913c0462c8f576393f4135ec6.jpg
How can I normalise the urls? Is it possible to cast a url to an instance of Django's URLField? Ideally, I would prefer if all urls would be in the same format as url2.
Thanks!
I'm not sure what you want to achieve but to normalize your urls to look like the second one you could just use a regular expression and do a substitution with the regex module.
formatted_url = re.sub(r'^((http\:|)//|www\.)?(?P<url>.*)', r'http://\g<url>', your_url)
That would take any url of the form //blabla.com, www.blabla.com and http://blabla.com and return http://blabla.com
Here's an example of how it could be used
import re
def getNormalized(url):
"""Returns the normalized version of a url"""
return re.sub(r'^((http\:|)//|www\.)?(?P<url>.*)',
r'http://\g<url>',url)
url1 = '//habrastorage.org/files/fa9/f33/091/fa9f330913c0462c8f576393f4135ec6.jpg'
url2 = 'http://habrastorage.org/files/fa9/f33/091/fa9f330913c0462c8f576393f4135ec6.jpg'
url3 = 'www.habrastorage.org/files/fa9/f33/091/fa9f330913c0462c8f576393f4135ec6.jpg'
formatted_url1 = getNormalized(url1)
formatted_url2 = getNormalized(url2)
formatted_url3 = getNormalized(url3)
print(formatted_url1)
# http://habrastorage.org/files/fa9/f33/091/fa9f330913c0462c8f576393f4135ec6.jpg
print(formatted_url2)
# http://habrastorage.org/files/fa9/f33/091/fa9f330913c0462c8f576393f4135ec6.jpg
print(formatted_url3)
# http://habrastorage.org/files/fa9/f33/091/fa9f330913c0462c8f576393f4135ec6.jpg
If you want to know how it's done check the code. Here you will find the to_python function that formats the return string.
https://github.com/django/django/blob/master/django/forms/fields.py#L705-L738
It uses urlparse or rather django's own copy for python2 and 3 support.
Related
I have very basic knowledge of python, so sorry if my question sounds dumb.
I need to query a website for a personal project I am doing, but I need to query it 500 times, and each time I need to change 1 specific part of the url, then take the data and upload it to gsheets.
(The () signifies what part of the url I need to change)
'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=(symbol)&apikey=apikey'
I thought about using while and format {} to do it, but I was unsure how to change the string each time, bar writing out the names for variables by hand (defeating the whole purpose of this).
I already have a list of the symbols I need to use, but I don't know how to input them
Example of how I get 1 piece of data
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=MMM&apikey=demo'
r = requests.get(url)
data = r.json()
Example of what I'd like to change it to
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=AOS&apikey=demo'
r = requests.get(url)
data = r.json()
#then change it to
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=ABT&apikey=demo'
r = requests.get(url)
data = r.json()
so on and so forth, 500 times.
You might combine .format with for loop, consider following simple example
symbols = ["abc","xyz","123"]
for s in symbols:
url = 'https://www.example.com?symbol={}'.format(s)
print(url)
output
https://www.example.com?symbol=abc
https://www.example.com?symbol=xyz
https://www.example.com?symbol=123
You might also elect to use any other way of formatting, e.g. f-string (requires python3.6 or newer) in which case code would be
symbols = ["abc","xyz","123"]
for s in symbols:
url = f'https://www.example.com?symbol={s}'
print(url)
Alternatively you might params optional argument of requests.get function as follows
import requests
symbols = ["abc","xyz","123"]
for s in symbols:
r = requests.get('https://www.example.com', params={'symbol':s})
print(r.url)
output
https://www.example.com/?symbol=abc
https://www.example.com/?symbol=xyz
https://www.example.com/?symbol=123
I am trying to match or find coincidence a string in python with regex method re.search() without lucky
this is my code:
import re
request_path = '/colpos/papanicolaou2/124579/1254'
urls = ['/colpos/prescription', '/colpos/transfer', '/colpos/papanicolaou2', '/colpos/biopsia']
for url in urls:
c_url = re.compile(url)
result = re.search(c_url, request_path)
if isinstance(result, re.Match):
allowed_url = url
break
print(allowed_url) # must be /colpos/papanicolau2
what I want to happen?, if url is in request_path (in this case partially) I expect that result been re.Match object instance not None.
how can I achive this?, is any better way to know if my request_path is in urls?
the code mentioned above only works if url and request_path contains exactly the same, I dont want that. How should I use re.search() in python to achive this?
thank you
I tried checking it with the "in" keyword instead of using re module. I think it is simpler and more readable.
request_path = '/colpos/papanicolaou2/124579/1254'
urls = ['/colpos/prescription', '/colpos/transfer', '/colpos/papanicolaou2', '/colpos/biopsia']
allowed_urls = []
for url in urls:
if url in request_path:
allowed_urls.append(url)
print(allowed_urls) # this contains '/colpos/papanicolaou2' like you wanted
In case you just got 2 fixed (real) parts for your request_path, you could the following (no loops, no regex - just Python):
/colpos/papanicolaou2/124579/1254
/part_1/part_2 /param1/param2/...
Code:
urls = ['colpos/prescription', 'colpos/transfer', 'colpos/papanicolaou2', 'colpos/biopsia']
request_path = "/colpos/papanicolaou2/124579/1254"
p1, p2, params = request_path[1:].split('/', 2)
if '/'.join([p1, p2]).lower() not in urls:
#raise Error(404)
print("url not found")
Note: You would need to make it more stable for production usage :)
i have this url
path('<slug>/thank_you/<user_id>', thank_you, name='thank_you'),
i want the <user_id> to be optional, but i dont want to make 2 urls like this
path('<slug>/thank_you', thank_you, name='thank_you'),
path('<slug>/thank_you/<user_id>', thank_you, name='thank_you2'),
i understand that you can make it optional using regex, but thats if you're using django <2 (using url, not path)
how do i obtain this ?
You can use URL Query String for this. For example:
# URL
path('/thank_you/', thank_you, name='thank_you'),
# View
def thank_you(request, slug):
user_id = request.GET.get('from')
# rest of the code
# Example route
http://localhost:8000/dummy-slug/thank_you/?from=dummy_user_id
Is there a cleaner way to modify some parts of a URL in Python 2?
For example
http://foo/bar -> http://foo/yah
At present, I'm doing this:
import urlparse
url = 'http://foo/bar'
# Modify path component of URL from 'bar' to 'yah'
# Use nasty convert-to-list hack due to urlparse.ParseResult being immutable
parts = list(urlparse.urlparse(url))
parts[2] = 'yah'
url = urlparse.urlunparse(parts)
Is there a cleaner solution?
Unfortunately, the documentation is out of date; the results produced by urlparse.urlparse() (and urlparse.urlsplit()) use a collections.namedtuple()-produced class as a base.
Don't turn this namedtuple into a list, but make use of the utility method provided for just this task:
parts = urlparse.urlparse(url)
parts = parts._replace(path='yah')
url = parts.geturl()
The namedtuple._replace() method lets you create a new copy with specific elements replaced. The ParseResult.geturl() method then re-joins the parts into a url for you.
Demo:
>>> import urlparse
>>> url = 'http://foo/bar'
>>> parts = urlparse.urlparse(url)
>>> parts = parts._replace(path='yah')
>>> parts.geturl()
'http://foo/yah'
mgilson filed a bug report (with patch) to address the documentation issue.
I guess the proper way to do it is this way.
As using _replace private methods or variables is not suggested.
from urlparse import urlparse, urlunparse
res = urlparse('http://www.goog.com:80/this/is/path/;param=paramval?q=val&foo=bar#hash')
l_res = list(res)
# this willhave ['http', 'www.goog.com:80', '/this/is/path/', 'param=paramval', 'q=val&foo=bar', 'hash']
l_res[2] = '/new/path'
urlunparse(l_res)
# outputs 'http://www.goog.com:80/new/path;param=paramval?q=val&foo=bar#hash'
I have this line in my python script:
url = tree.find("//video/products/product/read_only_info/read_only_value[#key='storeURL-GB']")
but sometimes the storeURL-GB key changes the last two country code letters, so I am trying to use something like this, but it doesn't work:
url = tree.find("//video/products/product/read_only_info/read_only_value[#key='storeURL-\.*']")
Any suggestions please?
You should probably try .xpath() and starts-with():
urls = tree.xpath("//video/products/product/read_only_info/read_only_value[starts-with(#key, 'storeURL-')]")
if urls:
url = urls[0]