Slicing url with Python - python

Hi how to use python to transform the url of a article to it's print url.
article url:http://www.indianexpress.com/news/second-time-as-farce/800228/0
print url:http://www.indianexpress.com/story-print/800228/
How to convert article url to print url?

Use urllib.parse.urlparse() to carve the path from the rest of the url, and posixpath.split() and posixpath.join() to reform the path, and urllib.parse.urlunparse() to put it all back together again.

from urllib.parse import urlparse
def transform(url):
parsed = urlparse(url)
return '{0}://{1}/story-print/{2}/'.format(parsed.scheme, parsed.netloc, parsed.path.split('/')[-2])

Related

How can I remove 'www.' from original URL through [urllib] parse in python?

Original URL ▶ https://www.exeam.org/index.html
I want to extract exeam.org/ or exeam.org from original URL.
To do this, I used urllib the most powerful parser in Python that I know,
but unfortunately urllib (url.scheme, url.netloc ...) couldn't give me the type of format I wanted.
to extract the domain name from a url using `urllib):
from urllib.parse import urlparse
surl = "https://www.exam.org/index.html"
urlparsed = urlparse(surl)
# network location from parsed url
print(urlparsed.netloc)
# ParseResult Object
print(urlparsed)
this will give you www.exam.org, but you want to further decompose this to registered domain if you are after just the exam.org part. so besides doing simple splits, which could be sufficient, you could also use library such as tldextract which knows how to parse subdmains, suffixes and more:
from tldextract import extract
ext = extract(surl)
print(ext.registered_domain)
this will produce:
exam.org

How to use regex & python to parse lat/long from a url?

I have the url: https://maps.google.com/maps?ll=44.864505,-93.44873&z=18&t=m&hl=en&gl=US&mapclient=apiv3
I wanted to extract the lat/long from the url, so that I have 44.864505,-93.44873.
So far I have (^[maps?ll=]*$|(?<=\?).*)* which gives me ll=44.864505,-93.44873&z=18&t=m&hl=en&gl=US&mapclient=apiv3
but this needs impovement. I have been trying to use pythex to work this out, but I am stuck.
Any suggestions? Thanks
I wouldn't use regex, I'd use urlparse
For Python2:
import urlparse
url = 'https://maps.google.com/maps?ll=44.864505,-93.44873&z=18&t=m&hl=en&gl=US&mapclient=apiv3'
parsed = urlparse.urlparse(url)
params = urlparse.parse_qs(parsed.query)
print(params['ll'])
prints:
['44.864505,-93.44873']
For Python3 (urllib.parse) :
import urllib.parse as urlparse
url = 'https://maps.google.com/maps?ll=44.864505,-93.44873&z=18&t=m&hl=en&gl=US&mapclient=apiv3'
parsed = urlparse.urlparse(url)
params = urlparse.parse_qs(parsed.query)
print(params['ll'])
prints:
['44.864505,-93.44873']

How to remove query string from a url?

I have the following URL:
https://stackoverflow.com/questions/7990301?aaa=aaa
https://stackoverflow.com/questions/7990300?fr=aladdin
https://stackoverflow.com/questions/22375#6
https://stackoverflow.com/questions/22375?
https://stackoverflow.com/questions/22375#3_1
I need URLs for example:
https://stackoverflow.com/questions/7990301
https://stackoverflow.com/questions/7990300
https://stackoverflow.com/questions/22375
https://stackoverflow.com/questions/22375
https://stackoverflow.com/questions/22375
My attempt:
url='https://stackoverflow.com/questions/7990301?aaa=aaa'
if '?' in url:
url=url.split('?')[0]
if '#' in url:
url = url.split('#')[0]
I think this is a stupid way
The very helpful library furl makes it trivial to remove both query and fragment parts:
>>> furl.furl("https://hi.com/?abc=def#ghi").remove(args=True, fragment=True).url
https://hi.com/
You can split on something that doesn't exist in the string, you'll just get a list of one element, so depending on your goal, you could do something like this to simplify your existing code:
url = url.split('?')[0].split('#')[0]
Not saying this is the best way (furl is a great solution), but it is a way.
In your example you're also removing the fragment (the thing after a #), not just the query.
You can remove both by using urllib.parse.urlsplit, then calling ._replace on the namedtuple it returns and converting back to a string URL with urllib.parse.unsplit:
from urllib.parse import urlsplit, urlunsplit
def remove_query_params_and_fragment(url):
return urlunsplit(urlsplit(url)._replace(query="", fragment=""))
Output:
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/7990301?aaa=aaa")
'https://stackoverflow.com/questions/7990301'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/7990300?fr=aladdin")
'https://stackoverflow.com/questions/7990300'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375#6")
'https://stackoverflow.com/questions/22375'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375?")
'https://stackoverflow.com/questions/22375'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375#3_1")
'https://stackoverflow.com/questions/22375'
You could try
urls = ["https://stackoverflow.com/questions/7990301?aaa=aaa",
"https://stackoverflow.com/questions/7990300?fr=aladdin",
"https://stackoverflow.com/questions/22375#6",
"https://stackoverflow.com/questions/22375"?,
"https://stackoverflow.com/questions/22375#3_1"]
urls_without_query = [url.split('?')[0] for url in urls]
for example, "https://stackoverflow.com/questions/7990301?aaa=aaa".split() returns a list that looks like ["https://stackoverflow.com/questions/7990301", "aaa=aaa"], and if that string is url, url.split('?')[0] would give you "https://stackoverflow.com/questions/7990301".
Edit: I didn't think about # arguments. The other answers might help you more :)
You can use w3lib
from w3lib import url as w3_url
url_without_query = w3_url.url_query_cleaner(url)
Here is an answer using standard libraries, and which parses the URL properly:
from urllib.parse import urlparse
url = 'http://www.example.com/this/category?one=two'
parsed = urlparse(url)
print("".join([parsed.scheme,"://",parsed.netloc,parsed.path]))
expected output:
http://www.example.com/this/category
Note: this also strips params and the fragment, but is easy to modify to include those if you want.

How to compare Referer URL in Django Request to another URL using reverse()?

How can I compare the referer URL and reverse() url?
Here is my current code:
if request.META.get('HTTP_REFERER') == reverse('dashboard'):
print 'Yeah!'
But this doesn't work because the reverse will output /dashboard while HTTP_REFERER output http://localhost:8000/dashboard/
My current solution is:
if reverse('dashboard') in request.META.get('HTTP_REFERER'):
print 'Yeah!'
I don't know if this is the best way to do this. Any suggestion would be great.
You can use urlparse to get the path element from a URL. In Python3:
from urllib import parse
path = parse.urlparse('http://localhost:8000/dashboard/').path
and in Python 2:
import urlparse
path = urlparse.urlparse('http://localhost:8000/dashboard/').path

How to find URL in another URL?

Kinda tricky question about regexes. I have url of such a pattern:
http://www.domain.com/img?res=high&refurl=http://www.ahother_domain.com/page/&imgurl=http://www.one_more.com/static/images/mercedes.jpg&w=640&h=480
how can I extract imgurl value?
Take a look at urlparse
http://docs.python.org/2/library/urlparse.html
You can easily split your URL into parameters and then exctract whatever you need.
Example:
import urlparse
url = "http://www.domain.com/img?res=high&refurl=http://www.ahother_domain.com/page/&imgurl=http://www.one_more.com/static/images/mercedes.jpg&w=640&h=480"
urlParams = urlparse.parse_qs(urlparse.urlparse(url).query)
urlInUrl = urlParams['imgurl']
print urlInUrl
This solution asssumes that the imgurl param value is always followed by size params such as: &w=...:
import re
re.findall('imgurl=([^&]+)&', url)

Categories

Resources