How to find URL in another URL? - python

Kinda tricky question about regexes. I have url of such a pattern:
http://www.domain.com/img?res=high&refurl=http://www.ahother_domain.com/page/&imgurl=http://www.one_more.com/static/images/mercedes.jpg&w=640&h=480
how can I extract imgurl value?

Take a look at urlparse
http://docs.python.org/2/library/urlparse.html
You can easily split your URL into parameters and then exctract whatever you need.
Example:
import urlparse
url = "http://www.domain.com/img?res=high&refurl=http://www.ahother_domain.com/page/&imgurl=http://www.one_more.com/static/images/mercedes.jpg&w=640&h=480"
urlParams = urlparse.parse_qs(urlparse.urlparse(url).query)
urlInUrl = urlParams['imgurl']
print urlInUrl

This solution asssumes that the imgurl param value is always followed by size params such as: &w=...:
import re
re.findall('imgurl=([^&]+)&', url)

Related

How can I remove 'www.' from original URL through [urllib] parse in python?

Original URL ▶ https://www.exeam.org/index.html
I want to extract exeam.org/ or exeam.org from original URL.
To do this, I used urllib the most powerful parser in Python that I know,
but unfortunately urllib (url.scheme, url.netloc ...) couldn't give me the type of format I wanted.
to extract the domain name from a url using `urllib):
from urllib.parse import urlparse
surl = "https://www.exam.org/index.html"
urlparsed = urlparse(surl)
# network location from parsed url
print(urlparsed.netloc)
# ParseResult Object
print(urlparsed)
this will give you www.exam.org, but you want to further decompose this to registered domain if you are after just the exam.org part. so besides doing simple splits, which could be sufficient, you could also use library such as tldextract which knows how to parse subdmains, suffixes and more:
from tldextract import extract
ext = extract(surl)
print(ext.registered_domain)
this will produce:
exam.org

I get InvalidURL: URL can't contain control characters when I try to send a request using urllib

I am trying to get a JSON response from the link used as a parameter to the urllib request. but it gives me an error that it can't contain control characters.
how can I solve the issue?
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
source = urllib.request.urlopen(start_url).read()
the error I get is :
URL can't contain control characters. '/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq=' (found at least ' ')
Replacing whitespace with:
url = url.replace(" ", "%20")
if the problem is with the whitespace.
Spaces are not allowed in URL, I removed them and it seems to be working now:
import urllib.request
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
url = start_url.replace(" ","")
source = urllib.request.urlopen(url).read()
Solr search strings can get pretty weird. Better use the 'quote' method to encode characters before making the request. See example below:
from urllib.parse import quote
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
source = urllib.request.urlopen(quote(start_url)).read()
Better later than never...
You probably already found out by now but let's get it written here.
There can't be any space character in the URL, and there are 2, after bundle_fq e dm_field_deadlineTo_fq
Remove those and you're good to go
Like the error message says, there are some control characters in your url, which doesn't seem to be a valid one by the way.
You need to encode the control characters inside the URL. Especially spaces need to be encoded to %20.
Parsing the url first and then encoding the url elements would work.
import urllib.request
from urllib.parse import urlparse, quote
def make_safe_url(url: str) -> str:
"""
Returns a parsed and quoted url
"""
_url = urlparse(url)
url = _url.scheme + "://" + _url.netloc + quote(_url.path) + "?" + quote(_url.query)
return url
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
start_url = make_safe_url(start_url)
source = urllib.request.urlopen(start_url).read()
The code returns the JSON-document despite the double forward-slash and the whitespace in the url.

How to use regex & python to parse lat/long from a url?

I have the url: https://maps.google.com/maps?ll=44.864505,-93.44873&z=18&t=m&hl=en&gl=US&mapclient=apiv3
I wanted to extract the lat/long from the url, so that I have 44.864505,-93.44873.
So far I have (^[maps?ll=]*$|(?<=\?).*)* which gives me ll=44.864505,-93.44873&z=18&t=m&hl=en&gl=US&mapclient=apiv3
but this needs impovement. I have been trying to use pythex to work this out, but I am stuck.
Any suggestions? Thanks
I wouldn't use regex, I'd use urlparse
For Python2:
import urlparse
url = 'https://maps.google.com/maps?ll=44.864505,-93.44873&z=18&t=m&hl=en&gl=US&mapclient=apiv3'
parsed = urlparse.urlparse(url)
params = urlparse.parse_qs(parsed.query)
print(params['ll'])
prints:
['44.864505,-93.44873']
For Python3 (urllib.parse) :
import urllib.parse as urlparse
url = 'https://maps.google.com/maps?ll=44.864505,-93.44873&z=18&t=m&hl=en&gl=US&mapclient=apiv3'
parsed = urlparse.urlparse(url)
params = urlparse.parse_qs(parsed.query)
print(params['ll'])
prints:
['44.864505,-93.44873']

How to remove query string from a url?

I have the following URL:
https://stackoverflow.com/questions/7990301?aaa=aaa
https://stackoverflow.com/questions/7990300?fr=aladdin
https://stackoverflow.com/questions/22375#6
https://stackoverflow.com/questions/22375?
https://stackoverflow.com/questions/22375#3_1
I need URLs for example:
https://stackoverflow.com/questions/7990301
https://stackoverflow.com/questions/7990300
https://stackoverflow.com/questions/22375
https://stackoverflow.com/questions/22375
https://stackoverflow.com/questions/22375
My attempt:
url='https://stackoverflow.com/questions/7990301?aaa=aaa'
if '?' in url:
url=url.split('?')[0]
if '#' in url:
url = url.split('#')[0]
I think this is a stupid way
The very helpful library furl makes it trivial to remove both query and fragment parts:
>>> furl.furl("https://hi.com/?abc=def#ghi").remove(args=True, fragment=True).url
https://hi.com/
You can split on something that doesn't exist in the string, you'll just get a list of one element, so depending on your goal, you could do something like this to simplify your existing code:
url = url.split('?')[0].split('#')[0]
Not saying this is the best way (furl is a great solution), but it is a way.
In your example you're also removing the fragment (the thing after a #), not just the query.
You can remove both by using urllib.parse.urlsplit, then calling ._replace on the namedtuple it returns and converting back to a string URL with urllib.parse.unsplit:
from urllib.parse import urlsplit, urlunsplit
def remove_query_params_and_fragment(url):
return urlunsplit(urlsplit(url)._replace(query="", fragment=""))
Output:
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/7990301?aaa=aaa")
'https://stackoverflow.com/questions/7990301'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/7990300?fr=aladdin")
'https://stackoverflow.com/questions/7990300'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375#6")
'https://stackoverflow.com/questions/22375'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375?")
'https://stackoverflow.com/questions/22375'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375#3_1")
'https://stackoverflow.com/questions/22375'
You could try
urls = ["https://stackoverflow.com/questions/7990301?aaa=aaa",
"https://stackoverflow.com/questions/7990300?fr=aladdin",
"https://stackoverflow.com/questions/22375#6",
"https://stackoverflow.com/questions/22375"?,
"https://stackoverflow.com/questions/22375#3_1"]
urls_without_query = [url.split('?')[0] for url in urls]
for example, "https://stackoverflow.com/questions/7990301?aaa=aaa".split() returns a list that looks like ["https://stackoverflow.com/questions/7990301", "aaa=aaa"], and if that string is url, url.split('?')[0] would give you "https://stackoverflow.com/questions/7990301".
Edit: I didn't think about # arguments. The other answers might help you more :)
You can use w3lib
from w3lib import url as w3_url
url_without_query = w3_url.url_query_cleaner(url)
Here is an answer using standard libraries, and which parses the URL properly:
from urllib.parse import urlparse
url = 'http://www.example.com/this/category?one=two'
parsed = urlparse(url)
print("".join([parsed.scheme,"://",parsed.netloc,parsed.path]))
expected output:
http://www.example.com/this/category
Note: this also strips params and the fragment, but is easy to modify to include those if you want.

Slicing url with Python

Hi how to use python to transform the url of a article to it's print url.
article url:http://www.indianexpress.com/news/second-time-as-farce/800228/0
print url:http://www.indianexpress.com/story-print/800228/
How to convert article url to print url?
Use urllib.parse.urlparse() to carve the path from the rest of the url, and posixpath.split() and posixpath.join() to reform the path, and urllib.parse.urlunparse() to put it all back together again.
from urllib.parse import urlparse
def transform(url):
parsed = urlparse(url)
return '{0}://{1}/story-print/{2}/'.format(parsed.scheme, parsed.netloc, parsed.path.split('/')[-2])

Categories

Resources