I have two urls:
url1 = "http://127.0.0.1/test1/test2/test3/test5.xml"
url2 = "../../test4/test6.xml"
How can I get an absolute url for url2?
You should use urlparse.urljoin :
>>> import urlparse
>>> urlparse.urljoin(url1, url2)
'http://127.0.0.1/test1/test4/test6.xml'
With Python 3 (where urlparse is renamed to urllib.parse) you could use it as follow:
>>> import urllib.parse
>>> urllib.parse.urljoin(url1, url2)
'http://127.0.0.1/test1/test4/test6.xml'
If your relative path consists of multiple parts, you have to join them separately, since urljoin would replace the relative path, not join it. The easiest way to do that is to use posixpath.
>>> import urllib.parse
>>> import posixpath
>>> url1 = "http://127.0.0.1"
>>> url2 = "test1"
>>> url3 = "test2"
>>> url4 = "test3"
>>> url5 = "test5.xml"
>>> url_path = posixpath.join(url2, url3, url4, url5)
>>> urllib.parse.urljoin(url1, url_path)
'http://127.0.0.1/test1/test2/test3/test5.xml'
See also: How to join components of a path when you are constructing a URL in Python
es = ['http://127.0.0.1', 'test1', 'test4', 'test6.xml']
base = ''
map(lambda e: urlparse.urljoin(base, e), es)
You can use reduce to achieve Shikhar's method in a cleaner fashion.
>>> import urllib.parse
>>> from functools import reduce
>>> reduce(urllib.parse.urljoin, ["http://moc.com/", "path1/", "path2/", "path3/"])
'http://moc.com/path1/path2/path3/'
Note that with this method each fragment should have trailing forward-slash, with no leading forward-slash, to indicate it is a path fragment being joined.
This is more correct/informative, telling you that path1/ is a URI path fragment, and not the full path (e.g. /path1/) or an unknown (e.g. path1). An unknown could be either, but they are handled as a full path.
If you need to add / to a fragment lacking it, you could do:
uri = uri if uri.endswith("/") else f"{uri}/"
To learn more about URI resolution, Wikipedia has some nice examples.
Updates
Just noticed Peter Perron commented about reduce on Shikhar's answer, but I'll leave this here then to demonstrate how that's done.
Updated wikipedia URL
For python 3.0+ the correct way to join urls is:
from urllib.parse import urljoin
urljoin('https://10.66.0.200/', '/api/org')
# output : 'https://10.66.0.200/api/org'
>>> from urlparse import urljoin
>>> url1 = "http://www.youtube.com/user/khanacademy"
>>> url2 = "/user/khanacademy"
>>> urljoin(url1, url2)
'http://www.youtube.com/user/khanacademy'
Simple.
Related
Is it possible to take from a html code value of url from lines:
#import url("/modules/system/system.menus.css?n98t0f");
#import url("/modules/system/system.messages.css?n98t0f");
I was trying to use soup.findAll('import') or soup.findAll('#import') but it did not work.
Through python's re module,
>>> import re
>>> s = """#import url("/modules/system/system.menus.css?n98t0f");
... #import url("/modules/system/system.messages.css?n98t0f");"""
>>> m = re.findall(r'(?<=#import url\(\")[^"]*', s)
>>> for i in m:
... print i
...
/modules/system/system.menus.css?n98t0f
/modules/system/system.messages.css?n98t0f
soup.find_all() can only find elements containing such import statements; you'll then have to extract the text from there:
import re
for style_tag in soup.find_all('style', text=re.compile('#import\s+url')):
style_text = style_tag.get_text()
urls = re.findall(r'#import url\("([^"]+)"\)', style_text)
print urls
http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW/ref=zg_bsms_shoes_2
I don't need the last /ref=zg_bsms_shoes_2
I have the values in the urls=[]
for productlink in products:
self.urls.append(productlink)
def save(self):
self.br.quit()
f=open(self.product_file,"w")
for url in self.urls:
f.write(url+"\n")
f.flush()
How to strip it? Also with a fail proof if I don't have /ref=?
I'd strongly encourage you to start with urlparse:
In python3:
>>> import os
>>> from urllib.parse import urlparse
>>> os.path.split(urlparse(url).path)[0]
'/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
urlparse will turn the URL into all its component pieces and then you can work with the path in any number of ways, simple string splitting, os.path.split, regex, whatever you like.
In Python2 just use from urlparse import urlparse
>>> x = 'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW/ref=zg_bsms_shoes_2'
>>> '/'.join(x.split('/')[:6])
'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
>>> y = 'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
>>> '/'.join(y.split('/')[:6])
'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
if 'ref' in url.split('/')[-1]: #Failsafe
url = '/'.join(url.split('/')[:-1]
A url looks like:
http://www.example.com/cgi-bin/blahblah?&PC=abd23423&uy=020
I need to extract the value: abc23423
I tried this regex but its not working:
rx = re.compile(r'PC=(\w*)&uy=')
I then I did:
pc = rx.search(url).groups()
but I get an error:
attribute error: nonetype object has no attribute groups.
Try urlparse.
Update
Sheesh. What was I thinking?
import urlparse
u = 'http://www.example.com/cgi-bin/blahblah?&PC=abd23423&uy=020'
query = urlparse.urlparse(u).query
urlparse.parse_qs(query) # {'PC': ['abd23423'], 'uy': ['020']}
Original Answer
This code snippet worked for me. Take a look:
import urlparse, re
u = 'http://www.example.com/cgi-bin/blahblah?&PC=abd23423&uy=020'
query = urlparse.urlparse(u).query
pattern = re.compile('PC=(\w*)&uy')
pattern.findall(query) # ['abd23423']
lol = "http://www.example.com/cgi-bin/blahblah?&PC=abd23423&uy=020"
s = re.compile("&PC=(\w+)&uy=")
g = s.search(lol)
g.groups()
('abd23423',)
This seems to work for me.
s = 'Tara%2520Stiles%2520Living'
How do I turn it into:
Tara Stiles Living
You need to use urllib.unquote, but it appears you need to use it twice:
>>> import urllib
>>> s = 'Tara%2520Stiles%2520Living'
>>> urllib.unquote(urllib.unquote(s))
'Tara Stiles Living'
After unquoting once, your "%2520" turns into "%20", which unquoting again turns into " " (a space).
Use:
urllib.unquote(string)
http://docs.python.org/library/urllib.html
>>> import urllib
>>> s = 'Tara%2520Stiles%2520Living'
>>> t=urllib.unquote_plus(s)
>>> print t
Tara%20Stiles%20Living
>>> urllib.unquote_plus(t)
'Tara Stiles Living'
>>>
import urllib
s = 'Tara%2520Stiles%2520Living'
t=''
while s<>t: s,t=t,urllib.unquote(s)
If you are using Python you should use urllib.parse.unquote(url) like the following code :
import urllib
url = "http://url-with-quoted-char:%3Cspan%3E%20%3C/span%3E"
print(url)
print(urllib.parse.unquote(url))
This code will output the following :
>>> print(url)
http://url-with-quoted-char:%3Cspan%3E%20%3C/span%3E
>>> print(urllib.parse.unquote(url))
http://url-with-quoted-char:<span> </span>
I am just learning python and is interested in how this can be accomplished. During the search for the answer, I came across this service: http://www.longurlplease.com
For example:
http://bit.ly/rgCbf can be converted to:
http://webdesignledger.com/freebies/the-best-social-media-icons-all-in-one-place
I did some inspecting with Firefox and see that the original url is not in the header.
Enter urllib2, which offers the easiest way of doing this:
>>> import urllib2
>>> fp = urllib2.urlopen('http://bit.ly/rgCbf')
>>> fp.geturl()
'http://webdesignledger.com/freebies/the-best-social-media-icons-all-in-one-place'
For reference's sake, however, note that this is also possible with httplib:
>>> import httplib
>>> conn = httplib.HTTPConnection('bit.ly')
>>> conn.request('HEAD', '/rgCbf')
>>> response = conn.getresponse()
>>> response.getheader('location')
'http://webdesignledger.com/freebies/the-best-social-media-icons-all-in-one-place'
And with PycURL, although I'm not sure if this is the best way to do it using it:
>>> import pycurl
>>> conn = pycurl.Curl()
>>> conn.setopt(pycurl.URL, "http://bit.ly/rgCbf")
>>> conn.setopt(pycurl.FOLLOWLOCATION, 1)
>>> conn.setopt(pycurl.CUSTOMREQUEST, 'HEAD')
>>> conn.setopt(pycurl.NOBODY, True)
>>> conn.perform()
>>> conn.getinfo(pycurl.EFFECTIVE_URL)
'http://webdesignledger.com/freebies/the-best-social-media-icons-all-in-one-place'