I am just learning python and is interested in how this can be accomplished. During the search for the answer, I came across this service: http://www.longurlplease.com
For example:
http://bit.ly/rgCbf can be converted to:
http://webdesignledger.com/freebies/the-best-social-media-icons-all-in-one-place
I did some inspecting with Firefox and see that the original url is not in the header.
Enter urllib2, which offers the easiest way of doing this:
>>> import urllib2
>>> fp = urllib2.urlopen('http://bit.ly/rgCbf')
>>> fp.geturl()
'http://webdesignledger.com/freebies/the-best-social-media-icons-all-in-one-place'
For reference's sake, however, note that this is also possible with httplib:
>>> import httplib
>>> conn = httplib.HTTPConnection('bit.ly')
>>> conn.request('HEAD', '/rgCbf')
>>> response = conn.getresponse()
>>> response.getheader('location')
'http://webdesignledger.com/freebies/the-best-social-media-icons-all-in-one-place'
And with PycURL, although I'm not sure if this is the best way to do it using it:
>>> import pycurl
>>> conn = pycurl.Curl()
>>> conn.setopt(pycurl.URL, "http://bit.ly/rgCbf")
>>> conn.setopt(pycurl.FOLLOWLOCATION, 1)
>>> conn.setopt(pycurl.CUSTOMREQUEST, 'HEAD')
>>> conn.setopt(pycurl.NOBODY, True)
>>> conn.perform()
>>> conn.getinfo(pycurl.EFFECTIVE_URL)
'http://webdesignledger.com/freebies/the-best-social-media-icons-all-in-one-place'
Related
How could I get all inner html from node which I select using etree xpath:
>>> from lxml import etree
>>> from StringIO import StringIO
>>> doc = '<foo><bar><div>привет привет</div></bar></foo>'
>>> hparser = etree.HTMLParser()
>>> htree = etree.parse(StringIO(doc), hparser)
>>> foo_element = htree.xpath("//foo")
How could I now print all foo_element's inner HTML as text? I need to get this:
<bar><div>привет привет</div></bar>
BTW when I tried to use lxml.html.tostring I get strange output:
>>> import lxml.etree
>>> lxml.html.tostring(foo_element[0])
'<foo><bar><div>пÑÐ¸Ð²ÐµÑ Ð¿ÑвиеÑ</div></bar></foo>'
You can apply the same technique as shown in this other SO post. Example in the context of this question :
>>> from lxml import etree
>>> from lxml import html
>>> from StringIO import StringIO
>>> doc = '<foo><bar><div>TEST NODE</div></bar></foo>'
>>> hparser = etree.HTMLParser()
>>> htree = etree.parse(StringIO(doc), hparser)
>>> foo_element = htree.xpath("//foo")
>>> print ''.join(html.tostring(e) for e in foo_element[0])
<bar><div>TEST NODE</div></bar>
Or to handle case when the element may contain text node child :
>>> doc = '<foo>text node child<bar><div>TEST NODE</div></bar></foo>'
>>> htree = etree.parse(StringIO(doc), hparser)
>>> foo_element = htree.xpath("//foo")
>>> print foo_element[0].text + ''.join(html.tostring(e) for e in foo_element[0])
text node child<bar><div>TEST NODE</div></bar>
Refactoring the code into a separate function as shown in the linked post is strongly advised for the real case.
http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW/ref=zg_bsms_shoes_2
I don't need the last /ref=zg_bsms_shoes_2
I have the values in the urls=[]
for productlink in products:
self.urls.append(productlink)
def save(self):
self.br.quit()
f=open(self.product_file,"w")
for url in self.urls:
f.write(url+"\n")
f.flush()
How to strip it? Also with a fail proof if I don't have /ref=?
I'd strongly encourage you to start with urlparse:
In python3:
>>> import os
>>> from urllib.parse import urlparse
>>> os.path.split(urlparse(url).path)[0]
'/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
urlparse will turn the URL into all its component pieces and then you can work with the path in any number of ways, simple string splitting, os.path.split, regex, whatever you like.
In Python2 just use from urlparse import urlparse
>>> x = 'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW/ref=zg_bsms_shoes_2'
>>> '/'.join(x.split('/')[:6])
'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
>>> y = 'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
>>> '/'.join(y.split('/')[:6])
'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
if 'ref' in url.split('/')[-1]: #Failsafe
url = '/'.join(url.split('/')[:-1]
I tried to test the below contents. Now I have seen one doubtful things as below:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<a>Foo</a>")
>>> soup.a.append("Bar")
>>> soup
<a>FooBar</a>
>>> soup.a.contents
[u'Foo', u'Bar']
>>>
I am confused why did it came as [u'Foo', u'Bar'] instead of [u'FooBar']?
Can you help me in this concept?
Try this:
>>> from BeautiulSoup import NavigableString
>>> soup = BeautifulSoup("<a>Foo</a>")
>>> soup.a.contents = [NavigableString(str(soup.a.contents[0]) + 'Bar')]
>>> soup
<a>FooBar</a>
I have two urls:
url1 = "http://127.0.0.1/test1/test2/test3/test5.xml"
url2 = "../../test4/test6.xml"
How can I get an absolute url for url2?
You should use urlparse.urljoin :
>>> import urlparse
>>> urlparse.urljoin(url1, url2)
'http://127.0.0.1/test1/test4/test6.xml'
With Python 3 (where urlparse is renamed to urllib.parse) you could use it as follow:
>>> import urllib.parse
>>> urllib.parse.urljoin(url1, url2)
'http://127.0.0.1/test1/test4/test6.xml'
If your relative path consists of multiple parts, you have to join them separately, since urljoin would replace the relative path, not join it. The easiest way to do that is to use posixpath.
>>> import urllib.parse
>>> import posixpath
>>> url1 = "http://127.0.0.1"
>>> url2 = "test1"
>>> url3 = "test2"
>>> url4 = "test3"
>>> url5 = "test5.xml"
>>> url_path = posixpath.join(url2, url3, url4, url5)
>>> urllib.parse.urljoin(url1, url_path)
'http://127.0.0.1/test1/test2/test3/test5.xml'
See also: How to join components of a path when you are constructing a URL in Python
es = ['http://127.0.0.1', 'test1', 'test4', 'test6.xml']
base = ''
map(lambda e: urlparse.urljoin(base, e), es)
You can use reduce to achieve Shikhar's method in a cleaner fashion.
>>> import urllib.parse
>>> from functools import reduce
>>> reduce(urllib.parse.urljoin, ["http://moc.com/", "path1/", "path2/", "path3/"])
'http://moc.com/path1/path2/path3/'
Note that with this method each fragment should have trailing forward-slash, with no leading forward-slash, to indicate it is a path fragment being joined.
This is more correct/informative, telling you that path1/ is a URI path fragment, and not the full path (e.g. /path1/) or an unknown (e.g. path1). An unknown could be either, but they are handled as a full path.
If you need to add / to a fragment lacking it, you could do:
uri = uri if uri.endswith("/") else f"{uri}/"
To learn more about URI resolution, Wikipedia has some nice examples.
Updates
Just noticed Peter Perron commented about reduce on Shikhar's answer, but I'll leave this here then to demonstrate how that's done.
Updated wikipedia URL
For python 3.0+ the correct way to join urls is:
from urllib.parse import urljoin
urljoin('https://10.66.0.200/', '/api/org')
# output : 'https://10.66.0.200/api/org'
>>> from urlparse import urljoin
>>> url1 = "http://www.youtube.com/user/khanacademy"
>>> url2 = "/user/khanacademy"
>>> urljoin(url1, url2)
'http://www.youtube.com/user/khanacademy'
Simple.
s = 'Tara%2520Stiles%2520Living'
How do I turn it into:
Tara Stiles Living
You need to use urllib.unquote, but it appears you need to use it twice:
>>> import urllib
>>> s = 'Tara%2520Stiles%2520Living'
>>> urllib.unquote(urllib.unquote(s))
'Tara Stiles Living'
After unquoting once, your "%2520" turns into "%20", which unquoting again turns into " " (a space).
Use:
urllib.unquote(string)
http://docs.python.org/library/urllib.html
>>> import urllib
>>> s = 'Tara%2520Stiles%2520Living'
>>> t=urllib.unquote_plus(s)
>>> print t
Tara%20Stiles%20Living
>>> urllib.unquote_plus(t)
'Tara Stiles Living'
>>>
import urllib
s = 'Tara%2520Stiles%2520Living'
t=''
while s<>t: s,t=t,urllib.unquote(s)
If you are using Python you should use urllib.parse.unquote(url) like the following code :
import urllib
url = "http://url-with-quoted-char:%3Cspan%3E%20%3C/span%3E"
print(url)
print(urllib.parse.unquote(url))
This code will output the following :
>>> print(url)
http://url-with-quoted-char:%3Cspan%3E%20%3C/span%3E
>>> print(urllib.parse.unquote(url))
http://url-with-quoted-char:<span> </span>