If I have this string in Python, how do I decode it? - python

s = 'Tara%2520Stiles%2520Living'
How do I turn it into:
Tara Stiles Living

You need to use urllib.unquote, but it appears you need to use it twice:
>>> import urllib
>>> s = 'Tara%2520Stiles%2520Living'
>>> urllib.unquote(urllib.unquote(s))
'Tara Stiles Living'
After unquoting once, your "%2520" turns into "%20", which unquoting again turns into " " (a space).

Use:
urllib.unquote(string)
http://docs.python.org/library/urllib.html

>>> import urllib
>>> s = 'Tara%2520Stiles%2520Living'
>>> t=urllib.unquote_plus(s)
>>> print t
Tara%20Stiles%20Living
>>> urllib.unquote_plus(t)
'Tara Stiles Living'
>>>

import urllib
s = 'Tara%2520Stiles%2520Living'
t=''
while s<>t: s,t=t,urllib.unquote(s)

If you are using Python you should use urllib.parse.unquote(url) like the following code :
import urllib
url = "http://url-with-quoted-char:%3Cspan%3E%20%3C/span%3E"
print(url)
print(urllib.parse.unquote(url))
This code will output the following :
>>> print(url)
http://url-with-quoted-char:%3Cspan%3E%20%3C/span%3E
>>> print(urllib.parse.unquote(url))
http://url-with-quoted-char:<span> </span>

Related

BeautifulSoup take the url of imported moduls

Is it possible to take from a html code value of url from lines:
#import url("/modules/system/system.menus.css?n98t0f");
#import url("/modules/system/system.messages.css?n98t0f");
I was trying to use soup.findAll('import') or soup.findAll('#import') but it did not work.
Through python's re module,
>>> import re
>>> s = """#import url("/modules/system/system.menus.css?n98t0f");
... #import url("/modules/system/system.messages.css?n98t0f");"""
>>> m = re.findall(r'(?<=#import url\(\")[^"]*', s)
>>> for i in m:
... print i
...
/modules/system/system.menus.css?n98t0f
/modules/system/system.messages.css?n98t0f
soup.find_all() can only find elements containing such import statements; you'll then have to extract the text from there:
import re
for style_tag in soup.find_all('style', text=re.compile('#import\s+url')):
style_text = style_tag.get_text()
urls = re.findall(r'#import url\("([^"]+)"\)', style_text)
print urls

Strip Some Part Of URL And Save File

http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW/ref=zg_bsms_shoes_2
I don't need the last /ref=zg_bsms_shoes_2
I have the values in the urls=[]
for productlink in products:
self.urls.append(productlink)
def save(self):
self.br.quit()
f=open(self.product_file,"w")
for url in self.urls:
f.write(url+"\n")
f.flush()
How to strip it? Also with a fail proof if I don't have /ref=?
I'd strongly encourage you to start with urlparse:
In python3:
>>> import os
>>> from urllib.parse import urlparse
>>> os.path.split(urlparse(url).path)[0]
'/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
urlparse will turn the URL into all its component pieces and then you can work with the path in any number of ways, simple string splitting, os.path.split, regex, whatever you like.
In Python2 just use from urlparse import urlparse
>>> x = 'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW/ref=zg_bsms_shoes_2'
>>> '/'.join(x.split('/')[:6])
'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
>>> y = 'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
>>> '/'.join(y.split('/')[:6])
'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
if 'ref' in url.split('/')[-1]: #Failsafe
url = '/'.join(url.split('/')[:-1]

How to join absolute and relative urls?

I have two urls:
url1 = "http://127.0.0.1/test1/test2/test3/test5.xml"
url2 = "../../test4/test6.xml"
How can I get an absolute url for url2?
You should use urlparse.urljoin :
>>> import urlparse
>>> urlparse.urljoin(url1, url2)
'http://127.0.0.1/test1/test4/test6.xml'
With Python 3 (where urlparse is renamed to urllib.parse) you could use it as follow:
>>> import urllib.parse
>>> urllib.parse.urljoin(url1, url2)
'http://127.0.0.1/test1/test4/test6.xml'
If your relative path consists of multiple parts, you have to join them separately, since urljoin would replace the relative path, not join it. The easiest way to do that is to use posixpath.
>>> import urllib.parse
>>> import posixpath
>>> url1 = "http://127.0.0.1"
>>> url2 = "test1"
>>> url3 = "test2"
>>> url4 = "test3"
>>> url5 = "test5.xml"
>>> url_path = posixpath.join(url2, url3, url4, url5)
>>> urllib.parse.urljoin(url1, url_path)
'http://127.0.0.1/test1/test2/test3/test5.xml'
See also: How to join components of a path when you are constructing a URL in Python
es = ['http://127.0.0.1', 'test1', 'test4', 'test6.xml']
base = ''
map(lambda e: urlparse.urljoin(base, e), es)
You can use reduce to achieve Shikhar's method in a cleaner fashion.
>>> import urllib.parse
>>> from functools import reduce
>>> reduce(urllib.parse.urljoin, ["http://moc.com/", "path1/", "path2/", "path3/"])
'http://moc.com/path1/path2/path3/'
Note that with this method each fragment should have trailing forward-slash, with no leading forward-slash, to indicate it is a path fragment being joined.
This is more correct/informative, telling you that path1/ is a URI path fragment, and not the full path (e.g. /path1/) or an unknown (e.g. path1). An unknown could be either, but they are handled as a full path.
If you need to add / to a fragment lacking it, you could do:
uri = uri if uri.endswith("/") else f"{uri}/"
To learn more about URI resolution, Wikipedia has some nice examples.
Updates
Just noticed Peter Perron commented about reduce on Shikhar's answer, but I'll leave this here then to demonstrate how that's done.
Updated wikipedia URL
For python 3.0+ the correct way to join urls is:
from urllib.parse import urljoin
urljoin('https://10.66.0.200/', '/api/org')
# output : 'https://10.66.0.200/api/org'
>>> from urlparse import urljoin
>>> url1 = "http://www.youtube.com/user/khanacademy"
>>> url2 = "/user/khanacademy"
>>> urljoin(url1, url2)
'http://www.youtube.com/user/khanacademy'
Simple.

Python: Regex help

str = "a\b\c\dsdf\matchthis\erwe.txt"
The last folder name.
Match "matchthis"
Without using regex, just do:
>>> import os
>>> my_str = "a/b/c/dsdf/matchthis/erwe.txt"
>>> my_dir_path = os.path.dirname(my_str)
>>> my_dir_path
'a/b/c/dsdf/matchthis'
>>> my_dir_name = os.path.basename(my_dir_path)
>>> my_dir_name
'matchthis'
Better to use os.path.split(path) since it's platform independent. You'll have to call it twice to get the final directory:
path_file = "a\b\c\dsdf\matchthis\erwe.txt"
path, file = os.path.split(path_file)
path, dir = os.path.split(path)
>>> str = "a\\b\\c\\dsdf\\matchthis\\erwe.txt"
>>> str.split("\\")[-2]
'matchthis'
x = "a\b\c\d\match\something.txt"
match = x.split('\\')[-2]
>>> import re
>>> print re.match(r".*\\(.*)\\[^\\]*", r"a\b\c\dsdf\matchthis\erwe.txt").groups()
('matchthis',)
As #chrisaycock and #rafe-kettler pointed out. Use the x.split(r'\') if you can. It is way faster, readable and more pythonic. If you really need a regex then use one.
EDIT:
Actually, os.path is best. Platform independent. unix/windows etc.

Python: Convert those TinyURL (bit.ly, tinyurl, ow.ly) to full URLS

I am just learning python and is interested in how this can be accomplished. During the search for the answer, I came across this service: http://www.longurlplease.com
For example:
http://bit.ly/rgCbf can be converted to:
http://webdesignledger.com/freebies/the-best-social-media-icons-all-in-one-place
I did some inspecting with Firefox and see that the original url is not in the header.
Enter urllib2, which offers the easiest way of doing this:
>>> import urllib2
>>> fp = urllib2.urlopen('http://bit.ly/rgCbf')
>>> fp.geturl()
'http://webdesignledger.com/freebies/the-best-social-media-icons-all-in-one-place'
For reference's sake, however, note that this is also possible with httplib:
>>> import httplib
>>> conn = httplib.HTTPConnection('bit.ly')
>>> conn.request('HEAD', '/rgCbf')
>>> response = conn.getresponse()
>>> response.getheader('location')
'http://webdesignledger.com/freebies/the-best-social-media-icons-all-in-one-place'
And with PycURL, although I'm not sure if this is the best way to do it using it:
>>> import pycurl
>>> conn = pycurl.Curl()
>>> conn.setopt(pycurl.URL, "http://bit.ly/rgCbf")
>>> conn.setopt(pycurl.FOLLOWLOCATION, 1)
>>> conn.setopt(pycurl.CUSTOMREQUEST, 'HEAD')
>>> conn.setopt(pycurl.NOBODY, True)
>>> conn.perform()
>>> conn.getinfo(pycurl.EFFECTIVE_URL)
'http://webdesignledger.com/freebies/the-best-social-media-icons-all-in-one-place'

Categories

Resources