Strip Some Part Of URL And Save File - python

http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW/ref=zg_bsms_shoes_2
I don't need the last /ref=zg_bsms_shoes_2
I have the values in the urls=[]
for productlink in products:
self.urls.append(productlink)
def save(self):
self.br.quit()
f=open(self.product_file,"w")
for url in self.urls:
f.write(url+"\n")
f.flush()
How to strip it? Also with a fail proof if I don't have /ref=?

I'd strongly encourage you to start with urlparse:
In python3:
>>> import os
>>> from urllib.parse import urlparse
>>> os.path.split(urlparse(url).path)[0]
'/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
urlparse will turn the URL into all its component pieces and then you can work with the path in any number of ways, simple string splitting, os.path.split, regex, whatever you like.
In Python2 just use from urlparse import urlparse

>>> x = 'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW/ref=zg_bsms_shoes_2'
>>> '/'.join(x.split('/')[:6])
'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
>>> y = 'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
>>> '/'.join(y.split('/')[:6])
'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'

if 'ref' in url.split('/')[-1]: #Failsafe
url = '/'.join(url.split('/')[:-1]

Related

Updating Query In URL With Urlib In Python

I have a url that is being parsed out of an XML file.
product_url = urlparse(item.find('product_url').text)
When I use urlib to break the url up I get this,
ParseResult(scheme='http', netloc='example.com', path='/dynamic', params='', query='t=MD5-YOUR-OAUTH-TOKEN&p=11111111', fragment='')
I need to update the
MD5-YOUR-OAUTH-TOKEN part of the query with a MD5 Hashed Oauth Key.
Which I have in this tokenHashed = encryptMd5Hash(token)
My goal is to after it is parsed and the hash has been inserted to the string in place of the MD5-YOUR-OAUTH-TOKEN, to have the whole url in a string I can use somewhere else. Originally I was trying to use regex to do this but found urlib. I cannot find where it says to do something like this?
Am I right to be using urlib for this? How do I achieve my goal of updating the url with the hashed token and having the whole url stored in a string?
So the string should look like this,
newString = 'http://example.com/dynamic?t='+tokenHashed+'&p=11112311312'
You'll first want to use the parse_qs function to parse the query string into a dictionary:
>>> import urlparse
>>> import urllib
>>> url = 'http://example.com/dynamic?t=MD5-YOUR-OAUTH-TOKEN&p=11111111'
>>> parsed = urlparse.urlparse(url)
>>> parsed
ParseResult(scheme='http', netloc='example.com', path='/dynamic', params='', query='t=MD5-YOUR-OAUTH-TOKEN&p=11111111', fragment='')
>>> qs = urlparse.parse_qs(parsed.query)
>>> qs
{'p': ['11111111'], 't': ['MD5-YOUR-OAUTH-TOKEN']}
>>>
Now you can modify the dictionary as desired:
>>> qs['t'] = ['tokenHashed']
Note here that because the parse_qs returned lists for each query
parameter, we need replace them with lists because we'll be calling
urlencode next with doseq=1 to handle those lists.
Next, rebuild the query string:
>>> newqs = urllib.urlencode(qs, doseq=1)
>>> newqs
'p=11111111&t=tokenHashed'
And then reassemble the URL:
>>> newurl = urlparse.urlunparse(
... [newqs if i == 4 else x for i,x in enumerate(parsed)])
>>> newurl
'http://example.com/dynamic?p=11111111&t=tokenHashed'
That list comprehension there is just using all the values from
parsed except for item 4, which we are replacing with our new query
string.

BeautifulSoup take the url of imported moduls

Is it possible to take from a html code value of url from lines:
#import url("/modules/system/system.menus.css?n98t0f");
#import url("/modules/system/system.messages.css?n98t0f");
I was trying to use soup.findAll('import') or soup.findAll('#import') but it did not work.
Through python's re module,
>>> import re
>>> s = """#import url("/modules/system/system.menus.css?n98t0f");
... #import url("/modules/system/system.messages.css?n98t0f");"""
>>> m = re.findall(r'(?<=#import url\(\")[^"]*', s)
>>> for i in m:
... print i
...
/modules/system/system.menus.css?n98t0f
/modules/system/system.messages.css?n98t0f
soup.find_all() can only find elements containing such import statements; you'll then have to extract the text from there:
import re
for style_tag in soup.find_all('style', text=re.compile('#import\s+url')):
style_text = style_tag.get_text()
urls = re.findall(r'#import url\("([^"]+)"\)', style_text)
print urls

Python split url to find image name and extension

I am looking for a way to extract a filename and extension from a particular url using Python
lets say a URL looks as follows
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
How would I go about getting the following.
filename = "da4ca3509a7b11e19e4a12313813ffc0_7"
file_ext = ".jpg"
try:
# Python 3
from urllib.parse import urlparse
except ImportError:
# Python 2
from urlparse import urlparse
from os.path import splitext, basename
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
disassembled = urlparse(picture_page)
filename, file_ext = splitext(basename(disassembled.path))
Only downside with this is that your filename will contain a preceding / which you can always remove yourself.
Try with urlparse.urlsplit to split url, and then os.path.splitext to retrieve filename and extension (use os.path.basename to keep only the last filename) :
import urlparse
import os.path
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
print os.path.splitext(os.path.basename(urlparse.urlsplit(picture_page).path))
>>> ('da4ca3509a7b11e19e4a12313813ffc0_7', '.jpg')
filename = picture_page.split('/')[-1].split('.')[0]
file_ext = '.'+picture_page.split('.')[-1]
# Here's your link:
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
#Here's your filename and ext:
filename, ext = (picture_page.split('/')[-1].split('.'))
When you do picture_page.split('/'), it will return a list of strings from your url split by a /.
If you know python list indexing well, you'd know that -1 will give you the last element or the first element from the end of the list.
In your case, it will be the filename: da4ca3509a7b11e19e4a12313813ffc0_7.jpg
Splitting that by delimeter ., you get two values:
da4ca3509a7b11e19e4a12313813ffc0_7 and jpg, as expected, because they are separated by a period which you used as a delimeter in your split() call.
Now, since the last split returns two values in the resulting list, you can tuplify it.
Hence, basically, the result would be like:
filename,ext = ('da4ca3509a7b11e19e4a12313813ffc0_7', 'jpg')
os.path.splitext will help you extract the filename and extension once you have extracted the relevant string from the URL using urlparse:
fName, ext = os.path.splitext('yourImage.jpg')
This is the easiest way to find image name and extension using regular expression.
import re
import sys
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
regex = re.compile('(.*\/(?P<name>\w+)\.(?P<ext>\w+))')
print regex.search(picture_page).group('name')
print regex.search(picture_page).group('ext')
>>> import re
>>> s = 'picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"'
>>> re.findall(r'\/([a-zA-Z0-9_]*)\.[a-zA-Z]*\"$',s)[0]
'da4ca3509a7b11e19e4a12313813ffc0_7'
>>> re.findall(r'([a-zA-Z]*)\"$',s)[0]
'jpg'

How to join absolute and relative urls?

I have two urls:
url1 = "http://127.0.0.1/test1/test2/test3/test5.xml"
url2 = "../../test4/test6.xml"
How can I get an absolute url for url2?
You should use urlparse.urljoin :
>>> import urlparse
>>> urlparse.urljoin(url1, url2)
'http://127.0.0.1/test1/test4/test6.xml'
With Python 3 (where urlparse is renamed to urllib.parse) you could use it as follow:
>>> import urllib.parse
>>> urllib.parse.urljoin(url1, url2)
'http://127.0.0.1/test1/test4/test6.xml'
If your relative path consists of multiple parts, you have to join them separately, since urljoin would replace the relative path, not join it. The easiest way to do that is to use posixpath.
>>> import urllib.parse
>>> import posixpath
>>> url1 = "http://127.0.0.1"
>>> url2 = "test1"
>>> url3 = "test2"
>>> url4 = "test3"
>>> url5 = "test5.xml"
>>> url_path = posixpath.join(url2, url3, url4, url5)
>>> urllib.parse.urljoin(url1, url_path)
'http://127.0.0.1/test1/test2/test3/test5.xml'
See also: How to join components of a path when you are constructing a URL in Python
es = ['http://127.0.0.1', 'test1', 'test4', 'test6.xml']
base = ''
map(lambda e: urlparse.urljoin(base, e), es)
You can use reduce to achieve Shikhar's method in a cleaner fashion.
>>> import urllib.parse
>>> from functools import reduce
>>> reduce(urllib.parse.urljoin, ["http://moc.com/", "path1/", "path2/", "path3/"])
'http://moc.com/path1/path2/path3/'
Note that with this method each fragment should have trailing forward-slash, with no leading forward-slash, to indicate it is a path fragment being joined.
This is more correct/informative, telling you that path1/ is a URI path fragment, and not the full path (e.g. /path1/) or an unknown (e.g. path1). An unknown could be either, but they are handled as a full path.
If you need to add / to a fragment lacking it, you could do:
uri = uri if uri.endswith("/") else f"{uri}/"
To learn more about URI resolution, Wikipedia has some nice examples.
Updates
Just noticed Peter Perron commented about reduce on Shikhar's answer, but I'll leave this here then to demonstrate how that's done.
Updated wikipedia URL
For python 3.0+ the correct way to join urls is:
from urllib.parse import urljoin
urljoin('https://10.66.0.200/', '/api/org')
# output : 'https://10.66.0.200/api/org'
>>> from urlparse import urljoin
>>> url1 = "http://www.youtube.com/user/khanacademy"
>>> url2 = "/user/khanacademy"
>>> urljoin(url1, url2)
'http://www.youtube.com/user/khanacademy'
Simple.

If I have this string in Python, how do I decode it?

s = 'Tara%2520Stiles%2520Living'
How do I turn it into:
Tara Stiles Living
You need to use urllib.unquote, but it appears you need to use it twice:
>>> import urllib
>>> s = 'Tara%2520Stiles%2520Living'
>>> urllib.unquote(urllib.unquote(s))
'Tara Stiles Living'
After unquoting once, your "%2520" turns into "%20", which unquoting again turns into " " (a space).
Use:
urllib.unquote(string)
http://docs.python.org/library/urllib.html
>>> import urllib
>>> s = 'Tara%2520Stiles%2520Living'
>>> t=urllib.unquote_plus(s)
>>> print t
Tara%20Stiles%20Living
>>> urllib.unquote_plus(t)
'Tara Stiles Living'
>>>
import urllib
s = 'Tara%2520Stiles%2520Living'
t=''
while s<>t: s,t=t,urllib.unquote(s)
If you are using Python you should use urllib.parse.unquote(url) like the following code :
import urllib
url = "http://url-with-quoted-char:%3Cspan%3E%20%3C/span%3E"
print(url)
print(urllib.parse.unquote(url))
This code will output the following :
>>> print(url)
http://url-with-quoted-char:%3Cspan%3E%20%3C/span%3E
>>> print(urllib.parse.unquote(url))
http://url-with-quoted-char:<span> </span>

Categories

Resources