Downloadable link from public box link [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'd like to be able to retrieve a pdf from a public Box link through python, but I'm not quite sure how I can do this. Here's an example of the type of pdf I hope to be able to download:
https://fnn.app.box.com/s/ho73v0idqauzda1r477kj8g8okh72lje
I can click the download button or click a button to get a printable link on my browser, but I haven't been able to find the link to this page in the source html. Is there a way to find this link programmatically? Perhaps through selenium or requests or even through the box API?
Thanks a lot for the help!

This is code to get download link of pdf:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
def get_html(url, timeout = 15):
''' function returns html of url
usually html = urlopen(url) is enough but sometimes it doesn't work
also instead urllib.request you can use any other method to get html
code of url like urllib or urllib2 (just search it online), but I
think urllib.request comes with python installation'''
html = ''
try:
html = urlopen(url, None, timeout)
except:
url = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
html = urlopen(url, None, timeout)
except:
pass
return html
def get_soup(html):
''' function returns soup of html code
Beautiful Soup is a Python library for pulling data out of HTML
and XML files. It works with your favorite parser to provide idiomatic
ways of navigating, searching, and modifying the parse tree. It
commonly saves programmers hours or days of work.
more at https://www.crummy.com/software/BeautifulSoup/bs4/doc/'''
soup = BeautifulSoup(html, "lxml")
## if it doesn't work instead of using "lxml"
## you can use any of these options:
## soup = BeautifulSoup(html, "html.parser")
## soup = BeautifulSoup(html, "lxml-xml")
## soup = BeautifulSoup(html, "xml")
## soup = BeautifulSoup(markup, "html5lib")
return soup
def get_data_file_id(html):
'''function returns data_file_id which is found in html code'''
## to scrape website i suggest using BeautifulSoup,
## you can do it manually using html.read() which will give you
## html code as string and then you need to do some string searching
soup = get_soup(html)
## part of html code we are interested in is:
## <div class="preview" data-module="preview" data-file-id="69950302561" data-file-version-id="">
## we want to extract this data-file-id
## first we find div in which it's located in
classifier = {"class": 'preview'} ## classifier specifies div we are looking for
div = soup.find('div', classifier) ## we will get div which has class 'preview'
## now we can easily get data-file-id by using
data_file_id = div.get('data-file-id')
return data_file_id
## you can install BeautifulSoup from:
## on windows http://www.lfd.uci.edu/~gohlke/pythonlibs/
## or from https://pypi.python.org/pypi/beautifulsoup4/4.4.1
## official page is https://www.crummy.com/software/BeautifulSoup/
## if you don't want to use BeautifulSoup than you should do smotehing like this:
##
##html_str = str(html.read())
##search_for = 'div class="preview" data-module="preview" data-file-id="'
##start = html_str.find(search_for) + len(search_for)
##end = html_str.find('"', start)
##data_file_id = html_str[start : end]
##
## it may seem easier to do it than to use BeautifulSoup, but the problem is that
## if there is one more space in search_for or the order of div attributes is different
## or there sign " is used instead of ' and and vice versa this string searching
## won't work while BeautifulSoup will so I recommend using BeautifulSoup
def get_url_id(url):
''' function returns url_id which is last part of url'''
reverse_url = url[::-1]
start = len(url) - reverse_url.find('/') # start is position of last '/' in url
url_id = url[start:]
return url_id
def get_download_url(url_id, data_file_id):
''' function returns download_url'''
start = 'https://fnn.app.box.com/index.php?rm=box_download_shared_file&shared_name='
download_url = start + url_id + '&file_id=f_' + data_file_id
return download_url
url = 'https://fnn.app.box.com/s/ho73v0idqauzda1r477kj8g8okh72lje'
url = 'https://fnn.app.box.com/s/n74mnmrwyrmtiooqwppqjkrd1hhf3t3j'
html = get_html(url)
data_file_id = get_data_file_id(html) ## we need data_file_id to create download url
url_id = get_url_id(url) ## we need url_id to create download url
download_url = get_download_url(url_id, data_file_id)
## this actually isn't real download url
## you can get real url by using:
## real_download_url = get_html(download_url).geturl()
## but you will get really long url for your example it would be
## https://dl.boxcloud.com/d/1/4vx9ZWYeeQikW0KHUuO4okRjjQv3t6VGFTbMkh7weWQQc_tInOFR_1L_FuqVFovLqiycOLUDHu4o2U5EdZjmwnSmVuByY5DhpjmmdlizjaVjk6RMBbLcVhSt0ewtusDNL5tA8aiUKD1iIDlWCnXHJlcVzBc4aH3BXIEU65Ki1KdfZIlG7_jl8wuwP4MQG_yFw2sLWVDZbddJ50NLo2ElBthxy4EMSJ1auyvAWOp6ai2S4WPdqUDZ04PjOeCxQhvo3ufkt3og6Uw_s6oVVPryPUO3Pb2M4-Li5x9Cki882-WzjWUkBAPJwscVxTbDbu1b8GrR9P-5lv2I_DC4uPPamXb07f3Kp2kSJDVyy9rKbs16ATF3Wi2pOMMszMm0DVSg9SFfC6CCI0ISrkXZjEhWa_HIBuv_ptfQUUdJOMm9RmteDTstW37WgCCjT2Z22eFAfXVsFTOZBiaFVmicVAFkpB7QHyVkrfxdqpCcySEmt-KOxyjQOykx1HiC_WB2-aEFtEkCBHPX8BsG7tm10KRbSwzeGbp5YN1TJLxNlDzYZ1wVIKcD7AeoAzTjq0Brr8du0Vf67laJLuBVcZKBUhFNYM54UuOgL9USQDj8hpl5ew-W__VqYuOnAFOS18KVUTDsLODYcgLMzAylYg5pp-2IF1ipPXlbBOJgwNpYgUY0Bmnl6HaorNaRpmLVQflhs0h6wAXc7DqSNHhSnq5I_YbiQxM3pV8K8IWvpejYy3xKED5PM9HR_Sr1dnO0HtaL5PgfKcuiRCdCJjpk766LO0iNiRSWKHQ9lmdgA-AUHbQMMywLvW71rhIEea_jQ84elZdK1tK19zqPAAJ0sgT7LwdKCsT781sA90R4sRU07H825R5I3O1ygrdD-3pPArMf9bfrYyVmiZfI_yE_XiQ0OMXV9y13daMh65XkwETMAgWYwhs6RoTo3Kaa57hJjFT111lQVhjmLQF9AeqwXb0AB-Hu2AhN7tmvryRm7N2YLu6IMGLipsabJQnmp3mWqULh18gerlve9ZsOj0UyjsfGD4I0I6OhoOILsgI1k0yn8QEaVusHnKgXAtmi_JwXLN2hnP9YP20WjBLJ/download
## and we don't really care about real download url so i will use just download_url
print(download_url)
also I wrote code to download that pdf:
from urllib.request import Request, urlopen
def get_html(url, timeout = 15):
''' function returns html of url
usually html = urlopen(url) is enough but sometimes it doesn't work
also instead urllib.request you can use any other method to get html
code of url like urllib or urllib2 (just search it online), but I
think urllib.request comes with python installation'''
html = ''
try:
html = urlopen(url, None, timeout)
except:
url = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
html = urlopen(url, None, timeout)
except:
pass
return html
def get_current_path():
''' function returns path of folder in which python program is saved'''
try:
path = __file__
except:
try:
import sys
path = sys.argv[0]
except:
path = ''
if path:
if '\\' in path:
path = path.replace('\\', '/')
end = len(path) - path[::-1].find('/')
path = path[:end]
return path
def check_if_name_already_exists(name, path):
''' function checks if there is already existing pdf file
with same name in folder given by path.'''
try:
file = open(path+name+'.pdf', 'r')
file.close()
return True
except:
return False
def get_new_name(old_name, path):
''' functions ask user to enter new name for file and returns inputted name.'''
print('File with name "{}" already exist.'.format(old_name))
answer = input('Would you like to replace it (answer with "r")\nor create new one (answer with "n") ? ')
while answer not in 'rRnN':
print('Your answer is inconclusive')
print('Please answer again:')
print('if you would like to replece the existing file answer with "r"')
print('if you would like to create new one answer with "n"')
answer = input('Would you like to replace it (answer with "r")\n or create new one (answer with "n") ? ')
if answer in 'nN':
new_name = input('Enter new name for file: ')
if check_if_name_already_exists(new_name, path):
return get_new_name(new_name, path)
else:
return new_name
if answer in 'rR':
return old_name
def download_pdf(url, name = 'document1', path = None):
'''function downloads pdf file from its url
required argument is url of pdf file and
optional argument is name for saved pdf file and
optional argument path if you want to choose where is your file saved
variable path must look like:
'C:\\Users\\Computer name\\Desktop' or
'C:/Users/Computer name/Desktop' '''
# and not like
# 'C:\Users\Computer name\Desktop'
pdf = get_html(url)
name = name.replace('.pdf', '')
if path == None:
path = get_current_path()
if '\\' in path:
path = path.replace('\\', '/')
if path[-1] != '/':
path += '/'
if path:
check = check_if_name_already_exists(name, path)
if check:
if name == 'document1':
i = 2
name = 'document' + str(i)
while check_if_name_already_exists(name, path):
i += 1
name = 'document' + str(i)
else:
name = get_new_name(name, path)
file = open(path+name + '.pdf', 'wb')
else:
file = open(name + '.pdf', 'wb')
file.write(pdf.read())
file.close()
if path:
print(name + '.pdf file downloaded in folder "{}".'.format(path))
else:
print(name + '.pdf file downloaded.')
return
download_url = 'https://fnn.app.box.com/index.php?rm=box_download_shared_file&shared_name=n74mnmrwyrmtiooqwppqjkrd1hhf3t3j&file_id=f_53868474893'
download_pdf(download_url)
Hope it helps, let me know if it works.

Related

Scraping specific pdfs from different websites

First question here. I need to download a specific pdf from every url. I need just the pdf of the european commission proposal from each url that I have, which is always in a specific part of the page
[Here the part from the website that I would always need in pdf form].
The European Commission proposal
And here is the html code of it (The part that is interesting for me is :
"http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" is the pdf that I need, as you can see from the image
)
[<a class="externalDocument" href="http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" target="externalDocument">COM(2020)0791</a>, <a href="http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" target="_blank">
<span class="ep_name">
COM(2020)0791
</span>
<span class="ep_icon"> </span>
</a>, <a href="http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" target="_blank">
<span class="ep_name">
COM(2020)0791
</span>
<span class="ep_icon"> </span>
</a>]
I used the subsequent code for the task, so that it takes every url from my csv file and it goes in each page to download every pdf. The problem is that with this approach it takes also other pdf which I do not need. It is fine for me if it downloads it but I need to distinguish them from the part where they are downloaded, this is why i am asking here to download all the pdf from just one specific subsection. So if it is possible to distinguish them in the name by section it would be also fine, for now this code gives me back 3000 pdfs, i need around 1400, one for each link, and if it keeps the name of the link it could be also easier for me, but is not my main worry since they are ordered in order of recall from the csv file and it will be easy to tidy them after.
In synthesis this code here needs to become a code which downloads only from one part of the site, instead of all of it:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
#import pandas
#data = pandas.read_csv('urls.csv')
#urls = data['urls'].tolist()
urls = ["http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2020/0350", "http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2012/0299", "http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2013/0092"]
#url="http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2013/0092"
folder_location = r'C:\Users\myname\Documents\R\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
for url in urls:
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='EN.pdf']"):
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
for example I did not want do download this file here
follow up document
which is a follow up document which starts with com, ends with EN.pdf, but has a different date because it is a follow up (in this case 2018)
as you can see from the link:
https://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2018/0564/COM_COM(2018)0564_EN.pdf
The links in your html file all seem to be to the same pdf [or at least they have the same filename], so it'll just be downloading and over-writing the same document. Still, if you just want to target only the first of those links, you could include the class externalDocument in your selector.
for link in soup.select('a.externalDocument[href$="EN.pdf"]'):
If you want to target a specific event like 'Legislative proposal published', then you could do something like this:
# urls....os.mkdir(folder_location)
evtName = 'Legislative proposal published'
tdSel, spSel, aSel = 'div.ep-table-cell', 'span.ep_name', 'a[href$="EN.pdf"]'
dlSel = f'{tdSel}+{tdSel}+{tdSel} {spSel}>{aSel}'
trSel = f'div.ep-table-row:has(>{dlSel}):has(>{tdSel}+{tdSel} {spSel})'
for url in urls:
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
pgPdfLinks = [
tr.select_one(dlSel).get('href') for tr in soup.select(trSel) if
evtName.strip().lower() in
tr.select_one(f'{tdSel}+{tdSel} {spSel}').get_text().strip().lower()
## if you want [case sensitive] exact match, change condition to
# tr.select_one(f'{tdSel}+{tdSel} {spSel}').get_text() == evtName
]
for link in pgPdfLinks[:1]:
filename = os.path.join(folder_location, link.split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url, link)).content)
[The [:1] of pgPdfLinks[:1] is probably unnecessary since more than one match isn't expected, but it's there if you want to absolutely ensure only one download per page.]
Note: you need to be sure that there will be an event named evtName with a link matching aSel (a[href$="EN.pdf"] in this case) - otherwise, no PDF links will be found and nothing will be downloaded for those pages.
if it keeps the name of the link it could be also easier for me
It's already doing that in your code, since there doesn't seem to be much difference between link['href'].split('/')[-1] and link.get_text().strip(), but if you meant that you want the page link [i.e. the url], you could include the procnum (since that seems to be an identifying part of url) in your filename:
# for link in...
procnum = url.replace('?', '&').split('&procnum=')[-1].split('&')[0]
procnum = ''.join(c if (
c.isalpha() or c.isdigit() or c in '_-[]'
) else ('_' if c == '/' else '') for c in procnum)
filename = f"proc-{procnum} {link.split('/')[-1]}"
# filename = f"proc-{procnum} {link['href'].split('/')[-1]}" # in your current code
filename = os.path.join(folder_location, filename)
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url, link)).content)
# f.write(requests.get(urljoin(url['href'], link)).content) # in your current code
So, [for example] instead of saving to "COM_COM(2020)0791_EN.pdf", it will save to "proc-OLP_2020_0350 COM_COM(2020)0791_EN.pdf".
I have tried to solve this by adding different steps so that it can check at the same time what year the pdf comes from and add it to the name. The code is below, and it is an improvement, however, the answer above by Driftr95 is way better than mine, if someone wants to replicate this they should use his code.
import requests
import pandas
import os
from urllib.parse import urljoin
from bs4 import BeautifulSoup
data = pandas.read_csv('urls.csv')
urls = data['url'].tolist()
years = data["yearstr"].tolist()
numbers = data["number"].tolist()
folder_location = r'C:\Users\dario.marino5\Documents\R\webscraping'
if not os.path.exists(folder_location):
os.mkdir(folder_location)
for url, year, number in zip(urls, years, numbers):
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
if year in link['href']:
# Construct the filename with the number from the CSV file
filename = f'document_{year}_{number}.pdf'
filename = os.path.join(folder_location, filename)
# Download the PDF file and save it to the filename
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url, link['href'])).content)

Extract URLs from PDF - text doesn't match URL

I'm using following code to extract URLs from PDF and it works fine to extract the anchor but does not work when anchor text is different than the URL behind it.
For example: 'www.page.com/A' is used as a short url in the text but the actual URL behind it is a longer (full) version.
The code I'm using is:
import urllib.request
import PyPDF2
urllib.request.urlretrieve(url, "remoteFile")
pdfFile = PyPDF2.PdfFileReader("remoteFile", strict=False)
key = "/Annots"
uri = "/URI"
ank = "/A"
mylist = []
for page_no in range(pdfFile.numPages):
page = pdfFile.getPage(page_no)
text = page.extractText()
pageObject = page.getObject()
if key in pageObject.keys():
ann = pageObject.keys()
for a in ann:
try:
u = a.getObject()
if uri in u[ank].keys():
mylist.append(u[ank][uri])
print(u[ank][uri])
except KeyError:
pass
As I said, it works ok if the anchor and the link are the same. If the link is different, it saves the anchor. Ideally I would save both (or just link).

Can you use requests.get() without HTTPS/HTTP

I have a code that takes a random flag from the Flag Mashup Bot and downloads it:
import requests
DIR = 'C:/Users/myUser/Desktop/Flags/'
URL = 'https://flagsmashupbot.pythonanywhere.com/mashup?passwd=fl4gsm4shupb0t'
def download_image(img_url: str, dest_dir: str):
img_data = requests.get(img_url).content
with open(dest_dir, 'wb') as file:
file.write(img_data)
if __name__ == "__main__":
response = requests.get(URL)
if response.ok:
page = response.text
image_url = page[page.find('data:image', page.find('data:image') + 1):page.find('" download=')]
name = page[page.find('" download=') + 12:page.find('_FlagsMashupBot.png"')]
DIR += (name + '.png')
print(DIR)
download_image(image_url, DIR)
When I run it, I get the following error on line 8:
requests.exceptions.InvalidSchema: No connection adapters were found for [image URL]
When I read about it, I realized that it's because the image URLs from the site don't start with "https://" (or at least that's what I understood).
So, is there a way to use requests.get() without having https at the start of the URL?
The reason you would not get an HTTP/HTTPs based URL is since the data is in href format pointing to the base64 encoded version of the image.
You may use urllib to open up the href download link and save the contents to a file:
data = 'data:image/png;charset=utf-8;base64,iVBORw0KGgoAAAANSUhEUgAABwgAAASwCAIAAABggIlUAAAABmJLR0QA/wD/AP+gvaeTAAAgAElEQVR4nOzdaZhd92Hf97Pde+fOYGYAzAx2gCBBivsCgqtEUl7piF.......'
response = urllib.request.urlopen(data)
with open('image.png', 'wb') as f:
f.write(response.file.read())

Python - Downloading images using Wget. How to add a string to each file?

I'm using the following Python code to download images from a certain website. It's part of a code that I'm using to make a web scraper.
for url in links:
# Invoke wget download method to download specified url image.
local_image_filename = wget.download(url)
# Print out local image file name.
local_image_filename
continue
It's working well, but I want to know if it's possible to add a string as a prefix to each file...
My ideia is get the page title via Xpath and add as a prefix for each file.
I don't know where to add a string in this code. Can someone help me?
For example, I'm downloading these files:
logo.jpg, plans.jpg, circle.jpg
And I need to add a prefix, like these:
Beautiful_Plan_logo.jpg, Beautiful_Plan_plans.jpg, Beautiful_Plan_circle.jpg
Following I'll put the entire code:
import requests
import bs4 as bs
import urllib.request
import wget
##################################################
# getting url images #
##################################################
url = "https://tyreehouseplans.com/shop/house-plans/blackberry-blossom/"
opener = urllib.request.build_opener()
opener.add_headers = [{'User-Agent' : 'Mozilla'}]
urllib.request.install_opener(opener)
raw = requests.get(url).text
soup = bs.BeautifulSoup(raw, 'html.parser')
imgs = soup.find_all('img')
links = []
for img in imgs:
link = img.get('src')
links.append(link)
print(links)
################################################
# downloading images #
################################################
for url in links:
# Invoke wget download method to download specified url image.
local_image_filename = wget.download(url)
# Print out local image file name.
local_image_filename
continue
Thank you for any help!
python module wget has an option out, which determines the name of the output file. For example, the following script downloads 3 images, adding a prefix Beautiful_Plan_.
import wget
base_url = 'https://homepages.cae.wisc.edu/~ece533/images/'
image_names = ['airplane.png', 'arctichare.png', 'baboon.png']
prefix = 'Beautiful_Plan_'
for image_name in image_names:
wget.download(base_url + image_name, out = prefix + image_name)
you can use shutil for this
import shutil
prefix = "prefix_"
#your piece of code
for url in links:
# Invoke wget download method to download specified url image.
local_image_filename = wget.download(url)
# Print out local image file name.
local_image_filename
shutil.copy(local_image_filename, prefix+local_image_filename)
use os.rename as per this documentation
I wrote code for making a seperate file with the extra information up front with a seperator.
import requests
import bs4 as bs
import urllib.request
import wget
##################################################
# getting url images #
##################################################
url = "https://tyreehouseplans.com/shop/house-plans/blackberry-blossom/"
opener = urllib.request.build_opener()
opener.add_headers = [{'User-Agent': 'Mozilla'}]
urllib.request.install_opener(opener)
raw = requests.get(url).text
soup = bs.BeautifulSoup(raw, 'html.parser')
imgs = soup.find_all('img')
links = []
for img in imgs:
link = img.get('src')
links.append(link)
# print(links)
################################################
# downloading images #
################################################
for url in links:
# Invoke wget download method to download specified url image.
try:
local_image_filename = wget.download(url)
except ValueError:
break
# Print out local image file name.
print(local_image_filename)
with open(local_image_filename, 'r') as myFile:
try:
data = myFile.read()
except UnicodeDecodeError:
data = "UNICODE DECODE ERROR"
except ValueError:
data = "VALUE ERROR"
print(data)
print(type(data))
myFile.close()
newSaveString = str(local_image_filename) + "SeperatorOfSomeKind" + str(data)
newFileName = "NEW_" + local_image_filename
with open(newFileName, 'w') as myFile:
myFile.write(newSaveString)
myFile.close()
continue

Python mechanize download file with original filename

Working on a python script to scraping multi files from a website.
The download form html is something like this:
<span>
<a class="tooltip" href="download.php?action=download&id=xxx&authkey=yyy&pass=zzz" title="Download">DL</a>
</span>
What I'm thinking of is:
f1 = open('scraping.log', 'a')
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
for a in soup.select('a[href^="download.php?action=download"]'):
link = a.attrs.get('href')
print >>f1, link
br.retrieve(url+link, destination)
However, for the retrieve I have to state out the output filename. I want to get the original filename instead of setting it as a random name. Is there any way to do that?
Moreover, as I want to add this script to run frequently in crontab, is there a way for me to set us check the scraping.log and skip those that have been downloaded before?
If you don't like "download.php", check for a Content-Disposition header, e.g.:
Content-Disposition: attachment; filename="fname.ext"
And ensure the filename complies with your intent:
It is important that the receiving MUA not blindly use the
suggested filename. The suggested filename SHOULD be checked (and
possibly changed) to see that it conforms to local filesystem
conventions, does not overwrite an existing file, and does not
present a security problem (see Security Considerations below).
Python 2:
import re
import mechanize # pip install mechanize
br = mechanize.Browser()
r = br.open('http://yoursite.com')
#print r.info()['Content-Disposition']
unsafe_filename = r.info().getparam('filename') # Could be "/etc/evil".
filename = re.findall("([a-zA-Z0-9 _,()'-]+[.][a-z0-9]+)$", unsafe_filename)[0] # "-]" to match "-".
As for skipping links you've processed before,
f1 = open('scraping.log', 'a')
processed_links = f1.readlines()
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
for a in soup.select('a[href^="download.php?action=download"]'):
link = a.attrs.get('href')
if not link in processed_links:
print >>f1, link
processed_links += [link]
br.retrieve(url+link, destination)

Categories

Resources