Extract URLs from PDF - text doesn't match URL

Extract URLs from PDF - text doesn't match URL - python

I'm using following code to extract URLs from PDF and it works fine to extract the anchor but does not work when anchor text is different than the URL behind it.
For example: 'www.page.com/A' is used as a short url in the text but the actual URL behind it is a longer (full) version.
The code I'm using is:
import urllib.request
import PyPDF2
urllib.request.urlretrieve(url, "remoteFile")
pdfFile = PyPDF2.PdfFileReader("remoteFile", strict=False)
key = "/Annots"
uri = "/URI"
ank = "/A"
mylist = []
for page_no in range(pdfFile.numPages):
page = pdfFile.getPage(page_no)
text = page.extractText()
pageObject = page.getObject()
if key in pageObject.keys():
ann = pageObject.keys()
for a in ann:
try:
u = a.getObject()
if uri in u[ank].keys():
mylist.append(u[ank][uri])
print(u[ank][uri])
except KeyError:
pass
As I said, it works ok if the anchor and the link are the same. If the link is different, it saves the anchor. Ideally I would save both (or just link).

Related

Scraping specific pdfs from different websites

First question here. I need to download a specific pdf from every url. I need just the pdf of the european commission proposal from each url that I have, which is always in a specific part of the page
[Here the part from the website that I would always need in pdf form].
The European Commission proposal
And here is the html code of it (The part that is interesting for me is :
"http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" is the pdf that I need, as you can see from the image
)
[<a class="externalDocument" href="http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" target="externalDocument">COM(2020)0791</a>, <a href="http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" target="_blank">
<span class="ep_name">
COM(2020)0791
</span>
<span class="ep_icon"> </span>
</a>, <a href="http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" target="_blank">
<span class="ep_name">
COM(2020)0791
</span>
<span class="ep_icon"> </span>
</a>]
I used the subsequent code for the task, so that it takes every url from my csv file and it goes in each page to download every pdf. The problem is that with this approach it takes also other pdf which I do not need. It is fine for me if it downloads it but I need to distinguish them from the part where they are downloaded, this is why i am asking here to download all the pdf from just one specific subsection. So if it is possible to distinguish them in the name by section it would be also fine, for now this code gives me back 3000 pdfs, i need around 1400, one for each link, and if it keeps the name of the link it could be also easier for me, but is not my main worry since they are ordered in order of recall from the csv file and it will be easy to tidy them after.
In synthesis this code here needs to become a code which downloads only from one part of the site, instead of all of it:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
#import pandas
#data = pandas.read_csv('urls.csv')
#urls = data['urls'].tolist()
urls = ["http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2020/0350", "http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2012/0299", "http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2013/0092"]
#url="http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2013/0092"
folder_location = r'C:\Users\myname\Documents\R\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
for url in urls:
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='EN.pdf']"):
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
for example I did not want do download this file here
follow up document
which is a follow up document which starts with com, ends with EN.pdf, but has a different date because it is a follow up (in this case 2018)
as you can see from the link:
https://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2018/0564/COM_COM(2018)0564_EN.pdf

The links in your html file all seem to be to the same pdf [or at least they have the same filename], so it'll just be downloading and over-writing the same document. Still, if you just want to target only the first of those links, you could include the class externalDocument in your selector.
for link in soup.select('a.externalDocument[href$="EN.pdf"]'):
If you want to target a specific event like 'Legislative proposal published', then you could do something like this:
# urls....os.mkdir(folder_location)
evtName = 'Legislative proposal published'
tdSel, spSel, aSel = 'div.ep-table-cell', 'span.ep_name', 'a[href$="EN.pdf"]'
dlSel = f'{tdSel}+{tdSel}+{tdSel} {spSel}>{aSel}'
trSel = f'div.ep-table-row:has(>{dlSel}):has(>{tdSel}+{tdSel} {spSel})'
for url in urls:
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
pgPdfLinks = [
tr.select_one(dlSel).get('href') for tr in soup.select(trSel) if
evtName.strip().lower() in
tr.select_one(f'{tdSel}+{tdSel} {spSel}').get_text().strip().lower()
## if you want [case sensitive] exact match, change condition to
# tr.select_one(f'{tdSel}+{tdSel} {spSel}').get_text() == evtName
]
for link in pgPdfLinks[:1]:
filename = os.path.join(folder_location, link.split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url, link)).content)
[The [:1] of pgPdfLinks[:1] is probably unnecessary since more than one match isn't expected, but it's there if you want to absolutely ensure only one download per page.]
Note: you need to be sure that there will be an event named evtName with a link matching aSel (a[href$="EN.pdf"] in this case) - otherwise, no PDF links will be found and nothing will be downloaded for those pages.
if it keeps the name of the link it could be also easier for me
It's already doing that in your code, since there doesn't seem to be much difference between link['href'].split('/')[-1] and link.get_text().strip(), but if you meant that you want the page link [i.e. the url], you could include the procnum (since that seems to be an identifying part of url) in your filename:
# for link in...
procnum = url.replace('?', '&').split('&procnum=')[-1].split('&')[0]
procnum = ''.join(c if (
c.isalpha() or c.isdigit() or c in '_-[]'
) else ('_' if c == '/' else '') for c in procnum)
filename = f"proc-{procnum} {link.split('/')[-1]}"
# filename = f"proc-{procnum} {link['href'].split('/')[-1]}" # in your current code
filename = os.path.join(folder_location, filename)
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url, link)).content)
# f.write(requests.get(urljoin(url['href'], link)).content) # in your current code
So, [for example] instead of saving to "COM_COM(2020)0791_EN.pdf", it will save to "proc-OLP_2020_0350 COM_COM(2020)0791_EN.pdf".

I have tried to solve this by adding different steps so that it can check at the same time what year the pdf comes from and add it to the name. The code is below, and it is an improvement, however, the answer above by Driftr95 is way better than mine, if someone wants to replicate this they should use his code.
import requests
import pandas
import os
from urllib.parse import urljoin
from bs4 import BeautifulSoup
data = pandas.read_csv('urls.csv')
urls = data['url'].tolist()
years = data["yearstr"].tolist()
numbers = data["number"].tolist()
folder_location = r'C:\Users\dario.marino5\Documents\R\webscraping'
if not os.path.exists(folder_location):
os.mkdir(folder_location)
for url, year, number in zip(urls, years, numbers):
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
if year in link['href']:
# Construct the filename with the number from the CSV file
filename = f'document_{year}_{number}.pdf'
filename = os.path.join(folder_location, filename)
# Download the PDF file and save it to the filename
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url, link['href'])).content)

How to input a list of URLs saved in a .txt to a Python program?

I have a list of URLs saved in a .txt file and I would like to feed them, one at a time, to a variable named url to which I apply methods from the newspaper3k python library. The program extracts the URL content, authors of the article, a summary of the text, etc, then prints the info to a new .txt file. The script works fine when you give it one URL as user input, but what should I do in order to read from a .txt with thousands of URLs?
I am only beginning with Python, as a matter of fact this is my first script, so I have tried to simply say url = (myfile.txt), but I realized this wouldn't work because I have to read the file one line at a time. So I have tried to apply read() and readlines() to it, but it wouldn't work properly because 'str' object has no attribute 'read' or 'readlines'. What should I use to read those URLs saved in a .txt file, each beginning in a new line, as the input of my simple script? Should I convert string to something else?
Extract from the code, lines 1-18:
from newspaper import Article
from newspaper import fulltext
import requests
url = input("Article URL: ")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
Later I have built some functions to display the info in a desired format and save it to a new .txt. I know this is a very basic one, but I am honestly stuck... I have read other similar questions here but I couldn't properly understand or apply the suggestions. So, what is the best way to read URLs from a .txt file in order to feed them, one at a time, to the url variable, to which other methods are them applied to extract its content?
This is my first question here and I understand the forum is aimed at more experienced programmers, but I would really appreciate some help. If I need to edit or clarify something in this post, please let me know and I will correct immediately.

Here is one way you could do it:
from newspaper import Article
from newspaper import fulltext
import requests
with open('myfile.txt',r) as f:
for line in f:
#do not forget to strip the trailing new line
url = line.rstrip("\n")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary

This could help you:
url_file = open('myfile.txt','r')
for url in url_file.readlines():
print url
url_file.close()
You can apply it on your code as the following
from newspaper import Article
from newspaper import fulltext
import requests
url_file = open('myfile.txt','r')
for url in url_file.readlines():
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
url_file.close()

how do i change hyperlinks inside pdf using python?

How do I change the hyperlinks in pdf using python? I am currently using a pyPDF2 to open up and loop through the pages. How do I actually scan for hyperlinks and then proceed to change the hyperlinks?

So I couldn't get what you want using the pyPDF2 library.
I did however get something working with another library: pdfrw. This installed fine for me using pip in Python 3.6:
pip install pdfrw
Note: for the following I have been using this example pdf I found online which contains multiple links. Your mileage may vary with this.
import pdfrw
pdf = pdfrw.PdfReader("pdf.pdf") # Load the pdf
new_pdf = pdfrw.PdfWriter() # Create an empty pdf
for page in pdf.pages: # Go through the pages
# Links are in Annots, but some pages don't have links so Annots returns None
for annot in page.Annots or []:
old_url = annot.A.URI
# >Here you put logic for replacing the URLs<
# Use the PdfString object to do the encoding for us
# Note the brackets around the URL here
new_url = pdfrw.objects.pdfstring.PdfString("(http://www.google.com)")
# Override the URL with ours
annot.A.URI = new_url
new_pdf.addpage(page)
new_pdf.write("new.pdf")

I managed to get it working with PyPDF2.
If you just want to remove all annotations for a page, you just have to do:
if '/Annots' in page: del page['/Annots']
Else, here is how you change each link:
import PyPDF2
new_link = "https://www.youtube.com/watch?v=dQw4w9WgXcQ" # great video by the way
pdf_reader = PyPDF2.PdfFileReader("input.pdf")
pdf_writer = PyPDF2.PdfFileWriter()
for i in range(pdf_reader.getNumPages()):
page = pdf_reader.getPage(i)
if '/Annots' not in page: continue
for annot in page['/Annots']:
annot_obj = annot.getObject()
if '/A' not in annot_obj: continue # not a link
# you have to wrap the key and value with a TextStringObject:
key = PyPDF2.generic.TextStringObject("/URI")
value = PyPDF2.generic.TextStringObject(new_link)
annot_obj['/A'][key] = value
pdf_writer.addPage(page)
with open('output.pdf', 'wb') as f:
pdf_writer.write(f)
An equivalent one-liner for a given page index i and annotation index j would be:
pdf_reader.getPage(i)['/Annots'][j].getObject()['/A'][PyPDF2.generic.TextStringObject("/URI")] = PyPDF2.generic.TextStringObject(new_link)

Downloadable link from public box link [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'd like to be able to retrieve a pdf from a public Box link through python, but I'm not quite sure how I can do this. Here's an example of the type of pdf I hope to be able to download:
https://fnn.app.box.com/s/ho73v0idqauzda1r477kj8g8okh72lje
I can click the download button or click a button to get a printable link on my browser, but I haven't been able to find the link to this page in the source html. Is there a way to find this link programmatically? Perhaps through selenium or requests or even through the box API?
Thanks a lot for the help!

This is code to get download link of pdf:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
def get_html(url, timeout = 15):
''' function returns html of url
usually html = urlopen(url) is enough but sometimes it doesn't work
also instead urllib.request you can use any other method to get html
code of url like urllib or urllib2 (just search it online), but I
think urllib.request comes with python installation'''
html = ''
try:
html = urlopen(url, None, timeout)
except:
url = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
html = urlopen(url, None, timeout)
except:
pass
return html
def get_soup(html):
''' function returns soup of html code
Beautiful Soup is a Python library for pulling data out of HTML
and XML files. It works with your favorite parser to provide idiomatic
ways of navigating, searching, and modifying the parse tree. It
commonly saves programmers hours or days of work.
more at https://www.crummy.com/software/BeautifulSoup/bs4/doc/'''
soup = BeautifulSoup(html, "lxml")
## if it doesn't work instead of using "lxml"
## you can use any of these options:
## soup = BeautifulSoup(html, "html.parser")
## soup = BeautifulSoup(html, "lxml-xml")
## soup = BeautifulSoup(html, "xml")
## soup = BeautifulSoup(markup, "html5lib")
return soup
def get_data_file_id(html):
'''function returns data_file_id which is found in html code'''
## to scrape website i suggest using BeautifulSoup,
## you can do it manually using html.read() which will give you
## html code as string and then you need to do some string searching
soup = get_soup(html)
## part of html code we are interested in is:
## <div class="preview" data-module="preview" data-file-id="69950302561" data-file-version-id="">
## we want to extract this data-file-id
## first we find div in which it's located in
classifier = {"class": 'preview'} ## classifier specifies div we are looking for
div = soup.find('div', classifier) ## we will get div which has class 'preview'
## now we can easily get data-file-id by using
data_file_id = div.get('data-file-id')
return data_file_id
## you can install BeautifulSoup from:
## on windows http://www.lfd.uci.edu/~gohlke/pythonlibs/
## or from https://pypi.python.org/pypi/beautifulsoup4/4.4.1
## official page is https://www.crummy.com/software/BeautifulSoup/
## if you don't want to use BeautifulSoup than you should do smotehing like this:
##
##html_str = str(html.read())
##search_for = 'div class="preview" data-module="preview" data-file-id="'
##start = html_str.find(search_for) + len(search_for)
##end = html_str.find('"', start)
##data_file_id = html_str[start : end]
##
## it may seem easier to do it than to use BeautifulSoup, but the problem is that
## if there is one more space in search_for or the order of div attributes is different
## or there sign " is used instead of ' and and vice versa this string searching
## won't work while BeautifulSoup will so I recommend using BeautifulSoup
def get_url_id(url):
''' function returns url_id which is last part of url'''
reverse_url = url[::-1]
start = len(url) - reverse_url.find('/') # start is position of last '/' in url
url_id = url[start:]
return url_id
def get_download_url(url_id, data_file_id):
''' function returns download_url'''
start = 'https://fnn.app.box.com/index.php?rm=box_download_shared_file&shared_name='
download_url = start + url_id + '&file_id=f_' + data_file_id
return download_url
url = 'https://fnn.app.box.com/s/ho73v0idqauzda1r477kj8g8okh72lje'
url = 'https://fnn.app.box.com/s/n74mnmrwyrmtiooqwppqjkrd1hhf3t3j'
html = get_html(url)
data_file_id = get_data_file_id(html) ## we need data_file_id to create download url
url_id = get_url_id(url) ## we need url_id to create download url
download_url = get_download_url(url_id, data_file_id)
## this actually isn't real download url
## you can get real url by using:
## real_download_url = get_html(download_url).geturl()
## but you will get really long url for your example it would be
## https://dl.boxcloud.com/d/1/4vx9ZWYeeQikW0KHUuO4okRjjQv3t6VGFTbMkh7weWQQc_tInOFR_1L_FuqVFovLqiycOLUDHu4o2U5EdZjmwnSmVuByY5DhpjmmdlizjaVjk6RMBbLcVhSt0ewtusDNL5tA8aiUKD1iIDlWCnXHJlcVzBc4aH3BXIEU65Ki1KdfZIlG7_jl8wuwP4MQG_yFw2sLWVDZbddJ50NLo2ElBthxy4EMSJ1auyvAWOp6ai2S4WPdqUDZ04PjOeCxQhvo3ufkt3og6Uw_s6oVVPryPUO3Pb2M4-Li5x9Cki882-WzjWUkBAPJwscVxTbDbu1b8GrR9P-5lv2I_DC4uPPamXb07f3Kp2kSJDVyy9rKbs16ATF3Wi2pOMMszMm0DVSg9SFfC6CCI0ISrkXZjEhWa_HIBuv_ptfQUUdJOMm9RmteDTstW37WgCCjT2Z22eFAfXVsFTOZBiaFVmicVAFkpB7QHyVkrfxdqpCcySEmt-KOxyjQOykx1HiC_WB2-aEFtEkCBHPX8BsG7tm10KRbSwzeGbp5YN1TJLxNlDzYZ1wVIKcD7AeoAzTjq0Brr8du0Vf67laJLuBVcZKBUhFNYM54UuOgL9USQDj8hpl5ew-W__VqYuOnAFOS18KVUTDsLODYcgLMzAylYg5pp-2IF1ipPXlbBOJgwNpYgUY0Bmnl6HaorNaRpmLVQflhs0h6wAXc7DqSNHhSnq5I_YbiQxM3pV8K8IWvpejYy3xKED5PM9HR_Sr1dnO0HtaL5PgfKcuiRCdCJjpk766LO0iNiRSWKHQ9lmdgA-AUHbQMMywLvW71rhIEea_jQ84elZdK1tK19zqPAAJ0sgT7LwdKCsT781sA90R4sRU07H825R5I3O1ygrdD-3pPArMf9bfrYyVmiZfI_yE_XiQ0OMXV9y13daMh65XkwETMAgWYwhs6RoTo3Kaa57hJjFT111lQVhjmLQF9AeqwXb0AB-Hu2AhN7tmvryRm7N2YLu6IMGLipsabJQnmp3mWqULh18gerlve9ZsOj0UyjsfGD4I0I6OhoOILsgI1k0yn8QEaVusHnKgXAtmi_JwXLN2hnP9YP20WjBLJ/download
## and we don't really care about real download url so i will use just download_url
print(download_url)
also I wrote code to download that pdf:
from urllib.request import Request, urlopen
def get_html(url, timeout = 15):
''' function returns html of url
usually html = urlopen(url) is enough but sometimes it doesn't work
also instead urllib.request you can use any other method to get html
code of url like urllib or urllib2 (just search it online), but I
think urllib.request comes with python installation'''
html = ''
try:
html = urlopen(url, None, timeout)
except:
url = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
html = urlopen(url, None, timeout)
except:
pass
return html
def get_current_path():
''' function returns path of folder in which python program is saved'''
try:
path = __file__
except:
try:
import sys
path = sys.argv[0]
except:
path = ''
if path:
if '\\' in path:
path = path.replace('\\', '/')
end = len(path) - path[::-1].find('/')
path = path[:end]
return path
def check_if_name_already_exists(name, path):
''' function checks if there is already existing pdf file
with same name in folder given by path.'''
try:
file = open(path+name+'.pdf', 'r')
file.close()
return True
except:
return False
def get_new_name(old_name, path):
''' functions ask user to enter new name for file and returns inputted name.'''
print('File with name "{}" already exist.'.format(old_name))
answer = input('Would you like to replace it (answer with "r")\nor create new one (answer with "n") ? ')
while answer not in 'rRnN':
print('Your answer is inconclusive')
print('Please answer again:')
print('if you would like to replece the existing file answer with "r"')
print('if you would like to create new one answer with "n"')
answer = input('Would you like to replace it (answer with "r")\n or create new one (answer with "n") ? ')
if answer in 'nN':
new_name = input('Enter new name for file: ')
if check_if_name_already_exists(new_name, path):
return get_new_name(new_name, path)
else:
return new_name
if answer in 'rR':
return old_name
def download_pdf(url, name = 'document1', path = None):
'''function downloads pdf file from its url
required argument is url of pdf file and
optional argument is name for saved pdf file and
optional argument path if you want to choose where is your file saved
variable path must look like:
'C:\\Users\\Computer name\\Desktop' or
'C:/Users/Computer name/Desktop' '''
# and not like
# 'C:\Users\Computer name\Desktop'
pdf = get_html(url)
name = name.replace('.pdf', '')
if path == None:
path = get_current_path()
if '\\' in path:
path = path.replace('\\', '/')
if path[-1] != '/':
path += '/'
if path:
check = check_if_name_already_exists(name, path)
if check:
if name == 'document1':
i = 2
name = 'document' + str(i)
while check_if_name_already_exists(name, path):
i += 1
name = 'document' + str(i)
else:
name = get_new_name(name, path)
file = open(path+name + '.pdf', 'wb')
else:
file = open(name + '.pdf', 'wb')
file.write(pdf.read())
file.close()
if path:
print(name + '.pdf file downloaded in folder "{}".'.format(path))
else:
print(name + '.pdf file downloaded.')
return
download_url = 'https://fnn.app.box.com/index.php?rm=box_download_shared_file&shared_name=n74mnmrwyrmtiooqwppqjkrd1hhf3t3j&file_id=f_53868474893'
download_pdf(download_url)
Hope it helps, let me know if it works.

Python mechanize download file with original filename

Working on a python script to scraping multi files from a website.
The download form html is something like this:
<span>
<a class="tooltip" href="download.php?action=download&id=xxx&authkey=yyy&pass=zzz" title="Download">DL</a>
</span>
What I'm thinking of is:
f1 = open('scraping.log', 'a')
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
for a in soup.select('a[href^="download.php?action=download"]'):
link = a.attrs.get('href')
print >>f1, link
br.retrieve(url+link, destination)
However, for the retrieve I have to state out the output filename. I want to get the original filename instead of setting it as a random name. Is there any way to do that?
Moreover, as I want to add this script to run frequently in crontab, is there a way for me to set us check the scraping.log and skip those that have been downloaded before?

If you don't like "download.php", check for a Content-Disposition header, e.g.:
Content-Disposition: attachment; filename="fname.ext"
And ensure the filename complies with your intent:
It is important that the receiving MUA not blindly use the
suggested filename. The suggested filename SHOULD be checked (and
possibly changed) to see that it conforms to local filesystem
conventions, does not overwrite an existing file, and does not
present a security problem (see Security Considerations below).
Python 2:
import re
import mechanize # pip install mechanize
br = mechanize.Browser()
r = br.open('http://yoursite.com')
#print r.info()['Content-Disposition']
unsafe_filename = r.info().getparam('filename') # Could be "/etc/evil".
filename = re.findall("([a-zA-Z0-9 _,()'-]+[.][a-z0-9]+)$", unsafe_filename)[0] # "-]" to match "-".
As for skipping links you've processed before,
f1 = open('scraping.log', 'a')
processed_links = f1.readlines()
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
for a in soup.select('a[href^="download.php?action=download"]'):
link = a.attrs.get('href')
if not link in processed_links:
print >>f1, link
processed_links += [link]
br.retrieve(url+link, destination)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract URLs from PDF - text doesn't match URL - python

Related

Scraping specific pdfs from different websites

How to input a list of URLs saved in a .txt to a Python program?

how do i change hyperlinks inside pdf using python?

Downloadable link from public box link [closed]

Python mechanize download file with original filename

Categories

Resources