Scraping specific pdfs from different websites - python

First question here. I need to download a specific pdf from every url. I need just the pdf of the european commission proposal from each url that I have, which is always in a specific part of the page
[Here the part from the website that I would always need in pdf form].
The European Commission proposal
And here is the html code of it (The part that is interesting for me is :
"http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" is the pdf that I need, as you can see from the image
)
[<a class="externalDocument" href="http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" target="externalDocument">COM(2020)0791</a>, <a href="http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" target="_blank">
<span class="ep_name">
COM(2020)0791
</span>
<span class="ep_icon"> </span>
</a>, <a href="http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" target="_blank">
<span class="ep_name">
COM(2020)0791
</span>
<span class="ep_icon"> </span>
</a>]
I used the subsequent code for the task, so that it takes every url from my csv file and it goes in each page to download every pdf. The problem is that with this approach it takes also other pdf which I do not need. It is fine for me if it downloads it but I need to distinguish them from the part where they are downloaded, this is why i am asking here to download all the pdf from just one specific subsection. So if it is possible to distinguish them in the name by section it would be also fine, for now this code gives me back 3000 pdfs, i need around 1400, one for each link, and if it keeps the name of the link it could be also easier for me, but is not my main worry since they are ordered in order of recall from the csv file and it will be easy to tidy them after.
In synthesis this code here needs to become a code which downloads only from one part of the site, instead of all of it:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
#import pandas
#data = pandas.read_csv('urls.csv')
#urls = data['urls'].tolist()
urls = ["http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2020/0350", "http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2012/0299", "http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2013/0092"]
#url="http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2013/0092"
folder_location = r'C:\Users\myname\Documents\R\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
for url in urls:
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='EN.pdf']"):
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
for example I did not want do download this file here
follow up document
which is a follow up document which starts with com, ends with EN.pdf, but has a different date because it is a follow up (in this case 2018)
as you can see from the link:
https://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2018/0564/COM_COM(2018)0564_EN.pdf

The links in your html file all seem to be to the same pdf [or at least they have the same filename], so it'll just be downloading and over-writing the same document. Still, if you just want to target only the first of those links, you could include the class externalDocument in your selector.
for link in soup.select('a.externalDocument[href$="EN.pdf"]'):
If you want to target a specific event like 'Legislative proposal published', then you could do something like this:
# urls....os.mkdir(folder_location)
evtName = 'Legislative proposal published'
tdSel, spSel, aSel = 'div.ep-table-cell', 'span.ep_name', 'a[href$="EN.pdf"]'
dlSel = f'{tdSel}+{tdSel}+{tdSel} {spSel}>{aSel}'
trSel = f'div.ep-table-row:has(>{dlSel}):has(>{tdSel}+{tdSel} {spSel})'
for url in urls:
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
pgPdfLinks = [
tr.select_one(dlSel).get('href') for tr in soup.select(trSel) if
evtName.strip().lower() in
tr.select_one(f'{tdSel}+{tdSel} {spSel}').get_text().strip().lower()
## if you want [case sensitive] exact match, change condition to
# tr.select_one(f'{tdSel}+{tdSel} {spSel}').get_text() == evtName
]
for link in pgPdfLinks[:1]:
filename = os.path.join(folder_location, link.split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url, link)).content)
[The [:1] of pgPdfLinks[:1] is probably unnecessary since more than one match isn't expected, but it's there if you want to absolutely ensure only one download per page.]
Note: you need to be sure that there will be an event named evtName with a link matching aSel (a[href$="EN.pdf"] in this case) - otherwise, no PDF links will be found and nothing will be downloaded for those pages.
if it keeps the name of the link it could be also easier for me
It's already doing that in your code, since there doesn't seem to be much difference between link['href'].split('/')[-1] and link.get_text().strip(), but if you meant that you want the page link [i.e. the url], you could include the procnum (since that seems to be an identifying part of url) in your filename:
# for link in...
procnum = url.replace('?', '&').split('&procnum=')[-1].split('&')[0]
procnum = ''.join(c if (
c.isalpha() or c.isdigit() or c in '_-[]'
) else ('_' if c == '/' else '') for c in procnum)
filename = f"proc-{procnum} {link.split('/')[-1]}"
# filename = f"proc-{procnum} {link['href'].split('/')[-1]}" # in your current code
filename = os.path.join(folder_location, filename)
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url, link)).content)
# f.write(requests.get(urljoin(url['href'], link)).content) # in your current code
So, [for example] instead of saving to "COM_COM(2020)0791_EN.pdf", it will save to "proc-OLP_2020_0350 COM_COM(2020)0791_EN.pdf".

I have tried to solve this by adding different steps so that it can check at the same time what year the pdf comes from and add it to the name. The code is below, and it is an improvement, however, the answer above by Driftr95 is way better than mine, if someone wants to replicate this they should use his code.
import requests
import pandas
import os
from urllib.parse import urljoin
from bs4 import BeautifulSoup
data = pandas.read_csv('urls.csv')
urls = data['url'].tolist()
years = data["yearstr"].tolist()
numbers = data["number"].tolist()
folder_location = r'C:\Users\dario.marino5\Documents\R\webscraping'
if not os.path.exists(folder_location):
os.mkdir(folder_location)
for url, year, number in zip(urls, years, numbers):
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
if year in link['href']:
# Construct the filename with the number from the CSV file
filename = f'document_{year}_{number}.pdf'
filename = os.path.join(folder_location, filename)
# Download the PDF file and save it to the filename
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url, link['href'])).content)

Related

Python3. How to save downloaded webpages to a specified dir?

I am trying to save all the < a > links within the python homepage into a folder named 'Downloaded pages'. However after 2 iterations through the for loop I receive the following error:
www.python.org#content <_io.BufferedWriter name='Downloaded
Pages/www.python.org#content'> www.python.org#python-network
<_io.BufferedWriter name='Downloaded
Pages/www.python.org#python-network'>
Traceback (most recent call last): File "/Users/Lucas/Python/AP book
exercise/Web Scraping/linkVerification.py", line 26, in
downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb') IsADirectoryError: [Errno 21]
Is a directory: 'Downloaded Pages/'
I am unsure why this happens as it appears the pages are being saved as due to seeing '<_io.BufferedWriter name='Downloaded Pages/www.python.org#content'>', which says to me its the correct path.
This is my code:
import requests, os, bs4
# Create a new folder to download webpages to
os.makedirs('Downloaded Pages', exist_ok=True)
# Download webpage
url = 'https://www.python.org/'
res = requests.get(url)
res.raise_for_status() # Check if the download was successful
soupObj = bs4.BeautifulSoup(res.text, 'html.parser') # Collects all text form the webpage
# Find all 'a' links on the webpage
linkElem = soupObj.select('a')
numOfLinks = len(linkElem)
for i in range(numOfLinks):
linkUrlToOpen = 'https://www.python.org' + linkElem[i].get('href')
print(os.path.basename(linkUrlToOpen))
# save each downloaded page to the 'Downloaded pages' folder
downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb')
print(downloadedPage)
if linkElem == []:
print('Error, link does not work')
else:
for chunk in res.iter_content(100000):
downloadedPage.write(chunk)
downloadedPage.close()
Appreciate any advice, thanks.
The problem is that when you try to do things like parse the basename of a page with an .html dir it works, but when you try to do it with one that doesn't specify it on the url like "http://python.org/" the basename is actually empty (you can try printing first the url and then the basename bewteen brackets or something to see what i mean). So to work arround that, the easiest solution would be to use absolue paths like #Thyebri said.
And also, remember that the file you write cannot contain characters like '/', '\' or '?'
So, i dont know if the following code it's messy or not, but using the re library i would do the following:
filename = re.sub('[\/*:"?]+', '-', linkUrlToOpen.split("://")[1])
downloadedPage = open(os.path.join('Downloaded_Pages', filename), 'wb')
So, first i remove part i remove the "https://" part, and then with the regular expressions library i replace all the usual symbols that are present in url links with a dash '-' and that is the name that will be given to the file.
Hope it works!

Extract URLs from PDF - text doesn't match URL

I'm using following code to extract URLs from PDF and it works fine to extract the anchor but does not work when anchor text is different than the URL behind it.
For example: 'www.page.com/A' is used as a short url in the text but the actual URL behind it is a longer (full) version.
The code I'm using is:
import urllib.request
import PyPDF2
urllib.request.urlretrieve(url, "remoteFile")
pdfFile = PyPDF2.PdfFileReader("remoteFile", strict=False)
key = "/Annots"
uri = "/URI"
ank = "/A"
mylist = []
for page_no in range(pdfFile.numPages):
page = pdfFile.getPage(page_no)
text = page.extractText()
pageObject = page.getObject()
if key in pageObject.keys():
ann = pageObject.keys()
for a in ann:
try:
u = a.getObject()
if uri in u[ank].keys():
mylist.append(u[ank][uri])
print(u[ank][uri])
except KeyError:
pass
As I said, it works ok if the anchor and the link are the same. If the link is different, it saves the anchor. Ideally I would save both (or just link).

Saving an image from a URL that does not end with image extension

I'm a python beginner. I have a dataset column that contains thousands of URLs. I want to save the image in each URL with its extension. I don't have a problem with urls that end with the image extension like https://web.archive.org/web/20170628093753im_/http://politicot.com/wp-content/uploads/2016/12/Sean-Spicer.jpg.(with urllib or requests)
However for URLs like link1= https://nypost.com/wp-content/uploads/sites/2/2017/11/171106-texas-shooter-church-index.jpg?quality=90&strip=all&w=1200 or link2 = https://i2.wp.com/www.huzlers.com/wp-content/uploads/2017/03/maxresdefault.jpeg?fit=1280%2C720&ssl=1, i failed to save them.
I want to save the images in links as follows: image1.jpg and image2.jpeg. How can we do this?
Any help could be useful.
The following seems to work for me, give it a try:
import requests
urls = ['https://nypost.com/wp-content/uploads/sites/2/2017/11/171106-texas-shooter-church-index.jpg?quality=90&strip=all&w=1200',
'https://i2.wp.com/www.huzlers.com/wp-content/uploads/2017/03/maxresdefault.jpeg?fit=1280%2C720&ssl=1']
for i, url in enumerate(urls):
r = requests.get(url)
filename = 'image{0}.jpg'.format(i+1)
with open(filename, 'wb') as f:
f.write(r.content)

How to input a list of URLs saved in a .txt to a Python program?

I have a list of URLs saved in a .txt file and I would like to feed them, one at a time, to a variable named url to which I apply methods from the newspaper3k python library. The program extracts the URL content, authors of the article, a summary of the text, etc, then prints the info to a new .txt file. The script works fine when you give it one URL as user input, but what should I do in order to read from a .txt with thousands of URLs?
I am only beginning with Python, as a matter of fact this is my first script, so I have tried to simply say url = (myfile.txt), but I realized this wouldn't work because I have to read the file one line at a time. So I have tried to apply read() and readlines() to it, but it wouldn't work properly because 'str' object has no attribute 'read' or 'readlines'. What should I use to read those URLs saved in a .txt file, each beginning in a new line, as the input of my simple script? Should I convert string to something else?
Extract from the code, lines 1-18:
from newspaper import Article
from newspaper import fulltext
import requests
url = input("Article URL: ")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
Later I have built some functions to display the info in a desired format and save it to a new .txt. I know this is a very basic one, but I am honestly stuck... I have read other similar questions here but I couldn't properly understand or apply the suggestions. So, what is the best way to read URLs from a .txt file in order to feed them, one at a time, to the url variable, to which other methods are them applied to extract its content?
This is my first question here and I understand the forum is aimed at more experienced programmers, but I would really appreciate some help. If I need to edit or clarify something in this post, please let me know and I will correct immediately.
Here is one way you could do it:
from newspaper import Article
from newspaper import fulltext
import requests
with open('myfile.txt',r) as f:
for line in f:
#do not forget to strip the trailing new line
url = line.rstrip("\n")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
This could help you:
url_file = open('myfile.txt','r')
for url in url_file.readlines():
print url
url_file.close()
You can apply it on your code as the following
from newspaper import Article
from newspaper import fulltext
import requests
url_file = open('myfile.txt','r')
for url in url_file.readlines():
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
url_file.close()

Python mechanize download file with original filename

Working on a python script to scraping multi files from a website.
The download form html is something like this:
<span>
<a class="tooltip" href="download.php?action=download&id=xxx&authkey=yyy&pass=zzz" title="Download">DL</a>
</span>
What I'm thinking of is:
f1 = open('scraping.log', 'a')
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
for a in soup.select('a[href^="download.php?action=download"]'):
link = a.attrs.get('href')
print >>f1, link
br.retrieve(url+link, destination)
However, for the retrieve I have to state out the output filename. I want to get the original filename instead of setting it as a random name. Is there any way to do that?
Moreover, as I want to add this script to run frequently in crontab, is there a way for me to set us check the scraping.log and skip those that have been downloaded before?
If you don't like "download.php", check for a Content-Disposition header, e.g.:
Content-Disposition: attachment; filename="fname.ext"
And ensure the filename complies with your intent:
It is important that the receiving MUA not blindly use the
suggested filename. The suggested filename SHOULD be checked (and
possibly changed) to see that it conforms to local filesystem
conventions, does not overwrite an existing file, and does not
present a security problem (see Security Considerations below).
Python 2:
import re
import mechanize # pip install mechanize
br = mechanize.Browser()
r = br.open('http://yoursite.com')
#print r.info()['Content-Disposition']
unsafe_filename = r.info().getparam('filename') # Could be "/etc/evil".
filename = re.findall("([a-zA-Z0-9 _,()'-]+[.][a-z0-9]+)$", unsafe_filename)[0] # "-]" to match "-".
As for skipping links you've processed before,
f1 = open('scraping.log', 'a')
processed_links = f1.readlines()
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
for a in soup.select('a[href^="download.php?action=download"]'):
link = a.attrs.get('href')
if not link in processed_links:
print >>f1, link
processed_links += [link]
br.retrieve(url+link, destination)

Categories

Resources