Scraping PDFs from multiple pages using bs4

Scraping PDFs from multiple pages using bs4 - python

I'm a python beginner and I'm hoping that what I'm trying to do isn't too involved. Essentially, I want to extract the text of the minutes (contained in PDF documents) from this municipality's council meetings for the last ~10 years at this website: https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3
Eventually, I want to analyze/categorise the action items from the meeting minutes. All I've been able to do so far is grab the links leading to the PDFs from the first page. Here is my code:
# Import requests for navigating to websites, beautiful soup to scrape website, PyPDF2 for PDF data mining
import sys
import requests
import bs4
import PyPDF2
#import PDfMiner
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
# Soupify URL
my_url = "https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3"
result = requests.get(my_url)
src = result.content
page_soup = soup(src, "lxml")
#list with links
urls = []
for tr_tag in page_soup.find_all("tr"):
a_tag = tr_tag.find("a")
urls.append(a_tag.attrs["href"])
print(urls)
A few things I could use help with:
How do I pull the links from pages 1 - 50 (arbitrary in the 'Previous Meetings' site, instead of just the first page?
How do I go about entering each of the links, and pulling the 'Read the minutes' PDFs for text analysis (using PyPDF2?)
Any help is so appreciated! Thank you in advance!
EDIT: I am hoping to get the data into a dataframe, where the first column is the file name and the second column is the text from the PDF. It would look like:
PDF_file_name
PDF_text
spec20210729min
[[' \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING MINUTES\n \n \nJULY 29, 2021\n \n \nA Special Meeting of the Council\n \nof the City of Vancouver\n \nw
spec20210802min
[[' \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING MINUTES\n \n \nAUGUST 2, 2021\n \n \nA Special Meeting of the Council\n \nof the City of Vancouver\n \nw

Welcome to the exciting world of web scraping!
First of all, great job you were on the good track.
There are a few points to discuss though.
You essentially have 2 problems here.
1 - How to retrieve the HTML text for all pages (1, ..., 50)?
In web scraping you have mainly to kind of web pages:
If you are lucky, the page does not render using javascript and you can use only requests to get the page content
You are less lucky, and the page uses JavaScript to render partly or entirely
To get all the pages from 1 to 50, we need to somehow click on the button next at the end of the page.
Why?
If you check what happens in the network tab from the browser developer, console, you see that a new query getting a JS script to generate the page is fetched for each click to the next button.
Unfortunately, we can't render JavaScript using requests
But we have a solution: Headless Browsers (wiki).
In the solution, I use selenium, which is a library that can use a real browser driver (in our case Chrome) to query a page and render JavaScript.
So we first get the web page with selenium, we extract the HTML, we click on next and wait a bit for the page to load, we extract the HTML, ... and so on.
2 - How to extract the text from the PDFs after getting them?
After downloading the PDfs, we can load it into a variable then open it with PyPDF2 and extract the text from all pages. I let you look at the solution code.
Here is a working solution. It will iterate over the first n pages you want and return the text from all the PDF you are interested in:
import os
import time
from io import BytesIO
from urllib.parse import urljoin
import pandas as pd
import PyPDF2
import requests
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# Create a headless chromedriver to query and perform action on webpages like a browser
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
# Main url
my_url = (
"https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3"
)
def get_n_first_pages(n: int):
"""Get the html text for the first n pages
Args:
n (int): The number of pages we want
Returns:
List[str]: A list of html text
"""
# Initialize the variables containing the pages
pages = []
# We query the web page with our chrome driver.
# This way we can iteratively click on the next link to get all the pages we want
driver.get(my_url)
# We append the page source code
pages.append(driver.page_source)
# Then for all subsequent pages, we click on next and wait to get the page
for _ in range(1, n):
driver.find_element_by_css_selector(
"#LiverpoolTheme_wt93_block_wtMainContent_RichWidgets_wt132_block_wt28"
).click()
# Wait for the page to load
time.sleep(1)
# Append the page
pages.append(driver.page_source)
return pages
def get_pdf(link: str):
"""Get the pdf text, per PDF pages, for a given link.
Args:
link (str): The link where we can retrieve the PDF
Returns:
List[str]: A list containing a string per PDF pages
"""
# We extract the file name
pdf_name = link.split("/")[-1].split(".")[0]
# We get the page containing the PDF link
# Here we don't need the chrome driver since we don't have to click on the link
# We can just get the PDF using requests after finding the href
pdf_link_page = requests.get(link)
page_soup = soup(pdf_link_page.text, "lxml")
# We get all <a> tag that have href attribute, then we select only the href
# containing min.pdf, since we only want the PDF for the minutes
pdf_link = [
urljoin(link, l.attrs["href"])
for l in page_soup.find_all("a", {"href": True})
if "min.pdf" in l.attrs["href"]
]
# There is only one PDF for the minutes so we get the only element in the list
pdf_link = pdf_link[0]
# We get the PDF with requests and then get the PDF bytes
pdf_bytes = requests.get(pdf_link).content
# We load the bytes into an in memory file (to avoid saving the PDF on disk)
p = BytesIO(pdf_bytes)
p.seek(0, os.SEEK_END)
# Now we can load our PDF in PyPDF2 from memory
read_pdf = PyPDF2.PdfFileReader(p)
count = read_pdf.numPages
pages_txt = []
# For each page we extract the text
for i in range(count):
page = read_pdf.getPage(i)
pages_txt.append(page.extractText())
# We return the PDF name as well as the text inside each pages
return pdf_name, pages_txt
# Get the first 2 pages, you can change this number
pages = get_n_first_pages(2)
# Initialize a list to store each dataframe rows
df_rows = []
# We iterate over each page
for page in pages:
page_soup = soup(page, "lxml")
# Here we get only the <a> tag inside the tbody and each tr
# We avoid getting the links from the head of the table
all_links = page_soup.select("tbody tr a")
# We extract the href for only the links containing council (we don't care about the
# video link)
minutes_links = [x.attrs["href"] for x in all_links if "council" in x.attrs["href"]]
#
for link in minutes_links:
pdf_name, pages_text = get_pdf(link)
df_rows.append(
{
"PDF_file_name": pdf_name,
# We join each page in the list into one string, separting them with a line return
"PDF_text": "\n".join(pages_text),
}
)
break
break
# We create the data frame from the list of rows
df = pd.DataFrame(df_rows)
Outputs a dataframe like:
PDF_file_name PDF_text
0 spec20210729ag \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING...
...
Keep scraping the web, it's fun :)

The issue is that BeautifulSoup won't see any results besides those for the first page. BeautifulSoup is just an XML/HTML parser, it's not a headless browser or JavaScript-capable runtime environment that can run JavaScript asynchronously. When you make a simple HTTP GET request to your page, the response is an HTML document, in which the first page's results are directly baked into the HTML. These contents are baked into the document at the time the server served the document to you, so BeautifulSoup can see these elements no problem. All the other pages of results, however, are more tricky.
View the page in a browser. While logging your network traffic, click on the "next" button to view the next page's results. If you're filtering your traffic by XHR/Fetch requests only, you'll notice an HTTP POST request being made to an ASP.NET server, the response of which is HTML containing JavaScript containing JSON containing HTML. It's this nested HTML structure that represents the new content with which to update the table. Clicking this button doesn't actually take you to a different URL - the contents of the table simply change. The DOM is being updated/populated asynchronously using JavaScript, which is not uncommon.
The challenge, then, is to mimic these requests and parse the response to extract the HREFs of only those links in which you're interested. I would split this up into three distinct scripts:
One script to generate a .txt file of all sub-page URLs (these
would be the URLs you navigate to when clicking links like "Agenda
and Minutes",
example)
One script to read from that .txt file, make requests to each URL,
and extract the HREF to the PDF on that page (if one is available).
These direct URLs to PDFs will be saved in another .txt file.
A script to read from the PDF-URL .txt file, and perform PDF
analysis.
You could combine scripts one and two if you really want to. I felt like splitting it up.
The first script makes an initial request to the main page to get some necessary cookies, and to extract a hidden input __OSVSTATE that's baked into the HTML which the ASP.NET server cares about in our future requests. It then simulates "clicks" on the "next" button by sending HTTP POST requests to a specific ASP.NET server endpoint. We keep going until we can't find a "next" button on the page anymore. It turns out there are around ~260 pages of results in total. For each of these 260 responses, we parse the response, pull the HTML out of it, and extract the HREFs. We only keep those tags whose HREF ends with the substring ".htm", and whose text contains the substring "minute" (case-insensitive). We then write all HREFs to a text file page_urls.txt. Some of these will be duplicated for some reason, and other's end up being invalid links, but we'll worry about that later. Here's the entire generated text file.
def get_urls():
import requests
from bs4 import BeautifulSoup as Soup
import datetime
import re
import json
# Start by making the initial request to store the necessary cookies in a session
# Also, retrieve the __OSVSTATE
url = "https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3"
headers = {
"user-agent": "Mozilla/5.0"
}
session = requests.Session()
response = session.get(url, headers=headers)
response.raise_for_status()
soup = Soup(response.content, "html.parser")
osv_state = soup.select_one("input[id=\"__OSVSTATE\"]")["value"]
# Get all results from all pages
url = "https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx"
headers = {
"user-agent": "Mozilla/5.0",
"x-requested-with": "XMLHttpRequest"
}
payload = {
"__EVENTTARGET": "LiverpoolTheme_wt93$block$wtMainContent$RichWidgets_wt132$block$wt28",
"__AJAX": "980,867,LiverpoolTheme_wt93_block_wtMainContent_RichWidgets_wt132_block_wt28,745,882,0,277,914,760,"
}
while True:
params = {
"_ts": round(datetime.datetime.now().timestamp())
}
payload["__OSVSTATE"] = osv_state
response = session.post(url, params=params, headers=headers, data=payload)
response.raise_for_status()
pattern = "OsJSONUpdate\\(({\"outers\":{[^\\n]+})\\)//\\]\\]"
jsn = re.search(pattern, response.text).group(1)
data = json.loads(jsn)
osv_state = data["hidden"]["__OSVSTATE"]
html = data["outers"]["LiverpoolTheme_wt93_block_wtMainContent_wtTblCommEventTable_Wrapper"]["inner"]
soup = Soup(html, "html.parser")
# Select only those a-tags whose href attribute ends with ".htm" and whose text contains the substring "minute"
tags = soup.select("a[href$=\".htm\"]")
hrefs = [tag["href"] for tag in tags if "minute" in tag.get_text().casefold()]
yield from hrefs
page_num = soup.select_one("a.ListNavigation_PageNumber").get_text()
records_message = soup.select_one("div.Counter_Message").get_text()
print("Page #{}:\n\tProcessed {}, collected {} URL(s)\n".format(page_num, records_message, len(hrefs)))
if soup.select_one("a.ListNavigation_Next") is None:
break
def main():
with open("page_urls.txt", "w") as file:
for url in get_urls():
file.write(url + "\n")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The second script reads the output file of the previous one, and makes a request to each URL in the file. Some of these will be invalid, some need to be cleaned up in order to be used, many will be duplicates, some will be valid but won't contain a link to a PDF, etc. We visit each page and extract the PDF URL, and save each in a file. In the end I've managed to collect 287 usable PDF URLs. Here is the generated text file.
def get_pdf_url(url):
import requests
from bs4 import BeautifulSoup as Soup
url = url.replace("/ctyclerk", "")
base_url = url[:url.rfind("/")+1]
headers = {
"user-agent": "Mozilla/5.0"
}
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
except requests.exceptions.HTTPError:
return ""
soup = Soup(response.content, "html.parser")
pdf_tags = soup.select("a[href$=\".pdf\"]")
tag = next((tag for tag in pdf_tags if "minute" in tag.get_text()), None)
if tag is None:
return ""
return tag["href"] if tag["href"].startswith("http") else base_url + tag["href"]
def main():
with open("page_urls.txt", "r") as file:
page_urls = set(file.read().splitlines())
with open("pdf_urls.txt", "w") as file:
for count, pdf_url in enumerate(map(get_pdf_url, page_urls), start=1):
if pdf_url:
status = "Success"
file.write(pdf_url + "\n")
file.flush()
else:
status = "Skipped"
print("{}/{} - {}".format(count, len(page_urls), status))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The third script would read from the pdf_urls.txt file, make a request to each URL, and then interpret the response bytes as a PDF:
def main():
import requests
from io import BytesIO
from PyPDF2 import PdfFileReader
with open("pdf_urls.txt", "r") as file:
pdf_urls = file.read().splitlines()
for pdf_url in pdf_urls:
response = requests.get(pdf_url)
response.raise_for_status()
content = BytesIO(response.content)
reader = PdfFileReader(content)
# do stuff with reader
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

Related

Is it possible to do a web scrapping on ebi.ac.uk/interpro website?

I would like to get a table on ebi.ac.uk/interpro with the list of all the thousands of proteins names, accession number, species, and length for the entry I put on the website. I tried to write a script with python using requests, BeautifulSoup, and so on, but I am always getting the error
AttributeError: 'NoneType' object has no attribute 'find_all'.
The code
import requests
from bs4 import BeautifulSoup
# Set the URL of the website you want to scrape
url = xxxx
# Send a request to the website and get the response
response = requests.get(url)
# Parse the response using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# Find the table on the page
table = soup.find("table", class_ = 'xxx')
# Extract the data from the table
# This will return a list of rows, where each row is a list of cells
table_data = []
for row in table.find_all('tr'):
cells = row.find_all("td")
row_data = []
# for cell in cells:
# row_data.append(cell.text)
# table_data.append(row_data)
# Print the extracted table data
#print(table_data)
for table = soup.find("table", class_ = 'xxx'), I fill in the class according to the name when I inspect the page.
Thank you.
I would like to get a table listing all the thousands of proteins that the website lists back from my request

sure it is take a look at this example:
import requests
url = "https://www.ebi.ac.uk/interpro/wwwapi/entry/hamap/"
querystring = {"search":"","page_size":"9999"}
payload = ""
response = requests.request("GET", url, data=payload, params=querystring)
print(response.text)
Please do not use selenium unless absolutely necessary. In the following example we request all the entries from /hamap/ I have no idea what this means but this is the API used to fetch the data. You can get the API for the dataset you want to scrape data from by doing the following:
open chrome dev tools -> network -> click Fetch/XAR -> click on the specific source you want -> wait until the page loads -> click the red icon for record -> look through the requests for the one that you want. It is important to not record requests after you retrieved the initial response. This website sends a tracking request every 1 second or so and it becomes cluttered really quick. Once you have the source that you want just loop over the array and get the fields that you want. I hope this answer was useful to you.

Hey I checked it out some more this site uses something similar to Elasticsearch's scroll here is a full implementation of what you are looking for:
import requests
import json
results_array = []
def main():
count = 0
starturl = "https://www.ebi.ac.uk/interpro/wwwapi//protein/UniProt/entry/InterPro/IPR002300/?page_size=100&has_model=true" ## This is the URL you want to scrape on page 0
startpage = requests.get(starturl) ## This is the page you want to scrape
count += int(startpage.json()['count']) ## This is the total number of indexes
next = startpage.json()['next'] ## This is the next page
for result in startpage.json()['results']:
results_array.append(result)
while count:
count -= 100
nextpage = requests.get(next) ## this is the next page
if nextpage.json()['next'] is None:
break
next = nextpage.json()['next']
for result in nextpage.json()['results']:
results_array.append(result)
print(json.dumps(nextpage.json()))
print(count)
if __name__ == '__main__':
main()
with open("output.json", "w") as f:
f.write(json.dumps(results_array))
To use this for any other type replace the startURL string with that one. make sure it is the url that controls pages. To get this click on the data you want then click on the next page use that url.
I hope this answer is what you were looking for.

How to download all the href (pdf) inside a class with python beautiful soup?

I have around 900 pages and each page contains 10 buttons (each button has pdf). I want to download all the pdf's - the program should browse to all the pages and download the pdfs one by one.
Code only searching for .pdf but my href does not have .pdf page_no (1 to 900).
https://bidplus.gem.gov.in/bidlists?bidlists&page_no=3
This is the website and below is the link:
BID NO: GEM/2021/B/1804626
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://bidplus.gem.gov.in/bidlists"
#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)

You only need the href as associated with the links you call buttons. Then prefix with the appropriate protocol + domain.
The links can be matched with the following selector:
.bid_no > a
That is anchor (a) tags with direct parent element having class bid_no.
This should pick up 10 links per page. As you will need a file name for each download I suggest having a global dict, which you store the links as values and link text as keys in. I would replace the "\" in the link descriptions with "_". You simply add to this during your loop over the desired number of pages.
An example of some of the dictionary entries:
As there are over 800 pages I have chosen to add in an additional termination page count variable called end_number. I don't want to loop to all pages so this allows me an early exit. You can remove this param if so desired.
Next, you need to determine the actual number of pages. For this you can use the following css selector to get the Last pagination link and then extract its data-ci-pagination-page value and convert to integer. This can then be the num_pages (number of pages) to terminate your loop at:
.pagination li:last-of-type > a
That looks for an a tag which is a direct child of the last li element, where those li elements have a shared parent with class pagination i.e. the anchor tag in the last li, which is the last page link in the pagination element.
Once you have all your desired links and file suffixes (the description text for the links) in your dictionary, loop the key, value pairs and issue requests for the content. Write that content out to disk.
TODO:
I would suggest you look at ways of optimizing the final issuing of requests and writing out to disk. For example, you could first issue all requests asynchronously and store in a dictionary to optimize what would be an I/0-bound process. Then loop that writing to disk perhaps with a multi-processing approach to optimize for a more CPU-bound process.
I would additionally consider if some sort of wait should be introduced between requests. Or if requests should be batches. You could theoretically currently have something like (836 * 10) + 836 requests.
import requests
from bs4 import BeautifulSoup as bs
end_number = 3
current_page = 1
pdf_links = {}
path = '<your path>'
with requests.Session() as s:
while True:
r = s.get(f'https://bidplus.gem.gov.in/bidlists?bidlists&page_no={current_page}')
soup = bs(r.content, 'lxml')
for i in soup.select('.bid_no > a'):
pdf_links[i.text.strip().replace('/', '_')] = 'https://bidplus.gem.gov.in' + i['href']
#print(pdf_links)
if current_page == 1:
num_pages = int(soup.select_one('.pagination li:last-of-type > a')['data-ci-pagination-page'])
print(num_pages)
if current_page == num_pages or current_page > end_number:
break
current_page+=1
for k,v in pdf_links.items():
with open(f'{path}/{k}.pdf', 'wb') as f:
r = s.get(v)
f.write(r.content)

Your site doesnt work for 90% people. But you provide examples of html. So i hope this ll help you:
url = 'https://bidplus.gem.gov.in/bidlists'
response = requests.get(url)
soup = BeautifulSoup(response.text, features='lxml')
for bid_no in soup.find_all('p', class_='bid_no pull-left'):
for pdf in bid_no.find_all('a'):
with open('pdf_name_here.pdf', 'wb') as f:
#if you have full link
href = pdf.get('href')
#if you have link exept full path, like /showbidDocument/2993132
#href = url + pdf.get('href')
response = requests.get(href)
f.write(response.content)

How to get data past the "Show More" button that DOESN'T change the URL?

I am trying to scrape article titles and links from Vogue with a site search keyword. I can't get the top 100 results because the "Show More" button obscures them. I've gotten around this before by using the changing URL, but Vogue's URL does not change to include the page number, result number, etc.
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.vogue.com/search?q=HARRY+STYLES&sort=score+desc'
r = requests.get(url)
soup = bs(r.content, 'html')
links = soup.find_all('a', {'class':"summary-item-tracking__hed-link summary-item__hed-link"})
titles = soup.find_all('h2', {'class':"summary-item__hed"})
res = []
for i in range(len(titles)):
entry = {'Title': titles[i].text.strip(), 'Link': 'https://www.vogue.com'+links[i]['href'].strip()}
res.append(entry)
Any tips on how to scrape the data past the "Show More" button?

You have to examine the Network from developer tools. Then you have to determine how to website requests the data. You can see the request and the response in the screenshot.
The website is using page parameter as you see.
Each page has 8 titles. So you have to use the loop to get 100 titles.
Code:
import cloudscraper,json,html
counter=1
for i in range(1,14):
url = f'https://www.vogue.com/search?q=HARRY%20STYLES&page={i}&sort=score%20desc&format=json'
scraper = cloudscraper.create_scraper(browser={'browser': 'firefox','platform': 'windows','mobile': False},delay=10)
byte_data = scraper.get(url).content
json_data = json.loads(byte_data)
for j in range(0,8):
title_url = 'https://www.vogue.com' + (html.unescape(json_data['search']['items'][j]['url']))
t = html.unescape(json_data['search']['items'][j]['source']['hed'])
print(counter," - " + t + ' - ' + title_url)
if (counter == 100):
break
counter = counter + 1
Output:

You can inspect the requests on the website using your browser's web developer tools to find out if its making a specific request for data of your interest.
In this case, the website is loading more info by making GET requests to an URL like this:
https://www.vogue.com/search?q=HARRY STYLES&page=<page_number>&sort=score desc&format=json
Where <page_number> is > 1 as page 1 is what you see by default when you visit the website.
Assuming you can/will request a limited amount of pages and as the data format is JSON, you will have to transform it to a dict() or other data structure to extract the data you want. Specifically targeting the "search.items" key of the JSON object since it contains an array of data of the articles for the requested page.
Then, the "Title" would be search.items[i].source.hed and you could assemble the link with search.items[i].url.
As a tip, I think is a good practice to try to see how the website works manually and then attempt to automate the process.
If you want to request data to that URL, make sure to include some delay between requests so you don't get kicked out or blocked.

Troubleshooting python code to scrape and store PDF text

The following code searches through the main URL and enters the 'Council' hyperlink to extract text from the Minutes documents on each page (stored in PDFs, and extracted using PyPDF2).
The problem I'm having is that the code is supposed to loop through n pages to pull PDFs, but the output only returns the first PDF. I'm not sure what's happening, as minutes_links does store the correct number of links to the PDF files, but in the for loop to extract pdf_name and pages_text, only the first link is pulled and stored.
import os
import time
from io import BytesIO
from urllib.parse import urljoin
import pandas as pd
import PyPDF2
import requests
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# Create a headless chromedriver to query and perform action on webpages like a browser
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
# Main url
my_url = (
"https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3"
)
def get_n_first_pages(n: int):
"""Get the html text for the first n pages
Args:
n (int): The number of pages we want
Returns:
List[str]: A list of html text
"""
# Initialize the variables containing the pages
pages = []
# We query the web page with our chrome driver.
# This way we can iteratively click on the next link to get all the pages we want
driver.get(my_url)
# We append the page source code
pages.append(driver.page_source)
# Then for all subsequent pages, we click on next and wait to get the page
for _ in range(1, n):
driver.find_element_by_css_selector(
"#LiverpoolTheme_wt93_block_wtMainContent_RichWidgets_wt132_block_wt28"
).click()
# Wait for the page to load
time.sleep(1)
# Append the page
pages.append(driver.page_source)
return pages
def get_pdf(link: str):
"""Get the pdf text, per PDF pages, for a given link.
Args:
link (str): The link where we can retrieve the PDF
Returns:
List[str]: A list containing a string per PDF pages
"""
# We extract the file name
pdf_name = link.split("/")[-1].split(".")[0]
# We get the page containing the PDF link
# Here we don't need the chrome driver since we don't have to click on the link
# We can just get the PDF using requests after finding the href
pdf_link_page = requests.get(link)
page_soup = soup(pdf_link_page.text, "lxml")
# We get all <a> tag that have href attribute, then we select only the href
# containing min.pdf, since we only want the PDF for the minutes
pdf_link = [
urljoin(link, l.attrs["href"])
for l in page_soup.find_all("a", {"href": True})
if "min.pdf" in l.attrs["href"]
]
# There is only one PDF for the minutes so we get the only element in the list
pdf_link = pdf_link[0]
# We get the PDF with requests and then get the PDF bytes
pdf_bytes = requests.get(pdf_link).content
# We load the bytes into an in memory file (to avoid saving the PDF on disk)
p = BytesIO(pdf_bytes)
p.seek(0, os.SEEK_END)
# Now we can load our PDF in PyPDF2 from memory
read_pdf = PyPDF2.PdfFileReader(p)
count = read_pdf.numPages
pages_txt = []
# For each page we extract the text
for i in range(count):
page = read_pdf.getPage(i)
pages_txt.append(page.extractText())
# We return the PDF name as well as the text inside each pages
return pdf_name, pages_txt
# Get the first 16 pages, you can change this number
pages = get_n_first_pages(16)
# Initialize a list to store each dataframe rows
df_rows = []
# We iterate over each page
for page in pages:
page_soup = soup(page, "lxml")
# Here we get only the <a> tag inside the tbody and each tr
# We avoid getting the links from the head of the table
all_links = page_soup.select("tbody tr a")
# We extract the href for only the links containing council (we don't care about the
# video link)
minutes_links = [x.attrs["href"] for x in all_links if "council" in x.attrs["href"]]
#
for link in minutes_links:
pdf_name, pages_text = get_pdf(link)
df_rows.append(
{
"PDF_file_name": pdf_name,
# We join each page in the list into one string, separting them with a line return
"PDF_text": "\n".join(pages_text),
}
)
break
break
# We create the data frame from the list of rows
df = pd.DataFrame(df_rows)
The desired output is a dataframe that looks like this:
PDF_file_name
PDF_text
spec20210729min
[[' \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING MINUTES\n \n \nJULY 29, 2021\n \n \nA Special Meeting of the Council\n \nof the City of Vancouver\n \nw
spec20210802min
[[' \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING MINUTES\n \n \nAUGUST 2, 2021\n \n \nA Special Meeting of the Council\n \nof the City of Vancouver\n \nw
Right now, I can get the first one in there, but not any subsequent files. TIA!

At the end of your two for loops, you have a break command.
The break command tells the for loop to stop executing and move on to the next block of code. So, each of your for loops only end up running once.
Remove these two break statements, and it should work as intended.
P.S - I have not tested this, I will remove this answer if it doesn't work

Scrape multiple pages with BeautifulSoup and Python

My code successfully scrapes the tr align=center tags from [ http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY ] and writes the td elements to a text file.
However, there are multiple pages available at the site above in which I would like to be able to scrape.
For example, with the url above, when I click the link to "page 2" the overall url does NOT change. I looked at the page source and saw a javascript code to advance to the next page.
How can my code be changed to scrape data from all the available listed pages?
My code that works for page 1 only:
import bs4
import requests
response = requests.get('http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY')
soup = bs4.BeautifulSoup(response.text)
soup.prettify()
acct = open("/Users/it/Desktop/accounting.txt", "w")
for tr in soup.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')

The trick here is to check the requests that are coming in and out of the page-change action when you click on the link to view the other pages. The way to check this is to use Chrome's inspection tool (via pressing F12) or installing the Firebug extension in Firefox. I will be using Chrome's inspection tool in this answer. See below for my settings.
Now, what we want to see is either a GET request to another page or a POST request that changes the page. While the tool is open, click on a page number. For a really brief moment, there will only be one request that will appear, and it's a POST method. All the other elements will quickly follow and fill the page. See below for what we're looking for.
Click on the above POST method. It should bring up a sub-window of sorts that has tabs. Click on the Headers tab. This page lists the request headers, pretty much the identification stuff that the other side (the site, for example) needs from you to be able to connect (someone else can explain this muuuch better than I do).
Whenever the URL has variables like page numbers, location markers, or categories, more often that not, the site uses query-strings. Long story made short, it's similar to an SQL query (actually, it is an SQL query, sometimes) that allows the site to pull the information you need. If this is the case, you can check the request headers for query string parameters. Scroll down a bit and you should find it.
As you can see, the query string parameters match the variables in our URL. A little bit below, you can see Form Data with pageNum: 2 beneath it. This is the key.
POST requests are more commonly known as form requests because these are the kind of requests made when you submit forms, log in to websites, etc. Basically, pretty much anything where you have to submit information. What most people don't see is that POST requests have a URL that they follow. A good example of this is when you log-in to a website and, very briefly, see your address bar morph into some sort of gibberish URL before settling on /index.html or somesuch.
What the above paragraph basically means is that you can (but not always) append the form data to your URL and it will carry out the POST request for you on execution. To know the exact string you have to append, click on view source.
Test if it works by adding it to the URL.
Et voila, it works. Now, the real challenge: getting the last page automatically and scraping all of the pages. Your code is pretty much there. The only things remaining to be done are getting the number of pages, constructing a list of URLs to scrape, and iterating over them.
Modified code is below:
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
base_url = 'http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY'
r = rq.get(base_url)
soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try: # Make sure there are more than one page, otherwise, set to 1.
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1
# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]
# Open the text file. Use with to save self from grief.
with open("results.txt","wb") as acct:
for url_ in url_list:
print "Processing {}...".format(url_)
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')
We use regular expressions to get the proper links. Then using list comprehension, we built a list of URL strings. Finally, we iterate over them.
Results:
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=1...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=2...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=3...
[Finished in 6.8s]
Hope that helps.
EDIT:
Out of sheer boredom, I think I just created a scraper for the entire class directory. Also, I update both the above and below codes to not error out when there is only a single page available.
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
spring_2015 = "http://my.gwu.edu/mod/pws/subjects.cfm?campId=1&termId=201501"
r = rq.get(spring_2015)
soup = bsoup(r.text)
classes_url_list = [c["href"] for c in soup.find_all("a", href=re.compile(r".*courses.cfm\?campId=1&termId=201501&subjId=.*"))]
print classes_url_list
with open("results.txt","wb") as acct:
for class_url in classes_url_list:
base_url = "http://my.gwu.edu/mod/pws/{}".format(class_url)
r = rq.get(base_url)
soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try:
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1
# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]
# Open the text file. Use with to save self from grief.
for url_ in url_list:
print "Processing {}...".format(url_)
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping PDFs from multiple pages using bs4 - python

Related

Is it possible to do a web scrapping on ebi.ac.uk/interpro website?

How to download all the href (pdf) inside a class with python beautiful soup?

How to get data past the "Show More" button that DOESN'T change the URL?

Troubleshooting python code to scrape and store PDF text

Scrape multiple pages with BeautifulSoup and Python

Categories

Resources