Troubleshooting python code to scrape and store PDF text

Troubleshooting python code to scrape and store PDF text - python

The following code searches through the main URL and enters the 'Council' hyperlink to extract text from the Minutes documents on each page (stored in PDFs, and extracted using PyPDF2).
The problem I'm having is that the code is supposed to loop through n pages to pull PDFs, but the output only returns the first PDF. I'm not sure what's happening, as minutes_links does store the correct number of links to the PDF files, but in the for loop to extract pdf_name and pages_text, only the first link is pulled and stored.
import os
import time
from io import BytesIO
from urllib.parse import urljoin
import pandas as pd
import PyPDF2
import requests
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# Create a headless chromedriver to query and perform action on webpages like a browser
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
# Main url
my_url = (
"https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3"
)
def get_n_first_pages(n: int):
"""Get the html text for the first n pages
Args:
n (int): The number of pages we want
Returns:
List[str]: A list of html text
"""
# Initialize the variables containing the pages
pages = []
# We query the web page with our chrome driver.
# This way we can iteratively click on the next link to get all the pages we want
driver.get(my_url)
# We append the page source code
pages.append(driver.page_source)
# Then for all subsequent pages, we click on next and wait to get the page
for _ in range(1, n):
driver.find_element_by_css_selector(
"#LiverpoolTheme_wt93_block_wtMainContent_RichWidgets_wt132_block_wt28"
).click()
# Wait for the page to load
time.sleep(1)
# Append the page
pages.append(driver.page_source)
return pages
def get_pdf(link: str):
"""Get the pdf text, per PDF pages, for a given link.
Args:
link (str): The link where we can retrieve the PDF
Returns:
List[str]: A list containing a string per PDF pages
"""
# We extract the file name
pdf_name = link.split("/")[-1].split(".")[0]
# We get the page containing the PDF link
# Here we don't need the chrome driver since we don't have to click on the link
# We can just get the PDF using requests after finding the href
pdf_link_page = requests.get(link)
page_soup = soup(pdf_link_page.text, "lxml")
# We get all <a> tag that have href attribute, then we select only the href
# containing min.pdf, since we only want the PDF for the minutes
pdf_link = [
urljoin(link, l.attrs["href"])
for l in page_soup.find_all("a", {"href": True})
if "min.pdf" in l.attrs["href"]
]
# There is only one PDF for the minutes so we get the only element in the list
pdf_link = pdf_link[0]
# We get the PDF with requests and then get the PDF bytes
pdf_bytes = requests.get(pdf_link).content
# We load the bytes into an in memory file (to avoid saving the PDF on disk)
p = BytesIO(pdf_bytes)
p.seek(0, os.SEEK_END)
# Now we can load our PDF in PyPDF2 from memory
read_pdf = PyPDF2.PdfFileReader(p)
count = read_pdf.numPages
pages_txt = []
# For each page we extract the text
for i in range(count):
page = read_pdf.getPage(i)
pages_txt.append(page.extractText())
# We return the PDF name as well as the text inside each pages
return pdf_name, pages_txt
# Get the first 16 pages, you can change this number
pages = get_n_first_pages(16)
# Initialize a list to store each dataframe rows
df_rows = []
# We iterate over each page
for page in pages:
page_soup = soup(page, "lxml")
# Here we get only the <a> tag inside the tbody and each tr
# We avoid getting the links from the head of the table
all_links = page_soup.select("tbody tr a")
# We extract the href for only the links containing council (we don't care about the
# video link)
minutes_links = [x.attrs["href"] for x in all_links if "council" in x.attrs["href"]]
#
for link in minutes_links:
pdf_name, pages_text = get_pdf(link)
df_rows.append(
{
"PDF_file_name": pdf_name,
# We join each page in the list into one string, separting them with a line return
"PDF_text": "\n".join(pages_text),
}
)
break
break
# We create the data frame from the list of rows
df = pd.DataFrame(df_rows)
The desired output is a dataframe that looks like this:
PDF_file_name
PDF_text
spec20210729min
[[' \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING MINUTES\n \n \nJULY 29, 2021\n \n \nA Special Meeting of the Council\n \nof the City of Vancouver\n \nw
spec20210802min
[[' \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING MINUTES\n \n \nAUGUST 2, 2021\n \n \nA Special Meeting of the Council\n \nof the City of Vancouver\n \nw
Right now, I can get the first one in there, but not any subsequent files. TIA!

At the end of your two for loops, you have a break command.
The break command tells the for loop to stop executing and move on to the next block of code. So, each of your for loops only end up running once.
Remove these two break statements, and it should work as intended.
P.S - I have not tested this, I will remove this answer if it doesn't work

Related

Is it possible to do a web scrapping on ebi.ac.uk/interpro website?

I would like to get a table on ebi.ac.uk/interpro with the list of all the thousands of proteins names, accession number, species, and length for the entry I put on the website. I tried to write a script with python using requests, BeautifulSoup, and so on, but I am always getting the error
AttributeError: 'NoneType' object has no attribute 'find_all'.
The code
import requests
from bs4 import BeautifulSoup
# Set the URL of the website you want to scrape
url = xxxx
# Send a request to the website and get the response
response = requests.get(url)
# Parse the response using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# Find the table on the page
table = soup.find("table", class_ = 'xxx')
# Extract the data from the table
# This will return a list of rows, where each row is a list of cells
table_data = []
for row in table.find_all('tr'):
cells = row.find_all("td")
row_data = []
# for cell in cells:
# row_data.append(cell.text)
# table_data.append(row_data)
# Print the extracted table data
#print(table_data)
for table = soup.find("table", class_ = 'xxx'), I fill in the class according to the name when I inspect the page.
Thank you.
I would like to get a table listing all the thousands of proteins that the website lists back from my request

sure it is take a look at this example:
import requests
url = "https://www.ebi.ac.uk/interpro/wwwapi/entry/hamap/"
querystring = {"search":"","page_size":"9999"}
payload = ""
response = requests.request("GET", url, data=payload, params=querystring)
print(response.text)
Please do not use selenium unless absolutely necessary. In the following example we request all the entries from /hamap/ I have no idea what this means but this is the API used to fetch the data. You can get the API for the dataset you want to scrape data from by doing the following:
open chrome dev tools -> network -> click Fetch/XAR -> click on the specific source you want -> wait until the page loads -> click the red icon for record -> look through the requests for the one that you want. It is important to not record requests after you retrieved the initial response. This website sends a tracking request every 1 second or so and it becomes cluttered really quick. Once you have the source that you want just loop over the array and get the fields that you want. I hope this answer was useful to you.

Hey I checked it out some more this site uses something similar to Elasticsearch's scroll here is a full implementation of what you are looking for:
import requests
import json
results_array = []
def main():
count = 0
starturl = "https://www.ebi.ac.uk/interpro/wwwapi//protein/UniProt/entry/InterPro/IPR002300/?page_size=100&has_model=true" ## This is the URL you want to scrape on page 0
startpage = requests.get(starturl) ## This is the page you want to scrape
count += int(startpage.json()['count']) ## This is the total number of indexes
next = startpage.json()['next'] ## This is the next page
for result in startpage.json()['results']:
results_array.append(result)
while count:
count -= 100
nextpage = requests.get(next) ## this is the next page
if nextpage.json()['next'] is None:
break
next = nextpage.json()['next']
for result in nextpage.json()['results']:
results_array.append(result)
print(json.dumps(nextpage.json()))
print(count)
if __name__ == '__main__':
main()
with open("output.json", "w") as f:
f.write(json.dumps(results_array))
To use this for any other type replace the startURL string with that one. make sure it is the url that controls pages. To get this click on the data you want then click on the next page use that url.
I hope this answer is what you were looking for.

How to download all the href (pdf) inside a class with python beautiful soup?

I have around 900 pages and each page contains 10 buttons (each button has pdf). I want to download all the pdf's - the program should browse to all the pages and download the pdfs one by one.
Code only searching for .pdf but my href does not have .pdf page_no (1 to 900).
https://bidplus.gem.gov.in/bidlists?bidlists&page_no=3
This is the website and below is the link:
BID NO: GEM/2021/B/1804626
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://bidplus.gem.gov.in/bidlists"
#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)

You only need the href as associated with the links you call buttons. Then prefix with the appropriate protocol + domain.
The links can be matched with the following selector:
.bid_no > a
That is anchor (a) tags with direct parent element having class bid_no.
This should pick up 10 links per page. As you will need a file name for each download I suggest having a global dict, which you store the links as values and link text as keys in. I would replace the "\" in the link descriptions with "_". You simply add to this during your loop over the desired number of pages.
An example of some of the dictionary entries:
As there are over 800 pages I have chosen to add in an additional termination page count variable called end_number. I don't want to loop to all pages so this allows me an early exit. You can remove this param if so desired.
Next, you need to determine the actual number of pages. For this you can use the following css selector to get the Last pagination link and then extract its data-ci-pagination-page value and convert to integer. This can then be the num_pages (number of pages) to terminate your loop at:
.pagination li:last-of-type > a
That looks for an a tag which is a direct child of the last li element, where those li elements have a shared parent with class pagination i.e. the anchor tag in the last li, which is the last page link in the pagination element.
Once you have all your desired links and file suffixes (the description text for the links) in your dictionary, loop the key, value pairs and issue requests for the content. Write that content out to disk.
TODO:
I would suggest you look at ways of optimizing the final issuing of requests and writing out to disk. For example, you could first issue all requests asynchronously and store in a dictionary to optimize what would be an I/0-bound process. Then loop that writing to disk perhaps with a multi-processing approach to optimize for a more CPU-bound process.
I would additionally consider if some sort of wait should be introduced between requests. Or if requests should be batches. You could theoretically currently have something like (836 * 10) + 836 requests.
import requests
from bs4 import BeautifulSoup as bs
end_number = 3
current_page = 1
pdf_links = {}
path = '<your path>'
with requests.Session() as s:
while True:
r = s.get(f'https://bidplus.gem.gov.in/bidlists?bidlists&page_no={current_page}')
soup = bs(r.content, 'lxml')
for i in soup.select('.bid_no > a'):
pdf_links[i.text.strip().replace('/', '_')] = 'https://bidplus.gem.gov.in' + i['href']
#print(pdf_links)
if current_page == 1:
num_pages = int(soup.select_one('.pagination li:last-of-type > a')['data-ci-pagination-page'])
print(num_pages)
if current_page == num_pages or current_page > end_number:
break
current_page+=1
for k,v in pdf_links.items():
with open(f'{path}/{k}.pdf', 'wb') as f:
r = s.get(v)
f.write(r.content)

Your site doesnt work for 90% people. But you provide examples of html. So i hope this ll help you:
url = 'https://bidplus.gem.gov.in/bidlists'
response = requests.get(url)
soup = BeautifulSoup(response.text, features='lxml')
for bid_no in soup.find_all('p', class_='bid_no pull-left'):
for pdf in bid_no.find_all('a'):
with open('pdf_name_here.pdf', 'wb') as f:
#if you have full link
href = pdf.get('href')
#if you have link exept full path, like /showbidDocument/2993132
#href = url + pdf.get('href')
response = requests.get(href)
f.write(response.content)

Scraping PDFs from multiple pages using bs4

I'm a python beginner and I'm hoping that what I'm trying to do isn't too involved. Essentially, I want to extract the text of the minutes (contained in PDF documents) from this municipality's council meetings for the last ~10 years at this website: https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3
Eventually, I want to analyze/categorise the action items from the meeting minutes. All I've been able to do so far is grab the links leading to the PDFs from the first page. Here is my code:
# Import requests for navigating to websites, beautiful soup to scrape website, PyPDF2 for PDF data mining
import sys
import requests
import bs4
import PyPDF2
#import PDfMiner
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
# Soupify URL
my_url = "https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3"
result = requests.get(my_url)
src = result.content
page_soup = soup(src, "lxml")
#list with links
urls = []
for tr_tag in page_soup.find_all("tr"):
a_tag = tr_tag.find("a")
urls.append(a_tag.attrs["href"])
print(urls)
A few things I could use help with:
How do I pull the links from pages 1 - 50 (arbitrary in the 'Previous Meetings' site, instead of just the first page?
How do I go about entering each of the links, and pulling the 'Read the minutes' PDFs for text analysis (using PyPDF2?)
Any help is so appreciated! Thank you in advance!
EDIT: I am hoping to get the data into a dataframe, where the first column is the file name and the second column is the text from the PDF. It would look like:
PDF_file_name
PDF_text
spec20210729min
[[' \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING MINUTES\n \n \nJULY 29, 2021\n \n \nA Special Meeting of the Council\n \nof the City of Vancouver\n \nw
spec20210802min
[[' \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING MINUTES\n \n \nAUGUST 2, 2021\n \n \nA Special Meeting of the Council\n \nof the City of Vancouver\n \nw

Welcome to the exciting world of web scraping!
First of all, great job you were on the good track.
There are a few points to discuss though.
You essentially have 2 problems here.
1 - How to retrieve the HTML text for all pages (1, ..., 50)?
In web scraping you have mainly to kind of web pages:
If you are lucky, the page does not render using javascript and you can use only requests to get the page content
You are less lucky, and the page uses JavaScript to render partly or entirely
To get all the pages from 1 to 50, we need to somehow click on the button next at the end of the page.
Why?
If you check what happens in the network tab from the browser developer, console, you see that a new query getting a JS script to generate the page is fetched for each click to the next button.
Unfortunately, we can't render JavaScript using requests
But we have a solution: Headless Browsers (wiki).
In the solution, I use selenium, which is a library that can use a real browser driver (in our case Chrome) to query a page and render JavaScript.
So we first get the web page with selenium, we extract the HTML, we click on next and wait a bit for the page to load, we extract the HTML, ... and so on.
2 - How to extract the text from the PDFs after getting them?
After downloading the PDfs, we can load it into a variable then open it with PyPDF2 and extract the text from all pages. I let you look at the solution code.
Here is a working solution. It will iterate over the first n pages you want and return the text from all the PDF you are interested in:
import os
import time
from io import BytesIO
from urllib.parse import urljoin
import pandas as pd
import PyPDF2
import requests
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# Create a headless chromedriver to query and perform action on webpages like a browser
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
# Main url
my_url = (
"https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3"
)
def get_n_first_pages(n: int):
"""Get the html text for the first n pages
Args:
n (int): The number of pages we want
Returns:
List[str]: A list of html text
"""
# Initialize the variables containing the pages
pages = []
# We query the web page with our chrome driver.
# This way we can iteratively click on the next link to get all the pages we want
driver.get(my_url)
# We append the page source code
pages.append(driver.page_source)
# Then for all subsequent pages, we click on next and wait to get the page
for _ in range(1, n):
driver.find_element_by_css_selector(
"#LiverpoolTheme_wt93_block_wtMainContent_RichWidgets_wt132_block_wt28"
).click()
# Wait for the page to load
time.sleep(1)
# Append the page
pages.append(driver.page_source)
return pages
def get_pdf(link: str):
"""Get the pdf text, per PDF pages, for a given link.
Args:
link (str): The link where we can retrieve the PDF
Returns:
List[str]: A list containing a string per PDF pages
"""
# We extract the file name
pdf_name = link.split("/")[-1].split(".")[0]
# We get the page containing the PDF link
# Here we don't need the chrome driver since we don't have to click on the link
# We can just get the PDF using requests after finding the href
pdf_link_page = requests.get(link)
page_soup = soup(pdf_link_page.text, "lxml")
# We get all <a> tag that have href attribute, then we select only the href
# containing min.pdf, since we only want the PDF for the minutes
pdf_link = [
urljoin(link, l.attrs["href"])
for l in page_soup.find_all("a", {"href": True})
if "min.pdf" in l.attrs["href"]
]
# There is only one PDF for the minutes so we get the only element in the list
pdf_link = pdf_link[0]
# We get the PDF with requests and then get the PDF bytes
pdf_bytes = requests.get(pdf_link).content
# We load the bytes into an in memory file (to avoid saving the PDF on disk)
p = BytesIO(pdf_bytes)
p.seek(0, os.SEEK_END)
# Now we can load our PDF in PyPDF2 from memory
read_pdf = PyPDF2.PdfFileReader(p)
count = read_pdf.numPages
pages_txt = []
# For each page we extract the text
for i in range(count):
page = read_pdf.getPage(i)
pages_txt.append(page.extractText())
# We return the PDF name as well as the text inside each pages
return pdf_name, pages_txt
# Get the first 2 pages, you can change this number
pages = get_n_first_pages(2)
# Initialize a list to store each dataframe rows
df_rows = []
# We iterate over each page
for page in pages:
page_soup = soup(page, "lxml")
# Here we get only the <a> tag inside the tbody and each tr
# We avoid getting the links from the head of the table
all_links = page_soup.select("tbody tr a")
# We extract the href for only the links containing council (we don't care about the
# video link)
minutes_links = [x.attrs["href"] for x in all_links if "council" in x.attrs["href"]]
#
for link in minutes_links:
pdf_name, pages_text = get_pdf(link)
df_rows.append(
{
"PDF_file_name": pdf_name,
# We join each page in the list into one string, separting them with a line return
"PDF_text": "\n".join(pages_text),
}
)
break
break
# We create the data frame from the list of rows
df = pd.DataFrame(df_rows)
Outputs a dataframe like:
PDF_file_name PDF_text
0 spec20210729ag \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING...
...
Keep scraping the web, it's fun :)

The issue is that BeautifulSoup won't see any results besides those for the first page. BeautifulSoup is just an XML/HTML parser, it's not a headless browser or JavaScript-capable runtime environment that can run JavaScript asynchronously. When you make a simple HTTP GET request to your page, the response is an HTML document, in which the first page's results are directly baked into the HTML. These contents are baked into the document at the time the server served the document to you, so BeautifulSoup can see these elements no problem. All the other pages of results, however, are more tricky.
View the page in a browser. While logging your network traffic, click on the "next" button to view the next page's results. If you're filtering your traffic by XHR/Fetch requests only, you'll notice an HTTP POST request being made to an ASP.NET server, the response of which is HTML containing JavaScript containing JSON containing HTML. It's this nested HTML structure that represents the new content with which to update the table. Clicking this button doesn't actually take you to a different URL - the contents of the table simply change. The DOM is being updated/populated asynchronously using JavaScript, which is not uncommon.
The challenge, then, is to mimic these requests and parse the response to extract the HREFs of only those links in which you're interested. I would split this up into three distinct scripts:
One script to generate a .txt file of all sub-page URLs (these
would be the URLs you navigate to when clicking links like "Agenda
and Minutes",
example)
One script to read from that .txt file, make requests to each URL,
and extract the HREF to the PDF on that page (if one is available).
These direct URLs to PDFs will be saved in another .txt file.
A script to read from the PDF-URL .txt file, and perform PDF
analysis.
You could combine scripts one and two if you really want to. I felt like splitting it up.
The first script makes an initial request to the main page to get some necessary cookies, and to extract a hidden input __OSVSTATE that's baked into the HTML which the ASP.NET server cares about in our future requests. It then simulates "clicks" on the "next" button by sending HTTP POST requests to a specific ASP.NET server endpoint. We keep going until we can't find a "next" button on the page anymore. It turns out there are around ~260 pages of results in total. For each of these 260 responses, we parse the response, pull the HTML out of it, and extract the HREFs. We only keep those tags whose HREF ends with the substring ".htm", and whose text contains the substring "minute" (case-insensitive). We then write all HREFs to a text file page_urls.txt. Some of these will be duplicated for some reason, and other's end up being invalid links, but we'll worry about that later. Here's the entire generated text file.
def get_urls():
import requests
from bs4 import BeautifulSoup as Soup
import datetime
import re
import json
# Start by making the initial request to store the necessary cookies in a session
# Also, retrieve the __OSVSTATE
url = "https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3"
headers = {
"user-agent": "Mozilla/5.0"
}
session = requests.Session()
response = session.get(url, headers=headers)
response.raise_for_status()
soup = Soup(response.content, "html.parser")
osv_state = soup.select_one("input[id=\"__OSVSTATE\"]")["value"]
# Get all results from all pages
url = "https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx"
headers = {
"user-agent": "Mozilla/5.0",
"x-requested-with": "XMLHttpRequest"
}
payload = {
"__EVENTTARGET": "LiverpoolTheme_wt93$block$wtMainContent$RichWidgets_wt132$block$wt28",
"__AJAX": "980,867,LiverpoolTheme_wt93_block_wtMainContent_RichWidgets_wt132_block_wt28,745,882,0,277,914,760,"
}
while True:
params = {
"_ts": round(datetime.datetime.now().timestamp())
}
payload["__OSVSTATE"] = osv_state
response = session.post(url, params=params, headers=headers, data=payload)
response.raise_for_status()
pattern = "OsJSONUpdate\\(({\"outers\":{[^\\n]+})\\)//\\]\\]"
jsn = re.search(pattern, response.text).group(1)
data = json.loads(jsn)
osv_state = data["hidden"]["__OSVSTATE"]
html = data["outers"]["LiverpoolTheme_wt93_block_wtMainContent_wtTblCommEventTable_Wrapper"]["inner"]
soup = Soup(html, "html.parser")
# Select only those a-tags whose href attribute ends with ".htm" and whose text contains the substring "minute"
tags = soup.select("a[href$=\".htm\"]")
hrefs = [tag["href"] for tag in tags if "minute" in tag.get_text().casefold()]
yield from hrefs
page_num = soup.select_one("a.ListNavigation_PageNumber").get_text()
records_message = soup.select_one("div.Counter_Message").get_text()
print("Page #{}:\n\tProcessed {}, collected {} URL(s)\n".format(page_num, records_message, len(hrefs)))
if soup.select_one("a.ListNavigation_Next") is None:
break
def main():
with open("page_urls.txt", "w") as file:
for url in get_urls():
file.write(url + "\n")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The second script reads the output file of the previous one, and makes a request to each URL in the file. Some of these will be invalid, some need to be cleaned up in order to be used, many will be duplicates, some will be valid but won't contain a link to a PDF, etc. We visit each page and extract the PDF URL, and save each in a file. In the end I've managed to collect 287 usable PDF URLs. Here is the generated text file.
def get_pdf_url(url):
import requests
from bs4 import BeautifulSoup as Soup
url = url.replace("/ctyclerk", "")
base_url = url[:url.rfind("/")+1]
headers = {
"user-agent": "Mozilla/5.0"
}
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
except requests.exceptions.HTTPError:
return ""
soup = Soup(response.content, "html.parser")
pdf_tags = soup.select("a[href$=\".pdf\"]")
tag = next((tag for tag in pdf_tags if "minute" in tag.get_text()), None)
if tag is None:
return ""
return tag["href"] if tag["href"].startswith("http") else base_url + tag["href"]
def main():
with open("page_urls.txt", "r") as file:
page_urls = set(file.read().splitlines())
with open("pdf_urls.txt", "w") as file:
for count, pdf_url in enumerate(map(get_pdf_url, page_urls), start=1):
if pdf_url:
status = "Success"
file.write(pdf_url + "\n")
file.flush()
else:
status = "Skipped"
print("{}/{} - {}".format(count, len(page_urls), status))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The third script would read from the pdf_urls.txt file, make a request to each URL, and then interpret the response bytes as a PDF:
def main():
import requests
from io import BytesIO
from PyPDF2 import PdfFileReader
with open("pdf_urls.txt", "r") as file:
pdf_urls = file.read().splitlines()
for pdf_url in pdf_urls:
response = requests.get(pdf_url)
response.raise_for_status()
content = BytesIO(response.content)
reader = PdfFileReader(content)
# do stuff with reader
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

Webscraping text is returning an empty set

The code is not scraping the text when using Beautiful Soup FindAll as it returns an empty set. There are other issues with the code after this but at this stage I am trying to solve the first problem. I am pretty new to this so I understand the code structure may be less than ideal. I come from a VBA background.
import requests
from requests import get
from selenium import webdriver
from bs4 import BeautifulSoup
from lxml import html
import pandas as pd
#import chromedriver_binary # Adds chromedriver binary to path
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome(executable_path=r"C:\Users\mmanenica\Documents\chromedriver.exe")
#click the search button on Austenders to return all Awarded Contracts
import time
#define the starting point: Austenders Awarded Contracts search page
driver.get('https://www.tenders.gov.au/cn/search')
#Find the Search Button and return all search results
Search_Results = driver.find_element_by_name("SearchButton")
if 'inactive' in Search_Results.get_attribute('name'):
print("Search Button not found")
exit;
print('Search Button found')
Search_Results.click()
#Pause code to prevent blocking by website
time.sleep(1)
i = 0
Awarded = []
#Move to the next search page by finding the Next button at the bottom of the page
#This code will need to be refined as the last search will be skipped currently.
while True:
Next_Page = driver.find_element_by_class_name('next')
if 'inactive' in Next_Page.get_attribute('class'):
print("End of Search Results")
exit;
i = i + 1
time.sleep(2)
#Loop through all the Detail links on the current Search Results Page
print("Checking search results page " + str(i))
print(driver.current_url)
soup = BeautifulSoup(driver.current_url, features='lxml')
#Find all Contract detail links in the current search results page
Details = soup.findAll('div', {'class': 'list-desc-inner'})
for each_Contract in Details:
#Loop through each Contract details link and scrape all the detailed
#Contract information page
Details_Page = each_Contract.find('a', {'class': 'detail'}).get('href')
driver.get(Details_Page)
#Scrape all the data in the Awarded Contract page
#r = requests.get(driver.current_url)
soup = BeautifulSoup(driver.current_url, features='lxml')
#find a list of all the Contract Info (contained in the the 'Contact Heading'
#class of the span element)
Contract = soup.find_all('span', {'class': 'Contact-Heading'})
Contract_Info = [span.get_text() for span in Contract]
#find a list of all the Summary Contract info which is in the text of\
#the 'list_desc_inner' class
Sub = soup.find_all('div', {'class': 'list_desc_inner'})
Sub_Info = [div.get_text() for div in Sub]
#Combine the lists into a unified list and append to the Awarded table
Combined = [Contract_Info, Sub_Info]
Awarded.append[Combined]
#Go back to the Search Results page (from the Detailed Contract page)
driver.back()
#Go to the next Search Page by clicking on the Next button at the bottom of the page
Next_Page.click()
#
time.sleep(3)
print(Awarded.Shape)

as stated, you are not actually feeding in the html source into BeautifulSoup. So first thing changed is: soup = BeautifulSoup(driver.current_url, features='lxml') to soup = BeautifulSoup(driver.page_source, features='lxml')
Second issue: Some of the elements there is no tag <a> with class=detail. So you won;t be able to get the href from a NoneType. I added a try/except to skip over when that happens (not sure if that gives your desired results though). You could also just get rid of that class, and just say Details_Page = each_Contract.find('a').get('href')
Next, that is only the extension of the url, you need to append the root, so: driver.get('https://www.tenders.gov.au' + Details_Page)
I also do not see where you are referring to class=Contact-Heading.
You also refer to class='class': 'list-desc-inner' and one point, then 'class': 'list_desc_inner' at another. Again, I don't see a class=list_desc_inner
Next. to append a list to a list, you want Awarded.append(Combined), not Awarded.append[Combined]
I also added .strip() in there to clean up some of that white space in the text.
Anyways, there's a lot you need to fix and clean up, and I also don't know what your expected output should be. But hopefully this gets you started.
Also, as stated in the comments, you COULD just click the download buttin and get the results straight away, but maybe you're doing it the hard way to practice...
import requests
from requests import get
from selenium import webdriver
from bs4 import BeautifulSoup
from lxml import html
import pandas as pd
#import chromedriver_binary # Adds chromedriver binary to path
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe")
#click the search button on Austenders to return all Awarded Contracts
import time
#define the starting point: Austenders Awarded Contracts search page
driver.get('https://www.tenders.gov.au/cn/search')
#Find the Search Button and return all search results
Search_Results = driver.find_element_by_name("SearchButton")
if 'inactive' in Search_Results.get_attribute('name'):
print("Search Button not found")
exit;
print('Search Button found')
Search_Results.click()
#Pause code to prevent blocking by website
time.sleep(1)
i = 0
Awarded = []
#Move to the next search page by finding the Next button at the bottom of the page
#This code will need to be refined as the last search will be skipped currently.
while True:
Next_Page = driver.find_element_by_class_name('next')
if 'inactive' in Next_Page.get_attribute('class'):
print("End of Search Results")
exit;
i = i + 1
time.sleep(2)
#Loop through all the Detail links on the current Search Results Page
print("Checking search results page " + str(i))
print(driver.current_url)
soup = BeautifulSoup(driver.page_source, features='lxml')
#Find all Contract detail links in the current search results page
Details = soup.findAll('div', {'class': 'list-desc-inner'})
for each_Contract in Details:
#Loop through each Contract details link and scrape all the detailed
#Contract information page
try:
Details_Page = each_Contract.find('a', {'class': 'detail'}).get('href')
driver.get('https://www.tenders.gov.au' + Details_Page)
#Scrape all the data in the Awarded Contract page
#r = requests.get(driver.current_url)
soup = BeautifulSoup(driver.page_source, features='lxml')
#find a list of all the Contract Info (contained in the the 'Contact Heading'
#class of the span element)
Contract = soup.find_all('span', {'class': 'Contact-Heading'})
Contract_Info = [span.text.strip() for span in Contract]
#find a list of all the Summary Contract info which is in the text of\
#the 'list_desc_inner' class
Sub = soup.find_all('div', {'class': 'list-desc-inner'})
Sub_Info = [div.text.strip() for div in Sub]
#Combine the lists into a unified list and append to the Awarded table
Combined = [Contract_Info, Sub_Info]
Awarded.append(Combined)
#Go back to the Search Results page (from the Detailed Contract page)
driver.back()
except:
continue
#Go to the next Search Page by clicking on the Next button at the bottom of the page
Next_Page.click()
#
time.sleep(3)
driver.close()
print(Awarded.Shape)

How to gather entire source of web page (Source only shows top 10 X.)

I'm trying to create a program that will go through a bunch of tumblr photos and extract the username of the person who uploaded them.
http://www.tumblr.com/tagged/food
If you look here, you can see multiple pictures of food with multiple different uploaders. If you scroll down you will begin to see even more pictures with even more uploaders. If you right click in your browser to view the source, and search "username", however, it will only yield 10 results. Every time, no matter how far down you scroll.
Is there any way to counter this and have instead have it display the entire source for all images, or for X amount of images, or for however far you scrolled?
Here is my code to show what I'm doing:
#Imports
import requests
from bs4 import BeautifulSoup
import re
#Start of code
r = requests.get('http://www.tumblr.com/tagged/skateboard')
page = r.content
soup = BeautifulSoup(page)
soup.prettify()
arrayDiv = []
for anchor in soup.findAll("div", { "class" : "post_info" }):
anchor = str(anchor)
tempString = anchor.replace('</a>:', '')
tempString = tempString.replace('<div class="post_info">', '')
tempString = tempString.replace('</div>', '')
tempString = tempString.split('>')
newString = tempString[1]
newString = newString.strip()
arrayDiv.append(newString)
print arrayDiv

I had solved a similiar problem using beautifulsoup. what I did is looping through the paged pages. check with beautifulsoup is there is a continue element - here(in the tumbler page) for example this is an element with an id "next_page_link"
if there is one I would loop the photo scraping code while changing the url fetched by requests. you would need to encapsulate all the code in a function ofcourse
good luck.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Troubleshooting python code to scrape and store PDF text - python

Related

Is it possible to do a web scrapping on ebi.ac.uk/interpro website?

How to download all the href (pdf) inside a class with python beautiful soup?

Scraping PDFs from multiple pages using bs4

Webscraping text is returning an empty set

How to gather entire source of web page (Source only shows top 10 X.)

Categories

Resources