Scraping flex tag - python

I would to scrape the paragraphs of a web site with BeautifulSoup, but there are flex boxes in the webpage so the program can't find the chosen tag.
def content_article(url, file_output):
"""scrape content web page in a file and the plain code
url: address of web page of international federation of canoe
file_output: file name created + plain file name
return two files: file with HTML code and file with only text information
"""
response = requests.get(url)
data= response.content
soup = bs(data, features="html.parser")
plain_soup = soup.encode("UTF-8")
section = soup.find("div", {"class" : "container"})
print (section)
paragraphes = section.find_all("p")
result=""
for paragraphe in paragraphes:
print ("paragraphe")
print(paragraphe)
result = result + paragraphe.text + "\n"
print("result")
print (result)
url_file = file_output + ".txt"
file = open(url_file, 'w', encoding="utf_8")
file.write("infos provenant de" + url + "\n")
file.write(result)
file.close()
url_plain_file = file_output + "_plain.txt"
plain_file = open(url_plain_file, 'w')
plain_file.write(str(plain_soup))
plain_file.close()
print("the file " + file_output + " has been created")
Example URL: https://www.fifa.com/about-fifa/president/news/gianni-infantino-congratulates-shaikh-salman-on-re-election-as-afc-president
The program can't find the tag "container" because it is in a flex tag.
I tried to use Selenium but I couldn't find the "activated" flex box.

The program can't find the tag "container" because it is in a flex tag.
You don't need to worry about the flexboxes as they won't affect the presence of html elements. I think they are used in a particular website layout model to make web dev easier.
I'm not sure what you mean by a "container" tag as I don't believe this tag type exists in html.
I tried to use Selenium but I couldn't find the "activated" flex box.
It works with selenium. Perhaps you were running it headless or altered some of the options?
Solution
Try this. I split up the different subtasks into individual functions and then made
one function (i.e. main) that combines them. It produces your desired output, namely 2 text files: one with the html elements (i.e. p tags) that contain the text and one with only the text.
Edit: I just realized that p tags with classes p-large ff-text-grey-slate are also used to store the 3 bullet points at the top of the article and 2 empty lines at the bottom of the article. Thus, the final output will contain these values.
Due to the websites html structure I'm not sure how to simply solve this issue. Try to find any patterns that you can exploit to only the p tags that only contain the values you want.
from bs4 import BeautifulSoup
from selenium import webdriver
from time import sleep
def get_page_source(url):
try:
driver = webdriver.Chrome()
driver.get(url)
sleep(3)
return driver.page_source
finally: driver.quit()
def store_elements(outpath, p_tags):
print(p_tags)
with open(outpath, mode='w') as file:
file.writelines(p_tags)
def store_texts(outpath, texts):
with open(outpath, mode='w') as file:
file.writelines(texts)
def get_elements(page_source, tag_name, attr):
soup = BeautifulSoup(page_source, 'html.parser')
return soup.find_all(tag_name, attr)
def get_text_from_elements(elements):
return [element.text for element in elements]
def main(html_path, text_path, url):
pg_source = get_page_source(url)
p_tags = get_elements(pg_source, 'p', {'class':'p-large ff-text-grey-slate'})
texts = get_text_from_elements(p_tags)
store_elements(html_path, p_tags)
store_texts(text_path, texts)
if __name__ == '__main__':
url = "https://www.fifa.com/about-fifa/president/news/gianni-infantino-congratulates-shaikh-salman-on-re-election-as-afc-president"
# enter 2 paths, one for the html and other for the paragraphs (i.e. texts)
main('', '', url)

Related

Web Scraping Emails using Python

new to web scraping (using python) and encountered a problem trying to get an email from a university's athletic department site.
I've managed to get to navigate to the email I want to extract but don't know where to go from here. When I print what I have, all I get is '' and not the actual text of the email.
I'm attaching what I have so far, let me know if it needs a better explanation.
Here's a link to an image of what I'm trying to scrape. Website
and the website: https://goheels.com/staff-directory
Thanks!
Here's my code:
from bs4 import BeautifulSoup
import requests
urls = ''
with open('websites.txt', 'r') as f:
for line in f.read():
urls += line
urls = list(urls.split())
print(urls)
for url in urls:
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
try:
body = soup.find(headers="col-staff_email category-0")
links = body.a
print(links)
except Exception as e:
print(f'"This url didn\'t work:" {url}')
The emails are hidden inside a <script> element. With a little pushing, shoving, css selecting and string splitting you can get there:
for em in soup.select('td[headers*="col-staff_email"] script'):
target = em.text.split('var firstHalf = "')[1]
fh = target.split('";')[0]
lh = target.split('var secondHalf = "')[1].split('";')[0]
print(fh+ '#' +lh)
Output:
bubba.cunningham#unc.edu
molly.dalton#unc.edu
athgallo#unc.edu
dhollier#unc.edu
etc.

How to download all the href (pdf) inside a class with python beautiful soup?

I have around 900 pages and each page contains 10 buttons (each button has pdf). I want to download all the pdf's - the program should browse to all the pages and download the pdfs one by one.
Code only searching for .pdf but my href does not have .pdf page_no (1 to 900).
https://bidplus.gem.gov.in/bidlists?bidlists&page_no=3
This is the website and below is the link:
BID NO: GEM/2021/B/1804626
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://bidplus.gem.gov.in/bidlists"
#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
You only need the href as associated with the links you call buttons. Then prefix with the appropriate protocol + domain.
The links can be matched with the following selector:
.bid_no > a
That is anchor (a) tags with direct parent element having class bid_no.
This should pick up 10 links per page. As you will need a file name for each download I suggest having a global dict, which you store the links as values and link text as keys in. I would replace the "\" in the link descriptions with "_". You simply add to this during your loop over the desired number of pages.
An example of some of the dictionary entries:
As there are over 800 pages I have chosen to add in an additional termination page count variable called end_number. I don't want to loop to all pages so this allows me an early exit. You can remove this param if so desired.
Next, you need to determine the actual number of pages. For this you can use the following css selector to get the Last pagination link and then extract its data-ci-pagination-page value and convert to integer. This can then be the num_pages (number of pages) to terminate your loop at:
.pagination li:last-of-type > a
That looks for an a tag which is a direct child of the last li element, where those li elements have a shared parent with class pagination i.e. the anchor tag in the last li, which is the last page link in the pagination element.
Once you have all your desired links and file suffixes (the description text for the links) in your dictionary, loop the key, value pairs and issue requests for the content. Write that content out to disk.
TODO:
I would suggest you look at ways of optimizing the final issuing of requests and writing out to disk. For example, you could first issue all requests asynchronously and store in a dictionary to optimize what would be an I/0-bound process. Then loop that writing to disk perhaps with a multi-processing approach to optimize for a more CPU-bound process.
I would additionally consider if some sort of wait should be introduced between requests. Or if requests should be batches. You could theoretically currently have something like (836 * 10) + 836 requests.
import requests
from bs4 import BeautifulSoup as bs
end_number = 3
current_page = 1
pdf_links = {}
path = '<your path>'
with requests.Session() as s:
while True:
r = s.get(f'https://bidplus.gem.gov.in/bidlists?bidlists&page_no={current_page}')
soup = bs(r.content, 'lxml')
for i in soup.select('.bid_no > a'):
pdf_links[i.text.strip().replace('/', '_')] = 'https://bidplus.gem.gov.in' + i['href']
#print(pdf_links)
if current_page == 1:
num_pages = int(soup.select_one('.pagination li:last-of-type > a')['data-ci-pagination-page'])
print(num_pages)
if current_page == num_pages or current_page > end_number:
break
current_page+=1
for k,v in pdf_links.items():
with open(f'{path}/{k}.pdf', 'wb') as f:
r = s.get(v)
f.write(r.content)
Your site doesnt work for 90% people. But you provide examples of html. So i hope this ll help you:
url = 'https://bidplus.gem.gov.in/bidlists'
response = requests.get(url)
soup = BeautifulSoup(response.text, features='lxml')
for bid_no in soup.find_all('p', class_='bid_no pull-left'):
for pdf in bid_no.find_all('a'):
with open('pdf_name_here.pdf', 'wb') as f:
#if you have full link
href = pdf.get('href')
#if you have link exept full path, like /showbidDocument/2993132
#href = url + pdf.get('href')
response = requests.get(href)
f.write(response.content)

Scraping PDFs from multiple pages using bs4

I'm a python beginner and I'm hoping that what I'm trying to do isn't too involved. Essentially, I want to extract the text of the minutes (contained in PDF documents) from this municipality's council meetings for the last ~10 years at this website: https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3
Eventually, I want to analyze/categorise the action items from the meeting minutes. All I've been able to do so far is grab the links leading to the PDFs from the first page. Here is my code:
# Import requests for navigating to websites, beautiful soup to scrape website, PyPDF2 for PDF data mining
import sys
import requests
import bs4
import PyPDF2
#import PDfMiner
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
# Soupify URL
my_url = "https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3"
result = requests.get(my_url)
src = result.content
page_soup = soup(src, "lxml")
#list with links
urls = []
for tr_tag in page_soup.find_all("tr"):
a_tag = tr_tag.find("a")
urls.append(a_tag.attrs["href"])
print(urls)
A few things I could use help with:
How do I pull the links from pages 1 - 50 (arbitrary in the 'Previous Meetings' site, instead of just the first page?
How do I go about entering each of the links, and pulling the 'Read the minutes' PDFs for text analysis (using PyPDF2?)
Any help is so appreciated! Thank you in advance!
EDIT: I am hoping to get the data into a dataframe, where the first column is the file name and the second column is the text from the PDF. It would look like:
PDF_file_name
PDF_text
spec20210729min
[[' \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING MINUTES\n \n \nJULY 29, 2021\n \n \nA Special Meeting of the Council\n \nof the City of Vancouver\n \nw
spec20210802min
[[' \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING MINUTES\n \n \nAUGUST 2, 2021\n \n \nA Special Meeting of the Council\n \nof the City of Vancouver\n \nw
Welcome to the exciting world of web scraping!
First of all, great job you were on the good track.
There are a few points to discuss though.
You essentially have 2 problems here.
1 - How to retrieve the HTML text for all pages (1, ..., 50)?
In web scraping you have mainly to kind of web pages:
If you are lucky, the page does not render using javascript and you can use only requests to get the page content
You are less lucky, and the page uses JavaScript to render partly or entirely
To get all the pages from 1 to 50, we need to somehow click on the button next at the end of the page.
Why?
If you check what happens in the network tab from the browser developer, console, you see that a new query getting a JS script to generate the page is fetched for each click to the next button.
Unfortunately, we can't render JavaScript using requests
But we have a solution: Headless Browsers (wiki).
In the solution, I use selenium, which is a library that can use a real browser driver (in our case Chrome) to query a page and render JavaScript.
So we first get the web page with selenium, we extract the HTML, we click on next and wait a bit for the page to load, we extract the HTML, ... and so on.
2 - How to extract the text from the PDFs after getting them?
After downloading the PDfs, we can load it into a variable then open it with PyPDF2 and extract the text from all pages. I let you look at the solution code.
Here is a working solution. It will iterate over the first n pages you want and return the text from all the PDF you are interested in:
import os
import time
from io import BytesIO
from urllib.parse import urljoin
import pandas as pd
import PyPDF2
import requests
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# Create a headless chromedriver to query and perform action on webpages like a browser
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
# Main url
my_url = (
"https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3"
)
def get_n_first_pages(n: int):
"""Get the html text for the first n pages
Args:
n (int): The number of pages we want
Returns:
List[str]: A list of html text
"""
# Initialize the variables containing the pages
pages = []
# We query the web page with our chrome driver.
# This way we can iteratively click on the next link to get all the pages we want
driver.get(my_url)
# We append the page source code
pages.append(driver.page_source)
# Then for all subsequent pages, we click on next and wait to get the page
for _ in range(1, n):
driver.find_element_by_css_selector(
"#LiverpoolTheme_wt93_block_wtMainContent_RichWidgets_wt132_block_wt28"
).click()
# Wait for the page to load
time.sleep(1)
# Append the page
pages.append(driver.page_source)
return pages
def get_pdf(link: str):
"""Get the pdf text, per PDF pages, for a given link.
Args:
link (str): The link where we can retrieve the PDF
Returns:
List[str]: A list containing a string per PDF pages
"""
# We extract the file name
pdf_name = link.split("/")[-1].split(".")[0]
# We get the page containing the PDF link
# Here we don't need the chrome driver since we don't have to click on the link
# We can just get the PDF using requests after finding the href
pdf_link_page = requests.get(link)
page_soup = soup(pdf_link_page.text, "lxml")
# We get all <a> tag that have href attribute, then we select only the href
# containing min.pdf, since we only want the PDF for the minutes
pdf_link = [
urljoin(link, l.attrs["href"])
for l in page_soup.find_all("a", {"href": True})
if "min.pdf" in l.attrs["href"]
]
# There is only one PDF for the minutes so we get the only element in the list
pdf_link = pdf_link[0]
# We get the PDF with requests and then get the PDF bytes
pdf_bytes = requests.get(pdf_link).content
# We load the bytes into an in memory file (to avoid saving the PDF on disk)
p = BytesIO(pdf_bytes)
p.seek(0, os.SEEK_END)
# Now we can load our PDF in PyPDF2 from memory
read_pdf = PyPDF2.PdfFileReader(p)
count = read_pdf.numPages
pages_txt = []
# For each page we extract the text
for i in range(count):
page = read_pdf.getPage(i)
pages_txt.append(page.extractText())
# We return the PDF name as well as the text inside each pages
return pdf_name, pages_txt
# Get the first 2 pages, you can change this number
pages = get_n_first_pages(2)
# Initialize a list to store each dataframe rows
df_rows = []
# We iterate over each page
for page in pages:
page_soup = soup(page, "lxml")
# Here we get only the <a> tag inside the tbody and each tr
# We avoid getting the links from the head of the table
all_links = page_soup.select("tbody tr a")
# We extract the href for only the links containing council (we don't care about the
# video link)
minutes_links = [x.attrs["href"] for x in all_links if "council" in x.attrs["href"]]
#
for link in minutes_links:
pdf_name, pages_text = get_pdf(link)
df_rows.append(
{
"PDF_file_name": pdf_name,
# We join each page in the list into one string, separting them with a line return
"PDF_text": "\n".join(pages_text),
}
)
break
break
# We create the data frame from the list of rows
df = pd.DataFrame(df_rows)
Outputs a dataframe like:
PDF_file_name PDF_text
0 spec20210729ag \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING...
...
Keep scraping the web, it's fun :)
The issue is that BeautifulSoup won't see any results besides those for the first page. BeautifulSoup is just an XML/HTML parser, it's not a headless browser or JavaScript-capable runtime environment that can run JavaScript asynchronously. When you make a simple HTTP GET request to your page, the response is an HTML document, in which the first page's results are directly baked into the HTML. These contents are baked into the document at the time the server served the document to you, so BeautifulSoup can see these elements no problem. All the other pages of results, however, are more tricky.
View the page in a browser. While logging your network traffic, click on the "next" button to view the next page's results. If you're filtering your traffic by XHR/Fetch requests only, you'll notice an HTTP POST request being made to an ASP.NET server, the response of which is HTML containing JavaScript containing JSON containing HTML. It's this nested HTML structure that represents the new content with which to update the table. Clicking this button doesn't actually take you to a different URL - the contents of the table simply change. The DOM is being updated/populated asynchronously using JavaScript, which is not uncommon.
The challenge, then, is to mimic these requests and parse the response to extract the HREFs of only those links in which you're interested. I would split this up into three distinct scripts:
One script to generate a .txt file of all sub-page URLs (these
would be the URLs you navigate to when clicking links like "Agenda
and Minutes",
example)
One script to read from that .txt file, make requests to each URL,
and extract the HREF to the PDF on that page (if one is available).
These direct URLs to PDFs will be saved in another .txt file.
A script to read from the PDF-URL .txt file, and perform PDF
analysis.
You could combine scripts one and two if you really want to. I felt like splitting it up.
The first script makes an initial request to the main page to get some necessary cookies, and to extract a hidden input __OSVSTATE that's baked into the HTML which the ASP.NET server cares about in our future requests. It then simulates "clicks" on the "next" button by sending HTTP POST requests to a specific ASP.NET server endpoint. We keep going until we can't find a "next" button on the page anymore. It turns out there are around ~260 pages of results in total. For each of these 260 responses, we parse the response, pull the HTML out of it, and extract the HREFs. We only keep those tags whose HREF ends with the substring ".htm", and whose text contains the substring "minute" (case-insensitive). We then write all HREFs to a text file page_urls.txt. Some of these will be duplicated for some reason, and other's end up being invalid links, but we'll worry about that later. Here's the entire generated text file.
def get_urls():
import requests
from bs4 import BeautifulSoup as Soup
import datetime
import re
import json
# Start by making the initial request to store the necessary cookies in a session
# Also, retrieve the __OSVSTATE
url = "https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3"
headers = {
"user-agent": "Mozilla/5.0"
}
session = requests.Session()
response = session.get(url, headers=headers)
response.raise_for_status()
soup = Soup(response.content, "html.parser")
osv_state = soup.select_one("input[id=\"__OSVSTATE\"]")["value"]
# Get all results from all pages
url = "https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx"
headers = {
"user-agent": "Mozilla/5.0",
"x-requested-with": "XMLHttpRequest"
}
payload = {
"__EVENTTARGET": "LiverpoolTheme_wt93$block$wtMainContent$RichWidgets_wt132$block$wt28",
"__AJAX": "980,867,LiverpoolTheme_wt93_block_wtMainContent_RichWidgets_wt132_block_wt28,745,882,0,277,914,760,"
}
while True:
params = {
"_ts": round(datetime.datetime.now().timestamp())
}
payload["__OSVSTATE"] = osv_state
response = session.post(url, params=params, headers=headers, data=payload)
response.raise_for_status()
pattern = "OsJSONUpdate\\(({\"outers\":{[^\\n]+})\\)//\\]\\]"
jsn = re.search(pattern, response.text).group(1)
data = json.loads(jsn)
osv_state = data["hidden"]["__OSVSTATE"]
html = data["outers"]["LiverpoolTheme_wt93_block_wtMainContent_wtTblCommEventTable_Wrapper"]["inner"]
soup = Soup(html, "html.parser")
# Select only those a-tags whose href attribute ends with ".htm" and whose text contains the substring "minute"
tags = soup.select("a[href$=\".htm\"]")
hrefs = [tag["href"] for tag in tags if "minute" in tag.get_text().casefold()]
yield from hrefs
page_num = soup.select_one("a.ListNavigation_PageNumber").get_text()
records_message = soup.select_one("div.Counter_Message").get_text()
print("Page #{}:\n\tProcessed {}, collected {} URL(s)\n".format(page_num, records_message, len(hrefs)))
if soup.select_one("a.ListNavigation_Next") is None:
break
def main():
with open("page_urls.txt", "w") as file:
for url in get_urls():
file.write(url + "\n")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The second script reads the output file of the previous one, and makes a request to each URL in the file. Some of these will be invalid, some need to be cleaned up in order to be used, many will be duplicates, some will be valid but won't contain a link to a PDF, etc. We visit each page and extract the PDF URL, and save each in a file. In the end I've managed to collect 287 usable PDF URLs. Here is the generated text file.
def get_pdf_url(url):
import requests
from bs4 import BeautifulSoup as Soup
url = url.replace("/ctyclerk", "")
base_url = url[:url.rfind("/")+1]
headers = {
"user-agent": "Mozilla/5.0"
}
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
except requests.exceptions.HTTPError:
return ""
soup = Soup(response.content, "html.parser")
pdf_tags = soup.select("a[href$=\".pdf\"]")
tag = next((tag for tag in pdf_tags if "minute" in tag.get_text()), None)
if tag is None:
return ""
return tag["href"] if tag["href"].startswith("http") else base_url + tag["href"]
def main():
with open("page_urls.txt", "r") as file:
page_urls = set(file.read().splitlines())
with open("pdf_urls.txt", "w") as file:
for count, pdf_url in enumerate(map(get_pdf_url, page_urls), start=1):
if pdf_url:
status = "Success"
file.write(pdf_url + "\n")
file.flush()
else:
status = "Skipped"
print("{}/{} - {}".format(count, len(page_urls), status))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The third script would read from the pdf_urls.txt file, make a request to each URL, and then interpret the response bytes as a PDF:
def main():
import requests
from io import BytesIO
from PyPDF2 import PdfFileReader
with open("pdf_urls.txt", "r") as file:
pdf_urls = file.read().splitlines()
for pdf_url in pdf_urls:
response = requests.get(pdf_url)
response.raise_for_status()
content = BytesIO(response.content)
reader = PdfFileReader(content)
# do stuff with reader
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

How to access and parse the source frame in a python web scraping application on a classic asp website

I'm trying to create a web-crawler application that downloads a bunch of assorted PDFs from SAI using this tutorial as my basis
https://www.youtube.com/watch?v=sVNJOiTBi_8
I've tried get(url) on the website I believe I want to get the source from
Authentication code (not included) ....
def subscription_spider(max_pages):
page = 1
while page <= max_pages:
url= 'https://www.saiglobal.com/online/Script/listvwstds.asp?TR=' + str(page)
source_code = session.get(url) # need to get the frame source!!
# extra code to find the href for the frame than get the frame source
#https://www.saiglobal.com/online/Script/ListVwStds.asp
plain_text = source_code.text #source_code.text printed is not the one we want
playFile = open('source_code.text', 'wb')
for chunk in r.iter_content(100000):
playFile.write(chunk)
playFile.close()
soup = BeautifulSoup(source_code.content, features='html.parser')
for link in soup.findAll('a',{'class':'stdLink'}): #can't find the std link
href= "https://www.saiglobal.com/online/" + link.get('href')
'''
get_pdf(href)
'''
print(href)
page += 1
'''
#function will probably give a bad name to the files but can fix this later
def get_pdf(std_url)
source_code = session.get(std_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(session.get(urljoin(url,link['href'])).content)
'''
r = session.get("https://www.saiglobal.com/online/")
print (r.status_code)
subscription_spider(1)
r = session.get("https://www.saiglobal.com/online/Script/Logout.asp") #not sure if this logs out or not
print (r.status_code)
The text file creates this
<frameset rows="*, 1">
<frame SRC="Script/Login.asp?">
<frame src="Script/Check.asp" noresize>
</frameset>
but when I inspect the element what I want is in a frame source which i cant directly access, i think the issue has something to do with the structure of classic asp pages but im not really sure what to do
html element
The output of the program is
200
200
press any key to continue...

Python Crawler is ignoring links on page

So I wrote a crawler for my friend that will go through a large list of web pages that are search results, pull all the links off the page, check if they're in the output file and add if they're not there. It took a lot of debugging but it works great! Unfortunately, the little bugger is really picky about which anchored tags it deems important enough to add.
Here's the code:
#!C:\Python27\Python.exe
from bs4 import BeautifulSoup
from urlparse import urljoin #urljoin is a class that's included in urlparse
import urllib2
import requests #not necessary but keeping here in case additions to code in future
urls_filename = "myurls.txt" #this is the input text file,list of urls or objects to scan
output_filename = "output.txt" #this is the output file that you will export to Excel
keyword = "skin" #optional keyword, not used for this script. Ignore
with open(urls_filename, "r") as f:
url_list = f.read() #This command opens the input text file and reads the information inside it
with open(output_filename, "w") as f:
for url in url_list.split("\n"): #This command splits the text file into separate lines so it's easier to scan
hdr = {'User-Agent': 'Mozilla/5.0'} #This (attempts) to tell the webpage that the program is a Firefox browser
try:
response = urllib2.urlopen(url) #tells program to open the url from the text file
except:
print "Could not access", url
continue
page = response.read() #this assigns a variable to the open page. like algebra, X=page opened
soup = BeautifulSoup(page) #we are feeding the variable to BeautifulSoup so it can analyze it
urls_all = soup('a') #beautiful soup is analyzing all the 'anchored' links in the page
for link in urls_all:
if('href' in dict(link.attrs)):
url = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain
if url.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue
url=url.split('#')[0]
if (url[0:4] == 'http' and url not in output_filename): #this checks if the item is a webpage and if it's already in the list
f.write(url + "\n") #if it's not in the list, it writes it to the output_filename
It works great except for the following link:
https://research.bidmc.harvard.edu/TVO/tvotech.asp
This link has a number of like "tvotech.asp?Submit=List&ID=796" and it's straight up ignoring them. The only anchor that goes into my output file is the main page itself. It's bizarre because looking at the source code, their anchors are pretty standard, like-
They have 'a' and 'href', I see no reason bs4 would just pass it and only include the main link. Please help. I've tried removing http from line 30 or changing it to https and that just removes all the results, not even the main page comes into the output.
that's cause one of the links there has a mailto in its href, it is then set to the url parameter and break the rest of the links as well cause the don't pass the url[0:4] == 'http' condition, it looks like this:
mailto:research#bidmc.harvard.edu?subject=Question about TVO Available Technology Abstracts
you should either filter it out or not use the same argument url in the loop, note the change to url1:
for link in urls_all:
if('href' in dict(link.attrs)):
url1 = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain
if url1.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue
url1=url1.split('#')[0]
if (url1[0:4] == 'http' and url1 not in output_filename): #this checks if the item is a webpage and if it's already in the list
f.write(url1 + "\n") #if it's not in the list, it writes it to the output_filename

Categories

Resources