Finding the correct elements for scraping a website - python

I am trying to scrape only certain articles from this main page. To be more specific, I am trying to scrape only articles from sub-page media and from sub-sub-pages Press releases; Governing Council decisions; Press conferences; Monetary policy accounts; Speeches; Interviews, and also just those which are in English.
I managed (based on some tutorials and other SE:overflow answers), to put together a code that scrapes completely everything from the website because my original idea was to scrape everything and then in data frame just clear the output later but the website includes so much that it always freezes after some time.
Getting the sub-links:
import requests
import re
from bs4 import BeautifulSoup
master_request = requests.get("https://www.ecb.europa.eu/")
base_url = "https://www.ecb.europa.eu"
master_soup = BeautifulSoup(master_request.content, 'html.parser')
master_atags = master_soup.find_all("a", href=True)
master_links = [ ]
sub_links = {}
for master_atag in master_atags:
master_href = master_atag.get('href')
master_href = base_url + master_href
print(master_href)
master_links.append(master_href)
sub_request = requests.get(master_href)
sub_soup = BeautifulSoup(sub_request.content, 'html.parser')
sub_atags = sub_soup.find_all("a", href=True)
sub_links[master_href] = []
for sub_atag in sub_atags:
sub_href = sub_atag.get('href')
sub_links[master_href].append(sub_href)
print("\t"+sub_href)
Some things I tried were to change the base link to sublinks - my idea was that maybe I can just do it separately for every sub-page and later just put the links together but that did not work). Other things that I tried was to replace the 17th line with the following;
sub_atags = sub_soup.find_all("a",{'class': ['doc-title']}, herf=True)
this seemed to partially solve my problem because even though it did not got only links from the sub-pages it at least ignored links that are not 'doc-title' which are all the links with text on the website but it was still too much and some links were not retrieved correctly.
I tried also tried the following:
for master_atag in master_atags:
master_href = master_atag.get('href')
for href in master_href:
master_href = [base_url + master_href if str(master_href).find(".en") in master_herf
print(master_href)
I thought that because all hrefs with English documents had .en somewhere in them this would only give me all links where .en occurs somewhere in the href but this code gives me syntax error for the print(master_href) which I dont understand because previously print(master_href) worked.
Next I want to extract the following information from sublinks. This part of code works when I test it for a single link, but I never had chance to try it on the above code since it wont finish running. Will this work once I manage to get the proper list of all links?
for links in sublinks:
resp = requests.get(sublinks)
soup = BeautifulSoup(resp.content, 'html5lib')
article = soup.find('article')
title = soup.find('title')
textdate = soup.find('h2')
paragraphs = article.find_all('p')
matches = re.findall('(\d{2}[\/ ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[\/ ]\d{2,4})', str(textdate))
for match in matches:
print(match[0])
datadate = match[0]
import pandas as pd
ecbdf = pd.DataFrame({"Article": [Article]; "Title": [title]: "Text": [paragraphs], "date": datadate})
Also going back to the scraping, since the first approach with beautiful soup did not worked for me I also tried to just approach the problem differently. The website has RSS feeds so I wanted to use the following code:
import feedparser
from pandas.io.json import json_normalize
import pandas as pd
import requests
rss_url='https://www.ecb.europa.eu/home/html/rss.en.html'
ecb_feed = feedparser.parse(rss_url)
df_ecb_feed=json_normalize(ecb_feed.entries)
df_ecb_fead.head()
Here I run into a problem of not being even able to find the RSS feed url in the first place. I tried the following: I viewed the source page and I tried to search for "RSS" and tried all urls that I could find this way but I always get empty dataframe.
I am a beginner to web-scraping and at this point I dont know how to proceed or how to approach this problem. In the end what I want to accomplish is to just collect all articles from the subpages with their titles, and dates and authors and put them into one dataframe.

The biggest problem you have with scraping this site is probably the lazy loading: Using JavaScript, they load the articles from several html pages and merge them into the list. For details, look out for index_include in the source code. This is problematic for scraping with only requests and BeautifulSoup because what your soup instance gets from the request content is just the basic skeleton without the list of articles. Now you have two options:
Instead of the main article list page (Press Releases, Interviews, etc.), use the lazy-loaded lists of articles, e.g., /press/pr/date/2019/html/index_include.en.html. This will probably be the easier option, but you have to do it for each year you're interested in.
Use a client that can execute JavaScript like Selenium to obtain the HTML instead of requests.
Apart from that, I would suggest to use CSS selectors for extracting information from the HTML code. This way, you only need a few lines for the article thing. Also, I don't think you have to filter for English articles if you use the index.en.html page for scraping because it shows English by default and -- additionally -- other languages if available.
Here's an example I quickly put together, this can certainly be optimized but it shows how to load the page with Selenium and extract the article URLs and article contents:
from bs4 import BeautifulSoup
from selenium import webdriver
base_url = 'https://www.ecb.europa.eu'
urls = [
f'{base_url}/press/pr/html/index.en.html',
f'{base_url}/press/govcdec/html/index.en.html'
]
driver = webdriver.Chrome()
for url in urls:
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
for anchor in soup.select('span.doc-title > a[href]'):
driver.get(f'{base_url}{anchor["href"]}')
article_soup = BeautifulSoup(driver.page_source, 'html.parser')
title = article_soup.select_one('h1.ecb-pressContentTitle').text
date = article_soup.select_one('p.ecb-publicationDate').text
paragraphs = article_soup.select('div.ecb-pressContent > article > p:not([class])')
content = '\n\n'.join(p.text for p in paragraphs)
print(f'title: {title}')
print(f'date: {date}')
print(f'content: {content[0:80]}...')
I get the following output for the Press Releases page:
title: ECB appoints Petra Senkovic as Director General Secretariat and Pedro Gustavo Teixeira as Director General Secretariat to the Supervisory Board
date: 20 December 2019
content: The European Central Bank (ECB) today announced the appointments of Petra Senkov...
title: Monetary policy decisions
date: 12 December 2019
content: At today’s meeting the Governing Council of the European Central Bank (ECB) deci...

Related

How to specify needed fields using Beautiful Soup and properly call upon website elements using HTML tags

I have been trying to create a web scraping program that will return the values of the Title, Company, and Location from job cards on Indeed. I finally am not returning error codes, however, I am only returning one value for each of the desired fields when there are multiple job cards I am attempting to call on. Also, the value for the company field is returning the value for the title field because is used there as well in the HTML code. I am unfamiliar with HTML code or how to specify my needs using Beautiful Soup. I have tried to use the documentation and played with a few different methods to solve this problem, but have been unsuccessful.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://au.indeed.com/jobs?
q=web%20developer&l=perth&from=searchOnHP&vjk=6c6cd45320143cdf").text
soup = BeautifulSoup(page, "lxml")
results = soup.find(id="mosaic-zone-jobcards")
job_elements = results.find("div", class_="slider_container")
for job_element in job_elements:
title = job_element.find("h2")
company = job_element.find("span")
location = job_element.find("div", class_="companyLocation")
print(title.text)
print(company.text)
print(location.text)
Here is what returns to the console:
C:\Users\Admin\PycharmProjects\WebScraper1\venv\Scripts\python.exe
C:/Users/Admin/PycharmProjects/WebScraper1/Indeed.py
Web App Developer
Web App Developer
Perth WA 6000
Process finished with exit code 0
job_elements only returns the first matching element because you used find instead of find_all. For the same reason company links to the first span found in div.slider_container. The span you want contains class="companyName". Also, the prints should be inside the for loop. Here's the improved code.
job_elements = results.find_all("div", class_="slider_container")
for job_element in job_elements:
title = job_element.find("h2")
company = job_element.find("span", class_="companyName")
location = job_element.find("div", class_="companyLocation")
print(title.text)
print(company.text)
print(location.text)

Trying to Get A Link Embedded in an HTML Page with Python

On the webpage: https://podcasts.apple.com/us/podcast/id979020229, there is a title that reads "Python at the US Federal Election Commission". When you click on that title, a link opens. What I'm trying to do is firstly, find the first title on the webpage that has an embedded link. Then, to get that link and print it. I'm not sure how to do this in Python, but I've tried using different ways I thought would work. One of the ways involved the BeautifulSoup module. My code is below.
Code:
page = requests.get(link)
soup = bs4.BeautifulSoup(page.text, "html.parser")
eps = soup.find_all('a')
i = 0
while (len(open) < 1):
s = str(eps[i].get('href'))
if s[8] == 'q':
open.append(s)
i += 1
for i in open:
print(i)
You can use findall with the class of the link.
I used inspect element to determine which class to select, in this case, tracks__track__link--block
Then you can iterate through the links.
from bs4 import BeautifulSoup
import requests
page = requests.get('https://podcasts.apple.com/us/podcast/id979020229')
soup = BeautifulSoup(page.text, "html.parser")
eps = soup.find_all(class_ = 'tracks__track__link--block')
for a in eps:
# gets the text of the link
print(a.text)
# gets the link
print(a['href'])
prints
Python at the US Federal Election Commission
https://podcasts.apple.com/us/podcast/python-at-the-us-federal-election-commission/id979020229?i=1000522628049
Flask 2.0
https://podcasts.apple.com/us/podcast/flask-2-0/id979020229?i=1000521751060
Awesome FastAPI extensions and add ons
https://podcasts.apple.com/us/podcast/awesome-fastapi-extensions-and-add-ons/id979020229?i=1000520763681
Ask us about modern Python projects and tools
https://podcasts.apple.com/us/podcast/ask-us-about-modern-python-projects-and-tools/id979020229?i=1000519506466
Automate your data exchange with PyDantic
https://podcasts.apple.com/us/podcast/automate-your-data-exchange-with-pydantic/id979020229?i=1000518286844
Python Apps that Scale to Billions of Users
https://podcasts.apple.com/us/podcast/python-apps-that-scale-to-billions-of-users/id979020229?i=1000517664109

Using Beautfiul Soup to extract specific groups of links from a blogspot website

I want to extract, let's say, every 7th year link on a school website. On the archives, it's pretty easy to find with ctrl + f "year-7". It's not that easy on beautifulSoup, though. Or maybe I'm doing it wrong.
import requests
from bs4 import BeautifulSoup
URL = '~school URL~'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
This gives me every link on the website archive. Every link that matters to me is pretty much like this:
~school URL~blogspot.com/2020/10/mathematics-activity-year-x.html
I tried storing the "(link.get('href'))" on a variable and searching for "year-x" on it but that doesn't work.
Any ideas on how I'd search through it? Blogspot search is horrific. I'm doing this as a project to help kids from a poor area on finding their classes easier, because it was all just left on the website for the next school year and there's hundreds of link without tags for different school years. I'm trying to at least compile a list of the links for each school year to help them.
If I understand, you want to extract the years from from the links. Try using regex to extract the years.
In your case it would be:
import re
from bs4 import BeautifulSoup
txt = """<a href="blogspot.com/2020/10/mathematics-activity-year-x.html"</a>"""
soup = BeautifulSoup(txt, "html.parser")
years = []
for tag in soup.find_all("a"):
link = tag.get("href")
year = re.search(r"year-.?", link).group()
years.append(year)
print(years)
Output:
['year-x']
Edit Try using a CSS selector to select all href’s that end with year-7.html
...
for tag in soup.select('a[href$="year-7.html"]'):
print(tag)

Making Python go a webpage that has a lot of links, scraping each link and seeing if they have a specific piece of text

I have a question pertaining to Python that I hope can be remedied.
I'm not asking to be spoonfed, but any advice will be extremely helpful.
I'm working on a mini-project of sorts where I "crawl" the WW1 Database of Canadian Soldiers who died and seeing which pages lack info.
http://www.canadaatwar.ca/memorial/world-war-i/
I'm trying to make Python go to each soldiers page, and seeing if the "biography" section is empty.
This is my code (not mine, I'll give credit and link the original page later) so far. It's very messy, and it may make senior developers tear their hair out in frustration, but bear with me.
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'http://www.canadaatwar.ca/memorial/world-war-i/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text)
soldier_links = []
for table_row in soup.select(".soldierslist tr"):
table_cells = table_row.findAll('td')
if len(table_cells) > 0:
relative_link_to_soldier_details = table_cells[0].find('a')['href']
absolute_link_to_soldier_details = url_to_scrape + relative_link_to_soldier_details
soldier_links.append(absolute_link_to_soldier_details)
inmates = []
r = requests.get(soldier_link)
soup = BeautifulSoup(r.text)
soldier_details = {}
soldier_profile_rows = soup.select("#soldierProfile tr")
soldier_details['additional text information is avalable on this individual just yet. If you have more information please'] = soldier_profile_rows[0].findAll('td')[0].text.strip()
soldiers.append(soldier_details)
It would be great to know how to make Python go to the next page once it's done scraping everything from the page it's scraping, and making Python print only the links that have information in the biography section.
I would like suggest you use Scrapy Framework. In building scrapy spiders,
You can specify how to start request, e.g., the simplest way by setting start_urls or overwrite function start_request
You can specify how to parse response. Here you can do the xpath selection, if the condition is satisfied, yield the result as items and at same time yield a new request (in this way, you are following the next request).
Hope this would be helpful.

Scraping the web in python

I'm completely new to scraping the web but I really want to learn it in python. I have a basic understanding of python.
I'm having trouble understanding a code to scrape a webpage because I can't find a good documentation about the modules which the code uses.
The code scraps some movie's data of this webpage
I get stuck after the comment "selection in pattern follows the rules of CSS".
I would like to understand the logic behind that code or a good documentation to understand that modules. Is there any previous topic which I need to learn?
The code is the following :
import requests
from pattern import web
from BeautifulSoup import BeautifulSoup
url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2012'
r = requests.get(url)
print r.url
url = 'http://www.imdb.com/search/title'
params = dict(sort='num_votes,desc', start=1, title_type='feature', year='1950,2012')
r = requests.get(url, params=params)
print r.url # notice it constructs the full url for you
#selection in pattern follows the rules of CSS
dom = web.Element(r.text)
for movie in dom.by_tag('td.title'):
title = movie.by_tag('a')[0].content
genres = movie.by_tag('span.genre')[0].by_tag('a')
genres = [g.content for g in genres]
runtime = movie.by_tag('span.runtime')[0].content
rating = movie.by_tag('span.value')[0].content
print title, genres, runtime, rating
Here's the documentation for BeautifulSoup, which is an HTML and XML parser.
The comment
selection in pattern follows the rules of CSS
means the strings such as 'td.title' and 'span.runtime' are CSS selectors that help find the data you are looking for, where td.title searches for the <TD> element with attribute class="title".
The code is iterating through the HTML elements in the webpage body and extracting title, genres, runtime, and rating by the CSS selectors .

Categories

Resources