Scraping text with subscripts with BeautifulSoup in Python - python

all! I'm working on my first web scraper ever, which grabs author names, URLs, and paper names from PMC, when given a "CitedBy" page like this
My program works fine for getting the author names and the URL's, however I can only get some of the paper titles, which I suspect is due to subscripts and superscripts.
Here's what I've got so far:
import requests
from bs4 import BeautifulSoup
import re
url = 'http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2593677/citedby/?page=0'
req = requests.get(url)
plain_text = req.text
soup = BeautifulSoup(plain_text, "lxml") #soup object
titles_list = []
for items in soup.findAll('div', {'class': 'title'}):
title = items.string
if title is None:
title = ("UHOH") #Problems with some titles
#print(title)
titles_list.append(title)
When I run this part of my code, my scraper gives me these results:
Finding and Comparing Syntenic Regions among Arabidopsis and the Outgroups Papaya, Poplar, and Grape: CoGe with Rosids
UHOH
Comprehensive Comparative Genomic and Transcriptomic Analyses of the Legume Genes Controlling the Nodulation Process
UHOH
Dosage Sensitivity of RPL9 and Concerted Evolution of Ribosomal Protein Genes in Plants
And so on for the whole page...
Some papers on this page that I get "UHOH" for are:
Comparative cell-specific transcriptomics reveals differentiation of C4 photosynthesis pathways in switchgrass and other C4 lineages
The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny
Cross-Family Translational Genomics of Abiotic Stress-Responsive Genes between Arabidopsis and Medicago truncatula
The first two I've listed here I believe are problematic because of "C4" and "F1" are actually "C subscript 4" and "F subscript 1". For the third one, "Medicago truncatula" is in an "em" HTML tag, so I suspect that this is why my scraper cannot scrape it.
The only alternative solution I've thought of is making my "soup.findAll" more specific, but that didn't end up helping me. I tried:
for items in soup.findAll('div', {'class': 'title'}):
title = items.string
if title is None:
for other in soup.findAll('a', {'class': 'view'}):
title = other.string
But sadly, this didn't work... So I'm not exactly sure how to approach this. Does anybody know how to handle special cases like these? Thank you so much!

Thanks to #LukasGraf, I have the answer!
Since I'm using the BeautifulSoup, I can use node.get_text(). It works different from the plain, ".string" because it also returns all the text beneath a tag, which was the case for the subscripts and "em" HTML marked text.

Related

Select css tags with randomized letters at the end

I am currently learning web scraping with python. I'm reading Web scraping with Python by Ryan Mitchell.
I am stuck at Crawling Sites Through Search. For example, reuters search given in the book works perfectly but when I try to find it by myself, as I will do in the future, I get this link.
Whilst in the second link it is working for a human, I cannot figure out how to scrape it due to weird class names like this class="media-story-card__body__3tRWy"
The first link gives me simple names, like this class="search-result-content" that I can scrape.
I've encountered the same problem on other sites too. How would I go about scraping it or finding a link with normal names in the future?
Here's my code example:
from bs4 import BeautifulSoup
import requests
from rich.pretty import pprint
text = "hello"
url = f"https://www.reuters.com/site-search/?query={text}"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
results = soup.select("div.media-story-card__body__3tRWy")
for result in results:
pprint(result)
pprint("###############")
You might resort to a prefix attribute value selector, like
div[class^="media-story-card__body__"]
This assumes that the class is the only one ( or at least notationally the first ). However, the idea can be extended to checking for a substring.

Traversal Parsing of Text from .HTML

I am trying to scrape text from webpages contained in tags of type titles, headings or paragraphs. When i try the below code I get mixed results depending on where the url is from. When i try some sources (e.g. Wikipedia or Reuters) the code works more or less fine and at least finds all the text. For other sources (e.g. Politico, The Economist) I start to miss a lot of the text contained in webpage.
I am using traversal algo to walk through the tree and check if the tag is 'of interest'. Maybe find_all(True, recursive=False) is for some reason missing children that subsequently contain the text I am looking for? I'm unsure how to investigate that. Or maybe some sites are blocking the scraping somehow? But then why can i scrape one paragraph from the economist?
Code below replicates issue for me - you should see the wikipedia page (urls[3]) print as desired, the politico (urls[0]) missing all text in the article and economist (urls[1]) missing all but one paragraph.
from bs4 import BeautifulSoup
import requests
urls = ["https://www.politico.com/news/2022/01/17/democrats-biden-clean-energy-527175",
"https://www.economist.com/finance-and-economics/the-race-to-power-the-defi-ecosystem-is-on/21807229",
"https://www.reuters.com/world/significant-damage-reported-tongas-main-island-after-volcanic-eruption-2022-01-17/",
"https://en.wikipedia.org/wiki/World_War_II"]
# get soup
url = urls[0] # first two urls don't work, last two do work
response = requests.get(url)
soup = BeautifulSoup(response.text, features="html.parser")
# tags with text that i want to print
tags_of_interest = ['p', 'title'] + ['h' + str(i) for i in range(1, 7)]
def read(soup):
for tag in soup.find_all(True, recursive=False):
if (tag.name in tags_of_interest):
print(tag.name + ": ", tag.text.strip())
for child in tag.find_all(True, recursive=False):
read(child)
# call the function
read(soup)
BeautifulSoup's find_all() will return a list of tags in the order of a DFT (depth first traversal) as per this answer here. This allows easy access to the desired elements.

Python webscrape the "next" piece of text in a website

I am creating a python program which scrapes company financials off a website. I am aware that websites which contain this information makes it particularly difficult to scrape data reliably, and as such, I have met a roadblock.
https://www.reuters.com/companies/3in.L/key-metrics
From this website, I am attempting to scrape the value next to the text "Return on Equity (TTM)". (currently it's 8.86)
I have searched StackOverflow and plenty of other sites. The closest I have got is this:
page = requests.get("https://www.reuters.com/companies/3in.L/key-metrics")
soup = BeautifulSoup(page.content, 'html.parser')
spans = soup.find('span', text='Return on Equity (TTM)').find_next('span').text
print(spans)
However, this creates the error:
'NoneType' object has no attribute 'find_next'
The line which creates the "spans" variable does not create an error when you remove the find_next bit on the end, instead it prints None
I have seen other people successfully use a similar line of code. However, seeing as I am a beginner with BeautifulSoup, there is still clearly a concept I have not grasped.
If anyone could guide me in the right direction, I would be very grateful.
The 'span' you are looking for does not exists. its actually a div with the text Return on Equity (TTM) inside it. so instead of searching for span, you can search for a div. For example
page = requests.get("https://www.reuters.com/companies/3in.L/key-metrics")
soup = BeautifulSoup(page.content, 'html.parser')
spans = soup.find('div', text='Return on Equity(TTM)').find_next('span').text
print(spans)
this should return the correct value.

find_all() not finding any results when using Beautiful Soup + Requests

I'm experimenting with using BeautifulSoup and Requests for the first time, and am trying to learn by scraping some information from a news site. The aim of the project is to just be able to read news highlights from terminal, so I need to effectively scrape and parse article titles and article body text.
I am still at the stage of getting the titles, but I simply am not storing any data when I try to use the find_all() function. Below is my code:
from bs4 import BeautifulSoup
from time import strftime
import requests
date = strftime("%Y/%m/%d")
url = "http://www.thedailybeast.com/cheat-sheets/" + date + "/cheat-sheet.html"
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c, "lxml")
titles = soup.find_all('h1 class="title multiline"')
print titles
Any thoughts? If anyone also has any advice / tips to improve what I currently have or the approach I'm taking, I'm always looking to get better so please do tell!
Cheers
You are putting everything here in quotes:
titles = soup.find_all('h1 class="title multiline"')
which makes BeautifulSoup search for h1 class="title multiline" elements.
Instead, use:
titles = soup.find_all("h1", class_="title multiline")
Or, with a CSS selector:
titles = soup.select("h1.title.multiline")
Actually, because of the dynamic nature of the page, to get all of the titles, you have to approach it differently:
import json
results = json.loads(soup.find('div', {'data-pageraillist': True})['data-pageraillist'])
for result in results:
print result["title"]
Prints:
Hillary Email ‘Born Classified’
North Korean Internet Goes Down
Kid-Porn Cops Go to Gene Simmons’s Home
Baylor Player Convicted of Rape After Coverup
U.S. Calls In Aussie Wildfire Experts
Markets’ 2015 Gains Wiped Out
Black Lives Matters Unveils Platform
Sheriff Won’t Push Jenner Crash Charge
Tear Gas Used on Migrants Near Macedonia
Franzen Considered Adopting Iraqi Orphan
You're very close, but find_all only searches the tags, it's not like a generic search function.
Hence if you want to filter by tag and attribute like class, then do this:
soup.find_all('h1', {'class' : 'multiline'})

Using regular expression in Beautiful Soup scraping

I have a list of URLs and I'm trying to use regex to scrap info from each URL. This is my code (well, at least the relevant part):
for url in sammy_urls:
soup = BeautifulSoup(urlopen(url).read()).find("div",{"id":"page"})
addy = soup.find("p","addy").em.encode_contents()
extracted_entities = re.match(r'"\$(\d+)\. ([^,]+), ([\d-]+)', addy).groups()
price = extracted_entities[0]
location = extracted_entities[1]
phone = extracted_entities[2]
if soup.find("p","addy").em.a:
website = soup.find("p", "addy").em.a.encode_contents()
else:
website = ""
When I pull a couple of the URLs and practice the regex equation, the extracted entities and the price location phone website come up fine, but run into trouble when I put it into this larger loop, being feed real URLs.
Did I input the regex incorrectly? (the error message is ''NoneType' object has no attribute 'groups'' so that is my guess).
My 'addy' seems to be what I want... (prints
"$10. 2109 W. Chicago Ave., 773-772-0406, "'theoldoaktap.com
"$9. 3619 North Ave., 773-772-8435, "'cemitaspuebla.com
and so on).
Combining html/xml with regular expressions has a tendency to turn bad.
Why not use bs4 to find the 'a' elements in the div you're interested in and get the 'href' attribute from the element.
see also retrieve links from web page using python and BeautifulSoup

Categories

Resources