Python webscrape the "next" piece of text in a website - python

I am creating a python program which scrapes company financials off a website. I am aware that websites which contain this information makes it particularly difficult to scrape data reliably, and as such, I have met a roadblock.
https://www.reuters.com/companies/3in.L/key-metrics
From this website, I am attempting to scrape the value next to the text "Return on Equity (TTM)". (currently it's 8.86)
I have searched StackOverflow and plenty of other sites. The closest I have got is this:
page = requests.get("https://www.reuters.com/companies/3in.L/key-metrics")
soup = BeautifulSoup(page.content, 'html.parser')
spans = soup.find('span', text='Return on Equity (TTM)').find_next('span').text
print(spans)
However, this creates the error:
'NoneType' object has no attribute 'find_next'
The line which creates the "spans" variable does not create an error when you remove the find_next bit on the end, instead it prints None
I have seen other people successfully use a similar line of code. However, seeing as I am a beginner with BeautifulSoup, there is still clearly a concept I have not grasped.
If anyone could guide me in the right direction, I would be very grateful.

The 'span' you are looking for does not exists. its actually a div with the text Return on Equity (TTM) inside it. so instead of searching for span, you can search for a div. For example
page = requests.get("https://www.reuters.com/companies/3in.L/key-metrics")
soup = BeautifulSoup(page.content, 'html.parser')
spans = soup.find('div', text='Return on Equity(TTM)').find_next('span').text
print(spans)
this should return the correct value.

Related

Simple examples on how to use html2json

I am brand new to Python. I know about 2-weeks of it, so I don't understand a lot of what I am seeing online. I have to download HTML text from multiple similar pages and convert it to JSON. It must include the HTML links--it cannot just be the table contents. I have figured out how to use Beautiful Soup to download the code off a website, and I have been able to get the portions into a Python list of 547 similar groupings. There are 547 in this one file so far. The next step is to convert it to JSON. I have "a" and "tr" and "td" tags and "class" and "data-href" and "href" (attributes?) as well as text and links associated with those. I think html2json will work for me, but I cannot figure out how to use it. I installed it, and it is in my library. I understand how to do collect(html, template) to get it to convert ... but nowhere gives an explanation on how to do the template. I only found 2 pages that describe it with weird terms and no actual examples. I can't even find anything that explains whether template uses () or {} or [] or whether it does or does not have an = for assigning it. Can someone provide an example of a few lines of HTML and what the template actually looks like in Python? Here is my code so far:
page = requests.get(url)
page
soup = BeautifulSoup(page.text, 'lxml')
soup
gs_results = soup.find_all('tr', class_= 'gsc_a_tr')
gs_results
gs_links = []
for i in gs_results:
item = i
gs_links.append(item)
template = I_HAVE_NO_IDEA_WHAT_GOES_HERE
collect(gs_links, template)

Beautiful Soup (Python) not seeing text inside of span

I can't figure out why BS4 is not seeing the text inside of the span in the following scenario:
Page: https://pypi.org/project/requests/
Text I'm looking for - number of stars on the left hand side (around 43,000 at the time of writing)
My code:
stars = soup.find('span', {'class': 'github-repo-info__item', 'data-key': 'stargazers_count'}).text
also tried:
stars = soup.find('span', {'class': 'github-repo-info__item', 'data-key': 'stargazers_count'}).get_text()
Both return an empty string ''. The element itself seems to be located correctly (I can browse through parents / siblings in PyCharm debugger without a problem. Fetching text in other parts of the website also works perfectly fine. It's just the github-related stats that fail to fetch.
Any ideas?
Because this page use Javascript to load the page dynamically.So you couldn't get it directly by response.text
The source code of the page:
You could crawl the API directly:
import requests
r = requests.get('https://api.github.com/repos/psf/requests')
print(r.json()["stargazers_count"])
Result:
43010
Using bs4, we can't scrape stars rate.
After inspecting the site, please check response html.
There, there is class information named "github-repo-info__item", but there is no text information.
in this case, use selenium.

scraping data from web page but missing content

I am downloading the verb conjugations to aid my learning. However one thing I can't seem to get from this web page is the english translation near the top of the page.
The code I have is below. When I print results_eng it prints the section I want but there is no english translation, what am I missing?
import requests
from bs4 import BeautifulSoup
URL = 'https://conjugator.reverso.net/conjugation-portuguese-verb-ser.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results_eng = soup.find(id='list-translations')
eng = results_eng.find_all('p', class_='context_term')
In a normal website, you should be able to find the text in a paragraph witht the function get_text(), but in this case this is a search, wich means it's probably pulling the data from a database and its not in the paragraph itself. At least that's what I can come up with, since I tried to use that function and I got an empty string in return. Can't you try another website and see what happens?
p.d: I'm a beginner, sorry if I'm guessing wrong

Python/Requets/Beautiful Soup Basic Scrape

Hope you're all well. I wrote a basic webscrape of an HTML site earlier today, along similar lines. I was following a tutorial, as you'll be able to see by my code I'm a bit of a green-horn to coding in Python. Hoping for a bit of guidance regarding scraping this site.
As you can see by the commented out code,
#print(results.prettify())
I am able to successfully able to print out the entire contents of the webpage. What I'd like to do however, is whittle down the contents of what I am printing out, so that I am just printing out the relevant content. There is a lot of content on the page that I don't want, and I'd like massage it out. Does anyone have any thoughts on why the for loop at the bottom of the code is not sequentially grabbing up the paragraphs in the xlmins unit of HTML and printing it out? Please see the below code for more.
import requests
from bs4 import BeautifulSoup
URL = "http://www.gutenberg.org/files/7142/7142-h/7142-h.htm"
page = requests.get(URL)
#we're going to create an object in Beautiful soup that will scrape it.
soup = BeautifulSoup(page.content, 'html.parser')
#this line of code takes
results = soup.find(xmlns='http://www.w3.org/1999/xhtml')
#print(results.prettify())
job_elems = results.find_all('p', xlmins="http://www.w3.org/1999/xhtml")
for job in job_elems:
paragraph = job.find("p", xlmins='http://www.w3.org/1999/xhtml')
print(paragraph.text.strip)
No <p> tag contains the attribute xlmins='http://www.w3.org/1999/xhtml', only the top HTML tag does. Remove that part and you'll get all the paragraphs.
job_elems = results.find_all('p')
for job in job_elems:
print(job.text.strip())

Scraping text with subscripts with BeautifulSoup in Python

all! I'm working on my first web scraper ever, which grabs author names, URLs, and paper names from PMC, when given a "CitedBy" page like this
My program works fine for getting the author names and the URL's, however I can only get some of the paper titles, which I suspect is due to subscripts and superscripts.
Here's what I've got so far:
import requests
from bs4 import BeautifulSoup
import re
url = 'http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2593677/citedby/?page=0'
req = requests.get(url)
plain_text = req.text
soup = BeautifulSoup(plain_text, "lxml") #soup object
titles_list = []
for items in soup.findAll('div', {'class': 'title'}):
title = items.string
if title is None:
title = ("UHOH") #Problems with some titles
#print(title)
titles_list.append(title)
When I run this part of my code, my scraper gives me these results:
Finding and Comparing Syntenic Regions among Arabidopsis and the Outgroups Papaya, Poplar, and Grape: CoGe with Rosids
UHOH
Comprehensive Comparative Genomic and Transcriptomic Analyses of the Legume Genes Controlling the Nodulation Process
UHOH
Dosage Sensitivity of RPL9 and Concerted Evolution of Ribosomal Protein Genes in Plants
And so on for the whole page...
Some papers on this page that I get "UHOH" for are:
Comparative cell-specific transcriptomics reveals differentiation of C4 photosynthesis pathways in switchgrass and other C4 lineages
The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny
Cross-Family Translational Genomics of Abiotic Stress-Responsive Genes between Arabidopsis and Medicago truncatula
The first two I've listed here I believe are problematic because of "C4" and "F1" are actually "C subscript 4" and "F subscript 1". For the third one, "Medicago truncatula" is in an "em" HTML tag, so I suspect that this is why my scraper cannot scrape it.
The only alternative solution I've thought of is making my "soup.findAll" more specific, but that didn't end up helping me. I tried:
for items in soup.findAll('div', {'class': 'title'}):
title = items.string
if title is None:
for other in soup.findAll('a', {'class': 'view'}):
title = other.string
But sadly, this didn't work... So I'm not exactly sure how to approach this. Does anybody know how to handle special cases like these? Thank you so much!
Thanks to #LukasGraf, I have the answer!
Since I'm using the BeautifulSoup, I can use node.get_text(). It works different from the plain, ".string" because it also returns all the text beneath a tag, which was the case for the subscripts and "em" HTML marked text.

Categories

Resources