Trying to Scrape a Span - python

I 've been trying to scrape two values from a website using beautiful soup in Python, and it's been giving me trouble. Here is the URL of the page I'm scraping:
https://www.stjosephpartners.com/Home/Index
Here are the values I'm trying to scrape:
HTML of Website to be Scraped
I tried:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.stjosephpartners.com/Home/Index').text
soup = BeautifulSoup(source, 'lxml')
gold_spot_shell = soup.find('div', class_ = 'col-lg-10').children
print(gold_spot_shell)
the output I got was: <list_iterator object at 0x039FD0A0>
When I tried using: gold_spot_shell = soup.find('div', class_ = 'col-lg-10').children
The output was: ['\n']
when I tried using: gold_spot_shell = soup.find('div', class_ = 'col-lg-10').span
The output was: none
The HTML definitely has at least one span child. I'm not sure how to scrape the values I'm after. Thanks.

Beautifulsoup + Request is not a good method to scrape dynamic website like this. That span is generated by javascript so when you get the html using request, it just does not exist.
You can try to use selenium instead.
You can check if the website is using javascript to render element or not by disabling javascript on the page and find that element again, or just "view page source"

Related

Extracting Scraped Web Content from iframe

Attempting to scrape the table at https://coronavirus.health.ny.gov/zip-code-vaccination-data
I've looked at
Python BeautifulSoup - Scrape Web Content Inside Iframes and have gotten this far but I don't know how to extract the information from soup.
Any help is greatly appreciated.
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get("https://coronavirus.health.ny.gov/zip-code-vaccination-data")
soup = BeautifulSoup(r.content, "html.parser")
iframe_src = '//static-assets.ny.gov/load_global_footer/ajax?iframe=true'
r = s.get(f"https:{iframe_src}")
soup = BeautifulSoup(r.content, "html.parser")
Website you're trying to scrape generates <iframe> dynamically with javascript so you need either something to automate browser actions like selenium, puppeteer or assign <iframe> url to a variable, because it seems to not change in the near future. Here is an url of your <iframe>:
https://public.tableau.com/views/Vaccination_Rate_Public/NYSVaccinationRate?:embed=y&:showVizHome=n&:tabs=n&:toolbar=n&:device=desktop&showShareOptions=false&:apiID=host0#navType=1&navSrc=Parse

beautiful soup- Scraping a site with hidden tag

I am trying to Scrape NBA.com play by play table so I want to get the text for each box that is in the example picture.
for example(https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play).
checking the html code I figured that each line is in an article tag that contains div tag that contains two p tags with the information I want, however I wrote the following code and I get back that there are 0 articles and only 9 P tags (should be much more) but even the tags I get their text is not the box but something else. I get 9 tags so I am doing something terrible wrong and I am not sure what it is.
this is the code to get the tags:
from urllib.request import urlopen
from bs4 import BeautifulSoup
def contains_word(t):
return t and 'keyword' in t
url = "https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
div_tags = soup.find_all('div', text=contains_word("playByPlayContainer"))
articles=soup.find_all('article')
p_tag = soup.find_all('p', text=contains_word("md:bg"))
thank you!
Use Selenium since it's using Javascript and pass it to Beautifulsoup. Also pip install selenium and get the chromedriver.exe
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play")
soup = BeautifulSoup(driver.page_source, "html.parser")

How to scrape content from div class using beautifulsoup

This is a part of the html page i want to scrape
I am trying to get the title and the value of cryptos using beautifulsoup.
I have tried many solutions using find and find_all to get the content included in div but I don't see what is wrong... There is an example of what i tried:
titles = soup.find_all("div", {"class": "tabTitle-qQlkPW5Y"})
Can you please help me with this ?
My solution is to use selenium to make sure the page fully rendered. Then using beautiful soup we can navigate through its elements.
from selenium import webdriver
driver = webdriver.Chrome(pathToChromeWebDriver)
url = "https://fr.tradingview.com/markets/cryptocurrencies/global-charts/"
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
for title in soup.find_all("div", {"class": "tabTitle-qQlkPW5Y"}):
print(title.string)

BeautifulSoup can't find tag

I am trying to scrape a web-page to collect a list of Fortune 500 companies. However, when I run this code, BeautifulSoup can't find <div class="rt-tr-group" role="rowgroup"> tags.
import requests
from bs4 import BeautifulSoup
url = r'https://fortune.com/fortune500/2019/search/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
data = soup.find_all('div', {'class': 'rt-tr-group'})
Instead, I just get an empty list. I've tried changing the parser but saw no results.
The tags exist and can be seen
here:
Data is loading on that page using JS, after some time.
Using Selenium, you can wait for page to be loaded completely, or try to get data from Javascript.
P.S. You can check for XHR requests and try to get JSON instead, without parsing. Here is one request
Content of your parsing page loading with JS, and you can get empty page with requests.get.

How to get sub-content from wikipedia page using BeautifulSoup

I am trying to scrape sub-content from Wikipedia pages based on the internal link using python, The problem is that scrape all content from the page, how can scrape just internal link paragraph, Thanks in advance
base_link='https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%AA%D9%87%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D8%B5%D8%A8%D8%A7%D8%AA'
sub_link="#الأسباب"
total=base_link+sub_link
r=requests.get(total)
soup = bs(r.text, 'html.parser')
results=soup.find('p')
print(results)
It is because it's not a sublink you are trying to scrape. It's an anchor.
Try to request the entire page and then to find the given id.
Something like this:
from bs4 import BeautifulSoup as soup
import requests
base_link='https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%AA%D9%87%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D8%B5%D8%A8%D8%A7%D8%AA'
anchor_id="ﺍﻸﺴﺑﺎﺑ"
r=requests.get(base_link)
page = soup(r.text, 'html.parser')
span = page.find('span', {'id': anchor_id})
results = span.parent.find_next_siblings('p')
print(results[0].text)

Categories

Resources