BeautifulSoup Number Extraction - python

``So im trying to get the degrees from this weather site. But i t keeps returning a blank answer. This is my code
Link to a screenshot
import requests
from bs4 import BeautifulSoup
# -----------------------------get site info------------------------------- #
URL = "https://www.theweathernetwork.com/ca/hourly-weather-forecast/ontario/oakville"
request = requests.get(URL)
# print(request.content)
# ----------------------parse site info---------------- #
soup = BeautifulSoup(request.content, 'html5lib')
#print(soup.prettify().encode("utf-8"))
weatherdata = soup.find('span', class_='temp')
print(weatherdata)

It might be that those values are rendered dynamically i.e. the values might be populated by javascript in the page.
requests.get() simply returns the markup received from the server without any further client-side changes so it's not fully about waiting.
You could perhaps use Selenium Chrome Webdriver to load the page URL and get the page source. (Or you can use Firefox driver).
Go to chrome://settings/help check your current chrome version and download the driver for that version from here. Make sure to either keep the driver file in your PATH or the same folder where your python script is.
Try this:
from bs4 import BeautifulSoup as bs
from selenium.webdriver import Chrome # pip install selenium
from selenium.webdriver.chrome.options import Options
url = "https://www.theweathernetwork.com/ca/hourly-weather-forecast/ontario/oakville"
#Make it headless i.e. run in backgroud without opening chrome window
chrome_options = Options()
chrome_options.add_argument("--headless")
# use Chrome to get page with javascript generated content
with Chrome(executable_path="./chromedriver", options=chrome_options) as browser:
browser.get(url)
page_source = browser.page_source
#Parse the final page source
soup = bs(page_source, 'html.parser')
weatherdata = soup.find('span', class_='temp')
print(weatherdata.text)
10
References:
Get page generated with Javascript in Python
selenium - chromedriver executable needs to be in PATH

Problem seems to be that the data is loaded via JavaScript so it takes a while to load the value for that specific span. When you do your request it seems to be empty and only loads in after a bit. One possible solution to this would be using selenium to wait for the page to load and then extract html afterwards.
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.theweathernetwork.com/ca/hourly-weather-forecast/ontario/oakville"
browser = webdriver.Chrome()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
elem = soup.find('span', class_='temp')
print(elem.text)

Related

I'm Trying To Scrape The Duration Of Tiktok Videos But I am Getting 'None'

I want to scrape the duration of tiktok videos for an upcoming project but my code isn't working
import requests; from bs4 import BeautifulSoup
content = requests.get('https://vm.tiktok.com/ZMFFKmx3K/').text
soup = BeautifulSoup(content, 'lxml')
data = soup.find('div', class_="tiktok-1g3unbt-DivSeekBarTimeContainer e123m2eu1")
print(data)
Using an example tiktok
I would think this would work could anyone help
If you turn off JavaScript then check out the element selection in chrome devtools then you will see that the value is like 00/000 but when you will turn JS and the video is on play mode then the duration is increasing uoto finishig.So the real duration value of that element depends on Js. So you have to use an automation tool something like selenium to grab that dynamic value. And How much duration will scrape that depend on time.sleep() if you are on selenium. If time.sleep is more than the video length then it will show None typEerror.
Example:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url ='https://vm.tiktok.com/ZMFFKmx3K/'
driver.get(url)
driver.maximize_window()
time.sleep(25)
soup = BeautifulSoup(driver.page_source,"lxml")
data = soup.find('div', class_="tiktok-1g3unbt-DivSeekBarTimeContainer e123m2eu1")
print(data.text)
Output:
00:25/00:28
the ID associated is likely randomized. Try using regex to get element by class ending in 'TimeContainer' + some other id
import requests
from bs4 import BeautifulSoup
import re
content = requests.get('https://vm.tiktok.com/ZMFFKmx3K/').text
soup = BeautifulSoup(content, 'lxml')
data = soup.find('div', {'class': re.compile(r'TimeContainer.*$')})
print(data)
you next issue is that the page loads before the video, so you'll get 0/0 for the time. try selenium instead so you can add timer waits for loading

Web scraping using Beautifulsoup/Selenium unable to pull div class using find or find_all

I am trying to webscrape form the website https://roll20.net/compendium/dnd5e/Monsters%20List#content and having some issues.
My first script I tried kept returning an empty list when finding by div and class name, which I believe is do to the site using Javascript? But a little uncertain if that is the case or not.
Here was my first attempt:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://roll20.net/compendium/dnd5e/Monsters%20List#content')
soup = BeautifulSoup(page.text, 'html.parser')
card = soup.find_all("div", class_='card')
print(card)
This one returns an empty list so then I tired to use Selenium and scrape with that. Here is that script:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
url='https://roll20.net/compendium/dnd5e/Monsters%20List#content'
driver = webdriver.Firefox(executable_path = 'C:\Windows\System32\geckodriver')
driver.get(url)
page = driver.page_source
page_soup = soup(page,'html.parser')
Starting the script with that I then tried all 3 of these different options (individually ran these, just listed them here together for simplicity sake):
for card in body.find('div', {"class":"card"}):
print(card.text)
print(card)
for card in body.find_all('div', {"class":"card"}):
print(card.text)
print(card)
card = body.find_all('div', {"class":"card"})
print(card)
All of them return the same error message:
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Where am I going wrong here?
Edit:
Fazul thank you for your input on this I guess I should be more specific. I was more looking to get the contents of each card. For example, the card has a "body" class and within that body class there are many fields that is the data I am looking to extract. Maybe I am misunderstanding your script and what you stated. Here is a screen shot to maybe help specify my question a bit more to what content I am looking to extract.
So everything that would be under the body i.e. name, title, subtitle, etc.. Those were the texts I was trying to extract.
That page is being loaded by JavaScript. So beautifulsoup will not work in this case. You have to use Selenium.
And the element that you are looking for - <div> with class name as card show up only when you click on the drop-down arrow. Since you are not doing any click event in your code, you get an empty result.
Use selenium to click the <div> with class name as dropdown-toggle. That click event loads the <div class='card'>
Then you can scrape the data you need.
Since the page is loaded by JavaScript, you have to use Selenium.
My solution As follows:
Every card has a link as shown:
Use BeautifulSoup to get the link for each card, Then open each link using Selenium in headless mode because it is also loaded by JavaScript.
Then you get the data you need for each card using BeautifulSoup
Working code:
from selenium import webdriver
from webdriver_manager import driver
from webdriver_manager.chrome import ChromeDriver, ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from time import sleep
from bs4 import BeautifulSoup
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://roll20.net/compendium/dnd5e/Monsters%20List#content")
page = driver.page_source
# Wait until the page is loaded
sleep(13)
soup = BeautifulSoup(page, "html.parser")
# Get all card urls
urls = soup.find_all("div", {"class":"header inapp-hide single-hide"})
cards_list = []
for url in urls:
card = {}
card_url = url.a["href"]
card_link = f'https://roll20.net{card_url}#content'
# Open chrom in headless mode
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)
driver.get(card_link )
page = driver.page_source
soup = BeautifulSoup(page, "html.parser")
card["name"] = soup.select_one(".name").text.split("\n")[1]
card["subtitle"] = soup.select_one(".subtitle").text
# Do the same to get all card detealis
cards_list.append(card)
driver.quit()
Library you need to install
pip install webdriver_manager
This library open chrome driver without the need to geckodriver and get up to date driver for you.
https://roll20.net/compendium/compendium/getList?bookName=dnd5e&pageName=Monsters%20List&_=1630747796747
u can get the json data from this link instead just parse it and get the data you need

Why doesn't BeautifulSoup scrape the entirety of the webpage and ow to get it?

I'm trying to get some data, from this webpage. All the text is visible on the page, however, when I grab it with BSoup I cant find any of the numbers for odds.
from selenium import webdriver
from urllib2 import urlopen
from bs4 import BeautifulSoup
URL = "https://www.sportsbookreview.com/betting-odds/mlb-baseball/money-line/1st-half/?date=20160601"
driver = webdriver.Chrome()
driver.get(URL)
soup = BeautifulSoup(driver.page_source,'lxml')
driver.quit()
opener = soup.findAll('span' , {'class' : 'opener'})
print opener
What's even weirder, it worked once, then stopped for no apparent reason or changes to the code. When I checked what soup scrapes from this page, the numbers for odds weren't even included.
Why don't I get all the data and what to do to get it?
I think you need to use Selenium to make javascript triggered then extract data you want in this case code below can help :
options = webdriver.ChromeOptions()
# also look at options
driver = webdriver.Chrome('address_of_your_driver', chrome_options=options)
driver.get("your_URL_here")
# now you've got the source
soup = BeautifulSoup(driver.page_source, "html.parser")

How can I get HTML code source of my opened page on Firefox in python

How can I get HTML code source of my opened page on Firefox in python? I've tried BeautifulSoup but it's not getting my actual avtive, and opened HTML page.
I assume you can't obtain what you want because of javascript.
If you want to obtain the HTML of an "active" web page you should probably have a look at Selenium. It can simulate a browser, navigate to a given url and get the "active" HTML for you.
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
source_code = driver.page_source
You can try this:
import requests
from bs4 import BeautifulSoup
url = 'https://www.pagina12.com.ar/'
my_page = requests.get(url)
soup = BeautifulSoup(my_page.text, 'html')
your_html = soup.prettify()
print(your_html)
Don't forget to be sure libraries are installed

Problem with BS4 printing only certain parts of the web page

I am having a problem with bs4 only finding some things in html. To be specific when I try to print span.nav2__menu-link-main-text it selects it and prints it without a problem but when I try to select other part of the page it probably selects it but It doesnt want to print it out. Here is the code that prints and the code that doesnt print:
Tried using different parsers other than lxml and none worked.
#This one prints
from bs4 import BeautifulSoup
import requests
import lxml
url = 'https://osu.ppy.sh/users/18723891'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
for i in soup.select('span.nav2__menu-link-main-text'):
print(i.text)
#This one does not print
from bs4 import BeautifulSoup
import requests
import lxml
url = 'https://osu.ppy.sh/users/https://osu.ppy.sh/users/18723891'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
for i in soup.select('div.value-dispaly__value'):
print(i.text)
I expect this program to print the current value of div.value-dispaly__value
but when I start the program it prints nothing even tough I can see the value is 4000 when I inspect the page.
It seems that code you are willing to get is dynamically added to the web page by javascript.
In order to update web js part, you have to use requests render() function.
Website page is javascript request rendering to get data, so you need to use automation library like selenium. download selenium web driver as per your browser requirement.
Download selenium web driver for chrome browser:
http://chromedriver.chromium.org/downloads
Install web driver for chrome browser:
https://christopher.su/2015/selenium-chromedriver-ubuntu/
Selenium tutorial:
https://selenium-python.readthedocs.io/
Replace your code to this:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get('https://osu.ppy.sh/users/12008062')
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'lxml')
for i in soup.find_all('div',{"class":"value-display__value"}):
print(i.get_text())
O/P:
#47,514
#108
11d 19h 49m
44
4,000
11d 19h 49m
44
4,000
#47,514
#108
0
0

Categories

Resources