Extracting content from webpage using BeautifulSoup

Extracting content from webpage using BeautifulSoup - python

I am working on scrapping numbers from the Powerball website with the code below.
However, numbers keeps coming back empty. Why is this?
import requests
from bs4 import BeautifulSoup
url = 'https://www.powerball.com/games/home'
page = requests.get(url).text
bsPage = BeautifulSoup(page)
numbers = bsPage.find_all("div", class_="field_numbers")
numbers

Can confirm #Teprr is absolutely correct. You'll need to download chrome and add chromedriver.exe to your system path for this to work but the following code gets what you are looking for. You can use other browsers too you just need their respective driver.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
url = 'https://www.powerball.com/games/home'
options = webdriver.ChromeOptions()
options.add_argument('headless')
browser = webdriver.Chrome(options=options)
browser.get(url)
time.sleep(3) # wait three seconds for all the js to happen
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
draws = soup.findAll("div", {"class":"number-card"})
print(draws)
for d in draws:
info = d.find("div",{"class":"field_draw_date"}).getText()
balls = d.find("div",{"class":"field_numbers"}).findAll("div",{"class":"numbers-ball"})
numbers = [ball.getText() for ball in balls]
print(info)
print(numbers)

If you download that file and inspect it locally, you can see that there is no <div> with that class. That means that it is likely generated dynamically using javascript by your browser, so you would need to use something like selenium to get the full, generated HTML content.
Anyway, in this specific case, this piece of HTML seems to be the container for the data you are looking for:
<div data-url="/api/v1/numbers/powerball/recent?_format=json" class="recent-winning-numbers"
data-numbers-powerball="Power Play" data-numbers="All Star Bonus">
Now, if you check that custom data-url, you can find the information you want in JSON format.

Related

Scraping Data from Table with Multiple Pages

I am trying to scrape data from AGMARKNET website. The tables are split into 11 pages but all of the pages use the same url. I am very new to webscraping (or python in general), but AGMARKNET does not have a public API so scraping the page seems to be my only option. I am currently using BeautifulSoup to parse the HTML code and I am able to scrape the initial table, but that only contains the first 500 data points, but I want the entire 11 page data. I am stuck and frustrated. Link and my current code are below. Any direction would be helpful, thank you .
#αԋɱҽԃ αмєяιcαη
https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--
import requests
import pandas as pd
url = 'https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--'
response = requests.get(url)
# Use BeautifulSoup to parse the HTML code
soup = BeautifulSoup(response.content, 'html.parser')
# changes stat_table from ResultSet to a Tag
stat_table = stat_table[0]
# Convert html table to list
rows = []
for tr in stat_table.find_all('tr')[1:]:
cells = []
tds = tr.find_all('td')
if len(tds) == 0:
ths = tr.find_all('th')
for th in ths:
cells.append(th.text.strip())
else:
for td in tds:
cells.append(td.text.strip())
rows.append(cells)
# convert table to df
table = pd.DataFrame(rows)

The website you linked to seems to be using JavaScript to navigate to the next page. The requests and BeautifulSoup libraries are only for parsing static HTML pages, so they can't run JavaScript.
Instead of using them, you should try something like Selenium that actually simulates a full browser environment (including HTML, CSS, etc.). In fact, Selenium can even open a full browser window so you can see it in action as it navigates!
Here is a quick sample code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
# If you prefer Chrome to Firefox, there is a driver available
# for that as well
# Set the URL
url = 'https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--'
# Start the browser
opts = Options()
driver = webdriver.Firefox(options=opts)
driver.get(url)
Now you can use functions like driver.find_element(...) and driver.find_elements(...) to extract the data you want from this page, the same way you did with BeautifulSoup.
For your given link, the page number navigators seem to be running a function of the form,
__doPostBack('ctl00$cphBody$GridViewBoth','Page$2')
...replacing Page$2 with Page$3, Page$4, etc. depending on which page you want. So you can use Selenium to run that JavaScript function when you're ready to navigate.
driver.execute_script("__doPostBack('ctl00$cphBody$GridViewBoth','Page$2')")
A more generic solution is to just select which button you want and then run that button's click() function. General example (not necessarily for the current website):
btn = driver.find_element('id', 'next-button')
btn.click()
A final note: after the button is clicked, you might want to time.sleep(...) for a little while to make sure the page is fully loaded before you start processing the next set of data.

Web scraping using Beautifulsoup/Selenium unable to pull div class using find or find_all

I am trying to webscrape form the website https://roll20.net/compendium/dnd5e/Monsters%20List#content and having some issues.
My first script I tried kept returning an empty list when finding by div and class name, which I believe is do to the site using Javascript? But a little uncertain if that is the case or not.
Here was my first attempt:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://roll20.net/compendium/dnd5e/Monsters%20List#content')
soup = BeautifulSoup(page.text, 'html.parser')
card = soup.find_all("div", class_='card')
print(card)
This one returns an empty list so then I tired to use Selenium and scrape with that. Here is that script:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
url='https://roll20.net/compendium/dnd5e/Monsters%20List#content'
driver = webdriver.Firefox(executable_path = 'C:\Windows\System32\geckodriver')
driver.get(url)
page = driver.page_source
page_soup = soup(page,'html.parser')
Starting the script with that I then tried all 3 of these different options (individually ran these, just listed them here together for simplicity sake):
for card in body.find('div', {"class":"card"}):
print(card.text)
print(card)
for card in body.find_all('div', {"class":"card"}):
print(card.text)
print(card)
card = body.find_all('div', {"class":"card"})
print(card)
All of them return the same error message:
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Where am I going wrong here?
Edit:
Fazul thank you for your input on this I guess I should be more specific. I was more looking to get the contents of each card. For example, the card has a "body" class and within that body class there are many fields that is the data I am looking to extract. Maybe I am misunderstanding your script and what you stated. Here is a screen shot to maybe help specify my question a bit more to what content I am looking to extract.
So everything that would be under the body i.e. name, title, subtitle, etc.. Those were the texts I was trying to extract.

That page is being loaded by JavaScript. So beautifulsoup will not work in this case. You have to use Selenium.
And the element that you are looking for - <div> with class name as card show up only when you click on the drop-down arrow. Since you are not doing any click event in your code, you get an empty result.
Use selenium to click the <div> with class name as dropdown-toggle. That click event loads the <div class='card'>
Then you can scrape the data you need.

Since the page is loaded by JavaScript, you have to use Selenium.
My solution As follows:
Every card has a link as shown:
Use BeautifulSoup to get the link for each card, Then open each link using Selenium in headless mode because it is also loaded by JavaScript.
Then you get the data you need for each card using BeautifulSoup
Working code:
from selenium import webdriver
from webdriver_manager import driver
from webdriver_manager.chrome import ChromeDriver, ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from time import sleep
from bs4 import BeautifulSoup
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://roll20.net/compendium/dnd5e/Monsters%20List#content")
page = driver.page_source
# Wait until the page is loaded
sleep(13)
soup = BeautifulSoup(page, "html.parser")
# Get all card urls
urls = soup.find_all("div", {"class":"header inapp-hide single-hide"})
cards_list = []
for url in urls:
card = {}
card_url = url.a["href"]
card_link = f'https://roll20.net{card_url}#content'
# Open chrom in headless mode
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)
driver.get(card_link )
page = driver.page_source
soup = BeautifulSoup(page, "html.parser")
card["name"] = soup.select_one(".name").text.split("\n")[1]
card["subtitle"] = soup.select_one(".subtitle").text
# Do the same to get all card detealis
cards_list.append(card)
driver.quit()
Library you need to install
pip install webdriver_manager
This library open chrome driver without the need to geckodriver and get up to date driver for you.

https://roll20.net/compendium/compendium/getList?bookName=dnd5e&pageName=Monsters%20List&_=1630747796747
u can get the json data from this link instead just parse it and get the data you need

Parsing HTML using beautifulsoup gives "None"

I can clearly see the tag I need in order to get the data I want to scrape.
According to multiple tutorials I am doing exactly the same way.
So why it gives me "None" when I simply want to display code between li class
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.governmentjobs.com/careers/sdcounty")
soup = BeautifulSoup(response.text,'html.parser')
job = soup.find('li', attrs = {'class':'list-item'})
print(job)

Whilst the page does dynamically update (it makes additional requests from browser to update content which you don't capture with your single request) you can find the source URI in the network tab for the content of interest. You also need to add the expected header.
import requests
from bs4 import BeautifulSoup as bs
headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get('https://www.governmentjobs.com/careers/home/index?agency=sdcounty&sort=PositionTitle&isDescendingSort=false&_=', headers=headers)
soup = bs(r.content, 'lxml')
print(len(soup.select('.list-item')))

There is no such content in the original page. The search results which you're referring to, are loaded dynamically/asynchronously using JavaScript.
Print the variable response.text to verify that. I got the result using ReqBin. You'll find that there's no text list-item inside.
Unfortunately, you can't run JavaScript with BeautifulSoup .

Another way to handle dynamically loading data is to use selenium instead of requests to get the page source. This should wait for the Javascript to load the data correctly and then give you the according html. This can be done like so:
from bs4 import BeautifulSoup
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
url = "<URL>"
chrome_options = Options()
chrome_options.add_argument("--headless") # Opens the browser up in background
with Chrome(options=chrome_options) as browser:
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
job = soup.find('li', attrs = {'class':'list-item'})
print(job)

Web Scraping Python (BeautifulSoup,Requests)

I am learning web scraping using python but I can't get the desired result. Below is my code and the output
code
import bs4,requests
url = "https://twitter.com/24x7chess"
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text,"html.parser")
soup.find_all("span",{"class":"account-group-inner"})
[]
Here is what I was trying to scrape
https://i.stack.imgur.com/tHo5S.png
I keep on getting an empty array. Please Help.

Sites like Twitter load the content dynamically, which sometimes depends upon the browser you are using etc. And due to dynamic loading there could be some elements in the webpage which are lazily loaded, which means that the DOM is inflated dynamically, depending upon the user actions, The tag you are inspecting in your browser Inspect element, is inspected the fully dynamically inflated HTML, But the response you are getting using requests, is inflated HTML, or a simple DOM waiting to load the elements dynamically on the user actions which in your case while fetching from requests module is None.
I would suggest you to use selenium webdriver for scraping dynamic javascript web pages.

Try this. It will give you the items you probably look for. Selenium with BeautifulSoup is easy to handle. I've written it that way. Here it is.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://twitter.com/24x7chess")
soup = BeautifulSoup(driver.page_source,"lxml")
driver.quit()
for title in soup.select("#page-container"):
name = title.select(".ProfileHeaderCard-nameLink")[0].text.strip()
location = title.select(".ProfileHeaderCard-locationText")[0].text.strip()
tweets = title.select(".ProfileNav-value")[0].text.strip()
following = title.select(".ProfileNav-value")[1].text.strip()
followers = title.select(".ProfileNav-value")[2].text.strip()
likes = title.select(".ProfileNav-value")[3].text.strip()
print(name,location,tweets,following,followers,likes)
Output:
akul chhillar New Delhi, India 214 44 17 5

You could have done the whole thing with requests rather than selenium
import requests
from bs4 import BeautifulSoup as bs
import re
r = requests.get('https://twitter.com/24x7chess')
soup = bs(r.content, 'lxml')
bio = re.sub(r'\n+',' ', soup.select_one('[name=description]')['content'])
stats_headers = ['Tweets', 'Following', 'Followers', 'Likes']
stats = [item['data-count'] for item in soup.select('[data-count]')]
data = dict(zip(stats_headers, stats))
print(bio, data)

Trouble Scraping site with BS4

usually I'm able to write a script that works for scraping, but I've been having some difficulty scraping this site for the table enlisted for this research project I'm working on. I'm planning to verify the script working on one State before entering the URL of my targeted states.
import requests
import bs4 as bs
url = ("http://programs.dsireusa.org/system/program/detail/284")
dsire_get = requests.get(url)
soup = bs.BeautifulSoup(dsire_get.text,'lxml')
table = soup.findAll('div', {'data-ng-controller': 'DetailsPageCtrl'})
print(table)
#I'm printing "Table" just to ensure that the table information I'm looking for is within this sections
I'm not sure if the site is attempting to block people from scraping, but all the info that I'm looking to grab is within "&quot"if you look what Table outputs.

The text is rendered with JavaScript.
First render the page with dryscrape
(If you don't want to use dryscrape see Web-scraping JavaScript page with Python )
Then the text can be extracted, after it has been rendered, from a different position on the page i.e the place it has been rendered to.
As an example this code will extract HTML from the summary.
import bs4 as bs
import dryscrape
url = ("http://programs.dsireusa.org/system/program/detail/284")
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'html.parser')
table = soup.findAll('div', {'class': 'programSummary ng-binding'})
print(table[0])
Outputs:
<div class="programSummary ng-binding" data-ng-bind-html="program.summary"><p>
<strong>Eligibility and Availability</strong></p>
<p>
Net metering is available to all "qualifying facilities" (QFs), as defined by the federal <i>Public Utility Regulatory Policies Act of 1978</i> (PURPA), which pertains to renewable energy systems and combined heat and power systems up to 80 megawatts (MW) in capacity. There is no statewide cap on the aggregate capacity of net-metered systems.</p>
<p>
All utilities subject to Public ...

So I finally managed to solve the issue, and successfuly grab the data from the Javascript page the code as follows worked for me if anyone encounters a same issue when trying to use python to scrape a javascript webpage using windows (dryscrape incompatible).
import bs4 as bs
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
url = ("http://programs.dsireusa.org/system/program/detail/284")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
table = soup.find('div', {'class': 'programOverview'})
data = []
for n in table.findAll("div", {"class": "ng-binding"}):
trip = str(n.text)
data.append(trip)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting content from webpage using BeautifulSoup - python

Related

Scraping Data from Table with Multiple Pages

Web scraping using Beautifulsoup/Selenium unable to pull div class using find or find_all

Parsing HTML using beautifulsoup gives "None"

Web Scraping Python (BeautifulSoup,Requests)

Trouble Scraping site with BS4

Categories

Resources