Scraping content with python and selenium

Scraping content with python and selenium - python

I would like to extract all the league names (e.g. England Premier League, Scotland Premiership, etc.) from this website https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1
Taking the inspector tools from Chrome/Firefox I can see that they are located here:
<span>England Premier League</span>
So I tried this
from lxml import html
from selenium import webdriver
session = webdriver.Firefox()
url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1'
session.get(url)
tree = html.fromstring(session.page_source)
leagues = tree.xpath('//span/text()')
print(leagues)
Unfortunately this doesn't return the desired results :-(
To me it looks like the website has different frames and I'm extracting the content from the wrong frame.
Could anyone please help me out here or point me in the right direction? As an alternative if someone knows how to extract the information through their api then this would obviously be the superior solution.
Any help is much appreciated. Thank you!

Hope you are looking for something like this:
from selenium import webdriver
import bs4, time
driver = webdriver.Chrome()
url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1'
driver.get(url)
driver.maximize_window()
# sleep is given so that JS populate data in this time
time.sleep(10)
pSource= driver.page_source
soup = bs4.BeautifulSoup(pSource, "html.parser")
for data in soup.findAll('div',{'class':'eventWrapper'}):
for res in data.find_all('span'):
print res.text
It will print the below data:
Wednesday's Matches
International List
Elite Euro List
UK List
Australia List
Club Friendly List
England Premier League
England EFL Cup
England Championship
England League 1
England League 2
England National League
England National League North
England National League South
Scotland Premiership
Scotland League Cup
Scotland Championship
Scotland League One
Scotland League Two
Northern Ireland Reserve League
Scotland Development League East
Wales Premier League
Wales Cymru Alliance
Asia - World Cup Qualifying
UEFA Champions League
UEFA Europa League
Wednesday's Matches
International List
Elite Euro List
UK List
Australia List
Club Friendly List
England Premier League
England EFL Cup
England Championship
England League 1
England League 2
England National League
England National League North
England National League South
Scotland Premiership
Scotland League Cup
Scotland Championship
Scotland League One
Scotland League Two
Northern Ireland Reserve League
Scotland Development League East
Wales Premier League
Wales Cymru Alliance
Asia - World Cup Qualifying
UEFA Champions League
UEFA Europa League
Only problem is its printing result set twice

Required content is absent in initial page source. It comes dynamically from https://mobile.bet365.com/V6/sport/splash/splash.aspx?zone=0&isocode=RO&tzi=4&key=1&gn=0&cid=1&lng=1&ctg=1&ct=156&clt=8881&ot=2
To be able to get this content you can use ExplicitWait as below:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
session = webdriver.Firefox()
url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1'
session.get(url)
WebDriverWait(session, 10).until(EC.presence_of_element_located((By.ID, 'Splash')))
for collapsed in session.find_elements_by_xpath('//h3[contains(#class, "collapsed")]'):
collapsed.location_once_scrolled_into_view
collapsed.click()
for event in session.find_elements_by_xpath('//div[contains(#class, "eventWrapper")]//span'):
print(event.text)

Related

Selenium Python extracting information from a website and dumping it into JSON Format

I'm trying to open a Hotel website www.booking.com and extract the name, price, location, and link from the top 50 search results which are sorted by cheapest first. I'm using Selenium python to automate the process However some HTML elements are targetable while others are not.
after inspecting the website I realized that all hotel names have the class name: fcab3ed991 a23c043802
I tried to target all of them and put them into an array as seen in my code below. But I can't seem to target the element correctly. What I'm I doing wrong?
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
PATH= "C:\Program Files (x86)\chromedriver.exe"
driver=webdriver.Chrome(PATH)
driver.get("https://www.booking.com/searchresults.html?label=gen173nr-1FCAEoggI46AdIM1gEaAKIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGoAgO4AvqR75YGwAIB0gIkZDQ4MTdjZDctYzIyNC00N2RlLWJhYjItZDU1YTAwMGU2M2Q12AIF4AIB&sid=8005d0cc6b75af8d0d2e74451b73cb8b&aid=304142&sb=1&sb_lp=1&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1FCAEoggI46AdIM1gEaAKIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGoAgO4AvqR75YGwAIB0gIkZDQ4MTdjZDctYzIyNC00N2RlLWJhYjItZDU1YTAwMGU2M2Q12AIF4AIB%26sid%3D8005d0cc6b75af8d0d2e74451b73cb8b%26sb_price_type%3Dtotal%26%26&ss=Jumeirah%2C+Dubai%2C+Dubai+Emirate%2C+United+Arab+Emirates&is_ski_area=&checkin_year=2022&checkin_month=8&checkin_monthday=1&checkout_year=2022&checkout_month=8&checkout_monthday=3&group_adults=2&group_children=0&no_rooms=1&map=1&b_h4u_keep_filters=&from_sf=1&ss_raw=jum&ac_position=1&ac_langcode=en&ac_click_type=b&dest_id=941&dest_type=district&place_id_lat=25.205553&place_id_lon=55.239216&search_pageview_id=c0ac477da63f02c2&search_pageview_id=c0ac477da63f02c2&search_selected=true&ac_suggestion_list_length=5&ac_suggestion_theme_list_length=0&order=price#map_closed")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "d4924c9e74"))
)
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "fcab3ed991 a23c043802"))
)
names=element.find_elements_by_class_name("fcab3ed991 a23c043802")
except:
driver.quit()

To extract the texts from the name and price fields you can use list comprehension and you can use the following locator strategies:
Code block:
driver.execute("get", {'url': 'https://www.booking.com/searchresults.html?label=gen173nr-1FCAEoggI46AdIM1gEaAKIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGoAgO4AvqR75YGwAIB0gIkZDQ4MTdjZDctYzIyNC00N2RlLWJhYjItZDU1YTAwMGU2M2Q12AIF4AIB&sid=8005d0cc6b75af8d0d2e74451b73cb8b&aid=304142&sb=1&sb_lp=1&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1FCAEoggI46AdIM1gEaAKIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGoAgO4AvqR75YGwAIB0gIkZDQ4MTdjZDctYzIyNC00N2RlLWJhYjItZDU1YTAwMGU2M2Q12AIF4AIB%26sid%3D8005d0cc6b75af8d0d2e74451b73cb8b%26sb_price_type%3Dtotal%26%26&ss=Jumeirah%2C+Dubai%2C+Dubai+Emirate%2C+United+Arab+Emirates&is_ski_area=&checkin_year=2022&checkin_month=8&checkin_monthday=1&checkout_year=2022&checkout_month=8&checkout_monthday=3&group_adults=2&group_children=0&no_rooms=1&map=1&b_h4u_keep_filters=&from_sf=1&ss_raw=jum&ac_position=1&ac_langcode=en&ac_click_type=b&dest_id=941&dest_type=district&place_id_lat=25.205553&place_id_lon=55.239216&search_pageview_id=c0ac477da63f02c2&search_pageview_id=c0ac477da63f02c2&search_selected=true&ac_suggestion_list_length=5&ac_suggestion_theme_list_length=0&order=price#map_closed'})
names = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[data-testid='title']")))]
prices = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[data-testid='price-and-discounted-price'] > span")))]
for i,j in zip(names, prices):
print(f"{i} hotel price is {j}")
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Console Output:
Royal Prestige Hotel hotel price is ₹ 10,871
Rove La Mer Beach hotel price is ₹ 10,328
Dubai Marine Beach Resort & Spa hotel price is ₹ 12,133
Roda Beach Resort hotel price is ₹ 16,525
Bespoke Residences - 3 Bedroom Waikiki Townhouses hotel price is ₹ 20,395
Walking distance to Burj al Arab - 1BR Lamtara 2 hotel price is ₹ 16,724
Mandarin Oriental Jumeira, Dubai hotel price is ₹ 18,108
Four Seasons Resort Dubai at Jumeirah Beach hotel price is ₹ 20,003
Bulgari Resort, Dubai hotel price is ₹ 78,274
Spacious Villa! hotel price is ₹ 62,619
Palm Beach Hotel hotel price is ₹ 64,794
York International Hotel hotel price is ₹ 86,971
Moon , Backpackers , Partition for Couples and for singles hotel price is ₹ 208,731
Hafez Hotel Apartments Al Ras Metro Station hotel price is ₹ 2,022
Grand Pearl Hostel For Boys hotel price is ₹ 2,131
Time Palace Hotel Branch hotel price is ₹ 3,131
Hostel Youth hotel price is ₹ 3,157
Grand Mayfair Hotel hotel price is ₹ 3,601
Explore Old Dubai, Souks, Tastings, Museums hotel price is ₹ 4,592
Panorama Hotel Bur Dubai hotel price is ₹ 3,674
Zain International Hotel hotel price is ₹ 3,827
Panorama Hotel Deira hotel price is ₹ 3,870
Decent Boys Hostel in center of Bur Dubai next to Burjuman metro Station with all FREE Facilities hotel price is ₹ 3,875
Brand New Boys Hostel 1 min walk from Burjuman Metro Station EXIT-4 with all Brand New Furnishings & Free Facilities hotel price is ₹ 3,914
OYO 338 Transworld Hotel hotel price is ₹ 3,914
PS: Following this solution you can similarly extract the location and link texts as well and dump in a JSON format.

Web scraping just printing "[]"?

I am attempting to get the names and prices of the listings on a cruise website.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.ncl.com/vacations?cruise-destination=transatlantic'
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
names = soup.find_all('h2', class_='headline -medium -small-xs -variant-1')
prices = soup.find_all('span', class_='headline-3 -variant-1')
print(names)
print(prices)
This just ends up printing brackets.

BeautifulSoup can only see HTML elements which exist in the HTML document at the time the document is served to you from the server. It cannot see elements in the DOM which normally would be populated/created asynchronously using JavaScript (by a browser).
The page you're trying to scrape is of the second kind: The HTML document the server served to you at the time you requested it only contains the "barebones" scaffolding of the page, which, if you're viewing the page in a browser, will be populated at a later point in time via JavaScript. This is typically achieved by the browser by making additional requests to other resources/APIs, whose response contains the information with which to populate the page.
BeautifulSoup is not a browser. It's just an HTML/XML parser. You made a single request to a (mostly empty) template HTML. You can expect BeautifulSoup NOT to work for any "fancy" pages - if you see a spinning "loading" graphic, you should immediately think "this page is populated asynchronously using JavaScript and BeautifulSoup won't work for this".
There are cases where the information you're trying to scrape is actually embeded somewhere in the HTML at the time the server served it to you - in a <script> tag possibly, and then the browser is expected to use JavaScript to make this data presentable. In such a case, BeautifulSoup would be able to see the data - that's a separate matter, though.
In your case, one solution would be to view the page in a browser, and log your network traffic. Doing this reveals that, once the page loads, an XHR HTTP GET request is made to a REST API endpoint, the response of which is JSON and contains all the information you're trying to scrape. The trick then is to imitate that request: copy the endpoint URL (including query-string parameters) and any necessary request headers (and payload, if it's a POST request. In this case, it isn't).
Inspecting the response gives us further clues on how to write our script: The JSON response contains ALL itineraries, even ones we aren't interested in (such as non-transatlantic trips). This means that, normally, the browser must run some JavaScript to filter the itineraries - this happens client-side, not server-side. Therefore, our script will have to perform the same kind of filtering.
def get_itineraries():
import requests
url = "https://www.ncl.com/fr/en/api/vacations/v1/itineraries"
params = {
"guests": "2",
"v": "1414764913-1626184979267"
}
headers = {
"accept": "application/json",
"accept-encoding": "gzip, deflate",
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, params=params, headers=headers)
response.raise_for_status()
def predicate(itinerary):
return any(dest["code"] == "TRANSATLANTIC" for dest in itinerary["destination"])
yield from filter(predicate, response.json()["itineraries"])
def main():
from itertools import islice
def get_cheapest_price(itinerary):
def get_valid_option(sailing):
def predicate(option):
return "combinedPrice" in option
return next(filter(predicate, sailing["pricing"]))
return min(get_valid_option(sailing)["combinedPrice"] for sailing in itinerary["sailings"])
itineraries = list(islice(get_itineraries(), 50))
prices = map(get_cheapest_price, itineraries)
for itinerary, price in sorted(zip(itineraries, prices), key=lambda tpl: tpl[1]):
print("[{}{}] - {}".format(itinerary["currency"]["symbol"], price, itinerary["title"]["fullTitle"]))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
[€983] - 12-Day Transatlantic From London To New York: Spain & Bermuda
[€984] - 11-Day Transatlantic from Miami to Barcelona: Ponta Delgada, Azores
[€1024] - 15-Day Transatlantic from Rio de Janeiro to Barcelona: Spain & Brazil
[€1177] - 15-Day Transatlantic from Rome to New York: Italy, France & Spain
[€1190] - 14-Day Transatlantic from Barcelona to New York: Spain & Bermuda
[€1234] - 14-Day Transatlantic from Lisbon to Rio de Janeiro: Spain & Brazil
[€1254] - 11-Day Europe from Rome to London: Italy, France, Spain & Portugal
[€1271] - 15-Day Transatlantic From New York to Rome: Italy, France & Spain
[€1274] - 15-Day Transatlantic from New York to Barcelona: Spain & Bermuda
[€1296] - 13-Day Transatlantic From New York to London: France & Ireland
[€1411] - 17-Day Transatlantic from Rome to Miami: Italy, France & Spain
[€1420] - 15-Day Transatlantic From New York to Barcelona: France & Spain
[€1438] - 16-Day Transatlantic from Rome to New York: Italy, France & Spain
[€1459] - 15-Day Transatlantic from Barcelona to Tampa: Bahamas, Spain & Bermuda
[€1473] - 11-Day Transatlantic from New York to Reykjavik: Halifax & Akureyri
[€1486] - 16-Day Transatlantic from Rome to New York: Italy, France & Spain
[€1527] - 15-Day Transatlantic from New York to Rome: Italy, France & Spain
[€1529] - 14-Day Transatlantic From New York to London: France & Ireland
[€1580] - 16-day Transatlantic From Barcelona to New York: Spain & Bermuda
[€1595] - 16-Day Transatlantic From New York to Rome: Italy, France & Spain
[€1675] - 16-Day Transatlantic from New York to Rome: Italy, France & Spain
[€1776] - 14-Day Transatlantic from New York to London: England & Ireland
[€1862] - 12-Day Transatlantic From London to New York: Scotland & Iceland
[€2012] - 15-Day Transatlantic from New York to Barcelona: Spain & Bermuda
[€2552] - 14-Day Transatlantic from New York to London: England & Ireland
[€2684] - 16-Day Transatlantic from New York to London: France & Ireland
[€3460] - 16-Day Transatlantic from New York to London: France & Ireland
>>>
For more information on logging your browser's network traffic, finding REST API endpoints (if they exist), and imitating requests, take a look at this other answer I posted to a similar question.

How to scrape information from tables selecting each of the Dropdown options using Selenium and Python?

Trying to help someone who works for a nonprofit. Currently trying to pull info from the STL County Boards/Commissions website(https://boards.stlouisco.com/).
Having trouble for a few reasons:
Was going to attempt to use BeautifulSoup, but the actual data isn't even shown until you choose a Board/Commission from a dropdown bar above, so I have switched to Selenium, which I am new at.
Is this task possible? When I look at the html code for the site, I see that the info isn't stored in the page, but pulled from another location and just displayed on the site based on the option chosen from the dropdown menu.
function ShowMemberList(selectedBoard) {
ClearMeetingsAndMembers();
var htmlString = "";
var boardsList = [{"id":407,"name":"Aging Ahead","isActive":true,"description":"... ...1.","totalSeats":14}];
var totalMembers = boardsList[$("select[name='BoardsList'] option:selected").index() - 1].totalSeats;
$.get("/api/boards/" + selectedBoard + "/members", function (data) {
if (data.length > 0) {
htmlString += "<table id=\"MemberTable\" class=\"table table-hover\">";
htmlString += "<thead><th>Member Name</th><th>Title</th><th>Position</th><th>Expiration Date</th></thead><tbody>";
for (var i = 0; i < totalMembers; i++) {
if (i < data.length) {
htmlString += "<tr><td>" + FormatString(data[i].firstName) + " " + FormatString(data[i].lastName) + "</td><td>" + FormatString(data[i].title) + "</td><td>" + FormatString(data[i].position) + "</td><td>" + FormatString(data[i].expirationDate) + "</td></tr>";
} else {
htmlString += "<tr><td colspan=\"4\">---Vacant Seat---</td></tr>"
}
}
htmlString += "</tbody></table>";
} else {
htmlString = "<span id=\"MemberTable\">There was no data found for this board.</span>";
}
$("#Results").append(htmlString);
});
}
So far, I have this (not a lot), which goes to the page and selects every board from the list:
driver = webdriver.Chrome()
driver.get("https://boards.stlouisco.com/")
select = Select(wait(driver, 10).until(EC.presence_of_element_located((By.ID, 'BoardsList'))))
options = select.options
for board in options:
select.select_by_visible_text(board.text)
From here I would like to be able to scrape the info from the MemberTable but I don't know how to move forward/if it is something in the scope of my abilities, or even if it is something possible with Selenium.
I've tried using find_by a few different elements to click on the members table but am met with errors. I have also tried calling for the memberstable after my select, but it is not able to find that element. Any tips/pointers/advice is appreciated!

You can use this script to save all members from all boards to csv:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://boards.stlouisco.com/'
members_url = 'https://boards.stlouisco.com/api/boards/{}/members'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for o in soup.select('#BoardsList option[value]'):
print(o['value'], o.text)
data = requests.get(members_url.format(o['value'])).json()
for d in data:
all_data.append(dict(board=o.text, **d))
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
Prints:
board boardMemberId memberId boardName ... lastName title position expirationDate
0 Aging Ahead 39003 27007 None ... Anderson None ST. LOUIS COUNTY EXECUTIVE APPOINTEE 10/1/2020
1 Aging Ahead 38963 27797 None ... Bauers None St. Charles County Community Action Agency App... None
2 Aging Ahead 39004 27815 None ... Berkowitz None ST. LOUIS COUNTY EXECUTIVE APPOINTEE 10/1/2020
3 Aging Ahead 38964 27798 None ... Biehle None Jefferson County Community Action Corp. Appointee None
4 Aging Ahead 38581 27597 None ... Bowers None Franklin County Commission Appointee None
.. ... ... ... ... ... ... ... ... ...
725 Zoo-Museum District - Zoological Park Subdistr... 38863 26745 None ... Seat (Robert R. Hermann, Jr.) St. Louis County 12/31/2019
726 Zoo-Museum District - Zoological Park Subdistr... 38864 26745 None ... Seat (Winthrop Reed) St. Louis County 12/31/2016
727 Zoo-Museum District - Zoological Park Subdistr... 38669 26745 None ... Seat (Lawrence Thomas) St. Louis County 12/31/2018
728 Zoo-Museum District - Zoological Park Subdistr... 38670 26745 None ... Seat (Peggy Ritter ) Advisory Commissioner Non-Voting St. Louis County 12/31/2019
729 Zoo-Museum District - Zoological Park Subdistr... 38394 27512 None ... Wilson Advisory Commissioner Non-Voting City of St. Louis None
[730 rows x 9 columns]
And saves data.csv with all boards/members (screenshot from LibreOffice):

To choose each of the Board / Commission from the html-select Dropdown and scrape the page you have to induce WebDriverWait for the element_to_be_clickable() and you can use the following Locator Strategies:
Code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://boards.stlouisco.com/")
select = Select(WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, 'BoardsList'))))
for option in select.options:
option.click()
print("Scrapping :"+option.text)
Console Output:
Scrapping :---Choose a Board---
Scrapping :Aging Ahead
Scrapping :Aging Ahead Advisory Council
Scrapping :Air Pollution & Noise Control Appeal Board
Scrapping :Animal Care & Control Advisory Board
Scrapping :Bi-State Development Agency (Metro)
Scrapping :Board Of Examiners For Mechanical Licensing
Scrapping :Board of Freeholders
Scrapping :Boundary Commission
Scrapping :Building Code Review Committee
Scrapping :Building Commission & Board Of Building Appeals
Scrapping :Business Advisory Council
Scrapping :Center for Educational Media
Scrapping :Civil Service Commission
Scrapping :Commission On Disabilities
Scrapping :County Health Advisory Board
Scrapping :Domestic And Family Violence Council
Scrapping :East-West Gateway Council of Governments Board of Directors
Scrapping :Economic Development Collaborative Advisory Board
Scrapping :Economic Rescue Team
Scrapping :Electrical Code Review Committee
Scrapping :Electrical Examiners, Board Of
Scrapping :Emergency Communications System Commission
Scrapping :Equalization, Board Of
Scrapping :Fire Standards Commission
Scrapping :Friends of the Kathy J. Weinman Shelter for Battered Women, Inc.
Scrapping :Fund Investment Advisory Committee
Scrapping :Historic Building Commission
Scrapping :Housing Authority
Scrapping :Housing Resources Commission
Scrapping :Human Relations Commission
Scrapping :Industrial Development Authority Board
Scrapping :Justice Services Advisory Board
Scrapping :Lambert Airport Eastern Perimeter Joint Development Commission
Scrapping :Land Clearance For Redevelopment Authority
Scrapping :Lemay Community Improvement District
Scrapping :Library Board
Scrapping :Local Emergency Planning Committee
Scrapping :Mechanical Code Review Committee
Scrapping :Metropolitan Park And Recreation District Board Of Directors (Great Rivers Greenway)
Scrapping :Metropolitan St. Louis Sewer District
Scrapping :Metropolitan Taxicab Commission
Scrapping :Metropolitan Zoological Park and Museum District Board
Scrapping :Municipal Court Judges
Scrapping :Older Adult Commission
Scrapping :Parks And Recreation Advisory Board
Scrapping :Planning Commission
Scrapping :Plumbing Code Review Committee
Scrapping :Plumbing Examiners, Board Of
Scrapping :Police Commissioners, Board Of
Scrapping :Port Authority Board Of Commissioners
Scrapping :Private Security Advisory Committee
Scrapping :Productive Living Board
Scrapping :Public Transportation Commission of St. Louis County
Scrapping :Regional Arts Commission
Scrapping :Regional Convention & Sports Complex Authority
Scrapping :Regional Convention & Visitors Commission
Scrapping :REJIS Commission
Scrapping :Restaurant Commission
Scrapping :Retirement Board Of Trustees
Scrapping :St. Louis Airport Commission
Scrapping :St. Louis County Children's Service Fund Board
Scrapping :St. Louis County Clean Energy Development Board (PACE)
Scrapping :St. Louis County Workforce Development Board
Scrapping :St. Louis Economic Development Partnership
Scrapping :St. Louis Regional Health Commission
Scrapping :St. Louis-Jefferson Solid Waste Management District
Scrapping :Tax Increment Financing Commission of St. Louis County
Scrapping :Transportation Board
Scrapping :Waste Management Commission
Scrapping :World Trade Center - St. Louis
Scrapping :Zoning Adjustment, Board of
Scrapping :Zoo-Museum District - Art Museum Subdistrict Board of Commissioners
Scrapping :Zoo-Museum District - Botanical Garden Subdistrict Board of Commissioners
Scrapping :Zoo-Museum District - Missouri History Museum Subdistrict Board of Commissioners
Scrapping :Zoo-Museum District - St. Louis Science Center Subdistrict Board of Commissioners
Scrapping :Zoo-Museum District - Zoological Park Subdistrict Board of Commissioners
References
You can find a couple of relevant discussions in:
Message: Element could not be scrolled into view while trying to click on an option within a dropdown menu through Selenium
How to open the option items of a select tag (dropdown) in different tabs/windows?

How to scrape hidden class data using selenium and beautiful soup

I'm trying to scrape java script enabled web page content. I need to extract data in the table of that website. However each row of the table has button (arrow) by which we get additional information of that row.
I need to extract that additional description of each row. By inspecting it is observed that the contents of those arrow of each row belong to same class. However the class is hidden in source code. It can be observed only while inspecting. The data I'm trying to sparse is from the webpage.
I have used selenium and beautiful soup. I'm able to scrape data of table but not content of those arrows in the table. My python is returning me an empty list for the class of that arrow. But working for the classs of normal table data.
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://projects.sfchronicle.com/2020/layoff-tracker/')
html_source = browser.page_source
soup = BeautifulSoup(html_source,'html.parser')
data = soup.find_all('div',class_="sc-fzoLsD jxXBhc rdt_ExpanderRow")
print(data.text)

To print hidden data, you can use this example:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://projects.sfchronicle.com/2020/layoff-tracker/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data_url = 'https://projects.sfchronicle.com' + soup.select_one('link[href*="commons-"]')['href']
data = re.findall(r'n\.exports=JSON\.parse\(\'(.*?)\'\)', requests.get(data_url).text)[1]
data = json.loads(data.replace(r"\'", "'"))
# uncomment this to see all data:
# print(json.dumps(data, indent=4))
for d in data[4:]:
print('{:<50}{:<10}{:<30}{:<30}{:<30}{:<30}{:<30}'.format(*d.values()))
Prints:
Company Layoffs City County Month Industry Company description
Tesla (Temporary layoffs. Factory reopened) 11083 Fremont Alameda County April Industrial Car maker
Bon Appetit Management Co. 3015 San Francisco San Francisco County April Food Food supplier
GSW Arena LLC-Chase Center 1720 San Francisco San Francisco County May Sports Arena vendors
YMCA of Silicon Valley 1657 Santa Clara Santa Clara County May Sports Gym
Nutanix Inc. (Temporary furlough of 2 weeks) 1434 San Jose Santa Clara County April Tech Cloud computing
TeamSanJose 1304 San Jose Santa Clara County April Travel Tourism bureau
San Francisco Giants 1200 San Francisco San Francisco County April Sports Stadium vendors
Lyft 982 San Francisco San Francisco County April Tech Ride hailing
YMCA of San Francisco 959 San Francisco San Francisco County May Sports Gym
Hilton San Francisco Union Square 923 San Francisco San Francisco County April Travel Hotel
Six Flags Discovery Kingdom 911 Vallejo Solano County June Entertainment Amusement park
San Francisco Marriott Marquis 808 San Francisco San Francisco County April Travel Hotel
Aramark 777 Oakland Alameda County April Food Food supplier
The Palace Hotel 774 San Francisco San Francisco County April Travel Hotel
Back of the House Inc 743 San Francisco San Francisco County April Food Restaurant
DPR Construction 715 Redwood City San Mateo County April Real estate Construction
...and so on.

The content you are interested in is generated when you click a button, so you would want to locate the button. A million ways you could do this but I would suggest something like:
element = driver.find_elements(By.XPATH, '//button')
for your specific case you could also use:
element = driver.find_elements(By.CSS_SELECTOR, 'button[class|="sc"]')
Once you get the button element, we can then do:
element.click()
Parsing the page after this should get you the javascript generated content you are looking for

How to scrape all information on a web page after the id = "firstheading" in python?

I am trying to scrape all text from a web page (using python) that comes after the first heading . The tag for that heading is : <h1 id="firstHeading" class="firstHeading" lang="en">Albert Einstein</h1>
I don't want any information before this heading . I want to scrape all text written after this heading . Can I use BeautifulSoup in python for this ?
I am running the following code :
` *
import requests
import bs4
from bs4 import BeautifulSoup
urlpage = 'https://en.wikipedia.org/wiki/Albert_Einstein#Publications'
res = requests.get(urlpage)
soup1 = (bs4.BeautifulSoup(res.text, 'lxml')).get_text()
print(soup1)
` *
The web page has the following information :
Albert Einstein - Wikipedia
document.documentElement.className="client-js";RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Albert_Einstein","wgTitle":"Albert Einstein","wgCurRevisionId":920687884,"wgRevisionId":920687884,"wgArticleId":736,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages with missing ISBNs","Webarchive template wayback links","CS1 German-language sources (de)","CS1: Julian–Gregorian uncertainty","CS1 French-language sources (fr)","CS1 errors: missing periodical","CS1: long volume value","Wikipedia indefinitely semi-protected pages","Use American English from February 2019","All Wikipedia articles written in American English","Articles with short description","Good articles","Articles containing German-language text","Biography with signature","Articles with hCards","Articles with hAudio microformats","All articles with unsourced statements",
"Articles with unsourced statements from July 2019","Commons category link from Wikidata","Articles with Wikilivres links","Articles with Curlie links","Articles with Project Gutenberg links","Articles with Internet Archive links","Articles with LibriVox links","Use dmy dates from August 2019","Wikipedia articles with BIBSYS identifiers","Wikipedia articles with BNE identifiers","Wikipedia articles with BNF identifiers","Wikipedia articles with GND identifiers","Wikipedia articles with HDS identifiers","Wikipedia articles with ISNI identifiers","Wikipedia articles with LCCN identifiers","Wikipedia articles with LNB identifiers","Wikipedia articles with MGP identifiers","Wikipedia articles with NARA identifiers","Wikipedia articles with NCL identifiers","Wikipedia articles with NDL identifiers","Wikipedia articles with NKC identifiers","Wikipedia articles with NLA identifiers","Wikipedia articles with NLA-person identifiers","Wikipedia articles with NLI identifiers",
"Wikipedia articles with NLR identifiers","Wikipedia articles with NSK identifiers","Wikipedia articles with NTA identifiers","Wikipedia articles with SBN identifiers","Wikipedia articles with SELIBR identifiers","Wikipedia articles with SNAC-ID identifiers","Wikipedia articles with SUDOC identifiers","Wikipedia articles with ULAN identifiers","Wikipedia articles with VIAF identifiers","Wikipedia articles with WorldCat-VIAF identifiers","AC with 25 elements","Wikipedia articles with suppressed authority control identifiers","Pages using authority control with parameters","Articles containing timelines","Pantheists","Spinozists","Albert Einstein","1879 births","1955 deaths","20th-century American engineers","20th-century American writers","20th-century German writers","20th-century physicists","American agnostics","American inventors","American letter writers","American pacifists","American people of German-Jewish descent","American physicists","American science writers",
"American socialists","American Zionists","Ashkenazi Jews","Charles University in Prague faculty","Corresponding Members of the Russian Academy of Sciences (1917–25)","Cosmologists","Deaths from abdominal aortic aneurysm","Einstein family","ETH Zurich alumni","ETH Zurich faculty","German agnostics","German Jews","German emigrants to Switzerland","German Nobel laureates","German inventors","German physicists","German socialists","European democratic socialists","Institute for Advanced Study faculty","Jewish agnostics","Jewish American scientists","Jewish emigrants from Nazi Germany to the United States","Jews who emigrated to escape Nazism","Jewish engineers","Jewish inventors","Jewish philosophers","Jewish physicists","Jewish socialists","Leiden University faculty","Foreign Fellows of the Indian National Science Academy","Foreign Members of the Royal Society","Members of the American Philosophical Society","Members of the Bavarian Academy of Sciences","Members of the Lincean Academy"
,"Members of the Royal Netherlands Academy of Arts and Sciences","Members of the United States National Academy of Sciences","Honorary Members of the USSR Academy of Sciences","Naturalised citizens of Austria","Naturalised citizens of Switzerland","New Jersey socialists","Nobel laureates in Physics","Patent examiners","People from Berlin","People from Bern","People from Munich","People from Princeton, New Jersey","People from Ulm","People from Zürich","People who lost German citizenship","People with acquired American citizenship","Philosophers of science","Relativity theorists","Stateless people","Swiss agnostics","Swiss emigrants to the United States","Swiss Jews","Swiss physicists","Theoretical physicists","Winners of the Max Planck Medal","World federalists","Recipients of the Pour le Mérite (civil class)","Determinists","Activists from New Jersey","Mathematicians involved with Mathematische Annalen","Intellectual Cooperation","Disease-related deaths in New Jersey"],
"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRelevantPageName":"Albert_Einstein","wgRelevantArticleId":736,"wgRequestId":"XaChjApAICIAALSsYfgAAABV","wgCSPNonce":!1,"wgIsProbablyEditable":!1,"wgRelevantPageIsProbablyEditable":!1,"wgRestrictionEdit":["autoconfirmed"],"wgRestrictionMove":["sysop"],"wgMediaViewerOnClick":!0,"wgMediaViewerEnabledByDefault":!0,"wgPopupsReferencePreviews":!1,"wgPopupsConflictsWithNavPopupGadget":!1,"wgVisualEditor":{"pageLanguageCode":"en","pageLanguageDir":"ltr","pageVariantFallbacks":"en"},"wgMFDisplayWikibaseDescriptions":{"search":!0,"nearby":!0,"watchlist":!0,"tagline":
!1},"wgWMESchemaEditAttemptStepOversample":!1,"wgULSCurrentAutonym":"English","wgNoticeProject":"wikipedia","wgWikibaseItemId":"Q937","wgCentralAuthMobileDomain":!1,"wgEditSubmitButtonLabelPublish":!0};RLSTATE={"ext.globalCssJs.user.styles":"ready","site.styles":"ready","noscript":"ready","user.styles":"ready","ext.globalCssJs.user":"ready","user":"ready","user.options":"ready","user.tokens":"loading","ext.cite.styles":"ready","ext.math.styles":"ready","mediawiki.legacy.shared":"ready","mediawiki.legacy.commonPrint":"ready","jquery.makeCollapsible.styles":"ready","mediawiki.toc.styles":"ready","wikibase.client.init":"ready","ext.visualEditor.desktopArticleTarget.noscript":"ready","ext.uls.interlanguage":"ready","ext.wikimediaBadges":"ready","ext.3d.styles":"ready","mediawiki.skinning.interface":"ready","skins.vector.styles":"ready"};RLPAGEMODULES=["ext.cite.ux-enhancements","ext.cite.tracking","ext.math.scripts","ext.scribunto.logs","site","mediawiki.page.startup",
"mediawiki.page.ready","jquery.makeCollapsible","mediawiki.toc","mediawiki.searchSuggest","ext.gadget.teahouse","ext.gadget.ReferenceTooltips","ext.gadget.watchlist-notice","ext.gadget.DRN-wizard","ext.gadget.charinsert","ext.gadget.refToolbar","ext.gadget.extra-toolbar-buttons","ext.gadget.switcher","ext.centralauth.centralautologin","mmv.head","mmv.bootstrap.autostart","ext.popups","ext.visualEditor.desktopArticleTarget.init","ext.visualEditor.targetLoader","ext.eventLogging","ext.wikimediaEvents","ext.navigationTiming","ext.uls.compactlinks","ext.uls.interface","ext.cx.eventlogging.campaigns","ext.quicksurveys.init","ext.centralNotice.geoIP","ext.centralNotice.startUp","skins.vector.js"];
(RLQ=window.RLQ||[]).push(function(){mw.loader.implement("user.tokens#tffin",function($,jQuery,require,module){/*#nomin*/mw.user.tokens.set({"patrolToken":"+\\","watchToken":"+\\","csrfToken":"+\\"});
});});
Albert Einstein
From Wikipedia, the free encyclopedia
Jump to navigation Jump to search "Einstein" redirects here. For other
people, see Einstein (surname). For other uses, see Albert Einstein
(disambiguation) and Einstein (disambiguation).
German-born physicist and developer of the theory of relativity
Albert EinsteinEinstein in 1921Born(1879-03-14)14 March 1879Ulm,
Kingdom of Württemberg, German EmpireDied18 April 1955(1955-04-18)
(aged 76)Princeton, New Jersey, United StatesResidenceGermany, Italy,
Switzerland, Austria (present-day Czech Republic), Belgium, United
StatesCitizenship Subject of the Kingdom of Württemberg during the
German Empire (1879–1896)[note 1] Stateless (1896–1901) Citizen of
Switzerland (1901–1955) Austrian subject of the Austro-Hungarian
Empire (1911–1912) Subject of the Kingdom of Prussia during the German
Empire (1914–1918)[note 1] German citizen of the Free State of Prussia
(Weimar Republic, 1918–1933) Citizen of the United States (1940–1955)
Education Federal polytechnic school (1896–1900; B.A., 1900)
University of Zurich (Ph.D., 1905) Known for General relativity
Special relativity Photoelectric effect E=mc2 (Mass–energy
equivalence) E=hf (Planck–Einstein relation) Theory of Brownian motion
Einstein field equations Bose–Einstein statistics Bose–Einstein
condensate Gravitational wave Cosmological constant Unified field
theory EPR paradox Ensemble interpretation List of other concepts
Spouse(s)Mileva Marić(m. 1903; div. 1919)Elsa Löwenthal(m. 1919;
died[1][2] 1936)Children"Lieserl" Einstein Hans Albert Einstein Eduard
"Tete" EinsteinAwards Barnard Medal (1920) Nobel Prize in Physics
(1921) Matteucci Medal (1921) ForMemRS (1921)[3] Copley Medal
(1925)[3] Gold Medal of the Royal Astronomical Society (1926) Max
Planck Medal (1929) Member of the National Academy of Sciences (1942)
Time Person of the Century (1999) Scientific careerFieldsPhysics,
philosophyInstitutions Swiss Patent Office (Bern) (1902–1909)
University of Bern (1908–1909) University of Zurich (1909–1911)
Charles University in Prague (1911–1912) ETH Zurich (1912–1914)
Prussian Academy of Sciences (1914–1933) Humboldt University of Berlin
(1914–1933) Kaiser Wilhelm Institute (director, 1917–1933) German
Physical Society (president, 1916–1918) Leiden University (visits,
1920) Institute for Advanced Study (1933–1955) Caltech (visits,
1931–1933) University of Oxford (visits, 1931–1933) ThesisEine neue
Bestimmung der Moleküldimensionen (A New Determination of Molecular
Dimensions) (1905)Doctoral advisorAlfred KleinerOther academic
advisorsHeinrich Friedrich WeberInfluences Arthur Schopenhauer Baruch
Spinoza Bernhard Riemann David Hume Ernst Mach Hendrik Lorentz Hermann
Minkowski Isaac Newton James Clerk Maxwell Michele Besso Moritz
Schlick Thomas Young Influenced Virtually all modern physics
Signature Albert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[4] German: [ˈalbɛʁt
ˈʔaɪnʃtaɪn] (listen); 14 March 1879 – 18 April 1955) was a German-born
theoretical physicist[5] who developed the theory of relativity, one
of the two pillars of modern physics (alongside quantum
mechanics).[3][6]:274 His work is also known for its influence on the
philosophy of science.[7][8] He is best known to the general public
for his mass–energy equivalence formula . . . . .
I only want text after the first heading "Albert Einstein"

First find h1 tag and then use find_next_siblings('div') and print the text value.
import requests
import bs4
urlpage = 'https://en.wikipedia.org/wiki/Albert_Einstein#Publications'
res = requests.get(urlpage)
soup1 =bs4.BeautifulSoup(res.text, 'lxml')
h1=soup1.find('h1')
for item in h1.find_next_siblings('div'):
print(item.text)

If you do want to get the text such as described, I suggest a bit of an "non-parser" way.
By cutting the string directly from the response object.
Let's do this:
import requests
urlpage = "https://en.wikipedia.org/wiki/Albert_Einstein#Publications"
my_string = """<h1 id="firstHeading" class="firstHeading" lang="en">Albert Einstein</h1>""" # define the string you want
response = requests.get(urlpage).text # get the full response html as str
cut_response = response[response.find(my_string)::] # cut the str from your string on
soup1 = (bs4.BeautifulSoup(cut_response, 'lxml')).get_text() # get soup object, but of cut string
print(soup1)
Should work.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping content with python and selenium - python

Related

Selenium Python extracting information from a website and dumping it into JSON Format

Web scraping just printing "[]"?

How to scrape information from tables selecting each of the Dropdown options using Selenium and Python?

How to scrape hidden class data using selenium and beautiful soup

How to scrape all information on a web page after the id = "firstheading" in python?

Categories

Resources