Selenium loading, but not printing all HTML - python

I am attempting to use Python and Selenium to web-scrape dynamically loaded data from a website. The problem is, only about half of the data is being reported as present, when in reality it all should be there. Even after using pauses before printing out all the page content, or simple find element by class searches, there seems to be no solution. The URL of the site is https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909. As you can see, there are 13 main sections, however I am only able to retrieve data from the first four games. To best show the problem I'll attach the code for printing the inner-HTML for the entire page to show the discrepancies between the loaded and non-loaded data.
from selenium import webdriver
import requests
url = "https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909"
driver = webdriver.Chrome()
driver.get(url)
print(driver.execute_script("return document.documentElement.innerText;"))
EDIT:
The problem is not the wait time, for I am running it line by line and fully waiting for it to load. It appears the problem boild down to selenium not grabbing all the JS loaded text on the page, as seen by the console output in the answer below.

#sudonym's analysis was in the right direction. You need to induce WebDriverWait for the desired elements to be visible before you attempt to extract them through execute_script() method as follows:
Code Block:
# -*- coding: UTF-8 -*-
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
url = "https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909"
driver = webdriver.Chrome()
driver.get(url)
WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//h2[contains(.,'USA - National Football League')]//following::section//span[3]")))
print(driver.execute_script("return document.documentElement.innerText;"))
Console Output:
SPORTSBOOK REVIEW
Home
Best Sportsbooks
Rating Guide
Blacklist
Bonuses
BETTING ODDS
FREE PICKS
Sports Picks
NFL
College Football
NBA
NCAAB
MLB
NHL
More Sports
How to Bet
Tools
FORUM
Home
Players Talk
Sportsbooks & Industry
Newbie Forum
Handicapper Think Tank
David Malinsky's Point Blank
Service Plays
Bitcoin Sports Betting
NBA Betting
NFL Betting
NCAAF Betting
MLB Betting
NHL Betting
CONTESTS
EARN BETPOINTS
What Are Betpoints?
SBR Sportsbook
SBR Casino
SBR Racebook
SBR Poker
SBR Store
Today
NFL
NBA
NHL
MLB
College Football
NCAA Basketball
Soccer
Soccer Odds
Major League Soccer
UEFA Champions League
UEFA Nations League
UEFA Europa League
English Premier League
World Cup 2022
Tennis
Tennis Odds
ATP
WTA
UFC
Boxing
More Sports
CFL
WNBA
AFL
Betting Odds/NFL Odds/Consensus
TODAY
|
YESTERDAY
|
DATE
?
Login
?
Settings
?
Bet Tracker
?
Bet Card
?
Favorites
NFL Consensus for Sep 09, 2018
USA - National Football League
Sunday Sep 09, 2018
01:00 PM
/
Pittsburgh vs Cleveland
453
Pittsburgh
454
Cleveland
Current Line
-3½+105
+3½-115
Wagers Placed
10040
54.07%
8530
45.93%
Amount Wagered
$381,520.00
56.10%
$298,550.00
43.90%
Average Bet Size
$38.00
$35.00
SBR Contest Best Bets
22
9
01:00 PM
/
San Francisco vs Minnesota
455
San Francisco
456
Minnesota
Current Line
+6-102
-6-108
Wagers Placed
6250
41.25%
8900
58.75%
Amount Wagered
$175,000.00
29.50%
$418,300.00
70.50%
Average Bet Size
$28.00
$47.00
SBR Contest Best Bets
5
19
01:00 PM
/
Cincinnati vs Indianapolis
457
Cincinnati
458
Indianapolis
Current Line
-1-104
+1-106
Wagers Placed
11640
66.36%
5900
33.64%
Amount Wagered
$1,338,600.00
85.65%
$224,200.00
14.35%
Average Bet Size
$115.00
$38.00
SBR Contest Best Bets
23
12
01:00 PM
/
Buffalo vs Baltimore
459
Buffalo
460
Baltimore
Current Line
+7½-103
-7½-107
Wagers Placed
5220
33.83%
10210
66.17%
Amount Wagered
$78,300.00
16.79%
$387,980.00
83.21%
Average Bet Size
$15.00
$38.00
SBR Contest Best Bets
5
17
01:00 PM
/
Jacksonville vs N.Y. Giants
461
Jacksonville
462
N.Y. Giants
01:00 PM
/
Tampa Bay vs New Orleans
463
Tampa Bay
464
New Orleans
01:00 PM
/
Houston vs New England
465
Houston
466
New England
01:00 PM
/
Tennessee vs Miami
467
Tennessee
468
Miami
04:05 PM
/
Kansas City vs L.A. Chargers
469
Kansas City
470
L.A. Chargers
04:25 PM
/
Seattle vs Denver
471
Seattle
472
Denver
04:25 PM
/
Dallas vs Carolina
473
Dallas
474
Carolina
04:25 PM
/
Washington vs Arizona
475
Washington
476
Arizona
08:20 PM
/
Chicago vs Green Bay
477
Chicago
478
Green Bay
Media
Site Map
Terms of use
Contact Us
Privacy Policy
DMCA
18+. Gamble Responsibly.
© Sportsbook Review. All Rights Reserved.

This solution is only worth to consider if there are lots of WebDriverWait calls
and given the interest in reduced runtime - else go for DebanjanB's
approach
You need to wait some time to let your html load completely. Also, you can set a timeout for script execution. To add a unconditional wait to driver.get(URL) in selenium, driver.set_page_load_timeout(n) with n = time/seconds and loop:
driver.set_page_load_timeout(n) # Set timeout of n seconds for page load
loading_finished = 0 # Set flag to 0
while loading_finished == 0: # Repeat while flag = 0
try:
sleep(random.uniform(0.1, 0.5)) # wait some time
website = driver.get(URL) # try to load for n seconds
loading_finished = 1 # Set flag to 1 and exit while loop
logger.info("website loaded") # Indicate load success
except:
logger.warn("timeout - retry") # Indicate load fail
else: # If flag == 1
driver.set_script_timeout(n) # Set timeout of n seconds for script
script_finished = 0 # Set flag to 0
while script_finished == 0 # Second loop
try:
print driver.execute_script("return document.documentElement.innerText;")
script_finished = 1 # Set flag to 1
logger.info("script done") # Indicate script done
except:
logger.warn("script timeout")
else:
logger.info("if you're still missing html here, increase timeout")

Related

Scraping data from Spotify charts

I want to scrape daily top 200 songs from Spotify charts website. I am trying to parse html code of page and trying to get song's artist, name and stream informations. But following code returns nothing. How can I get these informations with the following way?
for a in soup.find("div",{"class":"Container-c1ixcy-0 krZEp encore-base-set"}):
for b in a.findAll("main",{"class":"Main-tbtyrr-0 flXzSu"}):
for c in b.findAll("div",{"class":"Content-sc-1n5ckz4-0 jyvkLv"}):
for d in c.findAll("div",{"class":"TableContainer__Container-sc-86p3fa-0 fRKUEz"}):
print(d)
And let say this is the songs list that I want to scrape from it.
https://charts.spotify.com/charts/view/regional-tr-daily/2022-09-14
And also this is the html code of the page.
none selenium solution:
import requests
import pandas as pd
url = 'https://charts-spotify-com-service.spotify.com/public/v0/charts'
response = requests.get(url)
chart = []
for entry in response.json()['chartEntryViewResponses'][0]['entries']:
chart.append({
"Rank": entry['chartEntryData']['currentRank'],
"Artist": ', '.join([artist['name'] for artist in entry['trackMetadata']['artists']]),
"TrackName": entry['trackMetadata']['trackName']
})
df = pd.DataFrame(chart)
print(df.to_string(index=False))
OUTPUT:
Rank Artist TrackName
1 Bizarrap,Quevedo Quevedo: Bzrp Music Sessions, Vol. 52
2 Harry Styles As It Was
3 Bad Bunny,Chencho Corleone Me Porto Bonito
4 Bad Bunny Tití Me Preguntó
5 Manuel Turizo La Bachata
6 ROSALÍA DESPECHÁ
7 BLACKPINK Pink Venom
8 David Guetta,Bebe Rexha I'm Good (Blue)
9 OneRepublic I Ain't Worried
10 Bad Bunny Efecto
11 Chris Brown Under The Influence
12 Steve Lacy Bad Habit
13 Bad Bunny,Bomba Estéreo Ojitos Lindos
14 Kate Bush Running Up That Hill (A Deal With God) - 2018 Remaster
15 Joji Glimpse of Us
16 Nicki Minaj Super Freaky Girl
17 Bad Bunny Moscow Mule
18 Rosa Linn SNAP
19 Glass Animals Heat Waves
20 KAROL G PROVENZA
21 Charlie Puth,Jung Kook,BTS Left and Right (Feat. Jung Kook of BTS)
22 Harry Styles Late Night Talking
23 The Kid LAROI,Justin Bieber STAY (with Justin Bieber)
24 Tom Odell Another Love
25 Central Cee Doja
26 Stephen Sanchez Until I Found You
27 Bad Bunny Neverita
28 Post Malone,Doja Cat I Like You (A Happier Song) (with Doja Cat)
29 Lizzo About Damn Time
30 Nicky Youre,dazy Sunroof
31 Elton John,Britney Spears Hold Me Closer
32 Luar La L Caile
33 KAROL G,Maldy GATÚBELA
34 The Weeknd Die For You
35 Bad Bunny,Jhay Cortez Tarot
36 James Hype,Miggy Dela Rosa Ferrari
37 Imagine Dragons Bones
38 Elton John,Dua Lipa,PNAU Cold Heart - PNAU Remix
39 The Neighbourhood Sweater Weather
40 Ghost Mary On A Cross
41 Shakira,Rauw Alejandro Te Felicito
42 Justin Bieber Ghost
43 Bad Bunny,Rauw Alejandro Party
44 Drake,21 Savage Jimmy Cooks (feat. 21 Savage)
45 Doja Cat Vegas (From the Original Motion Picture Soundtrack ELVIS)
46 Camila Cabello,Ed Sheeran Bam Bam (feat. Ed Sheeran)
47 Rauw Alejandro,Lyanno,Brray LOKERA
48 Rels B cómo dormiste?
49 The Weeknd Blinding Lights
50 Arctic Monkeys 505
In the example link you provided, there aren't 200 songs, but only 50. The following is one way to get those songs:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time as t
import pandas as pd
from bs4 import BeautifulSoup
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("window-size=1920,1080")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url = 'https://charts.spotify.com/charts/view/regional-tr-daily/2022-09-14'
browser.get(url)
wait = WebDriverWait(browser, 5)
try:
wait.until(EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler"))).click()
print("accepted cookies")
except Exception as e:
print('no cookie button')
header_to_be_removed = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'header[data-testid="charts-header"]')))
browser.execute_script("""
var element = arguments[0];
element.parentNode.removeChild(element);
""", header_to_be_removed)
while True:
try:
show_more_button = wait.until(EC.element_to_be_clickable((By.XPATH, '//div[#data-testid="load-more-entries"]//button')))
show_more_button.location_once_scrolled_into_view
t.sleep(5)
show_more_button.click()
print('clicked to show more')
t.sleep(3)
except TimeoutException:
print('all done')
break
songs = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'li[data-testid="charts-entry-item"]')))
print('we have', len(songs), 'songs')
song_list = []
for song in songs:
song.location_once_scrolled_into_view
t.sleep(1)
title = song.find_element(By.CSS_SELECTOR, 'p[class^="Type__TypeElement-"]')
artist = song.find_element(By.CSS_SELECTOR, 'span[data-testid="artists-names"]')
song_list.append((artist.text, title.text))
df = pd.DataFrame(song_list, columns = ['Title', 'Artist'])
print(df)
This will print out in terminal:
no cookie button
clicked to show more
clicked to show more
clicked to show more
clicked to show more
all done
we have 50 songs
Title
Artist
0
Bizarrap,
Quevedo: Bzrp Music Sessions, Vol. 52
1
Harry Styles
As It Was
2
Bad Bunny,
Me Porto Bonito
3
Bad Bunny
Tití Me Preguntó
4
Manuel Turizo
La Bachata
5
ROSALÍA
DESPECHÁ
6
BLACKPINK
Pink Venom
7
David Guetta,
I'm Good (Blue)
8
OneRepublic
I Ain't Worried
9
Bad Bunny
Efecto
10
Chris Brown
Under The Influence
11
Steve Lacy
Bad Habit
12
Bad Bunny,
Ojitos Lindos
13
Kate Bush
Running Up That Hill (A Deal With God) - 2018 Remaster
14
Joji
Glimpse of Us
15
Nicki Minaj
Super Freaky Girl
16
Bad Bunny
Moscow Mule
17
Rosa Linn
SNAP
18
Glass Animals
Heat Waves
19
KAROL G
PROVENZA
20
Charlie Puth,
Left and Right (Feat. Jung Kook of BTS)
21
Harry Styles
Late Night Talking
22
The Kid LAROI,
STAY (with Justin Bieber)
23
Tom Odell
Another Love
24
Central Cee
Doja
25
Stephen Sanchez
Until I Found You
26
Bad Bunny
Neverita
27
Post Malone,
I Like You (A Happier Song) (with Doja Cat)
28
Lizzo
About Damn Time
29
Nicky Youre,
Sunroof
30
Elton John,
Hold Me Closer
31
Luar La L
Caile
32
KAROL G,
GATÚBELA
33
The Weeknd
Die For You
34
Bad Bunny,
Tarot
35
James Hype,
Ferrari
36
Imagine Dragons
Bones
37
Elton John,
Cold Heart - PNAU Remix
38
The Neighbourhood
Sweater Weather
39
Ghost
Mary On A Cross
40
Shakira,
Te Felicito
41
Justin Bieber
Ghost
42
Bad Bunny,
Party
43
Drake,
Jimmy Cooks (feat. 21 Savage)
44
Doja Cat
Vegas (From the Original Motion Picture Soundtrack ELVIS)
45
Camila Cabello,
Bam Bam (feat. Ed Sheeran)
46
Rauw Alejandro,
LOKERA
47
Rels B
cómo dormiste?
48
The Weeknd
Blinding Lights
49
Arctic Monkeys
505
​
Of course you can get other info like chart ranking, all artists when there are more than one, etc.
Selenium chrome/chromedriver setup is for Linux, you just have to observe the imports and code after defining the browser, to adapt it to your own setup.
Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/index.html
For selenium docs, visit: https://www.selenium.dev/documentation/

Find h4 tag using Beautiful Soup

I'm really new to web scraping and saw a few questions similar to mine but those solutions didn't work for me. So I'm trying to scrape this website: https://www.nba.com/schedule for the h4 tags, which hold the dates and times for upcoming basketball games. I'm trying to use beautiful soup to grab that tag but it always returns and empty list. Here's the code I'm using right now:
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
schedule = doc.find_all('h4')
I saw something in another answer about the h4 tags being inside tags and I tried to use a json module but couldn't get that to work. Thanks for your help in advance!
The data you see on the page is loaded from external URL, so BeautifulSoup doesn't see it. To load the data you can use following example:
import json
import requests
url = "https://cdn.nba.com/static/json/staticData/scheduleLeagueV2_1.json"
data = requests.get(url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for g in data["leagueSchedule"]["gameDates"]:
print(g["gameDate"])
for game in g["games"]:
print(
game["homeTeam"]["teamCity"],
game["homeTeam"]["teamName"],
"-",
game["awayTeam"]["teamCity"],
game["awayTeam"]["teamName"],
)
print()
Prints:
10/3/2021 12:00:00 AM
Los Angeles Lakers - Brooklyn Nets
10/4/2021 12:00:00 AM
Toronto Raptors - Philadelphia 76ers
Boston Celtics - Orlando Magic
Miami Heat - Atlanta Hawks
Minnesota Timberwolves - New Orleans Pelicans
Oklahoma City Thunder - Charlotte Hornets
San Antonio Spurs - Utah Jazz
Portland Trail Blazers - Golden State Warriors
Sacramento Kings - Phoenix Suns
LA Clippers - Denver Nuggets
10/5/2021 12:00:00 AM
New York Knicks - Indiana Pacers
Chicago Bulls - Cleveland Cavaliers
Houston Rockets - Washington Wizards
Memphis Grizzlies - Milwaukee Bucks
...and so on.

When I run the sportsipy.nba.teams.Teams function I am getting notified that The requested page returned a valid response, but no data could be found

The code that I am running (straight from sportsipy documentation):
from sportsipy.nba.teams import Teams
teams = Teams()
for team in teams:
print(team.name, team.abbreviation)
Returns the following:
The requested page returned a valid response, but no data could be found. Has the season begun, and is the data available on www.sports-reference.com?
Does anyone have any tips on moving forward with getting this information from the API?
That package api is old/outdated. The table it's trying to parse now has a different id attribute.
Few things you can do:
Go in and edit/patch the code manually to get the correct data.
Raise the issue on the github and wait for the fix and update.
Personally, the patch/fix is a quick easy one, so just do that (but there could be potentially other tables you may need to look into).
Open up the nba_utils.py:
change lines 85 and 86:
From:
teams_list = utils._get_stats_table(doc, 'div#all_team-stats-base')
opp_teams_list = utils._get_stats_table(doc, 'div#all_opponent-stats-base')
To:
teams_list = utils._get_stats_table(doc, '#totals-team')
opp_teams_list = utils._get_stats_table(doc, '#totals-opponent')
This will solve the current error, however, I don't know what other classes and functions may also need to be patched. There's a chance since this table slighltly changed, other may have as well.
Output:
Charlotte Hornets CHO
Milwaukee Bucks MIL
Utah Jazz UTA
Sacramento Kings SAC
Memphis Grizzlies MEM
Los Angeles Lakers LAL
Miami Heat MIA
Indiana Pacers IND
Houston Rockets HOU
Phoenix Suns PHO
Atlanta Hawks ATL
Minnesota Timberwolves MIN
San Antonio Spurs SAS
Boston Celtics BOS
Cleveland Cavaliers CLE
Golden State Warriors GSW
Washington Wizards WAS
Portland Trail Blazers POR
Los Angeles Clippers LAC
New Orleans Pelicans NOP
Dallas Mavericks DAL
Brooklyn Nets BRK
New York Knicks NYK
Orlando Magic ORL
Philadelphia 76ers PHI
Chicago Bulls CHI
Denver Nuggets DEN
Toronto Raptors TOR
Oklahoma City Thunder OKC
Detroit Pistons DET
Another option is to just not use the api and get the data yourself. If you don't need the abbreviations, it's pretty straight forward with pandas:
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2022.html'
teams = list(pd.read_html(url)[4].dropna(subset=['Rk'])['Team'])
for team in teams:
print(team)
If you do need the abbreviations, then it's a little more tricky, but can be achieved using BeautifulSoup to pull it out of the team href:
import requests
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/leagues/NBA_2022.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'id':'per_game-team'})
rows = table.find_all('td', {'data-stat':'team'})
teams = {}
for row in rows:
if row.find('a'):
name = row.find('a').text
abbreviation = row.find('a')['href'].split('/')[-2]
teams.update({name:abbreviation})
for team in teams.items():
print(team[0], team[1])

How to get a list of tickers in Jupyter Notebook?

Write code to get a list of tickers for all S&P 500 stocks from Wikipedia. As of 2/24/2021, there are 505 tickers in that list. You can use any method you want as long as the code actually queries the following website to get the list:
https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
One way would be to use the requests module to get the HTML code and then use the re module to extract the tickers. Another option would be the .read_html function in pandas and then export the tickers column to a list.
You should save the tickers in a list with the name sp500_tickers
This will grab the data in the table named 'constituents'.
# find a specific table by table count
import pandas as pd
import requests
from bs4 import BeautifulSoup
res = requests.get("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
print(df[0].to_json(orient='records'))
Result:
[{"Symbol":"MMM","Security":"3M Company","SEC filings":"reports","GICS Sector":"Industrials","GICS Sub-Industry":"Industrial Conglomerates","Headquarters Location":"St. Paul, Minnesota","Date first added":"1976-08-09","CIK":66740,"Founded":"1902"},{"Symbol":"ABT","Security":"Abbott Laboratories","SEC filings":"reports","GICS Sector":"Health Care","GICS Sub-Industry":"Health Care Equipment","Headquarters Location":"North Chicago, Illinois","Date first added":"1964-03-31","CIK":1800,"Founded":"1888"},{"Symbol":"ABBV","Security":"AbbVie Inc.","SEC filings":"reports","GICS Sector":"Health Care","GICS Sub-Industry":"Pharmaceuticals","Headquarters Location":"North Chicago, Illinois","Date first added":"2012-12-31","CIK":1551152,"Founded":"2013 (1888)"},{"Symbol":"ABMD","Security":"Abiomed","SEC filings":"reports","GICS Sector":"Health Care","GICS Sub-Industry":"Health Care Equipment","Headquarters Location":"Danvers, Massachusetts","Date first added":"2018-05-31","CIK":815094,"Founded":"1981"},{"Symbol":"ACN","Security":"Accenture","SEC filings":"reports","GICS Sector":"Information Technology","GICS Sub-Industry":"IT Consulting & Other Services","Headquarters Location":"Dublin, Ireland","Date first added":"2011-07-06","CIK":1467373,"Founded":"1989"},{"Symbol":"ATVI","Security":"Activision Blizzard","SEC filings":"reports","GICS Sector":"Communication Services","GICS Sub-Industry":"Interactive Home Entertainment","Headquarters Location":"Santa Monica, California","Date first added":"2015-08-31","CIK":718877,"Founded":"2008"},{"Symbol":"ADBE","Security":"Adobe Inc.","SEC filings":"reports","GICS Sector":"Information Technology","GICS Sub-Industry":"Application Software","Headquarters Location":"San Jose, California","Date first added":"1997-05-05","CIK":796343,"Founded":"1982"},
Etc., etc., etc.
That's JSON. If you want a table, kind of like what you would use in Excel, simply print the df.
Result:
[ Symbol Security SEC filings GICS Sector \
0 MMM 3M Company reports Industrials
1 ABT Abbott Laboratories reports Health Care
2 ABBV AbbVie Inc. reports Health Care
3 ABMD Abiomed reports Health Care
4 ACN Accenture reports Information Technology
.. ... ... ... ...
500 YUM Yum! Brands Inc reports Consumer Discretionary
501 ZBRA Zebra Technologies reports Information Technology
502 ZBH Zimmer Biomet reports Health Care
503 ZION Zions Bancorp reports Financials
504 ZTS Zoetis reports Health Care
GICS Sub-Industry Headquarters Location \
0 Industrial Conglomerates St. Paul, Minnesota
1 Health Care Equipment North Chicago, Illinois
2 Pharmaceuticals North Chicago, Illinois
3 Health Care Equipment Danvers, Massachusetts
4 IT Consulting & Other Services Dublin, Ireland
.. ... ...
500 Restaurants Louisville, Kentucky
501 Electronic Equipment & Instruments Lincolnshire, Illinois
502 Health Care Equipment Warsaw, Indiana
503 Regional Banks Salt Lake City, Utah
504 Pharmaceuticals Parsippany, New Jersey
Date first added CIK Founded
0 1976-08-09 66740 1902
1 1964-03-31 1800 1888
2 2012-12-31 1551152 2013 (1888)
3 2018-05-31 815094 1981
4 2011-07-06 1467373 1989
.. ... ... ...
500 1997-10-06 1041061 1997
501 2019-12-23 877212 1969
502 2001-08-07 1136869 1927
503 2001-06-22 109380 1873
504 2013-06-21 1555280 1952
[505 rows x 9 columns]]
Alternatively, you can export the df to a CSV file.
df.to_csv('constituents.csv')

Beautiful soup scraping with selenium

I'm learning how to scrape using Beautiful soup with selenium and I found a website that has multiple tables and found table tags (first time dealing with them). I'm learning how to try to scrape those texts from each table and append each element to respected list. First im trying to scrape the first table, and the rest I want to do on my own. But I cannot access the tag for some reason.
I also incorporated selenium to access the sites, because when I copy the link to the site onto another tab, the list of tables disappears, for some reason.
My code so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from selenium import webdriver
from selenium.webdriver.support.ui import Select
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
targetSite = "https://www.sdvisualarts.net/sdvan_new/events.php"
driver.get(targetSite)
select_event = Select(driver.find_element_by_name('subs'))
select_event.select_by_value('All')
select_loc = Select(driver.find_element_by_name('loc'))
select_loc.select_by_value("All")
driver.find_element_by_name("submit").click()
targetSite = "https://www.sdvisualarts.net/sdvan_new/viewevents.php"
event_title = []
name = []
address = []
city = []
state = []
zipCode = []
location = []
webSite = []
fee = []
event_dates = []
opening_dates = []
description = []
try:
page = requests.get(targetSite )
soup = BeautifulSoup(page.text, 'html.parser')
items = soup.find_all('table', {"class":"popdetail"})
for i in items:
event_title.append(item.find('b', {'class': "text"})).text.strip()
name.append(item.find('td', {'class': "text"})).text.strip()
address.append(item.find('td', {'class': "text"})).text.strip()
city.append(item.find('td', {'class': "text"})).text.strip()
state.append(item.find('td', {'class': "text"})).text.strip()
zipCode.append(item.find('td', {'class': "text"})).text.strip()
Can someone let me know if I am doing this correctly, This is my first time dealing with site's urls elements disappear when copied onto a new tab and/or window
So far, I am unable to append any information to each list.
One issue is with the for loop.
you have for i in items:, but then you are calling item instead of i.
And secondly, if you are using selenium to render the page, then you should probably use selenium to get the html. They also have some embedded tables within tables, so it's not as straight forward as iterating through the <table> tags. What I ended up doing was having pandas read in the tables (returns a list of dataframes), then iterating through those as there is a pattern of how the dataframes are constructed.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
targetSite = "https://www.sdvisualarts.net/sdvan_new/events.php"
driver.get(targetSite)
select_event = Select(driver.find_element_by_name('subs'))
select_event.select_by_value('All')
select_loc = Select(driver.find_element_by_name('loc'))
select_loc.select_by_value("All")
driver.find_element_by_name("submit").click()
targetSite = "https://www.sdvisualarts.net/sdvan_new/viewevents.php"
event_title = []
name = []
address = []
city = []
state = []
zipCode = []
location = []
webSite = []
fee = []
event_dates = []
opening_dates = []
description = []
dfs = pd.read_html(driver.page_source)
driver.close
for idx, table in enumerate(dfs):
if table.iloc[0,0] == 'Event Title':
event_title.append(table.iloc[-1,0])
tempA = dfs[idx+1]
tempA.index = tempA[0]
tempB = dfs[idx+4]
tempB.index = tempB[0]
tempC = dfs[idx+5]
tempC.index = tempC[0]
name.append(tempA.loc['Name',1])
address.append(tempA.loc['Address',1])
city.append(tempA.loc['City',1])
state.append(tempA.loc['State',1])
zipCode.append(tempA.loc['Zip',1])
location.append(tempA.loc['Location',1])
webSite.append(tempA.loc['Web Site',1])
fee.append(tempB.loc['Fee',1])
event_dates.append(tempB.loc['Dates',1])
opening_dates.append(tempB.loc['Opening Days',1])
description.append(tempC.loc['Event Description',1])
df = pd.DataFrame({'event_title':event_title,
'name':name,
'address':address,
'city':city,
'state':state,
'zipCode':zipCode,
'location':location,
'webSite':webSite,
'fee':fee,
'event_dates':event_dates,
'opening_dates':opening_dates,
'description':description})
Output:
print (df.to_string())
event_title name address city state zipCode location webSite fee event_dates opening_dates description
0 The San Diego Museum of Art Welcomes a Special... San Diego Museum of Art 1450 El Prado, Balboa Park San Diego CA 92101 Central San Diego https://www.sdmart.org/ NaN Starts On 6-18-2020 Ends On 1-10-2021 Opens virtually on June 18. The work will beco... The San Diego Museum of Art is launching its f...
1 New Exhibit: Miller Dairy Remembered Lemon Grove Historical Society 3185 Olive Street, Treganza Heritage Park Lemon Grove CA 91945 Central San Diego http://www.lghistorical.org Children 12 and under free and must be accompa... Starts On 6-27-2020 Ends On 12-4-2020 Exhibit on view Saturdays 11 am to 2 pm; close... From 1926 there were cows smack in the midst o...
2 Gizmos and Shivelight Distinction Gallery 317 E. Grand Ave Escondido CA 92025 North County Inland http://www.distinctionart.com NaN Starts On 7-14-2020 Ends On 9-5-2020 08/08/20 - 09/05/20 Distinction Gallery is proud to present our so...
3 Virtual Opening - July Exhibitions Vision Art Museum 2825 Dewey Rd. Suite 100 San Diego CA 92106 Central San Diego http://www.visionsartmuseum.org Free Starts On 7-18-2020 Ends On 10-4-2020 NaN Join Visions Art Museum for a virtual exhibiti...
4 Laying it Bare: The Art of Walter Redondo and ... Fresh Paint Gallery 1020-B Prospect Street La Jolla CA 92037 Central San Diego http://freshpaintgallery.com/ NaN Starts On 8-1-2020 Ends On 9-27-2020 Tuesday through Sunday. Mondays closed. A two-person exhibit of new abstract expressio...
5 Online oil painting lessons with Concetta Antico NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 8-10-2020 Ends On 8-31-2020 NaN Anyone can learn to paint like the masters! Ov...
6 MOMENTUM: A Creative Industry Symposium Vanguard Culture Via Zoom San Diego California 92101 Virtual https://www.eventbrite.com/e/momentum-a-creati... $10 suggested donation Starts On 8-17-2020 Ends On 9-7-2020 NaN MOMENTUM: A Creative Industry Symposium Monday...
7 Virtual Locals Invitational Show Art & Frames of Coronado 936 ORANGE AVE Coronado CA 92118 0 https://www.artsteps.com/view/5eed0ad62cd0d65b... free Starts On 8-21-2020 Ends On 8-1-2021 NaN Art and Frames of Coronado invites you to our ...
8 HERE & Now R.B. Stevenson Gallery 7661 Girard Avenue, Suite 101 La Jolla California 92037 Central San Diego http://www.rbstevensongallery.com Free Starts On 8-22-2020 Ends On 9-25-2020 Tuesday through Saturday R.B.Stevenson Gallery is pleased to announce t...
9 Art Unites Learning: Normal 2.0 Art Unites NaN San Diego NaN 92116 Central San Diego https://www.facebook.com/events/956878098104971 Free Starts On 8-25-2020 Ends On 8-25-2020 NaN Please join us on Tuesday, August 25th as we: ...
10 Image Quest Sojourn; Visual Journaling for Per... Pamela Underwood Studios Virtual NaN NaN NaN Virtual http://www.pamelaunderwood.com/event/new-onlin... $595.00 Starts On 8-26-2020 Ends On 11-11-2020 NaN Create a personal Image Quest resource journal...
11 Behind The Exhibition: Southern California Con... Oceanside Museum of Art 704 Pier View Way Oceanside California 92054 Virtual https://oma-online.org/events/behind-the-exhib... No fee required. Donations recommended. Starts On 8-27-2020 Ends On 8-27-2020 NaN Join curator Beth Smith and exhibitions manage...
12 Lay it on Thick, a Virtual Art Exhibition San Diego Watercolor Society 2825 Dewey Rd Bldg #202 San Diego California 92106 0 https://www.sdws.org NaN Starts On 8-30-2020 Ends On 9-26-2020 NaN The San Diego Watercolor Society proudly prese...
13 The Forum: Marketing & Branding for Creatives Vanguard Culture Via Zoom San Diego CA 92101 South San Diego http://vanguardculture.com/ $5 suggested donation Starts On 9-1-2020 Ends On 9-1-2020 NaN Attention creative industry professionals! Joi...
14 Write or Die Solo Exhibition You Belong Here 3619 EL CAJON BLVD San Diego CA 92104 Central San Diego http://www.youbelongsd.com/upcoming-events/wri... $10 donation to benefit You Belong Here Starts On 9-4-2020 Ends On 9-6-2020 NaN Write or Die is an immersive installation and ...
15 SDVAN presents Art San Diego at Bread and Salt San Diego Visual Arts Network 1955 Julian Avenue San Digo CA 92113 Central San Diego http://www.sdvisualarts.net and https://www.br... Free Starts On 9-5-2020 Ends On 10-24-2020 NaN We are pleased to announce the four artist rec...
16 The Coming of Treganza Heritage Park Lemon Grove Historical Society 3185 Olive Street Lemon Grove CA 91945 Central San Diego http://www.lghistorical.org Free for all ages Starts On 9-10-2020 Ends On 9-10-2020 The park is open daily, 8 am to 8 pm. Covid 19... Lemon Grove\'s central city park will be renam...
17 Online oil painting course | 4 weeks NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 9-14-2020 Ends On 10-5-2020 NaN Over 4 weekly Zoom lessons, learn the techniqu...
18 Online oil painting course | 4 weeks NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 10-12-2020 Ends On 11-2-2020 NaN Over 4 weekly Zoom lessons, learn the techniqu...
19 36th Annual Mission Fed ArtWalk Mission Fed ArtWalk Ash Street San Diego California 92101 Central San Diego www.missionfedartwalk.org Free Starts On 11-7-2020 Ends On 11-8-2020 Sat and Sun Nov 7 and 8 Mission Fed ArtWalk returns to San Diego’s Lit...
20 Mingei Pop Up Workshop: My Daruma Doll New Childrens Museum 200 West Island Avenue San Diego California 92101 Central San Diego http://thinkplaycreate.org/ Free with admission Starts On 11-13-2020 Ends On 11-13-2020 NaN Join Mingei International Museum at The New Ch...

Categories

Resources