Find h4 tag using Beautiful Soup - python

I'm really new to web scraping and saw a few questions similar to mine but those solutions didn't work for me. So I'm trying to scrape this website: https://www.nba.com/schedule for the h4 tags, which hold the dates and times for upcoming basketball games. I'm trying to use beautiful soup to grab that tag but it always returns and empty list. Here's the code I'm using right now:
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
schedule = doc.find_all('h4')
I saw something in another answer about the h4 tags being inside tags and I tried to use a json module but couldn't get that to work. Thanks for your help in advance!

The data you see on the page is loaded from external URL, so BeautifulSoup doesn't see it. To load the data you can use following example:
import json
import requests
url = "https://cdn.nba.com/static/json/staticData/scheduleLeagueV2_1.json"
data = requests.get(url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for g in data["leagueSchedule"]["gameDates"]:
print(g["gameDate"])
for game in g["games"]:
print(
game["homeTeam"]["teamCity"],
game["homeTeam"]["teamName"],
"-",
game["awayTeam"]["teamCity"],
game["awayTeam"]["teamName"],
)
print()
Prints:
10/3/2021 12:00:00 AM
Los Angeles Lakers - Brooklyn Nets
10/4/2021 12:00:00 AM
Toronto Raptors - Philadelphia 76ers
Boston Celtics - Orlando Magic
Miami Heat - Atlanta Hawks
Minnesota Timberwolves - New Orleans Pelicans
Oklahoma City Thunder - Charlotte Hornets
San Antonio Spurs - Utah Jazz
Portland Trail Blazers - Golden State Warriors
Sacramento Kings - Phoenix Suns
LA Clippers - Denver Nuggets
10/5/2021 12:00:00 AM
New York Knicks - Indiana Pacers
Chicago Bulls - Cleveland Cavaliers
Houston Rockets - Washington Wizards
Memphis Grizzlies - Milwaukee Bucks
...and so on.

Related

When I run the sportsipy.nba.teams.Teams function I am getting notified that The requested page returned a valid response, but no data could be found

The code that I am running (straight from sportsipy documentation):
from sportsipy.nba.teams import Teams
teams = Teams()
for team in teams:
print(team.name, team.abbreviation)
Returns the following:
The requested page returned a valid response, but no data could be found. Has the season begun, and is the data available on www.sports-reference.com?
Does anyone have any tips on moving forward with getting this information from the API?
That package api is old/outdated. The table it's trying to parse now has a different id attribute.
Few things you can do:
Go in and edit/patch the code manually to get the correct data.
Raise the issue on the github and wait for the fix and update.
Personally, the patch/fix is a quick easy one, so just do that (but there could be potentially other tables you may need to look into).
Open up the nba_utils.py:
change lines 85 and 86:
From:
teams_list = utils._get_stats_table(doc, 'div#all_team-stats-base')
opp_teams_list = utils._get_stats_table(doc, 'div#all_opponent-stats-base')
To:
teams_list = utils._get_stats_table(doc, '#totals-team')
opp_teams_list = utils._get_stats_table(doc, '#totals-opponent')
This will solve the current error, however, I don't know what other classes and functions may also need to be patched. There's a chance since this table slighltly changed, other may have as well.
Output:
Charlotte Hornets CHO
Milwaukee Bucks MIL
Utah Jazz UTA
Sacramento Kings SAC
Memphis Grizzlies MEM
Los Angeles Lakers LAL
Miami Heat MIA
Indiana Pacers IND
Houston Rockets HOU
Phoenix Suns PHO
Atlanta Hawks ATL
Minnesota Timberwolves MIN
San Antonio Spurs SAS
Boston Celtics BOS
Cleveland Cavaliers CLE
Golden State Warriors GSW
Washington Wizards WAS
Portland Trail Blazers POR
Los Angeles Clippers LAC
New Orleans Pelicans NOP
Dallas Mavericks DAL
Brooklyn Nets BRK
New York Knicks NYK
Orlando Magic ORL
Philadelphia 76ers PHI
Chicago Bulls CHI
Denver Nuggets DEN
Toronto Raptors TOR
Oklahoma City Thunder OKC
Detroit Pistons DET
Another option is to just not use the api and get the data yourself. If you don't need the abbreviations, it's pretty straight forward with pandas:
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2022.html'
teams = list(pd.read_html(url)[4].dropna(subset=['Rk'])['Team'])
for team in teams:
print(team)
If you do need the abbreviations, then it's a little more tricky, but can be achieved using BeautifulSoup to pull it out of the team href:
import requests
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/leagues/NBA_2022.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'id':'per_game-team'})
rows = table.find_all('td', {'data-stat':'team'})
teams = {}
for row in rows:
if row.find('a'):
name = row.find('a').text
abbreviation = row.find('a')['href'].split('/')[-2]
teams.update({name:abbreviation})
for team in teams.items():
print(team[0], team[1])

Beautiful soup scraping with selenium

I'm learning how to scrape using Beautiful soup with selenium and I found a website that has multiple tables and found table tags (first time dealing with them). I'm learning how to try to scrape those texts from each table and append each element to respected list. First im trying to scrape the first table, and the rest I want to do on my own. But I cannot access the tag for some reason.
I also incorporated selenium to access the sites, because when I copy the link to the site onto another tab, the list of tables disappears, for some reason.
My code so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from selenium import webdriver
from selenium.webdriver.support.ui import Select
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
targetSite = "https://www.sdvisualarts.net/sdvan_new/events.php"
driver.get(targetSite)
select_event = Select(driver.find_element_by_name('subs'))
select_event.select_by_value('All')
select_loc = Select(driver.find_element_by_name('loc'))
select_loc.select_by_value("All")
driver.find_element_by_name("submit").click()
targetSite = "https://www.sdvisualarts.net/sdvan_new/viewevents.php"
event_title = []
name = []
address = []
city = []
state = []
zipCode = []
location = []
webSite = []
fee = []
event_dates = []
opening_dates = []
description = []
try:
page = requests.get(targetSite )
soup = BeautifulSoup(page.text, 'html.parser')
items = soup.find_all('table', {"class":"popdetail"})
for i in items:
event_title.append(item.find('b', {'class': "text"})).text.strip()
name.append(item.find('td', {'class': "text"})).text.strip()
address.append(item.find('td', {'class': "text"})).text.strip()
city.append(item.find('td', {'class': "text"})).text.strip()
state.append(item.find('td', {'class': "text"})).text.strip()
zipCode.append(item.find('td', {'class': "text"})).text.strip()
Can someone let me know if I am doing this correctly, This is my first time dealing with site's urls elements disappear when copied onto a new tab and/or window
So far, I am unable to append any information to each list.
One issue is with the for loop.
you have for i in items:, but then you are calling item instead of i.
And secondly, if you are using selenium to render the page, then you should probably use selenium to get the html. They also have some embedded tables within tables, so it's not as straight forward as iterating through the <table> tags. What I ended up doing was having pandas read in the tables (returns a list of dataframes), then iterating through those as there is a pattern of how the dataframes are constructed.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
targetSite = "https://www.sdvisualarts.net/sdvan_new/events.php"
driver.get(targetSite)
select_event = Select(driver.find_element_by_name('subs'))
select_event.select_by_value('All')
select_loc = Select(driver.find_element_by_name('loc'))
select_loc.select_by_value("All")
driver.find_element_by_name("submit").click()
targetSite = "https://www.sdvisualarts.net/sdvan_new/viewevents.php"
event_title = []
name = []
address = []
city = []
state = []
zipCode = []
location = []
webSite = []
fee = []
event_dates = []
opening_dates = []
description = []
dfs = pd.read_html(driver.page_source)
driver.close
for idx, table in enumerate(dfs):
if table.iloc[0,0] == 'Event Title':
event_title.append(table.iloc[-1,0])
tempA = dfs[idx+1]
tempA.index = tempA[0]
tempB = dfs[idx+4]
tempB.index = tempB[0]
tempC = dfs[idx+5]
tempC.index = tempC[0]
name.append(tempA.loc['Name',1])
address.append(tempA.loc['Address',1])
city.append(tempA.loc['City',1])
state.append(tempA.loc['State',1])
zipCode.append(tempA.loc['Zip',1])
location.append(tempA.loc['Location',1])
webSite.append(tempA.loc['Web Site',1])
fee.append(tempB.loc['Fee',1])
event_dates.append(tempB.loc['Dates',1])
opening_dates.append(tempB.loc['Opening Days',1])
description.append(tempC.loc['Event Description',1])
df = pd.DataFrame({'event_title':event_title,
'name':name,
'address':address,
'city':city,
'state':state,
'zipCode':zipCode,
'location':location,
'webSite':webSite,
'fee':fee,
'event_dates':event_dates,
'opening_dates':opening_dates,
'description':description})
Output:
print (df.to_string())
event_title name address city state zipCode location webSite fee event_dates opening_dates description
0 The San Diego Museum of Art Welcomes a Special... San Diego Museum of Art 1450 El Prado, Balboa Park San Diego CA 92101 Central San Diego https://www.sdmart.org/ NaN Starts On 6-18-2020 Ends On 1-10-2021 Opens virtually on June 18. The work will beco... The San Diego Museum of Art is launching its f...
1 New Exhibit: Miller Dairy Remembered Lemon Grove Historical Society 3185 Olive Street, Treganza Heritage Park Lemon Grove CA 91945 Central San Diego http://www.lghistorical.org Children 12 and under free and must be accompa... Starts On 6-27-2020 Ends On 12-4-2020 Exhibit on view Saturdays 11 am to 2 pm; close... From 1926 there were cows smack in the midst o...
2 Gizmos and Shivelight Distinction Gallery 317 E. Grand Ave Escondido CA 92025 North County Inland http://www.distinctionart.com NaN Starts On 7-14-2020 Ends On 9-5-2020 08/08/20 - 09/05/20 Distinction Gallery is proud to present our so...
3 Virtual Opening - July Exhibitions Vision Art Museum 2825 Dewey Rd. Suite 100 San Diego CA 92106 Central San Diego http://www.visionsartmuseum.org Free Starts On 7-18-2020 Ends On 10-4-2020 NaN Join Visions Art Museum for a virtual exhibiti...
4 Laying it Bare: The Art of Walter Redondo and ... Fresh Paint Gallery 1020-B Prospect Street La Jolla CA 92037 Central San Diego http://freshpaintgallery.com/ NaN Starts On 8-1-2020 Ends On 9-27-2020 Tuesday through Sunday. Mondays closed. A two-person exhibit of new abstract expressio...
5 Online oil painting lessons with Concetta Antico NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 8-10-2020 Ends On 8-31-2020 NaN Anyone can learn to paint like the masters! Ov...
6 MOMENTUM: A Creative Industry Symposium Vanguard Culture Via Zoom San Diego California 92101 Virtual https://www.eventbrite.com/e/momentum-a-creati... $10 suggested donation Starts On 8-17-2020 Ends On 9-7-2020 NaN MOMENTUM: A Creative Industry Symposium Monday...
7 Virtual Locals Invitational Show Art & Frames of Coronado 936 ORANGE AVE Coronado CA 92118 0 https://www.artsteps.com/view/5eed0ad62cd0d65b... free Starts On 8-21-2020 Ends On 8-1-2021 NaN Art and Frames of Coronado invites you to our ...
8 HERE & Now R.B. Stevenson Gallery 7661 Girard Avenue, Suite 101 La Jolla California 92037 Central San Diego http://www.rbstevensongallery.com Free Starts On 8-22-2020 Ends On 9-25-2020 Tuesday through Saturday R.B.Stevenson Gallery is pleased to announce t...
9 Art Unites Learning: Normal 2.0 Art Unites NaN San Diego NaN 92116 Central San Diego https://www.facebook.com/events/956878098104971 Free Starts On 8-25-2020 Ends On 8-25-2020 NaN Please join us on Tuesday, August 25th as we: ...
10 Image Quest Sojourn; Visual Journaling for Per... Pamela Underwood Studios Virtual NaN NaN NaN Virtual http://www.pamelaunderwood.com/event/new-onlin... $595.00 Starts On 8-26-2020 Ends On 11-11-2020 NaN Create a personal Image Quest resource journal...
11 Behind The Exhibition: Southern California Con... Oceanside Museum of Art 704 Pier View Way Oceanside California 92054 Virtual https://oma-online.org/events/behind-the-exhib... No fee required. Donations recommended. Starts On 8-27-2020 Ends On 8-27-2020 NaN Join curator Beth Smith and exhibitions manage...
12 Lay it on Thick, a Virtual Art Exhibition San Diego Watercolor Society 2825 Dewey Rd Bldg #202 San Diego California 92106 0 https://www.sdws.org NaN Starts On 8-30-2020 Ends On 9-26-2020 NaN The San Diego Watercolor Society proudly prese...
13 The Forum: Marketing & Branding for Creatives Vanguard Culture Via Zoom San Diego CA 92101 South San Diego http://vanguardculture.com/ $5 suggested donation Starts On 9-1-2020 Ends On 9-1-2020 NaN Attention creative industry professionals! Joi...
14 Write or Die Solo Exhibition You Belong Here 3619 EL CAJON BLVD San Diego CA 92104 Central San Diego http://www.youbelongsd.com/upcoming-events/wri... $10 donation to benefit You Belong Here Starts On 9-4-2020 Ends On 9-6-2020 NaN Write or Die is an immersive installation and ...
15 SDVAN presents Art San Diego at Bread and Salt San Diego Visual Arts Network 1955 Julian Avenue San Digo CA 92113 Central San Diego http://www.sdvisualarts.net and https://www.br... Free Starts On 9-5-2020 Ends On 10-24-2020 NaN We are pleased to announce the four artist rec...
16 The Coming of Treganza Heritage Park Lemon Grove Historical Society 3185 Olive Street Lemon Grove CA 91945 Central San Diego http://www.lghistorical.org Free for all ages Starts On 9-10-2020 Ends On 9-10-2020 The park is open daily, 8 am to 8 pm. Covid 19... Lemon Grove\'s central city park will be renam...
17 Online oil painting course | 4 weeks NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 9-14-2020 Ends On 10-5-2020 NaN Over 4 weekly Zoom lessons, learn the techniqu...
18 Online oil painting course | 4 weeks NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 10-12-2020 Ends On 11-2-2020 NaN Over 4 weekly Zoom lessons, learn the techniqu...
19 36th Annual Mission Fed ArtWalk Mission Fed ArtWalk Ash Street San Diego California 92101 Central San Diego www.missionfedartwalk.org Free Starts On 11-7-2020 Ends On 11-8-2020 Sat and Sun Nov 7 and 8 Mission Fed ArtWalk returns to San Diego’s Lit...
20 Mingei Pop Up Workshop: My Daruma Doll New Childrens Museum 200 West Island Avenue San Diego California 92101 Central San Diego http://thinkplaycreate.org/ Free with admission Starts On 11-13-2020 Ends On 11-13-2020 NaN Join Mingei International Museum at The New Ch...

How to compare a list of similar strings in python for sport games?

I have a list of dictionaries of sports games with various attributes. These lists are all made with only one sport in the list. So I have a basketball games list, baseball games list, etc.
I want to format all of the "game" values the same way so the same particular sports match (Pit Steelers vs LA Rams) for example all have the same string value. Examples of a particular game could be "PIT Steelers vs LA Rams" but this also could be formatted "Pittsburgh Steelers vs Los Angeles Rams". I may have up to 7 dictionaries within the list for a particular game formatted in slightly different ways.
I can't choose to just use the team name or the city because within the same sport there could be the same match with those particular team names or cities just in two different leagues like the NFL and the NCAA.
I was thinking I would use the most expansive game name as the key. For example, I would use "Pittsburgh Steelers vs Los Angeles Rams" instead of "PIT Steelers vs LA Rams" as the key to use as a baseline.
Is there a way I could compare these other matches to the key and say if there is above X percentage of this string in the key replace this string game with the key? How would you do it? I am open to all suggestions!
Thanks!
Edit: Here is an attempt using difflib. I generated 1000 random games and imported into excel and sorted by ratio. We can see that it isn't a perfect fit.
Create a list of short team names that doesn't include the city, then scan the title for those short names. You should find two short team names in each title, which you can then use for grouping the titles into unique games.
team_long_names = ['Arizona Cardinals', 'Atlanta Falcons', 'Carolina Panthers', 'Chicago Bears',
'Dallas Cowboys', 'Detriot Lions','Green Bay Packers','Los Angeles Rams',
'Minnesota Vikings','New Orleans Saints','New York Giants', 'Philadelphia Eagles',
'San Francisco 49ers','Seattle Seahawks','Washington Redskins','Baltimore Ravens',
'Buffalo Bills','Cinncinnati Bangals','Cleveland Browns','Denver Broncos',
'Houston Texans','Indanapolis Colts','Jacksonville Jaguars','Kansas City Chiefs',
'Las Vegas Raiders','Los Angeles Chargers','Miami Dolphins','New England Patriots',
'New York Jets','Pittsburgh Steelers','Tennessee Titans']
team_short_names = [n.lower().split(' ')[-1] for n in team_long_names]
game_titles = ['Atlanta Falcons vs New York Jets', 'ATL Falcons vs NY Jets', 'Falcons v Jets',
'SF 49ers vs PIT Steelers', 'San Fransico 49ers vs Pittsburg Steelers', '49ers vs Steelers',
'Dallas Cowboys vs LA Chargers', 'DAL Cowboys vs Los Angles Chargers', 'Cowboys v Chargers',
'Blah blah Falcons and Foo bar Jets']
titles_by_key = []
for title in game_titles:
game_key = '-'.join([word for word in title.lower().split(' ') if word in team_short_names])
titles_by_key.append(game_key + ": " + title)
print(sorted(titles_by_key))
Output:
['49ers-steelers: 49ers vs Steelers',
'49ers-steelers: SF 49ers vs PIT Steelers',
'49ers-steelers: San Fransico 49ers vs Pittsburg Steelers',
'cowboys-chargers: Cowboys v Chargers',
'cowboys-chargers: DAL Cowboys vs Los Angles Chargers',
'cowboys-chargers: Dallas Cowboys vs LA Chargers',
'falcons-jets: ATL Falcons vs NY Jets',
'falcons-jets: Atlanta Falcons vs New York Jets',
'falcons-jets: Blah blah Falcons and Foo bar Jets',
'falcons-jets: Falcons v Jets']
That doesn't solve the problem of possible team name collisions with different leagues, but I suspect there might be easier strategies for detecting the league as a pre-processing step.
You could use fuzzy wuzzy library to replace the strings. Make a main list of the teams, then you can use the fuzzy wuzzy package to extract the team most similar to the other strings. It won't be perfect though, as you'll need to test out different scenarios, for example 'CAR' won't return Carolina Panthers, it'll return 'Arizona Cardinals', and 'LVR' came back with 'Los Angels Rams'. So you'dd just have to pay, and then possibly for those few cases, just work out some logic to get back the right team.
from fuzzywuzzy import process
nfl_teams = ['Arizona Cardinals', 'Atlanta Falcons', 'Carolina Panthers', 'Chicago Bears',
'Dallas Cowboys', 'Detriot Lions','Green Bay Packers','Los Angeles Rams',
'Minnesota Vikings','New Orleans Saints','New York Giants', 'Philadelphia Eagles',
'San Francisco 49ers','Seattle Seahawks','Washington Redskins','Baltimore Ravens',
'Buffalo Bills','Cinncinnati Bangals','Cleveland Browns','Denver Broncos',
'Houston Texans','Indanapolis Colts','Jacksonville Jaguars','Kansas City Chiefs',
'Las Vegas Raiders','Los Angeles Chargers','Miami Dolphins','New England Patriots',
'New York Jets','Pittsburgh Steelers','Tennessee Titans']
matchups = ['OAK Raiders vs CHI Bears','GB Packers vs MIN Vikings','PIT Steelers vs LA Rams', 'PHI vs DEN']
for each in matchups:
team1, team2 = each.split('vs')[0].strip(), each.split('vs')[-1].strip()
team1_alpha = process.extractOne(team1, nfl_teams)[0]
team2_alpha = process.extractOne(team2, nfl_teams)[0]
print ('%s -----> %s vs %s' %(each, team1_alpha, team2_alpha))
Output:
OAK Raiders vs CHI Bears -----> Las Vegas Raiders vs Chicago Bears
GB Packers vs MIN Vikings -----> Green Bay Packers vs Minnesota Vikings
PIT Steelers vs LA Rams -----> Pittsburgh Steelers vs Los Angeles Rams
PHI vs DEN -----> Philadelphia Eagles vs Denver Broncos
Or if you do want to do a compare to check to see how similar they are.
from fuzzywuzzy import fuzz
var1 = "Pittsburgh Steelers vs Los Angeles Rams"
var2 = "PIT Steelers vs LA Rams"
print(fuzz.ratio(var1, var2))
print(fuzz.token_sort_ratio(var1, var2))
print(fuzz.token_set_ratio(var1, var2))
Output:
68
71
82

How to extract only the n-nth HTML title tag in a sequence in Python with BeautifulSoup?

I'm trying to extract data from a Wikipedia table (https://en.wikipedia.org/wiki/NBA_Most_Valuable_Player_Award) about the MVP winners over NBA history.
This is my code:
wik_req = requests.get("https://en.wikipedia.org/wiki/NBA_Most_Valuable_Player_Award")
wik_webpage = wik_req.content
soup = BeautifulSoup(wik_webpage, "html.parser")
my_table = soup('table', {"class":"wikitable plainrowheaders sortable"})[0].find_all('a')
print(my_table)
for x in my_table:
test = x.get("title")
print(test)
However, this code prints all HTML title tags of the table as in the following (short version):
'1955–56 NBA season
Bob Pettit
Power Forward (basketball)
United States
St. Louis Hawks
1956–57 NBA season
Bob Cousy
Point guard
Boston Celtics'
Eventually, I want to create a pandas dataframe in which I store all the season years in a column, all the player years in a column, and so on and so forth. What code does the trick to only print one of the HTML tag titles (e.g. only the NBA season years)? I can then store those into a column to set up my dataframe and do the same with player, position, nationality and team.
All you should need for that dataframe is:
import pandas as pd
url = "https://en.wikipedia.org/wiki/NBA_Most_Valuable_Player_Award"
df=pd.read_html(url)[5]
Output:
print(df)
Season Player ... Nationality Team
0 1955–56 Bob Pettit* ... United States St. Louis Hawks
1 1956–57 Bob Cousy* ... United States Boston Celtics
2 1957–58 Bill Russell* ... United States Boston Celtics (2)
3 1958–59 Bob Pettit* (2) ... United States St. Louis Hawks (2)
4 1959–60 Wilt Chamberlain* ... United States Philadelphia Warriors
.. ... ... ... ... ...
59 2014–15 Stephen Curry^ ... United States Golden State Warriors (2)
60 2015–16 Stephen Curry^ (2) ... United States Golden State Warriors (3)
61 2016–17 Russell Westbrook^ ... United States Oklahoma City Thunder (2)
62 2017–18 James Harden^ ... United States Houston Rockets (4)
63 2018–19 Giannis Antetokounmpo^ ... Greece Milwaukee Bucks (4)
[64 rows x 5 columns]
If you really want to stick with BeautifulSoup, here's an example to get you started:
my_table = soup('table', {"class":"wikitable plainrowheaders sortable"})[0]
season_col=[]
for row in my_table.find_all('tr')[1:]:
season = row.findChildren(recursive=False)[0]
season_col.append(season.text.strip())
I expect there may be some differences between columns, but as you indicated you want to get familiar with BeautifulSoup, that's for you to explore :)

Selenium loading, but not printing all HTML

I am attempting to use Python and Selenium to web-scrape dynamically loaded data from a website. The problem is, only about half of the data is being reported as present, when in reality it all should be there. Even after using pauses before printing out all the page content, or simple find element by class searches, there seems to be no solution. The URL of the site is https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909. As you can see, there are 13 main sections, however I am only able to retrieve data from the first four games. To best show the problem I'll attach the code for printing the inner-HTML for the entire page to show the discrepancies between the loaded and non-loaded data.
from selenium import webdriver
import requests
url = "https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909"
driver = webdriver.Chrome()
driver.get(url)
print(driver.execute_script("return document.documentElement.innerText;"))
EDIT:
The problem is not the wait time, for I am running it line by line and fully waiting for it to load. It appears the problem boild down to selenium not grabbing all the JS loaded text on the page, as seen by the console output in the answer below.
#sudonym's analysis was in the right direction. You need to induce WebDriverWait for the desired elements to be visible before you attempt to extract them through execute_script() method as follows:
Code Block:
# -*- coding: UTF-8 -*-
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
url = "https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909"
driver = webdriver.Chrome()
driver.get(url)
WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//h2[contains(.,'USA - National Football League')]//following::section//span[3]")))
print(driver.execute_script("return document.documentElement.innerText;"))
Console Output:
SPORTSBOOK REVIEW
Home
Best Sportsbooks
Rating Guide
Blacklist
Bonuses
BETTING ODDS
FREE PICKS
Sports Picks
NFL
College Football
NBA
NCAAB
MLB
NHL
More Sports
How to Bet
Tools
FORUM
Home
Players Talk
Sportsbooks & Industry
Newbie Forum
Handicapper Think Tank
David Malinsky's Point Blank
Service Plays
Bitcoin Sports Betting
NBA Betting
NFL Betting
NCAAF Betting
MLB Betting
NHL Betting
CONTESTS
EARN BETPOINTS
What Are Betpoints?
SBR Sportsbook
SBR Casino
SBR Racebook
SBR Poker
SBR Store
Today
NFL
NBA
NHL
MLB
College Football
NCAA Basketball
Soccer
Soccer Odds
Major League Soccer
UEFA Champions League
UEFA Nations League
UEFA Europa League
English Premier League
World Cup 2022
Tennis
Tennis Odds
ATP
WTA
UFC
Boxing
More Sports
CFL
WNBA
AFL
Betting Odds/NFL Odds/Consensus
TODAY
|
YESTERDAY
|
DATE
?
Login
?
Settings
?
Bet Tracker
?
Bet Card
?
Favorites
NFL Consensus for Sep 09, 2018
USA - National Football League
Sunday Sep 09, 2018
01:00 PM
/
Pittsburgh vs Cleveland
453
Pittsburgh
454
Cleveland
Current Line
-3½+105
+3½-115
Wagers Placed
10040
54.07%
8530
45.93%
Amount Wagered
$381,520.00
56.10%
$298,550.00
43.90%
Average Bet Size
$38.00
$35.00
SBR Contest Best Bets
22
9
01:00 PM
/
San Francisco vs Minnesota
455
San Francisco
456
Minnesota
Current Line
+6-102
-6-108
Wagers Placed
6250
41.25%
8900
58.75%
Amount Wagered
$175,000.00
29.50%
$418,300.00
70.50%
Average Bet Size
$28.00
$47.00
SBR Contest Best Bets
5
19
01:00 PM
/
Cincinnati vs Indianapolis
457
Cincinnati
458
Indianapolis
Current Line
-1-104
+1-106
Wagers Placed
11640
66.36%
5900
33.64%
Amount Wagered
$1,338,600.00
85.65%
$224,200.00
14.35%
Average Bet Size
$115.00
$38.00
SBR Contest Best Bets
23
12
01:00 PM
/
Buffalo vs Baltimore
459
Buffalo
460
Baltimore
Current Line
+7½-103
-7½-107
Wagers Placed
5220
33.83%
10210
66.17%
Amount Wagered
$78,300.00
16.79%
$387,980.00
83.21%
Average Bet Size
$15.00
$38.00
SBR Contest Best Bets
5
17
01:00 PM
/
Jacksonville vs N.Y. Giants
461
Jacksonville
462
N.Y. Giants
01:00 PM
/
Tampa Bay vs New Orleans
463
Tampa Bay
464
New Orleans
01:00 PM
/
Houston vs New England
465
Houston
466
New England
01:00 PM
/
Tennessee vs Miami
467
Tennessee
468
Miami
04:05 PM
/
Kansas City vs L.A. Chargers
469
Kansas City
470
L.A. Chargers
04:25 PM
/
Seattle vs Denver
471
Seattle
472
Denver
04:25 PM
/
Dallas vs Carolina
473
Dallas
474
Carolina
04:25 PM
/
Washington vs Arizona
475
Washington
476
Arizona
08:20 PM
/
Chicago vs Green Bay
477
Chicago
478
Green Bay
Media
Site Map
Terms of use
Contact Us
Privacy Policy
DMCA
18+. Gamble Responsibly.
© Sportsbook Review. All Rights Reserved.
This solution is only worth to consider if there are lots of WebDriverWait calls
and given the interest in reduced runtime - else go for DebanjanB's
approach
You need to wait some time to let your html load completely. Also, you can set a timeout for script execution. To add a unconditional wait to driver.get(URL) in selenium, driver.set_page_load_timeout(n) with n = time/seconds and loop:
driver.set_page_load_timeout(n) # Set timeout of n seconds for page load
loading_finished = 0 # Set flag to 0
while loading_finished == 0: # Repeat while flag = 0
try:
sleep(random.uniform(0.1, 0.5)) # wait some time
website = driver.get(URL) # try to load for n seconds
loading_finished = 1 # Set flag to 1 and exit while loop
logger.info("website loaded") # Indicate load success
except:
logger.warn("timeout - retry") # Indicate load fail
else: # If flag == 1
driver.set_script_timeout(n) # Set timeout of n seconds for script
script_finished = 0 # Set flag to 0
while script_finished == 0 # Second loop
try:
print driver.execute_script("return document.documentElement.innerText;")
script_finished = 1 # Set flag to 1
logger.info("script done") # Indicate script done
except:
logger.warn("script timeout")
else:
logger.info("if you're still missing html here, increase timeout")

Categories

Resources