Scrape Table Data on Multiple Pages from Multiple URLs (Python & BeautifulSoup)

Scrape Table Data on Multiple Pages from Multiple URLs (Python & BeautifulSoup) - python

New coder here! I am trying to scrape web table data from multiple URLs. Each URL web-page has 1 table, however that table is split among multiple pages. My code only iterates through the table pages of the first URL and not the rest. So.. I am able to get pages 1-5 of NBA data for year 2000 only, but it stops there. How do I get my code to pull every year of data? Any help is greatly appreciated.
page = 1
year = 2000
while page < 20 and year < 2020:
base_URL = 'http://www.espn.com/nba/salaries/_/year/{}/page/{}'.format(year,page)
response = requests.get(base_URL, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
sal_table = soup.find_all('table', class_ = 'tablehead')
if len(sal_table) < 2:
sal_table = sal_table[0]
with open ('NBA_Salary_2000_2019.txt', 'a') as r:
for row in sal_table.find_all('tr'):
for cell in row.find_all('td'):
r.write(cell.text.ljust(30))
r.write('\n')
page+=1
else:
print("too many tables")
else:
year +=1
page = 1

I'd consider using pandas here as 1) it's .read_html() function (which uses beautifulsoup under the hood), is easier to parse <table> tags, and 2) it can easily then write straight to file.
Also, it's a waste to iterate through 20 pages (for example the first season you are after only has 4 pages...the rest are blank. So I'd consider adding something that says once it reaches a blank table, move on to the next season.
import pandas as pd
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'}
results = pd.DataFrame()
year = 2000
while year < 2020:
goToNextPage = True
page = 1
while goToNextPage == True:
base_URL = 'http://www.espn.com/nba/salaries/_/year/{}/page/{}'.format(year,page)
response = requests.get(base_URL, headers)
if response.status_code == 200:
temp_df = pd.read_html(base_URL)[0]
temp_df.columns = list(temp_df.iloc[0,:])
temp_df = temp_df[temp_df['RK'] != 'RK']
if len(temp_df) == 0:
goToNextPage = False
year +=1
continue
print ('Aquiring Season: %s\tPage: %s' %(year, page))
temp_df['Season'] = '%s-%s' %(year-1, year)
results = results.append(temp_df, sort=False).reset_index(drop=True)
page+=1
results.to_csv('c:/test/NBA_Salary_2000_2019.csv', index=False)
Output:
print (results.head(25).to_string())
RK NAME TEAM SALARY Season
0 1 Shaquille O'Neal, C Los Angeles Lakers $17,142,000 1999-2000
1 2 Kevin Garnett, PF Minnesota Timberwolves $16,806,000 1999-2000
2 3 Alonzo Mourning, C Miami Heat $15,004,000 1999-2000
3 4 Juwan Howard, PF Washington Wizards $15,000,000 1999-2000
4 5 Scottie Pippen, SF Portland Trail Blazers $14,795,000 1999-2000
5 6 Karl Malone, PF Utah Jazz $14,000,000 1999-2000
6 7 Larry Johnson, F New York Knicks $11,910,000 1999-2000
7 8 Gary Payton, PG Seattle SuperSonics $11,020,000 1999-2000
8 9 Rasheed Wallace, PF Portland Trail Blazers $10,800,000 1999-2000
9 10 Shawn Kemp, C Cleveland Cavaliers $10,780,000 1999-2000
10 11 Damon Stoudamire, PG Portland Trail Blazers $10,125,000 1999-2000
11 12 Antonio McDyess, PF Denver Nuggets $9,900,000 1999-2000
12 13 Antoine Walker, PF Boston Celtics $9,000,000 1999-2000
13 14 Shareef Abdur-Rahim, PF Vancouver Grizzlies $9,000,000 1999-2000
14 15 Allen Iverson, SG Philadelphia 76ers $9,000,000 1999-2000
15 16 Vin Baker, PF Seattle SuperSonics $9,000,000 1999-2000
16 17 Ray Allen, SG Milwaukee Bucks $9,000,000 1999-2000
17 18 Anfernee Hardaway, SF Phoenix Suns $9,000,000 1999-2000
18 19 Kobe Bryant, SF Los Angeles Lakers $9,000,000 1999-2000
19 20 Stephon Marbury, PG New Jersey Nets $9,000,000 1999-2000
20 21 Vlade Divac, C Sacramento Kings $8,837,000 1999-2000
21 22 Bryant Reeves, C Vancouver Grizzlies $8,666,000 1999-2000
22 23 Tom Gugliotta, PF Phoenix Suns $8,558,000 1999-2000
23 24 Nick Van Exel, PG Denver Nuggets $8,354,000 1999-2000
24 25 Elden Campbell, C Charlotte Hornets $7,975,000 1999-2000
...

Related

How to scrape this football page?

https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A
I wanna scrape the Team Stats, such as Possession and Shots on Target, also whats below like Fouls, Corners...
What I have now is very over complicated code, basically stripping and splitting multiple times this string to grab the values I want.
#getting a general info dataframe with all matches
championship_url = 'https://fbref.com/en/comps/24/1495/schedule/2016-Serie-A-Scores-and-Fixtures'
data = requests.get(URL)
time.sleep(3)
matches = pd.read_html(data.text, match="Resultados e Calendários")[0]
#putting stats info in each match entry (this is an example match to test)
match_url = 'https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A'
data = requests.get(match_url)
time.sleep(3)
soup = BeautifulSoup(data.text, features='lxml')
# ID the match to merge later on
home_team = soup.find("h1").text.split()[0]
round_week = float(soup.find("div", {'id': 'content'}).text.split()[18].strip(')'))
# collecting stats
stats = soup.find("div", {"id": "team_stats"}).text.split()[5:] #first part of stats with the progress bars
stats_extra = soup.find("div", {"id": "team_stats_extra"}).text.split()[2:] #second part
all_stats = {'posse_casa':[], 'posse_fora':[], 'chutestotais_casa':[], 'chutestotais_fora':[],
'acertopasses_casa':[], 'acertopasses_fora':[], 'chutesgol_casa':[], 'chutesgol_fora':[],
'faltas_casa':[], 'faltas_fora':[], 'escanteios_casa':[], 'escanteios_fora':[],
'cruzamentos_casa':[], 'cruzamentos_fora':[], 'contatos_casa':[], 'contatos_fora':[],
'botedef_casa':[], 'botedef_fora':[], 'aereo_casa':[], 'aereo_fora':[],
'defesas_casa':[], 'defesas_fora':[], 'impedimento_casa':[], 'impedimento_fora':[],
'tirometa_casa':[], 'tirometa_fora':[], 'lateral_casa':[], 'lateral_fora':[],
'bolalonga_casa':[], 'bolalonga_fora':[], 'Em casa':[home_team], 'Sem':[round_week]}
#not gonna copy everything but is kinda like this for each stat
#stats = '\nEstatísticas do time\n\n\nCoritiba \n\n\n\t\n\n\n\n\n\n\n\n\n\n Cuiabá\n\nPosse\n\n\n\n42%\n\n\n\n\n\n58%\n\n\n\n\nChutes ao gol\n\n\n\n2 of 4\xa0—\xa050%\n\n\n\n\n\n0%\xa0—\xa00 of 8\n\n\n\n\nDefesas\n\n\n\n0 of 0\xa0—\xa0%\n\n\n\n\n\n50%\xa0—\xa01 of 2\n\n\n\n\nCartões\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
#first grabbing 42% possession
all_stats['posse_casa']=stats.replace('\n','').replace('\t','')[20:].split('Posse')[1][:5].split('%')[0]
#grabbing 58% possession
all_stats['posse_fora']=stats.replace('\n','').replace('\t','')[20:].split('Posse')[1][:5].split('%')[1]
all_stats_df = pd.DataFrame.from_dict(all_stats)
championship_data = matches.merge(all_stats_df, on=['Em casa','Sem'])
There are a lot of stats in that dic bc in previous championship years, FBref has all those stats, only in the current year championship there is only 12 of them to fill. I do intend to run the code in 5-6 different years, so I made a version with all stats, and in current year games I intend to fill with nothing when there's no stat in the page to scrap.

You can get Fouls, Corners and Offsides and 7 tables worth of data from that page with the following code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
coritiba_fouls = soup.find('div', string='Fouls').previous_sibling.text.strip()
cuiaba_fouls = soup.find('div', string='Fouls').next_sibling.text.strip()
coritiba_corners = soup.find('div', string='Corners').previous_sibling.text.strip()
cuiaba_corners = soup.find('div', string='Corners').next_sibling.text.strip()
coritiba_offsides = soup.find('div', string='Offsides').previous_sibling.text.strip()
cuiaba_offsides = soup.find('div', string='Offsides').next_sibling.text.strip()
print('Coritiba Fouls: ' + coritiba_fouls, 'Cuiaba Fouls: ' + cuiaba_fouls)
print('Coritiba Corners: ' + coritiba_corners, 'Cuiaba Corners: ' + cuiaba_corners)
print('Coritiba Offsides: ' + coritiba_offsides, 'Cuiaba Offsides: ' + cuiaba_offsides)
dfs = pd.read_html(r.text)
print('Number of tables: ' + str(len(dfs)))
for df in dfs:
print(df)
print('___________')
This will print in the terminal:
Coritiba Fouls: 16 Cuiaba Fouls: 12
Coritiba Corners: 4 Cuiaba Corners: 4
Coritiba Offsides: 0 Cuiaba Offsides: 1
Number of tables: 7
Coritiba (4-2-3-1) Coritiba (4-2-3-1).1
0 23 Alex Muralha
1 2 Matheus Alexandre
2 3 Henrique
3 4 Luciano Castán
4 6 Egídio Pereira Júnior
5 9 Léo Gamalho
6 11 Alef Manga
7 25 Bernanrdo Lemes
8 78 Régis
9 97 Valdemir
10 98 Igor Paixão
11 Bench Bench
12 21 Rafael William
13 5 Guillermo de los Santos
14 15 Matías Galarza
15 16 Natanael
16 18 Guilherme Biro
17 19 Thonny Anderson
18 28 Pablo Javier García
19 32 Bruno Gomes
20 44 Márcio Silva
21 52 Adrián Martínez
22 75 Luiz Gabriel
23 88 Hugo
___________
Cuiabá (4-1-4-1) Cuiabá (4-1-4-1).1
0 1 Walter
1 2 João Lucas
2 3 Joaquim
3 4 Marllon Borges
4 5 Camilo
5 6 Igor Cariús
6 7 Alesson
7 8 João Pedro Pepê
8 9 Valdívia
9 10 Rodriguinho Marinho
10 11 Rafael Gava
11 Bench Bench
12 12 João Carlos
13 13 Daniel Guedes
14 14 Paulão
15 15 Marcão Silva
16 16 Cristian Rivas
17 17 Gabriel Pirani
18 18 Jenison
19 19 André
20 20 Kelvin Osorio
21 21 Jonathan Cafu
22 22 André Luis
23 23 Felipe Marques
___________
Coritiba Cuiabá
Possession Possession
0 42% 58%
1 Shots on Target Shots on Target
2 2 of 4 — 50% 0% — 0 of 8
3 Saves Saves
4 0 of 0 — % 50% — 1 of 2
5 Cards Cards
6 NaN NaN
_____________
[....]

Cannot get response.get() to load full webpage

When I go to scrape https://www.onthesnow.com/epic-pass/skireport for the names of all the ski resorts listed, I'm running into an issue where some of the ski resorts don't show up in my output. Here's my current code:
import requests
url = "https://www.onthesnow.com/epic-pass/skireport"
response = requests.get(url)
response.text
The current output gives all resorts up to Mont Sainte Anne, but then it skips to the resorts at the bottom of the webpage under "closed resorts". I notice that when you scroll down the webpage in a browser that the missing resort names need to be scrolled down to before they will load. How do I make my response.get() obtain all of the HTML, even the HTML that still needs to load?

The data you see is loaded from external URL in Json form. To load it, you can use this example:
import json
import requests
url = "https://api.onthesnow.com/api/v2/region/1291/resorts/1/page/1?limit=999"
data = requests.get(url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for i, d in enumerate(data["data"], 1):
print(i, d["title"])
Prints:
1 Beaver Creek
2 Breckenridge
3 Brides les Bains
4 Courchevel
5 Crested Butte Mountain Resort
6 Fernie Alpine
7 Folgàrida - Marilléva
8 Heavenly
9 Keystone
10 Kicking Horse
11 Kimberley
12 Kirkwood
13 La Tania
14 Les Menuires
15 Madonna di Campiglio
16 Meribel
17 Mont Sainte Anne
18 Nakiska Ski Area
19 Nendaz
20 Northstar California
21 Okemo Mountain Resort
22 Orelle
23 Park City
24 Pontedilegno - Tonale
25 Saint Martin de Belleville
26 Snowbasin
27 Stevens Pass Resort
28 Stoneham
29 Stowe Mountain
30 Sun Valley
31 Thyon 2000
32 Vail
33 Val Thorens
34 Verbier
35 Veysonnaz
36 Whistler Blackcomb

When I run the sportsipy.nba.teams.Teams function I am getting notified that The requested page returned a valid response, but no data could be found

The code that I am running (straight from sportsipy documentation):
from sportsipy.nba.teams import Teams
teams = Teams()
for team in teams:
print(team.name, team.abbreviation)
Returns the following:
The requested page returned a valid response, but no data could be found. Has the season begun, and is the data available on www.sports-reference.com?
Does anyone have any tips on moving forward with getting this information from the API?

That package api is old/outdated. The table it's trying to parse now has a different id attribute.
Few things you can do:
Go in and edit/patch the code manually to get the correct data.
Raise the issue on the github and wait for the fix and update.
Personally, the patch/fix is a quick easy one, so just do that (but there could be potentially other tables you may need to look into).
Open up the nba_utils.py:
change lines 85 and 86:
From:
teams_list = utils._get_stats_table(doc, 'div#all_team-stats-base')
opp_teams_list = utils._get_stats_table(doc, 'div#all_opponent-stats-base')
To:
teams_list = utils._get_stats_table(doc, '#totals-team')
opp_teams_list = utils._get_stats_table(doc, '#totals-opponent')
This will solve the current error, however, I don't know what other classes and functions may also need to be patched. There's a chance since this table slighltly changed, other may have as well.
Output:
Charlotte Hornets CHO
Milwaukee Bucks MIL
Utah Jazz UTA
Sacramento Kings SAC
Memphis Grizzlies MEM
Los Angeles Lakers LAL
Miami Heat MIA
Indiana Pacers IND
Houston Rockets HOU
Phoenix Suns PHO
Atlanta Hawks ATL
Minnesota Timberwolves MIN
San Antonio Spurs SAS
Boston Celtics BOS
Cleveland Cavaliers CLE
Golden State Warriors GSW
Washington Wizards WAS
Portland Trail Blazers POR
Los Angeles Clippers LAC
New Orleans Pelicans NOP
Dallas Mavericks DAL
Brooklyn Nets BRK
New York Knicks NYK
Orlando Magic ORL
Philadelphia 76ers PHI
Chicago Bulls CHI
Denver Nuggets DEN
Toronto Raptors TOR
Oklahoma City Thunder OKC
Detroit Pistons DET

Another option is to just not use the api and get the data yourself. If you don't need the abbreviations, it's pretty straight forward with pandas:
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2022.html'
teams = list(pd.read_html(url)[4].dropna(subset=['Rk'])['Team'])
for team in teams:
print(team)
If you do need the abbreviations, then it's a little more tricky, but can be achieved using BeautifulSoup to pull it out of the team href:
import requests
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/leagues/NBA_2022.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'id':'per_game-team'})
rows = table.find_all('td', {'data-stat':'team'})
teams = {}
for row in rows:
if row.find('a'):
name = row.find('a').text
abbreviation = row.find('a')['href'].split('/')[-2]
teams.update({name:abbreviation})
for team in teams.items():
print(team[0], team[1])

Scrape College Football team recruiting rankings page

So I have been able to scrape the first 50 teams in the team rankings webpage from 247sports.
I was able to get the following results:
index Rank Team Total Recruits Average Rating Total Rating
0 0 1 Ohio State 17 94.35 286.75
1 10 11 Alabama 10 94.16 210.61
2 8 9 Georgia 11 93.38 219.60
3 31 32 Clemson 8 92.02 161.74
4 3 4 LSU 14 91.92 240.57
5 4 5 Oklahoma 13 91.81 229.03
6 22 23 USC 9 91.60 174.69
7 11 12 Texas A&M 11 91.59 203.03
8 1 2 Notre Dame 18 91.01 250.35
9 2 3 Penn State 18 90.04 243.95
10 6 7 Texas 14 90.04 222.03
11 14 15 Missouri 12 89.94 196.37
12 7 8 Oregon 15 89.91 220.66
13 5 6 Florida State 15 89.88 224.51
14 25 26 Florida 10 89.15 167.89
15 37 38 North Carolina 9 88.94 152.79
16 9 10 Michigan 16 88.76 216.07
17 33 34 UCLA 10 88.49 160.00
18 23 24 Kentucky 11 88.46 173.12
19 12 13 Rutgers 14 88.44 198.56
20 19 20 Indiana 12 88.41 181.20
21 49 50 Washington 8 88.21 132.55
22 20 21 Oklahoma State 13 88.18 177.91
23 43 44 Ole Miss 10 87.80 143.35
24 44 45 California 9 87.78 141.80
25 17 18 Arkansas 15 87.75 188.64
26 16 17 South Carolina 15 87.61 190.84
27 32 33 Georgia Tech 11 87.30 161.33
28 35 36 Tennessee 11 87.25 157.77
29 39 40 NC State 11 87.18 150.18
30 46 47 SMU 9 87.08 138.50
31 36 37 Wisconsin 11 87.00 157.55
32 21 22 Mississippi State 15 86.96 177.33
33 24 25 West Virginia 13 86.78 171.72
34 30 31 Northwestern 14 86.76 162.66
35 40 41 Maryland 12 86.31 149.77
36 15 16 Virginia Tech 18 86.23 191.06
37 18 19 Baylor 19 85.90 184.68
38 13 14 Boston College 22 85.88 197.15
39 26 27 Michigan State 14 85.85 167.60
40 29 30 Cincinnati 14 85.68 164.90
41 34 35 Minnesota 13 85.55 159.35
42 28 29 Iowa State 14 85.54 166.50
43 48 49 Virginia 10 85.39 133.93
44 45 46 Arizona 11 85.27 140.90
45 41 42 Pittsburgh 12 85.10 147.58
46 47 48 Duke 13 85.02 137.40
47 27 28 Vanderbilt 16 85.01 166.77
48 38 39 Purdue 13 84.83 152.55
49 42 43 Illinois 13 84.15 143.86
From the following script:
year = '2022'
url = 'https://247sports.com/Season/' + str(year) + '-Football/CompositeTeamRankings/'
print(url)
# Add the `user-agent` otherwise we will get blocked when sending the request
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"}
response = requests.get(url, headers = headers).content
soup = BeautifulSoup(response, "html.parser")
data = []
for tag in soup.find_all("li", class_="rankings-page__list-item"):
rank = tag.find('div',{'class':'primary'}).text.strip()
team = tag.find('div',{'class':'team'}).find('a').text.strip()
total_recruits = tag.find('div',{'class':'total'}).find('a').text.split(' ')[0].strip()
# five_stars = tag.find('div',{'class':'gold'}).text.strip()
# four_stars = tag.find('div',{'class':'gold'}).text.strip()
# three_stars = tag.find('div',{'class':'metrics'}).text.strip()
avg_rating = tag.find('div',{'class':'avg'}).text.strip()
total_rating = tag.find('div',{'class':'points'}).text.strip()
data.append(
{
"Rank": rank,
"Team": team,
"Total Recruits": total_recruits,
# "Five-Star Recruits": five_stars,
# "Four-Star Recruits": four_stars,
# "Three-Star Recruits": three_stars,
"Average Rating": avg_rating,
"Total Rating": total_rating
}
)
df = pd.DataFrame(data)
df[['Rank', 'Total Recruits', 'Average Rating', 'Total Rating']] = df[['Rank', 'Total Recruits', 'Average Rating', 'Total Rating']].apply(pd.to_numeric)
df.sort_values('Average Rating', ascending = False).reset_index()
# soup
However, I would like to achieve three things.
I would like to grab the data from the "5-stars", "4-stars", "3-stars" columns in the webpage.
I would like to not just get the first 50 schools, but also tell the webpage to click "load more" enough times so that I can get the table with ALL schools in it.
I want to not only get the 2022 team rankings, but every team ranking that 247sports has to offer (2000 through 2024).
I tried to give it a go with this one script, but I constantly get the top-50 schools being outputted in one loop in the "print(row) portion" of the code.
print(datetime.datetime.now().time())
# years = ['2000', '2001', '2002', '2003', '2004',
# '2005', '2006', '2007', '2008', '2009',
# '2010', '2011', '2012', '2013', '2014',
# '2015', '2016', '2017', '2018', '2019',
# '2020', '2021', '2022', '2023']
years = ['2022']
rows = []
page_totals = []
# recruits_final = []
for year in years:
url = 'https://247sports.com/Season/' + str(year) + '-Football/CompositeTeamRankings/'
print(url)
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Mobile Safari/537.36'}
page = 0
while True:
page +=1
payload = {'Page': '%s' %page}
response = requests.get(url, headers=headers, params=payload)
soup = BeautifulSoup(response.text, 'html.parser')
tags = soup.find_all('li',{'class':'rankings-page__list-item'})
if len(tags) == 0:
print('Page: %s' %page)
page_totals.append(page)
break
continue_loop = True
while continue_loop == True:
for tag in tags:
if tag.text.strip() == 'Load More':
continue_loop = False
continue
# primary_rank = tag.find('div',{'class':'rank-column'}).find('div',{'class':'primary'}).text.strip()
# try:
# other_rank = tag.find('div',{'class':'rank-column'}).find('div',{'class':'other'}).text.strip()
# except:
# other_rank = ''
rank = tag.find('div',{'class':'primary'}).text.strip()
team = tag.find('div',{'class':'team'}).find('a').text.strip()
total_recruits = tag.find('div',{'class':'total'}).find('a').text.split(' ')[0].strip()
# five_stars = tag.find('div',{'class':'gold'}).text.strip()
# four_stars = tag.find('div',{'class':'gold'}).text.strip()
# three_stars = tag.find('div',{'class':'metrics'}).text.strip()
avg_rating = tag.find('div',{'class':'avg'}).text.strip()
total_rating = tag.find('div',{'class':'points'}).text.strip()
try:
team = athlete.find('div',{'class':'status'}).find('img')['title']
except:
team = ''
row = {'Rank': rank,
'Team': team,
'Total Recruits': total_recruits,
'Average Rating': avg_rating,
'Total Rating': total_rating,
'Year': year}
print(row)
rows.append(row)
recruits = pd.DataFrame(rows)
print(datetime.datetime.now().time())
Any assistance on this is truly appreciated. Thanks in advance.

First, you can extract the year ranges from the dropdown with BeautifulSoup (no need to click the button, as the dropdown is already on the page), and then navigate to each link with selenium, using the latter to interact with the "load more" toggle, and then finally scraping the resulting tables:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
import time, urllib.parse, re
d = webdriver.Chrome('path/to/chromedriver')
d.get((url:='https://247sports.com/Season/2022-Football/CompositeTeamRankings/'))
result = {}
for i in soup(d.page_source, 'html.parser').select('.rankings-page__header-nav > .rankings-page__nav-block .flyout_cmp.year.tooltip li a'):
if (y:=int(i.get_text(strip=True))) > 1999:
d.get(urllib.parse.urljoin(url, i['href']))
while d.execute_script("""return document.querySelector('a[data-js="showmore"]') != null"""):
d.execute_script("""document.querySelector('a[data-js="showmore"]').click()""")
time.sleep(1)
result[y] = [{"Rank":i.select_one('div.wrapper .rank-column .other').get_text(strip=True),
"Team":i.select_one('.team').get_text(strip=True),
"Total":i.select_one('.total').get_text(strip=True).split()[0],
"5-Stars":i.select_one('.star-commits-list li:nth-of-type(1) div').get_text(strip=True),
"4-Stars":i.select_one('.star-commits-list li:nth-of-type(2) div').get_text(strip=True),
"3-Stars":i.select_one('.star-commits-list li:nth-of-type(3) div').get_text(strip=True),
"Ave":i.select_one('.avg').get_text(strip=True),
"Points":i.select_one('.points').get_text(strip=True),
}
for i in soup(d.page_source, 'html.parser').select("""ul[data-js="rankings-list"].rankings-page__list li.rankings-page__list-item""")]
result stores all the team rankings for a given year, 2000-2024 (list(result) produces [2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000]). To convert the results to a pandas.DataFrame:
import pandas as pd
df = pd.DataFrame([{'Year':a, **i} for a, b in result.items() for i in b])
print(df)
Output:
Year Rank Team Total 5-Stars 4-Stars 3-Stars Ave Points
0 2024 N/A Iowa 1 0 0 0 0.00 0.00
1 2024 N/A Florida State 3 0 0 0 0.00 0.00
2 2024 N/A BYU 1 0 0 0 0.00 0.00
3 2023 1 Georgia 4 0 4 0 93.86 93.65
4 2023 3 Notre Dame 2 1 1 0 95.98 51.82
... ... ... ... ... ... ... ... ... ...
3543 2000 N/A NC State 18 0 0 0 70.00 0.00
3544 2000 N/A Colorado State 14 0 0 0 70.00 0.00
3545 2000 N/A Oregon 27 0 0 0 70.00 0.00
3546 2000 N/A California 25 0 0 0 70.00 0.00
3547 2000 N/A Texas Tech 20 0 0 0 70.00 0.00
[3548 rows x 9 columns]
Edit: instead of using selenium, you can send requests to the API endpoints that the site uses to retrieve and display the ranking data:
import requests, pandas as pd
from bs4 import BeautifulSoup as soup
def extract_rankings(source):
return [{"Rank":i.select_one('div.wrapper .rank-column .other').get_text(strip=True),
"Team":i.select_one('.team').get_text(strip=True),
"Total":i.select_one('.total').get_text(strip=True).split()[0],
"5-Stars":i.select_one('.star-commits-list li:nth-of-type(1) div').get_text(strip=True),
"4-Stars":i.select_one('.star-commits-list li:nth-of-type(2) div').get_text(strip=True),
"3-Stars":i.select_one('.star-commits-list li:nth-of-type(3) div').get_text(strip=True),
"Ave":i.select_one('.avg').get_text(strip=True),
"Points":i.select_one('.points').get_text(strip=True),
}
for i in soup(source, 'html.parser').select("""li.rankings-page__list-item""")]
def year_rankings(year):
page, results = 1, []
vals = extract_rankings(requests.get(f'https://247sports.com/Season/{year}-Football/CompositeTeamRankings/?ViewPath=~%2FViews%2FSkyNet%2FInstitutionRanking%2F_SimpleSetForSeason.ascx&Page={page}', headers={'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Mobile Safari/537.36'}).text)
while vals:
results.extend(vals)
page += 1
vals = extract_rankings(requests.get(f'https://247sports.com/Season/{year}-Football/CompositeTeamRankings/?ViewPath=~%2FViews%2FSkyNet%2FInstitutionRanking%2F_SimpleSetForSeason.ascx&Page={page}', headers={'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Mobile Safari/537.36'}).text)
return results
results = {y:year_rankings(y) for y in range(2000, 2025)}
df = pd.DataFrame([{'Year':a, **i} for a, b in results.items() for i in b])
print(df)

Problem concatenating URL and scraping data

I am trying to append a URL in python to scrape details from the target URL.
I have the below code but it seems to be scraping the data from url1 rather than URL.
I have scraped the team names from the NFL websit without any issue. The issue is with the spotrac URL where I am appending the team name which I have scraped from the NFL website.
import requests
from bs4 import BeautifulSoup
URL ='https://www.nfl.com/teams/'
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
team_name = []
team_name_list = soup.find_all('h4',class_='d3-o-media-object__roofline nfl-c-custom-promo__headline')
for team in team_name_list:
if team.find('p'):
team_name.append(team.text)
for team in team_name:
team = team.replace(" ", "-").lower()
url1 = 'https://www.spotrac.com/nfl/rankings/'
URL = url1 +str(team)
print(URL)
data = {
'ajax': 'true',
'mobile': 'false'
}
bs_soup = BeautifulSoup(requests.post(URL, data=data).content, 'html.parser')
spotrac_df = pd.DataFrame(columns = ['Name', 'Salary'])
for h3 in bs_soup.select('h3'):
spotrac_df = spotrac_df.append(pd.DataFrame({'Name': str(h3.text), 'Salary' : str(h3.find_next(class_="rank-value").text)}, index=[0]), ignore_index=False)
I'm almost certain the problem is coming from the URL not appending properly. The scraping is taking the salaries etc from url1 rather than URL.
My console output (using Spyder IDE) is as below for print(URL)

url is appending correctly, but you have a leading white space in your team names. I also made a few other changes and noted them in the code.
Lastly, (and I used to do this two), creating an empty dataframe then appending to it after each iteration I suppose isn't the best method. I've been told it better to construct your rows using lists/dictionaries, and then when done, then call on pandas to construct the dataframe, so changed that as well.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url ='https://www.nfl.com/teams/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
team_name = []
team_name_list = soup.find_all('h4',class_='d3-o-media-object__roofline nfl-c-custom-promo__headline')
for team in team_name_list:
if team.find('p'):
team_name.append(team.text.strip()) #<- remove leading/trailing white space
url1 = 'https://www.spotrac.com/nfl/rankings/' #<- since this is fixed, put it before the loop
spotrac_rows = []
for team in team_name:
team = '-'.join(team.split()).lower() #<- changed to split in case theres 2 spaces between city and team
url1 = 'https://www.spotrac.com/nfl/rankings/'
url = url1 + str(team)
print(url)
data = {
'ajax': 'true',
'mobile': 'false'
}
bs_soup = BeautifulSoup(requests.post(url, data=data).content, 'html.parser')
for h3 in bs_soup.select('h3'):
spotrac_rows.append({'Name': str(h3.text), 'Salary' : str(h3.find_next(class_="rank-value").text.strip())}) #<- remove white space from the salary
spotrac_df = pd.DataFrame(spotrac_rows)
Output:
print(spotrac_df)
Name Salary
0 Chandler Jones $21,333,333
1 Patrick Peterson $13,184,588
2 D.J. Humphries $12,800,000
3 DeAndre Hopkins $12,500,000
4 Larry Fitzgerald $11,750,000
5 Jordan Hicks $10,500,000
6 Justin Pugh $10,500,000
7 Kenyan Drake $8,483,000
8 Kyler Murray $8,080,601
9 Robert Alford $7,500,000
10 J.R. Sweezy $6,500,000
11 Corey Peters $4,437,500
12 Haason Reddick $4,288,444
13 Jordan Phillips $4,000,000
14 Isaiah Simmons $3,757,101
15 Maxx Williams $3,400,000
16 Zane Gonzalez $3,259,000
17 Devon Kennard $2,500,000
18 Budda Baker $2,173,184
19 De'Vondre Campbell $2,000,000
20 Andy Lee $2,000,000
21 Byron Murphy $1,815,795
22 Christian Kirk $1,607,691
23 Aaron Brewer $1,168,750
24 Max Garcia $1,143,125
25 Andy Isabella $1,052,244
26 Mason Cole $977,629
27 Zach Allen $975,855
28 Chris Banjo $887,500
29 Jonathan Bullard $887,500
... ...
2530 Khari Blasingame $675,000
2531 Kenneth Durden $675,000
2532 Cody Hollister $675,000
2533 Joey Ivie $675,000
2534 Greg Joseph $675,000
2535 Kareem Orr $675,000
2536 David Quessenberry $675,000
2537 Derick Roberson $675,000
2538 Shaun Wilson $675,000
2539 Cole McDonald $635,421
2540 Chris Jackson $629,570
2541 Kobe Smith $614,333
2542 Aaron Brewer $613,333
2543 Cale Garrett $613,333
2544 Tommy Hudson $613,333
2545 Kristian Wilkerson $613,333
2546 Khaylan Kearse-Thomas $612,500
2547 Nick Westbrook $612,333
2548 Kyle Williams $611,833
2549 Mason Kinsey $611,666
2550 Tucker McCann $611,666
2551 Cameron Scarlett $611,666
2552 Teair Tart $611,666
2553 Brandon Kemp $611,333
2554 Wyatt Ray $610,000
2555 Josh Smith $610,000
2556 Logan Woodside $610,000
2557 Rashard Davis $610,000
2558 Avery Gennesy $610,000
2559 Parker Hesse $610,000
[2560 rows x 2 columns]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape Table Data on Multiple Pages from Multiple URLs (Python & BeautifulSoup) - python

Related

How to scrape this football page?

Cannot get response.get() to load full webpage

When I run the sportsipy.nba.teams.Teams function I am getting notified that The requested page returned a valid response, but no data could be found

Scrape College Football team recruiting rankings page

Problem concatenating URL and scraping data

Categories

Resources