Scrape College Football team recruiting rankings page - python

So I have been able to scrape the first 50 teams in the team rankings webpage from 247sports.
I was able to get the following results:
index Rank Team Total Recruits Average Rating Total Rating
0 0 1 Ohio State 17 94.35 286.75
1 10 11 Alabama 10 94.16 210.61
2 8 9 Georgia 11 93.38 219.60
3 31 32 Clemson 8 92.02 161.74
4 3 4 LSU 14 91.92 240.57
5 4 5 Oklahoma 13 91.81 229.03
6 22 23 USC 9 91.60 174.69
7 11 12 Texas A&M 11 91.59 203.03
8 1 2 Notre Dame 18 91.01 250.35
9 2 3 Penn State 18 90.04 243.95
10 6 7 Texas 14 90.04 222.03
11 14 15 Missouri 12 89.94 196.37
12 7 8 Oregon 15 89.91 220.66
13 5 6 Florida State 15 89.88 224.51
14 25 26 Florida 10 89.15 167.89
15 37 38 North Carolina 9 88.94 152.79
16 9 10 Michigan 16 88.76 216.07
17 33 34 UCLA 10 88.49 160.00
18 23 24 Kentucky 11 88.46 173.12
19 12 13 Rutgers 14 88.44 198.56
20 19 20 Indiana 12 88.41 181.20
21 49 50 Washington 8 88.21 132.55
22 20 21 Oklahoma State 13 88.18 177.91
23 43 44 Ole Miss 10 87.80 143.35
24 44 45 California 9 87.78 141.80
25 17 18 Arkansas 15 87.75 188.64
26 16 17 South Carolina 15 87.61 190.84
27 32 33 Georgia Tech 11 87.30 161.33
28 35 36 Tennessee 11 87.25 157.77
29 39 40 NC State 11 87.18 150.18
30 46 47 SMU 9 87.08 138.50
31 36 37 Wisconsin 11 87.00 157.55
32 21 22 Mississippi State 15 86.96 177.33
33 24 25 West Virginia 13 86.78 171.72
34 30 31 Northwestern 14 86.76 162.66
35 40 41 Maryland 12 86.31 149.77
36 15 16 Virginia Tech 18 86.23 191.06
37 18 19 Baylor 19 85.90 184.68
38 13 14 Boston College 22 85.88 197.15
39 26 27 Michigan State 14 85.85 167.60
40 29 30 Cincinnati 14 85.68 164.90
41 34 35 Minnesota 13 85.55 159.35
42 28 29 Iowa State 14 85.54 166.50
43 48 49 Virginia 10 85.39 133.93
44 45 46 Arizona 11 85.27 140.90
45 41 42 Pittsburgh 12 85.10 147.58
46 47 48 Duke 13 85.02 137.40
47 27 28 Vanderbilt 16 85.01 166.77
48 38 39 Purdue 13 84.83 152.55
49 42 43 Illinois 13 84.15 143.86
From the following script:
year = '2022'
url = 'https://247sports.com/Season/' + str(year) + '-Football/CompositeTeamRankings/'
print(url)
# Add the `user-agent` otherwise we will get blocked when sending the request
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"}
response = requests.get(url, headers = headers).content
soup = BeautifulSoup(response, "html.parser")
data = []
for tag in soup.find_all("li", class_="rankings-page__list-item"):
rank = tag.find('div',{'class':'primary'}).text.strip()
team = tag.find('div',{'class':'team'}).find('a').text.strip()
total_recruits = tag.find('div',{'class':'total'}).find('a').text.split(' ')[0].strip()
# five_stars = tag.find('div',{'class':'gold'}).text.strip()
# four_stars = tag.find('div',{'class':'gold'}).text.strip()
# three_stars = tag.find('div',{'class':'metrics'}).text.strip()
avg_rating = tag.find('div',{'class':'avg'}).text.strip()
total_rating = tag.find('div',{'class':'points'}).text.strip()
data.append(
{
"Rank": rank,
"Team": team,
"Total Recruits": total_recruits,
# "Five-Star Recruits": five_stars,
# "Four-Star Recruits": four_stars,
# "Three-Star Recruits": three_stars,
"Average Rating": avg_rating,
"Total Rating": total_rating
}
)
df = pd.DataFrame(data)
df[['Rank', 'Total Recruits', 'Average Rating', 'Total Rating']] = df[['Rank', 'Total Recruits', 'Average Rating', 'Total Rating']].apply(pd.to_numeric)
df.sort_values('Average Rating', ascending = False).reset_index()
# soup
However, I would like to achieve three things.
I would like to grab the data from the "5-stars", "4-stars", "3-stars" columns in the webpage.
I would like to not just get the first 50 schools, but also tell the webpage to click "load more" enough times so that I can get the table with ALL schools in it.
I want to not only get the 2022 team rankings, but every team ranking that 247sports has to offer (2000 through 2024).
I tried to give it a go with this one script, but I constantly get the top-50 schools being outputted in one loop in the "print(row) portion" of the code.
print(datetime.datetime.now().time())
# years = ['2000', '2001', '2002', '2003', '2004',
# '2005', '2006', '2007', '2008', '2009',
# '2010', '2011', '2012', '2013', '2014',
# '2015', '2016', '2017', '2018', '2019',
# '2020', '2021', '2022', '2023']
years = ['2022']
rows = []
page_totals = []
# recruits_final = []
for year in years:
url = 'https://247sports.com/Season/' + str(year) + '-Football/CompositeTeamRankings/'
print(url)
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Mobile Safari/537.36'}
page = 0
while True:
page +=1
payload = {'Page': '%s' %page}
response = requests.get(url, headers=headers, params=payload)
soup = BeautifulSoup(response.text, 'html.parser')
tags = soup.find_all('li',{'class':'rankings-page__list-item'})
if len(tags) == 0:
print('Page: %s' %page)
page_totals.append(page)
break
continue_loop = True
while continue_loop == True:
for tag in tags:
if tag.text.strip() == 'Load More':
continue_loop = False
continue
# primary_rank = tag.find('div',{'class':'rank-column'}).find('div',{'class':'primary'}).text.strip()
# try:
# other_rank = tag.find('div',{'class':'rank-column'}).find('div',{'class':'other'}).text.strip()
# except:
# other_rank = ''
rank = tag.find('div',{'class':'primary'}).text.strip()
team = tag.find('div',{'class':'team'}).find('a').text.strip()
total_recruits = tag.find('div',{'class':'total'}).find('a').text.split(' ')[0].strip()
# five_stars = tag.find('div',{'class':'gold'}).text.strip()
# four_stars = tag.find('div',{'class':'gold'}).text.strip()
# three_stars = tag.find('div',{'class':'metrics'}).text.strip()
avg_rating = tag.find('div',{'class':'avg'}).text.strip()
total_rating = tag.find('div',{'class':'points'}).text.strip()
try:
team = athlete.find('div',{'class':'status'}).find('img')['title']
except:
team = ''
row = {'Rank': rank,
'Team': team,
'Total Recruits': total_recruits,
'Average Rating': avg_rating,
'Total Rating': total_rating,
'Year': year}
print(row)
rows.append(row)
recruits = pd.DataFrame(rows)
print(datetime.datetime.now().time())
Any assistance on this is truly appreciated. Thanks in advance.

First, you can extract the year ranges from the dropdown with BeautifulSoup (no need to click the button, as the dropdown is already on the page), and then navigate to each link with selenium, using the latter to interact with the "load more" toggle, and then finally scraping the resulting tables:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
import time, urllib.parse, re
d = webdriver.Chrome('path/to/chromedriver')
d.get((url:='https://247sports.com/Season/2022-Football/CompositeTeamRankings/'))
result = {}
for i in soup(d.page_source, 'html.parser').select('.rankings-page__header-nav > .rankings-page__nav-block .flyout_cmp.year.tooltip li a'):
if (y:=int(i.get_text(strip=True))) > 1999:
d.get(urllib.parse.urljoin(url, i['href']))
while d.execute_script("""return document.querySelector('a[data-js="showmore"]') != null"""):
d.execute_script("""document.querySelector('a[data-js="showmore"]').click()""")
time.sleep(1)
result[y] = [{"Rank":i.select_one('div.wrapper .rank-column .other').get_text(strip=True),
"Team":i.select_one('.team').get_text(strip=True),
"Total":i.select_one('.total').get_text(strip=True).split()[0],
"5-Stars":i.select_one('.star-commits-list li:nth-of-type(1) div').get_text(strip=True),
"4-Stars":i.select_one('.star-commits-list li:nth-of-type(2) div').get_text(strip=True),
"3-Stars":i.select_one('.star-commits-list li:nth-of-type(3) div').get_text(strip=True),
"Ave":i.select_one('.avg').get_text(strip=True),
"Points":i.select_one('.points').get_text(strip=True),
}
for i in soup(d.page_source, 'html.parser').select("""ul[data-js="rankings-list"].rankings-page__list li.rankings-page__list-item""")]
result stores all the team rankings for a given year, 2000-2024 (list(result) produces [2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000]). To convert the results to a pandas.DataFrame:
import pandas as pd
df = pd.DataFrame([{'Year':a, **i} for a, b in result.items() for i in b])
print(df)
Output:
Year Rank Team Total 5-Stars 4-Stars 3-Stars Ave Points
0 2024 N/A Iowa 1 0 0 0 0.00 0.00
1 2024 N/A Florida State 3 0 0 0 0.00 0.00
2 2024 N/A BYU 1 0 0 0 0.00 0.00
3 2023 1 Georgia 4 0 4 0 93.86 93.65
4 2023 3 Notre Dame 2 1 1 0 95.98 51.82
... ... ... ... ... ... ... ... ... ...
3543 2000 N/A NC State 18 0 0 0 70.00 0.00
3544 2000 N/A Colorado State 14 0 0 0 70.00 0.00
3545 2000 N/A Oregon 27 0 0 0 70.00 0.00
3546 2000 N/A California 25 0 0 0 70.00 0.00
3547 2000 N/A Texas Tech 20 0 0 0 70.00 0.00
[3548 rows x 9 columns]
Edit: instead of using selenium, you can send requests to the API endpoints that the site uses to retrieve and display the ranking data:
import requests, pandas as pd
from bs4 import BeautifulSoup as soup
def extract_rankings(source):
return [{"Rank":i.select_one('div.wrapper .rank-column .other').get_text(strip=True),
"Team":i.select_one('.team').get_text(strip=True),
"Total":i.select_one('.total').get_text(strip=True).split()[0],
"5-Stars":i.select_one('.star-commits-list li:nth-of-type(1) div').get_text(strip=True),
"4-Stars":i.select_one('.star-commits-list li:nth-of-type(2) div').get_text(strip=True),
"3-Stars":i.select_one('.star-commits-list li:nth-of-type(3) div').get_text(strip=True),
"Ave":i.select_one('.avg').get_text(strip=True),
"Points":i.select_one('.points').get_text(strip=True),
}
for i in soup(source, 'html.parser').select("""li.rankings-page__list-item""")]
def year_rankings(year):
page, results = 1, []
vals = extract_rankings(requests.get(f'https://247sports.com/Season/{year}-Football/CompositeTeamRankings/?ViewPath=~%2FViews%2FSkyNet%2FInstitutionRanking%2F_SimpleSetForSeason.ascx&Page={page}', headers={'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Mobile Safari/537.36'}).text)
while vals:
results.extend(vals)
page += 1
vals = extract_rankings(requests.get(f'https://247sports.com/Season/{year}-Football/CompositeTeamRankings/?ViewPath=~%2FViews%2FSkyNet%2FInstitutionRanking%2F_SimpleSetForSeason.ascx&Page={page}', headers={'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Mobile Safari/537.36'}).text)
return results
results = {y:year_rankings(y) for y in range(2000, 2025)}
df = pd.DataFrame([{'Year':a, **i} for a, b in results.items() for i in b])
print(df)

Related

How to scrape this football page?

https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A
I wanna scrape the Team Stats, such as Possession and Shots on Target, also whats below like Fouls, Corners...
What I have now is very over complicated code, basically stripping and splitting multiple times this string to grab the values I want.
#getting a general info dataframe with all matches
championship_url = 'https://fbref.com/en/comps/24/1495/schedule/2016-Serie-A-Scores-and-Fixtures'
data = requests.get(URL)
time.sleep(3)
matches = pd.read_html(data.text, match="Resultados e Calendários")[0]
#putting stats info in each match entry (this is an example match to test)
match_url = 'https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A'
data = requests.get(match_url)
time.sleep(3)
soup = BeautifulSoup(data.text, features='lxml')
# ID the match to merge later on
home_team = soup.find("h1").text.split()[0]
round_week = float(soup.find("div", {'id': 'content'}).text.split()[18].strip(')'))
# collecting stats
stats = soup.find("div", {"id": "team_stats"}).text.split()[5:] #first part of stats with the progress bars
stats_extra = soup.find("div", {"id": "team_stats_extra"}).text.split()[2:] #second part
all_stats = {'posse_casa':[], 'posse_fora':[], 'chutestotais_casa':[], 'chutestotais_fora':[],
'acertopasses_casa':[], 'acertopasses_fora':[], 'chutesgol_casa':[], 'chutesgol_fora':[],
'faltas_casa':[], 'faltas_fora':[], 'escanteios_casa':[], 'escanteios_fora':[],
'cruzamentos_casa':[], 'cruzamentos_fora':[], 'contatos_casa':[], 'contatos_fora':[],
'botedef_casa':[], 'botedef_fora':[], 'aereo_casa':[], 'aereo_fora':[],
'defesas_casa':[], 'defesas_fora':[], 'impedimento_casa':[], 'impedimento_fora':[],
'tirometa_casa':[], 'tirometa_fora':[], 'lateral_casa':[], 'lateral_fora':[],
'bolalonga_casa':[], 'bolalonga_fora':[], 'Em casa':[home_team], 'Sem':[round_week]}
#not gonna copy everything but is kinda like this for each stat
#stats = '\nEstatísticas do time\n\n\nCoritiba \n\n\n\t\n\n\n\n\n\n\n\n\n\n Cuiabá\n\nPosse\n\n\n\n42%\n\n\n\n\n\n58%\n\n\n\n\nChutes ao gol\n\n\n\n2 of 4\xa0—\xa050%\n\n\n\n\n\n0%\xa0—\xa00 of 8\n\n\n\n\nDefesas\n\n\n\n0 of 0\xa0—\xa0%\n\n\n\n\n\n50%\xa0—\xa01 of 2\n\n\n\n\nCartões\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
#first grabbing 42% possession
all_stats['posse_casa']=stats.replace('\n','').replace('\t','')[20:].split('Posse')[1][:5].split('%')[0]
#grabbing 58% possession
all_stats['posse_fora']=stats.replace('\n','').replace('\t','')[20:].split('Posse')[1][:5].split('%')[1]
all_stats_df = pd.DataFrame.from_dict(all_stats)
championship_data = matches.merge(all_stats_df, on=['Em casa','Sem'])
There are a lot of stats in that dic bc in previous championship years, FBref has all those stats, only in the current year championship there is only 12 of them to fill. I do intend to run the code in 5-6 different years, so I made a version with all stats, and in current year games I intend to fill with nothing when there's no stat in the page to scrap.
You can get Fouls, Corners and Offsides and 7 tables worth of data from that page with the following code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
coritiba_fouls = soup.find('div', string='Fouls').previous_sibling.text.strip()
cuiaba_fouls = soup.find('div', string='Fouls').next_sibling.text.strip()
coritiba_corners = soup.find('div', string='Corners').previous_sibling.text.strip()
cuiaba_corners = soup.find('div', string='Corners').next_sibling.text.strip()
coritiba_offsides = soup.find('div', string='Offsides').previous_sibling.text.strip()
cuiaba_offsides = soup.find('div', string='Offsides').next_sibling.text.strip()
print('Coritiba Fouls: ' + coritiba_fouls, 'Cuiaba Fouls: ' + cuiaba_fouls)
print('Coritiba Corners: ' + coritiba_corners, 'Cuiaba Corners: ' + cuiaba_corners)
print('Coritiba Offsides: ' + coritiba_offsides, 'Cuiaba Offsides: ' + cuiaba_offsides)
dfs = pd.read_html(r.text)
print('Number of tables: ' + str(len(dfs)))
for df in dfs:
print(df)
print('___________')
This will print in the terminal:
Coritiba Fouls: 16 Cuiaba Fouls: 12
Coritiba Corners: 4 Cuiaba Corners: 4
Coritiba Offsides: 0 Cuiaba Offsides: 1
Number of tables: 7
Coritiba (4-2-3-1) Coritiba (4-2-3-1).1
0 23 Alex Muralha
1 2 Matheus Alexandre
2 3 Henrique
3 4 Luciano Castán
4 6 Egídio Pereira Júnior
5 9 Léo Gamalho
6 11 Alef Manga
7 25 Bernanrdo Lemes
8 78 Régis
9 97 Valdemir
10 98 Igor Paixão
11 Bench Bench
12 21 Rafael William
13 5 Guillermo de los Santos
14 15 Matías Galarza
15 16 Natanael
16 18 Guilherme Biro
17 19 Thonny Anderson
18 28 Pablo Javier García
19 32 Bruno Gomes
20 44 Márcio Silva
21 52 Adrián Martínez
22 75 Luiz Gabriel
23 88 Hugo
___________
Cuiabá (4-1-4-1) Cuiabá (4-1-4-1).1
0 1 Walter
1 2 João Lucas
2 3 Joaquim
3 4 Marllon Borges
4 5 Camilo
5 6 Igor Cariús
6 7 Alesson
7 8 João Pedro Pepê
8 9 Valdívia
9 10 Rodriguinho Marinho
10 11 Rafael Gava
11 Bench Bench
12 12 João Carlos
13 13 Daniel Guedes
14 14 Paulão
15 15 Marcão Silva
16 16 Cristian Rivas
17 17 Gabriel Pirani
18 18 Jenison
19 19 André
20 20 Kelvin Osorio
21 21 Jonathan Cafu
22 22 André Luis
23 23 Felipe Marques
___________
Coritiba Cuiabá
Possession Possession
0 42% 58%
1 Shots on Target Shots on Target
2 2 of 4 — 50% 0% — 0 of 8
3 Saves Saves
4 0 of 0 — % 50% — 1 of 2
5 Cards Cards
6 NaN NaN
_____________
[....]

BeautifulSoup scraper doesn't retrieve any information

I am trying to retrieve football squads data from multiple wikipedia pages and put it in a Pandas Data frame. One example of the source is this [link][1], but I want too do this for links between 1930-2018.
The code that I will show used to work in Python 2 and I'm trying to adapt it to Python 3. The information in every page are multiple tables with 7 columns. All of the tables have the same format.
The code used to crash but now is running. The only problem is that it throws an empty .csv file.
Just to put more context I made some specific changes :
Python 2
path = os.path.join('.cache', hashlib.md5(url).hexdigest() + '.html')
Python 3
path = os.path.join('.cache', hashlib.sha256(url.encode('utf-8')).hexdigest() + '.html')
Python 2
open(path, 'w') as fd:
Python 3
open(path, 'wb') as fd:
Python 2
years = range(1930,1939,4) + range(1950,2015,4)
Python 3: Yes here I also changed the range so I could get World Cup 2018
years = list(range(1930,1939,4)) + list(range(1950,2019,4))
This is the whole chunk of code. If somebody can spot where in the world is the problem and give a solution I would be very thankful.
import hashlib
import requests
from bs4 import BeautifulSoup
import pandas as pd
if not os.path.exists('.cache'):
os.makedirs('.cache')
ua = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/15612.1.29.41.4'
session = requests.Session()
def get(url):
'''Return cached lxml tree for url'''
path = os.path.join('.cache', hashlib.sha256(url.encode('utf-8')).hexdigest() + '.html')
if not os.path.exists(path):
print(url)
response = session.get(url, headers={'User-Agent': ua})
with open(path, 'wb') as fd:
fd.write(response.text.encode('utf-8'))
return BeautifulSoup(open(path), 'html.parser')
def squads(url):
result = []
soup = get(url)
year = url[29:33]
for table in soup.find_all('table','sortable'):
if "wikitable" not in table['class']:
country = table.find_previous("span","mw-headline").text
for tr in table.find_all('tr')[1:]:
cells = [td.text.strip() for td in tr.find_all('td')]
cells += [country, td.a.get('title') if td.a else 'none', year]
result.append(cells)
return result
years = list(range(1930,1939,4)) + list(range(1950,2019,4))
result = []
for year in years:
url = "http://en.wikipedia.org/wiki/"+str(year)+"_FIFA_World_Cup_squads"
result += squads(url)
Final_result = pd.DataFrame(result)
Final_result.to_csv('/Users/home/Downloads/data.csv', index=False, encoding='iso-8859-1')```
[1]: https://en.wikipedia.org/wiki/2018_FIFA_World_Cup_squads
To get information about each team for years 1930-2018 you can use next example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/{}_FIFA_World_Cup_squads"
dfs = []
for year in range(1930, 2019):
print(year)
soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
tables = soup.find_all(
lambda tag: tag.name == "table"
and tag.select_one('th:-soup-contains("Pos.")')
)
for table in tables:
for tag in table.select('[style="display:none"]'):
tag.extract()
df = pd.read_html(str(table))[0]
df["Year"] = year
df["Country"] = table.find_previous(["h3", "h2"]).span.text
dfs.append(df)
df = pd.concat(dfs)
print(df)
df.to_csv("data.csv", index=False)
Prints:
...
13 14 FW Moussa Konaté 3 April 1993 (aged 25) 28 Amiens 2018 Senegal 10.0
14 15 FW Diafra Sakho 24 December 1989 (aged 28) 12 Rennes 2018 Senegal 3.0
15 16 GK Khadim N'Diaye 5 April 1985 (aged 33) 26 Horoya 2018 Senegal 0.0
16 17 MF Badou Ndiaye 27 October 1990 (aged 27) 20 Stoke City 2018 Senegal 1.0
17 18 FW Ismaïla Sarr 25 February 1998 (aged 20) 16 Rennes 2018 Senegal 3.0
18 19 FW M'Baye Niang 19 December 1994 (aged 23) 7 Torino 2018 Senegal 0.0
19 20 FW Keita Baldé 8 March 1995 (aged 23) 19 Monaco 2018 Senegal 3.0
20 21 DF Lamine Gassama 20 October 1989 (aged 28) 36 Alanyaspor 2018 Senegal 0.0
21 22 DF Moussa Wagué 4 October 1998 (aged 19) 10 Eupen 2018 Senegal 0.0
22 23 GK Alfred Gomis 5 September 1993 (aged 24) 1 SPAL 2018 Senegal 0.0
and saves data.csv (screenshot from LibreOffice):
just tested, you have no data because "wikitable" class is present in every table
You can replace "not in" by "in":
if "wikitable" in table["class"]:
...
And your BeautifulSoup data will be there
Once you changed this condition, you will have a problem with this list comprehension:
cells += [country, td.a.get('title') if td.a else 'none', year]
This is because td is not defined in this list, not quite sure what is the aim of these lines but you can define tds before and use them after:
tds = tr.find_all('td')
cells += ...
In general you can add breakpoints in your code to easily identify where is the problem

BeautifulSoup doesn't display the content

I want to scrape spot price data from MCX India website.
The HTML script as visible on inspecting an element is as follows:
<div class="contents spotmarketprice">
<div id="cont-1" style="display: block;">
<table class="mcx-table mrB20" width="100%" cellspacing="8" id="tblSMP">
<thead>
<tr>
<th class="symbol-head">
Commodity
</th>
<th>
Unit
</th>
<th class="left1">
Location
</th>
<th class="right1">
Spot Price (Rs.)
</th>
<th>
Up/Down
</th>
</tr>
</thead>
<tbody>
<tr>
<td class="symbol" style="width:30%;">ALMOND</td>
<td style="width:17%;">1 KGS</td>
<td align="left" style="width:17%;">DELHI</td>
<td align="right" style="width:17%;">558.00</td>
<td align="right" class="padR20" style="width:19%;">=</td>
</tr>
The code I have written is:
#import the required libraries
from bs4 import BeautifulSoup
import requests
#Getting data from website
source= requests.get('http://www.mcxindia.com/market-data/spot-market-price').text
#Getting the html code of the website
soup = BeautifulSoup(source, 'lxml')
#Navigating to the blocks where required content is present
division_1= soup.find('div', class_="contents spotmarketprice").div.table
#Displaying the results
print(division_1.tbody)
Output:
<tbody>
</tbody>
On the website, the content that I want to get is available in ... But, it is not showing any content here. Please, suggest a solution to this.
import requests
import re
import json
import pandas as pd
goal = ['EnSymbol', 'Unit', 'Location', 'TodaysSpotPrice']
def main(url):
r = requests.get(url)
match = json.loads(re.search(r'"Data":(\[.*?\])', r.text).group(1))
allin = []
for item in match:
allin.append([item[x] for x in goal])
df = pd.DataFrame(allin, columns=goal)
print(df)
main("https://www.mcxindia.com/market-data/spot-market-price")
Output:
EnSymbol Unit Location TodaysSpotPrice
0 ALMOND 1 KGS DELHI 558.00
1 ALUMINIUM 1 KGS THANE 137.60
2 CARDAMOM 1 KGS VANDANMEDU 2525.00
3 CASTORSEED 100 KGS DEESA 3626.00
4 CHANA 100 KGS DELHI 4163.00
5 COPPER 1 KGS THANE 388.30
6 COTTON 1 BALES RAJKOT 15880.00
7 CPO 10 KGS KANDLA 635.90
8 CRUDEOIL 1 BBL MUMBAI 2418.00
9 GOLD 10 GRMS AHMEDABAD 40989.00
10 GOLDGUINEA 8 GRMS AHMEDABAD 32923.00
11 GOLDM 10 GRMS AHMEDABAD 40989.00
12 GOLDPETAL 1 GRMS MUMBAI 4129.00
13 GUARGUM 100 KGS JODHPUR 5880.00
14 GUARSEED 100 KGS JODHPUR 3660.00
15 KAPAS 20 KGS RAJKOT 927.50
16 LEAD 1 KGS CHENNAI 141.60
17 MENTHAOIL 1 KGS CHANDAUSI 1295.10
18 NATURALGAS 1 mmBtu HAZIRA 138.50
19 NICKEL 1 KGS THANE 892.00
20 PEPPER 100 KGS KOCHI 32700.00
21 RAW JUTE 100 KGS KOLKATA 4999.00
22 RBD PALMOLEIN 10 KGS KANDLA 700.40
23 REFSOYOIL 10 KGS INDORE 845.25
24 SILVER 1 KGS AHMEDABAD 36871.00
25 SILVERM 1 KGS AHMEDABAD 36871.00
26 SILVERMIC 1 KGS AHMEDABAD 36871.00
27 SUGARMDEL 100 KGS DELHI 3380.00
28 SUGARMKOL 100 KGS KOLHAPUR 3334.00
29 SUGARSKLP 100 KGS KOLHAPUR 3275.00
30 TIN 1 KGS MUMBAI 1160.50
31 WHEAT 100 KGS DELHI 1977.50
32 ZINC 1 KGS THANE 155.15
In case if you want to have the symbol of changes:
Here's the version of it:
import requests
import re
import json
import pandas as pd
goal = ['EnSymbol', 'Unit', 'Location', 'TodaysSpotPrice', 'Change']
def main(url):
r = requests.get(url)
match = json.loads(re.search(r'"Data":(\[.*?\])', r.text).group(1))
allin = []
for item in match:
item = [item[x] for x in goal]
item[-1] = '▲' if item[-1] > 0 else '▼' if item[-1] < 0 else "="
allin.append(item)
df = pd.DataFrame(allin, columns=goal)
print(df)
main("https://www.mcxindia.com/market-data/spot-market-price")
Output:
EnSymbol Unit Location TodaysSpotPrice Change
0 ALMOND 1 KGS DELHI 558.00 =
1 ALUMINIUM 1 KGS THANE 137.60 =
2 CARDAMOM 1 KGS VANDANMEDU 2525.00 =
3 CASTORSEED 100 KGS DEESA 3626.00 =
4 CHANA 100 KGS DELHI 4163.00 =
5 COPPER 1 KGS THANE 388.30 =
6 COTTON 1 BALES RAJKOT 15880.00 ▲
7 CPO 10 KGS KANDLA 635.90 ▲
8 CRUDEOIL 1 BBL MUMBAI 2418.00 ▲
9 GOLD 10 GRMS AHMEDABAD 40989.00 =
10 GOLDGUINEA 8 GRMS AHMEDABAD 32923.00 =
11 GOLDM 10 GRMS AHMEDABAD 40989.00 =
12 GOLDPETAL 1 GRMS MUMBAI 4129.00 =
13 GUARGUM 100 KGS JODHPUR 5880.00 =
14 GUARSEED 100 KGS JODHPUR 3660.00 =
15 KAPAS 20 KGS RAJKOT 927.50 ▲
16 LEAD 1 KGS CHENNAI 141.60 =
17 MENTHAOIL 1 KGS CHANDAUSI 1295.10 =
18 NATURALGAS 1 mmBtu HAZIRA 138.50 ▲
19 NICKEL 1 KGS THANE 892.00 =
20 PEPPER 100 KGS KOCHI 32600.00 ▼
21 RAW JUTE 100 KGS KOLKATA 4999.00 =
22 RBD PALMOLEIN 10 KGS KANDLA 700.40 ▼
23 REFSOYOIL 10 KGS INDORE 845.25 =
24 SILVER 1 KGS AHMEDABAD 36871.00 =
25 SILVERM 1 KGS AHMEDABAD 36871.00 =
26 SILVERMIC 1 KGS AHMEDABAD 36871.00 =
27 SUGARMDEL 100 KGS DELHI 3380.00 ▼
28 SUGARMKOL 100 KGS KOLHAPUR 3334.00 ▲
29 SUGARSKLP 100 KGS KOLHAPUR 3275.00 ▼
30 TIN 1 KGS MUMBAI 1160.50 ▼
31 WHEAT 100 KGS DELHI 1977.50 ▲
32 ZINC 1 KGS THANE 155.15 =
It does seem like data within the table is being uploaded through JavaScript.
That's why, if you are trying to fetch this information using requests library, you don't receive table's data on return. requests simply doesn't support JS. Therefore, the problem here isn't in BeautifulSoup.
To scrape JS-driven data, consider using selenium and chromedriver. The solution in this case will look like:
# import libraries
from bs4 import BeautifulSoup
from selenium import webdriver
# create a webdriver
chromedriver_path = 'C:\\path\\to\\chromedriver.exe'
driver = webdriver.Chrome(chromedriver_path)
# go to the page and get its source
driver.get('http://www.mcxindia.com/market-data/spot-market-price')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# fetch mentioned data
table = soup.find('table', {'id': 'tblSMP'})
for tr in table.tbody.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
print(row)
# close the webdriver
driver.quit()
The output of the above script is:
['ALMOND', '1 KGS', 'DELHI', '558.00', '=']
['ALUMINIUM', '1 KGS', 'THANE', '137.60', '=']
['CARDAMOM', '1 KGS', 'VANDANMEDU', '2,525.00', '=']
['CASTORSEED', '100 KGS', 'DEESA', '3,626.00', '▼']
['CHANA', '100 KGS', 'DELHI', '4,163.00', '▲']
['COPPER', '1 KGS', 'THANE', '388.30', '=']
['COTTON', '1 BALES', 'RAJKOT', '15,790.00', '▲']
['CPO', '10 KGS', 'KANDLA', '630.10', '▼']
['CRUDEOIL', '1 BBL', 'MUMBAI', '2,418.00', '▲']
['GOLD', '10 GRMS', 'AHMEDABAD', '40,989.00', '=']
['GOLDGUINEA', '8 GRMS', 'AHMEDABAD', '32,923.00', '=']
['GOLDM', '10 GRMS', 'AHMEDABAD', '40,989.00', '=']
['GOLDPETAL', '1 GRMS', 'MUMBAI', '4,129.00', '=']
['GUARGUM', '100 KGS', 'JODHPUR', '5,880.00', '=']
['GUARSEED', '100 KGS', 'JODHPUR', '3,660.00', '=']
UPD: I must specify that the code above answers to the question of seeing this specific table. However, sometimes websites store data in 'application/json' or similar tags that can be reached with 'requests' library (since they don't require JS).
As discovered by αԋɱҽԃ αмєяιcαη, current website contains such tag. Please, check his answer. It is indeed better to use requests, than selenium in this situation.

Scrape Table Data on Multiple Pages from Multiple URLs (Python & BeautifulSoup)

New coder here! I am trying to scrape web table data from multiple URLs. Each URL web-page has 1 table, however that table is split among multiple pages. My code only iterates through the table pages of the first URL and not the rest. So.. I am able to get pages 1-5 of NBA data for year 2000 only, but it stops there. How do I get my code to pull every year of data? Any help is greatly appreciated.
page = 1
year = 2000
while page < 20 and year < 2020:
base_URL = 'http://www.espn.com/nba/salaries/_/year/{}/page/{}'.format(year,page)
response = requests.get(base_URL, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
sal_table = soup.find_all('table', class_ = 'tablehead')
if len(sal_table) < 2:
sal_table = sal_table[0]
with open ('NBA_Salary_2000_2019.txt', 'a') as r:
for row in sal_table.find_all('tr'):
for cell in row.find_all('td'):
r.write(cell.text.ljust(30))
r.write('\n')
page+=1
else:
print("too many tables")
else:
year +=1
page = 1
I'd consider using pandas here as 1) it's .read_html() function (which uses beautifulsoup under the hood), is easier to parse <table> tags, and 2) it can easily then write straight to file.
Also, it's a waste to iterate through 20 pages (for example the first season you are after only has 4 pages...the rest are blank. So I'd consider adding something that says once it reaches a blank table, move on to the next season.
import pandas as pd
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'}
results = pd.DataFrame()
year = 2000
while year < 2020:
goToNextPage = True
page = 1
while goToNextPage == True:
base_URL = 'http://www.espn.com/nba/salaries/_/year/{}/page/{}'.format(year,page)
response = requests.get(base_URL, headers)
if response.status_code == 200:
temp_df = pd.read_html(base_URL)[0]
temp_df.columns = list(temp_df.iloc[0,:])
temp_df = temp_df[temp_df['RK'] != 'RK']
if len(temp_df) == 0:
goToNextPage = False
year +=1
continue
print ('Aquiring Season: %s\tPage: %s' %(year, page))
temp_df['Season'] = '%s-%s' %(year-1, year)
results = results.append(temp_df, sort=False).reset_index(drop=True)
page+=1
results.to_csv('c:/test/NBA_Salary_2000_2019.csv', index=False)
Output:
print (results.head(25).to_string())
RK NAME TEAM SALARY Season
0 1 Shaquille O'Neal, C Los Angeles Lakers $17,142,000 1999-2000
1 2 Kevin Garnett, PF Minnesota Timberwolves $16,806,000 1999-2000
2 3 Alonzo Mourning, C Miami Heat $15,004,000 1999-2000
3 4 Juwan Howard, PF Washington Wizards $15,000,000 1999-2000
4 5 Scottie Pippen, SF Portland Trail Blazers $14,795,000 1999-2000
5 6 Karl Malone, PF Utah Jazz $14,000,000 1999-2000
6 7 Larry Johnson, F New York Knicks $11,910,000 1999-2000
7 8 Gary Payton, PG Seattle SuperSonics $11,020,000 1999-2000
8 9 Rasheed Wallace, PF Portland Trail Blazers $10,800,000 1999-2000
9 10 Shawn Kemp, C Cleveland Cavaliers $10,780,000 1999-2000
10 11 Damon Stoudamire, PG Portland Trail Blazers $10,125,000 1999-2000
11 12 Antonio McDyess, PF Denver Nuggets $9,900,000 1999-2000
12 13 Antoine Walker, PF Boston Celtics $9,000,000 1999-2000
13 14 Shareef Abdur-Rahim, PF Vancouver Grizzlies $9,000,000 1999-2000
14 15 Allen Iverson, SG Philadelphia 76ers $9,000,000 1999-2000
15 16 Vin Baker, PF Seattle SuperSonics $9,000,000 1999-2000
16 17 Ray Allen, SG Milwaukee Bucks $9,000,000 1999-2000
17 18 Anfernee Hardaway, SF Phoenix Suns $9,000,000 1999-2000
18 19 Kobe Bryant, SF Los Angeles Lakers $9,000,000 1999-2000
19 20 Stephon Marbury, PG New Jersey Nets $9,000,000 1999-2000
20 21 Vlade Divac, C Sacramento Kings $8,837,000 1999-2000
21 22 Bryant Reeves, C Vancouver Grizzlies $8,666,000 1999-2000
22 23 Tom Gugliotta, PF Phoenix Suns $8,558,000 1999-2000
23 24 Nick Van Exel, PG Denver Nuggets $8,354,000 1999-2000
24 25 Elden Campbell, C Charlotte Hornets $7,975,000 1999-2000
...

BeautifulSoup - find + iterate through a table

I am having some trouble trying to cleanly iterate through a table of sold property listings using BeautifulSoup.
In this example
Some rows in the main table are irrelevant (like "set search filters")
The rows have unique IDs
Have tried getting the rows using a style attribute, but this did not return results.
What would be the best approach to get just the rows for sold properties out of that table?
End goal is to pluck out the sold price; date of sale; # bedrooms/bathrooms/car; land area and append into a pandas dataframe.
from bs4 import BeautifulSoup
import requests
# Globals
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
url = 'http://house.ksou.cn/p.php?q=West+Footscray%2C+VIC'
r=requests.get(url,headers=headers)
c=r.content
soup=BeautifulSoup(c,"html.parser")
r=requests.get(url,headers=headers)
c=r.content
soup=BeautifulSoup(c,"html.parser")
prop_table = soup.find('table', id="mainT")
#prop_table = soup.find('table', {"font-size" : "13px"})
#prop_table = soup.select('.addr') # Pluck out the listings
rows = prop_table.findAll('tr')
for row in rows:
print(row.text)
This HTML is tricky to parse, because it doesn't have fixed structure. Unfortunately, I don't have pandas installed, so I only print the data to the screen:
import requests
from bs4 import BeautifulSoup
url = 'http://house.ksou.cn/p.php?q=West+Footscray&p={page}&s=1&st=&type=&count=300&region=West+Footscray&lat=0&lng=0&sta=vic&htype=&agent=0&minprice=0&maxprice=0&minbed=0&maxbed=0&minland=0&maxland=0'
data = []
for page in range(0, 2): # <-- increase to number of pages you want to crawl
soup = BeautifulSoup(requests.get(url.format(page=page)).text, 'html.parser')
for table in soup.select('table[id^="r"]'):
name = table.select_one('span.addr').text
price = table.select_one('span.addr').find_next('b').get_text(strip=True).split()[-1]
sold = table.select_one('span.addr').find_next('b').find_next_sibling(text=True).replace('in', '').replace('(Auction)', '').strip()
beds = table.select_one('img[alt="Bed rooms"]')
beds = beds.find_previous_sibling(text=True).strip() if beds else '-'
bath = table.select_one('img[alt="Bath rooms"]')
bath = bath.find_previous_sibling(text=True).strip() if bath else '-'
car = table.select_one('img[alt="Car spaces"]')
car = car.find_previous_sibling(text=True).strip() if car else '-'
land = table.select_one('b:contains("Land size:")')
land = land.find_next_sibling(text=True).split()[0] if land else '-'
building = table.select_one('b:contains("Building size:")')
building = building.find_next_sibling(text=True).split()[0] if building else '-'
data.append([name, price, sold, beds, bath, car, land, building])
# print the data
print('{:^25} {:^15} {:^15} {:^15} {:^15} {:^15} {:^15} {:^15}'.format('Name', 'Price', 'Sold', 'Beds', 'Bath', 'Car', 'Land', 'Building'))
for row in data:
print('{:<25} {:^15} {:^15} {:^15} {:^15} {:^15} {:^15} {:^15}'.format(*row))
Prints:
Name Price Sold Beds Bath Car Land Building
51 Fontein Street $770,000 07 Dec 2019 - - - - -
50 Fontein Street $751,000 07 Dec 2019 - - - - -
9 Wellington Street $1,024,999 Dec 2019 2 1 1 381 -
239 Essex Street $740,000 07 Dec 2019 2 1 1 358 101
677a Barkly Street $780,000 Dec 2019 4 1 - 380 -
23A Busch Street $800,000 30 Nov 2019 3 1 1 215 -
3/2-4 Dyson Street $858,000 Nov 2019 3 2 - 378 119
3/101 Stanhope Street $803,000 30 Nov 2019 2 2 2 168 113
2/4 Rondell Avenue $552,500 30 Nov 2019 2 - - 1,088 -
3/2 Dyson Street $858,000 30 Nov 2019 3 2 2 378 -
9 Vine Street $805,000 Nov 2019 2 1 2 318 -
39 Robbs Road $957,000 23 Nov 2019 2 2 - 231 100
29 Robbs Road $1,165,000 Nov 2019 2 1 1 266 -
5 Busch Street $700,000 Nov 2019 2 1 1 202 -
46 Indwe Street $730,000 16 Nov 2019 3 1 1 470 -
29/132 Rupert Street $216,000 16 Nov 2019 1 1 1 3,640 -
11/10 Carmichael Street $385,000 15 Nov 2019 2 1 1 1,005 -
2/16 Carmichael Street $515,000 14 Nov 2019 2 1 1 112 -
4/26 Beaumont Parade $410,000 Nov 2019 2 1 1 798 -
5/10 Carmichael Street $310,000 Nov 2019 1 1 1 1,004 -

Categories

Resources