Unable to extract Tables - python

Beginner here. I'm having issues while trying to extract data from the second (Team Statistics) and third (Team Analytics 5-on-5) Table on this page:
https://www.hockey-reference.com/leagues/NHL_2021.html
I'm using this code:
import pandas as pd
url = 'https://www.hockey-reference.com/leagues/NHL_2021.html'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[1]
print(df)
and
url = 'https://www.hockey-reference.com/leagues/NHL_2021.html'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[2]
print(df)
to get the right tables.
But for some kind of reason I will always get this error message:
IndexError: list index out of range
I could extract the first table by using the same code with df = df_list[0], that will work, but it is useless to me. I really need the 2nd an 3rd Table, and I just don't know why it doesn't work.
Pretty sure that's easy to answer for most of you.
Thanx in advance!

You get this error because the read_html() method returns a list of 1 element and that element is at position 0
instead of
df = df_list[1]
use this
df = df_list[0]
You get combined table of all teams from your mentioned site so if you want to extract the table of 2nd and 3rd team use loc[] accessor:-
east_division=df.loc[9:17]
north_division=df.loc[18:25]

Use the URL directly in pandas.read_html
df = pd.read_html('https://www.hockey-reference.com/leagues/NHL_2021.html')

The tables are in fact there in the html (within the comments). Use BeautifulSoup to pull out the comments and parse those tables as well. The code below will pull all (both commented and uncommented tables). and put it into a list. Just a matter of pulling out the table by index that you want, in this case indices 1 and 2.
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
url = "https://www.hockey-reference.com/leagues/NHL_2021.html"
# Gets all uncommented tables
tables = pd.read_html(url, header=1)
# Get the html source
response = requests.get(url, headers=headers)
# Creat soup object form html
soup = BeautifulSoup(response.content, 'html.parser')
# Get the comments in html
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
# Iterate thorugh each comment and parse the table if found
# # Append the table to the tables list
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(each, header=1)[0])
tables = tables[tables['Rk'].ne('Rk')]
tables = tables.rename(columns={'Unnamed: 1':'Team'})
except:
continue
Output:
for table in tables[1:3]:
print(table)
Rk Unnamed: 1 AvAge GP W ... S S% SA SV% SO
0 1.0 New York Islanders 29.0 28 18 ... 841 9.8 767 0.920 5
1 2.0 Tampa Bay Lightning 28.3 26 19 ... 798 12.2 725 0.919 3
2 3.0 Florida Panthers 28.1 27 18 ... 918 10.0 840 0.910 0
3 4.0 Toronto Maple Leafs 28.9 29 19 ... 883 11.2 828 0.909 2
4 5.0 Carolina Hurricanes 27.2 26 19 ... 816 10.9 759 0.912 3
5 6.0 Washington Capitals 30.4 27 17 ... 768 12.0 808 0.895 0
6 7.0 Vegas Golden Knights 29.1 25 18 ... 752 11.0 691 0.920 4
7 8.0 Edmonton Oilers 28.4 30 18 ... 945 10.6 938 0.907 2
8 9.0 Winnipeg Jets 28.0 27 17 ... 795 11.4 856 0.910 1
9 10.0 Pittsburgh Penguins 28.1 27 17 ... 779 11.0 784 0.899 1
10 11.0 Chicago Blackhawks 27.2 29 14 ... 863 10.1 997 0.910 2
11 12.0 Minnesota Wild 28.8 25 16 ... 764 10.3 723 0.913 2
12 13.0 St. Louis Blues 28.2 28 14 ... 836 10.4 835 0.892 0
13 14.0 Boston Bruins 28.8 25 14 ... 772 8.8 665 0.913 2
14 15.0 Colorado Avalanche 26.8 25 15 ... 846 8.7 622 0.905 4
15 16.0 Montreal Canadiens 28.8 27 12 ... 890 9.7 782 0.909 0
16 17.0 Philadelphia Flyers 27.5 25 13 ... 699 11.7 753 0.892 3
17 18.0 Calgary Flames 28.0 28 13 ... 838 8.9 845 0.904 3
18 19.0 Los Angeles Kings 27.7 26 11 ... 748 10.3 814 0.910 2
19 20.0 Vancouver Canucks 27.7 31 13 ... 951 8.8 1035 0.903 1
20 21.0 Columbus Blue Jackets 27.0 29 11 ... 839 9.3 902 0.895 1
21 22.0 Arizona Coyotes 28.5 27 12 ... 689 9.7 851 0.907 1
22 23.0 San Jose Sharks 29.3 25 11 ... 749 9.5 800 0.890 1
23 24.0 New York Rangers 25.7 26 11 ... 773 9.2 746 0.906 2
24 25.0 Nashville Predators 28.9 28 11 ... 880 7.4 837 0.885 1
25 26.0 Anaheim Ducks 28.4 29 8 ... 804 7.7 852 0.891 3
26 27.0 Dallas Stars 28.3 23 8 ... 657 10.2 626 0.904 3
27 28.0 Detroit Red Wings 29.4 28 8 ... 785 8.0 870 0.891 0
28 29.0 Ottawa Senators 26.4 30 9 ... 942 8.2 960 0.874 0
29 30.0 New Jersey Devils 26.2 24 8 ... 708 8.5 741 0.896 2
30 31.0 Buffalo Sabres 27.4 26 6 ... 728 7.7 804 0.893 0
31 NaN League Average 28.1 27 13 ... 808 9.8 808 0.902 2
[32 rows x 32 columns]
Rk Unnamed: 1 S% SV% ... HDGF HDC% HDGA HDCO%
0 1 New York Islanders 8.3 0.931 ... 11 12.2 11 11.8
1 2 Tampa Bay Lightning 8.7 0.933 ... 11 14.9 6 6.3
2 3 Florida Panthers 7.9 0.926 ... 15 14.4 12 17.6
3 4 Toronto Maple Leafs 8.8 0.933 ... 16 13.4 8 11.1
4 5 Carolina Hurricanes 7.5 0.932 ... 12 12.8 7 9.3
5 6 Washington Capitals 9.8 0.919 ... 10 10.9 5 7.8
6 7 Vegas Golden Knights 9.3 0.927 ... 20 15.9 11 14.5
7 8 Edmonton Oilers 8.2 0.920 ... 9 11.3 13 9.8
8 9 Winnipeg Jets 8.5 0.926 ... 15 15.0 8 7.8
9 10 Pittsburgh Penguins 8.8 0.922 ... 10 14.5 15 13.5
10 11 Chicago Blackhawks 7.3 0.925 ... 10 10.5 14 15.1
11 12 Minnesota Wild 9.9 0.930 ... 16 14.2 8 11.9
12 13 St. Louis Blues 8.4 0.914 ... 15 18.1 15 15.8
13 14 Boston Bruins 6.6 0.922 ... 5 7.4 11 12.2
14 15 Colorado Avalanche 6.7 0.916 ... 8 8.1 8 13.3
15 16 Montreal Canadiens 7.8 0.935 ... 15 12.0 8 11.3
16 17 Philadelphia Flyers 10.1 0.907 ... 18 15.9 9 12.9
17 18 Calgary Flames 7.6 0.929 ... 6 6.9 8 9.2
18 19 Los Angeles Kings 7.5 0.925 ... 11 13.1 8 9.8
19 20 Vancouver Canucks 7.3 0.919 ... 17 13.2 20 17.4
20 21 Columbus Blue Jackets 8.1 0.918 ... 5 9.6 15 13.6
21 22 Arizona Coyotes 7.7 0.924 ... 11 14.7 14 12.8
22 23 San Jose Sharks 8.1 0.909 ... 12 14.6 16 14.0
23 24 New York Rangers 7.8 0.921 ... 17 14.0 8 12.7
24 25 Nashville Predators 5.7 0.918 ... 5 10.6 11 13.4
25 26 Anaheim Ducks 7.4 0.909 ... 12 13.3 25 16.8
26 27 Dallas Stars 7.4 0.929 ... 11 13.3 5 12.8
27 28 Detroit Red Wings 7.5 0.923 ... 13 15.3 12 16.7
28 29 Ottawa Senators 7.1 0.894 ... 7 8.6 20 14.3
29 30 New Jersey Devils 7.2 0.923 ... 10 14.3 12 13.2
30 31 Buffalo Sabres 5.8 0.911 ... 6 8.2 16 14.0

Related

Table is not displayed with python requests

There's a website https://www.hockey-reference.com//leagues/NHL_2022.html
I need to get table in div with id=div_stats
from bs4 import BeautifulSoup
url = 'https://www.hockey-reference.com/leagues/NHL_2022.html'
r = requests.get(url=url)
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find('div', id='div_stats')
print(table)
#None
Response is 200, but there's no such div in BeautifulSoup object. If I open the page using selenium or manually - it gets loaded properly.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
url = 'https://www.hockey-reference.com/leagues/NHL_2022.html'
with webdriver.Chrome() as browser:
browser.get(url)
#sleep(1)
html = browser.page_source
#r = requests.get(url=url, stream=True)
soup = BeautifulSoup(html, 'html.parser')
table = soup.find_all('div', id='div_stats')
However, while using webdriver it may load page for quite a long time (even if I see the whole page, it's still loading browser.get(url), and the code couldn't continue).
Is there any solution that can help avoiding selenium / stop the loading when the table is in the HTML?
I tried: stream and timeout in requests.get(),
for season in seasons:
browser.get(url)
wait = WebDriverWait(browser, 5)
wait.until(EC.visibility_of_element_located((By.ID, 'div_stats')))
html = browser.execute_script('return document.documentElement.outerHTML')
Nothing of that worked.
This is one way to get that table as a dataframe:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
url= 'https://www.hockey-reference.com//leagues/NHL_2022.html'
response = requests.get(url).text.replace('<!--', '').replace('-->', '')
soup = bs(response, 'html.parser')
table_w_data = soup.select_one('table#stats')
df = pd.read_html(str(table_w_data), header=1)[0]
print(df)
Result in terminal:
0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Unnamed: 5_level_0 Unnamed: 6_level_0 Unnamed: 7_level_0 Unnamed: 8_level_0 Unnamed: 9_level_0 ... Special Teams Shot Data Unnamed: 31_level_0
Rk Unnamed: 1_level_1 AvAge GP W L OL PTS PTS% GF ... PK% SH SHA PIM/G oPIM/G S S% SA SV% SO
0 1.0 Florida Panthers* 27.8 82 58 18 6 122 0.744 337 ... 79.54 12 8 10.1 10.8 3062 11.0 2515 0.904 5
1 2.0 Colorado Avalanche* 28.2 82 56 19 7 119 0.726 308 ... 79.66 6 5 9.0 10.4 2874 10.7 2625 0.912 7
2 3.0 Carolina Hurricanes* 28.3 82 54 20 8 116 0.707 277 ... 88.04 4 3 9.2 7.7 2798 9.9 2310 0.913 6
3 4.0 Toronto Maple Leafs* 28.4 82 54 21 7 115 0.701 312 ... 82.05 13 4 8.6 8.5 2835 11.0 2511 0.900 7
4 5.0 Minnesota Wild* 29.4 82 53 22 7 113 0.689 305 ... 76.14 2 5 10.8 10.8 2666 11.4 2577 0.903 3
5 6.0 Calgary Flames* 28.8 82 50 21 11 111 0.677 291 ... 83.20 7 3 9.1 8.6 2908 10.0 2374 0.913 11
6 7.0 Tampa Bay Lightning* 29.6 82 51 23 8 110 0.671 285 ... 80.56 7 5 11.0 11.4 2535 11.2 2441 0.907 3
7 8.0 New York Rangers* 26.7 82 52 24 6 110 0.671 250 ... 82.30 8 2 8.2 8.2 2392 10.5 2528 0.919 9
8 9.0 St. Louis Blues* 28.8 82 49 22 11 109 0.665 309 ... 84.09 9 5 7.5 7.9 2492 12.4 2591 0.908 4
9 10.0 Boston Bruins* 28.5 82 51 26 5 107 0.652 253 ... 81.30 5 6 9.9 9.4 2962 8.5 2354 0.907 4
10 11.0 Edmonton Oilers* 29.1 82 49 27 6 104 0.634 285 ... 79.37 11 6 8.1 7.1 2790 10.2 2647 0.905 4
11 12.0 Pittsburgh Penguins* 29.7 82 46 25 11 103 0.628 269 ... 84.43 3 8 6.9 8.4 2849 9.4 2576 0.914 7
12 13.0 Washington Capitals* 29.5 82 44 26 12 100 0.610 270 ... 80.44 8 9 7.7 8.8 2577 10.5 2378 0.898 8
13 14.0 Los Angeles Kings* 28.0 82 44 27 11 99 0.604 235 ... 76.65 11 9 7.7 8.3 2865 8.2 2341 0.901 5
14 15.0 Dallas Stars* 29.4 82 46 30 6 98 0.598 233 ... 79.00 7 5 6.7 7.5 2486 9.4 2545 0.904 2
15 16.0 Nashville Predators* 27.7 82 45 30 7 97 0.591 262 ... 79.23 2 5 12.6 11.9 2439 10.7 2646 0.906 4
16 17.0 Vegas Golden Knights 28.5 82 43 31 8 94 0.573 262 ... 77.40 10 7 7.6 7.7 2830 9.3 2458 0.901 3
17 18.0 Vancouver Canucks 27.7 82 40 30 12 92 0.561 246 ... 74.89 5 6 8.0 8.6 2622 9.4 2612 0.912 1
18 19.0 Winnipeg Jets 28.2 82 39 32 11 89 0.543 250 ... 75.00 9 8 8.8 9.5 2645 9.5 2721 0.907 5
19 20.0 New York Islanders 30.1 82 37 35 10 84 0.512 229 ... 84.19 5 7 8.9 8.4 2367 9.7 2669 0.913 9
20 21.0 Columbus Blue Jackets 26.6 82 37 38 7 81 0.494 258 ... 78.57 7 6 7.7 7.2 2463 10.5 2887 0.897 2
21 22.0 San Jose Sharks 29.0 82 32 37 13 77 0.470 211 ... 85.20 4 11 8.8 8.6 2400 8.8 2622 0.900 3
22 23.0 Anaheim Ducks 27.9 82 31 37 14 76 0.463 228 ... 80.80 6 4 9.3 9.8 2393 9.5 2725 0.902 4
23 24.0 Buffalo Sabres 27.5 82 32 39 11 75 0.457 229 ... 76.42 6 6 8.1 7.9 2451 9.3 2702 0.894 1
24 25.0 Detroit Red Wings 26.9 82 32 40 10 74 0.451 227 ... 73.78 4 10 8.9 8.5 2414 9.4 2761 0.888 4
25 26.0 Ottawa Senators 26.6 82 33 42 7 73 0.445 224 ... 80.32 9 4 10.0 10.2 2463 9.1 2740 0.904 2
26 27.0 Chicago Blackhawks 28.0 82 28 42 12 68 0.415 213 ... 76.23 2 6 7.9 8.7 2362 9.0 2703 0.893 4
27 28.0 New Jersey Devils 25.8 82 27 46 9 63 0.384 245 ... 80.19 6 14 8.1 8.4 2562 9.6 2540 0.881 2
28 29.0 Philadelphia Flyers 28.3 82 25 46 11 61 0.372 210 ... 75.74 6 11 9.0 9.0 2539 8.3 2785 0.894 1
29 30.0 Seattle Kraken 28.7 82 27 49 6 60 0.366 213 ... 74.89 8 7 8.5 8.0 2380 8.9 2367 0.880 3
30 31.0 Arizona Coyotes 28.0 82 25 50 7 57 0.348 206 ... 75.00 3 4 10.2 8.2 2121 9.7 2910 0.894 1
31 32.0 Montreal Canadiens 27.8 82 22 49 11 55 0.335 218 ... 75.55 6 12 10.2 9.0 2442 8.9 2823 0.888 3
32 NaN League Average 28.2 82 41 32 9 91 0.555 255 ... 79.39 7 7 8.9 8.9 2593 9.8 2593 0.902 4
33 rows × 32 columns
Expect to do a little cleanup of that data, once you get it.
Relevant documentation for pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
And for requests: https://requests.readthedocs.io/en/latest/
And for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

Scraping table returning repeated values

I'm trying to build a simple web scraper. I am trying to scrape a table, but I'm not sure why the output is: School, 20-5, 33.2 26 times over.
Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.maxpreps.com/rankings/basketball/1/state/michigan.htm'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
teams = soup.find_all('tr')
for team in teams:
teamname = soup.find('th', class_ = "school").text
record = soup.find('td', class_= "overall dw").text
rating = soup.find('td', class_ = "rating sorted dw").text
print(teamname, record, rating)
Notice that you're never using the Tag that team refers to. Inside the for loop, all of the calls to soup.find() should be calls to team.find():
for team in teams[1:]:
teamname = team.find('th', class_ = "school").text
record = team.find('td', class_= "overall dw").text
rating = team.find('td', class_ = "rating sorted dw").text
print(teamname, record, rating)
This outputs:
St. Mary's Prep (Orchard Lake) 20-5 33.2
University of Detroit Jesuit (Detroit) 16-7 30.0
Williamston 25-0 29.3
Ferndale 21-3 28.9
Catholic Central (Grand Rapids) 25-1 28.4
King (Detroit) 18-3 27.4
De La Salle Collegiate (Warren) 18-7 27.2
Catholic Central (Novi) 16-9 26.6
Brother Rice (Bloomfield Hills) 15-7 26.5
Unity Christian (Hudsonville) 21-1 26.4
Hamtramck 21-4 26.3
Grand Blanc 20-5 25.9
East Lansing 18-5 25.0
Muskegon 20-3 24.8
Northview (Grand Rapids) 25-1 24.6
Cass Tech (Detroit) 21-4 24.3
North Farmington (Farmington Hills) 18-4 24.2
Beecher (Flint) 23-2 24.0
Okemos 19-5 23.9
Benton Harbor 23-3 23.2
Rockford 19-3 22.9
Grand Haven 17-4 21.9
Hartland 19-4 21.0
Marshall 20-3 21.0
Freeland 24-0 21.0
We use [1:] to skip the table header, slicing off the first element in the teams list.
Let pandas parse that table for you (it uses BeautifulSoup under the hoop).
import pandas as pd
url = 'https://www.maxpreps.com/rankings/basketball/1/state/michigan.htm'
df = pd.read_html(url)[0]
Output:
print(df)
# School Ovr. Rating Str. +/-
0 1 St. Mary's Prep (Orchard Lake) 20-5 33.2 23.0 NaN
1 2 University of Detroit Jesuit (Detroit) 16-7 30.0 24.1 NaN
2 3 Williamston 25-0 29.3 10.9 NaN
3 4 Ferndale 21-3 28.9 16.5 NaN
4 5 Catholic Central (Grand Rapids) 25-1 28.4 11.4 NaN
5 6 King (Detroit) 18-3 27.4 15.2 NaN
6 7 De La Salle Collegiate (Warren) 18-7 27.2 19.6 2.0
7 8 Catholic Central (Novi) 16-9 26.6 22.6 -1.0
8 9 Brother Rice (Bloomfield Hills) 15-7 26.5 21.0 -1.0
9 10 Unity Christian (Hudsonville) 21-1 26.4 10.4 NaN
10 11 Hamtramck 21-4 26.3 14.5 2.0
11 12 Grand Blanc 20-5 25.9 15.3 -1.0
12 13 East Lansing 18-5 25.0 15.6 1.0
13 14 Muskegon 20-3 24.8 11.4 1.0
14 15 Northview (Grand Rapids) 25-1 24.6 8.2 1.0
15 16 Cass Tech (Detroit) 21-4 24.3 11.8 -4.0
16 17 North Farmington (Farmington Hills) 18-4 24.2 13.1 NaN
17 18 Beecher (Flint) 23-2 24.0 8.6 2.0
18 19 Okemos 19-5 23.9 13.7 -1.0
19 20 Benton Harbor 23-3 23.2 9.9 -1.0
20 21 Rockford 19-3 22.9 11.6 NaN
21 22 Grand Haven 17-4 21.9 11.3 NaN
22 23 Hartland 19-4 21.0 10.4 1.0
23 24 Marshall 20-3 21.0 8.6 -1.0
24 25 Freeland 24-0 21.0 2.7 4.0

How do I get all the tables from a website using pandas

I am trying to get 3 tables from a particular website but only the first two are showing up. I have even tried get the data using BeautifulSoup but the third seems to be hidden somehow. Is there something I am missing?
url = "https://fbref.com/en/comps/9/keepersadv/Premier-League-Stats"
html = pd.read_html(url, header=1)
print(html[0])
print(html[1])
print(html[2]) # This prompts an error that the tables does not exist
The first two tables are the squad tables. The table not showing up is the individual player table. This also happens with similar pages from the same site.
You could use Selenium as suggested, but I think is a bit overkill. The table is available in the static HTML, just within the comments. So you would need to pull the comments out of BeautifulSoup to get those tables.
To get all the tables:
import pandas as pd
import requests
from bs4 import BeautifulSoup, Comment
url = 'https://fbref.com/en/comps/9/keepersadv/Premier-League-Stats'
response = requests.get(url)
tables = pd.read_html(response.text, header=1)
# Get the tables within the Comments
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'table' in str(each):
try:
table = pd.read_html(str(each), header=1)[0]
table = table[table['Rk'].ne('Rk')].reset_index(drop=True)
tables.append(table)
except:
continue
Output:
for table in tables:
print(table)
Squad # Pl 90s GA PKA ... Stp Stp% #OPA #OPA/90 AvgDist
0 Arsenal 2 12.0 17 0 ... 10 8.8 6 0.50 14.6
1 Aston Villa 2 12.0 20 0 ... 6 6.8 13 1.08 16.2
2 Brentford 2 12.0 17 1 ... 10 9.9 18 1.50 15.6
3 Brighton 2 12.0 14 2 ... 17 16.2 13 1.08 15.3
4 Burnley 1 12.0 20 0 ... 14 11.7 17 1.42 16.6
5 Chelsea 2 12.0 4 2 ... 8 8.5 5 0.42 14.0
6 Crystal Palace 1 12.0 17 0 ... 7 7.5 6 0.50 13.5
7 Everton 2 12.0 19 0 ... 8 7.4 7 0.58 13.7
8 Leeds United 1 12.0 20 1 ... 8 12.5 15 1.25 16.3
9 Leicester City 1 12.0 21 2 ... 9 8.4 7 0.58 13.0
10 Liverpool 2 12.0 11 0 ... 9 9.7 16 1.33 17.0
11 Manchester City 2 12.0 6 1 ... 5 8.1 16 1.33 17.5
12 Manchester Utd 1 12.0 21 0 ... 4 4.4 2 0.17 13.3
13 Newcastle Utd 2 12.0 27 4 ... 10 9.8 4 0.33 13.9
14 Norwich City 1 12.0 27 2 ... 6 5.1 5 0.42 12.4
15 Southampton 1 12.0 14 0 ... 16 13.9 2 0.17 12.9
16 Tottenham 1 12.0 17 1 ... 3 2.7 5 0.42 14.1
17 Watford 2 12.0 20 1 ... 6 5.5 9 0.75 15.4
18 West Ham 1 12.0 14 0 ... 6 5.3 1 0.08 11.9
19 Wolves 1 12.0 12 3 ... 9 10.0 10 0.83 15.5
[20 rows x 28 columns]
Squad # Pl 90s GA PKA ... Stp Stp% #OPA #OPA/90 AvgDist
0 vs Arsenal 2 12.0 13 0 ... 4 5.9 11 0.92 15.5
1 vs Aston Villa 2 12.0 16 2 ... 11 8.0 7 0.58 14.8
2 vs Brentford 2 12.0 16 1 ... 16 14.0 9 0.75 15.7
3 vs Brighton 2 12.0 12 3 ... 11 12.5 8 0.67 15.9
4 vs Burnley 1 12.0 14 0 ... 16 10.7 12 1.00 15.1
5 vs Chelsea 2 12.0 30 2 ... 10 11.1 11 0.92 14.2
6 vs Crystal Palace 1 12.0 18 2 ... 7 7.2 9 0.75 14.4
7 vs Everton 2 12.0 16 3 ... 7 7.6 7 0.58 13.8
8 vs Leeds United 1 12.0 12 1 ... 8 7.3 5 0.42 14.2
9 vs Leicester City 1 12.0 16 0 ... 2 3.3 7 0.58 14.3
10 vs Liverpool 2 12.0 35 1 ... 12 9.9 14 1.17 13.7
11 vs Manchester City 2 12.0 25 0 ... 8 6.7 4 0.33 13.1
12 vs Manchester Utd 1 12.0 20 0 ... 7 7.8 7 0.58 14.7
13 vs Newcastle Utd 2 12.0 15 0 ... 8 8.0 8 0.67 15.3
14 vs Norwich City 1 12.0 7 2 ... 5 5.7 16 1.33 17.3
15 vs Southampton 1 12.0 11 2 ... 4 3.7 9 0.75 14.0
16 vs Tottenham 1 12.0 11 1 ... 9 12.2 9 0.75 16.0
17 vs Watford 2 12.0 16 0 ... 8 8.2 9 0.75 15.3
18 vs West Ham 1 12.0 23 0 ... 13 10.5 6 0.50 13.8
19 vs Wolves 1 12.0 12 0 ... 5 6.8 9 0.75 15.3
[20 rows x 28 columns]
Rk Player Nation Pos ... #OPA #OPA/90 AvgDist Matches
0 1 Alisson br BRA GK ... 15 1.36 17.1 Matches
1 2 Kepa Arrizabalaga es ESP GK ... 1 1.00 18.8 Matches
2 3 Daniel Bachmann at AUT GK ... 1 0.25 12.2 Matches
3 4 Asmir Begović ba BIH GK ... 0 0.00 15.0 Matches
4 5 Karl Darlow eng ENG GK ... 4 0.50 14.9 Matches
5 6 Ederson br BRA GK ... 14 1.27 17.5 Matches
6 7 Łukasz Fabiański pl POL GK ... 1 0.08 11.9 Matches
7 8 Álvaro Fernández es ESP GK ... 5 1.67 15.3 Matches
8 9 Ben Foster eng ENG GK ... 8 1.00 16.8 Matches
9 10 David de Gea es ESP GK ... 2 0.17 13.3 Matches
10 11 Vicente Guaita es ESP GK ... 6 0.50 13.5 Matches
11 12 Caoimhín Kelleher ie IRL GK ... 1 1.00 14.6 Matches
12 13 Tim Krul nl NED GK ... 5 0.42 12.4 Matches
13 14 Bernd Leno de GER GK ... 1 0.33 13.1 Matches
14 15 Hugo Lloris fr FRA GK ... 5 0.42 14.1 Matches
15 16 Emiliano Martínez ar ARG GK ... 12 1.09 16.4 Matches
16 17 Alex McCarthy eng ENG GK ... 2 0.17 12.9 Matches
17 18 Edouard Mendy sn SEN GK ... 4 0.36 13.3 Matches
18 19 Illan Meslier fr FRA GK ... 15 1.25 16.3 Matches
19 20 Jordan Pickford eng ENG GK ... 7 0.64 13.6 Matches
20 21 Nick Pope eng ENG GK ... 17 1.42 16.6 Matches
21 22 Aaron Ramsdale eng ENG GK ... 5 0.56 14.9 Matches
22 23 David Raya es ESP GK ... 13 1.44 15.7 Matches
23 24 José Sá pt POR GK ... 10 0.83 15.5 Matches
24 25 Robert Sánchez es ESP GK ... 13 1.18 15.4 Matches
25 26 Kasper Schmeichel dk DEN GK ... 7 0.58 13.0 Matches
26 27 Jason Steele eng ENG GK ... 0 0.00 13.0 Matches
27 28 Jed Steer eng ENG GK ... 1 1.00 14.3 Matches
28 29 Zack Steffen us USA GK ... 2 2.00 17.8 Matches
29 30 Freddie Woodman eng ENG GK ... 0 0.00 11.6 Matches
[30 rows x 34 columns]

bs4 not giving table

URL = 'https://www.basketball-reference.com/leagues/NBA_2019.html'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
table = soup.find_all('table', {'class' : 'sortable stats_table now_sortable'})
rows = table.find_all('td')
for i in rows:
print(i.get_text())
I want to get content of the table with team per game stats from this website but I got error
>>>AttributeError: 'NoneType' object has no attribute 'find_all'
The table that you want is dynamically loaded, meaning it not loaded into the html when you first make a request to the page. So, the table you are searching for does not yet exist.
To scrape sites that use javascript, you can look into using selenium webdriver and PhantomJS, better described by this post –> https://stackoverflow.com/a/26440563/13275492
Actually you can use pandas.read_html() which will read the all tables in nice format. it's will return tables as list. so you can access it as DataFrame with index such as print(df[0]) for example
import pandas as pd
df = pd.read_html("https://www.basketball-reference.com/leagues/NBA_2019.html")
print(df)
The tables (with the exception of a few) in these sports reference sites are within the comments. You would need to pull out the comments, then render these tables with pandas.
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.basketball-reference.com/leagues/NBA_2019.html"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
comments = pageSoup.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'table' in each and 'id="team-stats-per_game"' in each:
df = pd.read_html(each, attrs = {'id': 'team-stats-per_game'})[0]
Output:
print (df)
Rk Team G MP FG ... STL BLK TOV PF PTS
0 1.0 Milwaukee Bucks* 82 241.2 43.4 ... 7.5 5.9 13.9 19.6 118.1
1 2.0 Golden State Warriors* 82 241.5 44.0 ... 7.6 6.4 14.3 21.4 117.7
2 3.0 New Orleans Pelicans 82 240.9 43.7 ... 7.4 5.4 14.8 21.1 115.4
3 4.0 Philadelphia 76ers* 82 241.5 41.5 ... 7.4 5.3 14.9 21.3 115.2
4 5.0 Los Angeles Clippers* 82 241.8 41.3 ... 6.8 4.7 14.5 23.3 115.1
5 6.0 Portland Trail Blazers* 82 242.1 42.3 ... 6.7 5.0 13.8 20.4 114.7
6 7.0 Oklahoma City Thunder* 82 242.1 42.6 ... 9.3 5.2 14.0 22.4 114.5
7 8.0 Toronto Raptors* 82 242.4 42.2 ... 8.3 5.3 14.0 21.0 114.4
8 9.0 Sacramento Kings 82 240.6 43.2 ... 8.3 4.4 13.4 21.4 114.2
9 10.0 Washington Wizards 82 243.0 42.1 ... 8.3 4.6 14.1 20.7 114.0
10 11.0 Houston Rockets* 82 241.8 39.2 ... 8.5 4.9 13.3 22.0 113.9
11 12.0 Atlanta Hawks 82 242.1 41.4 ... 8.2 5.1 17.0 23.6 113.3
12 13.0 Minnesota Timberwolves 82 241.8 41.6 ... 8.3 5.0 13.1 20.3 112.5
13 14.0 Boston Celtics* 82 241.2 42.1 ... 8.6 5.3 12.8 20.4 112.4
14 15.0 Brooklyn Nets* 82 243.7 40.3 ... 6.6 4.1 15.1 21.5 112.2
15 16.0 Los Angeles Lakers 82 241.2 42.6 ... 7.5 5.4 15.7 20.7 111.8
16 17.0 Utah Jazz* 82 240.9 40.4 ... 8.1 5.9 15.1 21.1 111.7
17 18.0 San Antonio Spurs* 82 241.5 42.3 ... 6.1 4.7 12.1 18.1 111.7
18 19.0 Charlotte Hornets 82 241.8 40.2 ... 7.2 4.9 12.2 18.9 110.7
19 20.0 Denver Nuggets* 82 240.6 41.9 ... 7.7 4.4 13.4 20.0 110.7
20 21.0 Dallas Mavericks 82 241.2 38.8 ... 6.5 4.3 14.2 20.1 108.9
21 22.0 Indiana Pacers* 82 240.3 41.3 ... 8.7 4.9 13.7 19.4 108.0
22 23.0 Phoenix Suns 82 242.4 40.1 ... 9.0 5.1 15.6 23.6 107.5
23 24.0 Orlando Magic* 82 241.2 40.4 ... 6.6 5.4 13.2 18.6 107.3
24 25.0 Detroit Pistons* 82 242.1 38.8 ... 6.9 4.0 13.8 22.1 107.0
25 26.0 Miami Heat 82 240.6 39.6 ... 7.6 5.5 14.7 20.9 105.7
26 27.0 Chicago Bulls 82 242.7 39.8 ... 7.4 4.3 14.1 20.3 104.9
27 28.0 New York Knicks 82 241.2 38.2 ... 6.8 5.1 14.0 20.9 104.6
28 29.0 Cleveland Cavaliers 82 240.9 38.9 ... 6.5 2.4 13.5 20.0 104.5
29 30.0 Memphis Grizzlies 82 242.4 38.0 ... 8.3 5.5 14.0 22.0 103.5
30 NaN League Average 82 241.6 41.1 ... 7.6 5.0 14.1 20.9 111.2
[31 rows x 25 columns]

If statement in grouped Pandas dataframe

I have a dataset that contains columns of year, julian day, hour and temperature. I have grouped the data by year and day and now want to perform an operation on the temperature data IF each day contains 24 hours worth of data. Then, I want to create a Dataframe with year, julian day, max temperature and min temperature. However, I'm not sure of the syntax to make sure this condition is met. Any help would be appreciated. My code is below:
df = pd.read_table(data,skiprows=1,sep='\t',usecols=(0,3,4,6),names=['year','jday','hour','temp'],na_values=-999.9)
g = df.groupby(['year','jday'])
if #the grouped year and day has 24 hours worth of data
maxt = g.aggregate({'temp':np.max})
mint = g.aggregate({'temp':np.min})
else:
continue
And some sample data (goes from 1942-2015):
Year Month Day Julian Hour Wind TempC DewC Pressure RH
1942 9 24 267 9 2.1 18.5 15.2 1014.2 81.0
1942 9 24 267 10 2.1 23.5 14.6 1014.6 57.0
1942 9 24 267 11 3.6 25.2 12.4 1014.2 45.0
1942 9 24 267 12 3.6 26.8 11.9 1014.2 40.0
1942 9 24 267 13 2.6 27.4 11.9 1014.2 38.0
1942 9 24 267 14 2.1 28.0 11.3 1013.5 35.0
1942 9 24 267 15 4.1 29.1 9.1 1013.5 29.0
1942 9 24 267 16 4.1 29.1 10.7 1013.5 32.0
1942 9 24 267 17 4.6 29.1 13.0 1013.9 37.0
1942 9 24 267 18 3.6 25.7 12.4 1015.2 44.0
1942 9 24 267 19 0.0 23.0 16.3 1015.2 66.0
1942 9 24 267 20 2.6 22.4 15.7 1015.9 66.0
1942 9 24 267 21 2.1 20.2 16.3 1016.3 78.0
1942 9 24 267 22 3.1 20.2 14.6 1016.9 70.0
1942 9 24 267 23 2.6 19.6 15.2 1017.6 76.0
1942 9 25 268 0 3.1 18.5 13.5 1018.3 73.0
1942 9 25 268 1 2.6 16.9 13.0 1018.3 78.0
1942 9 25 268 2 4.1 15.7 5.2 1021.0 50.0
1942 9 25 268 3 4.1 15.2 4.1 1020.7 47.0
1942 9 25 268 4 3.1 14.1 5.8 1021.3 57.0
1942 9 25 268 5 3.1 13.0 5.8 1021.3 62.0
1942 9 25 268 6 2.1 13.0 5.2 1022.4 59.0
1942 9 25 268 7 2.1 12.4 1.9 1022.4 49.0
1942 9 25 268 8 3.6 13.5 5.8 1024.7 60.0
1942 9 25 268 9 4.6 15.7 3.5 1025.1 44.0
1942 9 25 268 10 4.1 17.4 1.3 1025.4 34.0
1942 9 25 268 11 2.6 18.5 3.0 1025.4 36.0
1942 9 25 268 12 2.1 19.1 0.8 1025.1 29.0
1942 9 25 268 13 2.6 19.6 2.4 1024.7 32.0
1942 9 25 268 14 4.1 20.7 4.6 1023.4 35.0
1942 9 25 268 15 3.6 21.3 4.1 1023.7 32.0
1942 9 25 268 16 1.5 21.3 4.6 1023.4 34.0
1942 9 25 268 17 5.1 20.7 7.4 1023.4 42.0
1942 9 25 268 18 5.1 19.1 8.5 1023.0 50.0
1942 9 25 268 19 3.6 18.0 9.6 1022.7 58.0
1942 9 25 268 20 3.1 16.3 9.6 1023.0 65.0
1942 9 25 268 21 1.5 15.2 11.3 1023.0 78.0
1942 9 25 268 22 1.5 14.6 11.3 1023.0 81.0
1942 9 25 268 23 2.1 14.1 10.7 1024.0 80.0
I assume that there is no ['year', 'julian'] group which contains non-integer hours so we can just use the length of the group as the condition.
import pandas as pd
def get_min_max_by_date(df_group):
if len(df_group['hour'].unique()) < 24:
new_df = pd.DataFrame()
else:
year = df_group['year'].unique()[0]
j_day = df_group['jday'].unique()[0]
min_temp = df_group['temp'].min()
max_temp = df_group['temp'].max()
new_df = pd.DataFrame({'year': [year],
'julian_day': [j_day],
'min_temp': [min_temp],
'max_temp': [max_temp]}, index=[0])
return new_df
df = pd.read_table(data,
skiprows=1,
sep='\t',
usecols=(0, 3, 4, 6),
names=['year', 'jday', 'hour', 'temp'],
na_values=-999.9)
final_df = df.groupby(['year', 'jday'],
as_index=False).apply(get_min_max_by_date)
final_df = final_df.reset_index()
I don't have time to test this right now, but this should get you started.
I would start by grouping on day alone, and then iterate over the groups, checking the unique hours in each group. You can use set to find the unique hours for each measurement day, and compare with a full days worth of hours {0,1,2,3,...23}
a_full_day = set(range(24))
#data_out = {}
gb = df.groupby(['jday']) # only group by day
for day, inds in gb.groups.iteritems():
if set(df.ix[inds, 'hour']) == a_full_day:
maxt = df.ix[inds, 'temp'].max()
#data_out[day] = {}
#data_out[day]['maxt'] = maxt
# etc
I added some commented lines suggesting how you might want to store the output

Categories

Resources