There's a website https://www.hockey-reference.com//leagues/NHL_2022.html
I need to get table in div with id=div_stats
from bs4 import BeautifulSoup
url = 'https://www.hockey-reference.com/leagues/NHL_2022.html'
r = requests.get(url=url)
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find('div', id='div_stats')
print(table)
#None
Response is 200, but there's no such div in BeautifulSoup object. If I open the page using selenium or manually - it gets loaded properly.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
url = 'https://www.hockey-reference.com/leagues/NHL_2022.html'
with webdriver.Chrome() as browser:
browser.get(url)
#sleep(1)
html = browser.page_source
#r = requests.get(url=url, stream=True)
soup = BeautifulSoup(html, 'html.parser')
table = soup.find_all('div', id='div_stats')
However, while using webdriver it may load page for quite a long time (even if I see the whole page, it's still loading browser.get(url), and the code couldn't continue).
Is there any solution that can help avoiding selenium / stop the loading when the table is in the HTML?
I tried: stream and timeout in requests.get(),
for season in seasons:
browser.get(url)
wait = WebDriverWait(browser, 5)
wait.until(EC.visibility_of_element_located((By.ID, 'div_stats')))
html = browser.execute_script('return document.documentElement.outerHTML')
Nothing of that worked.
This is one way to get that table as a dataframe:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
url= 'https://www.hockey-reference.com//leagues/NHL_2022.html'
response = requests.get(url).text.replace('<!--', '').replace('-->', '')
soup = bs(response, 'html.parser')
table_w_data = soup.select_one('table#stats')
df = pd.read_html(str(table_w_data), header=1)[0]
print(df)
Result in terminal:
0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Unnamed: 5_level_0 Unnamed: 6_level_0 Unnamed: 7_level_0 Unnamed: 8_level_0 Unnamed: 9_level_0 ... Special Teams Shot Data Unnamed: 31_level_0
Rk Unnamed: 1_level_1 AvAge GP W L OL PTS PTS% GF ... PK% SH SHA PIM/G oPIM/G S S% SA SV% SO
0 1.0 Florida Panthers* 27.8 82 58 18 6 122 0.744 337 ... 79.54 12 8 10.1 10.8 3062 11.0 2515 0.904 5
1 2.0 Colorado Avalanche* 28.2 82 56 19 7 119 0.726 308 ... 79.66 6 5 9.0 10.4 2874 10.7 2625 0.912 7
2 3.0 Carolina Hurricanes* 28.3 82 54 20 8 116 0.707 277 ... 88.04 4 3 9.2 7.7 2798 9.9 2310 0.913 6
3 4.0 Toronto Maple Leafs* 28.4 82 54 21 7 115 0.701 312 ... 82.05 13 4 8.6 8.5 2835 11.0 2511 0.900 7
4 5.0 Minnesota Wild* 29.4 82 53 22 7 113 0.689 305 ... 76.14 2 5 10.8 10.8 2666 11.4 2577 0.903 3
5 6.0 Calgary Flames* 28.8 82 50 21 11 111 0.677 291 ... 83.20 7 3 9.1 8.6 2908 10.0 2374 0.913 11
6 7.0 Tampa Bay Lightning* 29.6 82 51 23 8 110 0.671 285 ... 80.56 7 5 11.0 11.4 2535 11.2 2441 0.907 3
7 8.0 New York Rangers* 26.7 82 52 24 6 110 0.671 250 ... 82.30 8 2 8.2 8.2 2392 10.5 2528 0.919 9
8 9.0 St. Louis Blues* 28.8 82 49 22 11 109 0.665 309 ... 84.09 9 5 7.5 7.9 2492 12.4 2591 0.908 4
9 10.0 Boston Bruins* 28.5 82 51 26 5 107 0.652 253 ... 81.30 5 6 9.9 9.4 2962 8.5 2354 0.907 4
10 11.0 Edmonton Oilers* 29.1 82 49 27 6 104 0.634 285 ... 79.37 11 6 8.1 7.1 2790 10.2 2647 0.905 4
11 12.0 Pittsburgh Penguins* 29.7 82 46 25 11 103 0.628 269 ... 84.43 3 8 6.9 8.4 2849 9.4 2576 0.914 7
12 13.0 Washington Capitals* 29.5 82 44 26 12 100 0.610 270 ... 80.44 8 9 7.7 8.8 2577 10.5 2378 0.898 8
13 14.0 Los Angeles Kings* 28.0 82 44 27 11 99 0.604 235 ... 76.65 11 9 7.7 8.3 2865 8.2 2341 0.901 5
14 15.0 Dallas Stars* 29.4 82 46 30 6 98 0.598 233 ... 79.00 7 5 6.7 7.5 2486 9.4 2545 0.904 2
15 16.0 Nashville Predators* 27.7 82 45 30 7 97 0.591 262 ... 79.23 2 5 12.6 11.9 2439 10.7 2646 0.906 4
16 17.0 Vegas Golden Knights 28.5 82 43 31 8 94 0.573 262 ... 77.40 10 7 7.6 7.7 2830 9.3 2458 0.901 3
17 18.0 Vancouver Canucks 27.7 82 40 30 12 92 0.561 246 ... 74.89 5 6 8.0 8.6 2622 9.4 2612 0.912 1
18 19.0 Winnipeg Jets 28.2 82 39 32 11 89 0.543 250 ... 75.00 9 8 8.8 9.5 2645 9.5 2721 0.907 5
19 20.0 New York Islanders 30.1 82 37 35 10 84 0.512 229 ... 84.19 5 7 8.9 8.4 2367 9.7 2669 0.913 9
20 21.0 Columbus Blue Jackets 26.6 82 37 38 7 81 0.494 258 ... 78.57 7 6 7.7 7.2 2463 10.5 2887 0.897 2
21 22.0 San Jose Sharks 29.0 82 32 37 13 77 0.470 211 ... 85.20 4 11 8.8 8.6 2400 8.8 2622 0.900 3
22 23.0 Anaheim Ducks 27.9 82 31 37 14 76 0.463 228 ... 80.80 6 4 9.3 9.8 2393 9.5 2725 0.902 4
23 24.0 Buffalo Sabres 27.5 82 32 39 11 75 0.457 229 ... 76.42 6 6 8.1 7.9 2451 9.3 2702 0.894 1
24 25.0 Detroit Red Wings 26.9 82 32 40 10 74 0.451 227 ... 73.78 4 10 8.9 8.5 2414 9.4 2761 0.888 4
25 26.0 Ottawa Senators 26.6 82 33 42 7 73 0.445 224 ... 80.32 9 4 10.0 10.2 2463 9.1 2740 0.904 2
26 27.0 Chicago Blackhawks 28.0 82 28 42 12 68 0.415 213 ... 76.23 2 6 7.9 8.7 2362 9.0 2703 0.893 4
27 28.0 New Jersey Devils 25.8 82 27 46 9 63 0.384 245 ... 80.19 6 14 8.1 8.4 2562 9.6 2540 0.881 2
28 29.0 Philadelphia Flyers 28.3 82 25 46 11 61 0.372 210 ... 75.74 6 11 9.0 9.0 2539 8.3 2785 0.894 1
29 30.0 Seattle Kraken 28.7 82 27 49 6 60 0.366 213 ... 74.89 8 7 8.5 8.0 2380 8.9 2367 0.880 3
30 31.0 Arizona Coyotes 28.0 82 25 50 7 57 0.348 206 ... 75.00 3 4 10.2 8.2 2121 9.7 2910 0.894 1
31 32.0 Montreal Canadiens 27.8 82 22 49 11 55 0.335 218 ... 75.55 6 12 10.2 9.0 2442 8.9 2823 0.888 3
32 NaN League Average 28.2 82 41 32 9 91 0.555 255 ... 79.39 7 7 8.9 8.9 2593 9.8 2593 0.902 4
33 rows × 32 columns
Expect to do a little cleanup of that data, once you get it.
Relevant documentation for pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
And for requests: https://requests.readthedocs.io/en/latest/
And for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html
Related
Beginner here. I'm having issues while trying to extract data from the second (Team Statistics) and third (Team Analytics 5-on-5) Table on this page:
https://www.hockey-reference.com/leagues/NHL_2021.html
I'm using this code:
import pandas as pd
url = 'https://www.hockey-reference.com/leagues/NHL_2021.html'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[1]
print(df)
and
url = 'https://www.hockey-reference.com/leagues/NHL_2021.html'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[2]
print(df)
to get the right tables.
But for some kind of reason I will always get this error message:
IndexError: list index out of range
I could extract the first table by using the same code with df = df_list[0], that will work, but it is useless to me. I really need the 2nd an 3rd Table, and I just don't know why it doesn't work.
Pretty sure that's easy to answer for most of you.
Thanx in advance!
You get this error because the read_html() method returns a list of 1 element and that element is at position 0
instead of
df = df_list[1]
use this
df = df_list[0]
You get combined table of all teams from your mentioned site so if you want to extract the table of 2nd and 3rd team use loc[] accessor:-
east_division=df.loc[9:17]
north_division=df.loc[18:25]
Use the URL directly in pandas.read_html
df = pd.read_html('https://www.hockey-reference.com/leagues/NHL_2021.html')
The tables are in fact there in the html (within the comments). Use BeautifulSoup to pull out the comments and parse those tables as well. The code below will pull all (both commented and uncommented tables). and put it into a list. Just a matter of pulling out the table by index that you want, in this case indices 1 and 2.
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
url = "https://www.hockey-reference.com/leagues/NHL_2021.html"
# Gets all uncommented tables
tables = pd.read_html(url, header=1)
# Get the html source
response = requests.get(url, headers=headers)
# Creat soup object form html
soup = BeautifulSoup(response.content, 'html.parser')
# Get the comments in html
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
# Iterate thorugh each comment and parse the table if found
# # Append the table to the tables list
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(each, header=1)[0])
tables = tables[tables['Rk'].ne('Rk')]
tables = tables.rename(columns={'Unnamed: 1':'Team'})
except:
continue
Output:
for table in tables[1:3]:
print(table)
Rk Unnamed: 1 AvAge GP W ... S S% SA SV% SO
0 1.0 New York Islanders 29.0 28 18 ... 841 9.8 767 0.920 5
1 2.0 Tampa Bay Lightning 28.3 26 19 ... 798 12.2 725 0.919 3
2 3.0 Florida Panthers 28.1 27 18 ... 918 10.0 840 0.910 0
3 4.0 Toronto Maple Leafs 28.9 29 19 ... 883 11.2 828 0.909 2
4 5.0 Carolina Hurricanes 27.2 26 19 ... 816 10.9 759 0.912 3
5 6.0 Washington Capitals 30.4 27 17 ... 768 12.0 808 0.895 0
6 7.0 Vegas Golden Knights 29.1 25 18 ... 752 11.0 691 0.920 4
7 8.0 Edmonton Oilers 28.4 30 18 ... 945 10.6 938 0.907 2
8 9.0 Winnipeg Jets 28.0 27 17 ... 795 11.4 856 0.910 1
9 10.0 Pittsburgh Penguins 28.1 27 17 ... 779 11.0 784 0.899 1
10 11.0 Chicago Blackhawks 27.2 29 14 ... 863 10.1 997 0.910 2
11 12.0 Minnesota Wild 28.8 25 16 ... 764 10.3 723 0.913 2
12 13.0 St. Louis Blues 28.2 28 14 ... 836 10.4 835 0.892 0
13 14.0 Boston Bruins 28.8 25 14 ... 772 8.8 665 0.913 2
14 15.0 Colorado Avalanche 26.8 25 15 ... 846 8.7 622 0.905 4
15 16.0 Montreal Canadiens 28.8 27 12 ... 890 9.7 782 0.909 0
16 17.0 Philadelphia Flyers 27.5 25 13 ... 699 11.7 753 0.892 3
17 18.0 Calgary Flames 28.0 28 13 ... 838 8.9 845 0.904 3
18 19.0 Los Angeles Kings 27.7 26 11 ... 748 10.3 814 0.910 2
19 20.0 Vancouver Canucks 27.7 31 13 ... 951 8.8 1035 0.903 1
20 21.0 Columbus Blue Jackets 27.0 29 11 ... 839 9.3 902 0.895 1
21 22.0 Arizona Coyotes 28.5 27 12 ... 689 9.7 851 0.907 1
22 23.0 San Jose Sharks 29.3 25 11 ... 749 9.5 800 0.890 1
23 24.0 New York Rangers 25.7 26 11 ... 773 9.2 746 0.906 2
24 25.0 Nashville Predators 28.9 28 11 ... 880 7.4 837 0.885 1
25 26.0 Anaheim Ducks 28.4 29 8 ... 804 7.7 852 0.891 3
26 27.0 Dallas Stars 28.3 23 8 ... 657 10.2 626 0.904 3
27 28.0 Detroit Red Wings 29.4 28 8 ... 785 8.0 870 0.891 0
28 29.0 Ottawa Senators 26.4 30 9 ... 942 8.2 960 0.874 0
29 30.0 New Jersey Devils 26.2 24 8 ... 708 8.5 741 0.896 2
30 31.0 Buffalo Sabres 27.4 26 6 ... 728 7.7 804 0.893 0
31 NaN League Average 28.1 27 13 ... 808 9.8 808 0.902 2
[32 rows x 32 columns]
Rk Unnamed: 1 S% SV% ... HDGF HDC% HDGA HDCO%
0 1 New York Islanders 8.3 0.931 ... 11 12.2 11 11.8
1 2 Tampa Bay Lightning 8.7 0.933 ... 11 14.9 6 6.3
2 3 Florida Panthers 7.9 0.926 ... 15 14.4 12 17.6
3 4 Toronto Maple Leafs 8.8 0.933 ... 16 13.4 8 11.1
4 5 Carolina Hurricanes 7.5 0.932 ... 12 12.8 7 9.3
5 6 Washington Capitals 9.8 0.919 ... 10 10.9 5 7.8
6 7 Vegas Golden Knights 9.3 0.927 ... 20 15.9 11 14.5
7 8 Edmonton Oilers 8.2 0.920 ... 9 11.3 13 9.8
8 9 Winnipeg Jets 8.5 0.926 ... 15 15.0 8 7.8
9 10 Pittsburgh Penguins 8.8 0.922 ... 10 14.5 15 13.5
10 11 Chicago Blackhawks 7.3 0.925 ... 10 10.5 14 15.1
11 12 Minnesota Wild 9.9 0.930 ... 16 14.2 8 11.9
12 13 St. Louis Blues 8.4 0.914 ... 15 18.1 15 15.8
13 14 Boston Bruins 6.6 0.922 ... 5 7.4 11 12.2
14 15 Colorado Avalanche 6.7 0.916 ... 8 8.1 8 13.3
15 16 Montreal Canadiens 7.8 0.935 ... 15 12.0 8 11.3
16 17 Philadelphia Flyers 10.1 0.907 ... 18 15.9 9 12.9
17 18 Calgary Flames 7.6 0.929 ... 6 6.9 8 9.2
18 19 Los Angeles Kings 7.5 0.925 ... 11 13.1 8 9.8
19 20 Vancouver Canucks 7.3 0.919 ... 17 13.2 20 17.4
20 21 Columbus Blue Jackets 8.1 0.918 ... 5 9.6 15 13.6
21 22 Arizona Coyotes 7.7 0.924 ... 11 14.7 14 12.8
22 23 San Jose Sharks 8.1 0.909 ... 12 14.6 16 14.0
23 24 New York Rangers 7.8 0.921 ... 17 14.0 8 12.7
24 25 Nashville Predators 5.7 0.918 ... 5 10.6 11 13.4
25 26 Anaheim Ducks 7.4 0.909 ... 12 13.3 25 16.8
26 27 Dallas Stars 7.4 0.929 ... 11 13.3 5 12.8
27 28 Detroit Red Wings 7.5 0.923 ... 13 15.3 12 16.7
28 29 Ottawa Senators 7.1 0.894 ... 7 8.6 20 14.3
29 30 New Jersey Devils 7.2 0.923 ... 10 14.3 12 13.2
30 31 Buffalo Sabres 5.8 0.911 ... 6 8.2 16 14.0
I am very new to coding and I am trying to build a web scraper for Excel so that I can transfer it to Google Sheets. Unfortunately, the code that I have written is working for other people, but not me.
This is the code I have written:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
URL = 'https://www.hockey-reference.com/leagues/NHL_2021.html'
csv_name = 'nhl_season_stats.csv'
def get_nhl_stats(URL):
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
pageTree = requests.get(URL, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
comments = pageSoup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each, header=1)[0])
except:
continue
df = tables[0]
df = df.rename(columns={'Unnamed: 1':'Team'})
df.to_csv(csv_name, index = False)
print(df)
get_nhl_stats(URL)
After running it, I receive this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 13, in get_nhl_stats
IndexError: list index out of range
Sorry for my bad jargon, as I am very new and very confused, but any help would be greatly appreciated!
this code working, maybe the problem is in the declaration of the class "Comment" or the server does not give you the requested values:
import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = 'https://www.hockey-reference.com/leagues/NHL_2021.html'
csv_name = 'nhl_season_stats.csv'
def get_nhl_stats(URL):
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
pageTree = requests.get(URL, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
comments = pageSoup.find_all(string=lambda text: isinstance(text, str))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each, header=1)[0])
except:
continue
df = tables[0]
df = df.rename(columns={'Unnamed: 1':'Team'})
df.to_csv(csv_name, index = False)
print(df)
get_nhl_stats(URL)
output:
Rk Team AvAge GP W L OL PTS PTS% GF GA SOW SOL SRS SOS TG/G EVGF EVGA PP PPO PP% PPA PPOA PK% SH SHA PIM/G oPIM/G S S% SA SV% SO
0 1.0 Toronto Maple Leafs 29.0 6 4 2 0 8 0.667 19 17 0.0 0.0 0.33 -0.01 6.00 11 12 8 18 44.44 4 22 81.82 0 1 10.5 7.5 190 10.0 157 0.892 0
1 2.0 Montreal Canadiens 28.6 5 3 0 2 8 0.800 24 15 0.0 1.0 0.77 -0.83 7.80 14 8 6 20 30.00 6 25 76.00 4 1 11.4 10.6 180 13.3 140 0.893 0
2 3.0 Vegas Golden Knights 28.9 5 4 1 0 8 0.800 18 12 0.0 0.0 1.12 -0.08 6.00 15 8 2 18 11.11 3 18 83.33 1 1 7.2 7.2 150 12.0 125 0.904 0
3 4.0 Minnesota Wild 29.1 5 4 1 0 8 0.800 15 10 0.0 0.0 0.86 -0.14 5.00 13 9 1 23 4.35 1 16 93.75 1 0 7.6 10.4 166 9.0 147 0.932 0
4 5.0 Washington Capitals 30.1 5 3 0 2 8 0.800 18 16 1.0 1.0 0.10 -0.30 6.80 16 12 2 9 22.22 3 18 83.33 0 1 8.6 5.0 130 13.8 141 0.887 0
5 6.0 Philadelphia Flyers 27.0 5 3 1 1 7 0.700 19 15 0.0 1.0 0.36 -0.24 6.80 14 10 5 17 29.41 5 18 72.22 0 0 7.2 6.8 125 15.2 187 0.920 1
6 7.0 Colorado Avalanche 26.9 5 3 2 0 6 0.600 17 12 0.0 0.0 0.47 -0.53 5.80 7 9 10 25 40.00 3 19 84.21 0 0 8.0 10.4 147 11.6 143 0.916 1
7 8.0 Winnipeg Jets 27.9 4 3 1 0 6 0.750 13 10 0.0 0.0 1.10 0.35 5.75 11 6 2 20 10.00 4 12 66.67 0 0 10.3 14.3 119 10.9 134 0.925 0
8 9.0 New York Islanders 28.9 4 3 1 0 6 0.750 9 6 0.0 0.0 0.61 -0.14 3.75 5 5 4 20 20.00 1 15 93.33 0 0 11.5 11.0 108 8.3 114 0.947 2
9 10.0 Tampa Bay Lightning 27.7 3 3 0 0 6 1.000 13 5 0.0 0.0 1.70 -0.97 6.00 11 2 2 8 25.00 3 11 72.73 0 0 9.0 7.0 107 12.1 85 0.941 0
10 11.0 Pittsburgh Penguins 28.6 5 3 2 0 6 0.600 16 21 2.0 0.0 -0.43 0.17 7.40 10 16 5 18 27.78 5 19 73.68 1 0 7.6 7.2 152 10.5 130 0.838 0
11 12.0 New Jersey Devils 26.2 4 2 1 1 5 0.625 9 10 0.0 1.0 -0.35 0.15 4.75 8 3 1 11 9.09 6 16 62.50 0 1 9.8 7.3 112 8.0 150 0.933 0
12 13.0 St. Louis Blues 28.3 4 2 1 1 5 0.625 10 14 0.0 1.0 -1.66 -0.41 6.00 10 6 0 14 0.00 8 21 61.90 0 0 11.0 7.5 109 9.2 129 0.891 0
13 14.0 Boston Bruins 28.8 4 2 1 1 5 0.625 7 9 2.0 0.0 0.07 0.07 4.00 3 7 3 13 23.08 2 18 88.89 1 0 11.3 8.8 135 5.2 96 0.906 0
14 15.0 Arizona Coyotes 28.4 5 2 2 1 5 0.500 17 17 0.0 1.0 -0.04 0.16 6.80 11 11 5 22 22.73 5 24 79.17 1 1 10.4 9.6 144 11.8 157 0.892 0
15 16.0 Calgary Flames 28.1 3 2 0 1 5 0.833 11 6 0.0 0.0 1.14 -0.52 5.67 5 4 6 16 37.50 1 12 91.67 0 1 8.7 11.3 93 11.8 93 0.935 1
16 17.0 Edmonton Oilers 27.9 6 2 4 0 4 0.333 15 20 0.0 0.0 -0.91 -0.08 5.83 10 14 3 23 13.04 4 18 77.78 2 2 7.7 9.3 192 7.8 200 0.900 0
17 18.0 Vancouver Canucks 27.3 6 2 4 0 4 0.333 17 28 1.0 0.0 -1.34 0.33 7.50 12 17 4 26 15.38 9 31 70.97 1 2 13.3 10.7 179 9.5 222 0.874 0
18 19.0 Anaheim Ducks 28.6 5 1 2 2 4 0.400 8 13 0.0 0.0 -0.10 0.90 4.20 8 10 0 12 0.00 2 15 86.67 0 1 6.4 5.2 133 6.0 160 0.919 1
19 20.0 Columbus Blue Jackets 26.6 5 1 2 2 4 0.400 10 16 0.0 0.0 -1.19 0.01 5.20 9 15 1 11 9.09 1 10 90.00 0 0 9.0 9.4 152 6.6 169 0.905 0
20 21.0 Los Angeles Kings 28.3 4 1 1 2 4 0.500 12 13 0.0 0.0 0.43 0.68 6.25 8 10 4 17 23.53 3 21 85.71 0 0 11.0 9.0 119 10.1 121 0.893 0
21 22.0 Detroit Red Wings 29.3 5 2 3 0 4 0.400 10 14 0.0 0.0 -1.54 -0.74 4.80 9 9 1 12 8.33 4 16 75.00 0 1 11.4 9.8 130 7.7 155 0.910 0
22 23.0 San Jose Sharks 29.4 5 2 3 0 4 0.400 12 18 2.0 0.0 -1.32 -0.52 6.00 7 16 5 21 23.81 2 18 88.89 0 0 8.4 9.6 162 7.4 148 0.878 0
23 24.0 Carolina Hurricanes 27.0 3 2 1 0 4 0.667 9 6 0.0 0.0 0.26 -0.74 5.00 6 5 3 12 25.00 1 9 88.89 0 0 7.7 9.7 98 9.2 68 0.912 1
24 25.0 Florida Panthers 27.8 2 2 0 0 4 1.000 10 6 0.0 0.0 1.29 -0.71 8.00 7 3 3 8 37.50 3 5 40.00 0 0 5.0 8.0 66 15.2 66 0.909 0
25 26.0 Nashville Predators 28.7 4 2 2 0 4 0.500 10 14 0.0 0.0 0.01 1.01 6.00 9 7 1 16 6.25 6 16 62.50 0 1 8.0 8.0 135 7.4 126 0.889 0
26 27.0 Buffalo Sabres 27.2 5 1 3 1 3 0.300 14 15 0.0 1.0 -0.18 0.22 5.80 11 14 3 17 17.65 1 6 83.33 0 0 3.8 8.2 161 8.7 133 0.887 0
27 28.0 New York Rangers 25.6 4 1 2 1 3 0.375 11 11 0.0 1.0 -0.15 0.11 5.50 7 7 4 21 19.05 4 16 75.00 0 0 8.5 14.0 140 7.9 112 0.902 1
28 29.0 Chicago Blackhawks 26.9 5 1 3 1 3 0.300 13 21 0.0 0.0 -0.43 1.17 6.80 5 16 7 17 41.18 5 20 75.00 1 0 8.0 6.8 154 8.4 167 0.874 0
29 30.0 Ottawa Senators 27.0 4 1 2 1 3 0.375 11 14 0.0 0.0 -0.04 0.71 6.25 8 10 3 18 16.67 4 21 80.95 0 0 14.3 15.3 113 9.7 120 0.883 0
30 31.0 Dallas Stars 28.8 1 1 0 0 2 1.000 7 0 0.0 0.0 7.30 0.30 7.00 1 0 5 8 62.50 0 5 100.00 1 0 10.0 16.0 28 25.0 34 1.000 1
31 NaN League Average 28.0 4 2 2 1 5 0.574 13 13 NaN NaN NaN NaN 5.94 9 9 4 16 21.33 4 16 78.67 0 0 8.0 8.0 133 9.8 133 0.902 0
Im trying to scrape a table from pro-football-reference, specifically the team offense table from https://www.pro-football-reference.com/years/2019/. Anytime I try the code below I get back an empty list or just a NoneType. I have scraped other websites like ESPN and have had no problems.
import requests
from bs4 import BeautifulSoup
url = 'https://www.pro-football-reference.com/years/{}/'
response = requests.get(url.format(2019))
soup = BeautifulSoup(response.text, 'lxml')
team_table = soup.find('table', {'id':'team_stats'})
I have also tried
soup = BeautifulSoup(response.text, 'html.parser')
to see if maybe it was the way I was bringing the data in. The page does have a bunch of tables so im assuming thats why its more difficult to scrape a specific table. Thank you.
The data is inside HTML comments <!-- ... -->. You can use this script to get them:
import requests
from bs4 import BeautifulSoup, Comment
url = "https://www.pro-football-reference.com/years/2019/"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
table = soup.select_one('#all_team_stats').find_next(text=lambda t: isinstance(t, Comment))
table = BeautifulSoup(table, 'html.parser')
for tr in table.select('tr'):
tds = [td.get_text(strip=True) for td in tr.select('td')]
print(*tds)
Prints:
Baltimore Ravens 16 531 6521 1064 6.1 15 7 386 289 440 3225 37 8 6.9 171 596 3296 21 5.5 188 109 867 27 52.1 8.6 245.99
San Francisco 49ers 16 479 6097 1012 6.0 23 10 336 331 478 3792 28 13 7.4 195 498 2305 23 4.6 110 105 939 31 44.3 12.0 146.39
Tampa Bay Buccaneers 16 458 6366 1086 5.9 41 11 353 382 630 4845 33 30 7.2 244 409 1521 15 3.7 81 133 1111 28 38.3 20.7 44.00
New Orleans Saints 16 458 5982 1011 5.9 8 2 347 418 581 4244 36 6 7.0 230 405 1738 12 4.3 97 120 1036 20 47.1 4.1 167.04
Kansas City Chiefs 16 451 6067 976 6.2 15 10 350 378 576 4498 30 5 7.5 211 375 1569 16 4.2 93 107 1029 46 49.4 8.0 265.38
Dallas Cowboys 16 434 6904 1069 6.5 18 7 379 388 597 4751 30 11 7.7 229 449 2153 18 4.8 120 109 1008 30 44.6 10.3 214.77
New England Patriots 16 420 5664 1095 5.2 15 6 338 378 620 3961 25 9 6.1 197 447 1703 17 3.8 110 94 828 31 36.8 7.6 39.62
Minnesota Vikings 16 407 5656 970 5.8 20 12 314 319 466 3523 26 8 7.1 171 476 2133 19 4.5 106 96 895 37 41.9 10.5 103.01
Seattle Seahawks 16 405 5991 1046 5.7 20 14 341 341 517 3791 31 6 6.7 190 481 2200 15 4.6 121 109 882 30 36.9 10.2 90.55
Tennessee Titans 16 402 5805 949 6.1 17 9 317 297 448 3582 29 8 7.1 177 445 2223 21 5.0 104 99 932 36 31.7 8.2 119.88
Los Angeles Rams 16 394 5998 1055 5.7 24 7 342 397 632 4499 22 17 6.9 222 401 1499 20 3.7 92 118 899 28 36.0 12.9 42.09
Philadelphia Eagles 16 385 5772 1104 5.2 23 15 354 391 613 3833 27 8 5.9 215 454 1939 16 4.3 104 100 836 35 35.5 10.4 67.85
Atlanta Falcons 16 381 6075 1096 5.5 25 10 383 459 684 4714 29 15 6.4 258 362 1361 10 3.8 84 119 956 41 41.3 13.4 104.51
Houston Texans 16 378 5792 1017 5.7 22 8 346 355 534 3783 27 14 6.5 203 434 2009 17 4.6 112 111 892 31 37.7 12.0 121.67
Green Bay Packers 16 376 5528 1020 5.4 13 9 320 356 573 3733 26 4 6.1 190 411 1795 18 4.4 90 100 774 40 37.1 6.9 118.40
Arizona Cardinals 16 361 5467 1000 5.5 18 6 314 355 554 3477 20 12 5.8 176 396 1990 18 5.0 109 121 956 29 38.8 10.1 67.36
Indianapolis Colts 16 361 5238 1016 5.2 21 11 340 307 513 3108 22 10 5.7 165 471 2130 17 4.5 131 79 670 44 36.1 11.4 78.80
Detroit Lions 16 341 5549 1021 5.4 23 8 313 344 571 3900 28 15 6.4 196 407 1649 7 4.1 82 113 937 35 33.3 11.7 33.07
New York Giants 16 341 5416 1012 5.4 33 16 311 376 607 3731 30 17 5.7 187 362 1685 11 4.7 89 90 784 35 28.3 18.3 -5.30
Carolina Panthers 16 340 5469 1077 5.1 35 14 335 382 633 3650 17 21 5.3 230 386 1819 20 4.7 82 87 754 23 32.3 16.9 -24.07
Los Angeles Chargers 16 337 5879 997 5.9 31 11 349 394 597 4426 24 20 7.0 220 366 1453 12 4.0 90 103 872 39 39.5 18.5 83.43
Cleveland Browns 16 335 5455 973 5.6 28 7 305 318 539 3554 22 21 6.1 180 393 1901 15 4.8 90 122 1106 35 34.1 14.8 16.54
Buffalo Bills 16 314 5283 1018 5.2 19 7 314 299 513 3229 21 12 5.8 162 465 2054 13 4.4 120 117 927 32 30.6 10.4 9.66
Oakland Raiders 16 313 5819 989 5.9 17 9 315 367 523 3926 22 8 7.1 194 437 1893 13 4.3 104 128 1138 17 32.9 9.9 80.57
Miami Dolphins 16 306 4960 1022 4.9 26 8 315 371 615 3804 22 18 5.7 210 349 1156 10 3.3 64 92 769 41 30.6 13.3 -23.85
Jacksonville Jaguars 16 300 5468 1020 5.4 20 12 298 364 589 3760 24 8 6.0 183 389 1708 3 4.4 85 132 1165 30 33.9 10.2 -14.15
Pittsburgh Steelers 16 289 4428 937 4.7 30 11 265 315 510 2981 18 19 5.5 147 395 1447 7 3.7 75 111 893 43 28.6 15.7 -84.56
Denver Broncos 16 282 4777 954 5.0 16 6 279 312 504 3115 16 10 5.7 162 409 1662 11 4.1 77 110 912 40 32.9 9.4 -11.61
Chicago Bears 16 280 4749 1020 4.7 19 7 297 371 580 3291 20 12 5.3 178 395 1458 8 3.7 85 103 838 34 29.1 10.5 -38.05
Cincinnati Bengals 16 279 5169 1049 4.9 30 14 312 356 616 3652 18 16 5.5 191 385 1517 9 3.9 85 93 761 36 30.3 16.0 -57.73
New York Jets 16 276 4368 956 4.6 25 9 253 323 521 3111 19 16 5.4 162 383 1257 6 3.3 61 115 1105 30 23.0 11.5 -108.92
Washington Redskins 16 266 4395 885 5.0 21 8 248 298 479 2812 18 13 5.3 154 356 1583 9 4.4 74 106 835 20 30.1 12.1 -82.30
Avg Team 365.0 5565.8 1016.1 5.5 22.2 9.4 324.0 354.1 557.9 3759.4 24.9 12.8 6.3 193.8 418.3 1806.4 14.0 4.3 97.3 107.8 915.8 32.9 36.0 11.8 56.6
League Total 11680 178107 32516 5.5 711 301 10369 11331 17853 120301 797 410 6.3 6200 13387 57806 447 4.3 3115 3451 29306 1054 36.0 11.8
Avg Tm/G 22.8 347.9 63.5 5.5 1.4 0.6 20.3 22.1 34.9 235.0 1.6 0.8 6.3 12.1 26.1 112.9 0.9 4.3 6.1 6.7 57.2 2.1 36.0 11.8
I am a newbie to python and pandas.
I want to get the Y values of every peak with the smallest gap
Here is my code so far which is very long and quite silly
for i in df.peaks.unique():
min_y= df[df['peaks'] == i][df['gap'] == min(df[df['peaks'] == i]['gap'])]['y'].tolist()[0]
print(min_y)
It works and this is the result:
171
204
246
278
311
416
Since, I saw some advanced syntax where people use apply, map, so they do not need to loop, can my task apply those techni ? Btw, I want to improve my code, please help!
The DataFrame is below for your informatin.
x y w h peaks gap
0 79 171 13 14 178.0 7.0
1 155 171 14 14 178.0 7.0
2 213 170 14 15 178.0 8.0
3 281 171 14 14 178.0 7.0
4 337 171 14 14 178.0 7.0
5 78 203 14 14 211.0 8.0
6 209 204 14 14 211.0 7.0
7 287 204 15 14 211.0 7.0
8 365 204 14 13 211.0 7.0
9 78 236 14 14 251.0 15.0
10 156 246 14 13 251.0 5.0
11 232 246 14 14 251.0 5.0
12 79 277 14 14 284.0 7.0
13 166 278 14 14 284.0 6.0
14 243 278 13 14 284.0 6.0
15 79 303 14 14 316.0 13.0
16 144 310 15 13 316.0 6.0
17 216 310 13 14 316.0 6.0
18 292 311 14 13 316.0 5.0
19 370 311 14 14 316.0 5.0
20 80 405 14 14 420.0 15.0
21 157 414 13 14 420.0 6.0
22 226 414 14 14 420.0 6.0
23 303 416 14 13 420.0 4.0
24 374 416 14 14 420.0 4.0
You can use groupby and combine it with idxmin
data.loc[data.groupby('peaks')['gap'].idxmin(), 'y']
# 0 171
# 6 204
# 10 246
# 13 278
# 18 311
# 23 416
# Name: y, dtype: int64
I have a dataset that contains columns of year, julian day, hour and temperature. I have grouped the data by year and day and now want to perform an operation on the temperature data IF each day contains 24 hours worth of data. Then, I want to create a Dataframe with year, julian day, max temperature and min temperature. However, I'm not sure of the syntax to make sure this condition is met. Any help would be appreciated. My code is below:
df = pd.read_table(data,skiprows=1,sep='\t',usecols=(0,3,4,6),names=['year','jday','hour','temp'],na_values=-999.9)
g = df.groupby(['year','jday'])
if #the grouped year and day has 24 hours worth of data
maxt = g.aggregate({'temp':np.max})
mint = g.aggregate({'temp':np.min})
else:
continue
And some sample data (goes from 1942-2015):
Year Month Day Julian Hour Wind TempC DewC Pressure RH
1942 9 24 267 9 2.1 18.5 15.2 1014.2 81.0
1942 9 24 267 10 2.1 23.5 14.6 1014.6 57.0
1942 9 24 267 11 3.6 25.2 12.4 1014.2 45.0
1942 9 24 267 12 3.6 26.8 11.9 1014.2 40.0
1942 9 24 267 13 2.6 27.4 11.9 1014.2 38.0
1942 9 24 267 14 2.1 28.0 11.3 1013.5 35.0
1942 9 24 267 15 4.1 29.1 9.1 1013.5 29.0
1942 9 24 267 16 4.1 29.1 10.7 1013.5 32.0
1942 9 24 267 17 4.6 29.1 13.0 1013.9 37.0
1942 9 24 267 18 3.6 25.7 12.4 1015.2 44.0
1942 9 24 267 19 0.0 23.0 16.3 1015.2 66.0
1942 9 24 267 20 2.6 22.4 15.7 1015.9 66.0
1942 9 24 267 21 2.1 20.2 16.3 1016.3 78.0
1942 9 24 267 22 3.1 20.2 14.6 1016.9 70.0
1942 9 24 267 23 2.6 19.6 15.2 1017.6 76.0
1942 9 25 268 0 3.1 18.5 13.5 1018.3 73.0
1942 9 25 268 1 2.6 16.9 13.0 1018.3 78.0
1942 9 25 268 2 4.1 15.7 5.2 1021.0 50.0
1942 9 25 268 3 4.1 15.2 4.1 1020.7 47.0
1942 9 25 268 4 3.1 14.1 5.8 1021.3 57.0
1942 9 25 268 5 3.1 13.0 5.8 1021.3 62.0
1942 9 25 268 6 2.1 13.0 5.2 1022.4 59.0
1942 9 25 268 7 2.1 12.4 1.9 1022.4 49.0
1942 9 25 268 8 3.6 13.5 5.8 1024.7 60.0
1942 9 25 268 9 4.6 15.7 3.5 1025.1 44.0
1942 9 25 268 10 4.1 17.4 1.3 1025.4 34.0
1942 9 25 268 11 2.6 18.5 3.0 1025.4 36.0
1942 9 25 268 12 2.1 19.1 0.8 1025.1 29.0
1942 9 25 268 13 2.6 19.6 2.4 1024.7 32.0
1942 9 25 268 14 4.1 20.7 4.6 1023.4 35.0
1942 9 25 268 15 3.6 21.3 4.1 1023.7 32.0
1942 9 25 268 16 1.5 21.3 4.6 1023.4 34.0
1942 9 25 268 17 5.1 20.7 7.4 1023.4 42.0
1942 9 25 268 18 5.1 19.1 8.5 1023.0 50.0
1942 9 25 268 19 3.6 18.0 9.6 1022.7 58.0
1942 9 25 268 20 3.1 16.3 9.6 1023.0 65.0
1942 9 25 268 21 1.5 15.2 11.3 1023.0 78.0
1942 9 25 268 22 1.5 14.6 11.3 1023.0 81.0
1942 9 25 268 23 2.1 14.1 10.7 1024.0 80.0
I assume that there is no ['year', 'julian'] group which contains non-integer hours so we can just use the length of the group as the condition.
import pandas as pd
def get_min_max_by_date(df_group):
if len(df_group['hour'].unique()) < 24:
new_df = pd.DataFrame()
else:
year = df_group['year'].unique()[0]
j_day = df_group['jday'].unique()[0]
min_temp = df_group['temp'].min()
max_temp = df_group['temp'].max()
new_df = pd.DataFrame({'year': [year],
'julian_day': [j_day],
'min_temp': [min_temp],
'max_temp': [max_temp]}, index=[0])
return new_df
df = pd.read_table(data,
skiprows=1,
sep='\t',
usecols=(0, 3, 4, 6),
names=['year', 'jday', 'hour', 'temp'],
na_values=-999.9)
final_df = df.groupby(['year', 'jday'],
as_index=False).apply(get_min_max_by_date)
final_df = final_df.reset_index()
I don't have time to test this right now, but this should get you started.
I would start by grouping on day alone, and then iterate over the groups, checking the unique hours in each group. You can use set to find the unique hours for each measurement day, and compare with a full days worth of hours {0,1,2,3,...23}
a_full_day = set(range(24))
#data_out = {}
gb = df.groupby(['jday']) # only group by day
for day, inds in gb.groups.iteritems():
if set(df.ix[inds, 'hour']) == a_full_day:
maxt = df.ix[inds, 'temp'].max()
#data_out[day] = {}
#data_out[day]['maxt'] = maxt
# etc
I added some commented lines suggesting how you might want to store the output