Pandas Merge two column to another Dataframe column - python

I want to merge two columns to another dataframe based on Squad column
df1
Squad
0 Arsenal
1 Aston Villa
2 Bournemouth
3 Brighton
4 Burnley
5 Chelsea
6 Crystal Palace
7 Everton
8 Leicester City
9 Liverpool
10 Manchester City
11 Manchester Utd
12 Newcastle Utd
13 Norwich City
14 Sheffield Utd
15 Southampton
16 Tottenham
17 Watford
18 West Ham
19 Wolves
df2
Rk Squad MP W D L GF GA GD Pts ... L GF GA GD Pts Pts/G xG xGA xGD xGD/90
0 1 Liverpool 19 18 1 0 52 16 36 55 ... 3 33 17 16 44 2.32 31.2 21.6 9.5 0.50
1 2 Manchester City 19 15 2 2 57 13 44 47 ... 7 45 22 23 34 1.79 45.5 19.5 26.0 1.37
2 3 Manchester Utd 19 10 7 2 40 17 23 37 ... 6 26 19 7 29 1.53 28.4 21.1 7.4 0.39
3 4 Chelsea 19 11 3 5 30 16 14 36 ... 7 39 38 1 30 1.58 29.2 27.2 2.0 0.10
4 5 Leicester City 19 11 4 4 35 17 18 37 ... 8 32 24 8 25 1.32 31.0 22.7 8.3 0.44
5 6 Tottenham 19 12 3 4 36 17 19 39 ... 7 25 30 -5 20 1.05 21.6 28.7 -7.1 -0.37
6 7 Wolves 19 8 7 4 27 19 8 31 ... 5 24 21 3 28 1.47 21.2 18.3 2.9 0.15
7 8 Arsenal 19 10 6 3 36 24 12 36 ... 7 20 24 -4 20 1.05 22.0 25.8 -3.8 -0.20
8 9 Sheffield Utd 19 10 3 6 24 15 9 33 ... 6 15 24 -9 21 1.11 15.3 28.2 -12.9 -0.68
9 10 Burnley 19 8 4 7 24 23 1 28 ... 7 19 27 -8 26 1.37 16.6 27.1 -10.4 -0.55
10 11 Southampton 19 6 3 10 21 35 -14 21 ... 6 30 25 5 31 1.63 30.8 24.9 5.9 0.31
11 12 Everton 19 8 7 4 24 21 3 31 ... 11 20 35 -15 18 0.95 21.8 24.9 -3.1 -0.16
12 13 Newcastle Utd 19 6 8 5 20 21 -1 26 ... 11 18 37 -19 18 0.95 15.0 30.3 -15.3 -0.81
13 14 Crystal Palace 19 6 5 8 15 20 -5 23 ... 9 16 30 -14 20 1.05 16.4 31.6 -15.2 -0.80
14 15 Brighton 19 5 7 7 20 27 -7 22 ... 8 19 27 -8 19 1.00 20.3 28.5 -8.2 -0.43
15 16 West Ham 19 6 4 9 30 33 -3 22 ... 10 19 29 -10 17 0.89 22.5 31.9 -9.4 -0.49
16 17 Aston Villa 19 7 3 9 22 30 -8 24 ... 12 19 37 -18 11 0.58 18.5 34.5 -16.0 -0.84
17 18 Bournemouth 19 5 6 8 22 30 -8 21 ... 14 18 35 -17 13 0.68 20.4 32.3 -11.9 -0.63
18 19 Watford 19 6 6 7 22 27 -5 24 ... 13 14 37 -23 10 0.53 18.5 29.9 -11.4 -0.60
19 20 Norwich City 19 4 3 12 19 37 -18 15 ... 15 7 38 -31 6 0.32 17.7 30.2 -12.6 -0.66
I want to create a new column on df1 and make a calculation that is, value of W column divide by value of MP column.
I tried to merge W column to df1 but I got TypeError: Cannot convert bool to numpy.ndarray
df1 = pd.merge(df1, df2['Squad','W'], on='Squad',how='left')

TRY:
result = df1.merge(df2[['Squad','W', 'MP']], how='left')
result['new_col'] = result['W'] / result['MP']
NOTE: Make sure to handle NAN and 0 before dividing.

Related

Table is not displayed with python requests

There's a website https://www.hockey-reference.com//leagues/NHL_2022.html
I need to get table in div with id=div_stats
from bs4 import BeautifulSoup
url = 'https://www.hockey-reference.com/leagues/NHL_2022.html'
r = requests.get(url=url)
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find('div', id='div_stats')
print(table)
#None
Response is 200, but there's no such div in BeautifulSoup object. If I open the page using selenium or manually - it gets loaded properly.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
url = 'https://www.hockey-reference.com/leagues/NHL_2022.html'
with webdriver.Chrome() as browser:
browser.get(url)
#sleep(1)
html = browser.page_source
#r = requests.get(url=url, stream=True)
soup = BeautifulSoup(html, 'html.parser')
table = soup.find_all('div', id='div_stats')
However, while using webdriver it may load page for quite a long time (even if I see the whole page, it's still loading browser.get(url), and the code couldn't continue).
Is there any solution that can help avoiding selenium / stop the loading when the table is in the HTML?
I tried: stream and timeout in requests.get(),
for season in seasons:
browser.get(url)
wait = WebDriverWait(browser, 5)
wait.until(EC.visibility_of_element_located((By.ID, 'div_stats')))
html = browser.execute_script('return document.documentElement.outerHTML')
Nothing of that worked.
This is one way to get that table as a dataframe:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
url= 'https://www.hockey-reference.com//leagues/NHL_2022.html'
response = requests.get(url).text.replace('<!--', '').replace('-->', '')
soup = bs(response, 'html.parser')
table_w_data = soup.select_one('table#stats')
df = pd.read_html(str(table_w_data), header=1)[0]
print(df)
Result in terminal:
0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Unnamed: 5_level_0 Unnamed: 6_level_0 Unnamed: 7_level_0 Unnamed: 8_level_0 Unnamed: 9_level_0 ... Special Teams Shot Data Unnamed: 31_level_0
Rk Unnamed: 1_level_1 AvAge GP W L OL PTS PTS% GF ... PK% SH SHA PIM/G oPIM/G S S% SA SV% SO
0 1.0 Florida Panthers* 27.8 82 58 18 6 122 0.744 337 ... 79.54 12 8 10.1 10.8 3062 11.0 2515 0.904 5
1 2.0 Colorado Avalanche* 28.2 82 56 19 7 119 0.726 308 ... 79.66 6 5 9.0 10.4 2874 10.7 2625 0.912 7
2 3.0 Carolina Hurricanes* 28.3 82 54 20 8 116 0.707 277 ... 88.04 4 3 9.2 7.7 2798 9.9 2310 0.913 6
3 4.0 Toronto Maple Leafs* 28.4 82 54 21 7 115 0.701 312 ... 82.05 13 4 8.6 8.5 2835 11.0 2511 0.900 7
4 5.0 Minnesota Wild* 29.4 82 53 22 7 113 0.689 305 ... 76.14 2 5 10.8 10.8 2666 11.4 2577 0.903 3
5 6.0 Calgary Flames* 28.8 82 50 21 11 111 0.677 291 ... 83.20 7 3 9.1 8.6 2908 10.0 2374 0.913 11
6 7.0 Tampa Bay Lightning* 29.6 82 51 23 8 110 0.671 285 ... 80.56 7 5 11.0 11.4 2535 11.2 2441 0.907 3
7 8.0 New York Rangers* 26.7 82 52 24 6 110 0.671 250 ... 82.30 8 2 8.2 8.2 2392 10.5 2528 0.919 9
8 9.0 St. Louis Blues* 28.8 82 49 22 11 109 0.665 309 ... 84.09 9 5 7.5 7.9 2492 12.4 2591 0.908 4
9 10.0 Boston Bruins* 28.5 82 51 26 5 107 0.652 253 ... 81.30 5 6 9.9 9.4 2962 8.5 2354 0.907 4
10 11.0 Edmonton Oilers* 29.1 82 49 27 6 104 0.634 285 ... 79.37 11 6 8.1 7.1 2790 10.2 2647 0.905 4
11 12.0 Pittsburgh Penguins* 29.7 82 46 25 11 103 0.628 269 ... 84.43 3 8 6.9 8.4 2849 9.4 2576 0.914 7
12 13.0 Washington Capitals* 29.5 82 44 26 12 100 0.610 270 ... 80.44 8 9 7.7 8.8 2577 10.5 2378 0.898 8
13 14.0 Los Angeles Kings* 28.0 82 44 27 11 99 0.604 235 ... 76.65 11 9 7.7 8.3 2865 8.2 2341 0.901 5
14 15.0 Dallas Stars* 29.4 82 46 30 6 98 0.598 233 ... 79.00 7 5 6.7 7.5 2486 9.4 2545 0.904 2
15 16.0 Nashville Predators* 27.7 82 45 30 7 97 0.591 262 ... 79.23 2 5 12.6 11.9 2439 10.7 2646 0.906 4
16 17.0 Vegas Golden Knights 28.5 82 43 31 8 94 0.573 262 ... 77.40 10 7 7.6 7.7 2830 9.3 2458 0.901 3
17 18.0 Vancouver Canucks 27.7 82 40 30 12 92 0.561 246 ... 74.89 5 6 8.0 8.6 2622 9.4 2612 0.912 1
18 19.0 Winnipeg Jets 28.2 82 39 32 11 89 0.543 250 ... 75.00 9 8 8.8 9.5 2645 9.5 2721 0.907 5
19 20.0 New York Islanders 30.1 82 37 35 10 84 0.512 229 ... 84.19 5 7 8.9 8.4 2367 9.7 2669 0.913 9
20 21.0 Columbus Blue Jackets 26.6 82 37 38 7 81 0.494 258 ... 78.57 7 6 7.7 7.2 2463 10.5 2887 0.897 2
21 22.0 San Jose Sharks 29.0 82 32 37 13 77 0.470 211 ... 85.20 4 11 8.8 8.6 2400 8.8 2622 0.900 3
22 23.0 Anaheim Ducks 27.9 82 31 37 14 76 0.463 228 ... 80.80 6 4 9.3 9.8 2393 9.5 2725 0.902 4
23 24.0 Buffalo Sabres 27.5 82 32 39 11 75 0.457 229 ... 76.42 6 6 8.1 7.9 2451 9.3 2702 0.894 1
24 25.0 Detroit Red Wings 26.9 82 32 40 10 74 0.451 227 ... 73.78 4 10 8.9 8.5 2414 9.4 2761 0.888 4
25 26.0 Ottawa Senators 26.6 82 33 42 7 73 0.445 224 ... 80.32 9 4 10.0 10.2 2463 9.1 2740 0.904 2
26 27.0 Chicago Blackhawks 28.0 82 28 42 12 68 0.415 213 ... 76.23 2 6 7.9 8.7 2362 9.0 2703 0.893 4
27 28.0 New Jersey Devils 25.8 82 27 46 9 63 0.384 245 ... 80.19 6 14 8.1 8.4 2562 9.6 2540 0.881 2
28 29.0 Philadelphia Flyers 28.3 82 25 46 11 61 0.372 210 ... 75.74 6 11 9.0 9.0 2539 8.3 2785 0.894 1
29 30.0 Seattle Kraken 28.7 82 27 49 6 60 0.366 213 ... 74.89 8 7 8.5 8.0 2380 8.9 2367 0.880 3
30 31.0 Arizona Coyotes 28.0 82 25 50 7 57 0.348 206 ... 75.00 3 4 10.2 8.2 2121 9.7 2910 0.894 1
31 32.0 Montreal Canadiens 27.8 82 22 49 11 55 0.335 218 ... 75.55 6 12 10.2 9.0 2442 8.9 2823 0.888 3
32 NaN League Average 28.2 82 41 32 9 91 0.555 255 ... 79.39 7 7 8.9 8.9 2593 9.8 2593 0.902 4
33 rows × 32 columns
Expect to do a little cleanup of that data, once you get it.
Relevant documentation for pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
And for requests: https://requests.readthedocs.io/en/latest/
And for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

How do I get all the tables from a website using pandas

I am trying to get 3 tables from a particular website but only the first two are showing up. I have even tried get the data using BeautifulSoup but the third seems to be hidden somehow. Is there something I am missing?
url = "https://fbref.com/en/comps/9/keepersadv/Premier-League-Stats"
html = pd.read_html(url, header=1)
print(html[0])
print(html[1])
print(html[2]) # This prompts an error that the tables does not exist
The first two tables are the squad tables. The table not showing up is the individual player table. This also happens with similar pages from the same site.
You could use Selenium as suggested, but I think is a bit overkill. The table is available in the static HTML, just within the comments. So you would need to pull the comments out of BeautifulSoup to get those tables.
To get all the tables:
import pandas as pd
import requests
from bs4 import BeautifulSoup, Comment
url = 'https://fbref.com/en/comps/9/keepersadv/Premier-League-Stats'
response = requests.get(url)
tables = pd.read_html(response.text, header=1)
# Get the tables within the Comments
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'table' in str(each):
try:
table = pd.read_html(str(each), header=1)[0]
table = table[table['Rk'].ne('Rk')].reset_index(drop=True)
tables.append(table)
except:
continue
Output:
for table in tables:
print(table)
Squad # Pl 90s GA PKA ... Stp Stp% #OPA #OPA/90 AvgDist
0 Arsenal 2 12.0 17 0 ... 10 8.8 6 0.50 14.6
1 Aston Villa 2 12.0 20 0 ... 6 6.8 13 1.08 16.2
2 Brentford 2 12.0 17 1 ... 10 9.9 18 1.50 15.6
3 Brighton 2 12.0 14 2 ... 17 16.2 13 1.08 15.3
4 Burnley 1 12.0 20 0 ... 14 11.7 17 1.42 16.6
5 Chelsea 2 12.0 4 2 ... 8 8.5 5 0.42 14.0
6 Crystal Palace 1 12.0 17 0 ... 7 7.5 6 0.50 13.5
7 Everton 2 12.0 19 0 ... 8 7.4 7 0.58 13.7
8 Leeds United 1 12.0 20 1 ... 8 12.5 15 1.25 16.3
9 Leicester City 1 12.0 21 2 ... 9 8.4 7 0.58 13.0
10 Liverpool 2 12.0 11 0 ... 9 9.7 16 1.33 17.0
11 Manchester City 2 12.0 6 1 ... 5 8.1 16 1.33 17.5
12 Manchester Utd 1 12.0 21 0 ... 4 4.4 2 0.17 13.3
13 Newcastle Utd 2 12.0 27 4 ... 10 9.8 4 0.33 13.9
14 Norwich City 1 12.0 27 2 ... 6 5.1 5 0.42 12.4
15 Southampton 1 12.0 14 0 ... 16 13.9 2 0.17 12.9
16 Tottenham 1 12.0 17 1 ... 3 2.7 5 0.42 14.1
17 Watford 2 12.0 20 1 ... 6 5.5 9 0.75 15.4
18 West Ham 1 12.0 14 0 ... 6 5.3 1 0.08 11.9
19 Wolves 1 12.0 12 3 ... 9 10.0 10 0.83 15.5
[20 rows x 28 columns]
Squad # Pl 90s GA PKA ... Stp Stp% #OPA #OPA/90 AvgDist
0 vs Arsenal 2 12.0 13 0 ... 4 5.9 11 0.92 15.5
1 vs Aston Villa 2 12.0 16 2 ... 11 8.0 7 0.58 14.8
2 vs Brentford 2 12.0 16 1 ... 16 14.0 9 0.75 15.7
3 vs Brighton 2 12.0 12 3 ... 11 12.5 8 0.67 15.9
4 vs Burnley 1 12.0 14 0 ... 16 10.7 12 1.00 15.1
5 vs Chelsea 2 12.0 30 2 ... 10 11.1 11 0.92 14.2
6 vs Crystal Palace 1 12.0 18 2 ... 7 7.2 9 0.75 14.4
7 vs Everton 2 12.0 16 3 ... 7 7.6 7 0.58 13.8
8 vs Leeds United 1 12.0 12 1 ... 8 7.3 5 0.42 14.2
9 vs Leicester City 1 12.0 16 0 ... 2 3.3 7 0.58 14.3
10 vs Liverpool 2 12.0 35 1 ... 12 9.9 14 1.17 13.7
11 vs Manchester City 2 12.0 25 0 ... 8 6.7 4 0.33 13.1
12 vs Manchester Utd 1 12.0 20 0 ... 7 7.8 7 0.58 14.7
13 vs Newcastle Utd 2 12.0 15 0 ... 8 8.0 8 0.67 15.3
14 vs Norwich City 1 12.0 7 2 ... 5 5.7 16 1.33 17.3
15 vs Southampton 1 12.0 11 2 ... 4 3.7 9 0.75 14.0
16 vs Tottenham 1 12.0 11 1 ... 9 12.2 9 0.75 16.0
17 vs Watford 2 12.0 16 0 ... 8 8.2 9 0.75 15.3
18 vs West Ham 1 12.0 23 0 ... 13 10.5 6 0.50 13.8
19 vs Wolves 1 12.0 12 0 ... 5 6.8 9 0.75 15.3
[20 rows x 28 columns]
Rk Player Nation Pos ... #OPA #OPA/90 AvgDist Matches
0 1 Alisson br BRA GK ... 15 1.36 17.1 Matches
1 2 Kepa Arrizabalaga es ESP GK ... 1 1.00 18.8 Matches
2 3 Daniel Bachmann at AUT GK ... 1 0.25 12.2 Matches
3 4 Asmir Begović ba BIH GK ... 0 0.00 15.0 Matches
4 5 Karl Darlow eng ENG GK ... 4 0.50 14.9 Matches
5 6 Ederson br BRA GK ... 14 1.27 17.5 Matches
6 7 Łukasz Fabiański pl POL GK ... 1 0.08 11.9 Matches
7 8 Álvaro Fernández es ESP GK ... 5 1.67 15.3 Matches
8 9 Ben Foster eng ENG GK ... 8 1.00 16.8 Matches
9 10 David de Gea es ESP GK ... 2 0.17 13.3 Matches
10 11 Vicente Guaita es ESP GK ... 6 0.50 13.5 Matches
11 12 Caoimhín Kelleher ie IRL GK ... 1 1.00 14.6 Matches
12 13 Tim Krul nl NED GK ... 5 0.42 12.4 Matches
13 14 Bernd Leno de GER GK ... 1 0.33 13.1 Matches
14 15 Hugo Lloris fr FRA GK ... 5 0.42 14.1 Matches
15 16 Emiliano Martínez ar ARG GK ... 12 1.09 16.4 Matches
16 17 Alex McCarthy eng ENG GK ... 2 0.17 12.9 Matches
17 18 Edouard Mendy sn SEN GK ... 4 0.36 13.3 Matches
18 19 Illan Meslier fr FRA GK ... 15 1.25 16.3 Matches
19 20 Jordan Pickford eng ENG GK ... 7 0.64 13.6 Matches
20 21 Nick Pope eng ENG GK ... 17 1.42 16.6 Matches
21 22 Aaron Ramsdale eng ENG GK ... 5 0.56 14.9 Matches
22 23 David Raya es ESP GK ... 13 1.44 15.7 Matches
23 24 José Sá pt POR GK ... 10 0.83 15.5 Matches
24 25 Robert Sánchez es ESP GK ... 13 1.18 15.4 Matches
25 26 Kasper Schmeichel dk DEN GK ... 7 0.58 13.0 Matches
26 27 Jason Steele eng ENG GK ... 0 0.00 13.0 Matches
27 28 Jed Steer eng ENG GK ... 1 1.00 14.3 Matches
28 29 Zack Steffen us USA GK ... 2 2.00 17.8 Matches
29 30 Freddie Woodman eng ENG GK ... 0 0.00 11.6 Matches
[30 rows x 34 columns]

Unable to extract Tables

Beginner here. I'm having issues while trying to extract data from the second (Team Statistics) and third (Team Analytics 5-on-5) Table on this page:
https://www.hockey-reference.com/leagues/NHL_2021.html
I'm using this code:
import pandas as pd
url = 'https://www.hockey-reference.com/leagues/NHL_2021.html'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[1]
print(df)
and
url = 'https://www.hockey-reference.com/leagues/NHL_2021.html'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[2]
print(df)
to get the right tables.
But for some kind of reason I will always get this error message:
IndexError: list index out of range
I could extract the first table by using the same code with df = df_list[0], that will work, but it is useless to me. I really need the 2nd an 3rd Table, and I just don't know why it doesn't work.
Pretty sure that's easy to answer for most of you.
Thanx in advance!
You get this error because the read_html() method returns a list of 1 element and that element is at position 0
instead of
df = df_list[1]
use this
df = df_list[0]
You get combined table of all teams from your mentioned site so if you want to extract the table of 2nd and 3rd team use loc[] accessor:-
east_division=df.loc[9:17]
north_division=df.loc[18:25]
Use the URL directly in pandas.read_html
df = pd.read_html('https://www.hockey-reference.com/leagues/NHL_2021.html')
The tables are in fact there in the html (within the comments). Use BeautifulSoup to pull out the comments and parse those tables as well. The code below will pull all (both commented and uncommented tables). and put it into a list. Just a matter of pulling out the table by index that you want, in this case indices 1 and 2.
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
url = "https://www.hockey-reference.com/leagues/NHL_2021.html"
# Gets all uncommented tables
tables = pd.read_html(url, header=1)
# Get the html source
response = requests.get(url, headers=headers)
# Creat soup object form html
soup = BeautifulSoup(response.content, 'html.parser')
# Get the comments in html
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
# Iterate thorugh each comment and parse the table if found
# # Append the table to the tables list
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(each, header=1)[0])
tables = tables[tables['Rk'].ne('Rk')]
tables = tables.rename(columns={'Unnamed: 1':'Team'})
except:
continue
Output:
for table in tables[1:3]:
print(table)
Rk Unnamed: 1 AvAge GP W ... S S% SA SV% SO
0 1.0 New York Islanders 29.0 28 18 ... 841 9.8 767 0.920 5
1 2.0 Tampa Bay Lightning 28.3 26 19 ... 798 12.2 725 0.919 3
2 3.0 Florida Panthers 28.1 27 18 ... 918 10.0 840 0.910 0
3 4.0 Toronto Maple Leafs 28.9 29 19 ... 883 11.2 828 0.909 2
4 5.0 Carolina Hurricanes 27.2 26 19 ... 816 10.9 759 0.912 3
5 6.0 Washington Capitals 30.4 27 17 ... 768 12.0 808 0.895 0
6 7.0 Vegas Golden Knights 29.1 25 18 ... 752 11.0 691 0.920 4
7 8.0 Edmonton Oilers 28.4 30 18 ... 945 10.6 938 0.907 2
8 9.0 Winnipeg Jets 28.0 27 17 ... 795 11.4 856 0.910 1
9 10.0 Pittsburgh Penguins 28.1 27 17 ... 779 11.0 784 0.899 1
10 11.0 Chicago Blackhawks 27.2 29 14 ... 863 10.1 997 0.910 2
11 12.0 Minnesota Wild 28.8 25 16 ... 764 10.3 723 0.913 2
12 13.0 St. Louis Blues 28.2 28 14 ... 836 10.4 835 0.892 0
13 14.0 Boston Bruins 28.8 25 14 ... 772 8.8 665 0.913 2
14 15.0 Colorado Avalanche 26.8 25 15 ... 846 8.7 622 0.905 4
15 16.0 Montreal Canadiens 28.8 27 12 ... 890 9.7 782 0.909 0
16 17.0 Philadelphia Flyers 27.5 25 13 ... 699 11.7 753 0.892 3
17 18.0 Calgary Flames 28.0 28 13 ... 838 8.9 845 0.904 3
18 19.0 Los Angeles Kings 27.7 26 11 ... 748 10.3 814 0.910 2
19 20.0 Vancouver Canucks 27.7 31 13 ... 951 8.8 1035 0.903 1
20 21.0 Columbus Blue Jackets 27.0 29 11 ... 839 9.3 902 0.895 1
21 22.0 Arizona Coyotes 28.5 27 12 ... 689 9.7 851 0.907 1
22 23.0 San Jose Sharks 29.3 25 11 ... 749 9.5 800 0.890 1
23 24.0 New York Rangers 25.7 26 11 ... 773 9.2 746 0.906 2
24 25.0 Nashville Predators 28.9 28 11 ... 880 7.4 837 0.885 1
25 26.0 Anaheim Ducks 28.4 29 8 ... 804 7.7 852 0.891 3
26 27.0 Dallas Stars 28.3 23 8 ... 657 10.2 626 0.904 3
27 28.0 Detroit Red Wings 29.4 28 8 ... 785 8.0 870 0.891 0
28 29.0 Ottawa Senators 26.4 30 9 ... 942 8.2 960 0.874 0
29 30.0 New Jersey Devils 26.2 24 8 ... 708 8.5 741 0.896 2
30 31.0 Buffalo Sabres 27.4 26 6 ... 728 7.7 804 0.893 0
31 NaN League Average 28.1 27 13 ... 808 9.8 808 0.902 2
[32 rows x 32 columns]
Rk Unnamed: 1 S% SV% ... HDGF HDC% HDGA HDCO%
0 1 New York Islanders 8.3 0.931 ... 11 12.2 11 11.8
1 2 Tampa Bay Lightning 8.7 0.933 ... 11 14.9 6 6.3
2 3 Florida Panthers 7.9 0.926 ... 15 14.4 12 17.6
3 4 Toronto Maple Leafs 8.8 0.933 ... 16 13.4 8 11.1
4 5 Carolina Hurricanes 7.5 0.932 ... 12 12.8 7 9.3
5 6 Washington Capitals 9.8 0.919 ... 10 10.9 5 7.8
6 7 Vegas Golden Knights 9.3 0.927 ... 20 15.9 11 14.5
7 8 Edmonton Oilers 8.2 0.920 ... 9 11.3 13 9.8
8 9 Winnipeg Jets 8.5 0.926 ... 15 15.0 8 7.8
9 10 Pittsburgh Penguins 8.8 0.922 ... 10 14.5 15 13.5
10 11 Chicago Blackhawks 7.3 0.925 ... 10 10.5 14 15.1
11 12 Minnesota Wild 9.9 0.930 ... 16 14.2 8 11.9
12 13 St. Louis Blues 8.4 0.914 ... 15 18.1 15 15.8
13 14 Boston Bruins 6.6 0.922 ... 5 7.4 11 12.2
14 15 Colorado Avalanche 6.7 0.916 ... 8 8.1 8 13.3
15 16 Montreal Canadiens 7.8 0.935 ... 15 12.0 8 11.3
16 17 Philadelphia Flyers 10.1 0.907 ... 18 15.9 9 12.9
17 18 Calgary Flames 7.6 0.929 ... 6 6.9 8 9.2
18 19 Los Angeles Kings 7.5 0.925 ... 11 13.1 8 9.8
19 20 Vancouver Canucks 7.3 0.919 ... 17 13.2 20 17.4
20 21 Columbus Blue Jackets 8.1 0.918 ... 5 9.6 15 13.6
21 22 Arizona Coyotes 7.7 0.924 ... 11 14.7 14 12.8
22 23 San Jose Sharks 8.1 0.909 ... 12 14.6 16 14.0
23 24 New York Rangers 7.8 0.921 ... 17 14.0 8 12.7
24 25 Nashville Predators 5.7 0.918 ... 5 10.6 11 13.4
25 26 Anaheim Ducks 7.4 0.909 ... 12 13.3 25 16.8
26 27 Dallas Stars 7.4 0.929 ... 11 13.3 5 12.8
27 28 Detroit Red Wings 7.5 0.923 ... 13 15.3 12 16.7
28 29 Ottawa Senators 7.1 0.894 ... 7 8.6 20 14.3
29 30 New Jersey Devils 7.2 0.923 ... 10 14.3 12 13.2
30 31 Buffalo Sabres 5.8 0.911 ... 6 8.2 16 14.0

I can't figure out why my web scraping code isn't working

I am very new to coding and I am trying to build a web scraper for Excel so that I can transfer it to Google Sheets. Unfortunately, the code that I have written is working for other people, but not me.
This is the code I have written:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
URL = 'https://www.hockey-reference.com/leagues/NHL_2021.html'
csv_name = 'nhl_season_stats.csv'
def get_nhl_stats(URL):
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
pageTree = requests.get(URL, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
comments = pageSoup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each, header=1)[0])
except:
continue
df = tables[0]
df = df.rename(columns={'Unnamed: 1':'Team'})
df.to_csv(csv_name, index = False)
print(df)
get_nhl_stats(URL)
After running it, I receive this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 13, in get_nhl_stats
IndexError: list index out of range
Sorry for my bad jargon, as I am very new and very confused, but any help would be greatly appreciated!
this code working, maybe the problem is in the declaration of the class "Comment" or the server does not give you the requested values:
import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = 'https://www.hockey-reference.com/leagues/NHL_2021.html'
csv_name = 'nhl_season_stats.csv'
def get_nhl_stats(URL):
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
pageTree = requests.get(URL, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
comments = pageSoup.find_all(string=lambda text: isinstance(text, str))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each, header=1)[0])
except:
continue
df = tables[0]
df = df.rename(columns={'Unnamed: 1':'Team'})
df.to_csv(csv_name, index = False)
print(df)
get_nhl_stats(URL)
output:
Rk Team AvAge GP W L OL PTS PTS% GF GA SOW SOL SRS SOS TG/G EVGF EVGA PP PPO PP% PPA PPOA PK% SH SHA PIM/G oPIM/G S S% SA SV% SO
0 1.0 Toronto Maple Leafs 29.0 6 4 2 0 8 0.667 19 17 0.0 0.0 0.33 -0.01 6.00 11 12 8 18 44.44 4 22 81.82 0 1 10.5 7.5 190 10.0 157 0.892 0
1 2.0 Montreal Canadiens 28.6 5 3 0 2 8 0.800 24 15 0.0 1.0 0.77 -0.83 7.80 14 8 6 20 30.00 6 25 76.00 4 1 11.4 10.6 180 13.3 140 0.893 0
2 3.0 Vegas Golden Knights 28.9 5 4 1 0 8 0.800 18 12 0.0 0.0 1.12 -0.08 6.00 15 8 2 18 11.11 3 18 83.33 1 1 7.2 7.2 150 12.0 125 0.904 0
3 4.0 Minnesota Wild 29.1 5 4 1 0 8 0.800 15 10 0.0 0.0 0.86 -0.14 5.00 13 9 1 23 4.35 1 16 93.75 1 0 7.6 10.4 166 9.0 147 0.932 0
4 5.0 Washington Capitals 30.1 5 3 0 2 8 0.800 18 16 1.0 1.0 0.10 -0.30 6.80 16 12 2 9 22.22 3 18 83.33 0 1 8.6 5.0 130 13.8 141 0.887 0
5 6.0 Philadelphia Flyers 27.0 5 3 1 1 7 0.700 19 15 0.0 1.0 0.36 -0.24 6.80 14 10 5 17 29.41 5 18 72.22 0 0 7.2 6.8 125 15.2 187 0.920 1
6 7.0 Colorado Avalanche 26.9 5 3 2 0 6 0.600 17 12 0.0 0.0 0.47 -0.53 5.80 7 9 10 25 40.00 3 19 84.21 0 0 8.0 10.4 147 11.6 143 0.916 1
7 8.0 Winnipeg Jets 27.9 4 3 1 0 6 0.750 13 10 0.0 0.0 1.10 0.35 5.75 11 6 2 20 10.00 4 12 66.67 0 0 10.3 14.3 119 10.9 134 0.925 0
8 9.0 New York Islanders 28.9 4 3 1 0 6 0.750 9 6 0.0 0.0 0.61 -0.14 3.75 5 5 4 20 20.00 1 15 93.33 0 0 11.5 11.0 108 8.3 114 0.947 2
9 10.0 Tampa Bay Lightning 27.7 3 3 0 0 6 1.000 13 5 0.0 0.0 1.70 -0.97 6.00 11 2 2 8 25.00 3 11 72.73 0 0 9.0 7.0 107 12.1 85 0.941 0
10 11.0 Pittsburgh Penguins 28.6 5 3 2 0 6 0.600 16 21 2.0 0.0 -0.43 0.17 7.40 10 16 5 18 27.78 5 19 73.68 1 0 7.6 7.2 152 10.5 130 0.838 0
11 12.0 New Jersey Devils 26.2 4 2 1 1 5 0.625 9 10 0.0 1.0 -0.35 0.15 4.75 8 3 1 11 9.09 6 16 62.50 0 1 9.8 7.3 112 8.0 150 0.933 0
12 13.0 St. Louis Blues 28.3 4 2 1 1 5 0.625 10 14 0.0 1.0 -1.66 -0.41 6.00 10 6 0 14 0.00 8 21 61.90 0 0 11.0 7.5 109 9.2 129 0.891 0
13 14.0 Boston Bruins 28.8 4 2 1 1 5 0.625 7 9 2.0 0.0 0.07 0.07 4.00 3 7 3 13 23.08 2 18 88.89 1 0 11.3 8.8 135 5.2 96 0.906 0
14 15.0 Arizona Coyotes 28.4 5 2 2 1 5 0.500 17 17 0.0 1.0 -0.04 0.16 6.80 11 11 5 22 22.73 5 24 79.17 1 1 10.4 9.6 144 11.8 157 0.892 0
15 16.0 Calgary Flames 28.1 3 2 0 1 5 0.833 11 6 0.0 0.0 1.14 -0.52 5.67 5 4 6 16 37.50 1 12 91.67 0 1 8.7 11.3 93 11.8 93 0.935 1
16 17.0 Edmonton Oilers 27.9 6 2 4 0 4 0.333 15 20 0.0 0.0 -0.91 -0.08 5.83 10 14 3 23 13.04 4 18 77.78 2 2 7.7 9.3 192 7.8 200 0.900 0
17 18.0 Vancouver Canucks 27.3 6 2 4 0 4 0.333 17 28 1.0 0.0 -1.34 0.33 7.50 12 17 4 26 15.38 9 31 70.97 1 2 13.3 10.7 179 9.5 222 0.874 0
18 19.0 Anaheim Ducks 28.6 5 1 2 2 4 0.400 8 13 0.0 0.0 -0.10 0.90 4.20 8 10 0 12 0.00 2 15 86.67 0 1 6.4 5.2 133 6.0 160 0.919 1
19 20.0 Columbus Blue Jackets 26.6 5 1 2 2 4 0.400 10 16 0.0 0.0 -1.19 0.01 5.20 9 15 1 11 9.09 1 10 90.00 0 0 9.0 9.4 152 6.6 169 0.905 0
20 21.0 Los Angeles Kings 28.3 4 1 1 2 4 0.500 12 13 0.0 0.0 0.43 0.68 6.25 8 10 4 17 23.53 3 21 85.71 0 0 11.0 9.0 119 10.1 121 0.893 0
21 22.0 Detroit Red Wings 29.3 5 2 3 0 4 0.400 10 14 0.0 0.0 -1.54 -0.74 4.80 9 9 1 12 8.33 4 16 75.00 0 1 11.4 9.8 130 7.7 155 0.910 0
22 23.0 San Jose Sharks 29.4 5 2 3 0 4 0.400 12 18 2.0 0.0 -1.32 -0.52 6.00 7 16 5 21 23.81 2 18 88.89 0 0 8.4 9.6 162 7.4 148 0.878 0
23 24.0 Carolina Hurricanes 27.0 3 2 1 0 4 0.667 9 6 0.0 0.0 0.26 -0.74 5.00 6 5 3 12 25.00 1 9 88.89 0 0 7.7 9.7 98 9.2 68 0.912 1
24 25.0 Florida Panthers 27.8 2 2 0 0 4 1.000 10 6 0.0 0.0 1.29 -0.71 8.00 7 3 3 8 37.50 3 5 40.00 0 0 5.0 8.0 66 15.2 66 0.909 0
25 26.0 Nashville Predators 28.7 4 2 2 0 4 0.500 10 14 0.0 0.0 0.01 1.01 6.00 9 7 1 16 6.25 6 16 62.50 0 1 8.0 8.0 135 7.4 126 0.889 0
26 27.0 Buffalo Sabres 27.2 5 1 3 1 3 0.300 14 15 0.0 1.0 -0.18 0.22 5.80 11 14 3 17 17.65 1 6 83.33 0 0 3.8 8.2 161 8.7 133 0.887 0
27 28.0 New York Rangers 25.6 4 1 2 1 3 0.375 11 11 0.0 1.0 -0.15 0.11 5.50 7 7 4 21 19.05 4 16 75.00 0 0 8.5 14.0 140 7.9 112 0.902 1
28 29.0 Chicago Blackhawks 26.9 5 1 3 1 3 0.300 13 21 0.0 0.0 -0.43 1.17 6.80 5 16 7 17 41.18 5 20 75.00 1 0 8.0 6.8 154 8.4 167 0.874 0
29 30.0 Ottawa Senators 27.0 4 1 2 1 3 0.375 11 14 0.0 0.0 -0.04 0.71 6.25 8 10 3 18 16.67 4 21 80.95 0 0 14.3 15.3 113 9.7 120 0.883 0
30 31.0 Dallas Stars 28.8 1 1 0 0 2 1.000 7 0 0.0 0.0 7.30 0.30 7.00 1 0 5 8 62.50 0 5 100.00 1 0 10.0 16.0 28 25.0 34 1.000 1
31 NaN League Average 28.0 4 2 2 1 5 0.574 13 13 NaN NaN NaN NaN 5.94 9 9 4 16 21.33 4 16 78.67 0 0 8.0 8.0 133 9.8 133 0.902 0

Pandas DataFrame- create a 14 day moving average, but show simple averages for the first 14 days of data?

I have a pandas dataframe similar to this.
score avg
date
1/1/2017 0 0
1/2/2017 1 0.5
1/3/2017 2 1
1/4/2017 3 1.5
1/5/2017 4 2
1/6/2017 5 2.5
1/7/2017 6 3
1/8/2017 7 3.5
1/9/2017 8 4
1/10/2017 9 4.5
1/11/2017 10 5
1/12/2017 11 5.5
1/13/2017 12 7.5
1/14/2017 13 6.5
1/15/2017 14 7.5
1/16/2017 15 8.5
1/17/2017 16 9.5
1/18/2017 17 10.5
1/19/2017 18 11.5
1/20/2017 19 12.5
1/21/2017 20 13.5
1/22/2017 21 14.5
1/23/2017 22 15.5
1/24/2017 23 16.5
1/25/2017 24 17.5
1/26/2017 25 18.5
1/27/2017 26 19.5
1/28/2017 27 20.5
1/29/2017 28 21.5
Basically I am looking to create a 14 day rolling average of the data, but instead of showing NaNs for the first 14 days, simply showing the simple averages. For example, the average on day 2 is the average of day 1 and 2, the average on day 10 is the averages of days 1-10, etc. How would I go about doing this without having to manually create averages? Thanks for the help!
What you need to use is rolling with min_periods=1 as paramter:
df['avg2'] = df.rolling(14, min_periods=1)['score'].mean()
Output:
date score avg avg2
0 2017-01-01 0 0.0 0.0
1 2017-01-02 1 0.5 0.5
2 2017-01-03 2 1.0 1.0
3 2017-01-04 3 1.5 1.5
4 2017-01-05 4 2.0 2.0
5 2017-01-06 5 2.5 2.5
6 2017-01-07 6 3.0 3.0
7 2017-01-08 7 3.5 3.5
8 2017-01-09 8 4.0 4.0
9 2017-01-10 9 4.5 4.5
10 2017-01-11 10 5.0 5.0
11 2017-01-12 11 5.5 5.5
12 2017-01-13 12 7.5 6.0
13 2017-01-14 13 6.5 6.5
14 2017-01-15 14 7.5 7.5
15 2017-01-16 15 8.5 8.5
16 2017-01-17 16 9.5 9.5
17 2017-01-18 17 10.5 10.5
18 2017-01-19 18 11.5 11.5
19 2017-01-20 19 12.5 12.5
20 2017-01-21 20 13.5 13.5
21 2017-01-22 21 14.5 14.5
22 2017-01-23 22 15.5 15.5
23 2017-01-24 23 16.5 16.5
24 2017-01-25 24 17.5 17.5
25 2017-01-26 25 18.5 18.5
26 2017-01-27 26 19.5 19.5
27 2017-01-28 27 20.5 20.5
28 2017-01-29 28 21.5 21.5

Categories

Resources