My data is related to "Cricket", sports game (like Baseball). It has 20 overs for each inning max and each over has approx 6 balls.
data:
season match_id inning sum_total_runs sum_total_wickets over/ball innings_score
32 2008 60 1 61 0 5.1 0
33 2008 60 1 61 1 5.2 0
34 2008 60 1 61 1 5.3 0
35 2008 60 1 61 1 5.4 0
36 2008 60 1 61 1 5.5 0
... ... ... ... ... ... ... ...
179073 2019 11415 2 152 5 19.2 0
179074 2019 11415 2 154 5 19.3 0
179075 2019 11415 2 155 6 19.4 0
179076 2019 11415 2 157 6 19.5 0
179077 2019 11415 2 157 7 19.6 0
111972 rows × 7 columns
innings_score is new column created by me (given default value 0). I want to update it.
The values that I want to enter in it are the results of df.groupby below.
In[]:
df.groupby(['season', 'match_id', 'inning'])['sum_total_runs'].max()
Out[]:
season match_id inning
2008 60 1 222
2 82
61 1 240
2 207
62 1 129
...
2019 11413 2 170
11414 1 155
2 162
11415 1 152
2 157
Name: sum_total_runs, Length: 1276, dtype: int64
I want innings_score to be like:
season match_id inning sum_total_runs sum_total_wickets over/ball innings_score
32 2008 60 1 61 0 5.1 222
33 2008 60 1 61 1 5.2 222
34 2008 60 1 61 1 5.3 222
35 2008 60 1 61 1 5.4 222
36 2008 60 1 61 1 5.5 222
... ... ... ... ... ... ... ...
179073 2019 11415 2 152 5 19.2 157
179074 2019 11415 2 154 5 19.3 157
179075 2019 11415 2 155 6 19.4 157
179076 2019 11415 2 157 6 19.5 157
179077 2019 11415 2 157 7 19.6 157
111972 rows × 7 columns
I would use assign. Starting from a simple example:
import pandas as pd
dt = pd.DataFrame({"name1":["A", "A", "B", "B", "C", "C"], "name2":["C", "C", "C", "D", "D", "D"], "value":[1, 2, 3, 4, 5, 6]})
grouping_variables = ["name1", "name2"]
dt = dt.set_index(grouping_variables)
dt = dt.assign(new_column=dt.groupby(grouping_variables)["value"].max())
As you can see, you set your grouping_variables as indeces before running the assignment.
You can always reset the index at the end if you don't want to keep the grouping_variables indexed dataframe:
dt.reset_index()
One way is to set those 3 columns as index and assign the groupby result as a new column and reset index after that.
While those columns are index, the grouby result and the dataframe both have similar index, so pandas will automatically match and insert the correct rows in the correct positions. Then reset index will turn them back into normal columns.
Something like this:
In [46]: df
Out[46]:
season match_id inning sum_total_runs sum_total_wickets over/ball
0 2008 60 1 61 0 5.1
1 2008 60 1 61 1 5.2
2 2008 60 1 61 1 5.3
3 2008 60 1 61 1 5.4
4 2008 60 1 61 1 5.5
5 2019 11415 2 152 5 19.2
6 2019 11415 2 154 5 19.3
7 2019 11415 2 155 6 19.4
8 2019 11415 2 157 6 19.5
9 2019 11415 2 157 7 19.6
In [47]: df.set_index(['season', 'match_id', 'inning']).assign(innings_score=df.groupby(['season', 'match_id', 'inning'])['sum_total_runs'].max()).reset_index()
Out[47]:
season match_id inning sum_total_runs sum_total_wickets over/ball innings_score
0 2008 60 1 61 0 5.1 61
1 2008 60 1 61 1 5.2 61
2 2008 60 1 61 1 5.3 61
3 2008 60 1 61 1 5.4 61
4 2008 60 1 61 1 5.5 61
5 2019 11415 2 152 5 19.2 157
6 2019 11415 2 154 5 19.3 157
7 2019 11415 2 155 6 19.4 157
8 2019 11415 2 157 6 19.5 157
9 2019 11415 2 157 7 19.6 157
Related
I found the code to calculate a YTD (year to date) value (basically a cumulative sum applied to a group by function passed on "year").
But now, I want this cumulative sum only for when the "Type" column is "Actual" and not "Budget". I'd like to have either empty spaces for when Type = "Budget", or ideally, I'd like to have 332 (the last value of the YTD) displayed for all the rows where Type = "Budget.
Initial table :
Value Type year month
0 100 Actual 2018 1
1 50 Actual 2018 2
2 20 Actual 2018 3
3 123 Actual 2018 4
4 56 Actual 2018 5
5 76 Actual 2018 6
6 98 Actual 2018 7
7 126 Actual 2018 8
8 90 Actual 2018 9
9 80 Actual 2018 10
10 67 Actual 2018 11
11 87 Actual 2018 12
12 101 Actual 2019 1
13 98 Actual 2019 2
14 76 Actual 2019 3
15 57 Actual 2019 4
16 98 Budget 2019 5
17 109 Budget 2019 6
18 123 Budget 2019 7
19 67 Budget 2019 8
20 98 Budget 2019 9
21 67 Budget 2019 10
22 98 Budget 2019 11
23 123 Budget 2019 12
This is the code that produced my actual table
df['YTD'] = df.groupby('year')['Value'].cumsum()
Value Type year month YTD
0 100 Actual 2018 1 100
1 50 Actual 2018 2 150
2 20 Actual 2018 3 170
3 123 Actual 2018 4 293
4 56 Actual 2018 5 349
5 76 Actual 2018 6 425
6 98 Actual 2018 7 523
7 126 Actual 2018 8 649
8 90 Actual 2018 9 739
9 80 Actual 2018 10 819
10 67 Actual 2018 11 886
11 87 Actual 2018 12 973
12 101 Actual 2019 1 101
13 98 Actual 2019 2 199
14 76 Actual 2019 3 275
15 57 Actual 2019 4 332
16 98 Budget 2019 5 430
17 109 Budget 2019 6 539
18 123 Budget 2019 7 662
19 67 Budget 2019 8 729
20 98 Budget 2019 9 827
21 67 Budget 2019 10 894
22 98 Budget 2019 11 992
23 123 Budget 2019 12 1115
Desired table :
Value Type year month YTD
0 100 Actual 2018 1 100
1 50 Actual 2018 2 150
2 20 Actual 2018 3 170
3 123 Actual 2018 4 293
4 56 Actual 2018 5 349
5 76 Actual 2018 6 425
6 98 Actual 2018 7 523
7 126 Actual 2018 8 649
8 90 Actual 2018 9 739
9 80 Actual 2018 10 819
10 67 Actual 2018 11 886
11 87 Actual 2018 12 973
12 101 Actual 2019 1 101
13 98 Actual 2019 2 199
14 76 Actual 2019 3 275
15 57 Actual 2019 4 332
16 98 Budget 2019 5 332
17 109 Budget 2019 6 332
18 123 Budget 2019 7 332
19 67 Budget 2019 8 332
20 98 Budget 2019 9 332
21 67 Budget 2019 10 332
22 98 Budget 2019 11 332
23 123 Budget 2019 12 332
A solution that I found was simply to set a condition (where Type = "Actual"), but in this case the whole table wouldn't display, whereas I need to display it entirely...
Do you have an idea to overcome the partial selection problem ?
Thank you
Alex
With DataFrame.loc we select only the rows where Type = Actual and we assign our cumsum to our new column YTD.
Then we fill our gaps of NaN with GroupBy.ffill:
m = df['Type'].eq('Actual')
df.loc[m, 'YTD'] = df.loc[m].groupby('year')['Value'].cumsum()
df['YTD'] = df.groupby('year')['YTD'].ffill()
Value Type year month YTD
0 100 Actual 2018 1 100.0
1 50 Actual 2018 2 150.0
2 20 Actual 2018 3 170.0
3 123 Actual 2018 4 293.0
4 56 Actual 2018 5 349.0
5 76 Actual 2018 6 425.0
6 98 Actual 2018 7 523.0
7 126 Actual 2018 8 649.0
8 90 Actual 2018 9 739.0
9 80 Actual 2018 10 819.0
10 67 Actual 2018 11 886.0
11 87 Actual 2018 12 973.0
12 101 Actual 2019 1 101.0
13 98 Actual 2019 2 199.0
14 76 Actual 2019 3 275.0
15 57 Actual 2019 4 332.0
16 98 Budget 2019 5 332.0
17 109 Budget 2019 6 332.0
18 123 Budget 2019 7 332.0
19 67 Budget 2019 8 332.0
20 98 Budget 2019 9 332.0
21 67 Budget 2019 10 332.0
22 98 Budget 2019 11 332.0
23 123 Budget 2019 12 332.0
I'm new to Python so any help or advice is very appreciated and sorry if I'm asking very obvious things.
I'm having the following data :
WMO_NO YEAR MONTH DAY HOUR MINUTE H PS T RH TD WDIR WSP
0 4018 2006 1 1 11 28 38 988.6 0.9 98 0.6 120 14.4
1 4018 2006 1 1 11 28 46 987.6 0.5 91 -0.7 122 15.0
2 4018 2006 1 1 11 28 57 986.3 0.5 89 -1.1 124 15.5
3 4018 2006 1 1 11 28 66 985.1 0.5 90 -1.1 126 16.0
4 4018 2006 1 1 11 28 74 984.1 0.4 90 -1.1 127 16.5
I would like to combine the YEAR MONTH DAY HOUR MINUTE into a new column formatted as YEAR:MONTH:DAY:HOUR:MINUTE ( and then index the T data with this column) and do some analysis.
My first question is how to I create such a new column ? The second is can I do comparisons and analysis on this column like ( YEAR:MONTH:DAY:HOUR:MINUTE > 2007:04:13:04:44)?
Cheers.
You can use to_datetime and then if necessary Series.dt.strftime with custom format, check http://strftime.org/:
df['date'] = pd.to_datetime(df[['YEAR','MONTH','DAY','HOUR','MINUTE']])
df['date_new'] = df['date'].dt.strftime('%Y:%m:%d:%H:%M')
print (df)
WMO_NO YEAR MONTH DAY HOUR MINUTE H PS T RH TD WDIR \
0 4018 2006 1 1 11 28 38 988.6 0.9 98 0.6 120
1 4018 2006 1 1 11 28 46 987.6 0.5 91 -0.7 122
2 4018 2006 1 1 11 28 57 986.3 0.5 89 -1.1 124
3 4018 2006 1 1 11 28 66 985.1 0.5 90 -1.1 126
4 4018 2006 1 1 11 28 74 984.1 0.4 90 -1.1 127
WSP date date_new
0 14.4 2006-01-01 11:28:00 2006:01:01:11:28
1 15.0 2006-01-01 11:28:00 2006:01:01:11:28
2 15.5 2006-01-01 11:28:00 2006:01:01:11:28
3 16.0 2006-01-01 11:28:00 2006:01:01:11:28
4 16.5 2006-01-01 11:28:00 2006:01:01:11:28
If your data consists of integers instead of strings you can use this to create a datetime index:
import pandas as pd
import datetime as dt
columns = ['ID', 'Year', 'Month', 'Day', 'Hour', 'Minute']
data = [ ['1', 2006, 1, 1, 11, 28],
['2', 2006, 1, 1, 11, 29]]
df = pd.DataFrame(data=data, columns=columns)
df.index = df.apply(lambda x: dt.datetime(x['Year'], x['Month'], x['Day'], x['Hour'], x['Minute']), axis=1)
i am a stata user and i trying to switch to python and i having problem with some codes. If i have the following panel data
id year quarter fecha jobs
1 2007 1 220 10
1 2007 2 221 12
1 2007 3 222 12
1 2007 4 223 12
1 2008 1 224 12
1 2008 2 225 13
1 2008 3 226 14
1 2008 4 227 9
1 2009 1 228 12
1 2009 2 229 15
1 2009 3 230 18
1 2009 4 231 15
1 2010 1 232 15
1 2010 2 233 16
1 2010 3 234 17
1 2010 4 235 18
2 2007 1 220 10
2 2007 2 221 12
2 2007 3 222 12
2 2007 4 223 12
2 2008 1 224 12
2 2008 2 225 13
2 2008 3 226 14
2 2008 4 227 9
2 2009 1 228 12
2 2009 2 229 15
2 2009 3 230 18
2 2009 4 231 15
2 2010 1 232 15
2 2010 2 233 16
2 2010 4 235 18
(My panel data is much bigger than the example, is just to illustrate my problem). I want to calculate the variation of jobs of the same quarter and three year before
So result should look like these
id year quarter fecha jobs jobs_variation
1 2007 1 220 10 Nan
1 2007 2 221 12 Nan
1 2007 3 222 12 Nan
1 2007 4 223 12 Nan
1 2008 1 224 12 Nan
1 2008 2 225 13 Nan
1 2008 3 226 14 Nan
1 2008 4 227 9 Nan
1 2009 1 228 12 Nan
1 2009 2 229 15 Nan
1 2009 3 230 18 Nan
1 2009 4 231 15 Nan
1 2010 1 232 15 0.5
1 2010 2 233 16 0.33
1 2010 3 234 17 0.30769
1 2010 4 235 18 0.5
2 2007 1 220 10 Nan
2 2007 4 223 12 Nan
2 2008 1 224 12 Nan
2 2008 2 225 13 Nan
2 2008 3 226 14 Nan
2 2008 4 227 9 Nan
2 2009 1 228 12 Nan
2 2009 2 229 15 Nan
2 2009 3 230 18 Nan
2 2009 4 231 15 Nan
2 2010 1 232 15 0.5
2 2010 2 233 16 Nan
2 2010 3 234 20 Nan
2 2010 4 235 18 0.5
Check that in the second id year 2010 in the second and thir quarter calculation must not be me made because the id was not present at 2007Q2 and 2007Q3.
In stata the code would be,
bys id: gen jobs_variation=jobs/jobs[_n-12]-1 if fecha[_n-12]==fecha-12
IIUC, you need a groupby on id and quarter followed by apply:
df['jobs_variation'] = df.groupby(['id', 'quarter']).jobs\
.apply(lambda x: x / x.shift(3) - 1)
df
id year quarter fecha jobs jobs_variation
0 1 2007 1 220 10 NaN
1 1 2007 2 221 12 NaN
2 1 2007 3 222 12 NaN
3 1 2007 4 223 12 NaN
4 1 2008 1 224 12 NaN
5 1 2008 2 225 13 NaN
6 1 2008 3 226 14 NaN
7 1 2008 4 227 9 NaN
8 1 2009 1 228 12 NaN
9 1 2009 2 229 15 NaN
10 1 2009 3 230 18 NaN
11 1 2009 4 231 15 NaN
12 1 2010 1 232 15 0.500000
13 1 2010 2 233 16 0.333333
14 1 2010 3 234 17 0.416667
15 1 2010 4 235 18 0.500000
16 2 2007 1 220 10 NaN
17 2 2007 4 223 12 NaN
18 2 2008 1 224 12 NaN
19 2 2008 2 225 13 NaN
20 2 2008 3 226 14 NaN
21 2 2008 4 227 9 NaN
22 2 2009 1 228 12 NaN
23 2 2009 2 229 15 NaN
24 2 2009 3 230 18 NaN
25 2 2009 4 231 15 NaN
26 2 2010 1 232 15 0.500000
27 2 2010 2 233 16 NaN
28 2 2010 3 234 20 NaN
29 2 2010 4 235 18 0.500000
x / x.shift(3) will divide the current year's job count (for that quarter) by the corresponding value from 3 years ago.
I'd like to reshape a pandas dataframe from wide to long. The challenge lies in the fact that the columns have got multiindexed column headers. The dataframe looks like this:
category price1 price2
year 2011 2012 2013 2011 2012 2013
1 33 22 48 135 144 149
2 22 26 37 136 127 129
3 39 30 47 123 148 148
4 45 42 21 140 126 121
5 20 37 35 141 142 147
6 29 20 34 122 121 132
7 20 35 45 128 123 130
8 39 34 49 125 120 131
9 24 20 36 122 146 130
10 24 37 43 142 133 138
11 23 22 40 124 135 131
12 27 22 40 147 149 132
Below is a snippet that produces the very same dataframe. You will also see that I've built this dataframe by concatenating two other dataframes.
Here's the snippet:
import pandas as pd
import numpy as np
# Make dataframe df1 with 12 observations over 3 years
# with multiindexed column headers
np.random.seed(123)
df1 = pd.DataFrame(np.random.randint(20, 50, size = (12,3)), columns=[2011,2012,2013])
df1.index = np.arange(1,len(df1)+1)
colNames1 = df1.columns
header1 = pd.MultiIndex.from_product([['price1'], colNames1], names=['category','year'])
df1.columns = header1
# Make dataframe df2 with 12 observations over 3 years
# with multiindexed column headers
df2 = pd.DataFrame(np.random.randint(120, 150, size = (12,3)), columns=[2011,2012,2013])
df2.index = np.arange(1,len(df2)+1)
colNames1 = df2.columns
header1 = pd.MultiIndex.from_product([['price2'], colNames1], names=['category','year'])
df2.columns = header1
df3 = pd.concat([df1, df2], axis = 1)
And here is the desired output:
price1 price2
1 2011 33 135
2 2011 22 136
3 2011 39 123
4 2011 45 140
5 2011 20 141
6 2011 29 122
7 2011 20 128
8 2011 39 125
9 2011 24 122
10 2011 24 142
11 2011 23 124
12 2011 27 147
1 2012 22 144
2 2012 26 127
3 2012 30 148
4 2012 42 126
5 2012 37 142
6 2012 20 121
7 2012 35 123
8 2012 34 120
9 2012 20 146
10 2012 37 133
11 2012 22 135
12 2012 22 149
1 2013 48 149
2 2013 37 129
3 2013 47 148
4 2013 21 121
5 2013 35 147
6 2013 34 132
7 2013 45 130
8 2013 49 131
9 2013 36 130
10 2013 43 138
11 2013 40 131
12 2013 40 132
I've tried different solutions based on suggestions with Reshape and pandas.wide_to_long, but I'm struggling with the multiindexed column names. So why not just remove this? Mostly because this is what my real world problem will look like, and also because I refuse to believe that it can't be done.
Thank you for any suggestions!
Use stack be last level and sort_index, add rename_axis and reset_index for columns:
df3 = (df3.stack()
.sort_index(level=[1,0])
.rename_axis(['months','year'])
.reset_index()
.rename_axis(None, 1))
print (df3.head(15))
months year price1 price2
0 1 2011 33 135
1 2 2011 22 136
2 3 2011 39 123
3 4 2011 45 140
4 5 2011 20 141
5 6 2011 29 122
6 7 2011 20 128
7 8 2011 39 125
8 9 2011 24 122
9 10 2011 24 142
10 11 2011 23 124
11 12 2011 27 147
12 1 2012 22 144
13 2 2012 26 127
14 3 2012 30 148
If need MutliIndex:
df3 = df3.stack().sort_index(level=[1,0])
print (df3.head(15))
category price1 price2
year
1 2011 33 135
2 2011 22 136
3 2011 39 123
4 2011 45 140
5 2011 20 141
6 2011 29 122
7 2011 20 128
8 2011 39 125
9 2011 24 122
10 2011 24 142
11 2011 23 124
12 2011 27 147
1 2012 22 144
2 2012 26 127
3 2012 30 148
I am reading an HTML table with pd.read_html but the result is coming in a list, I want to convert it inot a pandas dataframe, so I can continue further operations on the same. I am using the following script
import pandas as pd
import html5lib
data=pd.read_html('http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2',skiprows=1)
and since My results are coming as 1 list, I tried to convert it into a data frame with
data1=pd.DataFrame(Data)
and result came as
0
0 0 1 2 3 4...
and because of result as a list, I can't apply any functions such as rename, dropna, drop.
I will appreciate every help
I think you need add [0] if need select first item of list, because read_html return list of DataFrames:
So you can use:
import pandas as pd
data1 = pd.read_html('http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2',skiprows=1)[0]
print (data1)
0 1 2 3 4 5 6 7 8 9 \
0 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
1 1 Jamie Benn, LW DAL 82 35 52 87 1 64 1.06
2 2 John Tavares, C NYI 82 38 48 86 5 46 1.05
3 3 Sidney Crosby, C PIT 77 28 56 84 5 47 1.09
4 4 Alex Ovechkin, LW WSH 81 53 28 81 10 58 1.00
5 NaN Jakub Voracek, RW PHI 82 22 59 81 1 78 0.99
6 6 Nicklas Backstrom, C WSH 82 18 60 78 5 40 0.95
7 7 Tyler Seguin, C DAL 71 37 40 77 -1 20 1.08
8 8 Jiri Hudler, LW CGY 78 31 45 76 17 14 0.97
9 NaN Daniel Sedin, LW VAN 82 20 56 76 5 18 0.93
10 10 Vladimir Tarasenko, RW STL 77 37 36 73 27 31 0.95
11 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
12 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
13 NaN Nick Foligno, LW CBJ 79 31 42 73 16 50 0.92
14 NaN Claude Giroux, C PHI 81 25 48 73 -3 36 0.90
15 NaN Henrik Sedin, C VAN 82 18 55 73 11 22 0.89
16 14 Steven Stamkos, C TB 82 43 29 72 2 49 0.88
17 NaN Tyler Johnson, C TB 77 29 43 72 33 24 0.94
18 16 Ryan Johansen, C CBJ 82 26 45 71 -6 40 0.87
19 17 Joe Pavelski, C SJ 82 37 33 70 12 29 0.85
20 NaN Evgeni Malkin, C PIT 69 28 42 70 -2 60 1.01
21 NaN Ryan Getzlaf, C ANA 77 25 45 70 15 62 0.91
22 20 Rick Nash, LW NYR 79 42 27 69 29 36 0.87
23 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
24 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
25 21 Max Pacioretty, LW MTL 80 37 30 67 38 32 0.84
26 NaN Logan Couture, C SJ 82 27 40 67 -6 12 0.82
27 23 Jonathan Toews, C CHI 81 28 38 66 30 36 0.81
28 NaN Erik Karlsson, D OTT 82 21 45 66 7 42 0.80
29 NaN Henrik Zetterberg, LW DET 77 17 49 66 -6 32 0.86
30 26 Pavel Datsyuk, C DET 63 26 39 65 12 8 1.03
31 NaN Joe Thornton, C SJ 78 16 49 65 -4 30 0.83
32 28 Nikita Kucherov, RW TB 82 28 36 64 38 37 0.78
33 NaN Patrick Kane, RW CHI 61 27 37 64 10 10 1.05
34 NaN Mark Stone, RW OTT 80 26 38 64 21 14 0.80
35 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
36 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
37 NaN Alexander Steen, LW STL 74 24 40 64 8 33 0.86
38 NaN Kyle Turris, C OTT 82 24 40 64 5 36 0.78
39 NaN Johnny Gaudreau, LW CGY 80 24 40 64 11 14 0.80
40 NaN Anze Kopitar, C LA 79 16 48 64 -2 10 0.81
41 35 Radim Vrbata, RW VAN 79 31 32 63 6 20 0.80
42 NaN Jaden Schwartz, LW STL 75 28 35 63 13 16 0.84
43 NaN Filip Forsberg, C NSH 82 26 37 63 15 24 0.77
44 NaN Jordan Eberle, RW EDM 81 24 39 63 -16 24 0.78
45 NaN Ondrej Palat, LW TB 75 16 47 63 31 24 0.84
46 40 Zach Parise, LW MIN 74 33 29 62 21 41 0.84
10 11 12 13 14 15 16
0 SOG PCT GWG G A G A
1 253 13.8 6 10 13 2 3
2 278 13.7 8 13 18 0 1
3 237 11.8 3 10 21 0 0
4 395 13.4 11 25 9 0 0
5 221 10.0 3 11 22 0 0
6 153 11.8 3 3 30 0 0
7 280 13.2 5 13 16 0 0
8 158 19.6 5 6 10 0 0
9 226 8.9 5 4 21 0 0
10 264 14.0 6 8 10 0 0
11 NaN NaN NaN NaN NaN NaN NaN
12 SOG PCT GWG G A G A
13 182 17.0 3 11 15 0 0
14 279 9.0 4 14 23 0 0
15 101 17.8 0 5 20 0 0
16 268 16.0 6 13 12 0 0
17 203 14.3 6 8 9 0 0
18 202 12.9 0 7 19 2 0
19 261 14.2 5 19 12 0 0
20 212 13.2 4 9 17 0 0
21 191 13.1 6 3 10 0 2
22 304 13.8 8 6 6 4 1
23 NaN NaN NaN NaN NaN NaN NaN
24 SOG PCT GWG G A G A
25 302 12.3 10 7 4 3 2
26 263 10.3 4 6 18 2 0
27 192 14.6 7 6 11 2 1
28 292 7.2 3 6 24 0 0
29 227 7.5 3 4 24 0 0
30 165 15.8 5 8 16 0 0
31 131 12.2 0 4 18 0 0
32 190 14.7 2 2 13 0 0
33 186 14.5 5 6 16 0 0
34 157 16.6 6 5 8 1 0
35 NaN NaN NaN NaN NaN NaN NaN
36 SOG PCT GWG G A G A
37 223 10.8 5 8 16 0 0
38 215 11.2 6 4 12 1 0
39 167 14.4 4 8 13 0 0
40 134 11.9 4 6 18 0 0
41 267 11.6 7 12 11 0 0
42 184 15.2 4 8 8 0 2
43 237 11.0 6 6 13 0 0
44 183 13.1 2 6 15 0 0
45 139 11.5 5 3 8 1 1
46 259 12.7 3 11 5 0 0
If your dataframe ends up with columns indexed as 0,1,2 etc and the headings in the first row, (as above) just specify that the column names are in the first row with header=0
Without this, pandas may see a mix of data types - text in row 1 and numbers in the rest and cast the column as object rather than, say, int64.
Full line would be:
data1 = pd.read_html(url, skiprows=1, header=0)[0]
[0] is the first table in the list of possible tables.
There are options for handling NA values as well. Check out the documentation here:
https://pandas.pydata.org/docs/reference/api/pandas.read_html.html
I know this is late, but here's a better way...
I noticed that the DataFrames in the list are all part of the same table/dataset you are trying to analyze, so instead of breaking them up and then merging them together, a better solution is to contact the list of DataFrames.
Check out the results of this code:
df = pd.concat(pd.read_html('https://www.espn.com/nhl/stats/player/_/view/goaltending'),axis=1)
output:
df.head(1)
index RK Name POS GP W L OTL GA/G SA GA SV SV% SO TOI PIM SOSA SOS SOS%
0 1 Igor ShesterkinNYR G 53 36 13 4 2.07 1622 106 1516 0.935 6 3070:32 2 28 20 0.714