I need to create two new Pandas columns using the logic and value from the previous row.
I have the following data:
Day Vol Price Income Outgoing
1 499 75
2 3233 90
3 1812 70
4 2407 97
5 3474 82
6 1057 53
7 2031 68
8 304 78
9 1339 62
10 2847 57
11 3767 93
12 1096 83
13 3899 88
14 4090 63
15 3249 52
16 1478 52
17 4926 75
18 1209 52
19 1982 90
20 4499 93
My challenge is to come up with a logic where both the Income and Outgoing columns (which are currently empty), should have the values of (Vol * Price).
But, the Income column should carry this value when, the previous day's "Price" value is lower than present. The Outgoing column should carry this value when, the previous day's "Price" value is higher than present. The rest of the Income and Outgoing columns, should just have NaN's. If the Price is unchanged, then that day's value is to be dropped.
But the entire logic should start with (n + 1) day. The first row should be skipped and the logic should apply from row 2 onwards.
I have tried using shift in my code example such as:
if sample_data['Price'].shift(1) < sample_data['Price'].shift(2)):
sample_data['Income'] = sample_data['Vol'] * sample_data['Price']
else:
sample_data['Outgoing'] = sample_data['Vol'] * sample_data['Price']
But it isn't working.
I feel there would be a simpler and comprehensive tactic to go about this, could someone please help ?
Update (The final output should look like this):
For day 16, the data is deleted because we have two similar prices for day 15 and 16.
I'd calculate the product and the mask separately, and then update the cols:
In [11]: vol_price = df["Vol"] * df["Price"]
In [12]: incoming = df["Price"].diff() < 0
In [13]: df.loc[incoming, "Income"] = vol_price
In [14]: df.loc[~incoming, "Outgoing"] = vol_price
In [15]: df
Out[15]:
Day Vol Price Income Outgoing
0 1 499 75 NaN 37425.0
1 2 3233 90 NaN 290970.0
2 3 1812 70 126840.0 NaN
3 4 2407 97 NaN 233479.0
4 5 3474 82 284868.0 NaN
5 6 1057 53 56021.0 NaN
6 7 2031 68 NaN 138108.0
7 8 304 78 NaN 23712.0
8 9 1339 62 83018.0 NaN
9 10 2847 57 162279.0 NaN
10 11 3767 93 NaN 350331.0
11 12 1096 83 90968.0 NaN
12 13 3899 88 NaN 343112.0
13 14 4090 63 257670.0 NaN
14 15 3249 52 168948.0 NaN
15 16 1478 52 NaN 76856.0
16 17 4926 75 NaN 369450.0
17 18 1209 52 62868.0 NaN
18 19 1982 90 NaN 178380.0
19 20 4499 93 NaN 418407.0
or is it this way around:
In [21]: incoming = df["Price"].diff() > 0
In [22]: df.loc[incoming, "Income"] = vol_price
In [23]: df.loc[~incoming, "Outgoing"] = vol_price
In [24]: df
Out[24]:
Day Vol Price Income Outgoing
0 1 499 75 NaN 37425.0
1 2 3233 90 290970.0 NaN
2 3 1812 70 NaN 126840.0
3 4 2407 97 233479.0 NaN
4 5 3474 82 NaN 284868.0
5 6 1057 53 NaN 56021.0
6 7 2031 68 138108.0 NaN
7 8 304 78 23712.0 NaN
8 9 1339 62 NaN 83018.0
9 10 2847 57 NaN 162279.0
10 11 3767 93 350331.0 NaN
11 12 1096 83 NaN 90968.0
12 13 3899 88 343112.0 NaN
13 14 4090 63 NaN 257670.0
14 15 3249 52 NaN 168948.0
15 16 1478 52 NaN 76856.0
16 17 4926 75 369450.0 NaN
17 18 1209 52 NaN 62868.0
18 19 1982 90 178380.0 NaN
19 20 4499 93 418407.0 NaN
Related
I have 2 dataframes, i want to get sum value of every row based on groupby of unique id each previous 3rows & each row value should be multiply by other dataframe value
for example : dataframe A dataframe B
unique_id value out_value num_values
1 1 45 0.15
2 1 33 0.30
3 1 18 0.18
#4 1 26 20.7
5 2 66
6 2 44
7 2 22
#8 2 19. 28.3
expected output_value column
4th row = 18 * 0.15 + 33*0.30 + 45*0.18 = 2.7+9.9+8.1 = 20.7
8th row = 22 * 0.15 + 44*0.30 + 66*0.18 = 3.3+ 13.2 + 11.88= 28.3
based on Unique_id each value should calculate based previous 3values.
for every row there will be previous 3 rows available
import pandas as pd
import numpy as np
df_a = pd.DataFrame({
'uni_id':[1, 1, 1, 1, 2, 2, 2, 2, 152, 152, 152, 152, 152],
'value':[45,33,18,26,66,44,22,19,36,27,45,81,90]
}, index=range(1,14))
df_b = pd.DataFrame({
'num_values':[0.15,0.30,0.18]
})
df_a
###
uni_id value
1 1 45
2 1 33
3 1 18
4 1 26
5 2 66
6 2 44
7 2 22
8 2 19
9 152 36
10 152 27
11 152 45
12 152 81
13 152 90
df_b
###
num_values
0 0.15
1 0.30
2 0.18
# main calculation
arr = [df_a['value'].shift(x+1).values[::-1][:3] for x in range(len(df_a['value']))[::-1]]
arr_b = pd.Series(np.inner(arr, df_b['num_values']))
# filter and clean
mask = df_a.groupby('uni_id').cumcount()+1 > 3
output = arr_b * mask
output[output == 0] = np.nan
# concat result to df_a
df_a['out_value'] = output
df_a
###
uni_id value out_value
1 1 45 NaN
2 1 33 NaN
3 1 18 NaN
4 1 26 20.70
5 2 66 NaN
6 2 44 NaN
7 2 22 NaN
8 2 19 28.38
9 152 36 NaN
10 152 27 NaN
11 152 45 NaN
12 152 81 21.33
13 152 90 30.51
If you want to keep non-null values through filtrate:
df_a.query('out_value.notnull()')
###
uni_id value out_value
4 1 26 20.70
8 2 19 28.38
12 152 81 21.33
13 152 90 30.51
Group with metrics uni_id,Year_Month
Data preparation:
# create date range series with 7 days
import pandas as pd
import numpy as np
rng = np.random.default_rng(42)
rng.integers(10,100, 26)
date_range = pd.Series(pd.date_range(start='01.30.2020', periods=27, freq='5D')).dt.to_period('M')
df_a = pd.DataFrame({
'uni_id':[1, 1, 1, 1,1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 152, 152, 152, 152, 152,152, 152, 152, 152, 152],
'Year_Month':date_range,
'value':rng.integers(10,100, 26)
}, index=range(1,27))
df_b = pd.DataFrame({
'num_values':[0.15,0.30,0.18]
})
df_a
###
uni_id Year_Month value
1 1 2020-02 46
2 1 2020-02 84
3 1 2020-02 59
4 1 2020-02 49
5 1 2020-02 50
6 1 2020-02 30
7 1 2020-03 18
8 1 2020-03 59
9 2 2020-03 89
10 2 2020-03 15
11 2 2020-03 87
12 2 2020-03 84
13 2 2020-04 34
14 2 2020-04 66
15 2 2020-04 24
16 2 2020-04 78
17 152 2020-04 73
18 152 2020-04 41
19 152 2020-05 16
20 152 2020-05 97
21 152 2020-05 50
22 152 2020-05 90
23 152 2020-05 71
24 152 2020-05 80
25 152 2020-06 78
26 152 2020-06 27
Processing
arr = [df_a['value'].shift(x+1).values[::-1][:3] for x in range(len(df_a['value']))[::-1]]
arr_b = pd.Series(np.inner(arr, df_b['num_values']))
# filter and clean
mask = df_a.groupby(['uni_id','Year_Month']).cumcount()+1 > 3
output = arr_b * mask
output[output == 0] = np.nan
# concat result to df_a
df_a['out_value'] = output
df_a
###
uni_id Year_Month value out_value
1 1 2020-02 46 NaN
2 1 2020-02 84 NaN
3 1 2020-02 59 NaN
4 1 2020-02 49 40.17
5 1 2020-02 50 32.82
6 1 2020-02 30 28.32
7 1 2020-03 18 NaN
8 1 2020-03 59 NaN
9 2 2020-03 89 NaN
10 2 2020-03 15 NaN
11 2 2020-03 87 NaN
12 2 2020-03 84 41.4
13 2 2020-04 34 NaN
14 2 2020-04 66 NaN
15 2 2020-04 24 NaN
16 2 2020-04 78 30.78
17 152 2020-04 73 NaN
18 152 2020-04 41 NaN
19 152 2020-05 16 NaN
20 152 2020-05 97 NaN
21 152 2020-05 50 NaN
22 152 2020-05 90 45.96
23 152 2020-05 71 46.65
24 152 2020-05 80 49.5
25 152 2020-06 78 NaN
26 152 2020-06 27 NaN
i HAVE A DATAMATRIX of five columns
0 1 2 3 4
nan 34 23 34 11
43 34 123 4 44
45 12 4 nan 66
89 78 43 435 23
nan 89 nan 12 687
6 232 34 4 nan
24 56 34 121 56
nan 9 nan 54 12
24 nan 54 12 nan
76 11 123 76 78
43 nan 65 23 89
68 233 34 nan 89
65 53 nan 7 78
34 65 12 8 12
56 98 43 nan 43
I also have a fvector
fvector
23
67
23
nan
nan
87
323
nan
78
32
78
112
nan
56
nan
56
Till now i had just been able to find the correlation based upon full column
for i in datamatrix:
coef,p=spearmanr(datamatrix[i],fvector)
print(coef,p,"for column ",i)
I want to achieve 2 things:
1). I want to find the spearman's correlation between fvector and each column of datamatrix but if one of the two variables or both variables are nan then i want to drop the correlation for particular pair.
for eg. 4th value in column 1 is 78 and 4th value in fvector is nan so i want to exclude the particular pair(not whole column) from the process of correlation.i don't have any idea how to work with specific variable for finding correlation.
2). if the total number of nan values in fvector and datamatrix's column are > 30% then exclude whole column from finding correlation.
Any resource or reference will be helpful
Thanks
1) If you set nan_policy == "omit" the Nan will be ignored in the calculation. See scipy.stats.spearmanr.
2) You can compute the percentage of Nan in each column in this way: (df[i].isna().sum()*100)/df.shape[0]
All together:
nan_fvectr = int(vector.isna().sum())
for i in df:
if ((df[i].isna().sum()+nan_fvectr)*100)/(df.shape[0]*2) >= 30:
continue
coef,p=stats.spearmanr(df[i],vector, nan_policy="omit")
print(coef,p,"for column ",i)
Viewed 64 times
0
I have two columns in a data frame containing more than 1000 rows. Column A can take values X,Y,None. Column B contains random numbers from 50 to 100.
Every time there is a non 'None' occurrence in Column A, it is considered as occurrence4. so, previous non None occurrence in Column A will be occurrence3, and the previous to that will be occurrence2 and the previous to that will be occurrence1. I want to find the minimum value of column B between occurrence4 and occurrence3 and check if it is greater than the minimum value of column B between occurrence2 and occurrence1. The results can be stored in a new column in the data frame as "YES" or "NO".
SAMPLE INPUT
ROWNUM A B
1 None 68
2 None 83
3 X 51
4 None 66
5 None 90
6 Y 81
7 None 81
8 None 100
9 None 83
10 None 78
11 X 68
12 None 53
13 None 83
14 Y 68
15 None 94
16 None 50
17 None 71
18 None 71
19 None 52
20 None 67
21 None 82
22 X 76
23 None 66
24 None 92
For example, I need to find the minimum value of Column B between ROWNUM 14 and ROWNUM 11 and check if it is GREATER THAN the minimum value of Column B between ROWNUM 6 and ROWNUM 3. Next, I need to find the minimum value between ROWNUM 22 AND ROWNUM 14 and check if it is GREATER THAN the minimum value between ROWNUM 11 and ROWNNUM 6 and so on.
EDIT:
In the sample data, we start our calculation from row 14, since that is where we have the fourth non none occurrence of column A. The minimum value between row 14 and row 11 is 53. The minimum value between row 6 and 3 is 51. Since 53 > 51, , it means the minimum value of column B between occurrence 4 and occurrence 3, is GREATER THAN minimum value of column B between occurrence 2 and occurrence 1. So, output at row 14 would be "YES" or 1.
Next, at row 22, the minimum value between row 22 and row 14 is 50. The minimum value between row 11 and 6 is 68. Since 50 < 68, it means minimum between occurrence 4 and occurrence 3 is NOT GREATER THAN minimum between occurrence 2 and occurrence 1. So, output at row 22 would be "NO" or 0.
I have the following code.
import numpy as np
import pandas as pd
df = pd.DataFrame([[0, 0]]*100, columns=list('AB'), index=range(1, 101))
df.loc[[3, 6, 11, 14, 22, 26, 38, 51, 64, 69, 78, 90, 98], 'A'] = 1
df['B'] = np.random.randint(50, 100, size=len(df))
df['result'] = df.index[df['A'] != 0].to_series().rolling(4).apply(
lambda x: df.loc[x[2]:x[3], 'B'].min() > df.loc[x[0]:x[1], 'B'].min(), raw=True)
print(df)
This code works when column A has inputs [0,1]. But I need a code where column A could contain [None, X, Y]. Also, this code produces output as [0,1]. I need output as [YES, NO] instead.
I read your sample data as follows:
df = pd.read_fwf('input.txt', widths=[7, 6, 3], na_values=['None'])
Note na_values=['None'], which provides that None in input (a string)
is read as NaN.
This way the DataFrame is:
ROWNUM A B
0 1 NaN 68
1 2 NaN 83
2 3 X 51
3 4 NaN 66
4 5 NaN 90
5 6 Y 81
6 7 NaN 81
7 8 NaN 100
8 9 NaN 83
9 10 NaN 78
10 11 X 68
11 12 NaN 53
12 13 NaN 83
13 14 Y 69
14 15 NaN 94
15 16 NaN 50
16 17 NaN 71
17 18 NaN 71
18 19 NaN 52
19 20 NaN 67
20 21 NaN 82
21 22 X 76
22 23 NaN 66
23 24 NaN 92
The code to do your task is:
res = df.index[df.A.notnull()].to_series().rolling(4).apply(
lambda x: df.loc[x[2]:x[3], 'B'].min() > df.loc[x[0]:x[1], 'B'].min(), raw=True)\
.dropna().map(lambda x: 'YES' if x > 0 else 'NO').rename('Result')
df = df.join(res)
df.Result.fillna('', inplace=True)
As you can see, it is in part a slight change of your code, with some
additions.
The result is:
ROWNUM A B Result
0 1 NaN 68
1 2 NaN 83
2 3 X 51
3 4 NaN 66
4 5 NaN 90
5 6 Y 81
6 7 NaN 81
7 8 NaN 100
8 9 NaN 83
9 10 NaN 78
10 11 X 68
11 12 NaN 53
12 13 NaN 83
13 14 Y 69 YES
14 15 NaN 94
15 16 NaN 50
16 17 NaN 71
17 18 NaN 71
18 19 NaN 52
19 20 NaN 67
20 21 NaN 82
21 22 X 76 NO
22 23 NaN 66
23 24 NaN 92
The advantage of my solution over the other is that:
the content is either YES or NO, just as you want,
this content shows up only for non-null values in A column,
"ignoring" first 3, which don't have enough "predecessors".
Here's my approach:
def is_incr(x):
return x[:2].min() > x[2:].min()
# replace with s = df['A'] == 'None' if needed
s = df['A'].isna()
df['new_col'] = df.loc[s, 'B'].rolling(4).apply(is_incr)
Output:
ROWNUM A B new_col
0 1 NaN 68 NaN
1 2 NaN 83 NaN
2 3 X 51 NaN
3 4 NaN 66 NaN
4 5 NaN 90 1.0
5 6 Y 81 NaN
6 7 NaN 81 0.0
7 8 NaN 100 0.0
8 9 NaN 83 0.0
9 10 NaN 78 1.0
10 11 X 68 NaN
11 12 NaN 53 1.0
12 13 NaN 83 1.0
13 14 Y 68 NaN
14 15 NaN 94 0.0
15 16 NaN 50 1.0
16 17 NaN 71 1.0
17 18 NaN 71 0.0
18 19 NaN 52 0.0
19 20 NaN 67 1.0
20 21 NaN 82 0.0
21 22 X 76 NaN
22 23 NaN 66 0.0
23 24 NaN 92 1.0
The Scenario:
I have 2 dataframes fc0 and yc0. Where fc0 is a Cluster and yc0 is another dataframe which needs to be merged in fc0.
The Nature of data is as follows:
fc0
uid 1 2 3 4 5 6
234 235 4.000000 4.074464 4.128026 3.973045 3.921663 4.024864
235 236 3.524208 3.125669 3.652112 3.626923 3.524318 3.650589
236 237 4.174080 4.226267 4.200133 4.150983 4.124157 4.200052
yc0
iid uid 1 2 5 6 9 15
0 944 5.0 3.0 4.0 3.0 3.0 5.0
The Twist
I have 1682 columns in fc0 and I have few hundered values in yc0. Now I need the yc0 to go into fc0
In haste of resolving it, I even tried yc0.reset_index(inplace=True) but wasn't really helpful.
Expected Output
uid 1 2 3 4 5 6
234 235 4.000000 4.074464 4.128026 3.973045 3.921663 4.024864
235 236 3.524208 3.125669 3.652112 3.626923 3.524318 3.650589
236 237 4.174080 4.226267 4.200133 4.150983 4.124157 4.200052
944 5.0 3.0 NaN NaN 4.0 3.0 3.0
References
Link1 Tried this, but landed up inserting NaN values for 1st 16 Columns and rest of the data shifted by that many columns
Link2 Couldn't match column keys, besides I tried it for row.
Link3 Merging doesn't match the columns in it.
Link4 Concatenation doesn't work that way.
Link5 Same issues with Join.
EDIT 1
fc0.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 235 entries, 234 to 468
Columns: 1683 entries, uid to 1682
dtypes: float64(1682), int64(1)
memory usage: 3.0 MB
and
yc0.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Columns: 336 entries, uid to 1007
dtypes: float64(335), int64(1)
memory usage: 2.7 KB
Here's a MVCE example. Does this small sample data show the functionality that you are expecting?
df1 = pd.DataFrame(np.random.randint(0,100,(5,4)), columns=list('ABCE'))
A B C E
0 81 57 54 88
1 63 63 74 10
2 13 89 88 66
3 90 81 3 31
4 66 93 55 4
df2 = pd.DataFrame(np.random.randint(0,100,(5,4)), columns=list('BCDE'))
B C D E
0 93 48 62 25
1 24 97 52 88
2 53 50 21 13
3 81 27 7 81
4 10 21 77 19
df_out = pd.concat([df1,df2])
print(df_out)
Output:
A B C D E
0 81.0 57 54 NaN 88
1 63.0 63 74 NaN 10
2 13.0 89 88 NaN 66
3 90.0 81 3 NaN 31
4 66.0 93 55 NaN 4
0 NaN 93 48 62.0 25
1 NaN 24 97 52.0 88
2 NaN 53 50 21.0 13
3 NaN 81 27 7.0 81
4 NaN 10 21 77.0 19
I am reading an HTML table with pd.read_html but the result is coming in a list, I want to convert it inot a pandas dataframe, so I can continue further operations on the same. I am using the following script
import pandas as pd
import html5lib
data=pd.read_html('http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2',skiprows=1)
and since My results are coming as 1 list, I tried to convert it into a data frame with
data1=pd.DataFrame(Data)
and result came as
0
0 0 1 2 3 4...
and because of result as a list, I can't apply any functions such as rename, dropna, drop.
I will appreciate every help
I think you need add [0] if need select first item of list, because read_html return list of DataFrames:
So you can use:
import pandas as pd
data1 = pd.read_html('http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2',skiprows=1)[0]
print (data1)
0 1 2 3 4 5 6 7 8 9 \
0 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
1 1 Jamie Benn, LW DAL 82 35 52 87 1 64 1.06
2 2 John Tavares, C NYI 82 38 48 86 5 46 1.05
3 3 Sidney Crosby, C PIT 77 28 56 84 5 47 1.09
4 4 Alex Ovechkin, LW WSH 81 53 28 81 10 58 1.00
5 NaN Jakub Voracek, RW PHI 82 22 59 81 1 78 0.99
6 6 Nicklas Backstrom, C WSH 82 18 60 78 5 40 0.95
7 7 Tyler Seguin, C DAL 71 37 40 77 -1 20 1.08
8 8 Jiri Hudler, LW CGY 78 31 45 76 17 14 0.97
9 NaN Daniel Sedin, LW VAN 82 20 56 76 5 18 0.93
10 10 Vladimir Tarasenko, RW STL 77 37 36 73 27 31 0.95
11 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
12 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
13 NaN Nick Foligno, LW CBJ 79 31 42 73 16 50 0.92
14 NaN Claude Giroux, C PHI 81 25 48 73 -3 36 0.90
15 NaN Henrik Sedin, C VAN 82 18 55 73 11 22 0.89
16 14 Steven Stamkos, C TB 82 43 29 72 2 49 0.88
17 NaN Tyler Johnson, C TB 77 29 43 72 33 24 0.94
18 16 Ryan Johansen, C CBJ 82 26 45 71 -6 40 0.87
19 17 Joe Pavelski, C SJ 82 37 33 70 12 29 0.85
20 NaN Evgeni Malkin, C PIT 69 28 42 70 -2 60 1.01
21 NaN Ryan Getzlaf, C ANA 77 25 45 70 15 62 0.91
22 20 Rick Nash, LW NYR 79 42 27 69 29 36 0.87
23 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
24 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
25 21 Max Pacioretty, LW MTL 80 37 30 67 38 32 0.84
26 NaN Logan Couture, C SJ 82 27 40 67 -6 12 0.82
27 23 Jonathan Toews, C CHI 81 28 38 66 30 36 0.81
28 NaN Erik Karlsson, D OTT 82 21 45 66 7 42 0.80
29 NaN Henrik Zetterberg, LW DET 77 17 49 66 -6 32 0.86
30 26 Pavel Datsyuk, C DET 63 26 39 65 12 8 1.03
31 NaN Joe Thornton, C SJ 78 16 49 65 -4 30 0.83
32 28 Nikita Kucherov, RW TB 82 28 36 64 38 37 0.78
33 NaN Patrick Kane, RW CHI 61 27 37 64 10 10 1.05
34 NaN Mark Stone, RW OTT 80 26 38 64 21 14 0.80
35 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
36 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
37 NaN Alexander Steen, LW STL 74 24 40 64 8 33 0.86
38 NaN Kyle Turris, C OTT 82 24 40 64 5 36 0.78
39 NaN Johnny Gaudreau, LW CGY 80 24 40 64 11 14 0.80
40 NaN Anze Kopitar, C LA 79 16 48 64 -2 10 0.81
41 35 Radim Vrbata, RW VAN 79 31 32 63 6 20 0.80
42 NaN Jaden Schwartz, LW STL 75 28 35 63 13 16 0.84
43 NaN Filip Forsberg, C NSH 82 26 37 63 15 24 0.77
44 NaN Jordan Eberle, RW EDM 81 24 39 63 -16 24 0.78
45 NaN Ondrej Palat, LW TB 75 16 47 63 31 24 0.84
46 40 Zach Parise, LW MIN 74 33 29 62 21 41 0.84
10 11 12 13 14 15 16
0 SOG PCT GWG G A G A
1 253 13.8 6 10 13 2 3
2 278 13.7 8 13 18 0 1
3 237 11.8 3 10 21 0 0
4 395 13.4 11 25 9 0 0
5 221 10.0 3 11 22 0 0
6 153 11.8 3 3 30 0 0
7 280 13.2 5 13 16 0 0
8 158 19.6 5 6 10 0 0
9 226 8.9 5 4 21 0 0
10 264 14.0 6 8 10 0 0
11 NaN NaN NaN NaN NaN NaN NaN
12 SOG PCT GWG G A G A
13 182 17.0 3 11 15 0 0
14 279 9.0 4 14 23 0 0
15 101 17.8 0 5 20 0 0
16 268 16.0 6 13 12 0 0
17 203 14.3 6 8 9 0 0
18 202 12.9 0 7 19 2 0
19 261 14.2 5 19 12 0 0
20 212 13.2 4 9 17 0 0
21 191 13.1 6 3 10 0 2
22 304 13.8 8 6 6 4 1
23 NaN NaN NaN NaN NaN NaN NaN
24 SOG PCT GWG G A G A
25 302 12.3 10 7 4 3 2
26 263 10.3 4 6 18 2 0
27 192 14.6 7 6 11 2 1
28 292 7.2 3 6 24 0 0
29 227 7.5 3 4 24 0 0
30 165 15.8 5 8 16 0 0
31 131 12.2 0 4 18 0 0
32 190 14.7 2 2 13 0 0
33 186 14.5 5 6 16 0 0
34 157 16.6 6 5 8 1 0
35 NaN NaN NaN NaN NaN NaN NaN
36 SOG PCT GWG G A G A
37 223 10.8 5 8 16 0 0
38 215 11.2 6 4 12 1 0
39 167 14.4 4 8 13 0 0
40 134 11.9 4 6 18 0 0
41 267 11.6 7 12 11 0 0
42 184 15.2 4 8 8 0 2
43 237 11.0 6 6 13 0 0
44 183 13.1 2 6 15 0 0
45 139 11.5 5 3 8 1 1
46 259 12.7 3 11 5 0 0
If your dataframe ends up with columns indexed as 0,1,2 etc and the headings in the first row, (as above) just specify that the column names are in the first row with header=0
Without this, pandas may see a mix of data types - text in row 1 and numbers in the rest and cast the column as object rather than, say, int64.
Full line would be:
data1 = pd.read_html(url, skiprows=1, header=0)[0]
[0] is the first table in the list of possible tables.
There are options for handling NA values as well. Check out the documentation here:
https://pandas.pydata.org/docs/reference/api/pandas.read_html.html
I know this is late, but here's a better way...
I noticed that the DataFrames in the list are all part of the same table/dataset you are trying to analyze, so instead of breaking them up and then merging them together, a better solution is to contact the list of DataFrames.
Check out the results of this code:
df = pd.concat(pd.read_html('https://www.espn.com/nhl/stats/player/_/view/goaltending'),axis=1)
output:
df.head(1)
index RK Name POS GP W L OTL GA/G SA GA SV SV% SO TOI PIM SOSA SOS SOS%
0 1 Igor ShesterkinNYR G 53 36 13 4 2.07 1622 106 1516 0.935 6 3070:32 2 28 20 0.714