Sort outer multi-index - python

I want to sort a dataframe highest to lowest based on column B. I can't find an answer on how to sort the outer (i.e. first) index column.
I have this example data:
A B
Item Type
0 X 'rtr' 2
Tier 'sfg' 104
1 X 'zad' 7
Tier 'asd' 132
2 X 'frs' 4
Tier 'plg' 140
3 X 'gfq' 9
Tier 'bcd' 100
Each multi-index row contains a "Tier" row. I want to sort the outer index "Item" based on the "B" column value relating to each "Tier". The "A" column can be ignored for sorting purposes but needs to be included in the dataframe.
A B
Item Type
2 X 'frs' 4
Tier 'plg' 140
1 X 'zad' 7
Tier 'asd' 132
0 X 'rtr' 2
Tier 'sfg' 104
3 X 'gfq' 9
Tier 'bcd' 100

New Response #2
Based on all the inputs received, here's the solution. Hope this works for you.
import pandas as pd
df = pd.read_csv("xyz.txt")
df1 = df.copy()
#capture the original index of each row. This will be used for sorting later
df1['idx'] = df1.index
#create a dataframe with only items that match 'Tier'
#assumption is each Index has a row with 'Tier'
tier = df1.loc[df1['Type']=='Tier']
#sort Total for only the Tier rows
tier = tier.sort_values('Total')
#Create a list of the indexes in sorted order
#this will be the order to print the rows
tier_list = tier['Index'].tolist()
# Create the dictionary that defines the order for sorting
sorterIndex = dict(zip(tier_list, range(len(tier_list))))
# Generate a rank column that will be used to sort the dataframe numerically
df1['Tier_Rank'] = df1['Index'].map(sorterIndex)
#Now sort the dataframe based on rank column and original index
df1.sort_values(['Tier_Rank','idx'],ascending = [True, True],inplace = True)
#drop the temporary column we created
df1.drop(['Tier_Rank','idx'], 1, inplace = True)
#print the dataframe
print (df1)
Based on the source data, here's the final output. Let me know if this is in line with what you were looking for.
Index Type Id ... Intellect Strength Total
12 2 Chest Armor "6917529202229928161" ... 17 8 62
13 2 Gauntlets "6917529202229927889" ... 16 14 60
14 2 Helmet "6917529202223945870" ... 10 9 66
15 2 Leg Armor "6917529202802011569" ... 15 2 61
16 2 Set NaN ... 58 33 249
17 2 Tier NaN ... 5 3 22
24 4 Chest Armor "6917529202229928161" ... 17 8 62
25 4 Gauntlets "6917529202802009244" ... 7 9 63
26 4 Helmet "6917529202223945870" ... 10 9 66
27 4 Leg Armor "6917529202802011569" ... 15 2 61
28 4 Set NaN ... 49 28 252
29 4 Tier NaN ... 4 2 22
42 7 Chest Armor "6917529202229928161" ... 17 8 62
43 7 Gauntlets "6917529202791088503" ... 7 14 61
44 7 Helmet "6917529202223945870" ... 10 9 66
45 7 Leg Armor "6917529202229923870" ... 7 19 57
46 7 Set NaN ... 41 50 246
47 7 Tier NaN ... 4 5 22
0 0 Chest Armor "6917529202229928161" ... 17 8 62
1 0 Gauntlets "6917529202778947311" ... 10 15 62
2 0 Helmet "6917529202223945870" ... 10 9 66
3 0 Leg Armor "6917529202802011569" ... 15 2 61
4 0 Set NaN ... 52 34 251
5 0 Tier NaN ... 5 3 23
6 1 Chest Armor "6917529202229928161" ... 17 8 62
7 1 Gauntlets "6917529202778947311" ... 10 15 62
8 1 Helmet "6917529202223945870" ... 10 9 66
9 1 Leg Armor "6917529202229923870" ... 7 19 57
10 1 Set NaN ... 44 51 247
11 1 Tier NaN ... 4 5 23
18 3 Chest Armor "6917529202229928161" ... 17 8 62
19 3 Gauntlets "6917529202229927889" ... 16 14 60
20 3 Helmet "6917529202223945870" ... 10 9 66
21 3 Leg Armor "6917529202229923870" ... 7 19 57
22 3 Set NaN ... 50 50 245
23 3 Tier NaN ... 5 5 23
30 5 Chest Armor "6917529202229928161" ... 17 8 62
31 5 Gauntlets "6917529202802009244" ... 7 9 63
32 5 Helmet "6917529202223945870" ... 10 9 66
33 5 Leg Armor "6917529202229923870" ... 7 19 57
34 5 Set NaN ... 41 45 248
35 5 Tier NaN ... 4 4 23
36 6 Chest Armor "6917529202229928161" ... 17 8 62
37 6 Gauntlets "6917529202791088503" ... 7 14 61
38 6 Helmet "6917529202223945870" ... 10 9 66
39 6 Leg Armor "6917529202802011569" ... 15 2 61
40 6 Set NaN ... 49 33 250
41 6 Tier NaN ... 4 3 23
[48 rows x 11 columns]
New Response:
Based on the source data file shared, here's the group by and sort. Let me know how you want the values to be sorted. I have assumed that you want it sorted by Index, then Total.
df = df.groupby(['Index','Type',])\
.agg({'Total':'mean'})\
.sort_values(['Index','Total'])
The output of this will be as follows:
Total
Index Type
0 Tier 23
Leg Armor 61
Chest Armor 62
Gauntlets 62
Helmet 66
Set 251
1 Tier 23
Leg Armor 57
Chest Armor 62
Gauntlets 62
Helmet 66
Set 247
2 Tier 22
Gauntlets 60
Leg Armor 61
Chest Armor 62
Helmet 66
Set 249
3 Tier 23
Leg Armor 57
Gauntlets 60
Chest Armor 62
Helmet 66
Set 245
4 Tier 22
Leg Armor 61
Chest Armor 62
Gauntlets 63
Helmet 66
Set 252
Initial Response:
I dont have your raw data. Created some data to show you how sorting would work on groupby data. See if this is what you are looking for.
import pandas as pd
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon','Parrot', 'Parrot'],
'Type':['Wild', 'Captive', 'Wild', 'Captive'],
'Air': ['Good','Bad', 'Bad', 'Good'],
'Max Speed': [380., 370., 24., 26.]})
df = df.groupby(['Animal','Type','Air'])\
.agg({'Max Speed':'mean'})\
.sort_values('Max Speed')
print(df)
The output will be as follows:
Max Speed
Animal Type Air
Parrot Wild Bad 24.0
Captive Good 26.0
Falcon Captive Bad 370.0
Wild Good 380.0
Without the sort command, the output will be a bit different.
df = df.groupby(['Animal','Type','Air'])\
.agg({'Max Speed':'mean'})
This will result in below. The Max Speed is not sorted. Instead it is using the group by sort of Animal then Type:
Max Speed
Animal Type Air
Falcon Captive Bad 370.0
Wild Good 380.0
Parrot Captive Good 26.0
Wild Bad 24.0

Related

Pandas: Joining two Dataframes based on two criteria matches

Hi have the following Dataframe that contains sends and open totals df_send_open:
date user_id name send open
0 2022-03-31 35 sally 50 20
1 2022-03-31 47 bob 100 55
2 2022-03-31 01 john 500 102
3 2022-03-31 45 greg 47 20
4 2022-03-30 232 william 60 57
5 2022-03-30 147 mary 555 401
6 2022-03-30 35 sally 20 5
7 2022-03-29 41 keith 65 55
8 2022-03-29 147 mary 100 92
My other Dataframe contains calls and cancelled totals df_call_cancel:
date user_id name call cancel
0 2022-03-31 21 percy 54 21
1 2022-03-31 47 bob 150 21
2 2022-03-31 01 john 100 97
3 2022-03-31 45 greg 101 13
4 2022-03-30 232 william 61 55
5 2022-03-30 147 mary 5 3
6 2022-03-30 35 sally 13 5
7 2022-03-29 41 keith 14 7
8 2022-03-29 147 mary 102 90
Like a VLOOKUP in excel, i want to add the additional columns from df_call_cancel to df_send_open, however I need to do it on the unique combination of BOTH date and user_id and this is where i'm tripping up.
I have two desired Dataframes outcomes (not sure which to go forward with so thought i'd ask for both solutions):
Desired Dataframe 1:
date user_id name send open call cancel
0 2022-03-31 35 sally 50 20 0 0
1 2022-03-31 47 bob 100 55 150 21
2 2022-03-31 01 john 500 102 100 97
3 2022-03-31 45 greg 47 20 101 13
4 2022-03-30 232 william 60 57 61 55
5 2022-03-30 147 mary 555 401 5 3
6 2022-03-30 35 sally 20 5 13 5
7 2022-03-29 41 keith 65 55 14 7
8 2022-03-29 147 mary 100 92 102 90
Dataframe 1 only joins the call and cancel columns if the combination of date and user_id exists in df_send_open as this is the primary dataframe.
Desired Dataframe 2:
date user_id name send open call cancel
0 2022-03-31 35 sally 50 20 0 0
1 2022-03-31 47 bob 100 55 150 21
2 2022-03-31 01 john 500 102 100 97
3 2022-03-31 45 greg 47 20 101 13
4 2022-03-31 21 percy 0 0 54 21
5 2022-03-30 232 william 60 57 61 55
6 2022-03-30 147 mary 555 401 5 3
7 2022-03-30 35 sally 20 5 13 5
8 2022-03-29 41 keith 65 55 14 7
9 2022-03-29 147 mary 100 92 102 90
Dataframe 2 will do the same as df1 but will also add any new date and user combinations in df_call_cancel that isn't in df_send_open (see percy).
Many thanks.
merged_df1 = df_send_open.merge(df_call_cancel, how='left', on=['date', 'user_id'])
merged_df2 = df_send_open.merge(df_call_cancel, how='outer', on=['date', 'user_id']).fillna(0)
This should work for your 2 cases, one left and one outer join.

how to split an integer value from one column to two columns in text file using pandas or numpy (python)

I have a text file which has a number of integer values like this.
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1 0 4 5 2
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9 27 34 54 11
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38 69 66 87 14
20180722 201807281017 110 72 46 52 29 29 22 204 41 46 51 57 67 82 92 17
20180729 201808041106 276 37 11 87 20 10 8 284 54 54 72 38 49 41 53 12
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38 47 29 36 21
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31 34 27 64 7
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10 10 7 11 1
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8 2 6 6 10
I have to make a file by merging several files like this but you guys can see a problem with this data.
In 4 and 5 lines, the first values, 1017 and 1106, right next to period index make a problem.
When I try to read these two lines, I always have had this result.
It came out that first values in first column next to index columns couldn't recognized as first values themselves.
In [14]: fw.iloc[80,:]
Out[14]:
3 72.0
4 46.0
5 52.0
6 29.0
7 29.0
8 22.0
9 204.0
10 41.0
11 46.0
12 51.0
13 57.0
14 67.0
15 82.0
16 92.0
17 17.0
18 NaN
Name: (20180722, 201807281017), dtype: float64
I tried to make it correct with indexing but failed.
The desirable result is,
In [14]: fw.iloc[80,:]
Out[14]:
2 1017.0
3 110.0
4 72.0
5 46.0
6 52.0
7 29.0
8 29.0
9 22.0
10 204.0
11 41.0
12 46.0
13 51.0
14 57.0
15 67.0
16 82.0
17 92.0
18 17.0
Name: (20180722, 201807281017), dtype: float64
How can I solve this problem?
+
I used this code to read this file.
fw = pd.read_csv('warm_patient.txt', index_col=[0,1], header=None, delim_whitespace=True)
A better fit for this would be pandas.read_fwf. For your example:
df = pd.read_fwf(filename, index_col=[0,1], header=None, widths=2*[10]+17*[4])
I don't know if the column widths can be inferred for all your data or need to be hardcoded.
One possibility would be to manually construct the dataframe, this way we can parse the text by splitting the values every 4 characters.
from textwrap import wrap
import pandas as pd
def read_file(f_name):
data = []
with open(f_name) as f:
for line in f.readlines():
idx1 = line[0:8]
idx2 = line[10:18]
points = map(lambda x: int(x.replace(" ", "")), wrap(line.rstrip()[18:], 4))
data.append([idx1, idx2, *points])
return pd.DataFrame(data).set_index([0, 1])
It could be made somewhat more efficient (in particular if this is a particularly long text file), but here's one solution.
fw = pd.read_csv('test.txt', header=None, delim_whitespace=True)
for i in fw[pd.isna(fw.iloc[:,-1])].index:
num_str = str(fw.iat[i,1])
a,b = map(int,[num_str[:-4],num_str[-4:]])
fw.iloc[i,3:] = fw.iloc[i,2:-1]
fw.iloc[i,:3] = [fw.iat[i,0],a,b]
fw = fw.set_index([0,1])
The result of print(fw) from there is
2 3 4 5 6 7 8 9 10 11 12 13 14 15 \
0 1
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1 0
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9 27
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38 69
20180722 20180728 1017 110 72 46 52 29 29 22 204 41 46 51 57 67
20180729 20180804 1106 276 37 11 87 20 10 8 284 54 54 72 38 49
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38 47
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31 34
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10 10
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8 2
16 17 18
0 1
20180701 20180707 4 5 2.0
20180708 20180714 34 54 11.0
20180715 20180721 66 87 14.0
20180722 20180728 82 92 17.0
20180729 20180804 41 53 12.0
20180805 20180811 29 36 21.0
20180812 20180818 27 64 7.0
20180819 20180825 7 11 1.0
20180826 20180901 6 6 10.0
Here's the result of the print after applying your initial solution of fw = pd.read_csv('test.txt', index_col=[0,1], header=None, delim_whitespace=True) for comparison.
2 3 4 5 6 7 8 9 10 11 12 13 14 \
0 1
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38
20180722 201807281017 110 72 46 52 29 29 22 204 41 46 51 57 67
20180729 201808041106 276 37 11 87 20 10 8 284 54 54 72 38 49
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8
15 16 17 18
0 1
20180701 20180707 0 4 5 2.0
20180708 20180714 27 34 54 11.0
20180715 20180721 69 66 87 14.0
20180722 201807281017 82 92 17 NaN
20180729 201808041106 41 53 12 NaN
20180805 20180811 47 29 36 21.0
20180812 20180818 34 27 64 7.0
20180819 20180825 10 7 11 1.0
20180826 20180901 2 6 6 10.0

Simulating a football league in python

I could use some help on this. I want to simulate a football league in python for an arbitrary number of teams and tally the points over a season in a table. The rules are simple:
Every team in the league plays each other twice. so each team plays 2*(Nteams_in_league -1)
Teams have a 50% chance of a winning.
There are only two possible outcomes, win or lose.
A win gets 3 points, and a loss gets a team 0 points.
Here's an example of the output I'm looking for with a league of 8 teams over 11 seasons. It's based off an attempt I made but isn't completely correct because it's not allocating point across the winner and loser correctly.
columns = season,
rows = team,
observations are the points tally.
1
2
3
4
5
6
7
8
9
10
11
1
57
51
66
54
60
51
57
54
45
72
2
51
51
42
51
66
60
63
60
81
63
3
51
69
51
48
36
48
57
54
48
60
4
54
57
66
54
75
60
60
66
69
42
5
72
57
63
57
60
54
48
66
54
42
6
54
45
54
45
60
57
51
60
66
51
7
51
63
72
63
63
54
60
63
54
66
8
66
57
42
57
51
57
51
75
72
60
Here is one approach. This simulates each season independently. For each season and pair of teams, we simulate two outcomes for two games, assuming each team has a 50% chance at victory.
import numpy as np
import pandas as pd
from itertools import combinations
def simulate_naive(n_teams):
'Simulate a single season'
scores = np.zeros(n_teams, dtype=int)
for i, j in combinations(range(n_teams), 2):
# each pair of teams play twice, each time with 50/50 chance of
# either team winning; the winning team gets three points
scores[i if np.random.rand() < 0.5 else j] += 3
scores[i if np.random.rand() < 0.5 else j] += 3
return scores
n_teams = 8
n_seasons = 10
df = pd.DataFrame({season: simulate_naive(n_teams) for season in range(n_seasons)})
print(df)
# 0 1 2 3 4 5 6 7 8 9
# 0 15 30 21 12 24 24 9 21 18 33
# 1 21 18 24 24 15 21 12 30 18 21
# 2 21 27 21 18 21 27 27 15 12 24
# 3 27 12 9 36 18 12 30 15 24 21
# 4 24 24 27 24 18 18 33 18 30 15
# 5 18 15 21 15 15 27 15 24 24 15
# 6 18 18 30 21 33 21 24 27 18 21
# 7 24 24 15 18 24 18 18 18 24 18
I wonder if there is a nicer statistical approach that avoids simulating each game.

converting an HTML table in Pandas Dataframe

I am reading an HTML table with pd.read_html but the result is coming in a list, I want to convert it inot a pandas dataframe, so I can continue further operations on the same. I am using the following script
import pandas as pd
import html5lib
data=pd.read_html('http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2',skiprows=1)
and since My results are coming as 1 list, I tried to convert it into a data frame with
data1=pd.DataFrame(Data)
and result came as
0
0 0 1 2 3 4...
and because of result as a list, I can't apply any functions such as rename, dropna, drop.
I will appreciate every help
I think you need add [0] if need select first item of list, because read_html return list of DataFrames:
So you can use:
import pandas as pd
data1 = pd.read_html('http://www.espn.com/nhl/statis‌​tics/player/‌​_/stat/point‌​s/sort/point‌​s/year/2015&‌​#47;seasontype/2‌​',skiprows=1)[0]
print (data1)
0 1 2 3 4 5 6 7 8 9 \
0 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
1 1 Jamie Benn, LW DAL 82 35 52 87 1 64 1.06
2 2 John Tavares, C NYI 82 38 48 86 5 46 1.05
3 3 Sidney Crosby, C PIT 77 28 56 84 5 47 1.09
4 4 Alex Ovechkin, LW WSH 81 53 28 81 10 58 1.00
5 NaN Jakub Voracek, RW PHI 82 22 59 81 1 78 0.99
6 6 Nicklas Backstrom, C WSH 82 18 60 78 5 40 0.95
7 7 Tyler Seguin, C DAL 71 37 40 77 -1 20 1.08
8 8 Jiri Hudler, LW CGY 78 31 45 76 17 14 0.97
9 NaN Daniel Sedin, LW VAN 82 20 56 76 5 18 0.93
10 10 Vladimir Tarasenko, RW STL 77 37 36 73 27 31 0.95
11 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
12 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
13 NaN Nick Foligno, LW CBJ 79 31 42 73 16 50 0.92
14 NaN Claude Giroux, C PHI 81 25 48 73 -3 36 0.90
15 NaN Henrik Sedin, C VAN 82 18 55 73 11 22 0.89
16 14 Steven Stamkos, C TB 82 43 29 72 2 49 0.88
17 NaN Tyler Johnson, C TB 77 29 43 72 33 24 0.94
18 16 Ryan Johansen, C CBJ 82 26 45 71 -6 40 0.87
19 17 Joe Pavelski, C SJ 82 37 33 70 12 29 0.85
20 NaN Evgeni Malkin, C PIT 69 28 42 70 -2 60 1.01
21 NaN Ryan Getzlaf, C ANA 77 25 45 70 15 62 0.91
22 20 Rick Nash, LW NYR 79 42 27 69 29 36 0.87
23 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
24 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
25 21 Max Pacioretty, LW MTL 80 37 30 67 38 32 0.84
26 NaN Logan Couture, C SJ 82 27 40 67 -6 12 0.82
27 23 Jonathan Toews, C CHI 81 28 38 66 30 36 0.81
28 NaN Erik Karlsson, D OTT 82 21 45 66 7 42 0.80
29 NaN Henrik Zetterberg, LW DET 77 17 49 66 -6 32 0.86
30 26 Pavel Datsyuk, C DET 63 26 39 65 12 8 1.03
31 NaN Joe Thornton, C SJ 78 16 49 65 -4 30 0.83
32 28 Nikita Kucherov, RW TB 82 28 36 64 38 37 0.78
33 NaN Patrick Kane, RW CHI 61 27 37 64 10 10 1.05
34 NaN Mark Stone, RW OTT 80 26 38 64 21 14 0.80
35 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
36 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
37 NaN Alexander Steen, LW STL 74 24 40 64 8 33 0.86
38 NaN Kyle Turris, C OTT 82 24 40 64 5 36 0.78
39 NaN Johnny Gaudreau, LW CGY 80 24 40 64 11 14 0.80
40 NaN Anze Kopitar, C LA 79 16 48 64 -2 10 0.81
41 35 Radim Vrbata, RW VAN 79 31 32 63 6 20 0.80
42 NaN Jaden Schwartz, LW STL 75 28 35 63 13 16 0.84
43 NaN Filip Forsberg, C NSH 82 26 37 63 15 24 0.77
44 NaN Jordan Eberle, RW EDM 81 24 39 63 -16 24 0.78
45 NaN Ondrej Palat, LW TB 75 16 47 63 31 24 0.84
46 40 Zach Parise, LW MIN 74 33 29 62 21 41 0.84
10 11 12 13 14 15 16
0 SOG PCT GWG G A G A
1 253 13.8 6 10 13 2 3
2 278 13.7 8 13 18 0 1
3 237 11.8 3 10 21 0 0
4 395 13.4 11 25 9 0 0
5 221 10.0 3 11 22 0 0
6 153 11.8 3 3 30 0 0
7 280 13.2 5 13 16 0 0
8 158 19.6 5 6 10 0 0
9 226 8.9 5 4 21 0 0
10 264 14.0 6 8 10 0 0
11 NaN NaN NaN NaN NaN NaN NaN
12 SOG PCT GWG G A G A
13 182 17.0 3 11 15 0 0
14 279 9.0 4 14 23 0 0
15 101 17.8 0 5 20 0 0
16 268 16.0 6 13 12 0 0
17 203 14.3 6 8 9 0 0
18 202 12.9 0 7 19 2 0
19 261 14.2 5 19 12 0 0
20 212 13.2 4 9 17 0 0
21 191 13.1 6 3 10 0 2
22 304 13.8 8 6 6 4 1
23 NaN NaN NaN NaN NaN NaN NaN
24 SOG PCT GWG G A G A
25 302 12.3 10 7 4 3 2
26 263 10.3 4 6 18 2 0
27 192 14.6 7 6 11 2 1
28 292 7.2 3 6 24 0 0
29 227 7.5 3 4 24 0 0
30 165 15.8 5 8 16 0 0
31 131 12.2 0 4 18 0 0
32 190 14.7 2 2 13 0 0
33 186 14.5 5 6 16 0 0
34 157 16.6 6 5 8 1 0
35 NaN NaN NaN NaN NaN NaN NaN
36 SOG PCT GWG G A G A
37 223 10.8 5 8 16 0 0
38 215 11.2 6 4 12 1 0
39 167 14.4 4 8 13 0 0
40 134 11.9 4 6 18 0 0
41 267 11.6 7 12 11 0 0
42 184 15.2 4 8 8 0 2
43 237 11.0 6 6 13 0 0
44 183 13.1 2 6 15 0 0
45 139 11.5 5 3 8 1 1
46 259 12.7 3 11 5 0 0
If your dataframe ends up with columns indexed as 0,1,2 etc and the headings in the first row, (as above) just specify that the column names are in the first row with header=0
Without this, pandas may see a mix of data types - text in row 1 and numbers in the rest and cast the column as object rather than, say, int64.
Full line would be:
data1 = pd.read_html(url, skiprows=1, header=0)[0]
[0] is the first table in the list of possible tables.
There are options for handling NA values as well. Check out the documentation here:
https://pandas.pydata.org/docs/reference/api/pandas.read_html.html
I know this is late, but here's a better way...
I noticed that the DataFrames in the list are all part of the same table/dataset you are trying to analyze, so instead of breaking them up and then merging them together, a better solution is to contact the list of DataFrames.
Check out the results of this code:
df = pd.concat(pd.read_html('https://www.espn.com/nhl/stats/player/_/view/goaltending'),axis=1)
output:
df.head(1)
index RK Name POS GP W L OTL GA/G SA GA SV SV% SO TOI PIM SOSA SOS SOS%
0 1 Igor ShesterkinNYR G 53 36 13 4 2.07 1622 106 1516 0.935 6 3070:32 2 28 20 0.714

performing differences between rows in pandas based on columns values

I have this dataframe, I'm trying to create a new column where I want to store the difference of products sold based on code and date.
for example this is the starting dataframe:
date code sold
0 20150521 0 47
1 20150521 12 39
2 20150521 16 39
3 20150521 20 38
4 20150521 24 38
5 20150521 28 37
6 20150521 32 36
7 20150521 4 43
8 20150521 8 43
9 20150522 0 47
10 20150522 12 37
11 20150522 16 36
12 20150522 20 36
13 20150522 24 36
14 20150522 28 35
15 20150522 32 31
16 20150522 4 42
17 20150522 8 41
18 20150523 0 50
19 20150523 12 48
20 20150523 16 46
21 20150523 20 46
22 20150523 24 46
23 20150523 28 45
24 20150523 32 42
25 20150523 4 49
26 20150523 8 49
27 20150524 0 39
28 20150524 12 33
29 20150524 16 30
... ... ... ...
150 20150606 32 22
151 20150606 4 34
152 20150606 8 33
153 20150607 0 31
154 20150607 12 30
155 20150607 16 30
156 20150607 20 29
157 20150607 24 28
158 20150607 28 26
159 20150607 32 24
160 20150607 4 30
161 20150607 8 30
162 20150608 0 47
I think this could be a solution...
full_df1=full_df[full_df.date == '20150609'].reset_index(drop=True)
full_df1['code'] = full_df1['code'].astype(float)
full_df1= full_df1.sort(['code'], ascending=[False])
code date sold
8 32 20150609 33
7 28 20150609 36
6 24 20150609 37
5 20 20150609 39
4 16 20150609 42
3 12 20150609 46
2 8 20150609 49
1 4 20150609 49
0 0 20150609 50
full_df1.set_index('code')['sold'].diff().reset_index()
that gives me back this output for a single date 20150609 :
code difference
0 32 NaN
1 28 3
2 24 1
3 20 2
4 16 3
5 12 4
6 8 3
7 4 0
8 0 1
is there a better solution to have the same result in a more pythonic way?
I would like to create a new column [difference] and store the data there having as result 4 columns [date, code, sold, difference]
This exactly the kind of thing that panda's groupby functionality is built for, and I highly recommend reading and working through this documentation: panda's groupby documentation
This code replicates what you are asking for, but for every date.
df = pd.DataFrame({'date':['Mon','Mon','Mon','Tue','Tue','Tue'],'code':[10,21,30,10,21,30], 'sold':[12,13,34,10,15,20]})
df['difference'] = df.groupby('date')['sold'].diff()
df
code date sold difference
0 10 Mon 12 NaN
1 21 Mon 13 1
2 30 Mon 34 21
3 10 Tue 10 NaN
4 21 Tue 15 5
5 30 Tue 20 5

Categories

Resources