Pandas dataframe building summary table - python

Example df:
id home away winner loser winner_score loser_score
0 A B A B 20 10
1 C D D C 20 5
2 A D D A 30 0
My goal is to build a win/loss/for/against sort of table. End result:
Team
W
L
Total for
Total against
A
1
1
20
40
B
0
1
10
20
C
0
1
5
20
D
2
0
50
5
I can use groupby('winner/loser') to get the wins and losses, but am unable to properly groupby to get the summed scores (and mean, but that's pretty similar). My main issue is when a team has never won a match, I'll end up with NaNs.
My method now is along the lines of:
by_winner = df.groupby('winner').sum()
by_loser = df.groupby('loser').sum()
overall_table['score_for'] = by_winner.score + by_loser.score
I'm also not sure how to even phrase the question, but I would like to be able to run these stats from a concept of one line = one match, but I don't know how to group by the winner and loser, so that I get summed results of all teams at once.

Let's try:
counts = pd.get_dummies(df[['winner','loser']].stack()).sum(level=1).T
winner_score = (df.groupby('winner')[['winner_score','loser_score']].sum()
.rename(columns={'winner_score':'for', 'loser_score':'against'})
)
loser_score = (df.groupby('loser')[['winner_score','loser_score']].sum()
.rename(columns={'winner_score':'against', 'loser_score':'for'})
)
pd.concat((counts, winner_score.add( loser_score, fill_value=0) ), axis=1)
Output:
winner loser against for
A 1 1 40.0 20.0
B 0 1 20.0 10.0
C 0 1 20.0 5.0
D 2 0 5.0 50.0

Related

Comparing two data frames columns and addition of matching values

I have two data frames with similar data, and I would like to substract matching values. Example :
df1:
Letter FREQ Diff
0 A 20 NaN
1 B 12 NaN
2 C 5 NaN
3 D 4 NaN
df2:
Letter FREQ
0 A 19
1 B 11
3 D 2
If we can find the same letter in the column "Letter", I would like to create a new column with the subtraction of the two frequency columns.
Expected output :
df1:
Letter FREQ Diff
0 A 20 1
1 B 12 1
2 C 5 5
3 D 4 2
I have tried to begin like this, but obviously it doesn't work
for i in df1.Letter:
for j in df2.Letter:
if i == j:
df1.Difference[j] == (df1.Frequency[i] - df2.Frequency[j])
else:
pass
Thank you for your help!
Use df.merge with fillna:
In [1101]: res = df1.merge(df2, on='Letter', how='outer')
In [1108]: res['difference'] = (res.Frequency_x - res.Frequency_y).fillna(res.Frequency_x)
In [1110]: res = res.drop('Frequency_y', 1).rename(columns={'Frequency_x': 'Frequency'})
In [1111]: res
Out[1111]:
Letter Frequency difference
0 A 20 1.0
1 B 12 1.0
2 C 5 5.0
3 D 4 2.0

Mapping a big dataframe to calculate

I've got a table like this:
account_id costs
a 1
b 1 2
c_________________3
d 90
e 2 50
f_________________30
I'm trying to calculate another column, called total costs, with something like this:
final["total_costs"] = final["account_id"].map(calculate_balance)
def calculate_balance (x):
balance.append(final[final.account_id == x].costs.cumsum())
But it's taking TOO LONG. Can i use another solution? Much faster?
You can use groupby with cumsum function:
final['total_costs'] = final.groupby('account_id').cumsum()['costs']
Results:
account_id costs total_costs
0 1 1 1
1 1 2 3
2 1 3 6
3 2 90 90
4 2 50 140
5 2 30 170
you should use .groupby to calculate the values fast (and just once per group), and then .map to write them back to your new column.
try this:
import pandas as pd
from io import StringIO
final = pd.read_csv(StringIO("""
account_id costs
a 1 1
b 1 2
c 1 3
d 2 90
e 2 50
f 2 30"""), sep="\s+")
final["total_costs"] = final.groupby("account_id").cumsum()['costs']
print(final)
Output:
account_id costs total_costs
a 1 1 1
b 1 2 3
c 1 3 6
d 2 90 90
e 2 50 140
f 2 30 170

Get Sum of Every Time Two Values Match

My googleing has failed me, I think my main issue is im unsure how to phrase the question (sorry about the crappy title). I'm trying to find the total each time 2 people vote the same way. Below you will see an example of how the data looks and the output I was looking for. I have a working solution but its very slow (see bottom) and was wondering if theres a better way to go about this.
This is how the data is shaped
----------------------------------
event person vote
1 a y
1 b n
1 c nv
1 d nv
1 e y
2 a n
2 b nv
2 c y
2 d n
2 e n
----------------------------------
This is the output im looking for
----------------------------------
Person a b c d e
a 2 0 0 1 2
b 0 2 0 0 0
c 0 0 2 1 0
d 1 0 1 2 1
e 2 0 0 1 2
----------------------------------
Working Code
df = df.pivot(index='event', columns='person', values='vote')
frame = pd.DataFrame(columns=df.columns, index=df.columns)
for person1, value in frame.iterrows():
for person2 in frame:
count = 0
for i, row in df.iterrows():
person1_votes = row[person1]
person2_votes = row[person2]
if person1_votes == person2_votes:
count += 1
frame.at[person1, person2] = count
Try look at your problem in different way
df=df.assign(key=1)
mergedf=df.merge(df,on=['event','key'])
mergedf['equal']=mergedf['vote_x'].eq(mergedf['vote_y'])
output=mergedf.groupby(['person_x','person_y'])['equal'].sum().unstack()
output
Out[1241]:
person_y a b c d e
person_x
a 2.0 0.0 0.0 1.0 2.0
b 0.0 2.0 0.0 0.0 0.0
c 0.0 0.0 2.0 1.0 0.0
d 1.0 0.0 1.0 2.0 1.0
e 2.0 0.0 0.0 1.0 2.0
#Wen-Ben already answered your question. It bases on the concept of finding all possibilities of pair-wise person and count those having same vote. Finding all pair-wise is cartesian product (cross join). You may read great post from #cs95 on cartesian product (CROSS JOIN) with pandas
In your problem, you count same vote per event, so it is cross joint per event. Therefore, you don't need adding helper key column as in #cs95 post. You may cross join directly on column event. After cross join, filter out those pair-wise person<->person having same vote using query. Finally, using crosstab to count those pair-wise.
Below is my solution:
df_match = df.merge(df, on='event').query('vote_x == vote_y')
pd.crosstab(index=df_match.person_x, columns=df_match.person_y)
Out[1463]:
person_y a b c d e
person_x
a 2 0 0 1 2
b 0 2 0 0 0
c 0 0 2 1 0
d 1 0 1 2 1
e 2 0 0 1 2

Loop that counts unique values in a pandas df

I am trying to create a loop or a more efficient process that can count the amount of current values in a pandas df. At the moment I'm selecting the value I want to perform the function on.
So for the df below, I'm trying to determine two counts.
1) ['u'] returns the count of the same remaining values left in ['Code', 'Area']. So how many remaining times the same values occur.
2) ['On'] returns the amount of values that are currently occurring in ['Area']. It achieves this by parsing through the df to see if those values occur again. So it essentially looks into the future to see if those values occur again.
import pandas as pd
d = ({
'Code' : ['A','A','A','A','B','A','B','A','A','A'],
'Area' : ['Home','Work','Shops','Park','Cafe','Home','Cafe','Work','Home','Park'],
})
df = pd.DataFrame(data=d)
#Select value
df1 = df[df.Code == 'A'].copy()
df1['u'] = df1[::-1].groupby('Area').Area.cumcount()
ids = [1]
seen = set([df1.iloc[0].Area])
dec = False
for val, u in zip(df1.Area[1:], df1.u[1:]):
ids.append(ids[-1] + (val not in seen) - dec)
seen.add(val)
dec = u == 0
df1['On'] = ids
df1 = df1.reindex(df.index).fillna(df1)
The problem is I want to run this script on all values in Code. Instead of selecting one at a time. For instance, if I want to do the same thing on Code['B'], I would have to change: df2 = df1[df1.Code == 'B'].copy() and the run the script again.
If I have numerous values in Code it becomes very inefficient. I need a loop where it finds all unique values in 'Code'Ideally, the script would look like:
df1 = df[df.Code == 'All unique values'].copy()
Intended Output:
Code Area u On
0 A Home 2.0 1.0
1 A Work 1.0 2.0
2 A Shops 0.0 3.0
3 A Park 1.0 3.0
4 B Cafe 1.0 1.0
5 A Home 1.0 3.0
6 B Cafe 0.0 1.0
7 A Work 0.0 3.0
8 A Home 0.0 2.0
9 A Park 0.0 1.0
I find your "On" logic very confusing. That said, I think I can reproduce it:
df["u"] = df.groupby(["Code", "Area"]).cumcount(ascending=False)
df["nunique"] = pd.get_dummies(df.Area).groupby(df.Code).cummax().sum(axis=1)
df["On"] = (df["nunique"] -
(df["u"] == 0).groupby(df.Code).cumsum().groupby(df.Code).shift().fillna(0)
which gives me
In [212]: df
Out[212]:
Code Area u nunique On
0 A Home 2 1 1.0
1 A Work 1 2 2.0
2 A Shops 0 3 3.0
3 A Park 1 4 3.0
4 B Cafe 1 1 1.0
5 A Home 1 4 3.0
6 B Cafe 0 1 1.0
7 A Work 0 4 3.0
8 A Home 0 4 2.0
9 A Park 0 4 1.0
In this, u is the number of matching (Code, Area) pairs after that row. nunique is the number of unique Area values seen so far in that Code.
On is the number of unique Areas seen so far, except that once we "run out" of an Area -- once it's not used any more -- we start subtracting it from nuniq.
Using GroupBy with size and cumcount, you can construct your u series.
Your logic for On isn't clear: this requires clarification.
g = df.groupby(['Code', 'Area'])
df['u'] = g['Code'].transform('size') - (g.cumcount() + 1)
print(df)
Code Area u
0 A Home 2
1 A Home 1
2 B Shops 1
3 A Park 1
4 B Cafe 1
5 B Shops 0
6 A Home 0
7 B Cafe 0
8 A Work 0
9 A Park 0

Cumulative count based off different values in a pandas df

The code below provides a cumulative count of how many times a specified value changes. The value has to change to return a count.
import pandas as pd
import numpy as np
d = ({
'Who' : ['Out','Even','Home','Home','Even','Away','Home','Out','Even','Away','Away','Home','Away'],
})
#Specified Values
Teams = ['Home', 'Away']
for who in Teams:
s = df[df.Who==who].index.to_series().diff()!=1
df['Change_'+who] = s[s].cumsum()
Output:
Who Change_Home Change_Away
0 Out NaN NaN
1 Even NaN NaN
2 Home 1.0 NaN
3 Home NaN NaN
4 Even NaN NaN
5 Away NaN 1.0
6 Home 2.0 NaN
7 Out NaN NaN
8 Even NaN NaN
9 Away NaN 2.0
10 Away NaN NaN
11 Home 3.0 NaN
12 Away NaN 3.0
I'm trying to further sort the output based off what value precedes Home and Away. As in the code above doesn't differentiate what Home and Away got changed from. It just counts the amount of times it got changed to Home/Away.
Is there a way to alter the code above to split it up into what Home/Away got changed from? Or will it have to start again?
My intended output is:
Even_Away Even_Home Swap_Away Swap_Home Who
0 Out
1 Even
2 1 Home
3 Home
4 Even
5 1 Away
6 1 Home
7 Out
8 Even
9 2 Away
10 Away
11 2 Home
12 1 Away
So Even_ represents how many times it went from Even to Home/Away and Swap_ represents how many times it went from Home to Away or vice versa.
Main function is get_dummies for dynamic solution - create new columns for all previous values defined in Teams list:
#create DataFrame
df = pd.DataFrame(d)
Teams = ['Home', 'Away']
#create boolean mask for check value by list and compare with shifted column
shifted = df['Who'].shift().fillna('')
m1 = df['Who'].isin(Teams)
#mask for exclude same previous values Home_Home, Away_Away
m2 = df['Who'] == shifted
#chain together, ~ invert mask
m = m1 & ~m2
#join column by mask and create indicator df
df1 = pd.get_dummies(np.where(m, shifted + '_' + df['Who'], np.nan))
#rename columns dynamically
c = df1.columns[df1.columns.str.startswith(tuple(Teams))]
c1 = ['Swap_' + x.split('_')[1] for x in c]
df1 = df1.rename(columns = dict(zip(c, c1)))
#count values by cumulative sum, add column Who
df2 = df1.cumsum().mask(df1 == 0, 0).join(df[['Who']])
print (df2)
Swap_Home Even_Away Even_Home Swap_Away Who
0 0 0 0 0 Out
1 0 0 0 0 Even
2 0 0 1 0 Home
3 0 0 0 0 Home
4 0 0 0 0 Even
5 0 1 0 0 Away
6 1 0 0 0 Home
7 0 0 0 0 Out
8 0 0 0 0 Even
9 0 2 0 0 Away
10 0 0 0 0 Away
11 2 0 0 0 Home
12 0 0 0 1 Away

Categories

Resources