Getting sum of metrics for combinations of multiple columns - python

I have a pandas data frame that looks like this.
set language group version metric_1 metric_2 metric_3
X English 1 A 100 20 5
X French 2 A 90 10 10
X English 1 B 80 30 15
X French 2 B 70 20 20
Y English 1 A 200 20 30
Y French 2 A 180 30 20
Y English 1 B 160 10 10
Y French 2 B 140 20 5
I want to summarize the metrics with all the combinations of the experiment attributes - set, language, group & version. So the summary data frame will look like this.
set language group version metric_1 metric_2 metric_3
X 800 140 80
Y 1000 140 80
English 1200 200 80
French 600 80 80
1 1050 120 60
2 750 160 100
A 850 140 80
B 950 140 80
X English 500 100 40
X French 300 40 40
Y English 700 100 40
Y French 300 40 40
X 1 350 60 30
X 2 450 80 50
Y 1 700 60 30
Y 2 300 80 50
X A 350 70 40
X B 450 70 40
Y A 500 70 40
Y B 500 70 40
English 1 ...
English 2 ...
French 1 ...
French 2 ...
English A ...
English B ...
French A ...
French B ...
1 A ...
1 B ...
2 A ...
2 B ...
X English 1 ...
X English 2 ...
X French 1 ...
X French 2 ...
Y English 1 ...
Y English 2 ...
Y French 1 ...
Y French 2 ...
X English A ...
X English B ...
X French A ...
X French B ...
Y English A ...
Y English B ...
Y French A ...
Y French B ...
X 1 A ...
X 1 B ...
X 2 A ...
X 2 B ...
Y 1 A ...
Y 1 B ...
Y 2 A ...
Y 2 B ...
English 1 A ...
English 1 B ...
English 2 A ...
English 2 B ...
French 1 A ...
French 1 B ...
French 2 A ...
French 2 B ...
I know I can do this brute force with using the different combinations with groupby and concatenating all of those into a single data frame. This can potentially extend to many more attributes, so I'm trying to find a solution that is more scalable. I've been reading about functions available through itertools, but not sure how they would apply.
Appreciate any ideas / directions on this. Thanks!

Indeed, the function combinations of itertools will help you to create all combinations you can. Let's suppose your data is in a dataframe called df.
from itertools import combinations
# create two list, one for all columns you want to sum, and the others
list_metric = [col for col in df.columns if 'metric' in col]
list_non_metric = [col for col in df.columns if 'metric' not in col]
# create the dataframe grouped on all columns
df_grouped = df.groupby(list_non_metric,as_index=False)[list_metric].sum()
# use concat and list comprehension to create all the combinations
df_output = (pd.concat([df_grouped.groupby(list(combi),as_index=False)[list_metric].sum()
for j in range(1, len(list_non_metric)+1)
for combi in combinations(list_non_metric,j) ])
.fillna(''))
# reorder the columns as the input data (if necessary)
df_output = df_output[df.columns]
If you want to understand how works combinations, try to print these lines:
[combi for combi in combinations(list_non_metric,2)]
and then the second for j in range(1, len(list_non_metric)+1) will help to create the combinations of 1, 2, 3, ... elements of list_non_metric

Here's one way to do it. I'm assuming you've only given a subset of your data because the totals don't add up:
In []:
import itertools as it
cols = df.columns.tolist()
index = ['set', 'language', 'group', 'version']
df = df.set_index(index)
pd.concat([df.groupby(level=x).sum().reset_index()
for n in range(1, len(index)+1)
for x in it.combinations(range(len(index)), n)],
sort=True)[cols].fillna('')
Out[]:
set language group version metric_1 metric_2 metric_3
0 X 340 80 50
1 Y 680 80 65
0 English 540 80 60
1 French 480 80 55
0 1 540 80 60
1 2 480 80 55
0 A 570 80 65
1 B 450 80 50
0 X English 180 50 20
1 X French 160 30 30
2 Y English 360 30 40
3 Y French 320 50 25
...

Related

How to lookup in python between 2 dataframes with match mode -> an exact match or the next larger item?

I'd like to create a lookup (similar to excel for example) with match mode -> an exact match or the next larger item.
Let's say I have these 2 dataframes:
seed(1)
np.random.seed(1)
Wins_Range = np.arange(1,101,1)
Wins = pd.DataFrame({"Wins Needed": Wins_Range})
Wins
Wins Needed
0 1
1 2
2 3
3 4
4 5
... ...
95 96
96 97
97 98
98 99
99 100
And the second one:
Levels_Range = np.arange(1,101,1)
Levels = pd.DataFrame({"Level": Levels_Range})
Levels["Wins"]=np.random.choice([1,2,3,4,5],size=len(Levels), p=[0.2,0.2,0.2,0.2,0.2]).cumsum()
Levels
Level Wins
0 1 3
1 2 7
2 3 8
3 4 10
4 5 11
... ... ...
95 96 281
96 97 286
97 98 289
98 99 290
99 100 294
Now, I'd like to pull the level from Levels df to the Wins df when the condition is Wins Needed=Wins but as I said - the match mode will be an exact match or the next larger item.
BTW - the type of Levels["Wins"] is float and the type of Wins["Win"] is int if that matters.
I've tried to use the merge function but it doesn't work (I'm new at python) -
Wins.merge(Levels, on='Wins Needed', how='left')
Thanks in advance!
You need a merge_asof:
out = pd.merge_asof(Wins, Levels, left_on='Wins Needed', right_on='Wins',
direction='forward')[['Wins Needed', 'Level']]
Or
Wins['Level'] = pd.merge_asof(Wins, Levels, left_on='Wins Needed', right_on='Wins',
direction='forward')['Level']
NB. the keys must be sorted for a merge_asof.
Output:
Wins Needed Level
0 1 1
1 2 1
2 3 1
3 4 2
4 5 2
.. ... ...
95 96 35
96 97 35
97 98 36
98 99 36
99 100 37
[100 rows x 2 columns]
If the values are not initially sorted:
Wins['Level'] = pd.merge_asof(Wins[['Wins Needed']].reset_index().sort_values(by='Wins Needed'),
Levels.sort_values(by='Wins'),
left_on='Wins Needed', right_on='Wins',
direction='forward').set_index('index')['Level']

Python find mean of all rows by a column and then find distance

I have a dataframe as below. I understand that df.groupby("degree").mean() would provide me mean by column degree. I would like to take those means and find distance between each data point and those mean. In this case. For each data point, I would like to get 3 distances from means (output of df.groupby("degree").mean()) (4,40) (2,80) and (4,94) and create 3 new columns. Distance should be calculated by formula, BCA_mean=(name-4)^3+(score-40)^3,M.Tech_mean=(name-2)^3+(score-80)^3,MBA_mean=(name-4)^3+(score-94)^3
import pandas as pd
# dictionary of lists
dict = {'name':[5, 4, 2, 3],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
# creating a dataframe from a dictionary
df = pd.DataFrame(dict)
print (df)
name degree score
0 5 MBA 90
1 4 BCA 40
2 2 M.Tech 80
3 3 MBA 98
df.groupby("degree").mean()
degree name score
BCA 4 40
M.Tech 2 80
MBA 4 94
update1
my real dataset has more than 100 columns. i would prefer something that could suit that need. The logic is still the same, for each mean, subtract mean value from a column and take cube of each cell and add
I found something like below. But not sure if there is any other efficient way
y=df.groupby("degree").mean()
print (y)
import numpy as np
(np.square(df[['name','score']].subtract(y.iloc[0,:],axis=1))).sum(axis=1)
df["mean0"]=(np.square(df[['name','score']].subtract(y.iloc[0,:],axis=1))).sum(axis=1)
df
import pandas as pd
# dictionary of lists
dict = {'degree': ["MBA", "BCA", "M.Tech", "MBA","BCA"],
'name':[5, 4, 2, 3,2],
'score':[90, 40, 80, 98,60],
'game':[100,200,300,100,400],
'money':[100,200,300,100,400],
'loan':[100,200,300,100,400],
'rent':[100,200,300,100,400],
'location':[100,200,300,100,400]}
# creating a dataframe from a dictionary
df = pd.DataFrame(dict)
print (df)
dfx=df.groupby("degree").mean()
print(dfx)
def fun(x):
if x[0]=='BCA':
return x[1:] - dfx.iloc[0,:].tolist()
if x[0]=='M.Tech':
return x[1:]-dfx.iloc[1,:].tolist()
if x[0]=='MBA':
return x[1:]-dfx.iloc[2,:].tolist()
df_added=df.apply(fun,axis=1)
df_added
result
degree name score game money loan rent location
0 MBA 5 90 100 100 100 100 100
1 BCA 4 40 200 200 200 200 200
2 M.Tech 2 80 300 300 300 300 300
3 MBA 3 98 100 100 100 100 100
4 BCA 2 60 400 400 400 400 400
``````
mean which is dfx
``````````
name score game money loan rent location
degree
BCA 3 50 300 300 300 300 300
M.Tech 2 80 300 300 300 300 300
MBA 4 94 100 100 100 100 100
````````````
df_added********
difference of each element from their mean column value
``````````
name score game money loan rent location
0 1 -4 0 0 0 0 0
1 1 -10 -100 -100 -100 -100 -100
2 0 0 0 0 0 0 0
3 -1 4 0 0 0 0 0
4 -1 10 100 100 100 100 100

Adding total row to pandas DataFrame groupby

I am aware of this link but I didn't manage to solve my problem.
I have this below DataFrame from pandas.DataFrame.groupby().sum():
Value
Level Company Item
1 X a 100
b 200
Y a 35
b 150
c 35
2 X a 48
b 100
c 50
Y a 80
and would like to add total rows for each level of index that I have to get:
Value
Level Company Item
1 X a 100
b 200
Total 300
Y a 35
b 150
c 35
Total 520
Total 820
2 X a 48
b 100
c 50
Total 198
Y a 80
Total 80
Total 278
Total 1098
As request
level = list(map(int, list('111112222')))
company = list('XXYYYXXXY')
item = list('ababcabca')
value = [100,200,35,150,35,48,100,50,80]
col = ['Level', 'Company', 'Item', 'Value']
df = pd.DataFrame([level,company,item,value]).T
df.columns = col
df.groupby(['Level', 'Company', 'Item']).sum()
You can use:
m=df.groupby(['Level','Company','Item'])['Value'].sum().unstack()
m.assign(total=m.sum(1)).stack().to_frame('Value')
Value
Level Company Item
1 X a 100.0
b 200.0
total 300.0
Y a 35.0
b 150.0
c 35.0
total 220.0
2 X a 48.0
b 100.0
c 50.0
total 198.0
Y a 80.0
total 80.0
Try this: Basically, this is creating two new dfs from the using the sum of the two groups and concating the three data frames
level = list(map(int, list('111112222')))
company = list('XXYYYXXXY')
item = list('ababcabca')
value = [100,200,35,150,35,48,100,50,80]
col = ['Level', 'Company', 'Item', 'Value']
df = pd.DataFrame([level,company,item,value]).T
df.columns = col
df1 = (df.groupby(['Level', 'Company', 'Item'])['Value'].sum())
df2 = (df1.sum(level=0).to_frame().assign(Company='total').set_index('Company', append=True))
df3 = (df1.groupby(['Level','Company']).sum().to_frame().assign(Item='total').set_index('Item', append=True))
dfx = pd.concat([df1.to_frame().reset_index(),
df2.reset_index(),
df3.reset_index()],sort=False)
print(dfx)
Output:
Level Company Item Value
0 1 X a 100
1 1 X b 200
2 1 Y a 35
3 1 Y b 150
4 1 Y c 35
5 2 X a 48
6 2 X b 100
7 2 X c 50
8 2 Y a 80
0 1 total NaN 520
1 2 total NaN 278
0 1 X total 300
1 1 Y total 220
2 2 X total 198
3 2 Y total 80
This is not sorted though as you expect.
If I concat the 3 dfs without resetting the index I'm getting the expected sort order, but the index is a multi-index column
dfx = pd.concat([df1.to_frame(), df2, df3]).sort_index()
Output
Value
(1, X, a) 100
(1, X, b) 200
(1, X, total) 300
(1, Y, a) 35
(1, Y, b) 150
(1, Y, c) 35
(1, Y, total) 220
(1, total) 520
(2, X, a) 48
(2, X, b) 100
(2, X, c) 50
(2, X, total) 198
(2, Y, a) 80
(2, Y, total) 80
(2, total) 278
I am not sure how to convert this to columns as in your df.
You can try stacking it one level at a time:
m = df.groupby(['Level','Company','Item'])['Value'].sum().unstack(level=['Company','Item'])
m = m.assign(total=m.sum(1))
m = m.stack(level='Company')
m = m.assign(total=m.sum(1))
m = m.stack(level='Item')
The output has duplicate totals though:
Level Company Item
1 X a 100.0
b 200.0
total 300.0
Y a 35.0
b 150.0
c 35.0
total 220.0
total 520.0
total 520.0
2 X a 48.0
b 100.0
c 50.0
total 198.0
Y a 80.0
total 80.0
total 278.0
total 278.0
dtype: float64

Dataframe set_index produces duplicate index values instead of doing hierarchical grouping

I have a dataframe that looks like this (index not shown)
Time Letter Type Value
0 A x 10
0 B y 20
1 A y 30
1 B x 40
3 C x 50
I want to produce a dataframe that looks like this:
Time Letter TypeX TypeY
0 A 10 20
0 B 20
1 A 30
1 B 40
3 C 50
To do that, I decided I would first create a table that have multiple indices, Time, Letter and then unstack the last index Type.
Let's say my original dataframe was named my_table:
my_table.reset_index().set_index(['Time', 'Letter']) and instead of grouping it so that under every time index, letter there is BOTH Type X and Type Y, they seemed to have been sorted (adding a few more entries to demonstrate a point):
Time(i) Letter(i) Type Value
0 A x 10
D x 25
H x 15
G x 33
1 B x 40
G x 10
3 C x 50
0 B y 20
H y 10
1 A y 30
Why does this happen? I expected a result like so:
Time Letter Type Value
0 A x 10
y 30
B y 20
H x 15
y 10
D x 25
G x 33
1 B x 40
G x 10
3 C x 50
The same behavior occurs when I make Type one of the indices, it just becomes bold as an index.
How do I successfully group columns using Time and Letter to get X and Y to be matched by those columns, so I can successfully use unstack?
You need to set type as the index as well
df.set_index(['Time','Letter','Type']).Value.unstack(fill_value='').reset_index()
Out[178]:
Type Time Letter x y
0 0 A 10
1 0 B 20
2 1 A 30
3 1 B 40
4 3 C 50

Subtracting many columns in a df by one column in another df

I'm trying to substract a df "stock_returns" (144 rows x 517 col) by a df "p_df" (144 rows x 1 col).
I have tried;
stock_returns - p_df
stock_returns.rsub(p_df,axis=1)
stock_returns.substract(p_df)
But none of them work and all return Nan values.
I'm passing it through this fnc, and using the for loop to get args:
def disp_calc(returns, p, wi): #apply(disp_calc, rows = ...)
wi = wi/np.sum(wi)
rp = (col_len(returns)*(returns-p)**2).sum() #returns - p causing problems
return np.sqrt(rp)
for i in sectors:
stock_returns = returns_rolling[sectordict[i]]#.apply(np.mean,axis=1)
portfolio_return = returns_rolling[i]; p_df = portfolio_return.to_frame()
disp_df[i] = stock_returns.apply(disp_calc,args=(portfolio_return,wi))
My expected output is to subtract all 517 cols in the first df by the 1 col in p_df. so final results would still have 517 cols. Thanks
You're almost there, just need to set axis=0 to subtract along the indexes:
>>> stock_returns = pd.DataFrame([[10,100,200],
[15, 115, 215],
[20,120, 220],
[25,125,225],
[30,130,230]], columns=['A', 'B', 'C'])
>>> stock_returns
A B C
0 10 100 200
1 15 115 215
2 20 120 220
3 25 125 225
4 30 130 230
>>> p_df = pd.DataFrame([1,2,3,4,5], columns=['P'])
>>> p_df
P
0 1
1 2
2 3
3 4
4 5
>>> stock_returns.sub(p_df['P'], axis=0)
A B C
0 9 99 199
1 13 113 213
2 17 117 217
3 21 121 221
4 25 125 225
data['new_col3'] = data['col1'] - data['col2']

Categories

Resources