In R data.table it is possible and easy to aggregate on multiple columns using argmin or argmax functions in one aggregate. For example for DT:
> DT = data.table(id=c(1,1,1,2,2,2,2,3,3,3), col1=c(1,3,5,2,5,3,6,3,67,7), col2=c(4,6,8,3,65,3,5,4,4,7), col3=c(34,64,53,5,6,2,4,6,4,67))
> DT
id col1 col2 col3
1: 1 1 4 34
2: 1 3 6 64
3: 1 5 8 53
4: 2 2 3 5
5: 2 5 65 6
6: 2 3 3 2
7: 2 6 5 4
8: 3 3 4 6
9: 3 67 4 4
10: 3 7 7 67
> DT_agg = DT[, .(agg1 = col1[which.max(col2)]
, agg2 = col2[which.min(col3)]
, agg3 = col1[which.max(col3)])
, by= id]
> DT_agg
id agg1 agg2 agg3
1: 1 5 4 3
2: 2 5 3 5
3: 3 7 4 7
agg1 is value of col1 where value of col2 is maximum, grouped by id.
agg2 is value of col2 where value of col3 is minimum, grouped by id.
agg3 is value of col1 where value of col3 is maximum, grouped by id.
how is this possible in Pandas, doing all three aggregates in one aggregate operation using groupby and agg? I can't figure out how to incorporate three different indexing in one agg function in Python. here's the dataframe in Python:
DF =pd.DataFrame({'id':[1,1,1,2,2,2,2,3,3,3], 'col1':[1,3,5,2,5,3,6,3,67,7], 'col2':[4,6,8,3,65,3,5,4,4,7], 'col3':[34,64,53,5,6,2,4,6,4,67]})
DF
Out[70]:
id col1 col2 col3
0 1 1 4 34
1 1 3 6 64
2 1 5 8 53
3 2 2 3 5
4 2 5 65 6
5 2 3 3 2
6 2 6 5 4
7 3 3 4 6
8 3 67 4 4
9 3 7 7 67
You can try this,
DF.groupby('id').agg(agg1=('col1',lambda x:x[DF.loc[x.index,'col2'].idxmax()]),
agg2 = ('col2',lambda x:x[DF.loc[x.index,'col3'].idxmin()]),
agg3 = ('col1',lambda x:x[DF.loc[x.index,'col3'].idxmax()]))
agg1 agg2 agg3
id
1 5 4 3
2 5 3 5
3 7 4 7
How about a tidyverse way in python:
>>> from datar.all import f, tibble, group_by, which_max, which_min, summarise
>>>
>>> DF = tibble(
... id=[1,1,1,2,2,2,2,3,3,3],
... col1=[1,3,5,2,5,3,6,3,67,7],
... col2=[4,6,8,3,65,3,5,4,4,7],
... col3=[34,64,53,5,6,2,4,6,4,67]
... )
>>>
>>> DF >> group_by(f.id) >> summarise(
... agg1=f.col1[which_max(f.col2)],
... agg2=f.col2[which_min(f.col3)],
... agg3=f.col1[which_max(f.col3)]
... )
id agg1 agg2 agg3
<int64> <int64> <int64> <int64>
0 1 5 4 3
1 2 5 3 5
2 3 7 4 7
I am the author of the datar package. Feel free to submit issues if you have any questions.
Toyed with this question, primarily to see if I could get an improved speed on the original solution. This is faster than named aggregation.
grp = df.groupby("id")
pd.DataFrame({ "col1": df.col1[grp.col2.idxmax()].array,
"col2": df.col2[grp.col3.idxmin()].array,
"col3": df.col1[grp.col3.idxmax()].array},
index=grp.indices)
col1 col2 col3
1 5 4 3
2 5 3 5
3 7 4 7
Speedup ~3x.
Related
I'm trying to create a rolling function that:
Divides two DataFrames with 3 columns in each df.
Calculate the mean of each row from the output in step 1.
Sums the averages from step 2.
This could be done by using pd.iterrows() hence looping through each row. However, this would be inefficient when working with larger datasets. Therefore, my objective is to create a pd.rolling function that could do this much faster.
What I would need help with is to understand why my approach below returns multiple values while the function I'm using only returns a single value.
EDIT : I have updated the question with the code that produces my desired output.
This is the test dataset I'm working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
One method to achieve my desired output by looping through each row:
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[330.42328042328046,
212.0899470899471,
152.06349206349208,
205.55555555555554,
311.9047619047619,
209.1269841269841,
197.61904761904765,
116.94444444444444,
149.72222222222223,
430.0,
219.51058201058203,
215.34391534391537,
199.15343915343914,
159.6031746031746,
127.6984126984127,
326.85185185185185,
204.16666666666669]
However, this would be timely when working with large datasets. Therefore, I've tried to create a function which applies to a pd.rolling() object.
def SumOfAverageFunction(vals):
Div = df2 / vals.reset_index(drop="True")
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSum = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
The problem here is that my function returns multiple output. How can I solve this?
print(RunningSum)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 3.214286 4.533333 2.277778
3 4.777778 3.200000 2.111111
4 5.888889 4.416667 1.656085
5 5.111111 5.400000 2.915344
6 3.455556 3.933333 5.714286
7 2.866667 2.066667 5.500000
8 2.977778 3.977778 3.063492
9 3.555556 5.622222 1.907937
10 2.750000 4.200000 1.747619
11 1.638889 2.377778 3.616667
12 2.986111 2.005556 5.500000
13 5.333333 3.075000 4.750000
14 4.396825 5.000000 3.055556
15 2.174603 3.888889 2.148148
16 2.111111 2.527778 1.418519
17 2.507937 3.500000 3.311111
18 2.880952 3.000000 5.366667
19 2.722222 3.370370 5.750000
20 2.138889 5.129630 5.666667
After reordering of operations, your calculations can be simplified
BASE = df2.sum(axis=0) /3
BASE_series = pd.Series({k: v for k, v in zip(df1.columns, BASE)})
result = df1.rdiv(BASE_series, axis=1).sum(axis=1)
print(np.around(result[4:], 3))
Outputs:
4 5.508
5 4.200
6 2.400
7 3.000
...
if you dont want to calculate anything before index 4 then change:
df1.iloc[4:].rdiv(...
My input dataframe;
MinA MinB MaxA MaxB
0 1 2 5 7
1 1 0 8 6
2 2 15 15
3 3
4 10
I want to merge "min" and "max" columns amongst themselves with priority (A columns have more priority than B columns).
If both columns are null, they should have default values, for min=0 for max=100.
Desired output is;
MinA MinB MaxA MaxB Min Max
0 1 2 5 7 1 5
1 1 0 8 6 1 8
2 2 15 15 2 15
3 3 3 100
4 10 0 10
Could you please help me about this?
This can be accomplished using mask. With your data that would look like the following:
df = pd.DataFrame({
'MinA': [1,1,2,None,None],
'MinB': [2,0,None,3,None],
'MaxA': [5,8,15,None,None],
'MaxB': [7,6,15,None,10],
})
# Create new Column, using A as the base, if it is Nan, then use B.
# Then do the same again using specified values
df['Min'] = df['MinA'].mask(pd.isna, df['MinB']).mask(pd.isna, 0)
df['Max'] = df['MaxA'].mask(pd.isna, df['MaxB']).mask(pd.isna, 100)
The above would result in the desired output:
MinA MinB MaxA MaxB Min Max
0 1 2 5 7 1 5
1 1 0 8 6 1 8
2 2 NaN 15 15 2 15
3 NaN 3 NaN NaN 3 100
4 NaN NaN NaN 10 0 10
Just use fillna() will be fine.
df['Min'] = df['MinA'].fillna(df['MinB']).fillna(0)
df['Max'] = df['MaxA'].fillna(df['MaxB']).fillna(100)
I have a dataframe which can be generated from the code given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1': ['12/31/2007','11/25/2009','10/06/2005'],'date1derived':[0,0,0],'val1':[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'date2derived':[0,0,0],'val2':[1,3,5],'date3':['12/31/2027','11/25/2029','10/06/2025'],'date3derived':[0,0,0],'val3':[7,9,11]})
The dataframe looks like as shown below
I would like to retain the rows of each person as seperate rows and not as columns like shown in screenshot above.In addition, I want the date1derived,date2derived columns to be dropped.
I did try below approaches but they didn't provide the expected output
1) df.set_index(['person_id']).stack()/unstack
2) df.set_index(['person_id','date1','date2','date3']).stack()/unstack()
3) df.set_index('person_id').unstack()/stack
How can I get an output to be like this? I have more than 600 columns, so I don't think writing the column names manually would help me.
This is a wide_to_long problem:
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id', j='grp').sort_index(level=0)
date val
person_id grp
1 1 12/31/2007 2
2 12/31/2017 1
3 12/31/2027 7
2 1 11/25/2009 4
2 11/25/2019 3
3 11/25/2029 9
3 1 10/06/2005 6
2 10/06/2015 5
3 10/06/2025 11
To match your expected output:
df = pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id', j='grp').sort_index(level=0)
df = df.reset_index(level=1, drop=True).reset_index()
person_id date val
0 1 12/31/2007 2
1 1 12/31/2017 1
2 1 12/31/2027 7
3 2 11/25/2009 4
4 2 11/25/2019 3
5 2 11/25/2029 9
6 3 10/06/2005 6
7 3 10/06/2015 5
8 3 10/06/2025 11
You can do it without wide_to_long() but just with append()
df2 = pd.DataFrame()
for i in range(1, 4):
new_df = df[['person_id', f'date{i}', f'val{i}']]
new_df.columns = ['person_id', 'date', 'val']
df2 = df2.append(new_df)
df2.sort_values('person_id').reset_index(drop=True)
ouput :
person_id date val
0 1 12/31/2007 2
1 1 12/31/2017 1
2 1 12/31/2027 7
3 2 11/25/2009 4
4 2 11/25/2019 3
5 2 11/25/2029 9
6 3 10/06/2005 6
7 3 10/06/2015 5
8 3 10/06/2025 11
I have a dataframe as show below:
df =
index value1 value2 value3
001 0.3 1.3 4.5
002 1.1 2.5 3.7
003 0.1 0.9 7.8
....
365 3.4 1.2 0.9
the index means the days in a year( so sometimes the last number of index is 366), I want to group it with random days (for example 10 days or 30 days),I thinks the code would be as below,
df_new = df.groupby( "method" ).mean()
In some question I saw the they used type of datetime to groupby, however in my dataframe the index are just numbers, is there any better way to group it ? thanks in adavance !
I think need floor index values and aggregate mean:
df_new = df.groupby( df.index // 10).mean()
Another general solution if not default unique numeric index:
df_new = df.groupby( np.arange(len(df.index)) // 10).mean()
Sample:
c = 'val1 val2 val3'.split()
df = pd.DataFrame(np.random.randint(10, size=(20,3)), columns=c)
print (df)
val1 val2 val3
0 5 9 4
1 5 7 1
2 8 3 5
3 2 4 2
4 2 8 4
5 8 5 6
6 0 9 8
7 2 3 6
8 7 0 0
9 3 3 5
10 6 6 3
11 8 9 6
12 5 1 6
13 1 5 9
14 1 4 5
15 3 2 2
16 4 5 4
17 3 5 1
18 9 4 5
19 9 8 7
df_new = df.groupby( df.index // 10).mean()
print (df_new)
val1 val2 val3
0 4.2 5.1 4.1
1 4.9 4.9 4.8
Just create a new index via floored quotient operator // and group by this index. Here is an example with 155 rows. You can drop the original index for the result.
df = pd.DataFrame({'index': list(range(1, 156)),
'val1': np.random.rand(155),
'val2': np.random.rand(155),
'val3': np.random.rand(155)})
df['new_index'] = df['index'] // 10
res = df.groupby('new_index', as_index=False).mean().drop('index', 1)
# new_index val1 val2 val3
# 0 0 0.315851 0.462080 0.491779
# 1 1 0.377690 0.566162 0.588248
# 2 2 0.314571 0.471430 0.626292
# 3 3 0.725548 0.572577 0.530589
# 4 4 0.569597 0.466964 0.443815
# 5 5 0.470747 0.394189 0.321107
# 6 6 0.362968 0.362278 0.415093
# 7 7 0.403529 0.626155 0.322582
# 8 8 0.555819 0.415741 0.525251
# 9 9 0.454660 0.336846 0.524158
# 10 10 0.435777 0.495191 0.380897
# 11 11 0.345916 0.550897 0.487255
# 12 12 0.676762 0.464794 0.612018
# 13 13 0.524610 0.450550 0.472724
# 14 14 0.466074 0.542736 0.680481
# 15 15 0.456921 0.565800 0.442543
so I am filling dataframes from 2 different files. While those 2 files should have the same structure (the values should be different thought) the resulting dataframes look different. So when printing those I get:
a b c d
0 70402.14 70370.602112 0.533332 98
1 31362.21 31085.682726 1.912552 301
... ... ... ... ...
753919 64527.16 64510.008206 0.255541 71
753920 58077.61 58030.943621 0.835758 152
a b c d
index
0 118535.32 118480.657338 0.280282 47
1 49536.10 49372.999416 0.429902 86
... ... ... ... ...
753970 52112.95 52104.717927 0.356051 116
753971 37044.40 36915.264944 0.597472 165
So in the second dataframe there is that "index" row that doesnt make any sense for me and it causes troubles in my following code. I did neither write the code to fill the files into the dataframes nor I did create those files. So I am rather interested in checking if such a row exists and how I might be able to remove it. Does anyone have an idea about this?
The second dataframe has an index level named "index".
You can remove the name with
df.index.name = None
For example,
In [126]: df = pd.DataFrame(np.arange(15).reshape(5,3))
In [128]: df.index.name = 'index'
In [129]: df
Out[129]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [130]: df.index.name = None
In [131]: df
Out[131]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
The dataframe might have picked up the name "index" if you used reset_index and set_index like this:
In [138]: df.reset_index()
Out[138]:
index 0 1 2
0 0 0 1 2
1 1 3 4 5
2 2 6 7 8
3 3 9 10 11
4 4 12 13 14
In [140]: df.reset_index().set_index('index')
Out[140]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
Index is just the first column - it's numbering the rows by default, but you can change it a number of ways (e.g. filling it with values from one of the columns)