Mean and standard deviation with multiple dataframes - python

I have multiple dataframes having the same columns and the same number of observations:
For example
d1 = {'ID': ['A','B','C','D'], 'Amount':
[1,2,3,4]}
df1 =pd.DataFrame(data=d1)
d2 = {'ID': ['A','B','C','D'], 'Amount':
[6,0,1,5]}
df2 =pd.DataFrame(data=d2)
d3 = {'ID': ['A','B','C','D'], 'Amount':
[8,1,2,3]}
df3 =pd.DataFrame(data=d3)
I need to drop one column (D) and its corresponding value in each of the dataframes and then, for each variable, calculating the mean and standard deviation.
The expected output should be
avg std
A 5 ...
B ... ...
C ... ...
Generally, for one dataframe, I would use drop columns and then I would compute the average using mean() and the standard deviation std().
How can I do this in an easy and fast way with multiple dataframes? (I have at least 10 of them).

Use concat with remove D in DataFrame.query and aggregate by GroupBy.agg with named aggregations:
df = (pd.concat([df1, df2, df3])
.query('ID != "D"')
.groupby('ID')
.agg(avg=('Amount', 'mean'), std=('Amount', 'std')))
print (df)
avg std
ID
A 5 3.605551
B 1 1.000000
C 2 1.000000
Or remove D in last step by DataFrame.drop:
df = (pd.concat([df1, df2, df3])
.groupby('ID')
.agg(avg=('Amount', 'mean'), std=('Amount', 'std'))
.drop('D'))

You can use pivot_table as well:
import numpy as np
pd.concat([df1, df2, df3]).pivot_table(index='ID', aggfunc=[np.mean, np.std]).drop('D')

Related

How to aggregate, combining dataframes, with pandas groupby

I have a dataframe df and a column df['table'] such that each item in df['table'] is another dataframe with the same headers/number of columns. I was wondering if there's a way to do a groupby like this:
Original dataframe:
name table
Bob Pandas df1
Joe Pandas df2
Bob Pandas df3
Bob Pandas df4
Emily Pandas df5
After groupby:
name table
Bob Pandas df containing the appended df1, df3, and df4
Joe Pandas df2
Emily Pandas df5
I found this code snippet to do a groupby and lambda for strings in a dataframe, but haven't been able to figure out how to append entire dataframes in a groupby.
df['table'] = df.groupby(['name'])['table'].transform(lambda x : ' '.join(x))
I've also tried df['table'] = df.groupby(['name'])['HTML'].apply(list), but that gives me a df['table'] of all NaN.
Thanks for your help!!
Given 3 dataframes
import pandas as pd
dfa = pd.DataFrame({'a': [1, 2, 3]})
dfb = pd.DataFrame({'a': ['a', 'b', 'c']})
dfc = pd.DataFrame({'a': ['pie', 'steak', 'milk']})
Given another dataframe, with dataframes in the columns
df = pd.DataFrame({'name': ['Bob', 'Joe', 'Bob', 'Bob', 'Emily'], 'table': [dfa, dfa, dfb, dfc, dfb]})
# print the type for the first value in the table column, to confirm it's a dataframe
print(type(df.loc[0, 'table']))
[out]:
<class 'pandas.core.frame.DataFrame'>
Each group of dataframes, can be combined into a single dataframe, by using .groupby and aggregating a list for each group, and combining the dataframes in the list, with pd.concat
# if there is only one column, or if there are multiple columns of dataframes to aggregate
dfg = df.groupby('name').agg(lambda x: pd.concat(list(x)).reset_index(drop=True))
# display(dfg.loc['Bob', 'table'])
a
0 1
1 2
2 3
3 a
4 b
5 c
6 pie
7 steak
8 milk
# to specify a single column, or specify multiple columns, from many columns
dfg = df.groupby('name')[['table']].agg(lambda x: pd.concat(list(x)).reset_index(drop=True))
Not a duplicate
Originally, I had marked this question as a duplicate of How to group dataframe rows into list in pandas groupby, thinking the dataframes could be aggregated into a list, and then combined with pd.concat.
df.groupby('name')['table'].apply(list)
df.groupby('name').agg(list)
df.groupby('name')['table'].agg(list)
df.groupby('name').agg({'table': list})
df.groupby('name').agg(lambda x: list(x))
However, these all result in a StopIteration error, when there are dataframes to aggregate.
Here let's create a dataframe with dataframes as columns:
First, I start with three dataframes:
import pandas as pd
#creating dataframes that we will assign to Bob and Joe, notice b's and j':
df1 = pd.DataFrame({'var1':[12, 34, -4, None], 'letter':['b1', 'b2', 'b3', 'b4']})
df2 = pd.DataFrame({'var1':[1, 23, 44, 0], 'letter':['j1', 'j2', 'j3', 'j4']})
df3 = pd.DataFrame({'var1':[22, -3, 7, 78], 'letter':['b5', 'b6', 'b7', 'b8']})
#lets make a list of dictionaries:
list_of_dfs = [
{'name':'Bob' ,'table':df1},
{'name':'Joe' ,'table':df2},
{'name':'Bob' ,'table':df3}
]
#constuct the main dataframe:
original_df = pd.DataFrame(list_of_dfs)
print(original_df)
original_df.shape #shows (3, 2)
Now we have the original dataframe created as the input, we will produce the resulting new dataframe. In doing so, we use groupby(),agg(), and pd.concat(). We also reset the index.
new_df = original_df.groupby('name')['table'].agg(lambda series: pd.concat(series.tolist())).reset_index()
print(new_df)
#check that Bob's table is now a concatenated table of df1 and df3:
new_df[new_df['name']=='Bob']['table'][0]
The output to the last line of code is:
var1 letter
0 12.0 b1
1 34.0 b2
2 -4.0 b3
3 NaN b4
0 22.0 b5
1 -3.0 b6
2 7.0 b7
3 78.0 b8

Performing a correlation on multiple columns in pandas

Is it possible to do a correlation between multiple columns against one column in pandas? Like:
DF[['A']['B']].corr(DF['C'])
I believe you need corrwith and select multiple columns by list:
DF = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'A':[1,3,5,7,1,0],
})
print (DF[['A', 'B']].corrwith(DF['C']))
A 0.319717
B -0.316862
dtype: float64

Index columns disappeared after lambda function in Pandas

I wanted to calculate the percent of some object in one hour ('Time') so I have tried to write a lambda function, and I think it does the job, but index columns disappeared, columns that dataframe is grouped by.
df = df.groupby(['id', 'name', 'time', 'object', 'type'], as_index=True, sort=False)['col1', 'col2', 'col3', 'col4', 'col5'].apply(lambda x: x * 100 / 3600).reset_index()
After that code I print df.columns and got this:
Index([u'index', u'col1', col2', u'col3',
u'col4', u'col5'],
dtype='object')
If there is a need I am going to write some table with values for each column.
Thanks in advance.
Moving the loop outward, will make the code run significantly faster:
for c in ['col1', 'col2', 'col3', 'col4', 'col5']:
df[c] *= 100. / 3600
This is because the individual loops' calculations will be done in a vectorized way.
This also won't modify the index in any way.
pd.DataFrame.groupby is used to aggregate data, not to apply a function to multiple columns.
For simple functions, you should look for a vectorised solution. For example:
# set up simple dataframe
df = pd.DataFrame({'id': [1, 2, 1], 'name': ['A', 'B', 'A'],
'col1': [5, 6, 8], 'col2': [9, 4, 5]})
# apply logic in a vectorised way on multiple columns
df[['col1', 'col2']] = df[['col1', 'col2']].values * 100 / 3600
If you wish to set your index as multiple columns, and are keen to use pd.DataFrame.apply, this is possible as two separate steps. For example:
df = df.set_index(['id', 'name'])
df[['col1', 'col2']] = df[['col1', 'col2']].apply(lambda x: x * 100 / 3600)
You apply .reset_index() which resets the index. Take a look at the pandas documentation and you'll see, that .reset_index() transfers the index to the columns.
Data from Jpp
df[['col1','col2']]*=100/3600
df
Out[110]:
col1 col2 id name
0 0.138889 0.250000 1 A
1 0.166667 0.111111 2 B
2 0.222222 0.138889 1 A

Pandas create a custom groupby aggregation for column

Is there a way in Pandas to create a new column that is a function of two column's aggregation, so that for any arbitrary grouping it preserves the function? This would be functionally similar to creating a calculated column in excel and pivoting by labels.
df1 = pd.DataFrame({'lab':['lab1','lab2']*5,'A':[1,2]*5,'B':[4,5]*5})
df1['C'] = df1.apply(lambda x: x['A']/x['B'],axis=1)
pd.pivot_table(df1,index='lab',{'A':sum,'B':sum,'C':lambda x: x['A']/x['B']})
should return:
|lab|A B|C|
|----|---|---|
|lab1|5 |20|.25|
|lab2|10|25 |.4|
i'd like to aggregate by 'lab' (or any combination of labels) and have the dataframe return the aggregation without having to re-define the column calculation. I realize this is trivial to manually code, but it's repetitive when you have many columns.
There are two ways you can do this using apply or agg:
import numpy as np
import pandas as pd
# Method 1
df1.groupby('lab').apply(lambda df: pd.Series({'A': df['A'].sum(), 'B': df['B'].sum(), 'C': df['C'].unique()[0]})).reset_index()
# Method 2
df1.groupby('lab').agg({'A': 'sum',
'B': 'sum',
'C': lambda x: np.unique(x)}).reset_index()
# output
lab A B C
0 lab1 5 20 0.25
1 lab2 10 25 0.40

How to use groupby agg and rename functions for all columns

Question
How do I get the following result without having to assign a function dictionary for every column?
df.groupby(level=0).agg({'one': {'SUM': 'sum', 'HowMany': 'count'},
'two': {'SUM': 'sum', 'HowMany': 'count'}})
What I've done so far
Consider the df:
import pandas as pd
import numpy as np
idx = pd.MultiIndex.from_product([['A', 'B'], ['One', 'Two']],
names=['Alpha', 'Numeric'])
df = pd.DataFrame(np.arange(8).reshape(4, 2), idx, ['one', 'two'])
df
I want to use groupby().agg() where I run the set of functions and rename their output columns.
This works fine.
df.groupby(level=0).agg({'one': {'SUM': 'sum', 'HowMany': 'count'}})
But I want to do this for all columns. I could do this:
df.groupby(level=0).agg(['sum', 'count'])
But I'm missing the great renaming I've done. I'd hoped that this would work:
df.groupby(level=0).agg({'SUM': 'sum', 'HowMany': 'count'})
But it doesn't. I get this error:
KeyError: 'SUM'
This makes sense. Pandas is looking at the keys of the passed dictionary for columns names. It's how I got the example at the start to work.
You can use set_levels:
g = df.groupby(level=0).agg(['sum', 'count'])
g.columns.set_levels(['SUM', 'HowMany'], 1, inplace=True)
g
>>>
one two
SUM HowMany SUM HowMany
Alpha
A 2 2 4 2
B 10 2 12 2
is using .rename() an option for you?
In [7]: df.groupby(level=0).agg(['sum', 'count']).rename(columns=dict(sum='SUM', count='HowMany'))
Out[7]:
one two
SUM HowMany SUM HowMany
Alpha
A 2 2 4 2
B 10 2 12 2
This is an ugly answer:
gb = df.stack(0).groupby(level=[0, -1])
df1 = gb.agg({'SUM': 'sum', 'HowMany': 'count'})
df1.unstack().swaplevel(0, 1, 1).sort_index(1, 0)

Categories

Resources