how to choose multiple columns in aggregate functions? - python

I have data like this :
A,B,C,D
1,50,1 ,3.9
2,20,22,1.5
3,10,10,2.3
2,15,11,1.8
1,16,13,4.2
and I want to group them by A that I would take mean for BandC and sum for D .
the solution would be like this :
df = df.groupby(['A']).agg({
'B': 'mean', 'C': 'mean', 'D': sum
})
I am asking about if there is a way to choose multiple columns for the same function rather than repeating it as in the case of BandC

If you require at most one aggregation per column, you can store the aggregations in a dict {func: col_list}, then unpack it when you aggregate.
d = {'mean': ['B', 'C'], sum: ['D']}
df.groupby(['A']).agg({col: f for f,cols in d.items() for col in cols})
# B C D
#A
#1 33.0 7.0 8.1
#2 17.5 16.5 3.3
#3 10.0 10.0 2.3

Related

Python Pandas : group by in groups by and average, count, median

Suppose I have a dataframe that looks like this
d = {'User' : ['A', 'A', 'B', 'C', 'C', 'C'],
'time':[1,2,3,4,4,4],
'state':['CA', 'CA', 'ID', 'OR','OR','OR']}
df = pd.DataFrame(data = d)
Now suppose I want to create new dataframe that takes the average and median of time, grabs the users state, and generate a new column as well that counts the number of times that user appears in the User column, i.e.
d = {'User' : ['A', 'B', 'C'],
'avg_time':[1.5,3,4],
'median_time':[1.5,3,4],
'state':['CA','ID','OR'],
'user_count':[2,1,3]}
df_res = pd.DataFrame(data=d)
I know that I can do a group by mean statement like this
df.groupby(['User'], as_index=False).mean().groupby('User')['time'].mean()
This gives me a pandas series, and I assume I can make this into a dataframe if I wanted but how would I do the latter above for all the other columns I am interested in?
Try using pd.NamedAgg:
df.groupby('User').agg(avg_time=('time','mean'),
mean_time=('time','median'),
state=('state','first'),
user_count=('time','count')).reset_index()
Output:
User avg_time mean_time state user_count
0 A 1.5 1.5 CA 2
1 B 3.0 3.0 ID 1
2 C 4.0 4.0 OR 3
You can even pass multiple aggregate functions for the columns in the form of dictionary, something like this:
out = df.groupby('User').agg({'time': [np.mean, np.median], 'state':['first']})
time state
mean median first
User
A 1.5 1.5 CA
B 3.0 3.0 ID
C 4.0 4.0 OR
It gives multi-level columns, you can either drop the level or just join them:
>>> out.columns = ['_'.join(col) for col in out.columns]
time_mean time_median state_first
User
A 1.5 1.5 CA
B 3.0 3.0 ID
C 4.0 4.0 OR

How to get pandas crosstab to sum up values for multiple columns?

Let's assume we have a table like:
id chr val1 val2
... A 2 10
... B 4 20
... A 3 30
...and we'd like to have a contingency table like this (grouped by chr, thus using 'A' and 'B' as the row indices and then summing up the values for val1 and val2):
val1 val2 total
A 5 40 45
B 4 20 24
total 9 60 69
How can we achieve this?
pd.crosstab(index=df.chr, columns=["val1", "val2"]) looked quite promising but it just counts the rows and does not sum up the values.
I have also tried (numerous times) to supply the values manually...
pd.crosstab(
index=df.chr.unique(),
columns=["val1", "val2"],
values=[
df.groupby("chr")["val1"],
df.groupby("chr")["val2"]
],
aggfunc=sum
)
...but this always ends up in shape mismatches and when I tried to reshape via NumPy:
values=np.array([
df.groupby("chr")["val1"].values,
df.groupby("chr")["val2"].values
].reshape(-1, 2)
...crosstab tells me that it expected 1 value instead of the two given for each row.
import pandas as pd
df = pd.DataFrame({'chr': {0: 'A', 1: 'B', 2: 'A'},
'val1': {0: 2, 1: 4, 2: 3},
'val2': {0: 10, 1: 20, 2: 30}})
# aggregate values by chr
df = df.groupby('chr').sum().reset_index()
df = df.set_index('chr')
# Column Total
df.loc['total', :] = df.sum()
# Row total
df['total'] = df.sum(axis=1)
Output
val1 val2 total
chr
A 5.0 40.0 45.0
B 4.0 20.0 24.0
total 9.0 60.0 69.0
What you want is pivot_table
table = pd.pivot_table(df, values=['val1','val2'], index=['char'], aggfunc=np.sum)
table['total'] = table['val1'] + table['val2']

Taking columns from a dataframe based on row values of another dataframe in Python?

I am working with 2 dataframes, I am trying to create multiple dfs from df1 based on row values of df2. I am unable to find any documentation around how to get this done.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'A': 'foo bar bro bir fin car zoo loo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2
})
print(df1)
df2 = pd.DataFrame({
'col1': 'foo bar bro bir'.split(),
'col2': 'B B C B '.split(),
'col3': 'D C D D '.split()
})
print(df2)
How do I create a dataframe called 'foo' which takes only columns B and D in df1 (which are inputs from df2).
Same for another dataframe 'bar', 'bro' & 'bir'. So an example of the output of df_foo & df_bar will be
df_foo = pd.DataFrame({'B': 'one', 'D': 0})
df_bar = pd.DataFrame({'B': 'one', 'C': 1})
I could not find any documentation on how can this be done.
What about using loc for (label based) indexing? An example:
df1_ = df1.set_index('A') # use column A to "rename" rows.
print(df1_.loc[('foo',), ('B', 'D')]) # use `.loc` to access values via their label coordinates.
#
# B D
# A
# foo one 0
So, to build a new dataframe by taking df2's rows as input to be used within df1, you can do
df_all = pd.concat((
df1_.loc[(row.col1,), (row.col2, row.col3)]
for _, row in df2.iterrows()
))
print(df_all)
# B C D
# A
# foo one NaN 0.0
# bar one 1.0 NaN
# bro NaN 2.0 4.0
# bir three NaN 6.0
and finally, an example with 'bar' (replace 'bar' by 'foo' or whatever)
df_bar = df_all.loc['bar'].dropna()
print(df_bar)
# B one
# C 1
# Name: bar, dtype: object
# or, to keep playing with dataframes
print( df_all.loc[('bar',), :].dropna(axis=1) )
# B C
# A
# bar one 1.0
If you have more than 3 columns, lets say 70-80 columns in df1, something you can do is
idx = 'col1'
cols = [c for c in df2.columns.tolist() if c != idx]
df_agno = pd.concat((
df1_.loc[
(row[idx],), row[cols]
] for _, row in df2.iterrows()
))
print(df_agno)
# B C D
# A
# foo one NaN 0.0
# bar one 1.0 NaN
# bro NaN 2.0 4.0
# bir three NaN 6.0
print( df_agno.loc[('bar',), :].dropna(axis=1) )
# B C
# A
# bar one 1.0

Expanding an array of dictionary pandas

So I have a pandas dataframe with what amounts to an array of dictionaries inside it and I'm struggling with how to turn these into columns that in the original dictionary.
df3 = pd.DataFrame({'SomeCol':
["[{'Source': 'A', 'Value': '4.7'}]",
"[{'Source': 'A', 'Value': '8.2'},"
"{'Source': 'B', 'Value': '100%'}]",
"[{'Source': 'A', 'Value': '8.1'}, "
"{'Source': 'C', 'Value': '870'},"
"{'Source': 'B', 'Value': '98%'}]",
"[{}]"
],
'Other Stuff': ['One'
, 'Two', 'Three', 'Four'
]
})
I would like to have the following result
A B C
0 4.7 na na
1 8.2 100% na
2 8.1 98% 870
I have tried
data.map(eval).apply(pd.Series)
and also numerous variations on the theme
def f2(x):
df_r = pd.DataFrame()
for i in x:
df_r = pd.DataFrame.from_dict(x, orient='columns')
return df_r
dfa = pd.concat([df3, df3['SomeCol'].map(eval).apply(f2)])
I seem to be missing something important. The closest I've come is
The result of the first pass of calling the f2 function gets close
Source Value
0 A 4.7
0 A 8.2
1 B 100%
0 A 8.1
1 C 870
2 B 98%
But when I concat them together I get a mess. Just some help on where to go from here would be helpful. I've spent the last two days struggling with a simple way as well as a brute force and neigher seems to cut it.
You can create dictionary with ast.literal_eval for convert strings to dicts:
import ast
out = [{x.get('Source'):x.get('Value') for x in ast.literal_eval(v)}
for k, v in df3.pop('SomeCol').items()]
print (out)
[{'A': '4.7'}, {'A': '8.2', 'B': '100%'}, {'A': '8.1', 'C': '870', 'B': '98%'}, {None: None}]
Then pass to DataFrame constructor and remove NaNs columns by DataFrame.dropna:
df = pd.DataFrame(out, index=df3.index).dropna(how='all', axis=1)
print (df)
A B C
0 4.7 NaN NaN
1 8.2 100% NaN
2 8.1 98% 870
3 NaN NaN NaN
Last DataFrame.join to original:
df = df3.join(df)
print (df)
Other Stuff A B C
0 One 4.7 NaN NaN
1 Two 8.2 100% NaN
2 Three 8.1 98% 870
3 Four NaN NaN NaN

Pandas grouping and summing just a certain column

below is a minimal example, showing the problem that I am facing. Let our initial state be the following (I only use dictionary for the purpose of demonstration):
A = [{'D': '16.5.2013', 'A':1, 'B': 0.0, 'C': 2}, {'D': '16.5.2013', 'A':1, 'B': 0.0, 'C': 4}, {'D': '16.5.2013', 'A':1, 'B': 0.5, 'C': 7}]
df = pd.DataFrame(A)
>>> df
A B C D
0 1 0.0 2 16.5.2013
1 1 0.0 4 16.5.2013
2 1 0.5 7 16.5.2013
How do I get from df to df_new which is:
A_new = [{'D': '16.5.2013', 'A':1, 'B': 0.0, 'C': 6}, {'D': '16.5.2013', 'A':1, 'B': 0.5, 'C': 7}]
df_new = pd.DataFrame(A_new)
>>> df_new
A B C D
0 1 0.0 6 16.5.2013
1 1 0.5 7 16.5.2013
The first and the second rows of the 'C' column are summed, because 'B' is the same for these two rows. The rest is left the same, for instance, column 'A' is not summed, column 'D' is unchanged. How do I do that assuming I only have df and I want to get df_new. I would really like to find some kind of elegant solution if possible.
Thanks in advance.
Assuming the other columns are always the same, and should not be treated specially.
First create the df_new grouped by B where I take for each column the first row in the group:
In [17]: df_new = df.groupby('B', as_index=False).first()
and then calculate specificaly the C column as a sum for each group:
In [18]: df_new['C'] = df.groupby('B', as_index=False)['C'].sum()['C']
In [19]: df_new
Out[19]:
B A C D
0 0.0 1 6 16.5.2013
1 0.5 1 7 16.5.2013
If you have a limited number of columns, you can also do this in one step (but the above will be handier (less manual) if you have more columns) by specifying the desired function for each column:
In [20]: df_new = df.groupby('B', as_index=False).agg({'A':'first', 'C':'sum', 'D':'first'})
If A, and D are always equal when grouping by B, then you can can just group by A, B D, and sum C:
df.groupby(['A', 'B', 'D'], as_index = False).agg(sum)
Output:
A B D C
0 1 0.0 16.5.2013 6
1 1 0.5 16.5.2013 7
Alternatively:
You essentially want to aggregate the data grouped by column 'B'. To aggregate column C you will just use the built in sum function. For the other columns, you basically just want to select a sole value as you believe they are always the same within groups. To do that, just write a very simple function that aggregates those columns simply by taking the first value.
# will take first value of the grouped data
sole_value = lambda x : list(x)[0]
#dictionary that maps columns to aggregation functions
agg_funcs = {'A' : sole_value, 'C' : sum, 'D' : sole_value}
#group and aggregate
df.groupby('B', as_index = False).agg(agg_funcs)
Output:
B A C D
0 0.0 1 6 16.5.2013
1 0.5 1 7 16.5.2013
Of course you really need to be sure that you have values that are definitely equal in columns A, and D, otherwise you might be preserving the wrong data.

Categories

Resources