I'm having a large Pandas data frame and I want to aggregate the columns differently. I have 24 columns (hours of the day), which I would like to sum, and for all others just take the maximum.
I know that I can write manually the required conditions like this:
df_agg = df.groupby('user_id').agg({'hour_0':'sum',
'hour_1':'sum',
.
.
'hour_24':'sum',
'all other columns': 'max'}
)
but I was wondering whether an elegant solution exists on the lines:
df_agg = df.groupby('user_id').agg({'hour_*':'sum',
'all other columns != hour_*': 'max'}
You can generate dictionary by all columns with hour, add all another columns to another dictionary, merge them and last pass to agg:
c1 = df.columns[df.columns.str.startswith('hour')].tolist()
#also excluded user_id column for avoid `max` aggregation
c2 = df.columns.difference(c1 + ['user_id'])
#https://stackoverflow.com/a/26853961
d = {**dict.fromkeys(c1, 'sum'), **dict.fromkeys(c2, 'max')}
df_agg = df.groupby('user_id').agg(d)
Or you can use 2 times groupby with concat:
df_agg = pd.concat([df.groupby('user_id')[c1].sum(),
df.groupby('user_id')[c2].max()], axis=1)
Related
Let's say, I have this dataframe:
df = pd.DataFrame({'col_1': ['yes','no'], 'test_1':['a','b'], 'test_2':['a','b']})
What I want, is to group by all the columns except the first one and aggregate the results where the group by is the same.
This is what I'm trying:
col_names = df.columns.to_list()
df_out = df.groupby([col_names[1:]])[col_names[0]].agg(list)
This is my end data frame goal:
df = pd.DataFrame({'col_1': [['yes','no']], 'test_1':['a'], 'test_2':['b']})
And, if I have more rows, I want it to behave with the same principle, join in list the groups that are the same based on the column [1:] (from the second till end.
Using pandas agg() method
df = df.groupby(df.columns.difference(["col_1"]).tolist()).agg(
lambda x: x.tolist()).reset_index()
I have a dataset with a set of columns I want to sum for each row. The columns in question all follow a specific naming pattern that I have been able to group in the past via the .sum() function:
pd.DataFrame.sum(data.filter(regex=r'_name$'),axis=1)
Now, I need to complete this same function, but, when grouped by a value of a column:
data.groupby('group').sum(data.filter(regex=r'_name$'),axis=1)
However, this does not appear to work as the .sum() function now does not expect any filtered columns. Is there another way to approach this keeping my data.filter() code?
Example toy dataset. Real dataset contains over 500 columns where all columns are not cleanly ordered:
toy_data = ({'id':[1,2,3,4,5,6],
'group': ["a","a","b","b","c","c"],
'a_name': [1,6,7,3,7,3],
'b_name': [4,9,2,4,0,2],
'c_not': [5,7,8,4,2,5],
'q_name': [4,6,8,2,1,4]
})
df = pd.DataFrame(toy_data, columns=['id','group','a_name','b_name','c_not','q_name'])
Edit: Missed this in original post. My objective is to get a variable ;sum" of the summation of all the selected columns as shown below:
You can filter first and then pass df['group'] instead group to groupby, last add sum column by DataFrame.assign:
df1 = (df.filter(regex=r'_name$')
.groupby(df['group']).sum()
.assign(sum = lambda x: x.sum(axis=1)))
ALternative is filter columns names and pass after groupby:
cols = df.filter(regex=r'_name$').columns
df1 = df.groupby('group')[cols].sum()
Or:
cols = df.columns[df.columns.str.contains(r'_name$')]
df1 = df.groupby('group')[cols].sum().assign(sum = lambda x: x.sum(axis=1))
print (df1)
a_name b_name q_name sum
group
a 7 13 10 30
b 10 6 10 26
c 10 2 5 17
I have a dataframe like this:
df_test = pd.DataFrame({'ID1':['A','A','A','A','A','A','A','A','A','A'],
'ID2':['a','a','a','aa','aaa','aaa','b','b','b','b'],
'ID3':['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10'],
'condition1':[1,2,1,1,1,1,1,2,1,1],
'condition2':[80,85,88,80,70,83,85,90,90,70]})
df_test
I want to pick values in ID3 after group by ['ID1','ID2','condition1'] and (1):if there is only one row in the group, then it will be picked (such as c4), (2)if there is one more rows in the group, then it will be picked at the condition2 is max in the group (such as c3, c6,c9, and c8). the result will like this:
df_test_result = pd.DataFrame({'ID1':['A','A','A','A','A','A'],
'ID2':['a','a','aa','aaa','b','b'],
'condition1':[2,1,1,1,2,1],
'condition2':[85,88,80,83,90,90],
'ID3':['c2','c3','c4','c6','c8','c9']})
df_test_result
the process appears to be that way, but it is too ineffective (because I need to contact them together):
groups = df_test.groupby(['ID1','ID2','condition1'])
for group in groups:
dfi = group[1][group[1]['condition2']==group[1]['condition2'].max()]
print(dfi,'\n')
Your condition (1) generalises as (2), so you can always just look at the first row in the group according to condition2:
(
df_test
.sort_values("condition2", ascending=False) # sort everything by condition2
.groupby(["ID1", "ID2", "condition1"])
.first() # select first row in each group (now ordered by condition2)
.reset_index() # reset groupby columns
)
I have the dataframe 'df1' contain 1226 rows × 13 columns I want to group it by 'Region' columns but it is not working
Try this out for grouping based on column
blockedGroup = df1.groupby('Region')
blocking_df = {}
for x in blockedGroup.groups:
temp_df = blockedGroup.get_group(x)
blocking_df.update({temp_df['customerProfile'].iloc[0]: temp_df})
This will return the groups into a dict structure, where keys will be the unique items in the region and the values will data frames.
e.g : {"USA": DataFrame}
df.groupby does not change dataframe. Instead try doing it like this
df2 = df1.groupby('Region')
df2
I have two pandas dataframes a_df and b_df. a_df has columns ID, atext, and var1-var25, while b_df has columns ID, atext, and var1-var 25.
I want to add ONLY the corresponding vars from a_df and b_df and leave ID, and atext alone.
The code below adds ALL the corresponding columns. Is there a way to get it to add just the columns of interest?
absum_df=a_df.add(b_df)
What could I do to achieve this?
Use filter:
absum_df = a_df.filter(like='var').add(b_df.filter(like='var'))
If you want to keep additional columns as-is, use concat after summing:
absum_df = pd.concat([a_df[['ID', 'atext']], absum_df], axis=1)
Alternatively, instead of subselecting columns from a_df, you could instead just drop the columns in absum_df, if you want to add all columns from a_df not in absum_df:
absum_df = pd.concat([a_df.drop(absum_df.columns axis=1), absum_df], axis=1)
You can subset a dataframe to particular columns:
var_columns = ['var-{}'.format(i) for i in range(1,26)]
absum_df=a_df[var_columns].add(b_df[var_columns])
Note that this will result in a dataframe with only the var columns. If you want a dataframe with the non-var columns from a_df, and the var columns being the sum of a_df and b_df, you can do
absum_df = a_df.copy()
absum_df[var_columns] = a_df[var_columns].add(b_df[var_columns])