Different aggregation for dataframe with several columns - python

I am looking for some short-cut to reduce the manual grouping required:
I have a dataframe with many columns. When grouping the dataframe by 'Level', I want to group two columns using nunique(),but all other columns (ca. 60 columns representing years from 2021 onward) using mean().
Does anyone have an idea how to define 'the rest' of the columns?
Thanks!

I would do it following way
import pandas as pd
df = pd.DataFrame({'X':[1,1,1,2,2,2],'A':[1,2,3,4,5,6],'B':[1,2,3,4,5,6],'C':[7,8,9,10,11,12],'D':[13,14,15,16,17,18],'E':[19,20,21,22,23,24]})
aggdct = dict.fromkeys(df.columns, pd.Series.mean)
del aggdct['X']
aggdct['A'] = pd.Series.nunique
print(df.groupby('X').agg(aggdct))
output
A B C D E
X
1 3 2 8 14 20
2 3 5 11 17 23
Explanation: I prepare dict with information how to agg using dict.fromkeys which does provide dict with keys being names of column and values being pd.Series.mean functions, then remove column to be used in groupby and changing selected column to hold pd.Series.nunique rather than pd.Series.mean

Related

How to drop dataframe columns using both, a list and not from a list?

I am trying to drop pandas column in the following way. I have a list with columns to drop. This list will be used many times in my notebook. I have 2 columns which are only referenced once
drop_cols=['var1','var2']
df = df.drop(columns={'var0',drop_cols})
So basically, I want to drop all columns from list drop_cols in addition to a hard-coded "var0" column all in one swoop. This gives an error, How do I resolve?
df = df.drop(columns=drop_cols+['var0'])
From what I gather you have a set of columns you wish to drop from several different dataframes while at the same time adding another unique column to also be dropped a data frame. The command you have used is close but misses the point in that you can't create a concatenated list in the way you are trying to do it. This is how I would approach the problem.
Given a Dataframe of the form:
V0 V1 V2 V3
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
define a function to merge colnames
def mergeNames(spc_col, multi_cols):
rslt = [spc_col]
rslt.extend(multi+cols)
return rslt
Then with
drop_cols = ['V1', 'V2']
df.drop(columns=mergeNames('V0', drop_cols)
yields:
V3
0 4
1 8
2 12

Filtering pandas dataframe using dictionary for column values

Premise
I need to use a dictionary as a filter on a large dataframe, where the key-value pairs are values in different columns.
This dictionary is obtained from a separate dataframe, using dict(zip(df.id_col, df.rank_col)) so if a dictionary isn't the best way to go, that is open to change.
This is very similar to this question: Filter a pandas dataframe using values from a dict but fundamentally (I think) different because my dictionary contains column-paired values:
Example data
df_x = pd.DataFrame({'id':[1,1,1,2,2,2,3,3,3],
'B':[1,1,1,0,1,0,1,0,1], 'Rank':['1','2','3','1', '2','3','1','2','3'],'D':[1,2,3,4,5,6,7,8,9]})
filter_dict = {'1':'1', '2':'3', '3':'2'}
For this dataframe df_x I would want to be able to look at the filter dictionary and apply it to a set of columns, here id and Rank, so the dataframe is pared down to:
The actual source dataframe is approx 1M rows, and the dictionary is >100 key-value pairs.
Thanks for any help.
You can check with isin
df_x[df_x[['id','Rank']].astype(str).apply(tuple,1).isin(filter_dict.items())]
Out[182]:
id B Rank D
0 1 1 1 1
5 2 0 3 6
7 3 0 2 8

dask: how to groupby, aggregate without losing column used for groupby

How do one get a SQL-style grouped output when grouping following data:
item frequency
A 5
A 9
B 2
B 4
C 6
df.groupby(by = ["item"]).sum()
results in this:
item frequency
A 14
B 6
C 6
In pandas it is achieved by setting as_index=False. But dask doesn't support this argument in groupby. It currently omits item column and returns the series with frequency column.
Perhaps call .reset_index afterwards?

Aggregated Columns in Pandas within a Dataframe

I'm creating columns with aggregated values with the data from Pandas Dataframe using groupby() and reset_index() functions like that:
df=data.groupby(["subscription_id"])["count_boxes"].sum().reset_index(name="amount_boxes")
df1=data.groupby(["subscription_id"])["product"].count().reset_index(name="count_product")
Want to combine all these aggregated columns ("amount_boxes" and "count_product") in one dataframe with groupby column "subscription_id". Is there any way to do that ithin a function rather than merging the dataframes?
Let's look at using .agg with a dictionary of column and aggregation function.
(df.groupby('Subscription_id')
.agg({'count_boxes':'sum','product':'count'})
.reset_index()
.rename(columns={'count_boxes':'amount_boxes','product':'count_product'}))
Sample Output:
Subscription_id amount_boxes count_product
0 1 16 2
1 2 39 6
2 3 47 7

How to copy one DataFrame column in to another Dataframe if their indexes values are the same

After creating a DataFrame with some duplicated cell values in column with the name 'keys':
import pandas as pd
df = pd.DataFrame({'keys': [1,2,2,3,3,3,3],'values':[1,2,3,4,5,6,7]})
I go ahead and create two more DataFrames which are the consolidated versions of the original DataFrame df. Those newly created DataFrames will have no duplicated cell values under the 'keys' column:
df_sum = df_a.groupby('keys', axis=0).sum().reset_index()
df_mean = df_b.groupby('keys', axis=0).mean().reset_index()
As you can see df_sum['values'] cells values were all summed together.
While df_mean['values'] cell values were averaged with mean() method.
Lastly I rename the 'values' column in both dataframes with:
df_sum.columns = ['keys', 'sums']
df_mean.columns = ['keys', 'means']
Now I would like to copy the df_mean['means'] column into the dataframe df_sum.
How to achieve this?
The Photoshoped image below illustrates the dataframe I would like to create. Both 'sums' and 'means' columns are merged into a single DataFrame:
There are several ways to do this. Using the merge function off the dataframe is the most efficient.
df_both = df_sum.merge(df_mean, how='left', on='keys')
df_both
Out[1]:
keys sums means
0 1 1 1.0
1 2 5 2.5
2 3 22 5.5
I think pandas.merge() is the function you are looking for. Like pd.merge(df_sum, df_mean, on = "keys"). Besides, this result can also be summarized on one agg function as following:
df.groupby('keys')['values'].agg(['sum', 'mean']).reset_index()
# keys sum mean
#0 1 1 1.0
#1 2 5 2.5
#2 3 22 5.5

Categories

Resources