I need to group my dataframe and use several aggregation functions on different columns. And some of this aggregation have conditions.
Here is an example. The data are all the orders from 2 customers and I would like to calculate some information on each customer. Like their orders count, their total spendings and average spendings.
import pandas as pd
data = {'order_id' : range(1,9),
'cust_id' : [1]*5 + [2]*3,
'order_amount' : [100,50,70,75,80,105,30,20],
'cust_days_since_reg' : [0,10,25,37,52,0,17,40]}
orders = pd.DataFrame(data)
aggregation = {'order_id' : 'count',
'order_amount' : ['sum', 'mean']}
cust = orders.groupby('cust_id').agg(aggregation).reset_index()
cust.columns = ['_'.join(col) for col in cust.columns.values]
This works fine and gives me :
_
But I have to add an aggregation function with a argument and a condition : the amount a customer spent in his first X months (X must be customizable)
Since I need an argument in this aggregation I tried :
def spendings_X_month(group, n_months):
return group.loc[group['cust_days_since_reg'] <= n_months*30,
'order_amount'].sum()
aggregation = {'order_id' : 'count',
'order_amount' : ['sum',
'mean',
lambda x: spendings_X_month(x, 1)]}
cust = orders.groupby('cust_id').agg(aggregation).reset_index()
But that last line gets me the error : KeyError: 'cust_days_since_reg'.
It must be a scoping error, the cust_days_since_reg column must not be visible in this situation.
I could calculate this last column separately and then join the resulting dataframe to the first but there must be a better solution, that makes every thing in only one groupby.
Could anyone help me with this problem please ?
Thank You
You cannot use agg, because each function working only with one column, so this kind of filtering based of another col is not possible.
Solution use GroupBy.apply:
def spendings_X_month(group, n_months):
a = group['order_id'].count()
b = group['order_amount'].sum()
c = group['order_amount'].mean()
d = group.loc[group['cust_days_since_reg'] <= n_months*30,
'order_amount'].sum()
cols = ['order_id_count','order_amount_sum','order_amount_mean','order_amount_spendings']
return pd.Series([a,b,c,d], index=cols)
cust = orders.groupby('cust_id').apply(spendings_X_month, 1).reset_index()
print (cust)
cust_id order_id_count order_amount_sum order_amount_mean \
0 1 5.0 375.0 75.000000
1 2 3.0 155.0 51.666667
order_amount_spendings
0 220.0
1 135.0
Related
I have:
haves = pd.DataFrame({'Product':['R123','R234'],
'Price':[1.18,0.23],
'CS_Medium':[1, 0],
'CS_Small':[0, 1],
'SC_A':[1,0],
'SC_B':[0,1],
'SC_C':[0,0]})
print(haves)
given a list of columns, like so:
list_of_starts_with = ["CS_", "SC_"]
I would like to arrive here:
wants = pd.DataFrame({'Product':['R123','R234'],
'Price':[1.18,0.23],
'CS':['Medium', 'Small'],
'SC':['A', 'B'],})
print(wants)
I am aware of wide_to_long but don't think it is applicable here?
We could convert "SC" and "CS" column values to boolean mask to filter the column names; then join it back to the original DataFrame:
msk = haves.columns.str.contains('_')
s = haves.loc[:, msk].astype(bool)
s = s.apply(lambda x: dict(s.columns[x].str.split('_')), axis=1)
out = haves.loc[:, ~msk].join(pd.DataFrame(s.tolist(), index=s.index))
Output:
Product Price CS SC
0 R123 1.18 Medium A
1 R234 0.23 Small B
Based on the list of columns (assuming the starts_with is enough to identify them), it is possible to do the changes in bulk:
def preprocess_column_names(list_of_starts_with, column_names):
"Returns a list of tuples (merged_column_name, options, columns)"
columns_to_transform = []
for starts_with in list_of_starts_with:
len_of_start = len(starts_with)
columns = [col for col in column_names if col.startswith(starts_with)]
options = [col[len_of_start:] for col in columns]
merged_column_name = starts_with[:-1] # Assuming that the last char is not needed
columns_to_transform.append((merged_column_name, options, columns))
return columns_to_transform
def merge_columns(df, merged_column_name, options, columns):
for col, option in zip(columns, options):
df.loc[df[col] == 1, merged_column_name] = option
return df.drop(columns=columns)
def merge_all(df, columns_to_transform):
for merged_column_name, options, columns in columns_to_transform:
df = merge_columns(df, merged_column_name, options, columns)
return df
And to run:
columns_to_transform = preprocess_column_names(list_of_starts_with, haves.columns)
wants = merge_all(haves, columns_to_transform)
If your column names are not surprising (such as Index_ being in list_of_starts_with) the above code should solve the problem with a reasonable performance.
One option is to convert the data to a long form, filter for rows that have a value of 1, then convert back to wide form. We can use pivot_longer from pyjanitor for the wide to long part, and pivot to return to wide form:
# pip install pyjanitor
import pandas as pd
import janitor
( haves
.pivot_longer(index=["Product", "Price"],
names_to=("main", "other"),
names_sep="_")
.query("value==1")
.pivot(index=["Product", "Price"],
columns="main",
values="other")
.rename_axis(columns=None)
.reset_index()
)
Product Price CS SC
0 R123 1.18 Medium A
1 R234 0.23 Small B
You can totally avoid pyjanitor, by tranforming on the columns before reshaping (it still involves wide to long, then long to wide):
index = [col for col in haves
if not col.startswith(tuple(list_of_starts_with))]
temp = haves.set_index(index)
temp.columns = (temp
.columns.str.split("_", expand=True)
.set_names(["main", "other"])
# reshape to get final dataframe
(temp
.stack(["main", "other"])
.loc[lambda df: df == 1]
.reset_index("other")
.drop(columns=0)
.unstack()
.droplevel(0, 1)
.rename_axis(columns=None)
.reset_index()
)
Product Price CS SC
0 R123 1.18 Medium A
1 R234 0.23 Small B
I have a dictionary that contains 3 dataframes.
How do I implement a custom function to each dataframes in the dictionary.
In simpler terms, I want to apply the function find_outliers as seen below
# User defined function : find_outliers
#(I)
from scipy import stats
outlier_threshold = 1.5
ddof = 0
def find_outliers(s: pd.Series):
outlier_mask = np.abs(stats.zscore(s, ddof=ddof)) > outlier_threshold
# replace boolean values with corresponding strings
return ['background-color:blue' if val else '' for val in outlier_mask]
To the dictionary of dataframes dict_of_dfs below
# the dataset
import numpy as np
import pandas as pd
df = {
'col_A':['A_1001', 'A_1001', 'A_1001', 'A_1001', 'B_1002','B_1002','B_1002','B_1002','D_1003','D_1003','D_1003','D_1003'],
'col_X':[110.21, 191.12, 190.21, 12.00, 245.09,4321.8,122.99,122.88,134.28,148.14,161.17,132.17],
'col_Y':[100.22,199.10, 191.13,199.99, 255.19,131.22,144.27,192.21,7005.15,12.02,185.42,198.00],
'col_Z':[140.29, 291.07, 390.22, 245.09, 4122.62,4004.52,395.17,149.19,288.91,123.93,913.17,1434.85]
}
df = pd.DataFrame(df)
df
#dictionary_of_dataframes
#(II)
dict_of_dfs=dict(tuple(df.groupby('col_A')))
and lastly, flag outliers in each df of the dict_of_dfs
# end goal is to have find/flag outliers in each `df` of the `dict_of_dfs`
#(III)
desired_cols = ['col_X','col_Y','col_Z']
dict_of_dfs.style.apply(find_outliers, subset=desired_cols)
summarily, I want to apply I to II and finally flag outliers in III
Thanks for your attempt. :)
Desired output should look like this, but for the three dataframes
This may not be what you want, but this is how I'd approach it, but you'll have to work out the details of the function because you have it written to receive a series rather a dataframe. Groupby apply() will send the subsets of rows and then you can perform the actions on that subset and return the result.
For consideration:
inside the function you may be able to handle all columns like so:
def find_outliers(x):
for col in ['col_X','col_Y','col_Z']:
outlier_mask = np.abs(stats.zscore(x[col], ddof=ddof)) > outlier_threshold
x[col] = ['outlier' if val else '' for val in outlier_mask]
return x
newdf = df.groupby('col_A').apply(find_outliers)
col_A col_X col_Y col_Z
0 A_1001 outlier
1 A_1001
2 A_1001
3 A_1001 outlier
4 B_1002 outlier
5 B_1002 outlier
6 B_1002
7 B_1002
8 D_1003 outlier
9 D_1003
10 D_1003
I have a DataFrame in Python like below, which presents agreements of clients:
df = pd.DataFrame({"ID" : [1,2,1,1,3],
"amount" : [100,200,300,400,500],
"status" : ["active", "finished", "finished",
"active", "finished"]})
I need to write FUNCTION in Python, which will calculate:
1.Number (NumAg) and amount (AmAg) of contracts per "ID"
2.Number (NumAct) and amount of active (AmAct) contracts per ID
3.Number (NumFin) and amount of finished (AmFin) contracts per ID
To be more precision i need to create by this function DataFrame like below:
The below solution should fit your use case.
import pandas as pd
def summarise_df(df):
# Define mask to filter df by 'active' value in 'status' column for 'NumAct', 'AmAct', 'NumFin', and 'AmFin' columns
active_mask = df['status'].str.contains('active')
return df.groupby('ID').agg( # Create first columns in output df using agg (no mask needed)
NumAg=pd.NamedAgg(column='amount', aggfunc='count'),
AmAg=pd.NamedAgg(column='amount', aggfunc='sum'
)).join( # Add columns using values with 'active' status
df[active_mask].groupby('ID').agg(
NumAct=pd.NamedAgg(column='amount', aggfunc='count'),
AmAct=pd.NamedAgg(column='amount', aggfunc='sum')
)).join( # Add columns using values with NOT 'active' (i.e. 'finished') status
df[~active_mask].groupby('ID').agg(
NumFin=pd.NamedAgg(column='amount', aggfunc='count'),
AmFin=pd.NamedAgg(column='amount', aggfunc='sum')
)).fillna(0) # Replace nan values with 0
I would recommend reading over this function and its comments alongside documentation for groupby() and join() so that you can develop a better understanding of exactly what is being done here. It is seldom a wise decision to rely upon code that you don't have a good grasp on.
You could use groupby on ID with agg, after adding two bool columns that make the aggregation easier:
df['AmAct'] = df.amount[df.status.eq('active')]
df['AmFin'] = df.amount[df.status.eq('finished')]
df = df.groupby('ID').agg(
NumAg = ('ID', 'count'),
AmAg = ('amount', 'sum'),
NumAct = ('status', lambda col: col.eq('active').sum()),
AmAct = ('AmAct', 'sum'),
NumFin = ('status', lambda col: col.eq('finished').sum()),
AmFin = ('AmFin', 'sum')
)
Result:
NumAg AmAg NumAct AmAct NumFin AmFin
ID
1 3 800 2 500.0 1 300.0
2 1 200 0 0.0 1 200.0
3 1 500 0 0.0 1 500.0
Or add some more columns to df to do a simpler groupby on ID with sum:
df.insert(1, 'NumAg', 1)
df['NumAct'] = df.status.eq('active')
df['AmAct'] = df.amount[df.NumAct]
df['NumFin'] = df.status.eq('finished')
df['AmFin'] = df.amount[df.NumFin]
df.drop(columns=['status'], inplace=True)
df = df.groupby('ID').sum().rename(columns={'amount': 'AmAg'})
with the same result.
Or, maybe the easiest way, let pivot_table do most of the work, after adding a count column to df, and some column-rearranging afterwards:
df['count'] = 1
df = df.pivot_table(index='ID', columns='status', values=['count', 'amount'],
aggfunc=sum, fill_value=0, margins=True).drop('All')
df.columns = ['AmAct', 'AmFin', 'AmAg', 'NumAct', 'NumFin', 'NumAg']
df = df[['NumAg', 'AmAg', 'NumAct', 'AmAct', 'NumFin', 'AmFin']]
I have a dataframe "bb" like this:
Response Unique Count
I love it so much! 246_0 1
This is not bad, but can be better. 246_1 2
Well done, let's do it. 247_0 1
If count is lager than 1, I would like to split the string and make the dataframe "bb" become this: (result I expected)
Response Unique
I love it so much! 246_0
This is not bad 246_1_0
but can be better. 246_1_1
Well done, let's do it. 247_0
My code:
bb = DataFrame(bb[bb['Count'] > 1].Response.str.split(',').tolist(), index=bb[bb['Count'] > 1].Unique).stack()
bb = bb.reset_index()[[0, 'Unique']]
bb.columns = ['Response','Unique']
bb=bb.replace('', np.nan)
bb=bb.dropna()
print(bb)
But the result is like this:
Response Unique
0 This is not bad 246_1
1 but can be better. 246_1
How can I keep the original dataframe in this case?
First split only values per condition with to new helper Series and then add counter values by GroupBy.cumcount only per duplicated index values by Index.duplicated:
s = df.loc[df.pop('Count') > 1, 'Response'].str.split(',', expand=True).stack()
df1 = df.join(s.reset_index(drop=True, level=1).rename('Response1'))
df1['Response'] = df1.pop('Response1').fillna(df1['Response'])
mask = df1.index.duplicated(keep=False)
df1.loc[mask, 'Unique'] += df1[mask].groupby(level=0).cumcount().astype(str).radd('_')
df1 = df1.reset_index(drop=True)
print (df1)
Response Unique
0 I love it so much! 246_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0
EDIT: If need _0 for all another values remove mask:
s = df.loc[df.pop('Count') > 1, 'Response'].str.split(',', expand=True).stack()
df1 = df.join(s.reset_index(drop=True, level=1).rename('Response1'))
df1['Response'] = df1.pop('Response1').fillna(df1['Response'])
df1['Unique'] += df1.groupby(level=0).cumcount().astype(str).radd('_')
df1 = df1.reset_index(drop=True)
print (df1)
Response Unique
0 I love it so much! 246_0_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0_0
Step wise we can solve this problem the following:
Split your dataframes by count
Use this function to explode the string to rows
We groupby on index and use cumcount to get the correct unique column values.
Finally we concat the dataframes together again.
df1 = df[df['Count'].ge(2)] # all rows which have a count 2 or higher
df2 = df[df['Count'].eq(1)] # all rows which have count 1
df1 = explode_str(df1, 'Response', ',') # explode the string to rows on comma delimiter
# Create the correct unique column
df1['Unique'] = df1['Unique'] + '_' + df1.groupby(df1.index).cumcount().astype(str)
df = pd.concat([df1, df2]).sort_index().drop('Count', axis=1).reset_index(drop=True)
Response Unique
0 I love it so much! 246_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0
Function used from linked answer:
def explode_str(df, col, sep):
s = df[col]
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
I am creating 3 pandas dataframes based off of one original pandas dataframe. I have calculated standard deviations from the norm.
#Mean
stats_over_29000_mean = stats_over_29000['count'].mean().astype(int)
152542
#STDS
stats_over_29000_count_between_std = stats_over_29000_std - stats_over_29000_mean
54313
stats_over_29000_first_std = stats_over_29000_mean + stats_over_29000_count_between_std
206855
stats_over_29000_second_std = stats_over_29000_first_std + stats_over_29000_count_between_std
261168
stats_over_29000_third_std = stats_over_29000_second_std + stats_over_29000_count_between_std
315481
This works to get all rows from df under 2 stds
#Select all rows where count is less than 2 standard deviations
stats_under_2_stds = stats_over_29000[stats_over_29000['count'] < stats_over_29000_second_std]
Next I would like to select all rows from df where >=2 stds and less than 3 stds
I have tried:
stats_2_and_over_under_3_stds = stats_over_29000[stats_over_29000['count'] >= stats_over_29000_second_std < stats_over_29000_third_std]
and
stats_2_and_over_under_3_stds = stats_over_29000[stats_over_29000['count'] >= stats_over_29000_second_std && < stats_over_29000_third_std]
But neither seem to work.
Pandas now has the Series.between(left, right, inclusive=True), that allows both both comparisons at the same time.
In your case:
stats_2_and_over_under_3_stds = \
stats_over_29000[(stats_over_29000['count'].between(
stats_over_29000_second_std, stats_over_29000_third_std)]
This is how you filter on df with 2 conditions :
init df = pd.DataFrame([[1,2],[1,3],[1,5],[1,8]],columns=['A','B'])
operation : res = df[(df['B']<8) & (df['B']>2)]
result :
A B
1 1 3
2 1 5
In your case :
stats_2_and_over_under_3_stds = stats_over_29000[(stats_over_29000['count'] >= stats_over_29000_second_std) & (stats_over_29000['count'] < stats_over_29000_third_std)]
The loc function allow you to apply multiple conditions to filter a dataframe in a very concise syntax. I'm putting in "column of interest" as I do not know the column name where you have the values stored. Alternatively if the column of interest is the index, you could just write a condition directly as (stats_over_29000 > 261168) inside the loc function.
stats_over_29000.loc[(stats_over_29000('column of interest') > 261168) &\
(stats_over_29000('column of interest') < 315481)]