Passing multiple columns as arguments to aggregation function groupby - python

I am still struggling to get really familiar with the pandas groupby operations. When passing a function to agg, what if the aggregation function that is passed needs to consider values in columns other than those that are being aggregated.
Consider the following data frame example which is lists sales of products by two salesmen:
DateList = np.array( [(datetime.date.today() - datetime.timedelta(7)) + datetime.timedelta(days = x) for x in [1, 2, 2, 3, 4, 4, 5]] + \
[(datetime.date.today() - datetime.timedelta(7)) + datetime.timedelta(days = x) for x in [1, 1, 2, 3, 4, 5, 5]]
Names = np.array(['Joe' for x in xrange(7)] + ['John' for x in xrange(7)])
Product = np.array(['Product1', 'Product1', 'Product2', 'Product2', 'Product2', 'Product3', 'Product3', \
'Product1', 'Product2', 'Product2', 'Product2', 'Product2', 'Product2', 'Product3'])
Volume = np.array([100, 0, 150, 175, 15, 120, 150, 75, 0, 115, 130, 135, 10, 120])
Prices = {'Product1' : 25.99, 'Product2': 13.99, 'Product3': 8.99}
SalesDF = DataFrame({'Date' : DateLists, 'Seller' : Names, 'Product' : Product, 'Volume' : Volume})
SalesDF.sort(['Date', 'Seller'], inplace = True)
SalesDF['Prices'] = SalesDF.Product.map(Prices)
On some days each seller sells more than one item. Suppose you wanted to aggregate the data set into a single day/seller observations, and you wished to do so based upon which product sold the most volume. To be clear, this would be simple for the volume measure, simply pass a max function to agg. However for evaluating which Product and Price would remain would mean determining which volume measure was highest and then returning the value tha corresponds with that max value.
I am able to get the result I want by using the index values in the column that is passed to the function when agg is called and referencing the underlying data frame:
def AggFunc(x, df, col1):
#Create list of index values that index the data in the column passed as x
IndexVals = list(x.index)
#Use those index values to create a list of the values of col1 in those index positions in the underlying data frame.
ColList = list(df[col1][IndexVals])
# Find the max value of the list of values of col1
MaxVal = np.max(ColList)
# Find the index value of the max value of the list of values of col1
MaxValIndex = ColList.index(MaxVal)
#Return the data point in the list of data passed as column x which correspond to index value of the the max value of the list of col1 data
return list(x)[MaxValIndex]
FunctionDict = {'Product': lambda x : AggFunc(x, SalesDF, 'Volume'), 'Volume' : 'max',\
'Prices': lambda x : AggFunc(x, SalesDF, 'Volume')}
SalesDF.groupby(['Date', "Seller"], as_index = False).agg(FunctionDict)
But I'm wondering if there is a better way where I can pass 'Volume' as an argument to the function that aggregates Product without having to get the index values and create lists from the data in the underlying dataframe? Something tells me no, as agg passes each column as a series to the aggregation function, rather than the dataframe itself.
Any ideas?
Thanks

Maybe extracting the right indices first using .idxmax would be simpler?
>>> grouped = SalesDF.groupby(["Date", "Seller"])["Volume"]
>>> max_idx = grouped.apply(pd.Series.idxmax)
>>> SalesDF.loc[max_idx]
Date Product Seller Volume Prices
0 2013-11-04 Product1 Joe 100 25.99
7 2013-11-04 Product1 John 75 25.99
2 2013-11-05 Product2 Joe 150 13.99
9 2013-11-05 Product2 John 115 13.99
3 2013-11-06 Product2 Joe 175 13.99
10 2013-11-06 Product2 John 130 13.99
5 2013-11-07 Product3 Joe 120 8.99
11 2013-11-07 Product2 John 135 13.99
6 2013-11-08 Product3 Joe 150 8.99
13 2013-11-08 Product3 John 120 8.99
idxmax gives the index of the first occurrence of the maximum value. If you want to keep multiple products if they all obtain the maximum volume, it'd be a little different, something more like
>>> max_vols = SalesDF.groupby(["Date", "Seller"])["Volume"].transform(max)
>>> SalesDF[SalesDF.Volume == max_vols]

Related

How to apply a custom rolling function to pandas groupby?

I would like to calculate the daily sales from average sales using the following function:
def derive_daily_sales(avg_sales_series, period, first_day_sales):
"""
derive the daily sales from previous_avg_sales start date to current_avg_sales end date
for detail formula, please refer to README.md
#avg_sales_series: an array of avg sales(e.g. 2020-08-04 to 2020-08-06)
#period: the averaging period in days (e.g. 30 days, 90 days)
#first_day_sales: the sales at the first day of previous_avg_sales
"""
x_n1 = avg_sales_series[-1]*period - avg_sales_series[0]*period + first_day_sales
return x_n1
The avg_sales_series is supposed to be a pandas series.
The dataframe looks like the following:
date, customer_id, avg_30_day_sales
12/08/2020, 1, 30
13/08/2020, 1, 40
14/08/2020, 1, 40
12/08/2020, 2, 20
13/08/2020, 2, 40
14/08/2020, 2, 30
I would like to first groupby customer_id and sort by date. Then, get the rolling window of size 2. And apply the custom function derive_daily_sales assuming that period=30 and first_day_sales equal to the first avg_30_day_sales.
I tried:
df_sales_grouped = df_sales.sort_values('date').groupby(['customer_id','date'])]
df_daily_sales['daily_sales'] = df_sales_grouped['avg_30_day_sales'].rolling(2).apply(derive_daily_sales, axis=1, period=30, first_day_sales= df_sales['avg_30_day_sales'][0])
You should not group by the date since you want to roll over that column, so the grouping should be:
df_sales_grouped = df_sales.sort_values('date').groupby('customer_id')
Next, what you actually want to do is apply a rolling window on each group in the dataframe. So you need to use apply twice, once on the grouped dataframe and once on each rolling window. This can be done as follows:
rolling_arguments = {'period': 30, 'first_day_sales': df_sales['avg_30_day_sales'][0]}
df_sales['daily_sales'] = df_sales_grouped['avg_30_day_sales'].apply(
lambda g: g.rolling(2).apply(derive_daily_sales, kwargs=rolling_arguments))
For the given input data, the result is:
date customer_id avg_30_day_sales daily_sales
12/08/2020 1 30 NaN
13/08/2020 1 40 330.0
14/08/2020 1 40 30.0
12/08/2020 2 20 NaN
13/08/2020 2 40 630.0
14/08/2020 2 30 -270.0

Complicated function with groupby and between? Python

Here is a sample dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'VipNo':np.repeat( range(3), 2 ),
'Quantity': np.random.randint(200,size=6),
'OrderDate': np.random.choice( pd.date_range('1/1/2020', periods=365, freq='D'), 6, replace=False)})
print(df)
So I have a couple of steps to do. I want to create a new column named qtywithin1mon/totalqty. First I want to group the VipNo (each number represents an individual) because a person may have made multiple purchases. Then I want to see if the orderdate is within a certain range (let's say 2020/03/01 - 2020/03/31). If so, I want to use the respective quantity on that day divided by the total quantity this customer purchased. My dataset is big so a customer may have ordered twice within the time range and I would want the sum of the two orders divided by the total quantity in this case. How can I achieve this goal? I really have no idea where to start..
Thank you so much!
You can create a new column masking quantity within the given date range, then groupby:
start, end = pd.to_datetime(['2020/03/01','2020/03/31'])
(df.assign(QuantitySub=df['OrderDate'].between(start,end)*df.Quantity)
.groupby('VipNo')[['Quantity','QuantitySub']]
.sum()
.assign(output=lambda x: x['QuantitySub']/x['Quantity'])
.drop('QuantitySub', axis=1)
)
With a data frame:
VipNo Quantity OrderDate
0 0 105 2020-01-07
1 0 56 2020-03-04
2 1 167 2020-09-05
3 1 18 2020-05-08
4 2 151 2020-11-01
5 2 14 2020-03-17
The output is:
Quantity output
VipNo
0 161 0.347826
1 185 0.000000
2 165 0.084848

function in groupby pandas

I would like to calculate a mean value of "bonus" according to column "first_name", but the denominator is not the sum of the cases, because not all the cases have weight of 1, instead the may have 0.5 weight.
for instance in the case of Jason the value that I want is the sum of his bonus divided by 2.5.
Since in real life I have to group by several columns, like area, etc, I would like to adapt a groupby to this situation.
Here is my try, but it gives me the normal mean
raw_data = {'area': [1,2,3,3,4],'first_name': ['Jason','Jason','Jason', 'Jake','Jake'],
'bonus': [10,20, 10, 30, 20],'weight': [1,1,0.5,0.5,1]}
df = pd.DataFrame(raw_data, columns = ['area','first_name','bonus','weight'])
df
Use:
(df.groupby('first_name')[['bonus', 'weight']].sum()
#.add_prefix('sum_') # you could also want it
.assign(result = lambda x: x['bonus'].div(x['weight'])))
or
(df[['first_name', 'bonus', 'weight']].groupby('first_name').sum()
#.add_prefix('sum_')
.assign(result = lambda x: x['bonus'].div(x['weight'])))
Output
bonus weight result
first_name
Jake 50 1.5 33.333333
Jason 40 2.5 16.000000
One way is to use groupby().apply and np.average:
df.groupby('first_name').apply(lambda x: np.average(x.bonus, weights=x.weight))
Output:
first_name
Jake 23.333333
Jason 14.000000
dtype: float64

merge from one dataframe values with one dataframe columns

I have one difficulty here. My goal is to create a list of sales for one shop with one dataframe that lists prices by product and one other that lists all the sales in terms of products and quantities (for one period of time)
DataFrame 1 : prices
prices = pd.DataFrame({'qty_from' : ('0','10','20'), 'qty_to' : ('9','19','29'), 'product_A' :(50,30,10),'product_B' :(24,14,12),'product_C' :(70,50,18)})
DataFrame 2 : sales
sales = pd.DataFrame({'product' : ('product_b','product_b','product_a',product_c,product_b), 'qty' : ('4','12','21','41','7')})
I would like to get the turnover, line by line within the 'sales' DataFrame, with one other column like 'TurnOver'
I used
pd.merge_asof(sales, prices, left_on='qty', right_on='qty_from', direction='backward')
and it gave me the right price for the quantity sold, but how to get the good price that is related to one product?
How to merge with a value in 'sales' dataframe like 'product_b' with the name of a column in dataframe prices, here 'product_b' and then apply a calculation to get the turnover ?
Thank you for your help,
Eric
If I understand correctly, you can modify the dataframe prices to be able to use the parameter by in merge_asof, using stack:
#modify price
prices_stack = (prices.set_index(['qty_from','qty_to']).stack() # then products become as a column
.reset_index(name='price').rename(columns={'level_2':'product'}))
# uniform the case
sales['product'] = sales['product'].str.lower()
prices_stack['product'] = prices_stack['product'].str.lower()
# this is necessary with your data here as not int
sales.qty = sales.qty.astype(int)
prices_stack.qty_from = prices_stack.qty_from.astype(int)
#now you can merge_asof adding by parameter
sales_prices = (pd.merge_asof( sales.sort_values('qty'), prices_stack,
left_on='qty', right_on='qty_from',
by = 'product', #first merge on the column product
direction='backward')
.drop(['qty_from','qty_to'], axis=1)) #not necessary columns
print (sales_prices)
product qty price
0 product_b 4 24
1 product_b 7 24
2 product_b 12 14
3 product_a 21 10
4 product_c 41 18

looping into dates and apply function to pandas dataframe

I'm trying to detect the first dates when an event occur: here in my dataframe for the product A (see pivot table) I have 20 items stored for the first time on 2017-04-03.
so I want to create a new variable calle new_var_2017-04-03 that store the increment. On the other hand on the next day 2017-04-04 I don't mind if the item is now 50 instead of 20, I only want to store only the 1st event
It gives me several errors, I would like to know at least if the entire logic behind it makes sense, it's "pythonic", or if I'm completeley on the wrong way
raw_data = {'name': ['B','A','A','B'],'date' : pd.to_datetime(pd.Series(['2017-03-30','2017-03-31','2017-04-03','2017-04-04'])),
'age': [10,20,50,30]}
df1 = pd.DataFrame(raw_data, columns = ['date','name','age'])
table=pd.pivot_table(df1,index=['name'],columns=['date'],values=['age'],aggfunc='sum')
table
I'm passing the dates to a list
dates=df1['date'].values.tolist()
I want to do a backward loop into my list "dates" and create a variable if an event occurs.
pseudo code: with i-1 I mean the item before i in the list
def my_fun(x,list):
for i in reversed(list):
if (x[i]-x[i-1])>0 :
x[new_var+i]=x[i]-x[i-1]
else:
x[new_var+i]=0
return x
print (df.apply(lambda x: my_fun(x,dates), axis=1))
desidered output:
raw_data2 = {'new_var': ['new_var_2017-03-30','new_var_2017-03-31','new_var_2017-04-03','new_var_2017-04-04'],'result_a': [np.nan,20,np.nan,np.nan],'result_b': [10,np.nan,np.nan,np.nan]}
df2= pd.DataFrame(raw_data2, columns = ['new_var','result_a','result_b'])
df2.T
Let's try this:
df1['age'] = df1.groupby('name')['age'].transform(lambda x: (x==x.min())*x)
df1.pivot_table(index='name', columns='date', values='age').replace(0,np.nan)
date 2017-03-30 2017-03-31 2017-04-03 2017-04-04
name
A NaN 20.0 NaN NaN
B 10.0 NaN NaN NaN

Categories

Resources