Calculate the mean value using two columns in pandas - python

I have a deal dataframe with three columns and I have sorted by the type and date, It looks like:
type date price
A 2020-05-01 4
A 2020-06-04 6
A 2020-06-08 8
A 2020-07-03 5
B 2020-02-01 3
B 2020-04-02 4
There are many types (A, B, C,D,E…), I want to calculate the previous mean price of the same type of product. For example: the pre_mean_price value of third row A is (4+6)/2=5. I want to get a dataframe like this:
type date price pre_mean_price
A 2020-05-01 4 .
A 2020-06-04 6 4
A 2020-06-08 8 5
A 2020-07-03 5 6
B 2020-02-01 3 .
B 2020-04-02 4 3
How can I calculate the pre_mean_price? Thanks a lot!

You can use expanding().mean() after groupby for each group , then shift the values.
df['pre_mean_price'] = df.groupby("type")['price'].apply(lambda x:
x.expanding().mean().shift())
print(df)
type date price pre_mean_price
0 A 2020-05-01 4 NaN
1 A 2020-06-04 6 4.0
2 A 2020-06-08 8 5.0
3 A 2020-07-03 5 6.0
4 B 2020-02-01 3 NaN
5 B 2020-04-02 4 3.0

Something like
df['pre_mean_price'] = df.groupby('type').expanding().mean().groupby('type').shift(1)['price'].values
which produces
type date price pre_mean_price
0 A 2020-05-01 4 NaN
1 A 2020-06-04 6 4.0
2 A 2020-06-08 8 5.0
3 A 2020-07-03 5 6.0
4 B 2020-02-01 3 NaN
5 B 2020-04-02 4 3.0
Short explanation
The idea is to
First groupby "type" with .groupby(). This must be done since we want to calculate the (incremental) means within the group "type".
Then, calculate the incremental mean with expanding().mean(). The output in this point is
price
type
A 0 4.00
1 5.00
2 6.00
3 5.75
B 4 3.00
5 3.50
Then, groupby again by "type", and shift the elements inside the groups by one row with shift(1).
Then, just extract the values of the price column (the incremental means)
Note: This assumes your data is sorted by date. It it is not, call df.sort_values('date', inplace=True) before.

Related

How to add rows based on a condition with another dataframe

I have two dataframes as follows:
agreement
agreement_id activation term_months total_fee
0 A 2020-12-01 24 4800
1 B 2021-01-02 6 300
2 C 2021-01-21 6 600
3 D 2021-03-04 6 300
payments
cust_id agreement_id date payment
0 1 A 2020-12-01 200
1 1 A 2021-02-02 200
2 1 A 2021-02-03 100
3 1 A 2021-05-01 200
4 1 B 2021-01-02 50
5 1 B 2021-01-09 20
6 1 B 2021-03-01 80
7 1 B 2021-04-23 90
8 2 C 2021-01-21 600
9 3 D 2021-03-04 150
10 3 D 2021-05-03 150
I want to add another row in the payments dataframe when the total payments for the agreement_id in the payments dataframe is equal to the total_fee in the agreement_id. The row would contain a zero value under the payments and the date will be calculated as min(date) (from payments) plus term_months (from agreement).
Here's the results I want for the payments dataframe:
payments
cust_id agreement_id date payment
0 1 A 2020-12-01 200
1 1 A 2021-02-02 200
2 1 A 2021-02-03 100
3 1 A 2021-05-01 200
4 1 B 2021-01-02 50
5 1 B 2021-01-09 20
6 1 B 2021-03-01 80
7 1 B 2021-04-23 90
8 2 C 2021-01-21 600
9 3 D 2021-03-04 150
10 3 D 2021-05-03 150
11 2 C 2021-07-21 0
12 3 D 2021-09-04 0
The additional rows are row 11 and 12. The agreement_id 'C' and 'D' where equal to the total_fee shown in the agreement dataframe.
import pandas as pd
import numpy as np
Firstly convert 'date' column of payment dataframe into datetime dtype by using to_datetime() method:
payments['date']=pd.to_datetime(payments['date'])
You can do this by using groupby() method:
newdf=payments.groupby('agreement_id').agg({'payment':'sum','date':'min','cust_id':'first'}).reset_index()
Now by boolean masking get the data which mets your condition:
newdf=newdf[agreement['total_fee']==newdf['payment']].assign(payment=np.nan)
Note: here in the above code we are using assign() method and making the payments row to NaN
Now make use of pd.tseries.offsets.Dateoffsets() method and apply() method:
newdf['date']=newdf['date']+agreement['term_months'].apply(lambda x:pd.tseries.offsets.DateOffset(months=x))
Note: The above code gives you a warning so just ignore that warning as it's a warning not an error
Finally make use of concat() method and fillna() method:
result=pd.concat((payments,newdf),ignore_index=True).fillna(0)
Now if you print result you will get your desired output
#output
cust_id agreement_id date payment
0 1 A 2020-12-01 200.0
1 1 A 2021-02-02 200.0
2 1 A 2021-02-03 100.0
3 1 A 2021-05-01 200.0
4 1 B 2021-01-02 50.0
5 1 B 2021-01-09 20.0
6 1 B 2021-03-01 80.0
7 1 B 2021-04-23 90.0
8 2 C 2021-01-21 600.0
9 3 D 2021-03-04 150.0
10 3 D 2021-05-03 150.0
11 2 C 2021-07-21 0.0
12 3 D 2021-09-04 0.0
Note: If you want exact same output then make use of astype() method and change payment column dtype from float to int
result['payment']=result['payment'].astype(int)

How to calculate the average from previous rows in pandas? [duplicate]

I have a deal dataframe with three columns and I have sorted by the type and date, It looks like:
type date price
A 2020-05-01 4
A 2020-06-04 6
A 2020-06-08 8
A 2020-07-03 5
B 2020-02-01 3
B 2020-04-02 4
There are many types (A, B, C,D,E…), I want to calculate the previous mean price of the same type of product. For example: the pre_mean_price value of third row A is (4+6)/2=5. I want to get a dataframe like this:
type date price pre_mean_price
A 2020-05-01 4 .
A 2020-06-04 6 4
A 2020-06-08 8 5
A 2020-07-03 5 6
B 2020-02-01 3 .
B 2020-04-02 4 3
How can I calculate the pre_mean_price? Thanks a lot!
You can use expanding().mean() after groupby for each group , then shift the values.
df['pre_mean_price'] = df.groupby("type")['price'].apply(lambda x:
x.expanding().mean().shift())
print(df)
type date price pre_mean_price
0 A 2020-05-01 4 NaN
1 A 2020-06-04 6 4.0
2 A 2020-06-08 8 5.0
3 A 2020-07-03 5 6.0
4 B 2020-02-01 3 NaN
5 B 2020-04-02 4 3.0
Something like
df['pre_mean_price'] = df.groupby('type').expanding().mean().groupby('type').shift(1)['price'].values
which produces
type date price pre_mean_price
0 A 2020-05-01 4 NaN
1 A 2020-06-04 6 4.0
2 A 2020-06-08 8 5.0
3 A 2020-07-03 5 6.0
4 B 2020-02-01 3 NaN
5 B 2020-04-02 4 3.0
Short explanation
The idea is to
First groupby "type" with .groupby(). This must be done since we want to calculate the (incremental) means within the group "type".
Then, calculate the incremental mean with expanding().mean(). The output in this point is
price
type
A 0 4.00
1 5.00
2 6.00
3 5.75
B 4 3.00
5 3.50
Then, groupby again by "type", and shift the elements inside the groups by one row with shift(1).
Then, just extract the values of the price column (the incremental means)
Note: This assumes your data is sorted by date. It it is not, call df.sort_values('date', inplace=True) before.

Pandas: query() groupby() mean() using second column list

I'm trying to decypher some inherited pandas code and cannot determine what the list [['DemandRate','DemandRateQtr','AcceptRate']] is doing in this line of code:
plot_data = (my_dataframe.query("quote_date>'2020-02-01'")
.groupby(['quote_date'])[['DemandRate', 'DemandRateQtr', 'AcceptRate']]
.mean()
.reset_index()
)
Can anyone tell me what the list does?
It is filter by columns names, here are aggregate only columns from list.
['DemandRate', 'DemandRateQtr', 'AcceptRate']
If there are some another columns like this list and from by list(here ['quote_date']) are omitted:
my_dataframe = pd.DataFrame({
'quote_date':pd.date_range('2020-02-01', periods=3).tolist() * 2,
'DemandRate':[4,5,4,5,5,4],
'DemandRateQtr':[7,8,9,4,2,3],
'AcceptRate':[1,3,5,7,1,0],
'column':[5,3,6,9,2,4]
})
print(my_dataframe)
quote_date DemandRate DemandRateQtr AcceptRate column
0 2020-02-01 4 7 1 5
1 2020-02-02 5 8 3 3
2 2020-02-03 4 9 5 6
3 2020-02-01 5 4 7 9
4 2020-02-02 5 2 1 2
5 2020-02-03 4 3 0 4
plot_data = (my_dataframe.query("quote_date>'2020-02-01'")
.groupby(['quote_date'])[['DemandRate', 'DemandRateQtr', 'AcceptRate']]
.mean()
.reset_index())
print (plot_data)
#here is not column
quote_date DemandRate DemandRateQtr AcceptRate
0 2020-02-02 5.0 5.0 2.0
1 2020-02-03 4.0 6.0 2.5

How to vectorize an operation that uses previous values?

I want to do something like this:
df['indicator'] = df.at[x-1] + df.at[x-2]
or
df['indicator'] = df.at[x-1] > df.at[x-2]
I guess edge cases would be taken care of automatically, e.g. skip the first few rows.
This line should give you what you need. The first two rows for your indicator column will be automatically filled with 'NaN'.
df['indicator'] = df.at.shift(1) + df.at.shift(2)
For example, if we had the following dataframe:
a = pd.DataFrame({'date':['2017-06-01','2017-06-02','2017-06-03',
'2017-06-04','2017-06-05','2017-06-06'],
'count' :[10,15,17,5,3,7]})
date at
0 2017-06-01 10
1 2017-06-02 15
2 2017-06-03 17
3 2017-06-04 5
4 2017-06-05 3
5 2017-06-06 7
Then running this line will give the below result:
df['indicator'] = df.at.shift(1) + df.at.shift(2)
date at indicator
0 2017-06-01 10 NaN
1 2017-06-02 15 NaN
2 2017-06-03 17 25.0
3 2017-06-04 5 32.0
4 2017-06-05 3 22.0
5 2017-06-06 7 8.0

Shifting the values of a column in pandas dataframe one month forward

Is there a way to shift the values of a column in pandas dataframe one month forward? (note that I want to thift the column value and not the date value).
For example, if I have:
ColumnA ColumnB
2016-10-01 1 0
2016-09-30 2 1
2016-09-29 5 1
2016-09-28 7 1
.
.
2016-09-01 3 1
2016-08-31 4 7
2016-08-30 4 7
2016-08-29 9 7
2016-08-28 10 7
Then I want to be able to shift the values in ColumnB
one month forward, to get the desired output:
ColumnA ColumnB
2016-10-01 1 1
2016-09-30 2 7
2016-09-29 5 7
2016-09-28 7 7
.
.
2016-09-01 3 7
2016-08-31 3 X
2016-08-30 4 X
2016-08-29 9 x
2016-08-28 10 x
In the data I have, the value if fixed for each month (for example, the value in ColumnB was 1 during september), so the fact that the number of days is a bit different each month should not be a problem.
This seems related Python/Pandas - DataFrame Index - Move one month forward, but in the linked question the OP wanted to shift the whole frame, and I want to shift only selected columns.
It is not too elegant, but you can do something like that:
df=df.reset_index()
df['index']=pd.to_datetime(df['index'],infer_datetime_format=True)
df['offset']=df['index']-pd.DateOffset(months=1)
res=df.merge(df,right_on='index',left_on='offset',how='left')
and just take from res the columns you want
You can first create a new index of pandas Periods for each month and then find get the value of each month and use pandas automatic index alignment to create a new column.
df1 = df.copy()
orig_idx = df.index
df1.index = orig_idx.to_period('M')
col_b_new = df1.groupby(level=0)['ColumnB'].first().tshift(1)
df1['ColumnB_new'] = col_b_new
df1.index = orig_idx
Output
ColumnA ColumnB ColumnB_new
2016-10-01 1 0 1.0
2016-09-30 2 1 7.0
2016-09-29 5 1 7.0
2016-09-28 7 1 7.0
2016-09-01 3 1 7.0
2016-08-31 4 7 NaN
2016-08-30 4 7 NaN
2016-08-29 9 7 NaN
2016-08-28 10 7 NaN

Categories

Resources