How to add rows based on a condition with another dataframe - python

I have two dataframes as follows:
agreement
agreement_id activation term_months total_fee
0 A 2020-12-01 24 4800
1 B 2021-01-02 6 300
2 C 2021-01-21 6 600
3 D 2021-03-04 6 300
payments
cust_id agreement_id date payment
0 1 A 2020-12-01 200
1 1 A 2021-02-02 200
2 1 A 2021-02-03 100
3 1 A 2021-05-01 200
4 1 B 2021-01-02 50
5 1 B 2021-01-09 20
6 1 B 2021-03-01 80
7 1 B 2021-04-23 90
8 2 C 2021-01-21 600
9 3 D 2021-03-04 150
10 3 D 2021-05-03 150
I want to add another row in the payments dataframe when the total payments for the agreement_id in the payments dataframe is equal to the total_fee in the agreement_id. The row would contain a zero value under the payments and the date will be calculated as min(date) (from payments) plus term_months (from agreement).
Here's the results I want for the payments dataframe:
payments
cust_id agreement_id date payment
0 1 A 2020-12-01 200
1 1 A 2021-02-02 200
2 1 A 2021-02-03 100
3 1 A 2021-05-01 200
4 1 B 2021-01-02 50
5 1 B 2021-01-09 20
6 1 B 2021-03-01 80
7 1 B 2021-04-23 90
8 2 C 2021-01-21 600
9 3 D 2021-03-04 150
10 3 D 2021-05-03 150
11 2 C 2021-07-21 0
12 3 D 2021-09-04 0
The additional rows are row 11 and 12. The agreement_id 'C' and 'D' where equal to the total_fee shown in the agreement dataframe.

import pandas as pd
import numpy as np
Firstly convert 'date' column of payment dataframe into datetime dtype by using to_datetime() method:
payments['date']=pd.to_datetime(payments['date'])
You can do this by using groupby() method:
newdf=payments.groupby('agreement_id').agg({'payment':'sum','date':'min','cust_id':'first'}).reset_index()
Now by boolean masking get the data which mets your condition:
newdf=newdf[agreement['total_fee']==newdf['payment']].assign(payment=np.nan)
Note: here in the above code we are using assign() method and making the payments row to NaN
Now make use of pd.tseries.offsets.Dateoffsets() method and apply() method:
newdf['date']=newdf['date']+agreement['term_months'].apply(lambda x:pd.tseries.offsets.DateOffset(months=x))
Note: The above code gives you a warning so just ignore that warning as it's a warning not an error
Finally make use of concat() method and fillna() method:
result=pd.concat((payments,newdf),ignore_index=True).fillna(0)
Now if you print result you will get your desired output
#output
cust_id agreement_id date payment
0 1 A 2020-12-01 200.0
1 1 A 2021-02-02 200.0
2 1 A 2021-02-03 100.0
3 1 A 2021-05-01 200.0
4 1 B 2021-01-02 50.0
5 1 B 2021-01-09 20.0
6 1 B 2021-03-01 80.0
7 1 B 2021-04-23 90.0
8 2 C 2021-01-21 600.0
9 3 D 2021-03-04 150.0
10 3 D 2021-05-03 150.0
11 2 C 2021-07-21 0.0
12 3 D 2021-09-04 0.0
Note: If you want exact same output then make use of astype() method and change payment column dtype from float to int
result['payment']=result['payment'].astype(int)

Related

Pandas DataFrame Change Values Based on Values in Different Rows

I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0

Calculate the mean value using two columns in pandas

I have a deal dataframe with three columns and I have sorted by the type and date, It looks like:
type date price
A 2020-05-01 4
A 2020-06-04 6
A 2020-06-08 8
A 2020-07-03 5
B 2020-02-01 3
B 2020-04-02 4
There are many types (A, B, C,D,E…), I want to calculate the previous mean price of the same type of product. For example: the pre_mean_price value of third row A is (4+6)/2=5. I want to get a dataframe like this:
type date price pre_mean_price
A 2020-05-01 4 .
A 2020-06-04 6 4
A 2020-06-08 8 5
A 2020-07-03 5 6
B 2020-02-01 3 .
B 2020-04-02 4 3
How can I calculate the pre_mean_price? Thanks a lot!
You can use expanding().mean() after groupby for each group , then shift the values.
df['pre_mean_price'] = df.groupby("type")['price'].apply(lambda x:
x.expanding().mean().shift())
print(df)
type date price pre_mean_price
0 A 2020-05-01 4 NaN
1 A 2020-06-04 6 4.0
2 A 2020-06-08 8 5.0
3 A 2020-07-03 5 6.0
4 B 2020-02-01 3 NaN
5 B 2020-04-02 4 3.0
Something like
df['pre_mean_price'] = df.groupby('type').expanding().mean().groupby('type').shift(1)['price'].values
which produces
type date price pre_mean_price
0 A 2020-05-01 4 NaN
1 A 2020-06-04 6 4.0
2 A 2020-06-08 8 5.0
3 A 2020-07-03 5 6.0
4 B 2020-02-01 3 NaN
5 B 2020-04-02 4 3.0
Short explanation
The idea is to
First groupby "type" with .groupby(). This must be done since we want to calculate the (incremental) means within the group "type".
Then, calculate the incremental mean with expanding().mean(). The output in this point is
price
type
A 0 4.00
1 5.00
2 6.00
3 5.75
B 4 3.00
5 3.50
Then, groupby again by "type", and shift the elements inside the groups by one row with shift(1).
Then, just extract the values of the price column (the incremental means)
Note: This assumes your data is sorted by date. It it is not, call df.sort_values('date', inplace=True) before.

How to calculate the average from previous rows in pandas? [duplicate]

I have a deal dataframe with three columns and I have sorted by the type and date, It looks like:
type date price
A 2020-05-01 4
A 2020-06-04 6
A 2020-06-08 8
A 2020-07-03 5
B 2020-02-01 3
B 2020-04-02 4
There are many types (A, B, C,D,E…), I want to calculate the previous mean price of the same type of product. For example: the pre_mean_price value of third row A is (4+6)/2=5. I want to get a dataframe like this:
type date price pre_mean_price
A 2020-05-01 4 .
A 2020-06-04 6 4
A 2020-06-08 8 5
A 2020-07-03 5 6
B 2020-02-01 3 .
B 2020-04-02 4 3
How can I calculate the pre_mean_price? Thanks a lot!
You can use expanding().mean() after groupby for each group , then shift the values.
df['pre_mean_price'] = df.groupby("type")['price'].apply(lambda x:
x.expanding().mean().shift())
print(df)
type date price pre_mean_price
0 A 2020-05-01 4 NaN
1 A 2020-06-04 6 4.0
2 A 2020-06-08 8 5.0
3 A 2020-07-03 5 6.0
4 B 2020-02-01 3 NaN
5 B 2020-04-02 4 3.0
Something like
df['pre_mean_price'] = df.groupby('type').expanding().mean().groupby('type').shift(1)['price'].values
which produces
type date price pre_mean_price
0 A 2020-05-01 4 NaN
1 A 2020-06-04 6 4.0
2 A 2020-06-08 8 5.0
3 A 2020-07-03 5 6.0
4 B 2020-02-01 3 NaN
5 B 2020-04-02 4 3.0
Short explanation
The idea is to
First groupby "type" with .groupby(). This must be done since we want to calculate the (incremental) means within the group "type".
Then, calculate the incremental mean with expanding().mean(). The output in this point is
price
type
A 0 4.00
1 5.00
2 6.00
3 5.75
B 4 3.00
5 3.50
Then, groupby again by "type", and shift the elements inside the groups by one row with shift(1).
Then, just extract the values of the price column (the incremental means)
Note: This assumes your data is sorted by date. It it is not, call df.sort_values('date', inplace=True) before.

Calculate average of every 7 instances in a dataframe column

I have this pandas dataframe with daily asset prices:
Picture of head of Dataframe
I would like to create a pandas series (It could also be an additional column in the dataframe or some other datastructure) with the weakly average asset prices. This means I need to calculate the average on every 7 consecutive instances in the column and save it into a series.
Picture of how result should look like
As I am a complete newbie to python (and programming in general, for that matter), I really have no idea how to start.
I am very grateful for every tipp!
I believe need GroupBy.transform by modulo of numpy array create by numpy.arange for general solution also working with all indexes (e.g. with DatetimeIndex):
np.random.seed(2018)
rng = pd.date_range('2018-04-19', periods=20)
df = pd.DataFrame({'Date': rng[::-1],
'ClosingPrice': np.random.randint(4, size=20)})
#print (df)
df['weekly'] = df['ClosingPrice'].groupby(np.arange(len(df)) // 7).transform('mean')
print (df)
ClosingPrice Date weekly
0 2 2018-05-08 1.142857
1 2 2018-05-07 1.142857
2 2 2018-05-06 1.142857
3 1 2018-05-05 1.142857
4 1 2018-05-04 1.142857
5 0 2018-05-03 1.142857
6 0 2018-05-02 1.142857
7 2 2018-05-01 2.285714
8 1 2018-04-30 2.285714
9 1 2018-04-29 2.285714
10 3 2018-04-28 2.285714
11 3 2018-04-27 2.285714
12 3 2018-04-26 2.285714
13 3 2018-04-25 2.285714
14 1 2018-04-24 1.666667
15 0 2018-04-23 1.666667
16 3 2018-04-22 1.666667
17 2 2018-04-21 1.666667
18 2 2018-04-20 1.666667
19 2 2018-04-19 1.666667
Detail:
print (np.arange(len(df)) // 7)
[0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2]

pandas DataFrame cumulative value

I have the following pandas dataframe:
>>> df
Category Year Costs
0 A 1 20.00
1 A 2 30.00
2 A 3 40.00
3 B 1 15.00
4 B 2 25.00
5 B 3 35.00
How do I add a cumulative cost column that adds up the cost for the same category and previous years. Example of the extra column with previous df:
>>> new_df
Category Year Costs Cumulative Costs
0 A 1 20.00 20.00
1 A 2 30.00 50.00
2 A 3 40.00 90.00
3 B 1 15.00 15.00
4 B 2 25.00 40.00
5 B 3 35.00 75.00
Suggestions?
This works in pandas 0.17.0 Thanks to #DSM in the comments for the terser solution.
df['Cumulative Costs'] = df.groupby(['Category'])['Costs'].cumsum()
>>> df
Category Year Costs Cumulative Costs
0 A 1 20 20
1 A 2 30 50
2 A 3 40 90
3 B 1 15 15
4 B 2 25 40
5 B 3 35 75

Categories

Resources