pandas DataFrame cumulative value - python

I have the following pandas dataframe:
>>> df
Category Year Costs
0 A 1 20.00
1 A 2 30.00
2 A 3 40.00
3 B 1 15.00
4 B 2 25.00
5 B 3 35.00
How do I add a cumulative cost column that adds up the cost for the same category and previous years. Example of the extra column with previous df:
>>> new_df
Category Year Costs Cumulative Costs
0 A 1 20.00 20.00
1 A 2 30.00 50.00
2 A 3 40.00 90.00
3 B 1 15.00 15.00
4 B 2 25.00 40.00
5 B 3 35.00 75.00
Suggestions?

This works in pandas 0.17.0 Thanks to #DSM in the comments for the terser solution.
df['Cumulative Costs'] = df.groupby(['Category'])['Costs'].cumsum()
>>> df
Category Year Costs Cumulative Costs
0 A 1 20 20
1 A 2 30 50
2 A 3 40 90
3 B 1 15 15
4 B 2 25 40
5 B 3 35 75

Related

How to filter a dataframe and identify records based on a condition on multiple other columns

id zone price
0 0000001 1 33.0
1 0000001 2 24.0
2 0000001 3 34.0
3 0000001 4 45.0
4 0000001 5 51.0
I have the above pandas dataframe, here there are multiple ids (only 1 id is shown here). dataframe consist of a certain id with 5 zones and 5 prices. these prices should follow the below pattern
p1 (price of zone 1) < p2< p3< p4< p5
if anything out of order we should identify and print anomaly records to a file.
here in this example p3 <p4 <p5 but p1 and p2 are erroneous. (p1 > p2 whereas p1 < p2 is expected)
therefore 1st 2 records should be printed to a file
likewise this has to be done to the entire dataframe for all unique ids in it
My dataframe is huge, what is the most efficient way to do this filtering and identify erroneous records?
You can compute the diff per group after sorting the values to ensure the zones are increasing. If the diff is ≤ 0 the price is not strictly increasing and the rows should be flagged:
s = (df.sort_values(by=['id', 'zone']) # sort rows
.groupby('id') # group by id
['price'].diff() # compute the diff
.le(0) # flag those ≤ 0 (not increasing)
)
df[s|s.shift(-1)] # slice flagged rows + previous row
Example output:
id zone price
0 1 1 33.0
1 1 2 24.0
Example input:
id zone price
0 1 1 33.0
1 1 2 24.0
2 1 3 34.0
3 1 4 45.0
4 1 5 51.0
5 2 1 20.0
6 2 2 24.0
7 2 3 34.0
8 2 4 45.0
9 2 5 51.0
saving to file
df[s|s.shift(-1)].to_csv('incorrect_prices.csv')
Another way would be to first sort your dataframe by id and zone in ascending order and compare the next price with previous price using groupby.shift() creating a new column. Then you can just print out the prices that have fell in value:
import numpy as np
import pandas as pd
df.sort_values(by=['id','zone'],ascending=True)
df['increase'] = np.where(df.zone.eq(1),'no change',
np.where(df.groupby('id')['price'].shift(1) < df['price'],'inc','dec'))
>>> df
id zone price increase
0 1 1 33 no change
1 1 2 24 dec
2 1 3 34 inc
3 1 4 45 inc
4 1 5 51 inc
5 2 1 34 no change
6 2 2 56 inc
7 2 3 22 dec
8 2 4 55 inc
9 2 5 77 inc
10 3 1 44 no change
11 3 2 55 inc
12 3 3 44 dec
13 3 4 66 inc
14 3 5 33 dec
>>> df.loc[df.increase.eq('dec')]
id zone price increase
1 1 2 24 dec
7 2 3 22 dec
12 3 3 44 dec
14 3 5 33 dec
I have added some extra ID's to try and mimic your real data.

(Python) How to calculate the average over a time period?

I have a dataFrame and I am trying to add a new column that calculates the average amount spent with a card over the last 3 days.
I have tried using df[avg_card_7days] = df.groupby('card')['amount'].resample('3D', on = 'date').mean()
The dataFrame currently looks like:
card date amount
1 2/1/10 50
2 2/1/10 40
3 2/1/10 10
1 2/2/10 20
2 2/2/10 30
3 2/2/10 30
1 2/3/10 10
2 2/3/10 30
3 2/3/10 20
...
But I a looking for this result:
card date amount avg_card_3days
1 2/1/10 50 NaN
2 2/1/10 40 NaN
3 2/1/10 10 NaN
1 2/2/10 20 NaN
2 2/2/10 30 NaN
3 2/2/10 30 NaN
1 2/3/10 10 26.26
2 2/3/10 30 33.33
3 2/3/10 20 20.00
...
Any help would be greatly appreciated!
df['date'] = pd.to_datetime(df.date, format='%m/%d/%y')
df = df.set_index('date')
df['avg_card_3days'] = df.groupby('card').expanding(3).amount.agg('mean').droplevel(0).sort_index()
df = df.reset_index()
df
Output
date card amount avg_card_3days
0 2010-02-01 1 50 NaN
1 2010-02-01 2 40 NaN
2 2010-02-01 3 10 NaN
3 2010-02-02 1 20 NaN
4 2010-02-02 2 30 NaN
5 2010-02-02 3 30 NaN
6 2010-02-03 1 10 26.666667
7 2010-02-03 2 30 33.333333
8 2010-02-03 3 20 20.000000
Explanation
Converting date column to datetime type and setting it as index.
Grouping the df by card and finding rolling mean of 3 days and assigning it to new column.
resetting the index to get required output.

How to add rows based on a condition with another dataframe

I have two dataframes as follows:
agreement
agreement_id activation term_months total_fee
0 A 2020-12-01 24 4800
1 B 2021-01-02 6 300
2 C 2021-01-21 6 600
3 D 2021-03-04 6 300
payments
cust_id agreement_id date payment
0 1 A 2020-12-01 200
1 1 A 2021-02-02 200
2 1 A 2021-02-03 100
3 1 A 2021-05-01 200
4 1 B 2021-01-02 50
5 1 B 2021-01-09 20
6 1 B 2021-03-01 80
7 1 B 2021-04-23 90
8 2 C 2021-01-21 600
9 3 D 2021-03-04 150
10 3 D 2021-05-03 150
I want to add another row in the payments dataframe when the total payments for the agreement_id in the payments dataframe is equal to the total_fee in the agreement_id. The row would contain a zero value under the payments and the date will be calculated as min(date) (from payments) plus term_months (from agreement).
Here's the results I want for the payments dataframe:
payments
cust_id agreement_id date payment
0 1 A 2020-12-01 200
1 1 A 2021-02-02 200
2 1 A 2021-02-03 100
3 1 A 2021-05-01 200
4 1 B 2021-01-02 50
5 1 B 2021-01-09 20
6 1 B 2021-03-01 80
7 1 B 2021-04-23 90
8 2 C 2021-01-21 600
9 3 D 2021-03-04 150
10 3 D 2021-05-03 150
11 2 C 2021-07-21 0
12 3 D 2021-09-04 0
The additional rows are row 11 and 12. The agreement_id 'C' and 'D' where equal to the total_fee shown in the agreement dataframe.
import pandas as pd
import numpy as np
Firstly convert 'date' column of payment dataframe into datetime dtype by using to_datetime() method:
payments['date']=pd.to_datetime(payments['date'])
You can do this by using groupby() method:
newdf=payments.groupby('agreement_id').agg({'payment':'sum','date':'min','cust_id':'first'}).reset_index()
Now by boolean masking get the data which mets your condition:
newdf=newdf[agreement['total_fee']==newdf['payment']].assign(payment=np.nan)
Note: here in the above code we are using assign() method and making the payments row to NaN
Now make use of pd.tseries.offsets.Dateoffsets() method and apply() method:
newdf['date']=newdf['date']+agreement['term_months'].apply(lambda x:pd.tseries.offsets.DateOffset(months=x))
Note: The above code gives you a warning so just ignore that warning as it's a warning not an error
Finally make use of concat() method and fillna() method:
result=pd.concat((payments,newdf),ignore_index=True).fillna(0)
Now if you print result you will get your desired output
#output
cust_id agreement_id date payment
0 1 A 2020-12-01 200.0
1 1 A 2021-02-02 200.0
2 1 A 2021-02-03 100.0
3 1 A 2021-05-01 200.0
4 1 B 2021-01-02 50.0
5 1 B 2021-01-09 20.0
6 1 B 2021-03-01 80.0
7 1 B 2021-04-23 90.0
8 2 C 2021-01-21 600.0
9 3 D 2021-03-04 150.0
10 3 D 2021-05-03 150.0
11 2 C 2021-07-21 0.0
12 3 D 2021-09-04 0.0
Note: If you want exact same output then make use of astype() method and change payment column dtype from float to int
result['payment']=result['payment'].astype(int)

Split up the total of a value when merging dataframes with rows that contain id multiple times

I have two dataframes that I would like to merge. The first dataframe contains a customer id and a column with a value. The second dataframe contains the customer id and a purchase id. When merging i would like to split up the total value in the first dataframe based on how many times the customer id is present in the second dataframe and attribute every row the correct split of the total value.
Example: Customer with id 1 has a total value of 3000 but has bought products two times in its lifetime the value 3000 should then be split when merging so that each row gets 1500.
First dataframe:
import pandas as pd
df_first = pd.DataFrame({'customer_id': [1,2,3,4,5], 'value': [3000,4000,5000,6000,7000]})
df_first.head()
Out[1]:
customer_id value
0 1 3000
1 2 4000
2 3 5000
3 4 6000
4 5 7000
Second dataframe:
df_second = pd.DataFrame({'customer_id': [1,2,3,4,5,1,2,3,4,5], 'purchase_id': [11,12,13,14,15,21,22,23,24,25]})
df_second.head(10)
Out[2]:
customer_id purchase_id
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 1 21
6 2 22
7 3 23
8 4 24
9 5 25
Expected output when merging:
Out[3]:
customer_id value purchase_id
0 1 1500 11
1 1 1500 21
2 2 2000 12
3 2 2000 22
4 3 2500 13
5 3 2500 23
6 4 3000 14
7 4 3000 24
8 5 3500 15
9 5 3500 25
Use DataFrame.merge with left join and sorted values by customer_id and then divide values by length of groups mapped by Series.map with Series.value_counts :
df = df_second.sort_values('customer_id').merge(df_first, on='customer_id', how='left')
df['value'] /= df['customer_id'].map(df['customer_id'].value_counts())
#alternative
#df['value'] /= df.groupby('customer_id')['customer_id'].transform('size')
print (df)
customer_id purchase_id value
0 1 11 1500.0
1 1 21 1500.0
2 2 12 2000.0
3 2 22 2000.0
4 3 13 2500.0
5 3 23 2500.0
6 4 14 3000.0
7 4 24 3000.0
8 5 15 3500.0
9 5 25 3500.0

Pandas: multiply starting value for one column through each value in another within group

I have a starting value and some future expected growth rates for a number of customers.
Here is a simple sample dataframe:
df = pd.DataFrame([['A',1,10,np.nan],['A',2,10,1.2],['A',3,10,1.15],
['B',1,20,np.nan],['B',2,20,1.05],['B',3,20,1.2]],columns = ['Cust','Period','startingValue','Growth'])
print df
Cust Period startingValue Growth
0 A 1 10 NaN
1 A 2 10 1.20
2 A 3 10 1.15
3 B 1 20 NaN
4 B 2 20 1.05
5 B 3 20 1.20
For each Cust, I want to multiply the starting value by the growth rate, then carry that value forward to the next period. I could do this with groupby-apply or an ugly for loop, but I'm hoping there's some faster vectorized method for doing this. I had hoped there would be some .fill() magic, where you could multiply by another column as it fills downwards. Here's what the output should look like:
Cust Period startingValue Growth Pred_val
0 A 1 10 NaN 10.0
1 A 2 10 1.20 12.0
2 A 3 10 1.15 13.8
3 B 1 20 NaN 20.0
4 B 2 20 1.05 21.0
5 B 3 20 1.20 25.2
Thoughts?
You can do a cumulative product using cumprod function:
df['Pred_val'] = df.Growth.fillna(1).groupby(df.Cust).cumprod()*df.startingValue

Categories

Resources