How to do calculation on pandas dataframe that require processing multiple rows?

How to do calculation on pandas dataframe that require processing multiple rows? - python

I have a dataframe from which I need to calculate a number of features from. The dataframe df looks something like this for a object and an event:
id event_id event_date age money_spent rank
1 100 2016-10-01 4 150 2
2 100 2016-09-30 5 10 4
1 101 2015-12-28 3 350 3
2 102 2015-10-25 5 400 5
3 102 2015-10-25 7 500 2
1 103 2014-04-15 2 1000 1
2 103 2014-04-15 3 180 6
From this I need to know for each id and event_id (basically each row), what was the number of days since the last event date, total money spend upto that date, avg. money spent upto that date, rank in last 3 events etc.
What is the best way to work with this kind of problem in pandas where for each row I need information from all rows with the same id before the date of that row, and so the calculations? I want to return a new dataframe with the corresponding calculated features like
id event_id event_date days_last_event avg_money_spent total_money_spent
1 100 2016-10-01 278 500 1500
2 100 2016-09-30 361 196.67 590
1 101 2015-12-28 622 675 1350
2 102 2015-10-25 558 290 580
3 102 2015-10-25 0 500 500
1 103 2014-04-15 0 1000 1000
2 103 2014-04-15 0 180 180

I came up with the following solution:
df1= df.sort_values(by="event_date",ascending = False)
g = df1.groupby(by=["id"])
df1["total_money_spent","count"]= g.agg({"money_spent":["cumsum","cumcount"]})
df1["avg_money_spent"]=df1["total_money_spent"]/(df1["count"]+1)

Related

Find overlapped rows in Pandas Data Frame

What is the easiest way to convert the following ascending data frame:
start end
0 100 500
1 400 700
2 450 580
3 750 910
4 920 940
5 1000 1200
6 1100 1300
into
start end
0 100 700
1 750 910
2 920 940
3 1000 1300
You may notice that rows 0:3 and 5:7 were merged, because these rows overlap or one row is subpart of another: actually, they have only one start and end.

Use a custom group with shift to identify the overlapping intervals and keep the first start and last end (or min/max if you prefer):
group = df['start'].gt(df['end'].shift()).cumsum()
out = df.groupby(group).agg({'start': 'first', 'end': 'last'})
output:
start end
0 100 580
1 750 910
2 920 940
3 1000 1300
intermediate group:
0 0
1 0
2 0
3 1
4 2
5 3
6 3
dtype: int64

Sumifs excel formula in Pandas

I have seen a lot of SUMIFS question being answered here but is very different from the one I need.
1st Trade data frame contains transaction id and C_ID
transaction C_ID
1 101
2 103
3 104
4 101
5 102
6 104
2nd Customer data frame contains C_ID, On/Off, Amount
C_ID On/Off Amount
102 On 320
101 On 400
101 On 200
103 On 60
104 Off 80
104 On 100
So i want to calculate the Amount based on the C_ID with a condition on column 'On/Off' in Customer data frame. The resulting trade data frame should be
transaction C_ID Amount
1 101 600
2 103 60
3 104 100
4 101 600
5 102 320
6 104 100
So here’s the formula in EXCEL on how Amount are calculated
=SUMIFS(Customer.Amount, Customer.C_ID = Trade.C_ID, Customer.On/Off = On)
So i want to replicate this particular formula in Python using Pandas

You can use groupby() on filtered data to compute the sum and map to assign new column to transaction data.
s = df2[df2['On/Off']=='On'].groupby('C_ID')['Amount'].sum()
df1['Amount'] = df1['C_ID'].map(s)

We do filter groupby + reindex assign
df1['Amount']=df2.loc[df2['On/Off']=='On'].groupby(['C_ID']).Amount.sum().reindex(df1.C_ID).tolist()
df1
Out[340]:
transaction C_ID Amount
0 1 101 600
1 2 103 60
2 3 104 100
3 4 101 600
4 5 102 320
5 6 104 100

Pandas totalling balances with date timeline from multiple sheets

I have three sheets inside one excel spreadsheet. I am trying to obtain the output listed below, or something close to it. The desired outcome is to know when there will be a shortage so that I can attempt to re-actively order and prevent the shortage. All of these, except for the output, is on one excel file. Each are different sheets. How hard will this be to achieve, is this possible? Note that in all sheets listed, there are tons of other data columns so positional references to columns may be needed, or using iloc to call upon columns by name.
instock sheet
product someother datapoint qty
5.25 1 2 100
5.25 1 3 200
6 2 1 50
6 4 1 500
ordered
product something ordernum qty date
5 1/4 abc 52521 50 07/01/2019
5 1/4 ddd 22911 100 07/28/2019
6 eeee 72944 10 07/5/2019
promised
product order qty date
5 1/4 456 300 06/12/2019
5 1/4 789 50 06/20/2019
5 1/4 112 50 07/20/2019
6 113 800 07/22/2019
5 1/4 144 50 07/28/2019
9 155 100 08/22/2019
Output
product date onhand qtyordered commited balance shortage
5.25 06/10 300 300 n
5.25 06/12 300 300 0 n
5.25 06/20 0 50 -50 y
5.25 07/01 -50 50 0 n
6 07/05 550 10 0 560 n
5.25 07/20 0 50 -50 y
6 07/22 560 0 800 -240 y
5.25 07/28 -50 100 50 0 n
9 08/22 0 0 100 -100 y

Cumulative Sum by date (Month)

I have a pandas dataframe and I need to work out the cumulative sum for each month.
Date Amount
2017/01/12 50
2017/01/12 30
2017/01/15 70
2017/01/23 80
2017/02/01 90
2017/02/01 10
2017/02/02 10
2017/02/03 10
2017/02/03 20
2017/02/04 60
2017/02/04 90
2017/02/04 100
The cumulative sum is the trailing sum for each day i.e 01-31. However, some days are missing. The data frame should look like
Date Sum_Amount
2017/01/12 80
2017/01/15 150
2017/01/23 203
2017/02/01 100
2017/02/02 110
2017/02/03 140
2017/02/04 390

You can use if only need cumsum by months groupby with sum and then group by values of index converted to month:
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 140
6 2017-02-04 390
But if need but months and years need convert to month period by to_period:
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
Difference is better seen in changed df - added different year:
print (df)
Date Amount
0 2017/01/12 50
1 2017/01/12 30
2 2017/01/15 70
3 2017/01/23 80
4 2017/02/01 90
5 2017/02/01 10
6 2017/02/02 10
7 2017/02/03 10
8 2018/02/03 20
9 2018/02/04 60
10 2018/02/04 90
11 2018/02/04 100
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 140
7 2018-02-04 390
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 20
7 2018-02-04 270

How to find out if there was weekend between days?

I have two data frames. One representing when an order was placed and arrived, while the other one represents the working days of the shop.
Days are taken as days of the year. i.e. 32 = 1th February.
orders = DataFrame({'placed':[100,103,104,105,108,109], 'arrived':[103,104,105,106,111,111]})
Out[25]:
arrived placed
0 103 100
1 104 103
2 105 104
3 106 105
4 111 108
5 111 109
calendar = DataFrame({'day':['100','101','102','103','104','105','106','107','108','109','110','111','112','113','114','115','116','117','118','119','120'], 'closed':[0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0]})
Out[21]:
closed day
0 0 100
1 1 101
2 1 102
3 0 103
4 0 104
5 0 105
6 0 106
7 0 107
8 1 108
9 1 109
10 0 110
11 0 111
12 0 112
13 0 113
14 0 114
15 1 115
16 1 116
17 0 117
18 0 118
19 0 119
20 0 120
What i want to do is to compute the difference between placed and arrived
x = orders['arrived'] - orders['placed']
Out[24]:
0 3
1 1
2 1
3 1
4 3
5 2
dtype: int64
and subtract one if any day between arrived and placed (included) was a day in which the shop was closed.
i.e. in the first row the order is placed on day 100 and arrived on day 103. the day used are 100, 101, 102, 103. the difference between 103 and 100 is 3. However, since 101 and 102 are days in which the shop is closed I want to subtract 1 for each. That is 3 -1 -1 = 1. And finally append this result on the orders df.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to do calculation on pandas dataframe that require processing multiple rows? - python

I came up with the following solution: df1= df.sort_values(by="event_date",ascending = False) g = df1.groupby(by=["id"]) df1["total_money_spent","count"]= g.agg({"money_spent":["cumsum","cumcount"]}) df1["avg_money_spent"]=df1["total_money_spent"]/(df1["count"]+1)

Related

Find overlapped rows in Pandas Data Frame

Sumifs excel formula in Pandas

Pandas totalling balances with date timeline from multiple sheets

Cumulative Sum by date (Month)

How to find out if there was weekend between days?

Categories

Resources