How to conditionally aggregate values of previous rows of Pandas DataFrame? - python

I have the following example Pandas DataFrame
UserID Total Date
1 20 2019-01-01
1 18 2019-01-04
1 22 2019-01-05
1 16 2019-01-07
1 17 2019-01-09
1 26 2019-01-11
1 30 2019-01-12
1 28 2019-01-13
1 28 2019-01-15
1 28 2019-01-16
2 22 2019-01-06
2 11 2019-01-07
2 23 2019-01-09
2 14 2019-01-13
2 19 2019-01-14
2 29 2019-01-15
2 21 2019-01-16
2 22 2019-01-18
2 30 2019-01-22
2 16 2019-01-23
3 27 2019-01-01
3 13 2019-01-04
3 12 2019-01-05
3 27 2019-01-06
3 26 2019-01-09
3 26 2019-01-10
3 30 2019-01-11
3 19 2019-01-12
3 27 2019-01-13
3 29 2019-01-14
4 29 2019-01-07
4 12 2019-01-09
4 25 2019-01-10
4 11 2019-01-11
4 19 2019-01-13
4 20 2019-01-14
4 33 2019-01-15
4 24 2019-01-18
4 22 2019-01-19
4 24 2019-01-21
My goal is to add a column named TotalPrev10Days which is basically the sum of Total for previous 10 days (for each UserID)
I did a basic implementation using nested loops and comparing the current date with a timedelta.
Here's my code:
users = set(df.UserID) # get set of all unique user IDs
TotalPrev10Days = []
delta = timedelta(days=10) # 10 day time delta to subtract from each row date
for user in users: # looping over all user IDs
user_df = df[df["UserID"] == user] #creating dataframe that includes only current userID data
for row_index in user_df.index: #looping over each row from UserID dataframe
row_date = user_df["Date"][row_index]
row_date_minus_10 = row_date - delta #subtracting 10 days
sum_prev_10_days = user_df[(user_df["Date"] < row_date) & (user_df["Date"] >= row_date_minus_10)]["Total"].sum()
TotalPrev10Days.append(sum_prev_10_days) #appending total to a list
df["TotalPrev10Days"] = TotalPrev10Days #Assigning list to new DataFrame column
While it works perfectly, it's very slow for large datasets.
Is there a faster, more Pandas-native approach to this problem?

IIUC, try:
df["TotalPrev10Days"] = df.groupby("UserID") \
.rolling("9D", on="Date") \
.sum() \
.shift() \
.fillna(0)["Total"] \
.droplevel(0)
>>> df
UserID Total Date TotalPrev10Days
0 1 20 2019-01-01 0.0
1 1 18 2019-01-04 20.0
2 1 22 2019-01-05 38.0
3 1 16 2019-01-07 60.0
4 1 17 2019-01-09 76.0
5 1 26 2019-01-11 93.0
6 1 30 2019-01-12 99.0
7 1 28 2019-01-13 129.0
8 1 28 2019-01-15 139.0
9 1 28 2019-01-16 145.0
10 2 22 2019-01-06 0.0
11 2 11 2019-01-07 22.0
12 2 23 2019-01-09 33.0
13 2 14 2019-01-13 56.0
14 2 19 2019-01-14 70.0
15 2 29 2019-01-15 89.0
16 2 21 2019-01-16 96.0
17 2 22 2019-01-18 106.0
18 2 30 2019-01-22 105.0
19 2 16 2019-01-23 121.0
20 3 27 2019-01-01 0.0
21 3 13 2019-01-04 27.0
22 3 12 2019-01-05 40.0
23 3 27 2019-01-06 52.0
24 3 26 2019-01-09 79.0
25 3 26 2019-01-10 105.0
26 3 30 2019-01-11 104.0
27 3 19 2019-01-12 134.0
28 3 27 2019-01-13 153.0
29 3 29 2019-01-14 167.0
30 4 29 2019-01-07 0.0
31 4 12 2019-01-09 29.0
32 4 25 2019-01-10 41.0
33 4 11 2019-01-11 66.0
34 4 19 2019-01-13 77.0
35 4 20 2019-01-14 96.0
36 4 33 2019-01-15 116.0
37 4 24 2019-01-18 149.0
38 4 22 2019-01-19 132.0
39 4 24 2019-01-21 129.0

Related

Python: Pandas merge three dataframes on date, keeping all dates [duplicate]

This question already has answers here:
Merge multiple DataFrames Pandas
(5 answers)
Pandas Merging 101
(8 answers)
Closed 7 months ago.
I have three dataframes
Dataframe df1:
date A
0 2022-04-11 1
1 2022-04-12 2
2 2022-04-14 26
3 2022-04-16 2
4 2022-04-17 1
5 2022-04-20 17
6 2022-04-21 14
7 2022-04-22 1
8 2022-04-23 9
9 2022-04-24 1
10 2022-04-25 5
11 2022-04-26 2
12 2022-04-27 21
13 2022-04-28 9
14 2022-04-29 17
15 2022-04-30 5
16 2022-05-01 8
17 2022-05-07 1241217
18 2022-05-08 211
19 2022-05-09 1002521
20 2022-05-10 488739
21 2022-05-11 12925
22 2022-05-12 57
23 2022-05-13 8515098
24 2022-05-14 1134576
Dateframe df2:
date B
0 2022-04-12 8
1 2022-04-14 7
2 2022-04-16 2
3 2022-04-19 2
4 2022-04-23 2
5 2022-05-07 2
6 2022-05-08 5
7 2022-05-09 2
8 2022-05-14 1
Dataframe df3:
date C
0 2022-04-12 6
1 2022-04-13 1
2 2022-04-14 2
3 2022-04-20 3
4 2022-04-21 9
5 2022-04-22 25
6 2022-04-23 56
7 2022-04-24 49
8 2022-04-25 68
9 2022-04-26 71
10 2022-04-27 40
11 2022-04-28 44
12 2022-04-29 27
13 2022-04-30 34
14 2022-05-01 28
15 2022-05-07 9
16 2022-05-08 20
17 2022-05-09 24
18 2022-05-10 21
19 2022-05-11 8
20 2022-05-12 8
21 2022-05-13 14
22 2022-05-14 25
23 2022-05-15 43
24 2022-05-16 36
25 2022-05-17 29
26 2022-05-18 28
27 2022-05-19 17
28 2022-05-20 6
I would like to merge df1, df2, df3 in a single dataframe with columns date, A, B, C, in such a way that date contains all dates which appeared in df1 and/or df2 and/or df3 (without repetition), and if a particular date was not in any of the dataframes, then for the respective column I put value 0.0. So, I would like to have something like that:
date A B C
0 2022-04-11 1.0 0.0 0.0
1 2022-08-12 2.0 8.0 6.0
2 2022-08-13 0.0 0.0 1.0
...
I tried to use this method
merge1 = pd.merge(df1, df2, how='outer')
sorted_merge1 = merge1.sort_values(by=['date'], ascending=False)
full_merge = pd.merge(sorted_merg1, df3, how='outer')
However, it seems it skips the dates which are not common for all three dataframes.
Try this,
print(pd.merge(df1, df2, on='date', how='outer').merge(df3, on='date', how='outer').fillna(0))
O/P:
date A B C
0 2022-04-11 1.0 0.0 0.0
1 2022-04-12 2.0 8.0 6.0
2 2022-04-14 26.0 7.0 2.0
3 2022-04-16 2.0 2.0 0.0
4 2022-04-17 1.0 0.0 0.0
5 2022-04-20 17.0 0.0 3.0
6 2022-04-21 14.0 0.0 9.0
7 2022-04-22 1.0 0.0 25.0
8 2022-04-23 9.0 2.0 56.0
9 2022-04-24 1.0 0.0 49.0
10 2022-04-25 5.0 0.0 68.0
11 2022-04-26 2.0 0.0 71.0
12 2022-04-27 21.0 0.0 40.0
13 2022-04-28 9.0 0.0 44.0
14 2022-04-29 17.0 0.0 27.0
15 2022-04-30 5.0 0.0 34.0
16 2022-05-01 8.0 0.0 28.0
17 2022-05-07 1241217.0 2.0 9.0
18 2022-05-08 211.0 5.0 20.0
19 2022-05-09 1002521.0 2.0 24.0
20 2022-05-10 488739.0 0.0 21.0
21 2022-05-11 12925.0 0.0 8.0
22 2022-05-12 57.0 0.0 8.0
23 2022-05-13 8515098.0 0.0 14.0
24 2022-05-14 1134576.0 1.0 25.0
25 2022-04-19 0.0 2.0 0.0
26 2022-04-13 0.0 0.0 1.0
27 2022-05-15 0.0 0.0 43.0
28 2022-05-16 0.0 0.0 36.0
29 2022-05-17 0.0 0.0 29.0
30 2022-05-18 0.0 0.0 28.0
31 2022-05-19 0.0 0.0 17.0
32 2022-05-20 0.0 0.0 6.0
​
perform merge chain and fill NaN with 0

Pandas Insert missing dates values with mutiples IDs

I have a pandas dataframe, with 1.7 million of rows. Like this:
ID
date
value
10
2022-01-01
100
10
2022-01-02
150
10
2022-01-05
200
10
2022-01-07
150
10
2022-01-12
100
23
2022-02-01
490
23
2022-02-03
350
23
2022-02-04
333
23
2022-02-08
211
23
2022-02-09
100
I would like to insert the missing dates in the column date. Like this:
ID
date
value
10
2022-01-01
100
10
2022-01-02
150
10
2022-01-03
0
10
2022-01-04
0
10
2022-01-05
200
10
2022-01-06
0
10
2022-01-07
150
10
2022-01-08
0
10
2022-01-09
0
10
2022-01-10
0
10
2022-01-11
0
10
2022-01-12
100
23
2022-02-01
490
10
2022-02-02
0
23
2022-02-03
350
23
2022-02-04
333
´´
10
2022-02-05
10
2022-02-06
0
10
2022-02-07
0
23
2022-02-08
211
23
2022-02-09
100
I used:
s = (pd.MultiIndex.from_tuples([[x, d]
for x, y in df.groupby("Id")["Dt"]
for d in pd.date_range(min(y), max(df["Dt"]), freq="MS")], names=["Id", "Dt"]))
print (df.set_index(["Id", "Dt"]).reindex(s, fill_value=0).reset_index())
But, It took too long. Is there a more performative way to do this?
You can try:
df['date'] = pd.to_datetime(df['date'])
df = (df.groupby('ID')['date'].apply(lambda d:
pd.date_range(start=d.min(),end=d.max()).to_list())
.explode().reset_index()
.merge(df, on=['ID','date'],how='left'))
df['value'] = df['value'].fillna(0).astype(int)
Output:
ID date value
0 10 2022-01-01 100
1 10 2022-01-02 150
2 10 2022-01-03 0
3 10 2022-01-04 0
4 10 2022-01-05 200
5 10 2022-01-06 0
6 10 2022-01-07 150
7 10 2022-01-08 0
8 10 2022-01-09 0
9 10 2022-01-10 0
10 10 2022-01-11 0
11 10 2022-01-12 100
12 23 2022-02-01 490
13 23 2022-02-02 0
14 23 2022-02-03 350
15 23 2022-02-04 333
16 23 2022-02-05 0
17 23 2022-02-06 0
18 23 2022-02-07 0
19 23 2022-02-08 211
20 23 2022-02-09 100
Use asfreq and fillna:
#convert to datetime if needed
df["date"] = pd.to_datetime(df["date"])
df = df.set_index("date").asfreq("D").fillna({"value": "0"}).ffill().reset_index()
>>> df
date ID value
0 2022-01-01 10.0 100.0
1 2022-01-02 10.0 150.0
2 2022-01-03 10.0 0
3 2022-01-04 10.0 0
4 2022-01-05 10.0 200.0
5 2022-01-06 10.0 0
6 2022-01-07 10.0 150.0
7 2022-01-08 10.0 0
8 2022-01-09 10.0 0
9 2022-01-10 10.0 0
10 2022-01-11 10.0 0
11 2022-01-12 10.0 100.0
12 2022-01-13 10.0 0
13 2022-01-14 10.0 0
14 2022-01-15 10.0 0
15 2022-01-16 10.0 0
16 2022-01-17 10.0 0
17 2022-01-18 10.0 0
18 2022-01-19 10.0 0
19 2022-01-20 10.0 0
20 2022-01-21 10.0 0
21 2022-01-22 10.0 0
22 2022-01-23 10.0 0
23 2022-01-24 10.0 0
24 2022-01-25 10.0 0
25 2022-01-26 10.0 0
26 2022-01-27 10.0 0
27 2022-01-28 10.0 0
28 2022-01-29 10.0 0
29 2022-01-30 10.0 0
30 2022-01-31 10.0 0
31 2022-02-01 23.0 490.0
32 2022-02-02 23.0 0
33 2022-02-03 23.0 350.0
34 2022-02-04 23.0 333.0
35 2022-02-05 23.0 0
36 2022-02-06 23.0 0
37 2022-02-07 23.0 0
38 2022-02-08 23.0 211.0
39 2022-02-09 23.0 100.0

How to refer to other rows in Pandas DataFrame in context of a single row?

I have the following example Pandas DataFrame
df
UserID Total Date
1 20 2019-01-01
1 18 2019-01-02
1 22 2019-01-03
1 16 2019-01-04
1 17 2019-01-05
1 26 2019-01-06
1 30 2019-01-07
1 28 2019-01-08
1 28 2019-01-09
1 28 2019-01-10
2 22 2019-01-01
2 11 2019-01-02
2 23 2019-01-03
2 14 2019-01-04
2 19 2019-01-05
2 29 2019-01-06
2 21 2019-01-07
2 22 2019-01-08
2 30 2019-01-09
2 16 2019-01-10
3 27 2019-01-01
3 13 2019-01-02
3 12 2019-01-03
3 27 2019-01-04
3 26 2019-01-05
3 26 2019-01-06
3 30 2019-01-07
3 19 2019-01-08
3 27 2019-01-09
3 29 2019-01-10
4 29 2019-01-01
4 12 2019-01-02
4 25 2019-01-03
4 11 2019-01-04
4 19 2019-01-05
4 20 2019-01-06
4 33 2019-01-07
4 24 2019-01-08
4 22 2019-01-09
4 24 2019-01-10
What I'm trying to achieve is to add a column TotalPast3Days that is basically the sum of Total of the previous 3 days (excluding the current date in the row) for that particular UserID
How can this be done?
For the first 3 days, you will get a NaN because there are no "previous 3 days (excluding the current date in the row)"; but, for the rest, you can use shift like df['TotalPast3Days'] = df['Date'].shift(1) + df['Date'].shift(2) + df['Date'].shift(3)
totals = []
for i in len(df.index):
if i < 3:
totals.append(0)
elif df['UserID'].iloc[i] == df['UserID'].iloc[i-3]:
total = df['Total'].iloc[i-1] +
df['Total'].iloc[i-2] +
df['Total'].iloc[i-3]
totals.append(total)
else:
totals.append(0)
df['Sum of past 3'] = totals

How to get Max Value from group of Column Values for each Row [duplicate]

This question already has answers here:
Find the max of two or more columns with pandas
(3 answers)
Closed 2 years ago.
I have a dataframe as below:
F_Time BP BQ BO0 BO1 BO2 BO3 BO4
0 2020-07-10 09:30:00 10780.00 8550 1 28 1 1 2
1 2020-07-10 10:15:00 10788.00 8700 1 5 10 2 1
2 2020-07-10 10:20:00 10780.00 12150 1 1 1 3 76
3 2020-07-10 10:30:00 10770.00 15675 3 2 8 4 94
4 2020-07-10 10:35:00 10760.60 8100 2 1 1 1 29
5 2020-07-10 10:40:00 10750.00 18825 8 9 154 1 1
6 2020-07-10 11:05:00 10725.00 9825 3 4 94 1 1
I want to find the Max Value from group of Column Values (BO0, BO1, BO2, BO3, BO4) for every row:
Expected Output as below:
F_Time BP BQ BO
0 2020-07-10 09:30:00 10780.00 8550 28
1 2020-07-10 10:15:00 10788.00 8700 10
2 2020-07-10 10:20:00 10780.00 12150 76
3 2020-07-10 10:30:00 10770.00 15675 94
4 2020-07-10 10:35:00 10760.60 8100 29
5 2020-07-10 10:40:00 10750.00 18825 154
6 2020-07-10 11:05:00 10725.00 9825 94
Try this.
df["BO"] = df[["BO0", "BO1".....]].max(axis=1)

How to find number of days in a given month from a date in Python [duplicate]

This question already has answers here:
Numbers of Day in Month
(4 answers)
Closed 4 years ago.
I have a Python dataframe with one column with dates like below:
Date:
2018-10-19
2018-10-20
2018-10-21
(...)
2019-01-31
2019-02-01
Any ideas on how to add an additional column with the number of days in a month, having something like:
Date DaysinMonth
2018-10-19 31
2018-10-20 31
2018-10-21 31
(...)
Thanks
If you have a dataframe df like the following:
df = pd.DataFrame({'dates': dates, 'vals': np.random.randint(10, size=dates.shape)})
dates vals
0 2018-01-01 5
1 2018-01-21 8
2 2018-02-10 1
3 2018-03-02 9
4 2018-03-22 0
5 2018-04-11 3
6 2018-05-01 8
7 2018-05-21 2
8 2018-06-10 4
9 2018-06-30 2
10 2018-07-20 7
11 2018-08-09 5
12 2018-08-29 3
13 2018-09-18 7
14 2018-10-08 6
15 2018-10-28 5
16 2018-11-17 3
17 2018-12-07 4
18 2018-12-27 1
Getting the days in the month is as simple as the following:
df['daysinmonth'] = df['dates'].dt.daysinmonth
dates vals daysinmonth
0 2018-01-01 5 31
1 2018-01-21 8 31
2 2018-02-10 1 28
3 2018-03-02 9 31
4 2018-03-22 0 31
5 2018-04-11 3 30
6 2018-05-01 8 31
7 2018-05-21 2 31
8 2018-06-10 4 30
9 2018-06-30 2 30
10 2018-07-20 7 31
11 2018-08-09 5 31
12 2018-08-29 3 31
13 2018-09-18 7 30
14 2018-10-08 6 31
15 2018-10-28 5 31
16 2018-11-17 3 30
17 2018-12-07 4 31
18 2018-12-27 1 31

Categories

Resources