Pandas Insert missing dates values with mutiples IDs

Pandas Insert missing dates values with mutiples IDs - python

I have a pandas dataframe, with 1.7 million of rows. Like this:
ID
date
value
10
2022-01-01
100
10
2022-01-02
150
10
2022-01-05
200
10
2022-01-07
150
10
2022-01-12
100
23
2022-02-01
490
23
2022-02-03
350
23
2022-02-04
333
23
2022-02-08
211
23
2022-02-09
100
I would like to insert the missing dates in the column date. Like this:
ID
date
value
10
2022-01-01
100
10
2022-01-02
150
10
2022-01-03
0
10
2022-01-04
0
10
2022-01-05
200
10
2022-01-06
0
10
2022-01-07
150
10
2022-01-08
0
10
2022-01-09
0
10
2022-01-10
0
10
2022-01-11
0
10
2022-01-12
100
23
2022-02-01
490
10
2022-02-02
0
23
2022-02-03
350
23
2022-02-04
333
´´
10
2022-02-05
10
2022-02-06
0
10
2022-02-07
0
23
2022-02-08
211
23
2022-02-09
100
I used:
s = (pd.MultiIndex.from_tuples([[x, d]
for x, y in df.groupby("Id")["Dt"]
for d in pd.date_range(min(y), max(df["Dt"]), freq="MS")], names=["Id", "Dt"]))
print (df.set_index(["Id", "Dt"]).reindex(s, fill_value=0).reset_index())
But, It took too long. Is there a more performative way to do this?

You can try:
df['date'] = pd.to_datetime(df['date'])
df = (df.groupby('ID')['date'].apply(lambda d:
pd.date_range(start=d.min(),end=d.max()).to_list())
.explode().reset_index()
.merge(df, on=['ID','date'],how='left'))
df['value'] = df['value'].fillna(0).astype(int)
Output:
ID date value
0 10 2022-01-01 100
1 10 2022-01-02 150
2 10 2022-01-03 0
3 10 2022-01-04 0
4 10 2022-01-05 200
5 10 2022-01-06 0
6 10 2022-01-07 150
7 10 2022-01-08 0
8 10 2022-01-09 0
9 10 2022-01-10 0
10 10 2022-01-11 0
11 10 2022-01-12 100
12 23 2022-02-01 490
13 23 2022-02-02 0
14 23 2022-02-03 350
15 23 2022-02-04 333
16 23 2022-02-05 0
17 23 2022-02-06 0
18 23 2022-02-07 0
19 23 2022-02-08 211
20 23 2022-02-09 100

Use asfreq and fillna:
#convert to datetime if needed
df["date"] = pd.to_datetime(df["date"])
df = df.set_index("date").asfreq("D").fillna({"value": "0"}).ffill().reset_index()
>>> df
date ID value
0 2022-01-01 10.0 100.0
1 2022-01-02 10.0 150.0
2 2022-01-03 10.0 0
3 2022-01-04 10.0 0
4 2022-01-05 10.0 200.0
5 2022-01-06 10.0 0
6 2022-01-07 10.0 150.0
7 2022-01-08 10.0 0
8 2022-01-09 10.0 0
9 2022-01-10 10.0 0
10 2022-01-11 10.0 0
11 2022-01-12 10.0 100.0
12 2022-01-13 10.0 0
13 2022-01-14 10.0 0
14 2022-01-15 10.0 0
15 2022-01-16 10.0 0
16 2022-01-17 10.0 0
17 2022-01-18 10.0 0
18 2022-01-19 10.0 0
19 2022-01-20 10.0 0
20 2022-01-21 10.0 0
21 2022-01-22 10.0 0
22 2022-01-23 10.0 0
23 2022-01-24 10.0 0
24 2022-01-25 10.0 0
25 2022-01-26 10.0 0
26 2022-01-27 10.0 0
27 2022-01-28 10.0 0
28 2022-01-29 10.0 0
29 2022-01-30 10.0 0
30 2022-01-31 10.0 0
31 2022-02-01 23.0 490.0
32 2022-02-02 23.0 0
33 2022-02-03 23.0 350.0
34 2022-02-04 23.0 333.0
35 2022-02-05 23.0 0
36 2022-02-06 23.0 0
37 2022-02-07 23.0 0
38 2022-02-08 23.0 211.0
39 2022-02-09 23.0 100.0

Related

Add column in dataframe from another dataframe matching the id and based on condition in date columns pandas

My problem is a very complex and confusing one, I haven't been able to find the answer anywhere.
I basically have 2 dataframes, one is price history of certain products and the other is invoice dataframe that contains transaction data.
Sample Data:
Price History:
product_id updated price
id
1 1 2022-01-01 5.0
2 2 2022-01-01 5.5
3 3 2022-01-01 5.7
4 1 2022-01-15 6.0
5 2 2022-01-15 6.5
6 3 2022-01-15 6.7
7 1 2022-02-01 7.0
8 2 2022-02-01 7.5
9 3 2022-02-01 7.7
Invoice:
transaction_date product_id quantity
id
1 2022-01-02 1 2
2 2022-01-02 2 3
3 2022-01-02 3 4
4 2022-01-14 1 1
5 2022-01-14 2 4
6 2022-01-14 3 2
7 2022-01-15 1 3
8 2022-01-15 2 6
9 2022-01-15 3 5
10 2022-01-16 1 3
11 2022-01-16 2 2
12 2022-01-16 3 3
13 2022-02-05 1 1
14 2022-02-05 2 4
15 2022-02-05 3 7
16 2022-05-10 1 4
17 2022-05-10 2 2
18 2022-05-10 3 1
What I am looking to achieve is to add the price column in the Invoice dataframe, based on:
The product id
Comparing the Updated and Transaction Date in a way that updated date <= transaction date for that particular record, basically finding the closest date after the price was updated. (The MAX date that is <= transaction date)
I managed to do this:
invoice['price'] = invoice['product_id'].map(price_history.set_index('id')['price'])
but need to incorporate the date condition now.
Expected result for sample data:
Expected Result
Any guidance in the correct direction is appreciated, thanks

merge_asof is what you are looking for:
pd.merge_asof(
invoice,
price_history,
left_on="transaction_date",
right_on="updated",
by="product_id",
)[["transaction_date", "product_id", "quantity", "price"]]

merge_asof with arg direction
merged = pd.merge_asof(
left=invoice,
right=price_history,
left_on="transaction_date",
right_on="updated",
by="product_id",
direction="backward",
suffixes=("", "_y")
).drop(columns=["id_y", "updated"]).reset_index(drop=True)
print(merged)
id transaction_date product_id quantity price
0 1 2022-01-02 1 2 5.0
1 2 2022-01-02 2 3 5.5
2 3 2022-01-02 3 4 5.7
3 4 2022-01-14 1 1 5.0
4 5 2022-01-14 2 4 5.5
5 6 2022-01-14 3 2 5.7
6 7 2022-01-15 1 3 6.0
7 8 2022-01-15 2 6 6.5
8 9 2022-01-15 3 5 6.7
9 10 2022-01-16 1 3 6.0
10 11 2022-01-16 2 2 6.5
11 12 2022-01-16 3 3 6.7
12 13 2022-02-05 1 1 7.0
13 14 2022-02-05 2 4 7.5
14 15 2022-02-05 3 7 7.7
15 16 2022-05-10 1 4 7.0
16 17 2022-05-10 2 2 7.5
17 18 2022-05-10 3 1 7.7

Generating monthly level data using ffill and bffill on multiplce columns of a log file

I have a log file in following format:
Item
Month_end_date
old_price
new_price
row
A
2022-03-31
25
30
1
A
2022-06-30
30
40
2
A
2022-08-31
40
45
3
B
2022-04-30
80
70
4
Here, its assumed that the price of the item A from the start of the year was 25 using 1st row of the above table. I want to get monthly prices using this table. The ideal output looks like the table below:
Item
Month_end_date
price
A
2022-01-31
25
A
2022-02-28
25
A
2022-03-31
30
A
2022-04-30
30
A
2022-05-31
30
A
2022-06-30
40
A
2022-07-31
40
A
2022-08-31
45
A
2022-09-30
45
A
2022-10-31
45
A
2022-11-30
45
A
2022-12-31
45
B
2022-01-31
80
B
2022-02-28
80
B
2022-03-31
80
B
2022-04-30
70
B
2022-05-31
70
B
2022-06-30
70
B
2022-07-31
70
B
2022-08-31
70
B
2022-09-30
70
B
2022-10-31
70
B
2022-11-30
70
B
2022-12-31
70

IIUC, you can reshape, fill in the missing periods and ffill/bfill per group:
(df
.assign(**{'Month_end_date': pd.to_datetime(df['Month_end_date'])})
.set_index(['Item', 'Month_end_date'])
[['old_price', 'new_price']]
.reindex(pd.MultiIndex
.from_product([df['Item'].unique(),
pd.date_range('2022-01-01',
'2022-12-31',
freq='M')],
names=['Items', 'Month_end_date'])
)
.stack(dropna=False)
.groupby(level=0).apply(lambda g: g.ffill().bfill())
.unstack()['new_price']
.reset_index(name='price')
)
output:
Items Month_end_date price
0 A 2022-01-31 25.0
1 A 2022-02-28 25.0
2 A 2022-03-31 30.0
3 A 2022-04-30 30.0
4 A 2022-05-31 30.0
5 A 2022-06-30 40.0
6 A 2022-07-31 40.0
7 A 2022-08-31 45.0
8 A 2022-09-30 45.0
9 A 2022-10-31 45.0
10 A 2022-11-30 45.0
11 A 2022-12-31 45.0
12 B 2022-01-31 80.0
13 B 2022-02-28 80.0
14 B 2022-03-31 80.0
15 B 2022-04-30 70.0
16 B 2022-05-31 70.0
17 B 2022-06-30 70.0
18 B 2022-07-31 70.0
19 B 2022-08-31 70.0
20 B 2022-09-30 70.0
21 B 2022-10-31 70.0
22 B 2022-11-30 70.0
23 B 2022-12-31 70.0

Python: Pandas merge three dataframes on date, keeping all dates [duplicate]

This question already has answers here:
Merge multiple DataFrames Pandas
(5 answers)
Pandas Merging 101
(8 answers)
Closed 7 months ago.
I have three dataframes
Dataframe df1:
date A
0 2022-04-11 1
1 2022-04-12 2
2 2022-04-14 26
3 2022-04-16 2
4 2022-04-17 1
5 2022-04-20 17
6 2022-04-21 14
7 2022-04-22 1
8 2022-04-23 9
9 2022-04-24 1
10 2022-04-25 5
11 2022-04-26 2
12 2022-04-27 21
13 2022-04-28 9
14 2022-04-29 17
15 2022-04-30 5
16 2022-05-01 8
17 2022-05-07 1241217
18 2022-05-08 211
19 2022-05-09 1002521
20 2022-05-10 488739
21 2022-05-11 12925
22 2022-05-12 57
23 2022-05-13 8515098
24 2022-05-14 1134576
Dateframe df2:
date B
0 2022-04-12 8
1 2022-04-14 7
2 2022-04-16 2
3 2022-04-19 2
4 2022-04-23 2
5 2022-05-07 2
6 2022-05-08 5
7 2022-05-09 2
8 2022-05-14 1
Dataframe df3:
date C
0 2022-04-12 6
1 2022-04-13 1
2 2022-04-14 2
3 2022-04-20 3
4 2022-04-21 9
5 2022-04-22 25
6 2022-04-23 56
7 2022-04-24 49
8 2022-04-25 68
9 2022-04-26 71
10 2022-04-27 40
11 2022-04-28 44
12 2022-04-29 27
13 2022-04-30 34
14 2022-05-01 28
15 2022-05-07 9
16 2022-05-08 20
17 2022-05-09 24
18 2022-05-10 21
19 2022-05-11 8
20 2022-05-12 8
21 2022-05-13 14
22 2022-05-14 25
23 2022-05-15 43
24 2022-05-16 36
25 2022-05-17 29
26 2022-05-18 28
27 2022-05-19 17
28 2022-05-20 6
I would like to merge df1, df2, df3 in a single dataframe with columns date, A, B, C, in such a way that date contains all dates which appeared in df1 and/or df2 and/or df3 (without repetition), and if a particular date was not in any of the dataframes, then for the respective column I put value 0.0. So, I would like to have something like that:
date A B C
0 2022-04-11 1.0 0.0 0.0
1 2022-08-12 2.0 8.0 6.0
2 2022-08-13 0.0 0.0 1.0
...
I tried to use this method
merge1 = pd.merge(df1, df2, how='outer')
sorted_merge1 = merge1.sort_values(by=['date'], ascending=False)
full_merge = pd.merge(sorted_merg1, df3, how='outer')
However, it seems it skips the dates which are not common for all three dataframes.

Try this,
print(pd.merge(df1, df2, on='date', how='outer').merge(df3, on='date', how='outer').fillna(0))
O/P:
date A B C
0 2022-04-11 1.0 0.0 0.0
1 2022-04-12 2.0 8.0 6.0
2 2022-04-14 26.0 7.0 2.0
3 2022-04-16 2.0 2.0 0.0
4 2022-04-17 1.0 0.0 0.0
5 2022-04-20 17.0 0.0 3.0
6 2022-04-21 14.0 0.0 9.0
7 2022-04-22 1.0 0.0 25.0
8 2022-04-23 9.0 2.0 56.0
9 2022-04-24 1.0 0.0 49.0
10 2022-04-25 5.0 0.0 68.0
11 2022-04-26 2.0 0.0 71.0
12 2022-04-27 21.0 0.0 40.0
13 2022-04-28 9.0 0.0 44.0
14 2022-04-29 17.0 0.0 27.0
15 2022-04-30 5.0 0.0 34.0
16 2022-05-01 8.0 0.0 28.0
17 2022-05-07 1241217.0 2.0 9.0
18 2022-05-08 211.0 5.0 20.0
19 2022-05-09 1002521.0 2.0 24.0
20 2022-05-10 488739.0 0.0 21.0
21 2022-05-11 12925.0 0.0 8.0
22 2022-05-12 57.0 0.0 8.0
23 2022-05-13 8515098.0 0.0 14.0
24 2022-05-14 1134576.0 1.0 25.0
25 2022-04-19 0.0 2.0 0.0
26 2022-04-13 0.0 0.0 1.0
27 2022-05-15 0.0 0.0 43.0
28 2022-05-16 0.0 0.0 36.0
29 2022-05-17 0.0 0.0 29.0
30 2022-05-18 0.0 0.0 28.0
31 2022-05-19 0.0 0.0 17.0
32 2022-05-20 0.0 0.0 6.0

perform merge chain and fill NaN with 0

How to conditionally aggregate values of previous rows of Pandas DataFrame?

I have the following example Pandas DataFrame
UserID Total Date
1 20 2019-01-01
1 18 2019-01-04
1 22 2019-01-05
1 16 2019-01-07
1 17 2019-01-09
1 26 2019-01-11
1 30 2019-01-12
1 28 2019-01-13
1 28 2019-01-15
1 28 2019-01-16
2 22 2019-01-06
2 11 2019-01-07
2 23 2019-01-09
2 14 2019-01-13
2 19 2019-01-14
2 29 2019-01-15
2 21 2019-01-16
2 22 2019-01-18
2 30 2019-01-22
2 16 2019-01-23
3 27 2019-01-01
3 13 2019-01-04
3 12 2019-01-05
3 27 2019-01-06
3 26 2019-01-09
3 26 2019-01-10
3 30 2019-01-11
3 19 2019-01-12
3 27 2019-01-13
3 29 2019-01-14
4 29 2019-01-07
4 12 2019-01-09
4 25 2019-01-10
4 11 2019-01-11
4 19 2019-01-13
4 20 2019-01-14
4 33 2019-01-15
4 24 2019-01-18
4 22 2019-01-19
4 24 2019-01-21
My goal is to add a column named TotalPrev10Days which is basically the sum of Total for previous 10 days (for each UserID)
I did a basic implementation using nested loops and comparing the current date with a timedelta.
Here's my code:
users = set(df.UserID) # get set of all unique user IDs
TotalPrev10Days = []
delta = timedelta(days=10) # 10 day time delta to subtract from each row date
for user in users: # looping over all user IDs
user_df = df[df["UserID"] == user] #creating dataframe that includes only current userID data
for row_index in user_df.index: #looping over each row from UserID dataframe
row_date = user_df["Date"][row_index]
row_date_minus_10 = row_date - delta #subtracting 10 days
sum_prev_10_days = user_df[(user_df["Date"] < row_date) & (user_df["Date"] >= row_date_minus_10)]["Total"].sum()
TotalPrev10Days.append(sum_prev_10_days) #appending total to a list
df["TotalPrev10Days"] = TotalPrev10Days #Assigning list to new DataFrame column
While it works perfectly, it's very slow for large datasets.
Is there a faster, more Pandas-native approach to this problem?

IIUC, try:
df["TotalPrev10Days"] = df.groupby("UserID") \
.rolling("9D", on="Date") \
.sum() \
.shift() \
.fillna(0)["Total"] \
.droplevel(0)
>>> df
UserID Total Date TotalPrev10Days
0 1 20 2019-01-01 0.0
1 1 18 2019-01-04 20.0
2 1 22 2019-01-05 38.0
3 1 16 2019-01-07 60.0
4 1 17 2019-01-09 76.0
5 1 26 2019-01-11 93.0
6 1 30 2019-01-12 99.0
7 1 28 2019-01-13 129.0
8 1 28 2019-01-15 139.0
9 1 28 2019-01-16 145.0
10 2 22 2019-01-06 0.0
11 2 11 2019-01-07 22.0
12 2 23 2019-01-09 33.0
13 2 14 2019-01-13 56.0
14 2 19 2019-01-14 70.0
15 2 29 2019-01-15 89.0
16 2 21 2019-01-16 96.0
17 2 22 2019-01-18 106.0
18 2 30 2019-01-22 105.0
19 2 16 2019-01-23 121.0
20 3 27 2019-01-01 0.0
21 3 13 2019-01-04 27.0
22 3 12 2019-01-05 40.0
23 3 27 2019-01-06 52.0
24 3 26 2019-01-09 79.0
25 3 26 2019-01-10 105.0
26 3 30 2019-01-11 104.0
27 3 19 2019-01-12 134.0
28 3 27 2019-01-13 153.0
29 3 29 2019-01-14 167.0
30 4 29 2019-01-07 0.0
31 4 12 2019-01-09 29.0
32 4 25 2019-01-10 41.0
33 4 11 2019-01-11 66.0
34 4 19 2019-01-13 77.0
35 4 20 2019-01-14 96.0
36 4 33 2019-01-15 116.0
37 4 24 2019-01-18 149.0
38 4 22 2019-01-19 132.0
39 4 24 2019-01-21 129.0

Resampling gap between Datetime filling with previous values (multi-index)

I am trying to correct every row that there is no date. Then idea is just to fill the gap between the missing dates, and complete the other columns with the previous values.
ds SKU Estoque leadtime
0 2018-01-02 504777 45 11
1 2018-01-04 504777 42 11
2 2018-01-05 504777 41 11
3 2018-01-09 504777 40 11
4 2018-01-12 504777 37 11
5 2018-01-13 504777 36 11
6 2018-01-15 504777 35 11
... ... ... ... ...
6629 2018-08-14 857122 11 10
6630 2018-08-15 857122 10 10
6631 2018-08-16 857122 9 10
6632 2018-08-17 857122 7 10
6633 2018-08-23 857122 14 10
6634 2018-08-24 857122 13 10
I have already tried to:
df.set_index('ds', inplace=True)
df = df.resample("D")
or
df.resample("D", how='first', fill_method='ffill')
But I just got this:
DatetimeIndexResampler [freq=<Day>, axis=0, closed=left, label=left, convention=start, base=0]
When I tried :
(df.groupby('SKU')
.resample('D')
.last()
.reset_index()
.set_index('ds'))
I got this error :
ValueError: cannot insert SKU, already exists
I am trying to have this result:
ds SKU Estoque leadtime
0 2018-01-02 504777 45 11
1 2018-01-03 504777 45 11
2 2018-01-04 504777 42 11
3 2018-01-05 504777 41 11
4 2018-01-06 504777 41 11
5 2018-01-07 504777 41 11
6 2018-01-08 504777 41 11
7 2018-01-09 504777 40 11
... ... ... ... ...
PS: If I set date as index, I have duplicated index. I need to isolate each product first (group by).

In your case you may need to chain with apply
#df.set_index('ds', inplace=True)
df.groupby('SKU').apply(lambda x : x.resample('D').ffill()).reset_index(level=0,drop=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Insert missing dates values with mutiples IDs - python

Related

Add column in dataframe from another dataframe matching the id and based on condition in date columns pandas

Generating monthly level data using ffill and bffill on multiplce columns of a log file

Python: Pandas merge three dataframes on date, keeping all dates [duplicate]

How to conditionally aggregate values of previous rows of Pandas DataFrame?

Resampling gap between Datetime filling with previous values (multi-index)

Categories

Resources