Pandas fill missing dates and values simultaneously for each group - python

I have a dataframe (mydf) with dates for each group in monthly frequency like below:
Dt Id Sales
2021-03-01 B 2
2021-04-01 B 42
2021-05-01 B 20
2021-06-01 B 4
2020-10-01 A 47
2020-11-01 A 67
2020-12-01 A 46
I want to fill the dt for each group till the Maximum date within the date column starting from the date of Id while simultaneously filling in 0 for the Sales column. So each group starts at their own start date but ends at the same end date.
So for e.g. ID=A will start from 2020-10-01 and go all the way to 2021-06-03 and the value for the filled dates will be 0.
So the output will be
Dt Id Sales
2021-03-01 B 2
2021-04-01 B 42
2021-05-01 B 20
2021-06-01 B 4
2020-10-01 A 46
2020-11-01 A 47
2020-12-01 A 67
2021-01-01 A 0
2021-02-01 A 0
2021-03-01 A 0
2021-04-01 A 0
2021-05-01 A 0
2021-06-01 A 0
I have tried reindex but instead of adding daterange manually I want to use the dates in the groups.
My code is :
f = lambda x: x.reindex(pd.date_range('2020-10-01', '2021-06-01', freq='MS', name='Dt'))
mydf = mydf.set_index('Dt').groupby('Id').apply(f).drop('Id', axis=1).fillna(0)
mydf = mydf.reset_index()

Let's try:
Getting the minimum value per group using groupby.min
Add a new column to the aggregated mins called max which stores the maximum values from the frame using Series.max on Dt
Create individual date_range per group based on the min and max values
Series.explode into rows to have a DataFrame that represents the new index.
Create a MultiIndex.from_frame to reindex the DataFrame with.
reindex with midx and set the fillvalue=0
# Get Min Per Group
dates = mydf.groupby('Id')['Dt'].min().to_frame(name='min')
# Get max from Frame
dates['max'] = mydf['Dt'].max()
# Create MultiIndex with separate Date ranges per Group
midx = pd.MultiIndex.from_frame(
dates.apply(
lambda x: pd.date_range(x['min'], x['max'], freq='MS'), axis=1
).explode().reset_index(name='Dt')[['Dt', 'Id']]
)
# Reindex
mydf = (
mydf.set_index(['Dt', 'Id'])
.reindex(midx, fill_value=0)
.reset_index()
)
mydf:
Dt Id Sales
0 2020-10-01 A 47
1 2020-11-01 A 67
2 2020-12-01 A 46
3 2021-01-01 A 0
4 2021-02-01 A 0
5 2021-03-01 A 0
6 2021-04-01 A 0
7 2021-05-01 A 0
8 2021-06-01 A 0
9 2021-03-01 B 2
10 2021-04-01 B 42
11 2021-05-01 B 20
12 2021-06-01 B 4
DataFrame:
import pandas as pd
mydf = pd.DataFrame({
'Dt': ['2021-03-01', '2021-04-01', '2021-05-01', '2021-06-01', '2020-10-01',
'2020-11-01', '2020-12-01'],
'Id': ['B', 'B', 'B', 'B', 'A', 'A', 'A'],
'Sales': [2, 42, 20, 4, 47, 67, 46]
})
mydf['Dt'] = pd.to_datetime(mydf['Dt'])

An alternative using pd.MultiIndex with list comprehension:
s = (pd.MultiIndex.from_tuples([[x, d]
for x, y in df.groupby("Id")["Dt"]
for d in pd.date_range(min(y), max(df["Dt"]), freq="MS")], names=["Id", "Dt"]))
print (df.set_index(["Id", "Dt"]).reindex(s, fill_value=0).reset_index())

Here is a different approach:
from itertools import product
# compute the min-max date range
date_range = pd.date_range(*mydf['Dt'].agg(['min', 'max']), freq='MS', name='Dt')
# make MultiIndex per group, keep only values above min date per group
idx = pd.MultiIndex.from_tuples([e for Id,Dt_min in mydf.groupby('Id')['Dt'].min().items()
for e in list(product(date_range[date_range>Dt_min],
[Id]))
])
# concatenate the original dataframe and the missing indexes
mydf = mydf.set_index(['Dt', 'Id'])
mydf = pd.concat([mydf,
mydf.reindex(idx.difference(mydf.index)).fillna(0)]
).sort_index(level=1).reset_index()
mydf
output:
Dt Id Sales
0 2020-10-01 A 47.0
1 2020-11-01 A 67.0
2 2020-12-01 A 46.0
3 2021-01-01 A 0.0
4 2021-02-01 A 0.0
5 2021-03-01 A 0.0
6 2021-04-01 A 0.0
7 2021-05-01 A 0.0
8 2021-06-01 A 0.0
9 2021-03-01 B 2.0
10 2021-04-01 B 42.0
11 2021-05-01 B 20.0
12 2021-06-01 B 4.0

We can use the complete function from pyjanitor to expose the missing values:
Convert Dt to datetime:
df['Dt'] = pd.to_datetime(df['Dt'])
Create a mapping of Dt to new values, via pd.date_range, and set the frequency to monthly begin (MS):
max_time = df.Dt.max()
new_values = {"Dt": lambda df:pd.date_range(df.min(), max_time, freq='1MS')}
# pip install pyjanitor
import janitor
import pandas as pd
df.complete([new_values], by='Id').fillna(0)
Id Dt Sales
0 A 2020-10-01 47.0
1 A 2020-11-01 67.0
2 A 2020-12-01 46.0
3 A 2021-01-01 0.0
4 A 2021-02-01 0.0
5 A 2021-03-01 0.0
6 A 2021-04-01 0.0
7 A 2021-05-01 0.0
8 A 2021-06-01 0.0
9 B 2021-03-01 2.0
10 B 2021-04-01 42.0
11 B 2021-05-01 20.0
12 B 2021-06-01 4.0
Sticking to Pandas only, we can combine apply, with groupby and reindex; thankfully, Dt is unique, so we can safely reindex:
(df
.set_index('Dt')
.groupby('Id')
.apply(lambda df: df.reindex(pd.date_range(df.index.min(),
max_time,
freq='1MS'),
fill_value = 0)
)
.drop(columns='Id')
.rename_axis(['Id', 'Dt'])
.reset_index())
Id Dt Sales
0 A 2020-10-01 47
1 A 2020-11-01 67
2 A 2020-12-01 46
3 A 2021-01-01 0
4 A 2021-02-01 0
5 A 2021-03-01 0
6 A 2021-04-01 0
7 A 2021-05-01 0
8 A 2021-06-01 0
9 B 2021-03-01 2
10 B 2021-04-01 42
11 B 2021-05-01 20
12 B 2021-06-01 4

Related

Calculate average temperature/humidity between 2 dates pandas data frames

I have the following data frames:
df3
Harvest_date
Starting_date
2022-10-06
2022-08-06
2022-02-22
2021-12-22
df (I have all temp and humid starting from 2021-01-01 till the present)
date
temp
humid
2022-10-06 00:30:00
2
30
2022-10-06 00:01:00
1
30
2022-10-06 00:01:30
0
30
2022-10-06 00:02:00
0
30
2022-10-06 00:02:30
-2
30
I would like to calculate the avg temperature and humidity between the starting_date and harvest_date. I tried this:
import pandas as pd
df = pd.read_csv (r'C:\climate.csv')
df3 = pd.read_csv (r'C:\Flower_weight_Seson.csv')
df['date'] = pd.to_datetime(df.date)
df3['Harvest_date'] = pd.to_datetime(df3.Harvest_date)
df3['Starting_date'] = pd.to_datetime(df3.Starting_date)
df.style.format({"date": lambda t: t.strftime("%Y-%m-%d")})
df3.style.format({"Harvest_date": lambda t: t.strftime("%Y-%m-%d")})
df3.style.format({"Starting_date": lambda t: t.strftime("%Y-%m-%d")})
for harvest_date,starting_date in zip(df3['Harvest_date'],df3['Starting_date']):
df3["Season avg temp"]= df[df["date"].between(starting_date,harvest_date)]["temp"].mean()
df3["Season avg humid"]= df[df["date"].between(starting_date,harvest_date)]["humid"].mean()
I get the same value for all dates. Can someone point out what I did wrong, please?
Use DataFrame.loc with match indices by means of another DataFrame:
#changed data for match with df3
print (df)
date temp humid
0 2022-10-06 00:30:00 2 30
1 2022-09-06 00:01:00 1 33
2 2022-09-06 00:01:30 0 23
3 2022-10-06 00:02:00 0 30
4 2022-01-06 00:02:30 -2 25
for i,harvest_date,starting_date in zip(df3.index,df3['Harvest_date'],df3['Starting_date']):
mask = df["date"].between(starting_date,harvest_date)
avg = df.loc[mask, ["temp",'humid']].mean()
df3.loc[i, ["Season avg temp",'Season avg humid']] = avg.to_numpy()
print (df3)
Harvest_date Starting_date Season avg temp Season avg humid
0 2022-10-06 2022-08-06 0.5 28.0
1 2022-02-22 2021-12-220 -2.0 25.0
EDIT: For add new condition for match by room columns use:
for i,harvest_date,starting_date, room in zip(df3.index,
df3['Harvest_date'],
df3['Starting_date'], df3['Room']):
mask = df["date"].between(starting_date,harvest_date) & df['Room'].eq(room)
avg = df.loc[mask, ["temp",'humid']].mean()
df3.loc[i, ["Season avg temp",'Season avg humid']] = avg.to_numpy()
print (df3)

How to join pandas dataframe to itself by condition?

I'm having a python pandas dataframe with 2 relevant columns "date" and "value", let's assume it looks like this and is ordered by date:
data = pd.DataFrame({"date": ["2021-01-01", "2021-01-31", "2021-02-01", "2021-02-28", "2021-03-01", "2021-03-31", "2021-04-01", "2021-04-02"],
"value": [1,2,3,4,5,6,5,8]})
data["date"] = pd.to_datetime(data['date'])
Now I want to join the dataFrame to itself in such a way, that I get for each last available day in month the next available day where the value is higher. In our example this should basically look like this:
date, value, date2, value2:
2021-01-31, 2, 2021-02-01, 3
2021-02-28, 4, 2021-03-01, 5
2021-03-31, 6, 2021-04-02, 8
2021-04-02, 8, NaN, NaN
My current partial solution to this problem looks like this:
last_days = data.groupby([data.date.dt.year, data.date.dt.month]).last()
res = [data.loc[(data.date>date) & (data.value > value)][:1] for date, value in zip(last_days.date, last_days.value)]
print(res)
But because of this answer "Don't iterate over rows in a dataframe", it doesn't feel like the pandas way to me.
So the question is, how to solve it the pandas way?
If you don’t have too many rows, you could generate all pairs of items and filter from there.
Let’s start with getting the last days in the month:
>>> last = data.loc[data['date'].dt.daysinmonth == data['date'].dt.day]
>>> last
date value
1 2021-01-31 2
3 2021-02-28 4
5 2021-03-31 6
Now use a cross join to map each last day to any possible day, then filter on criteria such as later date and larger value:
>>> pairs = pd.merge(last, data, how='cross', suffixes=('', '2'))
>>> pairs = pairs.loc[pairs['date2'].gt(pairs['date']) & pairs['value2'].gt(pairs['value'])]
>>> pairs
date value date2 value2
2 2021-01-31 2 2021-02-01 3
3 2021-01-31 2 2021-02-28 4
4 2021-01-31 2 2021-03-01 5
5 2021-01-31 2 2021-03-31 6
6 2021-01-31 2 2021-04-01 5
7 2021-01-31 2 2021-04-02 8
12 2021-02-28 4 2021-03-01 5
13 2021-02-28 4 2021-03-31 6
14 2021-02-28 4 2021-04-01 5
15 2021-02-28 4 2021-04-02 8
23 2021-03-31 6 2021-04-02 8
Finally use GroupBy.idxmin() to get the first date2
>>> pairs.loc[pairs.groupby(['date', 'value'])['value2'].idxmin().values]
date value date2 value2
2 2021-01-31 2 2021-02-01 3
12 2021-02-28 4 2021-03-01 5
23 2021-03-31 6 2021-04-02 8
Otherwise you might want apply, which is pretty much the same as iterating on rows to be entirely honest.
First create 2 masks: one for the end day of month and another one for the first day of the next month.
m1 = data['date'].diff(1).shift(-1) == pd.Timedelta(days=1)
m2 = m1.shift(1, fill_value=False)
Finally, concatenate the 2 results ignoring index:
>>> pd.concat([data.loc[m1].reset_index(drop=True),
data.loc[m2].reset_index(drop=True)], axis="columns")
date value date value
0 2021-01-31 2 2021-02-01 3
1 2021-02-28 4 2021-03-01 5
2 2021-03-31 6 2021-04-01 5
3 2021-04-01 5 2021-04-02 8
One option is with the conditional_join from pyjanitor, which uses binary search underneath, and should be faster/more memory efficient than a cross merge, as the data size increases. Also, have a look at the piso library and see if it can be helpful/more efficient:
Get the last dates, via a groupby (assumption here is that the data is already sorted; if not, you can sort it before grouping):
# pip install pyjanitor
import pandas as pd
import janitor
trim = (data
.groupby([data.date.dt.year, data.date.dt.month], as_index = False)
.nth(-1)
)
trim
date value
1 2021-01-31 2
3 2021-02-28 4
5 2021-03-31 6
7 2021-04-02 8
Use conditional_join to get rows where the value from trim is less than data, and the date from trim is less than data as well:
trimmed = trim.conditional_join(data,
# variable arguments
# tuple is of the form:
# col_from_left_df, col_from_right_df, comparator
('value', 'value', '<'),
('date', 'date', '<'),
how = 'left')
trimmed
left right
date value date value
0 2021-01-31 2 2021-02-01 3.0
1 2021-01-31 2 2021-02-28 4.0
2 2021-01-31 2 2021-03-01 5.0
3 2021-01-31 2 2021-04-01 5.0
4 2021-01-31 2 2021-03-31 6.0
5 2021-01-31 2 2021-04-02 8.0
6 2021-02-28 4 2021-03-01 5.0
7 2021-02-28 4 2021-04-01 5.0
8 2021-02-28 4 2021-03-31 6.0
9 2021-02-28 4 2021-04-02 8.0
10 2021-03-31 6 2021-04-02 8.0
11 2021-04-02 8 NaT NaN
Since the only interest is in the first match, a groupby is required.
trimmed = (trimmed
.groupby(('left', 'date'), dropna = False, as_index = False)
.nth(0)
)
trimmed
left right
date value date value
0 2021-01-31 2 2021-02-01 3.0
6 2021-02-28 4 2021-03-01 5.0
10 2021-03-31 6 2021-04-02 8.0
11 2021-04-02 8 NaT NaN
You can fix the columns, to flat form:
trimmed.set_axis(['date', 'value', 'date2', 'value2'], axis = 'columns')
date value date2 value2
0 2021-01-31 2 2021-02-01 3.0
6 2021-02-28 4 2021-03-01 5.0
10 2021-03-31 6 2021-04-02 8.0
11 2021-04-02 8 NaT NaN

How can I get the interval start and end of a groupby.agg function when the timestamp is the index?

DataFrame example:
import pandas as pd
import numpy as np
days = pd.date_range('2020-01-01 00:00:00','2020-01-02 00:00:00',freq='1S')
data = np.random.randint(1, high=100, size=len(days))
category = np.random.choice(['A', 'B', 'C', 'D'], size=len(days))
df = pd.DataFrame({'time': days, 'category': category, 'data': data})
df = df.set_index('time')
df
Output:
category data
time
2020-01-01 00:00:00 B 27
2020-01-01 00:00:01 D 10
2020-01-01 00:00:02 D 87
2020-01-01 00:00:03 B 78
2020-01-01 00:00:04 A 49
2020-01-01 00:00:05 C 21
2020-01-01 00:00:06 C 32
2020-01-01 00:00:07 A 95
2020-01-01 00:00:08 B 75
2020-01-01 00:00:09 B 19
... ...
2020-01-01 23:59:51 D 9
2020-01-01 23:59:52 D 67
2020-01-01 23:59:53 B 57
2020-01-01 23:59:54 D 51
2020-01-01 23:59:55 A 75
2020-01-01 23:59:56 D 47
2020-01-01 23:59:57 B 19
2020-01-01 23:59:58 A 90
2020-01-01 23:59:59 D 7
2020-01-02 00:00:00 B 44
[86401 rows x 2 columns]
I'd like to calculate for each category the min, max, average of DATA, but also the MIN and MAX timestamp. However, since the timestamp is the index, I don't now how to to that. I'm getting:
df.groupby('category').agg({'time': [min, max], 'data': [np.min, np.max, np.average]})
KeyError: "Column 'time' does not exist!"
If I remove the "'time': [min, max]", it works:
data
amin amax average
category
A 1 99 50.072437
B 1 99 49.542499
C 1 99 50.291096
D 1 99 49.851255
You can reset the index and groupby and it doesn't matter cause the after groupby the index doesn't exist.
df.reset_index().groupby('category').agg({'time': [min, max], 'data': [np.min, np.max, np.average]})
Output:

Calculate delta between two columns and two following rows for different group

Are there any vector operations for improving runtime?
I found no other way besides for loops.
Sample DataFrame:
df = pd.DataFrame({'ID': ['1', '1','1','2','2','2'],
'start_date': ['01-Jan', '05-Jan', '08-Jan', '05-Jan','06-Jan', '10-Jan'],
'start_value': [12, 15, 1, 3, 2, 6],
'end_value': [20, 17, 6,19,13.5,9]})
ID start_date start_value end_value
0 1 01-Jan 12 20.0
1 1 05-Jan 15 17.0
2 1 08-Jan 1 6.0
3 2 05-Jan 3 19.0
4 2 06-Jan 2 13.5
5 2 10-Jan 6 9.0
I've tried:
import pandas as pd
df_original # contains data
data_frame_diff= pd.DataFrame()
for ID in df_original ['ID'].unique():
tmp_frame = df_original .loc[df_original ['ID']==ID]
tmp_start_value = 0
for label, row in tmp_frame.iterrows():
last_delta = tmp_start_value - row['value']
tmp_start_value = row['end_value']
row['last_delta'] = last_delta
data_frame_diff= data_frame_diff.append(row,True)
Expected Result:
df = pd.DataFrame({'ID': ['1', '1','1','2','2','2'],
'start_date': ['01-Jan', '05-Jan', '08-Jan', '05-Jan', '06-Jan',
'10-Jan'],
'last_delta': [0, 5, 16, 0, 17, 7.5]})
ID start_date last_delta
0 1 01-Jan 0.0
1 1 05-Jan 5.0
2 1 08-Jan 16.0
3 2 05-Jan 0.0
4 2 06-Jan 17.0
5 2 10-Jan 7.5
I want to calculate the delta between start_value and end_value of the timestamp and the following timestamp after for each user ID.
Is there a way to improve runtime of this code?
Use DataFrame.groupby
on ID and shift the column end_value then use Series.sub to subtract it from start_value, finally use Series.fillna and assign this new column s to the dataframe using DataFrame.assign:
s = df.groupby('ID')['end_value'].shift().sub(df['start_value']).fillna(0)
df1 = df[['ID', 'start_date']].assign(last_delta=s)
Result:
print(df1)
ID start_date last_delta
0 1 01-Jan 0.0
1 1 05-Jan 5.0
2 1 08-Jan 16.0
3 2 05-Jan 0.0
4 2 06-Jan 17.0
5 2 10-Jan 7.5
It's a bit difficult to follow from your description what you need, but you might find this helpful:
import pandas as pd
df = (pd.DataFrame({'t1': pd.date_range(start="2020-01-01", end="2020-01-02", freq="H"),
})
.reset_index().rename(columns={'index': 'ID'})
)
df['t2'] = df['t1']+pd.Timedelta(value=10, unit="H")
df['delta_t1_t2'] = df['t2']-df['t1']
df['delta_to_previous_t1'] = df['t1'] - df['t1'].shift()
print(df)
It results in
ID t1 t2 delta_t1_t2 delta_to_previous_t1
0 0 2020-01-01 00:00:00 2020-01-01 10:00:00 10:00:00 NaT
1 1 2020-01-01 01:00:00 2020-01-01 11:00:00 10:00:00 01:00:00
2 2 2020-01-01 02:00:00 2020-01-01 12:00:00 10:00:00 01:00:00
3 3 2020-01-01 03:00:00 2020-01-01 13:00:00 10:00:00 01:00:00

Pandas fill in missing date within each group with information in the previous row

Similar question to this one, but with some modifications:
Instead of filling in missing dates for each group between the min and max date of the entire column, we only should be filling in the dates between the min and the max of that group, and output a dataframe with the last row in each group
Reproducible example:
x = pd.DataFrame({'dt': ['2016-01-01','2016-01-03', '2016-01-04','2016-01-01','2016-01-01','2016-01-04']
,'amount': [10.0,30.0,40.0,78.0,80.0,82.0]
, 'sub_id': [1,1,1,2,2,2]
})
Visually:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-03 1 30.0
2 2016-01-04 1 40.0
3 2017-01-01 2 78.0
4 2017-01-01 2 80.0
5 2017-01-04 2 82.0
Output I need:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-02 1 10.0
2 2016-01-03 1 30.0
3 2016-01-04 1 40.0
4 2017-01-01 2 80.0
5 2017-01-02 2 80.0
6 2017-01-03 2 80.0
7 2017-01-04 2 82.0
We are grouping by dt and sub_id. As you can see, in sub_id=1, a row was added for 2016-01-02 and amount was imputed at 10.0 as the previous row was 10.0 (Assume data is sorted beforehand to enable this). For sub_id=2 row was added for 2017-01-02 and 2017-01-03 and amount is 80.0 as that was the last row before this date. The first row for 2017-01-01 was also deleted because we just want to keep the last row for each date and sub_id.
Looking for the most efficient way to do this as the real data has millions of rows. I have a current method using lambda functions and applying them across groups of sub_id but I feel like we could do better.
Thanks!
Getting the date right of course:
x.dt = pd.to_datetime(x.dt)
Then this:
cols = ['dt', 'sub_id']
pd.concat([
d.asfreq('D').ffill(downcast='infer')
for _, d in x.drop_duplicates(cols, keep='last')
.set_index('dt').groupby('sub_id')
]).reset_index()
dt amount sub_id
0 2016-01-01 10 1
1 2016-01-02 10 1
2 2016-01-03 30 1
3 2016-01-04 40 1
4 2016-01-01 80 2
5 2016-01-02 80 2
6 2016-01-03 80 2
7 2016-01-04 82 2
By using resample with groupby
x.dt=pd.to_datetime(x.dt)
x.set_index('dt').groupby('sub_id').apply(lambda x : x.resample('D').max().ffill()).reset_index(level=1)
Out[265]:
dt amount sub_id
sub_id
1 2016-01-01 10.0 1.0
1 2016-01-02 10.0 1.0
1 2016-01-03 30.0 1.0
1 2016-01-04 40.0 1.0
2 2016-01-01 80.0 2.0
2 2016-01-02 80.0 2.0
2 2016-01-03 80.0 2.0
2 2016-01-04 82.0 2.0
use asfreq & groupby
first convert dt to datetime & get rid of duplicates
then for each group of sub_id use asfreq('D', method='ffill') to generate missing dates and impute amounts
finally reset_index on amount column as there's a duplicate sub_id column as well as index.
x.dt = pd.to_datetime(x.dt)
x.drop_duplicates(
['dt', 'sub_id'], 'last'
).groupby('sub_id').apply(
lambda x: x.set_index('dt').asfreq('D', method='ffill')
).amount.reset_index()
# output:
sub_id dt amount
0 1 2016-01-01 10.0
1 1 2016-01-02 10.0
2 1 2016-01-03 30.0
3 1 2016-01-04 40.0
4 2 2016-01-01 80.0
5 2 2016-01-02 80.0
6 2 2016-01-03 80.0
7 2 2016-01-04 82.0
The below works for me and seems pretty efficient, but I can't say if it's efficient enough. It does avoid lambdas tho.
I called your data df.
Create a base_df with the entire date / sub_id grid:
import pandas as pd
from itertools import product
base_grid = product(pd.date_range(df['dt'].min(), df['dt'].max(), freq='D'), list(range(df['sub_id'].min(), df['sub_id'].max() + 1, 1)))
base_df = pd.DataFrame(list(base_grid), columns=['dt', 'sub_id'])
Get the max value per dt / sub_id from df:
max_value_df = df.loc[df.groupby(['dt', 'sub_id'])['amount'].idxmax()]
max_value_df['dt'] = max_value_df['dt'].apply(pd.Timestamp)
Merge base_df on the max values:
merged_df = base_df.merge(max_value_df, how='left', on=['dt', 'sub_id'])
Sort and forward fill the maximal value:
merged_df = merged_df.sort_values(by=['sub_id', 'dt', 'amount'], ascending=True)
merged_df['amount'] = merged_df.groupby(['sub_id'])['amount'].fillna(method='ffill')
Result:
dt sub_id amount
0 2016-01-01 1 10.0
2 2016-01-02 1 10.0
4 2016-01-03 1 30.0
6 2016-01-04 1 40.0
1 2016-01-01 2 80.0
3 2016-01-02 2 80.0
5 2016-01-03 2 80.0
7 2016-01-04 2 82.0

Categories

Resources