I am trying to correct every row that there is no date. Then idea is just to fill the gap between the missing dates, and complete the other columns with the previous values.
ds SKU Estoque leadtime
0 2018-01-02 504777 45 11
1 2018-01-04 504777 42 11
2 2018-01-05 504777 41 11
3 2018-01-09 504777 40 11
4 2018-01-12 504777 37 11
5 2018-01-13 504777 36 11
6 2018-01-15 504777 35 11
... ... ... ... ...
6629 2018-08-14 857122 11 10
6630 2018-08-15 857122 10 10
6631 2018-08-16 857122 9 10
6632 2018-08-17 857122 7 10
6633 2018-08-23 857122 14 10
6634 2018-08-24 857122 13 10
I have already tried to:
df.set_index('ds', inplace=True)
df = df.resample("D")
or
df.resample("D", how='first', fill_method='ffill')
But I just got this:
DatetimeIndexResampler [freq=<Day>, axis=0, closed=left, label=left, convention=start, base=0]
When I tried :
(df.groupby('SKU')
.resample('D')
.last()
.reset_index()
.set_index('ds'))
I got this error :
ValueError: cannot insert SKU, already exists
I am trying to have this result:
ds SKU Estoque leadtime
0 2018-01-02 504777 45 11
1 2018-01-03 504777 45 11
2 2018-01-04 504777 42 11
3 2018-01-05 504777 41 11
4 2018-01-06 504777 41 11
5 2018-01-07 504777 41 11
6 2018-01-08 504777 41 11
7 2018-01-09 504777 40 11
... ... ... ... ...
PS: If I set date as index, I have duplicated index. I need to isolate each product first (group by).
In your case you may need to chain with apply
#df.set_index('ds', inplace=True)
df.groupby('SKU').apply(lambda x : x.resample('D').ffill()).reset_index(level=0,drop=True)
Related
I have a pandas dataframe, with 1.7 million of rows. Like this:
ID
date
value
10
2022-01-01
100
10
2022-01-02
150
10
2022-01-05
200
10
2022-01-07
150
10
2022-01-12
100
23
2022-02-01
490
23
2022-02-03
350
23
2022-02-04
333
23
2022-02-08
211
23
2022-02-09
100
I would like to insert the missing dates in the column date. Like this:
ID
date
value
10
2022-01-01
100
10
2022-01-02
150
10
2022-01-03
0
10
2022-01-04
0
10
2022-01-05
200
10
2022-01-06
0
10
2022-01-07
150
10
2022-01-08
0
10
2022-01-09
0
10
2022-01-10
0
10
2022-01-11
0
10
2022-01-12
100
23
2022-02-01
490
10
2022-02-02
0
23
2022-02-03
350
23
2022-02-04
333
ยดยด
10
2022-02-05
10
2022-02-06
0
10
2022-02-07
0
23
2022-02-08
211
23
2022-02-09
100
I used:
s = (pd.MultiIndex.from_tuples([[x, d]
for x, y in df.groupby("Id")["Dt"]
for d in pd.date_range(min(y), max(df["Dt"]), freq="MS")], names=["Id", "Dt"]))
print (df.set_index(["Id", "Dt"]).reindex(s, fill_value=0).reset_index())
But, It took too long. Is there a more performative way to do this?
You can try:
df['date'] = pd.to_datetime(df['date'])
df = (df.groupby('ID')['date'].apply(lambda d:
pd.date_range(start=d.min(),end=d.max()).to_list())
.explode().reset_index()
.merge(df, on=['ID','date'],how='left'))
df['value'] = df['value'].fillna(0).astype(int)
Output:
ID date value
0 10 2022-01-01 100
1 10 2022-01-02 150
2 10 2022-01-03 0
3 10 2022-01-04 0
4 10 2022-01-05 200
5 10 2022-01-06 0
6 10 2022-01-07 150
7 10 2022-01-08 0
8 10 2022-01-09 0
9 10 2022-01-10 0
10 10 2022-01-11 0
11 10 2022-01-12 100
12 23 2022-02-01 490
13 23 2022-02-02 0
14 23 2022-02-03 350
15 23 2022-02-04 333
16 23 2022-02-05 0
17 23 2022-02-06 0
18 23 2022-02-07 0
19 23 2022-02-08 211
20 23 2022-02-09 100
Use asfreq and fillna:
#convert to datetime if needed
df["date"] = pd.to_datetime(df["date"])
df = df.set_index("date").asfreq("D").fillna({"value": "0"}).ffill().reset_index()
>>> df
date ID value
0 2022-01-01 10.0 100.0
1 2022-01-02 10.0 150.0
2 2022-01-03 10.0 0
3 2022-01-04 10.0 0
4 2022-01-05 10.0 200.0
5 2022-01-06 10.0 0
6 2022-01-07 10.0 150.0
7 2022-01-08 10.0 0
8 2022-01-09 10.0 0
9 2022-01-10 10.0 0
10 2022-01-11 10.0 0
11 2022-01-12 10.0 100.0
12 2022-01-13 10.0 0
13 2022-01-14 10.0 0
14 2022-01-15 10.0 0
15 2022-01-16 10.0 0
16 2022-01-17 10.0 0
17 2022-01-18 10.0 0
18 2022-01-19 10.0 0
19 2022-01-20 10.0 0
20 2022-01-21 10.0 0
21 2022-01-22 10.0 0
22 2022-01-23 10.0 0
23 2022-01-24 10.0 0
24 2022-01-25 10.0 0
25 2022-01-26 10.0 0
26 2022-01-27 10.0 0
27 2022-01-28 10.0 0
28 2022-01-29 10.0 0
29 2022-01-30 10.0 0
30 2022-01-31 10.0 0
31 2022-02-01 23.0 490.0
32 2022-02-02 23.0 0
33 2022-02-03 23.0 350.0
34 2022-02-04 23.0 333.0
35 2022-02-05 23.0 0
36 2022-02-06 23.0 0
37 2022-02-07 23.0 0
38 2022-02-08 23.0 211.0
39 2022-02-09 23.0 100.0
I have the following example Pandas DataFrame
UserID Total Date
1 20 2019-01-01
1 18 2019-01-04
1 22 2019-01-05
1 16 2019-01-07
1 17 2019-01-09
1 26 2019-01-11
1 30 2019-01-12
1 28 2019-01-13
1 28 2019-01-15
1 28 2019-01-16
2 22 2019-01-06
2 11 2019-01-07
2 23 2019-01-09
2 14 2019-01-13
2 19 2019-01-14
2 29 2019-01-15
2 21 2019-01-16
2 22 2019-01-18
2 30 2019-01-22
2 16 2019-01-23
3 27 2019-01-01
3 13 2019-01-04
3 12 2019-01-05
3 27 2019-01-06
3 26 2019-01-09
3 26 2019-01-10
3 30 2019-01-11
3 19 2019-01-12
3 27 2019-01-13
3 29 2019-01-14
4 29 2019-01-07
4 12 2019-01-09
4 25 2019-01-10
4 11 2019-01-11
4 19 2019-01-13
4 20 2019-01-14
4 33 2019-01-15
4 24 2019-01-18
4 22 2019-01-19
4 24 2019-01-21
My goal is to add a column named TotalPrev10Days which is basically the sum of Total for previous 10 days (for each UserID)
I did a basic implementation using nested loops and comparing the current date with a timedelta.
Here's my code:
users = set(df.UserID) # get set of all unique user IDs
TotalPrev10Days = []
delta = timedelta(days=10) # 10 day time delta to subtract from each row date
for user in users: # looping over all user IDs
user_df = df[df["UserID"] == user] #creating dataframe that includes only current userID data
for row_index in user_df.index: #looping over each row from UserID dataframe
row_date = user_df["Date"][row_index]
row_date_minus_10 = row_date - delta #subtracting 10 days
sum_prev_10_days = user_df[(user_df["Date"] < row_date) & (user_df["Date"] >= row_date_minus_10)]["Total"].sum()
TotalPrev10Days.append(sum_prev_10_days) #appending total to a list
df["TotalPrev10Days"] = TotalPrev10Days #Assigning list to new DataFrame column
While it works perfectly, it's very slow for large datasets.
Is there a faster, more Pandas-native approach to this problem?
IIUC, try:
df["TotalPrev10Days"] = df.groupby("UserID") \
.rolling("9D", on="Date") \
.sum() \
.shift() \
.fillna(0)["Total"] \
.droplevel(0)
>>> df
UserID Total Date TotalPrev10Days
0 1 20 2019-01-01 0.0
1 1 18 2019-01-04 20.0
2 1 22 2019-01-05 38.0
3 1 16 2019-01-07 60.0
4 1 17 2019-01-09 76.0
5 1 26 2019-01-11 93.0
6 1 30 2019-01-12 99.0
7 1 28 2019-01-13 129.0
8 1 28 2019-01-15 139.0
9 1 28 2019-01-16 145.0
10 2 22 2019-01-06 0.0
11 2 11 2019-01-07 22.0
12 2 23 2019-01-09 33.0
13 2 14 2019-01-13 56.0
14 2 19 2019-01-14 70.0
15 2 29 2019-01-15 89.0
16 2 21 2019-01-16 96.0
17 2 22 2019-01-18 106.0
18 2 30 2019-01-22 105.0
19 2 16 2019-01-23 121.0
20 3 27 2019-01-01 0.0
21 3 13 2019-01-04 27.0
22 3 12 2019-01-05 40.0
23 3 27 2019-01-06 52.0
24 3 26 2019-01-09 79.0
25 3 26 2019-01-10 105.0
26 3 30 2019-01-11 104.0
27 3 19 2019-01-12 134.0
28 3 27 2019-01-13 153.0
29 3 29 2019-01-14 167.0
30 4 29 2019-01-07 0.0
31 4 12 2019-01-09 29.0
32 4 25 2019-01-10 41.0
33 4 11 2019-01-11 66.0
34 4 19 2019-01-13 77.0
35 4 20 2019-01-14 96.0
36 4 33 2019-01-15 116.0
37 4 24 2019-01-18 149.0
38 4 22 2019-01-19 132.0
39 4 24 2019-01-21 129.0
I have a dataframe using
year_start = '2020-03-29'
year_end = '2021-04-10'
week_end_sat = pd.DataFrame(pd.date_range(year_start, year_end, freq=f'W-SAT'), columns=['a'])
How can I make another column specifying the week number making 2020-03-29 as the first day of the calendar since I am trying to make a 4-4-5 calendar which always ends on a Saturday?
Final df that I want is,
a | count
2020-04-04 | 1
2020-04-11 | 2
.
.
.
2021-04-03 | 53 #since 2020 is a leap year there are 53 weeks otherwise it will be 52 weeks
2021-04-10 | 1
2021-04-17 | 2
.
2022-03-02 | 52
2022-04-09 | 1
I think you can create a baseline date range start from the first day of your given year_start.
first_day_of_year = week_end_sat.iloc[0, 0].replace(day=1, month=1)
baseline = pd.Series(pd.date_range(first_day_of_year, periods=len(week_end_sat), freq=f'W-SAT'))
The baseline's week of year is what you want.
week_end_sat['count'] = baseline['a'].dt.isocalendar().week
# print(week_end_sat)
a count
0 2020-04-04 1
1 2020-04-11 2
2 2020-04-18 3
3 2020-04-25 4
4 2020-05-02 5
5 2020-05-09 6
6 2020-05-16 7
7 2020-05-23 8
8 2020-05-30 9
9 2020-06-06 10
10 2020-06-13 11
11 2020-06-20 12
12 2020-06-27 13
13 2020-07-04 14
14 2020-07-11 15
15 2020-07-18 16
16 2020-07-25 17
17 2020-08-01 18
18 2020-08-08 19
19 2020-08-15 20
20 2020-08-22 21
21 2020-08-29 22
...
43 2021-01-30 44
44 2021-02-06 45
45 2021-02-13 46
46 2021-02-20 47
47 2021-02-27 48
48 2021-03-06 49
49 2021-03-13 50
50 2021-03-20 51
51 2021-03-27 52
52 2021-04-03 53
53 2021-04-10 1
I calculated the week number using a W-Sat frequency and isocalendar api and return the week number. I then create a baseline using the first day of the year and assign the week number to baseline_week. Now the week has an associated baseline_week number.
year_start = '2020-03-29'
year_end = '2021-04-10'
df = pd.DataFrame(pd.date_range(year_start, year_end, freq=f'W-SAT'), columns=['week_date'])
df['week_number']=df['week_date'].apply(lambda row: datetime.date(row.year, row.month, row.day).isocalendar()[1])
first_day_of_year = df.iloc[0, 0].replace(day=1, month=1)
baseline = pd.Series(pd.date_range(first_day_of_year, periods=len(df), freq=f'W-SAT'))
df['baseline_date']=baseline
df['baseline_week_number']=df['baseline_date'].apply(lambda row: datetime.date(row.year, row.month, row.day).isocalendar()[1])
print(df)
output:
week_date week_number baseline_date baseline_week_number
0 2020-04-04 14 2020-01-04 1
1 2020-04-11 15 2020-01-11 2
2 2020-04-18 16 2020-01-18 3
3 2020-04-25 17 2020-01-25 4
4 2020-05-02 18 2020-02-01 5
5 2020-05-09 19 2020-02-08 6
6 2020-05-16 20 2020-02-15 7
7 2020-05-23 21 2020-02-22 8
8 2020-05-30 22 2020-02-29 9
9 2020-06-06 23 2020-03-07 10
10 2020-06-13 24 2020-03-14 11
11 2020-06-20 25 2020-03-21 12
12 2020-06-27 26 2020-03-28 13
13 2020-07-04 27 2020-04-04 14
14 2020-07-11 28 2020-04-11 15
15 2020-07-18 29 2020-04-18 16
16 2020-07-25 30 2020-04-25 17
17 2020-08-01 31 2020-05-02 18
18 2020-08-08 32 2020-05-09 19
19 2020-08-15 33 2020-05-16 20
20 2020-08-22 34 2020-05-23 21
21 2020-08-29 35 2020-05-30 22
22 2020-09-05 36 2020-06-06 23
23 2020-09-12 37 2020-06-13 24
24 2020-09-19 38 2020-06-20 25
25 2020-09-26 39 2020-06-27 26
26 2020-10-03 40 2020-07-04 27
27 2020-10-10 41 2020-07-11 28
28 2020-10-17 42 2020-07-18 29
29 2020-10-24 43 2020-07-25 30
30 2020-10-31 44 2020-08-01 31
31 2020-11-07 45 2020-08-08 32
32 2020-11-14 46 2020-08-15 33
33 2020-11-21 47 2020-08-22 34
34 2020-11-28 48 2020-08-29 35
35 2020-12-05 49 2020-09-05 36
36 2020-12-12 50 2020-09-12 37
37 2020-12-19 51 2020-09-19 38
38 2020-12-26 52 2020-09-26 39
39 2021-01-02 53 2020-10-03 40
40 2021-01-09 1 2020-10-10 41
41 2021-01-16 2 2020-10-17 42
42 2021-01-23 3 2020-10-24 43
43 2021-01-30 4 2020-10-31 44
44 2021-02-06 5 2020-11-07 45
45 2021-02-13 6 2020-11-14 46
46 2021-02-20 7 2020-11-21 47
47 2021-02-27 8 2020-11-28 48
48 2021-03-06 9 2020-12-05 49
49 2021-03-13 10 2020-12-12 50
50 2021-03-20 11 2020-12-19 51
51 2021-03-27 12 2020-12-26 52
52 2021-04-03 13 2021-01-02 53
53 2021-04-10 14 2021-01-09 1
I'm trying to concatenate a Series onto the right side of a dataframe with the column name 'RSI'. However, because the Series is of shorter length than the other columns in the dataframe, I need to ensure that NaN values are appended to the top of the column and not the bottom. Right now, I've used the following code but I can't find an argument that would allow me to have the desired output.
RSI = pd.Series(RSI)
df = pd.concat((df, RSI.rename('RSI'), axis='columns')
So far, this is my output:
Dates Prices Volumes RSI
0 2013-02-08 201.68 2893254 47.7357
1 2013-02-11 200.16 2944651 53.3967
2 2013-02-12 200.04 2461779 56.3866
3 2013-02-13 200.09 2169757 60.1845
4 2013-02-14 199.65 3294126 62.1784
5 2013-02-15 200.98 3627887 63.9720
6 2013-02-19 200.32 2998317 62.9671
7 2013-02-20 199.31 3715311 63.9232
8 2013-02-21 198.33 3923051 66.8817
9 2013-02-22 201.09 3107876 72.8258
10 2013-02-25 197.51 3845276 69.6578
11 2013-02-26 199.14 3391562 63.8458
12 2013-02-27 202.33 4185545 64.2776
13 2013-02-28 200.83 4689698 67.2445
14 2013-03-01 202.91 3308544 58.2408
15 2013-03-04 205.19 3693365 57.7058
16 2013-03-05 206.53 3807706 53.7482
17 2013-03-06 208.38 3594899 57.5396
18 2013-03-07 209.42 3884317 53.2722
19 2013-03-08 210.38 3700086 58.6824
20 2013-03-11 210.08 3048901 56.0161
21 2013-03-12 210.55 3591261 60.2066
22 2013-03-13 212.06 3355969 55.3322
23 2013-03-14 215.80 5505484 51.7492
24 2013-03-15 214.92 7935024 47.1241
25 2013-03-18 213.21 3006125 46.9102
26 2013-03-19 213.44 3198577 46.6569
27 2013-03-20 215.06 3019153 54.0822
28 2013-03-21 212.26 5830566 56.2525
29 2013-03-22 212.08 3015847 51.8359
... ... ... ... ...
1229 2017-12-26 152.83 2479017 80.1930
1230 2017-12-27 153.13 2149257 80.7444
1231 2017-12-28 154.04 2687624 56.4425
1232 2017-12-29 153.42 3327087 56.9183
1233 2018-01-02 154.25 4202503 63.6958
1234 2018-01-03 158.49 9441567 61.1962
1235 2018-01-04 161.70 7556249 61.3816
1236 2018-01-05 162.49 5195764 64.7724
1237 2018-01-08 163.47 5237523 63.0508
1238 2018-01-09 163.83 4341810 53.9559
1239 2018-01-10 164.18 4174105 54.1351
1240 2018-01-11 164.20 3794453 50.6824
1241 2018-01-12 163.14 5031886 43.0222
1242 2018-01-16 163.85 7794195 32.7428
1243 2018-01-17 168.65 11710033 39.4754
1244 2018-01-18 169.12 14259345 37.3409
1245 2018-01-19 162.37 21172488 NaN
1246 2018-01-22 162.60 8480795 NaN
1247 2018-01-23 166.25 7466232 NaN
1248 2018-01-24 165.37 5645003 NaN
1249 2018-01-25 165.47 3302520 NaN
1250 2018-01-26 167.34 3787913 NaN
1251 2018-01-29 166.80 3516995 NaN
1252 2018-01-30 163.62 4902341 NaN
1253 2018-01-31 163.70 4072830 NaN
1254 2018-02-01 162.40 4434242 NaN
1255 2018-02-02 159.03 5251938 NaN
1256 2018-02-05 152.53 8746599 NaN
1257 2018-02-06 155.34 9867678 NaN
1258 2018-02-07 153.85 6149207 NaN
However, I need it to look like this:
Dates Prices Volumes RSI
0 2013-02-08 201.68 2893254 NaN
1 2013-02-11 200.16 2944651 NaN
2 2013-02-12 200.04 2461779 NaN
3 2013-02-13 200.09 2169757 NaN
4 2013-02-14 199.65 3294126 NaN
5 2013-02-15 200.98 3627887 NaN
6 2013-02-19 200.32 2998317 NaN
7 2013-02-20 199.31 3715311 NaN
8 2013-02-21 198.33 3923051 NaN
9 2013-02-22 201.09 3107876 NaN
10 2013-02-25 197.51 3845276 NaN
11 2013-02-26 199.14 3391562 NaN
12 2013-02-27 202.33 4185545 NaN
13 2013-02-28 200.83 4689698 NaN
14 2013-03-01 202.91 3308544 NaN
15 2013-03-04 205.19 3693365 57.7058
16 2013-03-05 206.53 3807706 53.7482
17 2013-03-06 208.38 3594899 57.5396
18 2013-03-07 209.42 3884317 53.2722
19 2013-03-08 210.38 3700086 58.6824
20 2013-03-11 210.08 3048901 56.0161
21 2013-03-12 210.55 3591261 60.2066
22 2013-03-13 212.06 3355969 55.3322
23 2013-03-14 215.80 5505484 51.7492
24 2013-03-15 214.92 7935024 47.1241
25 2013-03-18 213.21 3006125 46.9102
26 2013-03-19 213.44 3198577 46.6569
27 2013-03-20 215.06 3019153 54.0822
28 2013-03-21 212.26 5830566 56.2525
29 2013-03-22 212.08 3015847 51.8359
... ... ... ... ...
1229 2017-12-26 152.83 2479017 80.1930
1230 2017-12-27 153.13 2149257 80.7444
1231 2017-12-28 154.04 2687624 56.4425
1232 2017-12-29 153.42 3327087 56.9183
1233 2018-01-02 154.25 4202503 63.6958
1234 2018-01-03 158.49 9441567 61.1962
1235 2018-01-04 161.70 7556249 61.3816
1236 2018-01-05 162.49 5195764 64.7724
1237 2018-01-08 163.47 5237523 63.0508
1238 2018-01-09 163.83 4341810 53.9559
1239 2018-01-10 164.18 4174105 54.1351
1240 2018-01-11 164.20 3794453 50.6824
1241 2018-01-12 163.14 5031886 43.0222
1242 2018-01-16 163.85 7794195 32.7428
1243 2018-01-17 168.65 11710033 39.4754
1244 2018-01-18 169.12 14259345 36.9999
1245 2018-01-19 162.37 21172488 41.1297
1246 2018-01-22 162.60 8480795 12.1231
1247 2018-01-23 166.25 7466232 39.0977
1248 2018-01-24 165.37 5645003 63.6958
1249 2018-01-25 165.47 3302520 56.4425
1250 2018-01-26 167.34 3787913 80.7444
1251 2018-01-29 166.80 3516995 61.1962
1252 2018-01-30 163.62 4902341 58.6824
1253 2018-01-31 163.70 4072830 53.7482
1254 2018-02-01 162.40 4434242 43.0222
1255 2018-02-02 159.03 5251938 61.1962
1256 2018-02-05 152.53 8746599 56.4425
1257 2018-02-06 155.34 9867678 36.0978
1258 2018-02-07 153.85 6149207 41.1311
Thanks for the help.
Another way is manipulating rsi series index to match df index from bottom up(I use only 13 rows of your sample for demo)
size_diff = df.index.size - rsi.index.size
rsi.index = df.index[size_diff:]
pd.concat([df, rsi], axis=1)
Out[1490]:
Dates Prices Volumes RSI
0 2013-02-08 201.68 2893254 NaN
1 2013-02-11 200.16 2944651 NaN
2 2013-02-12 200.04 2461779 NaN
3 2013-02-13 200.09 2169757 NaN
4 2013-02-14 199.65 3294126 NaN
5 2013-02-15 200.98 3627887 47.7357
6 2013-02-19 200.32 2998317 53.3967
7 2013-02-20 199.31 3715311 56.3866
8 2013-02-21 198.33 3923051 60.1845
9 2013-02-22 201.09 3107876 62.1784
10 2013-02-25 197.51 3845276 63.9720
11 2013-02-26 199.14 3391562 62.9671
12 2013-02-27 202.33 4185545 63.9232
13 2013-02-28 200.83 4689698 66.8817
Try like this:
df["RSI"].shift(len(df)-len(df["RSI"].dropna()))
We can get the difference in rows between the Series and the dataframe.
Then append the difference in NaN to the series (on top) with np.repeat
Finally append the new series with NaN to your original dataframe over axis=1 (columns)
diff = df.shape[0] - RSI.shape[0]
rpts = np.repeat(np.NaN, diff)
RSI = pd.concat([pd.Series(rpts, name='RSI'), RSI], ignore_index=True)
pd.concat([df, RSI['RSI']], axis=1).head(20)
Dates Prices Volumes RSI
0 2013-02-08 201.68 2893254 NaN
1 2013-02-11 200.16 2944651 NaN
2 2013-02-12 200.04 2461779 NaN
3 2013-02-13 200.09 2169757 NaN
4 2013-02-14 199.65 3294126 NaN
5 2013-02-15 200.98 3627887 NaN
6 2013-02-19 200.32 2998317 NaN
7 2013-02-20 199.31 3715311 NaN
8 2013-02-21 198.33 3923051 NaN
9 2013-02-22 201.09 3107876 NaN
10 2013-02-25 197.51 3845276 NaN
11 2013-02-26 199.14 3391562 NaN
12 2013-02-27 202.33 4185545 NaN
13 2013-02-28 200.83 4689698 47.7357
14 2013-03-01 202.91 3308544 53.3967
15 2013-03-04 205.19 3693365 56.3866
16 2013-03-05 206.53 3807706 60.1845
17 2013-03-06 208.38 3594899 62.1784
18 2013-03-07 209.42 3884317 63.9720
19 2013-03-08 210.38 3700086 62.9671
My data looks something like this:
ID1 ID2 Date Values
1 1 2018-01-05 75
1 1 2018-01-06 83
1 1 2018-01-07 17
1 1 2018-01-08 15
1 2 2018-02-01 85
1 2 2018-02-02 98
2 1 2018-02-15 54
2 1 2018-02-16 17
2 1 2018-02-17 83
2 1 2018-02-18 94
2 2 2017-12-18 16
2 2 2017-12-19 84
2 2 2017-12-20 47
2 2 2017-12-21 28
2 2 2017-12-22 38
All the operations must be done within groups of ['ID1', 'ID2'].
What I want to do is upsample the dataframe in a way such that I end up with a sub-dataframe for each 'Date' index which includes all previous dates including the current one from it's own ['ID1', 'ID2'] group. The resulting dataframe should look like this:
ID1 ID2 DateGroup Date Values
1 1 2018-01-05 2018-01-05 75
1 1 2018-01-06 2018-01-05 75
1 1 2018-01-06 2018-01-06 83
1 1 2018-01-07 2018-01-05 75
1 1 2018-01-07 2018-01-06 83
1 1 2018-01-07 2018-01-07 17
1 1 2018-01-08 2018-01-05 75
1 1 2018-01-08 2018-01-06 83
1 1 2018-01-08 2018-01-07 17
1 1 2018-01-08 2018-01-08 15
1 2 2018-02-01 2018-02-01 85
1 2 2018-02-02 2018-02-01 85
1 2 2018-02-02 2018-02-02 98
2 1 2018-02-15 2018-02-15 54
2 1 2018-02-16 2018-02-15 54
2 1 2018-02-16 2018-02-16 17
2 1 2018-02-17 2018-02-15 54
2 1 2018-02-17 2018-02-16 17
2 1 2018-02-17 2018-02-17 83
2 1 2018-02-18 2018-02-15 54
2 1 2018-02-18 2018-02-16 17
2 1 2018-02-18 2018-02-17 83
2 1 2018-02-18 2018-02-18 94
2 2 2017-12-18 2017-12-18 16
2 2 2017-12-19 2017-12-18 16
2 2 2017-12-19 2017-12-19 84
2 2 2017-12-20 2017-12-18 16
2 2 2017-12-20 2017-12-19 84
2 2 2017-12-20 2017-12-20 47
2 2 2017-12-21 2017-12-18 16
2 2 2017-12-21 2017-12-19 84
2 2 2017-12-21 2017-12-20 47
2 2 2017-12-21 2017-12-21 28
2 2 2017-12-22 2017-12-18 16
2 2 2017-12-22 2017-12-19 84
2 2 2017-12-22 2017-12-20 47
2 2 2017-12-22 2017-12-21 28
2 2 2017-12-22 2017-12-22 38
The dataframe I'm working with is quite big (~20 million rows), thus I would like to avoid iterating through each row.
Is it possible to use a function or combination of pandas functions like resample/apply/reindex to achieve what I need?
Assuming ID1 and ID2 is your original Index. You should reset the index, set Date as Index, reset the index back to [ID1, ID2]:
df = df.reset_index().set_index(['Date']).resample('d').ffill().reset_index().set_index(['ID1','ID2'])
If your 'Date' field is string, then you should be converting it into datetime before resampling on that field. You can use the below for that:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')