I would like to resample df by creating monthly data for all columns and filling in missing values with 0, within the time frame of say 2019-01-01 to 2019-12-31.
df:
ITEM_ID Date Value YearMonth
0 101002 2019-03-31 1.0 2019-03
1 101002 2019-04-30 1.0 2019-04
2 101002 2019-10-31 0.0 2019-10
3 101002 2019-11-30 8.0 2019-11
4 101002 2019-12-31 5.0 2019-12
Expected output:
ITEM_ID Date Value YearMonth
... 0 2019-01 (added)
... 0 2019-02 (added)
0 101002 2019-03-31 1.0 2019-03
1 101002 2019-04-30 1.0 2019-04
... 0 2019-05 (added)
... 0 2019-06 (added)
... 0 2019-07 (added)
... 0 2019-08 (added)
... 0 2019-09 (added)
2 101002 2019-10-31 0.0 2019-10
3 101002 2019-11-30 8.0 2019-11
4 101002 2019-12-31 5.0 2019-12
I came across a few methods like multiindex and resample. multiindex seems to be versatile but gets a bit complicated when it involves different levels of indexes; I am not sure if resample allows me to extend the effect to specified time frame. What is the best way to do it?
I think you need DataFrame.reindex:
df['YearMonth'] = pd.to_datetime(df['YearMonth'])
r = pd.to_datetime(pd.date_range('2019-01-01', '2020-01-01', freq='1MS'))
mux = pd.MultiIndex.from_product([df['ITEM_ID'].unique(), r], names=['ITEM_ID','YearMonth'])
df = df.set_index(['ITEM_ID','YearMonth']).reindex(mux).fillna({'Value':0}).reset_index().reindex(df.columns, axis=1)
print (df)
ITEM_ID Date Value YearMonth
0 101002 NaN 0.0 2019-01-01
1 101002 NaN 0.0 2019-02-01
2 101002 2019-03-31 1.0 2019-03-01
3 101002 2019-04-30 1.0 2019-04-01
4 101002 NaN 0.0 2019-05-01
5 101002 NaN 0.0 2019-06-01
6 101002 NaN 0.0 2019-07-01
7 101002 NaN 0.0 2019-08-01
8 101002 NaN 0.0 2019-09-01
9 101002 2019-10-31 0.0 2019-10-01
10 101002 2019-11-30 8.0 2019-11-01
11 101002 2019-12-31 5.0 2019-12-01
12 101002 NaN 0.0 2020-01-01
Here is the solution
import pandas as pd
df1= # this is the dataframe which you have given example. please change accordingly.
print(df1)
data=[['2019-01'],['2019-02'],['2019-03'],['2019-04'],['2019-05'],['2019-06'],['2019-07'],['2019-08'],
['2019-09'],['2019-10'],['2019-11'],['2019-12']]
df2=pd.DataFrame(data=data,columns=['YearMonth'])
print(df2)
final_DF = pd.merge(df1,df2,on ='YearMonth',how ='outer').sort_values('YearMonth')
final_DF = final_DF.fillna(0)
print(final_DF)
Instead of thinking in terms of year and month columns, we created an empty data frame with a start and end date and time and combined it with the original data frame.
df['Date'] = pd.to_datetime(df['Date'])
df1 = pd.DataFrame(index=pd.to_datetime(pd.date_range('2019-01-01', '2020-01-01', freq='1M'))).reset_index()
df1 = df1.merge(df, left_on='index', right_on='Date', how='outer')
df1['yearmonth'] = df1['index'].apply(lambda x: str(x.year) + '-' + '{:02}'.format(x.month))
df1
index ITEM_ID Date Value YearMonth yearmonth
0 2019-01-31 NaN NaT NaN NaN 2019-01
1 2019-02-28 NaN NaT NaN NaN 2019-02
2 2019-03-31 101002.0 2019-03-31 1.0 2019-03 2019-03
3 2019-04-30 101002.0 2019-04-30 1.0 2019-04 2019-04
4 2019-05-31 NaN NaT NaN NaN 2019-05
5 2019-06-30 NaN NaT NaN NaN 2019-06
6 2019-07-31 NaN NaT NaN NaN 2019-07
7 2019-08-31 NaN NaT NaN NaN 2019-08
8 2019-09-30 NaN NaT NaN NaN 2019-09
9 2019-10-31 101002.0 2019-10-31 0.0 2019-10 2019-10
10 2019-11-30 101002.0 2019-11-30 8.0 2019-11 2019-11
11 2019-12-31 101002.0 2019-12-31 5.0 2019-12 2019-12
Related
i have a dataframe named zz
zz columns name ['Ancolmekar','Cidurian','Dayeuhkolot','Hantap','Kertasari','Meteolembang','Sapan']
for col in zz.columns:
df = pd.DataFrame(zz[col],index=pd.date_range('2017-01-01 00:00:00', '2021-12-31 23:50:00', freq='10T'))
df.resample('1M').mean()
error : invalid syntax
i want to know the mean value by month in 10 minutes data interval. when i run this just sapan values appear with NaN. before, i have replace the NaN data 1 else 0.
Sapan
2017-01-31 NaN
2017-02-28 NaN
2017-03-31 NaN
2017-04-30 NaN
2017-05-31 NaN
2017-06-30 NaN
2017-07-31 NaN
2017-08-31 NaN
2017-09-30 NaN
2017-10-31 NaN
2017-11-30 NaN
2017-12-31 NaN
2018-01-31 NaN
2018-02-28 NaN
2018-03-31 NaN
2018-04-30 NaN
2018-05-31 NaN
2018-06-30 NaN
2018-07-31 NaN
2018-08-31 NaN
2018-09-30 NaN
2018-10-31 NaN
2018-11-30 NaN
2018-12-31 NaN
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 NaN
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 NaN
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 NaN
2019-10-31 NaN
2019-11-30 NaN
2019-12-31 NaN
2020-01-31 NaN
2020-02-29 NaN
2020-03-31 NaN
2020-04-30 NaN
2020-05-31 NaN
2020-06-30 NaN
2020-07-31 NaN
2020-08-31 NaN
2020-09-30 NaN
2020-10-31 NaN
2020-11-30 NaN
2020-12-31 NaN
2021-01-31 NaN
2021-02-28 NaN
2021-03-31 NaN
2021-04-30 NaN
2021-05-31 NaN
2021-06-30 NaN
2021-07-31 NaN
2021-08-31 NaN
2021-09-30 NaN
2021-10-31 NaN
2021-11-30 NaN
2021-12-31 NaN
what should i do? thanks before
You are re-assigninig variable df to a dataframe with a single column during each pass through the for loop. The last column is sapan. Hence, only this column is shown.
Additionally, you are setting the index on df that probably isn't the index in zz, therefore you get Not A Number NaN for non-existing values.
If the index in zz is corresponding to the one you are setting, this should work:
df = zz.copy()
df['new_column'] = pd.Series(pd.date_range('2017-01-01 00:00:00', '2021-12-31 23:50:00', freq='10T'))
df = df.set_index('new_column')
df.resample('1M').mean()
How can I create empty rows from 7 days before 2016-01-01 going to January 2015? I tried reindexing
df
date value
0 2016-01-01 4.0
1 2016-01-08 5.0
2 2016-01-15 1.0
Expected Output
date value
2015-01-02 NaN
....
2015-12-25 NaN
2016-01-01 4.0
2016-01-08 5.0
2016-01-15 1.0
First create DatetimeIndex:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
And then use DataFrame.reindex with date_range by your minimal value and minimal index value with Index.union for avoid lost original index values:
rng = pd.date_range('2015-01-02', df.index.min(), freq='7d').union(df.index)
df = df.reindex(rng)
print (df)
value
2015-01-02 NaN
2015-01-09 NaN
2015-01-16 NaN
2015-01-23 NaN
2015-01-30 NaN
2015-02-06 NaN
2015-02-13 NaN
2015-02-20 NaN
2015-02-27 NaN
2015-03-06 NaN
2015-03-13 NaN
2015-03-20 NaN
2015-03-27 NaN
2015-04-03 NaN
2015-04-10 NaN
2015-04-17 NaN
2015-04-24 NaN
2015-05-01 NaN
2015-05-08 NaN
2015-05-15 NaN
2015-05-22 NaN
2015-05-29 NaN
2015-06-05 NaN
2015-06-12 NaN
2015-06-19 NaN
2015-06-26 NaN
2015-07-03 NaN
2015-07-10 NaN
2015-07-17 NaN
2015-07-24 NaN
2015-07-31 NaN
2015-08-07 NaN
2015-08-14 NaN
2015-08-21 NaN
2015-08-28 NaN
2015-09-04 NaN
2015-09-11 NaN
2015-09-18 NaN
2015-09-25 NaN
2015-10-02 NaN
2015-10-09 NaN
2015-10-16 NaN
2015-10-23 NaN
2015-10-30 NaN
2015-11-06 NaN
2015-11-13 NaN
2015-11-20 NaN
2015-11-27 NaN
2015-12-04 NaN
2015-12-11 NaN
2015-12-18 NaN
2015-12-25 NaN
2016-01-01 4.0
2016-01-08 5.0
2016-01-15 1.0
I have this DataFrame.
timestamp Val1
2020-04-02 06:44:00 NaN
2020-04-03 16:52:00 NaN
2020-04-03 16:53:00 NaN
2020-04-03 16:54:00 NaN
2020-04-03 16:55:00 NaN
2020-04-17 02:03:00 NaN
2020-04-17 02:04:00 NaN
2020-04-17 02:05:00 NaN
2020-04-17 02:06:00 NaN
And I trying to separate in groups using the sequence of minutes. For example, I can't group rows with more then 1 min with difference.
So the output will be like this:
#Group 1
timestamp Val1
2020-04-02 06:44:00 NaN
#Group 2
timestamp Val1
2020-04-03 16:52:00 NaN
2020-04-03 16:53:00 NaN
2020-04-03 16:54:00 NaN
2020-04-03 16:55:00 NaN
#Group 3
timestamp Val1
2020-04-17 02:03:00 NaN
2020-04-17 02:04:00 NaN
2020-04-17 02:05:00 NaN
2020-04-17 02:06:00 NaN
Now, I just can get the min and max data with all the data. But no like what I want to try.
Take the difference between consecutive rows and check whether it is above your desired difference ('1min'). Taking the cumsum of this Boolean Series creates the grouping label. I've assigned it to a column here for illustration.
#df['timestamp'] = pd.to_datetime(df['timestamp'])
df['group'] = df['timestamp'].diff().gt('1min').cumsum()
timestamp Val1 group
0 2020-04-02 06:44:00 NaN 0
1 2020-04-03 16:52:00 NaN 1
2 2020-04-03 16:53:00 NaN 1
3 2020-04-03 16:54:00 NaN 1
4 2020-04-03 16:55:00 NaN 1
5 2020-04-17 02:03:00 NaN 2
6 2020-04-17 02:04:00 NaN 2
7 2020-04-17 02:05:00 NaN 2
8 2020-04-17 02:06:00 NaN 2
I have a dataframe like the following
LIT__0001 LIT__002 AAA__0001 AAA__0002 XYZ
2019-10-31 13:40:00-04:00 NaN 0.014786 10 55 1
2019-10-31 13:45:00-04:00 NaN 0.012143 33 11 2
2019-10-31 13:50:00-04:00 NaN NaN NaN NaN 3
2019-10-31 13:55:00-04:00 NaN 0.020000 14 13 4
2019-10-31 14:00:00-04:00 0.010000 NaN 14 NaN 5
I need to convert it to a dataframe like the following
LIT AAA XYZ
2019-10-31 13:40:00-04:00 0.014786 10 1
2019-10-31 13:45:00-04:00 0.012143 11 2
2019-10-31 13:50:00-04:00 NaN NaN 3
2019-10-31 13:55:00-04:00 0.020000 13 4
2019-10-31 14:00:00-04:00 0.010000 14 5
That is, for every column having common first characters before '__', take the minimun for each row.
My dataframe is really huge so I would apreciate the faster solution.
Use GroupBy.min by columns with axis=1 and lambda function for split:
df = df.groupby(lambda x: x.split('__')[0], axis=1, sort=False).min()
Or use str.split:
df = df.groupby(df.columns.str.split('__').str[0], axis=1, sort=False).min()
print (df)
LIT AAA XYZ
2019-10-31 13:40:00-04:00 0.014786 10.0 1.0
2019-10-31 13:45:00-04:00 0.012143 11.0 2.0
2019-10-31 13:50:00-04:00 NaN NaN 3.0
2019-10-31 13:55:00-04:00 0.020000 13.0 4.0
2019-10-31 14:00:00-04:00 0.010000 14.0 5.0
I have a dataframe like so:
df = pd.DataFrame({'time':['23:59:45','23:49:50','23:59:55','00:00:00','00:00:05','00:00:10','00:00:15'],
'X':[-5,-4,-2,5,6,10,11],
'Y':[3,4,5,9,20,22,23]})
As you can see, the time is formed by hours (string format) and are across midnight. The time is given every 5 seconds!
My goal is however to add empty rows (filled with Nan for examples) so that the time is every second. Finally the column time should be converted as a time stamp and set as index.
Could you please suggest a smart and elegant way to achieve my goal?
Here is what the output should look like:
X Y
time
23:59:45 -5.0 3.0
23:59:46 NaN NaN
23:59:47 NaN NaN
23:59:48 NaN NaN
... ... ...
00:00:10 10.0 22.0
00:00:11 NaN NaN
00:00:12 NaN NaN
00:00:13 NaN NaN
00:00:14 NaN NaN
00:00:15 11.0 23.0
Note: I do not need the dates.
Use to_timedelta with reindex by timedelta_range:
df['time'] = pd.to_timedelta(df['time'])
idx = pd.timedelta_range('0', '23:59:59', freq='S', name='time')
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 00:00:00 5.0 9.0
1 00:00:01 NaN NaN
2 00:00:02 NaN NaN
3 00:00:03 NaN NaN
4 00:00:04 NaN NaN
5 00:00:05 6.0 20.0
6 00:00:06 NaN NaN
7 00:00:07 NaN NaN
8 00:00:08 NaN NaN
9 00:00:09 NaN NaN
If need replace NaNs:
df = df.set_index('time').reindex(idx, fill_value=0).reset_index()
print (df.head(10))
time X Y
0 00:00:00 5 9
1 00:00:01 0 0
2 00:00:02 0 0
3 00:00:03 0 0
4 00:00:04 0 0
5 00:00:05 6 20
6 00:00:06 0 0
7 00:00:07 0 0
8 00:00:08 0 0
9 00:00:09 0 0
Another solution with resample, but is possible some rows are missing in the end:
df = df.set_index('time').resample('S').first()
print (df.tail(10))
X Y
time
23:59:46 NaN NaN
23:59:47 NaN NaN
23:59:48 NaN NaN
23:59:49 NaN NaN
23:59:50 NaN NaN
23:59:51 NaN NaN
23:59:52 NaN NaN
23:59:53 NaN NaN
23:59:54 NaN NaN
23:59:55 -2.0 5.0
EDIT1:
idx1 = pd.timedelta_range('23:59:45', '23:59:59', freq='S', name='time')
idx2 = pd.timedelta_range('0', '00:00:15', freq='S', name='time')
idx = np.concatenate([idx1, idx2])
df['time'] = pd.to_timedelta(df['time'])
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 23:59:45 -5.0 3.0
1 23:59:46 NaN NaN
2 23:59:47 NaN NaN
3 23:59:48 NaN NaN
4 23:59:49 NaN NaN
5 23:59:50 NaN NaN
6 23:59:51 NaN NaN
7 23:59:52 NaN NaN
8 23:59:53 NaN NaN
9 23:59:54 NaN NaN
print (df.tail(10))
time X Y
21 00:00:06 NaN NaN
22 00:00:07 NaN NaN
23 00:00:08 NaN NaN
24 00:00:09 NaN NaN
25 00:00:10 10.0 22.0
26 00:00:11 NaN NaN
27 00:00:12 NaN NaN
28 00:00:13 NaN NaN
29 00:00:14 NaN NaN
30 00:00:15 11.0 23.0
EDIT:
Another solution - change next day to 1 day timedeltas:
df['time'] = pd.to_timedelta(df['time'])
a = pd.to_timedelta(df['time'].diff().dt.days.abs().cumsum().fillna(1).sub(1), unit='d')
df['time'] = df['time'] + a
print (df)
X Y time
0 -5 3 0 days 23:59:45
1 -4 4 0 days 23:49:50
2 -2 5 0 days 23:59:55
3 5 9 1 days 00:00:00
4 6 20 1 days 00:00:05
5 10 22 1 days 00:00:10
6 11 23 1 days 00:00:15
idx = pd.timedelta_range(df['time'].min(), df['time'].max(), freq='S', name='time')
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 23:49:50 -4.0 4.0
1 23:49:51 NaN NaN
2 23:49:52 NaN NaN
3 23:49:53 NaN NaN
4 23:49:54 NaN NaN
5 23:49:55 NaN NaN
6 23:49:56 NaN NaN
7 23:49:57 NaN NaN
8 23:49:58 NaN NaN
9 23:49:59 NaN NaN
print (df.tail(10))
time X Y
616 1 days 00:00:06 NaN NaN
617 1 days 00:00:07 NaN NaN
618 1 days 00:00:08 NaN NaN
619 1 days 00:00:09 NaN NaN
620 1 days 00:00:10 10.0 22.0
621 1 days 00:00:11 NaN NaN
622 1 days 00:00:12 NaN NaN
623 1 days 00:00:13 NaN NaN
624 1 days 00:00:14 NaN NaN
625 1 days 00:00:15 11.0 23.0