How can I generate a time series dataset with min and max date range with the specific interval in pandas?
min_date = 18 oct 2022
Max_date = 20 Oct 2022
interval = 1 hour
Min_date Max_date
18/10/2022 00:00:00 18/10/2022 01:00:00
18/10/2022 01:00:00 18/10/2022 02:00:00
18/10/2022 02:00:00 18/10/2022 03:00:00
18/10/2022 03:00:00 18/10/2022 04:00:00
19/10/2022 22:00:00 18/10/2022 23:00:00
19/10/2022 23:00:00 18/10/2022 23:59:00
Thanks in advance
import pandas as pd
min_date = pd.Timestamp('oct 18, 2022')
max_date = pd.Timestamp('oct 20, 2022')
interval = pd.offsets.Hour(+1)
df = pd.DataFrame(pd.date_range(min_date, max_date - interval, freq = interval), columns = ['Min_date'])
df['Max_date'] = df['Min_date'] + interval
print(df)
Output:
Min_date Max_date
0 2022-10-18 00:00:00 2022-10-18 01:00:00
1 2022-10-18 01:00:00 2022-10-18 02:00:00
2 2022-10-18 02:00:00 2022-10-18 03:00:00
.
.
.
45 2022-10-19 21:00:00 2022-10-19 22:00:00
46 2022-10-19 22:00:00 2022-10-19 23:00:00
47 2022-10-19 23:00:00 2022-10-20 00:00:00
I understand how to create a pandas frequency like in python3:
import pandas as pd
import datetime
idx = pd.date_range('2017-01-01' ,'2017-06-16', freq='D')
ts = pd.Series(range(len(idx)), index=idx)
ts
How would I do this for irregularly sampled hourly data of 9 12 and 18 o'clock?
You can try:
idx = pd.date_range('2017-01-01' ,'2017-06-16', freq='H')
idx = idx[idx.hour.isin([9,12,18])]
ts = pd.Series(range(len(idx)), index=idx)
Output:
2017-01-01 09:00:00 0
2017-01-01 12:00:00 1
2017-01-01 18:00:00 2
2017-01-02 09:00:00 3
2017-01-02 12:00:00 4
...
2017-06-14 12:00:00 493
2017-06-14 18:00:00 494
2017-06-15 09:00:00 495
2017-06-15 12:00:00 496
2017-06-15 18:00:00 497
Length: 498, dtype: int64
I have a dataframe that has a date time column called start time and it is set to a default of 12:00:00 AM. I would like to reset this column so that the first row is 00:01:00 and the second row is 00:02:00, that is one minute interval.
This is the original table.
ID State Time End Time
A001 12:00:00 12:00:00
A002 12:00:00 12:00:00
A003 12:00:00 12:00:00
A004 12:00:00 12:00:00
A005 12:00:00 12:00:00
A006 12:00:00 12:00:00
A007 12:00:00 12:00:00
I want to reset the start time column so that my output is this:
ID State Time End Time
A001 0:00:00 12:00:00
A002 0:00:01 12:00:00
A003 0:00:02 12:00:00
A004 0:00:03 12:00:00
A005 0:00:04 12:00:00
A006 0:00:05 12:00:00
A007 0:00:06 12:00:00
How do I go about this?
you could use pd.date_range:
df['Start Time'] = pd.date_range('00:00', periods=df['Start Time'].shape[0], freq='1min')
gives you
df
Out[23]:
Start Time
0 2019-09-30 00:00:00
1 2019-09-30 00:01:00
2 2019-09-30 00:02:00
3 2019-09-30 00:03:00
4 2019-09-30 00:04:00
5 2019-09-30 00:05:00
6 2019-09-30 00:06:00
7 2019-09-30 00:07:00
8 2019-09-30 00:08:00
9 2019-09-30 00:09:00
supply a full date/time string to get another starting date.
First we convert your State Time column to datetime type. Then we use pd.date_range and use the first time as starting point with a frequency of 1 minute.
df['State Time'] = pd.to_datetime(df['State Time'])
df['State Time'] = pd.date_range(start=df['State Time'].min(),
periods=len(df),
freq='min').time
Output
ID State Time End Time
0 A001 12:00:00 12:00:00
1 A002 12:01:00 12:00:00
2 A003 12:02:00 12:00:00
3 A004 12:03:00 12:00:00
4 A005 12:04:00 12:00:00
5 A006 12:05:00 12:00:00
6 A007 12:06:00 12:00:00
I have an Excel data set that looks like this:
24 25 26 27
1 0,3818 0,0713 0,07222 0,3542
2 0,17802 0,04508 0,06877 0,17319
3 0,22356 0,07314 0,04991 0,22448
4 0,1771 0,07038 0,07406 0,19136
5 0,19389 0,06164 0,05497 0,18538
6 0,20401 0,07475 0,06417 0,21413
7 0,18354 0,07245 0,07337 0,17756
8 0,46184 0,04669 0,0506 0,28819
9 0,43838 0,0667 0,06785 0,4692
10 0,78292 0,07038 0,07291 0,66424
11 1,81792 0,06003 0,04508 1,17001
12 2,40833 0,05451 0,07245 1,08422
13 1,55746 0,07038 0,07314 0,61272
14 1,2075 0,06509 0,04485 0,40871
15 2,4196 0,05014 0,07291 0,27393
16 0,95979 0,07015 0,07291 0,2323
17 0,51681 0,06992 0,04554 0,2024
18 0,46529 0,04232 0,85192 0,35558
19 0,58328 0,06992 1,59321 0,60283
20 1,40185 0,07015 0,82869 1,23326
21 0,71484 0,04692 1,05041 1,01131
22 0,48576 0,07291 0,80707 1,4697
23 0,04278 0,07245 0,57523 1,72316
24 0,07291 0,04554 0,5175 0,61364
The first column represents the hours of the day, the first row the number of day of the year (24 corresponds to the 24th of January, the rows spans all the way through the year, ending on the day number 365) for the year 2013.
What I want to obtain is a dataframe which as first column has the date, with year-month-day-hour and for which the respective hourly value is correctly associated.
'date' 'value'
2013-01-24 01:00 0.3818
2013-01-24 02:00 0.17802
2013-01-24 03:00 0.22356
...
The Excel data set
Thank you for your help.
This is the best I've got:
if your data is on a pandas.DataFrame called df you can do:
df2 = df.unstack()
start = pd.Timestamp('01/01/2013')
df2 = df2.reset_index()
df2['date'] = [start + pd.DateOffset(days = int(x)-1) for x in df2.level_0.values]
df2['date'] += pd.to_timedelta(df2.level_1, unit='h')
df2.index = df2.date
df2 = df2[0]
Result
date
2013-01-24 00:00:00 0,3818
2013-01-24 01:00:00 0,17802
2013-01-24 02:00:00 0,22356
2013-01-24 03:00:00 0,1771
2013-01-24 04:00:00 0,19389
2013-01-24 05:00:00 0,20401
2013-01-24 06:00:00 0,18354
2013-01-24 07:00:00 0,46184
2013-01-24 08:00:00 0,43838
2013-01-24 09:00:00 0,78292
2013-01-24 10:00:00 1,81792
2013-01-24 11:00:00 2,40833
2013-01-24 12:00:00 1,55746
2013-01-24 13:00:00 1,2075
2013-01-24 14:00:00 2,4196
2013-01-24 15:00:00 0,95979
2013-01-24 16:00:00 0,51681
2013-01-24 17:00:00 0,46529
2013-01-24 18:00:00 0,58328
2013-01-24 19:00:00 1,40185
2013-01-24 20:00:00 0,71484
2013-01-24 21:00:00 0,48576
2013-01-24 22:00:00 0,04278
2013-01-24 23:00:00 0,07291
2013-01-25 00:00:00 0,0713
I have a DataFrame with data similar to the following
import pandas as pd; import numpy as np; import datetime; from datetime import timedelta;
df = pd.DataFrame(index=pd.date_range(start='20160102', end='20170301', freq='5min'))
df['value'] = np.random.randn(df.index.size)
df.index += pd.Series([timedelta(seconds=np.random.randint(-60, 60))
for _ in range(df.index.size)])
which looks like this
In[37]: df
Out[37]:
value
2016-01-02 00:00:33 0.546675
2016-01-02 00:04:52 1.080558
2016-01-02 00:10:46 -1.551206
2016-01-02 00:15:52 -1.278845
2016-01-02 00:19:04 -1.672387
2016-01-02 00:25:36 -0.786985
2016-01-02 00:29:35 1.067132
2016-01-02 00:34:36 -0.575365
2016-01-02 00:39:33 0.570341
2016-01-02 00:44:56 -0.636312
...
2017-02-28 23:14:57 -0.027981
2017-02-28 23:19:51 0.883150
2017-02-28 23:24:15 -0.706997
2017-02-28 23:30:09 -0.954630
2017-02-28 23:35:08 -1.184881
2017-02-28 23:40:20 0.104017
2017-02-28 23:44:10 -0.678742
2017-02-28 23:49:15 -0.959857
2017-02-28 23:54:36 -1.157165
2017-02-28 23:59:10 0.527642
Now, I'm aiming to get the mean per 5 minute period over the course of a 24 hour day - without considering what day those values actually come from.
How can I do this effectively? I would like to think I could somehow remove the actual dates from my index and then use something like pd.TimeGrouper, but I haven't figured out how to do so.
My not-so-great solution
My solution so far has been to use between_time in a loop like this, just using an arbitrary day.
aggregates = []
start_time = datetime.datetime(1990, 1, 1, 0, 0, 0)
while start_time < datetime.datetime(1990, 1, 1, 23, 59, 0):
aggregates.append(
(
start_time,
df.between_time(start_time.time(),
(start_time + timedelta(minutes=5)).time(),
include_end=False).value.mean()
)
)
start_time += timedelta(minutes=5)
result = pd.DataFrame(aggregates, columns=['time', 'value'])
which works as expected
In[68]: result
Out[68]:
time value
0 1990-01-01 00:00:00 0.032667
1 1990-01-01 00:05:00 0.117288
2 1990-01-01 00:10:00 -0.052447
3 1990-01-01 00:15:00 -0.070428
4 1990-01-01 00:20:00 0.034584
5 1990-01-01 00:25:00 0.042414
6 1990-01-01 00:30:00 0.043388
7 1990-01-01 00:35:00 0.050371
8 1990-01-01 00:40:00 0.022209
9 1990-01-01 00:45:00 -0.035161
.. ... ...
278 1990-01-01 23:10:00 0.073753
279 1990-01-01 23:15:00 -0.005661
280 1990-01-01 23:20:00 -0.074529
281 1990-01-01 23:25:00 -0.083190
282 1990-01-01 23:30:00 -0.036636
283 1990-01-01 23:35:00 0.006767
284 1990-01-01 23:40:00 0.043436
285 1990-01-01 23:45:00 0.011117
286 1990-01-01 23:50:00 0.020737
287 1990-01-01 23:55:00 0.021030
[288 rows x 2 columns]
But this doesn't feel like a very Pandas-friendly solution.
IIUC then the following should work:
In [62]:
df.groupby(df.index.floor('5min').time).mean()
Out[62]:
value
00:00:00 -0.038002
00:05:00 -0.011646
00:10:00 0.010701
00:15:00 0.034699
00:20:00 0.041164
00:25:00 0.151187
00:30:00 -0.006149
00:35:00 -0.008256
00:40:00 0.021389
00:45:00 0.016851
00:50:00 -0.074825
00:55:00 0.012861
01:00:00 0.054048
01:05:00 0.041907
01:10:00 -0.004457
01:15:00 0.052428
01:20:00 -0.021518
01:25:00 -0.019010
01:30:00 0.030887
01:35:00 -0.085415
01:40:00 0.002386
01:45:00 -0.002189
01:50:00 0.049720
01:55:00 0.032292
02:00:00 -0.043642
02:05:00 0.067132
02:10:00 -0.029628
02:15:00 0.064098
02:20:00 0.042731
02:25:00 -0.031113
... ...
21:30:00 -0.018391
21:35:00 0.032155
21:40:00 0.035014
21:45:00 -0.016979
21:50:00 -0.025248
21:55:00 0.027896
22:00:00 -0.117036
22:05:00 -0.017970
22:10:00 -0.008494
22:15:00 -0.065303
22:20:00 -0.014623
22:25:00 0.076994
22:30:00 -0.030935
22:35:00 0.030308
22:40:00 -0.124668
22:45:00 0.064853
22:50:00 0.057913
22:55:00 0.002309
23:00:00 0.083586
23:05:00 -0.031043
23:10:00 -0.049510
23:15:00 0.003520
23:20:00 0.037135
23:25:00 -0.002231
23:30:00 -0.029592
23:35:00 0.040335
23:40:00 -0.021513
23:45:00 0.104421
23:50:00 -0.022280
23:55:00 -0.021283
[288 rows x 1 columns]
Here I floor the index to '5 min' intervals and then group on the time attribute and aggregate the mean