How to extract rows between 2 times with Pandas? - python

I want to make sub dataframes out of one dataframe, using its datetime index. For example, if I want to extract rows between 07:00~06:00 and make new dataframes:
import pandas as pd
int_rows = 24
str_freq = '180min'
i = pd.date_range('2018-04-09', periods=int_rows, freq=str_freq)
df = pd.DataFrame({'A': [i for i in range(int_rows)]}, index=i)
>>> df
A
2018-04-09 00:00:00 0
2018-04-09 03:00:00 1
2018-04-09 06:00:00 2
2018-04-09 09:00:00 3
2018-04-09 12:00:00 4
2018-04-09 15:00:00 5
2018-04-09 18:00:00 6
2018-04-09 21:00:00 7
2018-04-10 00:00:00 8
2018-04-10 03:00:00 9
2018-04-10 06:00:00 10
2018-04-10 09:00:00 11
2018-04-10 12:00:00 12
2018-04-10 15:00:00 13
2018-04-10 18:00:00 14
2018-04-10 21:00:00 15
2018-04-11 00:00:00 16
2018-04-11 03:00:00 17
2018-04-11 06:00:00 18
2018-04-11 09:00:00 19
2018-04-11 12:00:00 20
2018-04-11 15:00:00 21
2018-04-11 18:00:00 22
2018-04-11 21:00:00 23
# new dataframes that I want
A
2018-04-09 00:00:00 0
2018-04-09 03:00:00 1
A
2018-04-09 06:00:00 2
2018-04-09 09:00:00 3
2018-04-09 12:00:00 4
2018-04-09 15:00:00 5
2018-04-09 18:00:00 6
2018-04-09 21:00:00 7
2018-04-10 00:00:00 8
2018-04-10 03:00:00 9
A
2018-04-10 06:00:00 10
2018-04-10 09:00:00 11
2018-04-10 12:00:00 12
2018-04-10 15:00:00 13
2018-04-10 18:00:00 14
2018-04-10 21:00:00 15
2018-04-11 00:00:00 16
2018-04-11 03:00:00 17
A
2018-04-11 06:00:00 18
2018-04-11 09:00:00 19
2018-04-11 12:00:00 20
2018-04-11 15:00:00 21
2018-04-11 18:00:00 22
2018-04-11 21:00:00 23
I found between_time method, but it doesn't care about dates. I could iterate over the original dataframe and check each date and time, but I think it's going to be inefficient. Are there any simple ways to do this?

You can 'shift' the timestamp by 6 hours and group by day:
for k, d in df.groupby((df.index - pd.to_timedelta('6:00:00')).normalize()):
print(d); print()
Output:
A
2018-04-09 00:00:00 0
2018-04-09 03:00:00 1
A
2018-04-09 06:00:00 2
2018-04-09 09:00:00 3
2018-04-09 12:00:00 4
2018-04-09 15:00:00 5
2018-04-09 18:00:00 6
2018-04-09 21:00:00 7
2018-04-10 00:00:00 8
2018-04-10 03:00:00 9
A
2018-04-10 06:00:00 10
2018-04-10 09:00:00 11
2018-04-10 12:00:00 12
2018-04-10 15:00:00 13
2018-04-10 18:00:00 14
2018-04-10 21:00:00 15
2018-04-11 00:00:00 16
2018-04-11 03:00:00 17
A
2018-04-11 06:00:00 18
2018-04-11 09:00:00 19
2018-04-11 12:00:00 20
2018-04-11 15:00:00 21
2018-04-11 18:00:00 22
2018-04-11 21:00:00 23

Related

Replacing NaNs with Mean Value using Pandas

Say I have a Dataframe called Data with shape (71067, 4):
StartTime EndDateTime TradeDate Values
0 2018-12-31 23:00:00 2018-12-31 23:30:00 2019-01-01 -44.676
1 2018-12-31 23:30:00 2019-01-01 00:00:00 2019-01-01 -36.113
2 2019-01-01 00:00:00 2019-01-01 00:30:00 2019-01-01 -19.229
3 2019-01-01 00:30:00 2019-01-01 01:00:00 2019-01-01 -23.606
4 2019-01-01 01:00:00 2019-01-01 01:30:00 2019-01-01 -25.899
... ... ... ... ...
2023-01-30 20:30:00 2023-01-30 21:00:00 2023-01-30 -27.198
2023-01-30 21:00:00 2023-01-30 21:30:00 2023-01-30 -13.221
2023-01-30 21:30:00 2023-01-30 22:00:00 2023-01-30 -12.034
2023-01-30 22:00:00 2023-01-30 22:30:00 2023-01-30 -16.464
2023-01-30 22:30:00 2023-01-30 23:00:00 2023-01-30 -25.441
71067 rows × 4 columns
When running Data.isna().sum().sum() I realise I have some NaN values in the dataset:
Data.isna().sum().sum()
> 1391
Shown here:
Data[Data['Values'].isna()].reset_index(drop = True).sort_values(by = 'StartTime')
0 2019-01-01 03:30:00 2019-01-01 04:00:00 2019-01-01 NaN
1 2019-01-04 02:30:00 2019-01-04 03:00:00 2019-01-04 NaN
2 2019-01-04 03:00:00 2019-01-04 03:30:00 2019-01-04 NaN
3 2019-01-04 03:30:00 2019-01-04 04:00:00 2019-01-04 NaN
4 2019-01-04 04:00:00 2019-01-04 04:30:00 2019-01-04 NaN
... ... ... ... ...
1386 2022-12-06 13:00:00 2022-12-06 13:30:00 2022-12-06 NaN
1387 2022-12-06 13:30:00 2022-12-06 14:00:00 2022-12-06 NaN
1388 2022-12-22 11:00:00 2022-12-22 11:30:00 2022-12-22 NaN
1389 2023-01-25 11:00:00 2023-01-25 11:30:00 2023-01-25 NaN
1390 2023-01-25 11:30:00 2023-01-25 12:00:00 2023-01-25 NaN
Is there anyway of replacing each of the NaN values in the dataset with the mean value of the corresponding half hour across the 70,000 plus rows, see below:
Data['HH'] = pd.to_datetime(Data['StartTime']).dt.time
Data.groupby(['HH'], as_index=False)[['Data']].mean().head(10)
# Only showing first 10 means
HH Values
0 00:00:00 5.236811
1 00:30:00 2.056571
2 01:00:00 4.157455
3 01:30:00 2.339253
4 02:00:00 2.658238
5 02:30:00 0.230557
6 03:00:00 0.217599
7 03:30:00 -0.630243
8 04:00:00 -0.989919
9 04:30:00 -0.494372
For example, if a value is missing against 04:00, can it be replaced with the 04:00 mean value (0.989919) as per the above table of means?
Any help greatly appreciated.
Let's group the dataframe by HH then transform the Values with mean to broadcast the mean values back to the original column shape then use fillna to fill the null values
avg = Data.groupby('HH')['Values'].transform('mean')
Data['Values'] = Data['Values'].fillna(avg)

pandas drop consecutive duplicates selectively

I have been looking at all questions/answers about to how drop consecutive duplicates selectively in a pandas dataframe, still cannot figure out the following scenario:
import pandas as pd
import numpy as np
def random_dates(start, end, n, freq, seed=None):
if seed is not None:
np.random.seed(seed)
dr = pd.date_range(start, end, freq=freq)
return pd.to_datetime(np.sort(np.random.choice(dr, n, replace=False)))
date = random_dates('2018-01-01', '2018-01-12', 20, 'H', seed=[3, 1415])
data = {'Timestamp': date,
'Message': ['Message received.','Sending...', 'Sending...', 'Sending...', 'Work in progress...', 'Work in progress...',
'Message received.','Sending...', 'Sending...','Work in progress...',
'Message received.','Sending...', 'Sending...', 'Sending...','Work in progress...', 'Work in progress...', 'Work in progress...',
'Message received.','Sending...', 'Sending...']}
df = pd.DataFrame(data, columns = ['Timestamp', 'Message'])
I have the following dataframe:
Timestamp Message
0 2018-01-02 03:00:00 Message received.
1 2018-01-02 11:00:00 Sending...
2 2018-01-03 04:00:00 Sending...
3 2018-01-04 11:00:00 Sending...
4 2018-01-04 16:00:00 Work in progress...
5 2018-01-04 17:00:00 Work in progress...
6 2018-01-05 05:00:00 Message received.
7 2018-01-05 11:00:00 Sending...
8 2018-01-05 17:00:00 Sending...
9 2018-01-06 02:00:00 Work in progress...
10 2018-01-06 14:00:00 Message received.
11 2018-01-07 07:00:00 Sending...
12 2018-01-07 20:00:00 Sending...
13 2018-01-08 01:00:00 Sending...
14 2018-01-08 02:00:00 Work in progress...
15 2018-01-08 15:00:00 Work in progress...
16 2018-01-09 00:00:00 Work in progress...
17 2018-01-10 03:00:00 Message received.
18 2018-01-10 09:00:00 Sending...
19 2018-01-10 14:00:00 Sending...
I want to drop the consecutive duplicates in df['Message'] column ONLY when 'Message' is 'Work in progress...' and keep the first instance (here e.g. Index 5, 15 and 16 need to be dropped), ideally I would like to get:
Timestamp Message
0 2018-01-02 03:00:00 Message received.
1 2018-01-02 11:00:00 Sending...
2 2018-01-03 04:00:00 Sending...
3 2018-01-04 11:00:00 Sending...
4 2018-01-04 16:00:00 Work in progress...
6 2018-01-05 05:00:00 Message received.
7 2018-01-05 11:00:00 Sending...
8 2018-01-05 17:00:00 Sending...
9 2018-01-06 02:00:00 Work in progress...
10 2018-01-06 14:00:00 Message received.
11 2018-01-07 07:00:00 Sending...
12 2018-01-07 20:00:00 Sending...
13 2018-01-08 01:00:00 Sending...
14 2018-01-08 02:00:00 Work in progress...
17 2018-01-10 03:00:00 Message received.
18 2018-01-10 09:00:00 Sending...
19 2018-01-10 14:00:00 Sending...
I have tried solutions offered in similar posts like:
df['Message'].loc[df['Message'].shift(-1) != df['Message']]
I also calculated the length of the Messages:
df['length'] = df['Message'].apply(lambda x: len(x))
and wrote a conditional drop as:
df.loc[(df['length'] ==17) | (df['length'] ==10) | ~df['Message'].duplicated(keep='first')]
It looks better but still Index 14, 15, and 16 are dropped altogether, thus it is ill-behaved, see:
Timestamp Message length
0 2018-01-02 03:00:00 Message received. 17
1 2018-01-02 11:00:00 Sending... 10
2 2018-01-03 04:00:00 Sending... 10
3 2018-01-04 11:00:00 Sending... 10
4 2018-01-04 16:00:00 Work in progress... 19
6 2018-01-05 05:00:00 Message received. 17
7 2018-01-05 11:00:00 Sending... 10
8 2018-01-05 17:00:00 Sending... 10
10 2018-01-06 14:00:00 Message received. 17
11 2018-01-07 07:00:00 Sending... 10
12 2018-01-07 20:00:00 Sending... 10
13 2018-01-08 01:00:00 Sending... 10
17 2018-01-10 03:00:00 Message received. 17
18 2018-01-10 09:00:00 Sending... 10
19 2018-01-10 14:00:00 Sending... 10
Your time and help is appreciated!
First filter first consecutive values with compare by Series.shift and chain mask with filter all rows with no Work in progress... values:
df = df[(df['Message'].shift() != df['Message']) | (df['Message'] != 'Work in progress...')]
print (df)
Timestamp Message
0 2018-01-02 03:00:00 Message received.
1 2018-01-02 11:00:00 Sending...
2 2018-01-03 04:00:00 Sending...
3 2018-01-04 11:00:00 Sending...
4 2018-01-04 16:00:00 Work in progress...
6 2018-01-05 05:00:00 Message received.
7 2018-01-05 11:00:00 Sending...
8 2018-01-05 17:00:00 Sending...
9 2018-01-06 02:00:00 Work in progress...
10 2018-01-06 14:00:00 Message received.
11 2018-01-07 07:00:00 Sending...
12 2018-01-07 20:00:00 Sending...
13 2018-01-08 01:00:00 Sending...
14 2018-01-08 02:00:00 Work in progress...
17 2018-01-10 03:00:00 Message received.
18 2018-01-10 09:00:00 Sending...
19 2018-01-10 14:00:00 Sending...
You can first get all Messages with 'Work in Progress' and compare them with the previous element and then filter:
condition = (df['Message'] == 'Work in progress...') & (df['Message']==df['Message'].shift(1))
df[~condition]
Timestamp Message
0 2018-01-02 03:00:00 Message received.
1 2018-01-02 11:00:00 Sending...
2 2018-01-03 04:00:00 Sending...
3 2018-01-04 11:00:00 Sending...
4 2018-01-04 16:00:00 Work in progress...
6 2018-01-05 05:00:00 Message received.
7 2018-01-05 11:00:00 Sending...
8 2018-01-05 17:00:00 Sending...
9 2018-01-06 02:00:00 Work in progress...
10 2018-01-06 14:00:00 Message received.
11 2018-01-07 07:00:00 Sending...
12 2018-01-07 20:00:00 Sending...
13 2018-01-08 01:00:00 Sending...
14 2018-01-08 02:00:00 Work in progress...
17 2018-01-10 03:00:00 Message received.
18 2018-01-10 09:00:00 Sending...
19 2018-01-10 14:00:00 Sending...

Flagging list of datetimes within date ranges in pandas dataframe

I've looked around (eg.
Python - Locating the closest timestamp) but can't find anything on this.
I have a list of datetimes, and a dataframe containing 10k + rows, of start and end times (formatted as datetimes).
The dataframe is effectively listing parameters for runs of an instrument.
The list describes times from an alarm event.
The datetime list items are all within a row (i.e. between a start and end time) in the dataframe. Is there an easy way to locate the rows which would contain the timeframe within which the alarm time would be? (sorry for poor wording there!)
eg.
for i in alarms:
df.loc[(df.start_time < i) & (df.end_time > i), 'Flag'] = 'Alarm'
(this didn't work but shows my approach)
Example datasets
# making list of datetimes for the alarms
df = pd.DataFrame({'Alarms':["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n=33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({ 'start_date': rng1, 'end_Date': rng2})
Herein a flag would go against line (well, index) 4, 13 and 21.
You can use pandas.IntervalIndex here:
# Create and set IntervalIndex
intervals = pd.IntervalIndex.from_arrays(df.start_date, df.end_Date)
df = df.set_index(intervals)
# Update using loc
df.loc[alarms, 'flag'] = 'alarm'
# Finally, reset_index
df = df.reset_index(drop=True)
[out]
start_date end_Date flag
0 2019-07-18 00:00:00 2019-07-18 03:00:00 NaN
1 2019-07-18 03:00:00 2019-07-18 06:00:00 NaN
2 2019-07-18 06:00:00 2019-07-18 09:00:00 NaN
3 2019-07-18 09:00:00 2019-07-18 12:00:00 NaN
4 2019-07-18 12:00:00 2019-07-18 15:00:00 alarm
5 2019-07-18 15:00:00 2019-07-18 18:00:00 NaN
6 2019-07-18 18:00:00 2019-07-18 21:00:00 NaN
7 2019-07-18 21:00:00 2019-07-19 00:00:00 NaN
8 2019-07-19 00:00:00 2019-07-19 03:00:00 NaN
9 2019-07-19 03:00:00 2019-07-19 06:00:00 NaN
10 2019-07-19 06:00:00 2019-07-19 09:00:00 NaN
11 2019-07-19 09:00:00 2019-07-19 12:00:00 NaN
12 2019-07-19 12:00:00 2019-07-19 15:00:00 NaN
13 2019-07-19 15:00:00 2019-07-19 18:00:00 alarm
14 2019-07-19 18:00:00 2019-07-19 21:00:00 NaN
15 2019-07-19 21:00:00 2019-07-20 00:00:00 NaN
16 2019-07-20 00:00:00 2019-07-20 03:00:00 NaN
17 2019-07-20 03:00:00 2019-07-20 06:00:00 NaN
18 2019-07-20 06:00:00 2019-07-20 09:00:00 NaN
19 2019-07-20 09:00:00 2019-07-20 12:00:00 NaN
20 2019-07-20 12:00:00 2019-07-20 15:00:00 NaN
21 2019-07-20 15:00:00 2019-07-20 18:00:00 alarm
22 2019-07-20 18:00:00 2019-07-20 21:00:00 NaN
23 2019-07-20 21:00:00 2019-07-21 00:00:00 NaN
24 2019-07-21 00:00:00 2019-07-21 03:00:00 NaN
25 2019-07-21 03:00:00 2019-07-21 06:00:00 NaN
26 2019-07-21 06:00:00 2019-07-21 09:00:00 NaN
27 2019-07-21 09:00:00 2019-07-21 12:00:00 NaN
28 2019-07-21 12:00:00 2019-07-21 15:00:00 NaN
29 2019-07-21 15:00:00 2019-07-21 18:00:00 NaN
30 2019-07-21 18:00:00 2019-07-21 21:00:00 NaN
31 2019-07-21 21:00:00 2019-07-22 00:00:00 NaN
32 2019-07-22 00:00:00 2019-07-22 03:00:00 NaN
you were calling your columns start_date and end_Date, but in your for you use start_time and end_time.
try this:
import pandas as pd
df = pd.DataFrame({'Alarms': ["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n = 33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({'start_date': rng1, 'end_Date': rng2})
for i in alarms:
df.loc[(df.start_date < i) & (df.end_Date > i), 'Flag'] = 'Alarm'
print(df[df['Flag']=='Alarm']['Flag'])
Output:
4 Alarm
13 Alarm
21 Alarm
Name: Flag, dtype: object

merging two different date time column to form a sequence

I have two data sets.One with weekly date time and the other hourly date time.
my data sets looks like this:-
df1
Week_date w_values
21-04-2019 20:00:00 10
28-04-2019 20:00:00 20
05-05-2019 20:00:00 30
df2
hour_date h_values
19-04-2019 08:00:00 a
21-04-2019 07:00:00 b
21-04-2019 20:00:00 c
22-04-2019 06:00:00 d
23-04-2019 05:00:00 e
28-04-2019 19:00:00 f
28-04-2019 20:00:00 g
28-04-2019 21:00:00 h
29-04-2019 20:00:00 i
05-05-2019 20:00:00 j
06-05-2019 23:00:00 k
tried merging but failed to get the desired output
output data set should look like this
week_date w_values hour_date h_values
21-04-2019 20:00:00 10 21-04-2019 20:00:00 c
21-04-2019 20:00:00 10 22-04-2019 06:00:00 d
21-04-2019 20:00:00 10 23-04-2019 05:00:00 e
21-04-2019 20:00:00 10 28-04-2019 19:00:00 f
28-04-2019 20:00:00 20 28-04-2019 20:00:00 g
28-04-2019 20:00:00 20 28-04-2019 21:00:00 h
28-04-2019 20:00:00 20 29-04-2019 20:00:00 i
05-05-2019 20:00:00 30 05-05-2019 20:00:00 j
05-05-2019 20:00:00 30 06-05-2019 23:00:00 k
the weekly date will change only when week date is equal to hour date....else it will take previous week date....
Use the 'merge_asof' function. From pandas documentation "This merge is similar to a left-join except that we match on nearest key rather than equal keys."
df_week['Week_date']=pd.to_datetime(df_week['Week_date'])
df_hour['hour_date']=pd.to_datetime(df_hour['hour_date'])
df_week_sort=df_week.sort_values(by='Week_date')
df_hour_sort=df_hour.sort_values(by='hour_date')
df_week_sort.rename(columns={'Week_date':'Merge_date'},inplace=True)
df_hour_sort.rename(columns={'hour_date':'Merge_date'},inplace=True)
df_merged=pd.merge_asof(df_hour_sort,df_week_sort,on='Merge_date')
Make sure that the two frames are sorted by the date stamp
The following should do (provided Week_date and hour_date are datetimes):
(df2.merge(df1, how='left', right_on='Week_date', left_on='hour_date')
.ffill()
.dropna())
The way it works
Make sure both dfs are sorted
>>> df1 = df1.sort_values('Week_date')
>>> df2 = df2.sort_values('hour_date')
Do the merge
>>> df3 = df2.merge(df1, how='left', right_on='Week_date', left_on='hour_date')
>>> df3
hour_date h_values Week_date w_values
0 2019-04-19 08:00:00 a NaT NaN
1 2019-04-21 07:00:00 b NaT NaN
2 2019-04-21 20:00:00 c 2019-04-21 20:00:00 10.0
3 2019-04-22 06:00:00 d NaT NaN
4 2019-04-23 05:00:00 e NaT NaN
5 2019-04-28 19:00:00 f NaT NaN
6 2019-04-28 20:00:00 g 2019-04-28 20:00:00 20.0
7 2019-04-28 21:00:00 h NaT NaN
8 2019-04-29 20:00:00 i NaT NaN
9 2019-05-05 20:00:00 j 2019-05-05 20:00:00 30.0
10 2019-06-05 23:00:00 k NaT NaN
Forward fill the gaps
>>> df3 = df3.ffill()
>>> df3
hour_date h_values Week_date w_values
0 2019-04-19 08:00:00 a NaT NaN
1 2019-04-21 07:00:00 b NaT NaN
2 2019-04-21 20:00:00 c 2019-04-21 20:00:00 10.0
3 2019-04-22 06:00:00 d 2019-04-21 20:00:00 10.0
4 2019-04-23 05:00:00 e 2019-04-21 20:00:00 10.0
5 2019-04-28 19:00:00 f 2019-04-21 20:00:00 10.0
6 2019-04-28 20:00:00 g 2019-04-28 20:00:00 20.0
7 2019-04-28 21:00:00 h 2019-04-28 20:00:00 20.0
8 2019-04-29 20:00:00 i 2019-04-28 20:00:00 20.0
9 2019-05-05 20:00:00 j 2019-05-05 20:00:00 30.0
10 2019-06-05 23:00:00 k 2019-05-05 20:00:00 30.0
Remove the remaining NaNs
>>> df3 = df3.dropna()
>>> df3
hour_date h_values Week_date w_values
2 2019-04-21 20:00:00 c 2019-04-21 20:00:00 10.0
3 2019-04-22 06:00:00 d 2019-04-21 20:00:00 10.0
4 2019-04-23 05:00:00 e 2019-04-21 20:00:00 10.0
5 2019-04-28 19:00:00 f 2019-04-21 20:00:00 10.0
6 2019-04-28 20:00:00 g 2019-04-28 20:00:00 20.0
7 2019-04-28 21:00:00 h 2019-04-28 20:00:00 20.0
8 2019-04-29 20:00:00 i 2019-04-28 20:00:00 20.0
9 2019-05-05 20:00:00 j 2019-05-05 20:00:00 30.0
10 2019-06-05 23:00:00 k 2019-05-05 20:00:00 30.0

Grouping dates by 5 minute periods irrespective of day

I have a DataFrame with data similar to the following
import pandas as pd; import numpy as np; import datetime; from datetime import timedelta;
df = pd.DataFrame(index=pd.date_range(start='20160102', end='20170301', freq='5min'))
df['value'] = np.random.randn(df.index.size)
df.index += pd.Series([timedelta(seconds=np.random.randint(-60, 60))
for _ in range(df.index.size)])
which looks like this
In[37]: df
Out[37]:
value
2016-01-02 00:00:33 0.546675
2016-01-02 00:04:52 1.080558
2016-01-02 00:10:46 -1.551206
2016-01-02 00:15:52 -1.278845
2016-01-02 00:19:04 -1.672387
2016-01-02 00:25:36 -0.786985
2016-01-02 00:29:35 1.067132
2016-01-02 00:34:36 -0.575365
2016-01-02 00:39:33 0.570341
2016-01-02 00:44:56 -0.636312
...
2017-02-28 23:14:57 -0.027981
2017-02-28 23:19:51 0.883150
2017-02-28 23:24:15 -0.706997
2017-02-28 23:30:09 -0.954630
2017-02-28 23:35:08 -1.184881
2017-02-28 23:40:20 0.104017
2017-02-28 23:44:10 -0.678742
2017-02-28 23:49:15 -0.959857
2017-02-28 23:54:36 -1.157165
2017-02-28 23:59:10 0.527642
Now, I'm aiming to get the mean per 5 minute period over the course of a 24 hour day - without considering what day those values actually come from.
How can I do this effectively? I would like to think I could somehow remove the actual dates from my index and then use something like pd.TimeGrouper, but I haven't figured out how to do so.
My not-so-great solution
My solution so far has been to use between_time in a loop like this, just using an arbitrary day.
aggregates = []
start_time = datetime.datetime(1990, 1, 1, 0, 0, 0)
while start_time < datetime.datetime(1990, 1, 1, 23, 59, 0):
aggregates.append(
(
start_time,
df.between_time(start_time.time(),
(start_time + timedelta(minutes=5)).time(),
include_end=False).value.mean()
)
)
start_time += timedelta(minutes=5)
result = pd.DataFrame(aggregates, columns=['time', 'value'])
which works as expected
In[68]: result
Out[68]:
time value
0 1990-01-01 00:00:00 0.032667
1 1990-01-01 00:05:00 0.117288
2 1990-01-01 00:10:00 -0.052447
3 1990-01-01 00:15:00 -0.070428
4 1990-01-01 00:20:00 0.034584
5 1990-01-01 00:25:00 0.042414
6 1990-01-01 00:30:00 0.043388
7 1990-01-01 00:35:00 0.050371
8 1990-01-01 00:40:00 0.022209
9 1990-01-01 00:45:00 -0.035161
.. ... ...
278 1990-01-01 23:10:00 0.073753
279 1990-01-01 23:15:00 -0.005661
280 1990-01-01 23:20:00 -0.074529
281 1990-01-01 23:25:00 -0.083190
282 1990-01-01 23:30:00 -0.036636
283 1990-01-01 23:35:00 0.006767
284 1990-01-01 23:40:00 0.043436
285 1990-01-01 23:45:00 0.011117
286 1990-01-01 23:50:00 0.020737
287 1990-01-01 23:55:00 0.021030
[288 rows x 2 columns]
But this doesn't feel like a very Pandas-friendly solution.
IIUC then the following should work:
In [62]:
df.groupby(df.index.floor('5min').time).mean()
Out[62]:
value
00:00:00 -0.038002
00:05:00 -0.011646
00:10:00 0.010701
00:15:00 0.034699
00:20:00 0.041164
00:25:00 0.151187
00:30:00 -0.006149
00:35:00 -0.008256
00:40:00 0.021389
00:45:00 0.016851
00:50:00 -0.074825
00:55:00 0.012861
01:00:00 0.054048
01:05:00 0.041907
01:10:00 -0.004457
01:15:00 0.052428
01:20:00 -0.021518
01:25:00 -0.019010
01:30:00 0.030887
01:35:00 -0.085415
01:40:00 0.002386
01:45:00 -0.002189
01:50:00 0.049720
01:55:00 0.032292
02:00:00 -0.043642
02:05:00 0.067132
02:10:00 -0.029628
02:15:00 0.064098
02:20:00 0.042731
02:25:00 -0.031113
... ...
21:30:00 -0.018391
21:35:00 0.032155
21:40:00 0.035014
21:45:00 -0.016979
21:50:00 -0.025248
21:55:00 0.027896
22:00:00 -0.117036
22:05:00 -0.017970
22:10:00 -0.008494
22:15:00 -0.065303
22:20:00 -0.014623
22:25:00 0.076994
22:30:00 -0.030935
22:35:00 0.030308
22:40:00 -0.124668
22:45:00 0.064853
22:50:00 0.057913
22:55:00 0.002309
23:00:00 0.083586
23:05:00 -0.031043
23:10:00 -0.049510
23:15:00 0.003520
23:20:00 0.037135
23:25:00 -0.002231
23:30:00 -0.029592
23:35:00 0.040335
23:40:00 -0.021513
23:45:00 0.104421
23:50:00 -0.022280
23:55:00 -0.021283
[288 rows x 1 columns]
Here I floor the index to '5 min' intervals and then group on the time attribute and aggregate the mean

Categories

Resources