pandas drop consecutive duplicates selectively - python

I have been looking at all questions/answers about to how drop consecutive duplicates selectively in a pandas dataframe, still cannot figure out the following scenario:
import pandas as pd
import numpy as np
def random_dates(start, end, n, freq, seed=None):
if seed is not None:
np.random.seed(seed)
dr = pd.date_range(start, end, freq=freq)
return pd.to_datetime(np.sort(np.random.choice(dr, n, replace=False)))
date = random_dates('2018-01-01', '2018-01-12', 20, 'H', seed=[3, 1415])
data = {'Timestamp': date,
'Message': ['Message received.','Sending...', 'Sending...', 'Sending...', 'Work in progress...', 'Work in progress...',
'Message received.','Sending...', 'Sending...','Work in progress...',
'Message received.','Sending...', 'Sending...', 'Sending...','Work in progress...', 'Work in progress...', 'Work in progress...',
'Message received.','Sending...', 'Sending...']}
df = pd.DataFrame(data, columns = ['Timestamp', 'Message'])
I have the following dataframe:
Timestamp Message
0 2018-01-02 03:00:00 Message received.
1 2018-01-02 11:00:00 Sending...
2 2018-01-03 04:00:00 Sending...
3 2018-01-04 11:00:00 Sending...
4 2018-01-04 16:00:00 Work in progress...
5 2018-01-04 17:00:00 Work in progress...
6 2018-01-05 05:00:00 Message received.
7 2018-01-05 11:00:00 Sending...
8 2018-01-05 17:00:00 Sending...
9 2018-01-06 02:00:00 Work in progress...
10 2018-01-06 14:00:00 Message received.
11 2018-01-07 07:00:00 Sending...
12 2018-01-07 20:00:00 Sending...
13 2018-01-08 01:00:00 Sending...
14 2018-01-08 02:00:00 Work in progress...
15 2018-01-08 15:00:00 Work in progress...
16 2018-01-09 00:00:00 Work in progress...
17 2018-01-10 03:00:00 Message received.
18 2018-01-10 09:00:00 Sending...
19 2018-01-10 14:00:00 Sending...
I want to drop the consecutive duplicates in df['Message'] column ONLY when 'Message' is 'Work in progress...' and keep the first instance (here e.g. Index 5, 15 and 16 need to be dropped), ideally I would like to get:
Timestamp Message
0 2018-01-02 03:00:00 Message received.
1 2018-01-02 11:00:00 Sending...
2 2018-01-03 04:00:00 Sending...
3 2018-01-04 11:00:00 Sending...
4 2018-01-04 16:00:00 Work in progress...
6 2018-01-05 05:00:00 Message received.
7 2018-01-05 11:00:00 Sending...
8 2018-01-05 17:00:00 Sending...
9 2018-01-06 02:00:00 Work in progress...
10 2018-01-06 14:00:00 Message received.
11 2018-01-07 07:00:00 Sending...
12 2018-01-07 20:00:00 Sending...
13 2018-01-08 01:00:00 Sending...
14 2018-01-08 02:00:00 Work in progress...
17 2018-01-10 03:00:00 Message received.
18 2018-01-10 09:00:00 Sending...
19 2018-01-10 14:00:00 Sending...
I have tried solutions offered in similar posts like:
df['Message'].loc[df['Message'].shift(-1) != df['Message']]
I also calculated the length of the Messages:
df['length'] = df['Message'].apply(lambda x: len(x))
and wrote a conditional drop as:
df.loc[(df['length'] ==17) | (df['length'] ==10) | ~df['Message'].duplicated(keep='first')]
It looks better but still Index 14, 15, and 16 are dropped altogether, thus it is ill-behaved, see:
Timestamp Message length
0 2018-01-02 03:00:00 Message received. 17
1 2018-01-02 11:00:00 Sending... 10
2 2018-01-03 04:00:00 Sending... 10
3 2018-01-04 11:00:00 Sending... 10
4 2018-01-04 16:00:00 Work in progress... 19
6 2018-01-05 05:00:00 Message received. 17
7 2018-01-05 11:00:00 Sending... 10
8 2018-01-05 17:00:00 Sending... 10
10 2018-01-06 14:00:00 Message received. 17
11 2018-01-07 07:00:00 Sending... 10
12 2018-01-07 20:00:00 Sending... 10
13 2018-01-08 01:00:00 Sending... 10
17 2018-01-10 03:00:00 Message received. 17
18 2018-01-10 09:00:00 Sending... 10
19 2018-01-10 14:00:00 Sending... 10
Your time and help is appreciated!

First filter first consecutive values with compare by Series.shift and chain mask with filter all rows with no Work in progress... values:
df = df[(df['Message'].shift() != df['Message']) | (df['Message'] != 'Work in progress...')]
print (df)
Timestamp Message
0 2018-01-02 03:00:00 Message received.
1 2018-01-02 11:00:00 Sending...
2 2018-01-03 04:00:00 Sending...
3 2018-01-04 11:00:00 Sending...
4 2018-01-04 16:00:00 Work in progress...
6 2018-01-05 05:00:00 Message received.
7 2018-01-05 11:00:00 Sending...
8 2018-01-05 17:00:00 Sending...
9 2018-01-06 02:00:00 Work in progress...
10 2018-01-06 14:00:00 Message received.
11 2018-01-07 07:00:00 Sending...
12 2018-01-07 20:00:00 Sending...
13 2018-01-08 01:00:00 Sending...
14 2018-01-08 02:00:00 Work in progress...
17 2018-01-10 03:00:00 Message received.
18 2018-01-10 09:00:00 Sending...
19 2018-01-10 14:00:00 Sending...

You can first get all Messages with 'Work in Progress' and compare them with the previous element and then filter:
condition = (df['Message'] == 'Work in progress...') & (df['Message']==df['Message'].shift(1))
df[~condition]
Timestamp Message
0 2018-01-02 03:00:00 Message received.
1 2018-01-02 11:00:00 Sending...
2 2018-01-03 04:00:00 Sending...
3 2018-01-04 11:00:00 Sending...
4 2018-01-04 16:00:00 Work in progress...
6 2018-01-05 05:00:00 Message received.
7 2018-01-05 11:00:00 Sending...
8 2018-01-05 17:00:00 Sending...
9 2018-01-06 02:00:00 Work in progress...
10 2018-01-06 14:00:00 Message received.
11 2018-01-07 07:00:00 Sending...
12 2018-01-07 20:00:00 Sending...
13 2018-01-08 01:00:00 Sending...
14 2018-01-08 02:00:00 Work in progress...
17 2018-01-10 03:00:00 Message received.
18 2018-01-10 09:00:00 Sending...
19 2018-01-10 14:00:00 Sending...

Related

How to extract the first and last value from a data sequence based on a column value?

I have a time series dataset that can be created with the following code.
idx = pd.date_range("2018-01-01", periods=100, freq="H")
ts = pd.Series(idx)
dft = pd.DataFrame(ts,columns=["date"])
dft["data"] = ""
dft["data"][0:5]= "a"
dft["data"][5:15]= "b"
dft["data"][15:20]= "c"
dft["data"][20:30]= "d"
dft["data"][30:40]= "a"
dft["data"][40:70]= "c"
dft["data"][70:85]= "b"
dft["data"][85:len(dft)]= "c"
In the data column, the unique values are a,b,c,d. These values are repeating in a sequence in different time windows. I want to capture the first and last value of that time window. How can I do that?
Compute a grouper for your changing values using shift to compare consecutive rows, then use groupby+agg to get the min/max per group:
group = dft.data.ne(dft.data.shift()).cumsum()
dft.groupby(group)['date'].agg(['min', 'max'])
output:
min max
data
1 2018-01-01 00:00:00 2018-01-01 04:00:00
2 2018-01-01 05:00:00 2018-01-01 14:00:00
3 2018-01-01 15:00:00 2018-01-01 19:00:00
4 2018-01-01 20:00:00 2018-01-02 05:00:00
5 2018-01-02 06:00:00 2018-01-02 15:00:00
6 2018-01-02 16:00:00 2018-01-03 21:00:00
7 2018-01-03 22:00:00 2018-01-04 12:00:00
8 2018-01-04 13:00:00 2018-01-05 03:00:00
edit. combining with original data:
dft.groupby(group).agg({'data': 'first', 'date': ['min', 'max']})
output:
data date
first min max
data
1 a 2018-01-01 00:00:00 2018-01-01 04:00:00
2 b 2018-01-01 05:00:00 2018-01-01 14:00:00
3 c 2018-01-01 15:00:00 2018-01-01 19:00:00
4 d 2018-01-01 20:00:00 2018-01-02 05:00:00
5 a 2018-01-02 06:00:00 2018-01-02 15:00:00
6 c 2018-01-02 16:00:00 2018-01-03 21:00:00
7 b 2018-01-03 22:00:00 2018-01-04 12:00:00
8 c 2018-01-04 13:00:00 2018-01-05 03:00:00

Create regular time series from irregular interval with python

I wonder if is it possible to convert irregular time series interval to regular one without interpolating value from other column like this :
Index count
2018-01-05 00:00:00 1
2018-01-07 00:00:00 4
2018-01-08 00:00:00 15
2018-01-11 00:00:00 2
2018-01-14 00:00:00 5
2018-01-19 00:00:00 5
....
2018-12-26 00:00:00 6
2018-12-29 00:00:00 7
2018-12-30 00:00:00 8
And I expect the result to be something like this:
Index count
2018-01-01 00:00:00 0
2018-01-02 00:00:00 0
2018-01-03 00:00:00 0
2018-01-04 00:00:00 0
2018-01-05 00:00:00 1
2018-01-06 00:00:00 0
2018-01-07 00:00:00 4
2018-01-08 00:00:00 15
2018-01-09 00:00:00 0
2018-01-10 00:00:00 0
2018-01-11 00:00:00 2
2018-01-12 00:00:00 0
2018-01-13 00:00:00 0
2018-01-14 00:00:00 5
2018-01-15 00:00:00 0
2018-01-16 00:00:00 0
2018-01-17 00:00:00 0
2018-01-18 00:00:00 0
2018-01-19 00:00:00 5
....
2018-12-26 00:00:00 6
2018-12-27 00:00:00 0
2018-12-28 00:00:00 0
2018-12-29 00:00:00 7
2018-12-30 00:00:00 8
2018-12-31 00:00:00 0
So, far I just try resample from pandas but it only partially solved my problem.
Thanks in advance
Use DataFrame.reindex with date_range:
#if necessary
df.index = pd.to_datetime(df.index)
df = df.reindex(pd.date_range('2018-01-01','2018-12-31'), fill_value=0)
print (df)
count
2018-01-01 0
2018-01-02 0
2018-01-03 0
2018-01-04 0
2018-01-05 1
...
2018-12-27 0
2018-12-28 0
2018-12-29 7
2018-12-30 8
2018-12-31 0
[365 rows x 1 columns]

How to extract rows between 2 times with Pandas?

I want to make sub dataframes out of one dataframe, using its datetime index. For example, if I want to extract rows between 07:00~06:00 and make new dataframes:
import pandas as pd
int_rows = 24
str_freq = '180min'
i = pd.date_range('2018-04-09', periods=int_rows, freq=str_freq)
df = pd.DataFrame({'A': [i for i in range(int_rows)]}, index=i)
>>> df
A
2018-04-09 00:00:00 0
2018-04-09 03:00:00 1
2018-04-09 06:00:00 2
2018-04-09 09:00:00 3
2018-04-09 12:00:00 4
2018-04-09 15:00:00 5
2018-04-09 18:00:00 6
2018-04-09 21:00:00 7
2018-04-10 00:00:00 8
2018-04-10 03:00:00 9
2018-04-10 06:00:00 10
2018-04-10 09:00:00 11
2018-04-10 12:00:00 12
2018-04-10 15:00:00 13
2018-04-10 18:00:00 14
2018-04-10 21:00:00 15
2018-04-11 00:00:00 16
2018-04-11 03:00:00 17
2018-04-11 06:00:00 18
2018-04-11 09:00:00 19
2018-04-11 12:00:00 20
2018-04-11 15:00:00 21
2018-04-11 18:00:00 22
2018-04-11 21:00:00 23
# new dataframes that I want
A
2018-04-09 00:00:00 0
2018-04-09 03:00:00 1
A
2018-04-09 06:00:00 2
2018-04-09 09:00:00 3
2018-04-09 12:00:00 4
2018-04-09 15:00:00 5
2018-04-09 18:00:00 6
2018-04-09 21:00:00 7
2018-04-10 00:00:00 8
2018-04-10 03:00:00 9
A
2018-04-10 06:00:00 10
2018-04-10 09:00:00 11
2018-04-10 12:00:00 12
2018-04-10 15:00:00 13
2018-04-10 18:00:00 14
2018-04-10 21:00:00 15
2018-04-11 00:00:00 16
2018-04-11 03:00:00 17
A
2018-04-11 06:00:00 18
2018-04-11 09:00:00 19
2018-04-11 12:00:00 20
2018-04-11 15:00:00 21
2018-04-11 18:00:00 22
2018-04-11 21:00:00 23
I found between_time method, but it doesn't care about dates. I could iterate over the original dataframe and check each date and time, but I think it's going to be inefficient. Are there any simple ways to do this?
You can 'shift' the timestamp by 6 hours and group by day:
for k, d in df.groupby((df.index - pd.to_timedelta('6:00:00')).normalize()):
print(d); print()
Output:
A
2018-04-09 00:00:00 0
2018-04-09 03:00:00 1
A
2018-04-09 06:00:00 2
2018-04-09 09:00:00 3
2018-04-09 12:00:00 4
2018-04-09 15:00:00 5
2018-04-09 18:00:00 6
2018-04-09 21:00:00 7
2018-04-10 00:00:00 8
2018-04-10 03:00:00 9
A
2018-04-10 06:00:00 10
2018-04-10 09:00:00 11
2018-04-10 12:00:00 12
2018-04-10 15:00:00 13
2018-04-10 18:00:00 14
2018-04-10 21:00:00 15
2018-04-11 00:00:00 16
2018-04-11 03:00:00 17
A
2018-04-11 06:00:00 18
2018-04-11 09:00:00 19
2018-04-11 12:00:00 20
2018-04-11 15:00:00 21
2018-04-11 18:00:00 22
2018-04-11 21:00:00 23

Python Pandas - Get Attributes Associated With Consecutive datetime

I have a data frame that has a list of datetime by minutes (generally in hour increments), for example 2018-01-14 03:00, 2018-01-14 04:00, etc.
What I want to do is capture the number of consecutive records by the minute increment (some could be 60 others 15, etc.) that I define. Then, I want to associate the first and last reading time in the block.
Take the following data for instance:
id reading_time type
1 1/6/2018 00:00 Interval
1 1/6/2018 01:00 Interval
1 1/6/2018 02:00 Interval
1 1/6/2018 03:00 Interval
1 1/6/2018 06:00 Interval
1 1/6/2018 07:00 Interval
1 1/6/2018 09:00 Interval
1 1/6/2018 10:00 Interval
1 1/6/2018 14:00 Interval
1 1/6/2018 15:00 Interval
I would like the output to look like the following:
id first_reading_time last_reading_time number_of_records type
1 1/6/2018 00:00 1/6/2018 03:00 4 Received
1 1/6/2018 04:00 1/6/2018 05:00 2 Missed
1 1/6/2018 06:00 1/6/2018 07:00 2 Received
1 1/6/2018 08:00 1/6/2018 08:00 1 Missed
1 1/6/2018 09:00 1/6/2018 10:00 2 Received
1 1/6/2018 11:00 1/6/2018 13:00 3 Missed
1 1/6/2018 14:00 1/6/2018 15:00 2 Received
Now, in this example there is only 1 day and I can write the code for one day. Many of the rows extend across multiple days.
Now, what I've been able to is capture this aggregation up to the point the first consecutive records come in, but not the next set using this code:
first_reading_time = df['reading_time'][0]
last_reaeding_time = df['reading_time'][idx_loc-1]
df = pd.DataFrame(data=d)
df.reading_time = pd.to_datetime(df.reading_time)
d = pd.Timedelta(60, 'm')
df = df.sort_values('reading_time', ascending=True)
consecutive = df.reading_time.diff().fillna(0).abs().le(d)
df['consecutive'] = consecutive
df.iloc[:idx_loc]
idx_loc = df.index.get_loc(consecutive.idxmin())
where the data frame 'd' represents the more granular level data up top. The line of code that sets the variable 'consecutive' tags each records as True or False based on the number of minutes difference between the current row and the previous. The variable idx_loc captures the number of rows that were consecutive, but it only captures the first set (in this case 1/6/2018 00:00 and 1/6/2018 00:03).
Any help is appreciated.
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'reading_time': ['1/6/2018 00:00', '1/6/2018 01:00', '1/6/2018 02:00', '1/6/2018 03:00', '1/6/2018 06:00', '1/6/2018 07:00', '1/6/2018 09:00', '1/6/2018 10:00', '1/6/2018 14:00', '1/6/2018 15:00'], 'type': ['Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval']} )
df['reading_time'] = pd.to_datetime(df['reading_time'])
df = df.set_index('reading_time')
df = df.asfreq('1H')
df = df.reset_index()
df['group'] = (pd.isnull(df['id']).astype(int).diff() != 0).cumsum()
result = df.groupby('group')['reading_time'].agg(['first','last','count'])
types = pd.Categorical(['Missed', 'Received'])
result['type'] = types[result.index % 2]
yields
first last count type
group
1 2018-01-06 00:00:00 2018-01-06 03:00:00 4 Received
2 2018-01-06 04:00:00 2018-01-06 05:00:00 2 Missed
3 2018-01-06 06:00:00 2018-01-06 07:00:00 2 Received
4 2018-01-06 08:00:00 2018-01-06 08:00:00 1 Missed
5 2018-01-06 09:00:00 2018-01-06 10:00:00 2 Received
6 2018-01-06 11:00:00 2018-01-06 13:00:00 3 Missed
7 2018-01-06 14:00:00 2018-01-06 15:00:00 2 Received
You could use asfreq to expand the DataFrame to include missing rows:
df = df.set_index('reading_time')
df = df.asfreq('1H')
df = df.reset_index()
# reading_time id type
# 0 2018-01-06 00:00:00 1.0 Interval
# 1 2018-01-06 01:00:00 1.0 Interval
# 2 2018-01-06 02:00:00 1.0 Interval
# 3 2018-01-06 03:00:00 1.0 Interval
# 4 2018-01-06 04:00:00 NaN NaN
# 5 2018-01-06 05:00:00 NaN NaN
# 6 2018-01-06 06:00:00 1.0 Interval
# 7 2018-01-06 07:00:00 1.0 Interval
# 8 2018-01-06 08:00:00 NaN NaN
# 9 2018-01-06 09:00:00 1.0 Interval
# 10 2018-01-06 10:00:00 1.0 Interval
# 11 2018-01-06 11:00:00 NaN NaN
# 12 2018-01-06 12:00:00 NaN NaN
# 13 2018-01-06 13:00:00 NaN NaN
# 14 2018-01-06 14:00:00 1.0 Interval
# 15 2018-01-06 15:00:00 1.0 Interval
Next, use the NaNs in, say, the id column to identify groups:
df['group'] = (pd.isnull(df['id']).astype(int).diff() != 0).cumsum()
then group by the group values to find first and last reading_times for each group:
result = df.groupby('group')['reading_time'].agg(['first','last','count'])
# first last count
# group
# 1 2018-01-06 00:00:00 2018-01-06 03:00:00 4
# 2 2018-01-06 04:00:00 2018-01-06 05:00:00 2
# 3 2018-01-06 06:00:00 2018-01-06 07:00:00 2
# 4 2018-01-06 08:00:00 2018-01-06 08:00:00 1
# 5 2018-01-06 09:00:00 2018-01-06 10:00:00 2
# 6 2018-01-06 11:00:00 2018-01-06 13:00:00 3
# 7 2018-01-06 14:00:00 2018-01-06 15:00:00 2
Since the Missed and Received values alternate, they can be generated from the index:
types = pd.Categorical(['Missed', 'Received'])
result['type'] = types[result.index % 2]
To handle multiple frequencies on a per-id basis, you could use:
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], 'reading_time': ['1/6/2018 00:00', '1/6/2018 01:00', '1/6/2018 02:00', '1/6/2018 03:00', '1/6/2018 06:00', '1/6/2018 07:00', '1/6/2018 09:00', '1/6/2018 10:00', '1/6/2018 14:00', '1/6/2018 15:00'], 'type': ['Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval']} )
df['reading_time'] = pd.to_datetime(df['reading_time'])
df = df.sort_values(by='reading_time')
df = df.set_index('reading_time')
freqmap = {1:'1H', 2:'15T'}
df = df.groupby('id', group_keys=False).apply(
lambda grp: grp.asfreq(freqmap[grp['id'][0]]))
df = df.reset_index(level='reading_time')
df['group'] = (pd.isnull(df['id']).astype(int).diff() != 0).cumsum()
grouped = df.groupby('group')
result = grouped['reading_time'].agg(['first','last','count'])
result['id'] = grouped['id'].agg('first')
types = pd.Categorical(['Missed', 'Received'])
result['type'] = types[result.index % 2]
which yields
first last count id type
group
1 2018-01-06 00:00:00 2018-01-06 03:00:00 4 1.0 Received
2 2018-01-06 04:00:00 2018-01-06 05:00:00 2 NaN Missed
3 2018-01-06 06:00:00 2018-01-06 07:00:00 2 1.0 Received
4 2018-01-06 07:15:00 2018-01-06 08:45:00 7 NaN Missed
5 2018-01-06 09:00:00 2018-01-06 09:00:00 1 2.0 Received
6 2018-01-06 09:15:00 2018-01-06 09:45:00 3 NaN Missed
7 2018-01-06 10:00:00 2018-01-06 10:00:00 1 2.0 Received
8 2018-01-06 10:15:00 2018-01-06 13:45:00 15 NaN Missed
9 2018-01-06 14:00:00 2018-01-06 14:00:00 1 2.0 Received
10 2018-01-06 14:15:00 2018-01-06 14:45:00 3 NaN Missed
11 2018-01-06 15:00:00 2018-01-06 15:00:00 1 2.0 Received
It seems plausible that "Missed" rows should not be associated with any id, but to bring the result a little closer to the one you posted, you could ffill to forward-fill NaN id values:
result['id'] = result['id'].ffill()
changes the result to
first last count id type
group
1 2018-01-06 00:00:00 2018-01-06 03:00:00 4 1 Received
2 2018-01-06 04:00:00 2018-01-06 05:00:00 2 1 Missed
3 2018-01-06 06:00:00 2018-01-06 07:00:00 2 1 Received
4 2018-01-06 07:15:00 2018-01-06 08:45:00 7 1 Missed
5 2018-01-06 09:00:00 2018-01-06 09:00:00 1 2 Received
6 2018-01-06 09:15:00 2018-01-06 09:45:00 3 2 Missed
7 2018-01-06 10:00:00 2018-01-06 10:00:00 1 2 Received
8 2018-01-06 10:15:00 2018-01-06 13:45:00 15 2 Missed
9 2018-01-06 14:00:00 2018-01-06 14:00:00 1 2 Received
10 2018-01-06 14:15:00 2018-01-06 14:45:00 3 2 Missed
11 2018-01-06 15:00:00 2018-01-06 15:00:00 1 2 Received

Grouping dates by 5 minute periods irrespective of day

I have a DataFrame with data similar to the following
import pandas as pd; import numpy as np; import datetime; from datetime import timedelta;
df = pd.DataFrame(index=pd.date_range(start='20160102', end='20170301', freq='5min'))
df['value'] = np.random.randn(df.index.size)
df.index += pd.Series([timedelta(seconds=np.random.randint(-60, 60))
for _ in range(df.index.size)])
which looks like this
In[37]: df
Out[37]:
value
2016-01-02 00:00:33 0.546675
2016-01-02 00:04:52 1.080558
2016-01-02 00:10:46 -1.551206
2016-01-02 00:15:52 -1.278845
2016-01-02 00:19:04 -1.672387
2016-01-02 00:25:36 -0.786985
2016-01-02 00:29:35 1.067132
2016-01-02 00:34:36 -0.575365
2016-01-02 00:39:33 0.570341
2016-01-02 00:44:56 -0.636312
...
2017-02-28 23:14:57 -0.027981
2017-02-28 23:19:51 0.883150
2017-02-28 23:24:15 -0.706997
2017-02-28 23:30:09 -0.954630
2017-02-28 23:35:08 -1.184881
2017-02-28 23:40:20 0.104017
2017-02-28 23:44:10 -0.678742
2017-02-28 23:49:15 -0.959857
2017-02-28 23:54:36 -1.157165
2017-02-28 23:59:10 0.527642
Now, I'm aiming to get the mean per 5 minute period over the course of a 24 hour day - without considering what day those values actually come from.
How can I do this effectively? I would like to think I could somehow remove the actual dates from my index and then use something like pd.TimeGrouper, but I haven't figured out how to do so.
My not-so-great solution
My solution so far has been to use between_time in a loop like this, just using an arbitrary day.
aggregates = []
start_time = datetime.datetime(1990, 1, 1, 0, 0, 0)
while start_time < datetime.datetime(1990, 1, 1, 23, 59, 0):
aggregates.append(
(
start_time,
df.between_time(start_time.time(),
(start_time + timedelta(minutes=5)).time(),
include_end=False).value.mean()
)
)
start_time += timedelta(minutes=5)
result = pd.DataFrame(aggregates, columns=['time', 'value'])
which works as expected
In[68]: result
Out[68]:
time value
0 1990-01-01 00:00:00 0.032667
1 1990-01-01 00:05:00 0.117288
2 1990-01-01 00:10:00 -0.052447
3 1990-01-01 00:15:00 -0.070428
4 1990-01-01 00:20:00 0.034584
5 1990-01-01 00:25:00 0.042414
6 1990-01-01 00:30:00 0.043388
7 1990-01-01 00:35:00 0.050371
8 1990-01-01 00:40:00 0.022209
9 1990-01-01 00:45:00 -0.035161
.. ... ...
278 1990-01-01 23:10:00 0.073753
279 1990-01-01 23:15:00 -0.005661
280 1990-01-01 23:20:00 -0.074529
281 1990-01-01 23:25:00 -0.083190
282 1990-01-01 23:30:00 -0.036636
283 1990-01-01 23:35:00 0.006767
284 1990-01-01 23:40:00 0.043436
285 1990-01-01 23:45:00 0.011117
286 1990-01-01 23:50:00 0.020737
287 1990-01-01 23:55:00 0.021030
[288 rows x 2 columns]
But this doesn't feel like a very Pandas-friendly solution.
IIUC then the following should work:
In [62]:
df.groupby(df.index.floor('5min').time).mean()
Out[62]:
value
00:00:00 -0.038002
00:05:00 -0.011646
00:10:00 0.010701
00:15:00 0.034699
00:20:00 0.041164
00:25:00 0.151187
00:30:00 -0.006149
00:35:00 -0.008256
00:40:00 0.021389
00:45:00 0.016851
00:50:00 -0.074825
00:55:00 0.012861
01:00:00 0.054048
01:05:00 0.041907
01:10:00 -0.004457
01:15:00 0.052428
01:20:00 -0.021518
01:25:00 -0.019010
01:30:00 0.030887
01:35:00 -0.085415
01:40:00 0.002386
01:45:00 -0.002189
01:50:00 0.049720
01:55:00 0.032292
02:00:00 -0.043642
02:05:00 0.067132
02:10:00 -0.029628
02:15:00 0.064098
02:20:00 0.042731
02:25:00 -0.031113
... ...
21:30:00 -0.018391
21:35:00 0.032155
21:40:00 0.035014
21:45:00 -0.016979
21:50:00 -0.025248
21:55:00 0.027896
22:00:00 -0.117036
22:05:00 -0.017970
22:10:00 -0.008494
22:15:00 -0.065303
22:20:00 -0.014623
22:25:00 0.076994
22:30:00 -0.030935
22:35:00 0.030308
22:40:00 -0.124668
22:45:00 0.064853
22:50:00 0.057913
22:55:00 0.002309
23:00:00 0.083586
23:05:00 -0.031043
23:10:00 -0.049510
23:15:00 0.003520
23:20:00 0.037135
23:25:00 -0.002231
23:30:00 -0.029592
23:35:00 0.040335
23:40:00 -0.021513
23:45:00 0.104421
23:50:00 -0.022280
23:55:00 -0.021283
[288 rows x 1 columns]
Here I floor the index to '5 min' intervals and then group on the time attribute and aggregate the mean

Categories

Resources