I have a dataframe with some date time interval. I am trying to squash them into single events.
I have the start and end time alongside the respective duration for every event.
What I have
start_time end_time duration id
0 2020-01-01 00:00:00 2020-01-01 00:30:00 30 A
1 2020-01-01 00:30:00 2020-01-01 01:00:00 30 B
2 2020-01-01 01:00:00 2020-01-01 01:30:00 30 C
3 2020-01-01 01:30:00 2020-01-01 02:00:00 30 D
4 2020-01-04 05:00:00 2020-01-04 05:30:00 30 E
5 2020-01-04 05:30:00 2020-01-04 06:00:00 30 F
6 2020-01-04 06:00:00 2020-01-04 06:30:00 30 G
7 2020-01-04 06:30:00 2020-01-04 07:00:00 30 H
8 2020-01-04 20:30:00 2020-01-04 21:00:00 30 I
What I'm trying to squash it into
start_time end_time duration id
0 2020-01-01 00:00:00 2020-01-01 02:00:00 120 A
4 2020-01-04 05:00:00 2020-01-04 07:00:00 120 E
8 2020-01-04 20:30:00 2020-01-04 21:00:00 30 I
I looked for group and merging options in pandas but I didn't manage to to what I want.
Groupby.agg with Series.dt.date
new_df =( df.groupby(df['end_time'].dt.date,as_index = False)
.agg({'start_time':'first',
'end_time':'last',
'duration':'sum',
'id':'first'})
)
print(new_df)
start_time end_time duration id
0 2020-01-01 00:00:00 2020-01-01 02:00:00 120 A
1 2020-01-04 05:00:00 2020-01-04 07:00:00 120 E
You can use DataFrame.shift to compare end times with the shifted start times and set any identical pairs to null:
df['flag'] = df['start_time'].shift(-1)
df.loc[df['end_time'] == df['flag'], 'flag'] = pd.NaT
print(df)
start_time end_time duration id flag
0 2020-01-01 00:00:00 2020-01-01 00:30:00 30 A NaT
1 2020-01-01 00:30:00 2020-01-01 01:00:00 30 B NaT
2 2020-01-01 01:00:00 2020-01-01 01:30:00 30 C NaT
3 2020-01-01 01:30:00 2020-01-01 02:00:00 30 D 2020-01-04 05:00:00
4 2020-01-04 05:00:00 2020-01-04 05:30:00 30 E NaT
5 2020-01-04 05:30:00 2020-01-04 06:00:00 30 F NaT
6 2020-01-04 06:00:00 2020-01-04 06:30:00 30 G NaT
7 2020-01-04 06:30:00 2020-01-04 07:00:00 30 H 2020-01-04 20:30:00
8 2020-01-04 20:30:00 2020-01-04 21:00:00 30 I NaT
Then use DataFrame.bfill to backfill those nulls with the start time that violate your interval condition. You'll need to manually set the null value yourself for the last value.
df['flag'] = df['flag'].bfill().fillna(df['end_time'].iloc[-2])
print(df)
start_time end_time duration id flag
0 2020-01-01 00:00:00 2020-01-01 00:30:00 30 A 2020-01-04 05:00:00
1 2020-01-01 00:30:00 2020-01-01 01:00:00 30 B 2020-01-04 05:00:00
2 2020-01-01 01:00:00 2020-01-01 01:30:00 30 C 2020-01-04 05:00:00
3 2020-01-01 01:30:00 2020-01-01 02:00:00 30 D 2020-01-04 05:00:00
4 2020-01-04 05:00:00 2020-01-04 05:30:00 30 E 2020-01-04 20:30:00
5 2020-01-04 05:30:00 2020-01-04 06:00:00 30 F 2020-01-04 20:30:00
6 2020-01-04 06:00:00 2020-01-04 06:30:00 30 G 2020-01-04 20:30:00
7 2020-01-04 06:30:00 2020-01-04 07:00:00 30 H 2020-01-04 20:30:00
8 2020-01-04 20:30:00 2020-01-04 21:00:00 30 I 2020-01-04 07:00:00
Now do as ansev suggested:
df = df.groupby('flag').agg({'start_time':'first','end_time':'last','duration':'sum','id':'first'}).reset_index(drop=True)
print(df)
start_time end_time duration id
0 2020-01-01 00:00:00 2020-01-01 02:00:00 120 A
1 2020-01-04 20:30:00 2020-01-04 21:00:00 30 I
2 2020-01-04 05:00:00 2020-01-04 07:00:00 120 E
Related
I have a time series with breaks (times w/o recordings) in between. A simplified example would be:
df = pd.DataFrame(
np.random.rand(13), columns=["values"],
index=pd.date_range(start='1/1/2020 11:00:00',end='1/1/2020 23:00:00',freq='H'))
df.iloc[4:7] = np.nan
df.dropna(inplace=True)
df
values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166
Now I would like to split it in intervals which are divided by a certain time span (e.g. 2h). In the example above this would be:
( values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023,
values
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166)
I was a bit surprised that I didn't find anything on that since I thought this is a common problem. My current solution to get start and end index of each interval is :
def intervals(data: pd.DataFrame, delta_t: timedelta = timedelta(hours=2)):
data = data.sort_values(by=['event_timestamp'], ignore_index=True)
breaks = (data['event_timestamp'].diff() > delta_t).astype(bool).values
ranges = []
start = 0
end = start
for i, e in enumerate(breaks):
if not e:
end = i
if i == len(breaks) - 1:
ranges.append((start, end))
start = i
end = start
elif i != 0:
ranges.append((start, end))
start = i
end = start
return ranges
Any suggestions how I could do this in a smarter way? I suspect this should be somehow possible using groupby.
Yes, you can use the very convenient np.split:
dt = pd.Timedelta('2H')
parts = np.split(df, np.where(np.diff(df.index) > dt)[0] + 1)
Which gives, for your example:
>>> parts
[ values
2020-01-01 11:00:00 0.557374
2020-01-01 12:00:00 0.942296
2020-01-01 13:00:00 0.181189
2020-01-01 14:00:00 0.758822,
values
2020-01-01 18:00:00 0.682125
2020-01-01 19:00:00 0.818187
2020-01-01 20:00:00 0.053515
2020-01-01 21:00:00 0.572342
2020-01-01 22:00:00 0.423129
2020-01-01 23:00:00 0.882215]
#Pierre thanks for your input. I now got to a solution which is convenient for me:
df['diff'] = df.index.to_series().diff()
max_gap = timedelta(hours=2)
df['gapId'] = 0
df.loc[df['diff'] >= max_gap, ['gapId']] = 1
df['gapId'] = df['gapId'].cumsum()
list(df.groupby('gapId'))
gives:
[(0,
values date diff gapId
0 1.0 2020-01-01 11:00:00 NaT 0
1 1.0 2020-01-01 12:00:00 0 days 01:00:00 0
2 1.0 2020-01-01 13:00:00 0 days 01:00:00 0
3 1.0 2020-01-01 14:00:00 0 days 01:00:00 0),
(1,
values date diff gapId
7 1.0 2020-01-01 18:00:00 0 days 04:00:00 1
8 1.0 2020-01-01 19:00:00 0 days 01:00:00 1
9 1.0 2020-01-01 20:00:00 0 days 01:00:00 1
10 1.0 2020-01-01 21:00:00 0 days 01:00:00 1
11 1.0 2020-01-01 22:00:00 0 days 01:00:00 1
12 1.0 2020-01-01 23:00:00 0 days 01:00:00 1)]
If we can divide time of a day from 00:00:00 hrs to 23:59:00 into 15 min blocks we will have 96 blocks. we can number them from 0 to 95.
I want to add a "timeblock" column to the dataframe, where i can number each row with a timeblock number that time stamp sits in as shown below.
tagdatetime tagvalue timeblock
2020-01-01 00:00:00 47.874423 0
2020-01-01 00:01:00 14.913561 0
2020-01-01 00:02:00 56.368034 0
2020-01-01 00:03:00 16.555687 0
2020-01-01 00:04:00 42.138176 0
... ... ...
2020-01-01 00:13:00 47.874423 0
2020-01-01 00:14:00 14.913561 0
2020-01-01 00:15:00 56.368034 0
2020-01-01 00:16:00 16.555687 1
2020-01-01 00:17:00 42.138176 1
... ... ...
2020-01-01 23:55:00 18.550685 95
2020-01-01 23:56:00 51.219147 95
2020-01-01 23:57:00 15.098951 95
2020-01-01 23:58:00 37.863191 95
2020-01-01 23:59:00 51.380950 95
I think there's a better way to do it, but I think it's possible below.
import pandas as pd
import numpy as np
tindex = pd.date_range('2020-01-01 00:00:00', '2020-01-01 23:59:00', freq='min')
tvalue = np.random.randint(1,50, (1440,))
df = pd.DataFrame({'tagdatetime':tindex, 'tagvalue':tvalue})
min15 = pd.date_range('2020-01-01 00:00:00', '2020-01-01 23:59:00', freq='15min')
tblock = np.arange(96)
df2 = pd.DataFrame({'min15':min15, 'timeblock':tblock})
df3 = pd.merge(df, df2, left_on='tagdatetime', right_on='min15', how='outer')
df3.ffill(axis=0, inplace=True)
df3 = df3.drop('min15', axis=1)
df3.iloc[10:20,]
tagdatetime tagvalue timeblock
10 2020-01-01 00:10:00 20 0.0
11 2020-01-01 00:11:00 25 0.0
12 2020-01-01 00:12:00 42 0.0
13 2020-01-01 00:13:00 45 0.0
14 2020-01-01 00:14:00 11 0.0
15 2020-01-01 00:15:00 15 1.0
16 2020-01-01 00:16:00 38 1.0
17 2020-01-01 00:17:00 23 1.0
18 2020-01-01 00:18:00 5 1.0
19 2020-01-01 00:19:00 32 1.0
I have a df:
dates values
2020-01-01 00:15:00 38.61487
2020-01-01 00:30:00 36.905204
2020-01-01 00:45:00 35.136584
2020-01-01 01:00:00 33.60378
2020-01-01 01:15:00 32.306791999999994
2020-01-01 01:30:00 31.304574
I am creating a new column named start as follows:
df = df.rename(columns={'dates': 'end'})
df['start']= df['end'].shift(1)
When I do this, I get the following:
end values start
2020-01-01 00:15:00 38.61487 NaT
2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
2020-01-01 01:00:00 33.60378 2020-01-01 00:45:00
2020-01-01 01:15:00 32.306791999999994 2020-01-01 01:00:00
2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
I want to fill that NaT value with
2020-01-01 00:00:00
How can this be done?
Use Series.fillna with datetimes, e.g. by Timestamp:
df['start']= df['end'].shift().fillna(pd.Timestamp('2020-01-01'))
Or if pandas 0.24+ with fill_value parameter:
df['start']= df['end'].shift(fill_value=pd.Timestamp('2020-01-01'))
If all datetimes are regular, always difference 15 minutes is possible subtracting by offsets.DateOffset:
df['start']= df['end'] - pd.offsets.DateOffset(minutes=15)
print (df)
end values start
0 2020-01-01 00:15:00 38.614870 2020-01-01 00:00:00
1 2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2 2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
3 2020-01-01 01:00:00 33.603780 2020-01-01 00:45:00
4 2020-01-01 01:15:00 32.306792 2020-01-01 01:00:00
5 2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
How about that?
df = pd.DataFrame(columns = ['end'])
df.loc[:, 'end'] = pd.date_range(start=pd.Timestamp(2019,1,1,0,15), end=pd.Timestamp(2019,1,2), freq='15min')
df.loc[:, 'start'] = df.loc[:, 'end'].shift(1)
delta = df.loc[df.index[3], 'end'] - df.loc[df.index[2], 'end']
df.loc[df.index[0], 'start'] = df.loc[df.index[1], 'start'] - delta
df
end start
0 2019-01-01 00:15:00 2019-01-01 00:00:00
1 2019-01-01 00:30:00 2019-01-01 00:15:00
2 2019-01-01 00:45:00 2019-01-01 00:30:00
3 2019-01-01 01:00:00 2019-01-01 00:45:00
4 2019-01-01 01:15:00 2019-01-01 01:00:00
... ... ...
91 2019-01-01 23:00:00 2019-01-01 22:45:00
92 2019-01-01 23:15:00 2019-01-01 23:00:00
93 2019-01-01 23:30:00 2019-01-01 23:15:00
94 2019-01-01 23:45:00 2019-01-01 23:30:00
95 2019-01-02 00:00:00 2019-01-01 23:45:00
I've looked around (eg.
Python - Locating the closest timestamp) but can't find anything on this.
I have a list of datetimes, and a dataframe containing 10k + rows, of start and end times (formatted as datetimes).
The dataframe is effectively listing parameters for runs of an instrument.
The list describes times from an alarm event.
The datetime list items are all within a row (i.e. between a start and end time) in the dataframe. Is there an easy way to locate the rows which would contain the timeframe within which the alarm time would be? (sorry for poor wording there!)
eg.
for i in alarms:
df.loc[(df.start_time < i) & (df.end_time > i), 'Flag'] = 'Alarm'
(this didn't work but shows my approach)
Example datasets
# making list of datetimes for the alarms
df = pd.DataFrame({'Alarms':["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n=33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({ 'start_date': rng1, 'end_Date': rng2})
Herein a flag would go against line (well, index) 4, 13 and 21.
You can use pandas.IntervalIndex here:
# Create and set IntervalIndex
intervals = pd.IntervalIndex.from_arrays(df.start_date, df.end_Date)
df = df.set_index(intervals)
# Update using loc
df.loc[alarms, 'flag'] = 'alarm'
# Finally, reset_index
df = df.reset_index(drop=True)
[out]
start_date end_Date flag
0 2019-07-18 00:00:00 2019-07-18 03:00:00 NaN
1 2019-07-18 03:00:00 2019-07-18 06:00:00 NaN
2 2019-07-18 06:00:00 2019-07-18 09:00:00 NaN
3 2019-07-18 09:00:00 2019-07-18 12:00:00 NaN
4 2019-07-18 12:00:00 2019-07-18 15:00:00 alarm
5 2019-07-18 15:00:00 2019-07-18 18:00:00 NaN
6 2019-07-18 18:00:00 2019-07-18 21:00:00 NaN
7 2019-07-18 21:00:00 2019-07-19 00:00:00 NaN
8 2019-07-19 00:00:00 2019-07-19 03:00:00 NaN
9 2019-07-19 03:00:00 2019-07-19 06:00:00 NaN
10 2019-07-19 06:00:00 2019-07-19 09:00:00 NaN
11 2019-07-19 09:00:00 2019-07-19 12:00:00 NaN
12 2019-07-19 12:00:00 2019-07-19 15:00:00 NaN
13 2019-07-19 15:00:00 2019-07-19 18:00:00 alarm
14 2019-07-19 18:00:00 2019-07-19 21:00:00 NaN
15 2019-07-19 21:00:00 2019-07-20 00:00:00 NaN
16 2019-07-20 00:00:00 2019-07-20 03:00:00 NaN
17 2019-07-20 03:00:00 2019-07-20 06:00:00 NaN
18 2019-07-20 06:00:00 2019-07-20 09:00:00 NaN
19 2019-07-20 09:00:00 2019-07-20 12:00:00 NaN
20 2019-07-20 12:00:00 2019-07-20 15:00:00 NaN
21 2019-07-20 15:00:00 2019-07-20 18:00:00 alarm
22 2019-07-20 18:00:00 2019-07-20 21:00:00 NaN
23 2019-07-20 21:00:00 2019-07-21 00:00:00 NaN
24 2019-07-21 00:00:00 2019-07-21 03:00:00 NaN
25 2019-07-21 03:00:00 2019-07-21 06:00:00 NaN
26 2019-07-21 06:00:00 2019-07-21 09:00:00 NaN
27 2019-07-21 09:00:00 2019-07-21 12:00:00 NaN
28 2019-07-21 12:00:00 2019-07-21 15:00:00 NaN
29 2019-07-21 15:00:00 2019-07-21 18:00:00 NaN
30 2019-07-21 18:00:00 2019-07-21 21:00:00 NaN
31 2019-07-21 21:00:00 2019-07-22 00:00:00 NaN
32 2019-07-22 00:00:00 2019-07-22 03:00:00 NaN
you were calling your columns start_date and end_Date, but in your for you use start_time and end_time.
try this:
import pandas as pd
df = pd.DataFrame({'Alarms': ["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n = 33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({'start_date': rng1, 'end_Date': rng2})
for i in alarms:
df.loc[(df.start_date < i) & (df.end_Date > i), 'Flag'] = 'Alarm'
print(df[df['Flag']=='Alarm']['Flag'])
Output:
4 Alarm
13 Alarm
21 Alarm
Name: Flag, dtype: object
I have two data sets.One with weekly date time and the other hourly date time.
my data sets looks like this:-
df1
Week_date w_values
21-04-2019 20:00:00 10
28-04-2019 20:00:00 20
05-05-2019 20:00:00 30
df2
hour_date h_values
19-04-2019 08:00:00 a
21-04-2019 07:00:00 b
21-04-2019 20:00:00 c
22-04-2019 06:00:00 d
23-04-2019 05:00:00 e
28-04-2019 19:00:00 f
28-04-2019 20:00:00 g
28-04-2019 21:00:00 h
29-04-2019 20:00:00 i
05-05-2019 20:00:00 j
06-05-2019 23:00:00 k
tried merging but failed to get the desired output
output data set should look like this
week_date w_values hour_date h_values
21-04-2019 20:00:00 10 21-04-2019 20:00:00 c
21-04-2019 20:00:00 10 22-04-2019 06:00:00 d
21-04-2019 20:00:00 10 23-04-2019 05:00:00 e
21-04-2019 20:00:00 10 28-04-2019 19:00:00 f
28-04-2019 20:00:00 20 28-04-2019 20:00:00 g
28-04-2019 20:00:00 20 28-04-2019 21:00:00 h
28-04-2019 20:00:00 20 29-04-2019 20:00:00 i
05-05-2019 20:00:00 30 05-05-2019 20:00:00 j
05-05-2019 20:00:00 30 06-05-2019 23:00:00 k
the weekly date will change only when week date is equal to hour date....else it will take previous week date....
Use the 'merge_asof' function. From pandas documentation "This merge is similar to a left-join except that we match on nearest key rather than equal keys."
df_week['Week_date']=pd.to_datetime(df_week['Week_date'])
df_hour['hour_date']=pd.to_datetime(df_hour['hour_date'])
df_week_sort=df_week.sort_values(by='Week_date')
df_hour_sort=df_hour.sort_values(by='hour_date')
df_week_sort.rename(columns={'Week_date':'Merge_date'},inplace=True)
df_hour_sort.rename(columns={'hour_date':'Merge_date'},inplace=True)
df_merged=pd.merge_asof(df_hour_sort,df_week_sort,on='Merge_date')
Make sure that the two frames are sorted by the date stamp
The following should do (provided Week_date and hour_date are datetimes):
(df2.merge(df1, how='left', right_on='Week_date', left_on='hour_date')
.ffill()
.dropna())
The way it works
Make sure both dfs are sorted
>>> df1 = df1.sort_values('Week_date')
>>> df2 = df2.sort_values('hour_date')
Do the merge
>>> df3 = df2.merge(df1, how='left', right_on='Week_date', left_on='hour_date')
>>> df3
hour_date h_values Week_date w_values
0 2019-04-19 08:00:00 a NaT NaN
1 2019-04-21 07:00:00 b NaT NaN
2 2019-04-21 20:00:00 c 2019-04-21 20:00:00 10.0
3 2019-04-22 06:00:00 d NaT NaN
4 2019-04-23 05:00:00 e NaT NaN
5 2019-04-28 19:00:00 f NaT NaN
6 2019-04-28 20:00:00 g 2019-04-28 20:00:00 20.0
7 2019-04-28 21:00:00 h NaT NaN
8 2019-04-29 20:00:00 i NaT NaN
9 2019-05-05 20:00:00 j 2019-05-05 20:00:00 30.0
10 2019-06-05 23:00:00 k NaT NaN
Forward fill the gaps
>>> df3 = df3.ffill()
>>> df3
hour_date h_values Week_date w_values
0 2019-04-19 08:00:00 a NaT NaN
1 2019-04-21 07:00:00 b NaT NaN
2 2019-04-21 20:00:00 c 2019-04-21 20:00:00 10.0
3 2019-04-22 06:00:00 d 2019-04-21 20:00:00 10.0
4 2019-04-23 05:00:00 e 2019-04-21 20:00:00 10.0
5 2019-04-28 19:00:00 f 2019-04-21 20:00:00 10.0
6 2019-04-28 20:00:00 g 2019-04-28 20:00:00 20.0
7 2019-04-28 21:00:00 h 2019-04-28 20:00:00 20.0
8 2019-04-29 20:00:00 i 2019-04-28 20:00:00 20.0
9 2019-05-05 20:00:00 j 2019-05-05 20:00:00 30.0
10 2019-06-05 23:00:00 k 2019-05-05 20:00:00 30.0
Remove the remaining NaNs
>>> df3 = df3.dropna()
>>> df3
hour_date h_values Week_date w_values
2 2019-04-21 20:00:00 c 2019-04-21 20:00:00 10.0
3 2019-04-22 06:00:00 d 2019-04-21 20:00:00 10.0
4 2019-04-23 05:00:00 e 2019-04-21 20:00:00 10.0
5 2019-04-28 19:00:00 f 2019-04-21 20:00:00 10.0
6 2019-04-28 20:00:00 g 2019-04-28 20:00:00 20.0
7 2019-04-28 21:00:00 h 2019-04-28 20:00:00 20.0
8 2019-04-29 20:00:00 i 2019-04-28 20:00:00 20.0
9 2019-05-05 20:00:00 j 2019-05-05 20:00:00 30.0
10 2019-06-05 23:00:00 k 2019-05-05 20:00:00 30.0