Flagging list of datetimes within date ranges in pandas dataframe

Flagging list of datetimes within date ranges in pandas dataframe - python

I've looked around (eg.
Python - Locating the closest timestamp) but can't find anything on this.
I have a list of datetimes, and a dataframe containing 10k + rows, of start and end times (formatted as datetimes).
The dataframe is effectively listing parameters for runs of an instrument.
The list describes times from an alarm event.
The datetime list items are all within a row (i.e. between a start and end time) in the dataframe. Is there an easy way to locate the rows which would contain the timeframe within which the alarm time would be? (sorry for poor wording there!)
eg.
for i in alarms:
df.loc[(df.start_time < i) & (df.end_time > i), 'Flag'] = 'Alarm'
(this didn't work but shows my approach)
Example datasets
# making list of datetimes for the alarms
df = pd.DataFrame({'Alarms':["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n=33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({ 'start_date': rng1, 'end_Date': rng2})
Herein a flag would go against line (well, index) 4, 13 and 21.

You can use pandas.IntervalIndex here:
# Create and set IntervalIndex
intervals = pd.IntervalIndex.from_arrays(df.start_date, df.end_Date)
df = df.set_index(intervals)
# Update using loc
df.loc[alarms, 'flag'] = 'alarm'
# Finally, reset_index
df = df.reset_index(drop=True)
[out]
start_date end_Date flag
0 2019-07-18 00:00:00 2019-07-18 03:00:00 NaN
1 2019-07-18 03:00:00 2019-07-18 06:00:00 NaN
2 2019-07-18 06:00:00 2019-07-18 09:00:00 NaN
3 2019-07-18 09:00:00 2019-07-18 12:00:00 NaN
4 2019-07-18 12:00:00 2019-07-18 15:00:00 alarm
5 2019-07-18 15:00:00 2019-07-18 18:00:00 NaN
6 2019-07-18 18:00:00 2019-07-18 21:00:00 NaN
7 2019-07-18 21:00:00 2019-07-19 00:00:00 NaN
8 2019-07-19 00:00:00 2019-07-19 03:00:00 NaN
9 2019-07-19 03:00:00 2019-07-19 06:00:00 NaN
10 2019-07-19 06:00:00 2019-07-19 09:00:00 NaN
11 2019-07-19 09:00:00 2019-07-19 12:00:00 NaN
12 2019-07-19 12:00:00 2019-07-19 15:00:00 NaN
13 2019-07-19 15:00:00 2019-07-19 18:00:00 alarm
14 2019-07-19 18:00:00 2019-07-19 21:00:00 NaN
15 2019-07-19 21:00:00 2019-07-20 00:00:00 NaN
16 2019-07-20 00:00:00 2019-07-20 03:00:00 NaN
17 2019-07-20 03:00:00 2019-07-20 06:00:00 NaN
18 2019-07-20 06:00:00 2019-07-20 09:00:00 NaN
19 2019-07-20 09:00:00 2019-07-20 12:00:00 NaN
20 2019-07-20 12:00:00 2019-07-20 15:00:00 NaN
21 2019-07-20 15:00:00 2019-07-20 18:00:00 alarm
22 2019-07-20 18:00:00 2019-07-20 21:00:00 NaN
23 2019-07-20 21:00:00 2019-07-21 00:00:00 NaN
24 2019-07-21 00:00:00 2019-07-21 03:00:00 NaN
25 2019-07-21 03:00:00 2019-07-21 06:00:00 NaN
26 2019-07-21 06:00:00 2019-07-21 09:00:00 NaN
27 2019-07-21 09:00:00 2019-07-21 12:00:00 NaN
28 2019-07-21 12:00:00 2019-07-21 15:00:00 NaN
29 2019-07-21 15:00:00 2019-07-21 18:00:00 NaN
30 2019-07-21 18:00:00 2019-07-21 21:00:00 NaN
31 2019-07-21 21:00:00 2019-07-22 00:00:00 NaN
32 2019-07-22 00:00:00 2019-07-22 03:00:00 NaN

you were calling your columns start_date and end_Date, but in your for you use start_time and end_time.
try this:
import pandas as pd
df = pd.DataFrame({'Alarms': ["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n = 33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({'start_date': rng1, 'end_Date': rng2})
for i in alarms:
df.loc[(df.start_date < i) & (df.end_Date > i), 'Flag'] = 'Alarm'
print(df[df['Flag']=='Alarm']['Flag'])
Output:
4 Alarm
13 Alarm
21 Alarm
Name: Flag, dtype: object

Related

Pandas dataframe with hourly data: Calculating sums for specific times

There is a dataframe with hourly data, e.g.:
DATE TIME Amount
2022-11-07 21:00:00 10
2022-11-07 22:00:00 11
2022-11-08 07:00:00 10
2022-11-08 08:00:00 13
2022-11-08 09:00:00 12
2022-11-08 10:00:00 11
2022-11-08 11:00:00 13
2022-11-08 12:00:00 12
2022-11-08 13:00:00 10
2022-11-08 14:00:00 9
...
I would like to add a new column sum_morning where I calculate the sum of "Amount" for the morning hours only (07:00 - 12:00):
DATE TIME Amount sum_morning
2022-11-07 21:00:00 10 NaN
2022-11-07 22:00:00 11 NaN
2022-11-08 07:00:00 10 NaN
2022-11-08 08:00:00 13 NaN
2022-11-08 09:00:00 12 NaN
2022-11-08 10:00:00 11 NaN
2022-11-08 11:00:00 13 NaN
2022-11-08 12:00:00 12 71
2022-11-08 13:00:00 10 NaN
2022-11-08 14:00:00 9 NaN
...
There can be gaps in the dataframe (e.g. from 22:00 - 07:00), so shift is probably not working here.
I thought about
creating a new dataframe where I filter all time slices from 07:00 - 12:00 for all dates
do a group by and calculate the sum for each day
and then merge this back to the original df.
But maybe there is a more effective solution?
I really enjoy working with Python / pandas, but hourly data still makes my head spin.

First set a DatetimeIndex in order to use DataFrame.between_time, then groupby DATE and aggregate by sum. Finally, get the last value of datetimes per day, in order to match the index of the original DataFrame:
df.index = pd.to_datetime(df['DATE'] + ' ' + df['TIME'])
s = (df.between_time('7:00','12:00')
.reset_index()
.groupby('DATE')
.agg({'Amount':'sum', 'index':'last'})
.set_index('index')['Amount'])
df['sum_morning'] = s
print (df)
DATE TIME Amount sum_morning
2022-11-07 21:00:00 2022-11-07 21:00:00 10 NaN
2022-11-07 22:00:00 2022-11-07 22:00:00 11 NaN
2022-11-08 07:00:00 2022-11-08 07:00:00 10 NaN
2022-11-08 08:00:00 2022-11-08 08:00:00 13 NaN
2022-11-08 09:00:00 2022-11-08 09:00:00 12 NaN
2022-11-08 10:00:00 2022-11-08 10:00:00 11 NaN
2022-11-08 11:00:00 2022-11-08 11:00:00 13 NaN
2022-11-08 12:00:00 2022-11-08 12:00:00 12 71.0
2022-11-08 13:00:00 2022-11-08 13:00:00 10 NaN
2022-11-08 14:00:00 2022-11-08 14:00:00 9 NaN
Lastly, if you need to remove DatetimeIndex you can use:
df = df.reset_index(drop=True)

You can use:
# get values between 7 and 12h
m = pd.to_timedelta(df['TIME']).between('7h', '12h')
# find last True per day
idx = m&m.groupby(df['DATE']).shift(-1).ne(True)
# assign the sum of the 7-12h values on the last True per day
df.loc[idx, 'sum_morning'] = df['Amount'].where(m).groupby(df['DATE']).transform('sum')
Output:
DATE TIME Amount sum_morning
0 2022-11-07 21:00:00 10 NaN
1 2022-11-07 22:00:00 11 NaN
2 2022-11-08 07:00:00 10 NaN
3 2022-11-08 08:00:00 13 NaN
4 2022-11-08 09:00:00 12 NaN
5 2022-11-08 10:00:00 11 NaN
6 2022-11-08 11:00:00 13 NaN
7 2022-11-08 12:00:00 12 71.0
8 2022-11-08 13:00:00 10 NaN
9 2022-11-08 14:00:00 9 NaN

How to add a new categorical column with numbering as per time Interval in Pandas

Value
2021-07-15 00:00:00 10
2021-07-15 06:00:00 10
2021-07-15 12:00:00 10
2021-07-15 18:00:00 10
2021-07-16 00:00:00 20
2021-07-16 06:00:00 10
2021-07-16 12:00:00 10
2021-07-16 18:00:00 20
I want to add a column such that when it
00:00:00 1
06:00:00 2
12:00:00 3
18:00:00 4
Eventually, I want something like this
Value Number
2021-07-15 00:00:00 10 1
2021-07-15 06:00:00 10 2
2021-07-15 12:00:00 10 3
2021-07-15 18:00:00 10 4
2021-07-16 00:00:00 20 1
2021-07-16 06:00:00 10 2
2021-07-16 12:00:00 10 3
2021-07-16 18:00:00 20 4
and so on
I want that Numbering column such that whenever it's 00:00:00 time it always says 1, whenever it's 06:00:00 time it always says 2, whenever it's 12:00:00 time it always says 3, whenever it's 18:00:00 time it always says 4. In this way, I will have a categorical column having only 1,2,3,4 values

Sorry, new here, so I don't have enough rep to comment. But #Keiku's solution is closer than you realise. If you replace .time by .hour, you get the hour of the day. Divide that by 6 to get 0-3 categories for 0:00 to 18:00. If you must have them in the range 1-4 specifically, simply add 1.
To borrow #Keiku's example code:
import pandas as pd
df = pd.DataFrame({
'2021-07-15 00:00:00 0.48',
'2021-07-15 06:00:00 80.00',
'2021-07-15 12:00:00 6.10',
'2021-07-15 18:00:00 1400.00',
'2021-07-16 00:00:00 1400.00'
}, columns=['value'])
df['date'] = pd.to_datetime(df['value'].str[:19])
df.sort_values(['date'], ascending=[True], inplace=True)
df['category'] = df['date'].dt.hour / 6 # + 1 if you want this to be 1-4

You can use pd.to_datetime to convert to datetime and .dt.time to extract the time. You can use pd.factorize for 1,2,3,4 categories.
import pandas as pd
df = pd.DataFrame({
'2021-07-15 00:00:00 0.48',
'2021-07-15 06:00:00 80.00',
'2021-07-15 12:00:00 6.10',
'2021-07-15 18:00:00 1400.00',
'2021-07-16 00:00:00 1400.00'
}, columns=['value'])
df
# value
# 0 2021-07-15 00:00:00 0.48
# 1 2021-07-15 06:00:00 80.00
# 2 2021-07-15 12:00:00 6.10
# 3 2021-07-16 00:00:00 1400.00
# 4 2021-07-15 18:00:00 1400.00
df['date'] = pd.to_datetime(df['value'].str[:19])
df.sort_values(['date'], ascending=[True], inplace=True)
df['time'] = df['date'].dt.time
df['index'], _ = pd.factorize(df['time'])
df['index'] += 1
df
# value date time index
# 0 2021-07-15 00:00:00 0.48 2021-07-15 00:00:00 00:00:00 1
# 1 2021-07-15 06:00:00 80.00 2021-07-15 06:00:00 06:00:00 2
# 2 2021-07-15 12:00:00 6.10 2021-07-15 12:00:00 12:00:00 3
# 4 2021-07-15 18:00:00 1400.00 2021-07-15 18:00:00 18:00:00 4
# 3 2021-07-16 00:00:00 1400.00 2021-07-16 00:00:00 00:00:00 1

Select groups using slicing based on the group index in pandas DataFrame

I have a Dataframe with a users indicated by the column: 'user_id'. Each of these users have several entries in the dataframe based on the date on which they did something, which is also a column. The dataframe looks somthing like
df:
user_id date
0 2019-04-13 02:00:00
0 2019-04-13 03:00:00
3 2019-02-18 22:00:00
3 2019-02-18 23:00:00
3 2019-02-19 00:00:00
3 2019-02-19 02:00:00
3 2019-02-19 03:00:00
3 2019-02-19 04:00:00
8 2019-04-05 04:00:00
8 2019-04-05 05:00:00
8 2019-04-05 06:00:00
8 2019-04-05 15:00:00
15 2019-04-28 19:00:00
15 2019-04-28 20:00:00
15 2019-04-29 01:00:00
23 2019-06-24 02:00:00
23 2019-06-24 05:00:00
23 2019-06-24 06:00:00
24 2019-03-27 12:00:00
24 2019-03-27 13:00:00
What I want to do is, for example, select the first 3 users. I wanted to do this with a code like this:
df.groupby('user_id').iloc[:3]
I know that groupby doesn't have an iloc so how could I achieve the same thing like an iloc in the groups, so I am able to slice them?

I found a way based on crayxt's answer:
df[df['user_id'].isin(df['user_id'].unique()[:3])]

merging two different date time column to form a sequence

I have two data sets.One with weekly date time and the other hourly date time.
my data sets looks like this:-
df1
Week_date w_values
21-04-2019 20:00:00 10
28-04-2019 20:00:00 20
05-05-2019 20:00:00 30
df2
hour_date h_values
19-04-2019 08:00:00 a
21-04-2019 07:00:00 b
21-04-2019 20:00:00 c
22-04-2019 06:00:00 d
23-04-2019 05:00:00 e
28-04-2019 19:00:00 f
28-04-2019 20:00:00 g
28-04-2019 21:00:00 h
29-04-2019 20:00:00 i
05-05-2019 20:00:00 j
06-05-2019 23:00:00 k
tried merging but failed to get the desired output
output data set should look like this
week_date w_values hour_date h_values
21-04-2019 20:00:00 10 21-04-2019 20:00:00 c
21-04-2019 20:00:00 10 22-04-2019 06:00:00 d
21-04-2019 20:00:00 10 23-04-2019 05:00:00 e
21-04-2019 20:00:00 10 28-04-2019 19:00:00 f
28-04-2019 20:00:00 20 28-04-2019 20:00:00 g
28-04-2019 20:00:00 20 28-04-2019 21:00:00 h
28-04-2019 20:00:00 20 29-04-2019 20:00:00 i
05-05-2019 20:00:00 30 05-05-2019 20:00:00 j
05-05-2019 20:00:00 30 06-05-2019 23:00:00 k
the weekly date will change only when week date is equal to hour date....else it will take previous week date....

Use the 'merge_asof' function. From pandas documentation "This merge is similar to a left-join except that we match on nearest key rather than equal keys."
df_week['Week_date']=pd.to_datetime(df_week['Week_date'])
df_hour['hour_date']=pd.to_datetime(df_hour['hour_date'])
df_week_sort=df_week.sort_values(by='Week_date')
df_hour_sort=df_hour.sort_values(by='hour_date')
df_week_sort.rename(columns={'Week_date':'Merge_date'},inplace=True)
df_hour_sort.rename(columns={'hour_date':'Merge_date'},inplace=True)
df_merged=pd.merge_asof(df_hour_sort,df_week_sort,on='Merge_date')
Make sure that the two frames are sorted by the date stamp

The following should do (provided Week_date and hour_date are datetimes):
(df2.merge(df1, how='left', right_on='Week_date', left_on='hour_date')
.ffill()
.dropna())
The way it works
Make sure both dfs are sorted
>>> df1 = df1.sort_values('Week_date')
>>> df2 = df2.sort_values('hour_date')
Do the merge
>>> df3 = df2.merge(df1, how='left', right_on='Week_date', left_on='hour_date')
>>> df3
hour_date h_values Week_date w_values
0 2019-04-19 08:00:00 a NaT NaN
1 2019-04-21 07:00:00 b NaT NaN
2 2019-04-21 20:00:00 c 2019-04-21 20:00:00 10.0
3 2019-04-22 06:00:00 d NaT NaN
4 2019-04-23 05:00:00 e NaT NaN
5 2019-04-28 19:00:00 f NaT NaN
6 2019-04-28 20:00:00 g 2019-04-28 20:00:00 20.0
7 2019-04-28 21:00:00 h NaT NaN
8 2019-04-29 20:00:00 i NaT NaN
9 2019-05-05 20:00:00 j 2019-05-05 20:00:00 30.0
10 2019-06-05 23:00:00 k NaT NaN
Forward fill the gaps
>>> df3 = df3.ffill()
>>> df3
hour_date h_values Week_date w_values
0 2019-04-19 08:00:00 a NaT NaN
1 2019-04-21 07:00:00 b NaT NaN
2 2019-04-21 20:00:00 c 2019-04-21 20:00:00 10.0
3 2019-04-22 06:00:00 d 2019-04-21 20:00:00 10.0
4 2019-04-23 05:00:00 e 2019-04-21 20:00:00 10.0
5 2019-04-28 19:00:00 f 2019-04-21 20:00:00 10.0
6 2019-04-28 20:00:00 g 2019-04-28 20:00:00 20.0
7 2019-04-28 21:00:00 h 2019-04-28 20:00:00 20.0
8 2019-04-29 20:00:00 i 2019-04-28 20:00:00 20.0
9 2019-05-05 20:00:00 j 2019-05-05 20:00:00 30.0
10 2019-06-05 23:00:00 k 2019-05-05 20:00:00 30.0
Remove the remaining NaNs
>>> df3 = df3.dropna()
>>> df3
hour_date h_values Week_date w_values
2 2019-04-21 20:00:00 c 2019-04-21 20:00:00 10.0
3 2019-04-22 06:00:00 d 2019-04-21 20:00:00 10.0
4 2019-04-23 05:00:00 e 2019-04-21 20:00:00 10.0
5 2019-04-28 19:00:00 f 2019-04-21 20:00:00 10.0
6 2019-04-28 20:00:00 g 2019-04-28 20:00:00 20.0
7 2019-04-28 21:00:00 h 2019-04-28 20:00:00 20.0
8 2019-04-29 20:00:00 i 2019-04-28 20:00:00 20.0
9 2019-05-05 20:00:00 j 2019-05-05 20:00:00 30.0
10 2019-06-05 23:00:00 k 2019-05-05 20:00:00 30.0

Extend dataframe by adding start and end date and fill it with timestamps and NaN

I have got the following data:
data
timestamp
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-01 13:00:00 9
2012-06-01 10:00:00 9
and would like to sort it descending by time, add a start and end date on top and bottom of the data, so that it looks like this:
data
timestamp
2012-06-01 00:00:00 NaN
2012-06-01 10:00:00 9
2012-06-01 13:00:00 9
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-02 00:00:00 NaN
and finally I would like to extend the dataset to cover all hours from start to end in one hour steps, filling the dataframe with missing timestamps containing 'None'/'NaN' as data.
So far I have the following code:
df2 = pd.DataFrame({'data':temperature, 'timestamp': pd.DatetimeIndex(timestamp)}, dtype=float)
df2.set_index('timestamp',inplace=True)
df3 = pd.DataFrame({ 'timestamp': pd.Series([ts1, ts2]), 'data': [None, None]})
df3.set_index('timestamp',inplace=True)
print(df3)
merged = df3.append(df2)
print(merged)
with the following print outs:
df3:
data
timestamp
2012-06-01 00:00:00 None
2012-06-02 00:00:00 None
merged:
data
timestamp
2012-06-01 00:00:00 NaN
2012-06-02 00:00:00 NaN
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-01 13:00:00 9
2012-06-01 10:00:00 9
I have tried:
merged = merged.asfreq('H')
but this returned an unsatisfying result:
data
2012-06-01 00:00:00 NaN
2012-06-01 01:00:00 NaN
2012-06-01 02:00:00 NaN
2012-06-01 03:00:00 NaN
2012-06-01 04:00:00 NaN
2012-06-01 05:00:00 NaN
2012-06-01 06:00:00 NaN
2012-06-01 07:00:00 NaN
2012-06-01 08:00:00 NaN
2012-06-01 09:00:00 NaN
2012-06-01 10:00:00 9
Where is the rest of the dataframe? Why does it only contain data till the first valid value?
Help is much appreciated. Thanks a lot in advance

First create an empty dataframe with the timestamp index that you want and then do a left merge with your original dataset:
df2 = pd.DataFrame(index = pd.date_range('2012-06-01','2012-06-02', freq='H'))
df3 = pd.merge(df2, df, left_index = True, right_index = True, how = 'left')
df3
Out[103]:
timestamp value
2012-06-01 00:00:00 NaN NaN
2012-06-01 01:00:00 NaN NaN
2012-06-01 02:00:00 NaN NaN
2012-06-01 03:00:00 NaN NaN
2012-06-01 04:00:00 NaN NaN
2012-06-01 05:00:00 NaN NaN
2012-06-01 06:00:00 NaN NaN
2012-06-01 07:00:00 NaN NaN
2012-06-01 08:00:00 NaN NaN
2012-06-01 09:00:00 NaN NaN
2012-06-01 10:00:00 2012-06-01 10:00:00 9
2012-06-01 11:00:00 NaN NaN
2012-06-01 12:00:00 NaN NaN
2012-06-01 13:00:00 2012-06-01 13:00:00 9
2012-06-01 14:00:00 NaN NaN
2012-06-01 15:00:00 NaN NaN
2012-06-01 16:00:00 NaN NaN
2012-06-01 17:00:00 2012-06-01 17:00:00 9
2012-06-01 18:00:00 NaN NaN
2012-06-01 19:00:00 NaN NaN
2012-06-01 20:00:00 2012-06-01 20:00:00 8
2012-06-01 21:00:00 NaN NaN
2012-06-01 22:00:00 NaN NaN
2012-06-01 23:00:00 NaN NaN
2012-06-02 00:00:00 NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Flagging list of datetimes within date ranges in pandas dataframe - python

Related

Pandas dataframe with hourly data: Calculating sums for specific times

How to add a new categorical column with numbering as per time Interval in Pandas

Select groups using slicing based on the group index in pandas DataFrame

merging two different date time column to form a sequence

Extend dataframe by adding start and end date and fill it with timestamps and NaN

Categories

Resources