Split Time Series Data Into Time Intervals in one line (PythonicWay) - Hourly - python

I have a minute data that has the time column. I want to create new column with just hours with date time format, for example format ='%Y-%m-%d %H:%M:%S'. I know in R, we can use something like,
value$hour<- cut(as.POSIXct(paste(value$time),
format="%Y-%m-%d %H:%M:%S"), breaks="hour")
When I do this, I get the following output, (which i need)
time hour
2017-02-10 00:00:00 2017-02-10 00:00:00
2017-02-10 00:01:00 2017-02-10 00:00:00
2017-02-10 00:02:00 2017-02-10 00:00:00
2017-02-10 00:03:00 2017-02-10 00:00:00
....
2017-12-1 10:05:00 2017-12-01 10:00:00
2017-12-1 10:06:00 2017-12-01 10:00:00
I am also aware that there are many threads that discusses about dt.date, dt.hour etc. I can do the following in python like this,
value['date'] = value['time'].dt.date
value['hour'] = value['time'].dt.hour
Is there any way that I can do in python that is similar to R as mentioned above in one line?
Any thoughts would be appreciated. Thanks in advance!

You need dt.floor:
df['hour'] = df['time'].dt.floor('H')
print (df)
time hour
0 2017-02-10 00:00:00 2017-02-10 00:00:00
1 2017-02-10 00:01:00 2017-02-10 00:00:00
2 2017-02-10 00:02:00 2017-02-10 00:00:00
3 2017-02-10 00:03:00 2017-02-10 00:00:00
4 2017-12-01 10:05:00 2017-12-01 10:00:00
5 2017-12-01 10:06:00 2017-12-01 10:00:00
If need convert to datetime column time add to_datetime:
df['hour'] = pd.to_datetime(df['time']).dt.floor('H')
print (df)
time hour
0 2017-02-10 00:00:00 2017-02-10 00:00:00
1 2017-02-10 00:01:00 2017-02-10 00:00:00
2 2017-02-10 00:02:00 2017-02-10 00:00:00
3 2017-02-10 00:03:00 2017-02-10 00:00:00
4 2017-12-1 10:05:00 2017-12-01 10:00:00
5 2017-12-1 10:06:00 2017-12-01 10:00:00

Related

Assign first element of groupby to a column yields NaN

Why does this not work out?
I get the right results if I just print it out, but if I use the same to assign it to the df column, I get Nan values...
print(df.groupby('cumsum').first()['Date'])
cumsum
1 2021-01-05 11:00:00
2 2021-01-06 08:00:00
3 2021-01-06 10:00:00
4 2021-01-06 13:00:00
5 2021-01-06 14:00:00
...
557 2021-08-08 08:00:00
558 2021-08-08 09:00:00
559 2021-08-08 11:00:00
560 2021-08-08 13:00:00
561 2021-08-08 18:00:00
Name: Date, Length: 561, dtype: datetime64[ns]
vs
df["Date_First"] = df.groupby('cumsum').first()['Date']
Date
2021-01-01 00:00:00 NaT
2021-01-01 01:00:00 NaT
2021-01-01 02:00:00 NaT
2021-01-01 03:00:00 NaT
2021-01-01 04:00:00 NaT
..
2021-08-08 14:00:00 NaT
2021-08-08 15:00:00 NaT
2021-08-08 16:00:00 NaT
2021-08-08 17:00:00 NaT
2021-08-08 18:00:00 NaT
Name: Date_Last, Length: 5268, dtype: datetime64[ns]
What happens here?
I used an exmpmle form here, but want to get the first elements.
https://www.codeforests.com/2021/03/30/group-consecutive-rows-in-pandas/
What happens here?
If use:
print(df.groupby('cumsum')['Date'].first())
#print(df.groupby('cumsum').first()['Date'])
output are aggregated values by column cumsum with aggregated function first.
So in index are unique values cumsum, so if assign to new column there is mismatch with original index and output are NaNs.
Solution is use GroupBy.transform, which repeat aggregated values to Series (column) with same size like original DataFrame, so index is same like original and assign working perfectly:
df["Date_First"] = df.groupby('cumsum')['Date'].transform("first")

How to reset a date time column to be in increments of one minute in python

I have a dataframe that has a date time column called start time and it is set to a default of 12:00:00 AM. I would like to reset this column so that the first row is 00:01:00 and the second row is 00:02:00, that is one minute interval.
This is the original table.
ID State Time End Time
A001 12:00:00 12:00:00
A002 12:00:00 12:00:00
A003 12:00:00 12:00:00
A004 12:00:00 12:00:00
A005 12:00:00 12:00:00
A006 12:00:00 12:00:00
A007 12:00:00 12:00:00
I want to reset the start time column so that my output is this:
ID State Time End Time
A001 0:00:00 12:00:00
A002 0:00:01 12:00:00
A003 0:00:02 12:00:00
A004 0:00:03 12:00:00
A005 0:00:04 12:00:00
A006 0:00:05 12:00:00
A007 0:00:06 12:00:00
How do I go about this?
you could use pd.date_range:
df['Start Time'] = pd.date_range('00:00', periods=df['Start Time'].shape[0], freq='1min')
gives you
df
Out[23]:
Start Time
0 2019-09-30 00:00:00
1 2019-09-30 00:01:00
2 2019-09-30 00:02:00
3 2019-09-30 00:03:00
4 2019-09-30 00:04:00
5 2019-09-30 00:05:00
6 2019-09-30 00:06:00
7 2019-09-30 00:07:00
8 2019-09-30 00:08:00
9 2019-09-30 00:09:00
supply a full date/time string to get another starting date.
First we convert your State Time column to datetime type. Then we use pd.date_range and use the first time as starting point with a frequency of 1 minute.
df['State Time'] = pd.to_datetime(df['State Time'])
df['State Time'] = pd.date_range(start=df['State Time'].min(),
periods=len(df),
freq='min').time
Output
ID State Time End Time
0 A001 12:00:00 12:00:00
1 A002 12:01:00 12:00:00
2 A003 12:02:00 12:00:00
3 A004 12:03:00 12:00:00
4 A005 12:04:00 12:00:00
5 A006 12:05:00 12:00:00
6 A007 12:06:00 12:00:00

Flagging list of datetimes within date ranges in pandas dataframe

I've looked around (eg.
Python - Locating the closest timestamp) but can't find anything on this.
I have a list of datetimes, and a dataframe containing 10k + rows, of start and end times (formatted as datetimes).
The dataframe is effectively listing parameters for runs of an instrument.
The list describes times from an alarm event.
The datetime list items are all within a row (i.e. between a start and end time) in the dataframe. Is there an easy way to locate the rows which would contain the timeframe within which the alarm time would be? (sorry for poor wording there!)
eg.
for i in alarms:
df.loc[(df.start_time < i) & (df.end_time > i), 'Flag'] = 'Alarm'
(this didn't work but shows my approach)
Example datasets
# making list of datetimes for the alarms
df = pd.DataFrame({'Alarms':["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n=33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({ 'start_date': rng1, 'end_Date': rng2})
Herein a flag would go against line (well, index) 4, 13 and 21.
You can use pandas.IntervalIndex here:
# Create and set IntervalIndex
intervals = pd.IntervalIndex.from_arrays(df.start_date, df.end_Date)
df = df.set_index(intervals)
# Update using loc
df.loc[alarms, 'flag'] = 'alarm'
# Finally, reset_index
df = df.reset_index(drop=True)
[out]
start_date end_Date flag
0 2019-07-18 00:00:00 2019-07-18 03:00:00 NaN
1 2019-07-18 03:00:00 2019-07-18 06:00:00 NaN
2 2019-07-18 06:00:00 2019-07-18 09:00:00 NaN
3 2019-07-18 09:00:00 2019-07-18 12:00:00 NaN
4 2019-07-18 12:00:00 2019-07-18 15:00:00 alarm
5 2019-07-18 15:00:00 2019-07-18 18:00:00 NaN
6 2019-07-18 18:00:00 2019-07-18 21:00:00 NaN
7 2019-07-18 21:00:00 2019-07-19 00:00:00 NaN
8 2019-07-19 00:00:00 2019-07-19 03:00:00 NaN
9 2019-07-19 03:00:00 2019-07-19 06:00:00 NaN
10 2019-07-19 06:00:00 2019-07-19 09:00:00 NaN
11 2019-07-19 09:00:00 2019-07-19 12:00:00 NaN
12 2019-07-19 12:00:00 2019-07-19 15:00:00 NaN
13 2019-07-19 15:00:00 2019-07-19 18:00:00 alarm
14 2019-07-19 18:00:00 2019-07-19 21:00:00 NaN
15 2019-07-19 21:00:00 2019-07-20 00:00:00 NaN
16 2019-07-20 00:00:00 2019-07-20 03:00:00 NaN
17 2019-07-20 03:00:00 2019-07-20 06:00:00 NaN
18 2019-07-20 06:00:00 2019-07-20 09:00:00 NaN
19 2019-07-20 09:00:00 2019-07-20 12:00:00 NaN
20 2019-07-20 12:00:00 2019-07-20 15:00:00 NaN
21 2019-07-20 15:00:00 2019-07-20 18:00:00 alarm
22 2019-07-20 18:00:00 2019-07-20 21:00:00 NaN
23 2019-07-20 21:00:00 2019-07-21 00:00:00 NaN
24 2019-07-21 00:00:00 2019-07-21 03:00:00 NaN
25 2019-07-21 03:00:00 2019-07-21 06:00:00 NaN
26 2019-07-21 06:00:00 2019-07-21 09:00:00 NaN
27 2019-07-21 09:00:00 2019-07-21 12:00:00 NaN
28 2019-07-21 12:00:00 2019-07-21 15:00:00 NaN
29 2019-07-21 15:00:00 2019-07-21 18:00:00 NaN
30 2019-07-21 18:00:00 2019-07-21 21:00:00 NaN
31 2019-07-21 21:00:00 2019-07-22 00:00:00 NaN
32 2019-07-22 00:00:00 2019-07-22 03:00:00 NaN
you were calling your columns start_date and end_Date, but in your for you use start_time and end_time.
try this:
import pandas as pd
df = pd.DataFrame({'Alarms': ["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n = 33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({'start_date': rng1, 'end_Date': rng2})
for i in alarms:
df.loc[(df.start_date < i) & (df.end_Date > i), 'Flag'] = 'Alarm'
print(df[df['Flag']=='Alarm']['Flag'])
Output:
4 Alarm
13 Alarm
21 Alarm
Name: Flag, dtype: object

Convert datetime to the cloest time point

I have a dateset as below.
dummy
datetime
2015-10-25 06:00:00 1
2015-04-05 20:00:00 1
2015-11-24 00:00:00 1
2015-08-18 08:00:00 1
2015-10-21 12:00:00 1
I want to change the datetime to the cloest predefined time point, say 00:00:00 and 12:00:00
dummy
datetime
2015-10-25 00:00:00 1
2015-04-05 12:00:00 1
2015-11-24 00:00:00 1
2015-08-18 00:00:00 1
2015-10-21 12:00:00 1
Here is possible use DatetimeIndex.floor:
df.index = df.index.floor('12H')
print (df)
dummy
datetime
2015-10-25 00:00:00 1
2015-04-05 12:00:00 1
2015-11-24 00:00:00 1
2015-08-18 00:00:00 1
2015-10-21 12:00:00 1

Calculate difference between 'times' rows in DataFrame Pandas

My DataFrame is in the Form:
TimeWeek TimeSat TimeHoli
0 6:40:00 8:00:00 8:00:00
1 6:45:00 8:05:00 8:05:00
2 6:50:00 8:09:00 8:10:00
3 6:55:00 8:11:00 8:14:00
4 6:58:00 8:13:00 8:17:00
5 7:40:00 8:15:00 8:21:00
I need to find the time difference between each row in TimeWeek , TimeSat and TimeHoli, the output must be
TimeWeekDiff TimeSatDiff TimeHoliDiff
00:05:00 00:05:00 00:05:00
00:05:00 00:04:00 00:05:00
00:05:00 00:02:00 00:04:00
00:03:00 00:02:00 00:03:00
00:02:00 00:02:00 00:04:00
I tried using (d['TimeWeek']-df['TimeWeek'].shift().fillna(0) , it throws an error:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
Probably because of the presence of ':' in the column. How do I resolve this?
It looks like the error is thrown because the data is in the form of a string instead of a timestamp. First convert them to timestamps:
df2 = df.apply(lambda x: [pd.Timestamp(ts) for ts in x])
They will contain today's date by default, but this shouldn't matter once you difference the time (hopefully you don't have to worry about differencing 23:55 and 00:05 across dates).
Once converted, simply difference the DataFrame:
>>> df2 - df2.shift()
TimeWeek TimeSat TimeHoli
0 NaT NaT NaT
1 00:05:00 00:05:00 00:05:00
2 00:05:00 00:04:00 00:05:00
3 00:05:00 00:02:00 00:04:00
4 00:03:00 00:02:00 00:03:00
5 00:42:00 00:02:00 00:04:00
Depending on your needs, you can just take rows 1+ (ignoring the NaTs):
(df2 - df2.shift()).iloc[1:, :]
or you can fill the NaTs with zeros:
(df2 - df2.shift()).fillna(0)
Forget everything I just said. Pandas has great timedelta parsing.
df["TimeWeek"] = pd.to_timedelta(df["TimeWeek"])
(d['TimeWeek']-df['TimeWeek'].shift().fillna(pd.to_timedelta("00:00:00"))
>>> import pandas as pd
>>> df = pd.DataFrame({'TimeWeek': ['6:40:00', '6:45:00', '6:50:00', '6:55:00', '7:40:00']})
>>> df["TimeWeek_date"] = pd.to_datetime(df["TimeWeek"], format="%H:%M:%S")
>>> print df
TimeWeek TimeWeek_date
0 6:40:00 1900-01-01 06:40:00
1 6:45:00 1900-01-01 06:45:00
2 6:50:00 1900-01-01 06:50:00
3 6:55:00 1900-01-01 06:55:00
4 7:40:00 1900-01-01 07:40:00
>>> df['TimeWeekDiff'] = (df['TimeWeek_date'] - df['TimeWeek_date'].shift().fillna(pd.to_datetime("00:00:00", format="%H:%M:%S")))
>>> print df
TimeWeek TimeWeek_date TimeWeekDiff
0 6:40:00 1900-01-01 06:40:00 06:40:00
1 6:45:00 1900-01-01 06:45:00 00:05:00
2 6:50:00 1900-01-01 06:50:00 00:05:00
3 6:55:00 1900-01-01 06:55:00 00:05:00
4 7:40:00 1900-01-01 07:40:00 00:45:00

Categories

Resources