Split up duration while upsampling dataframe - python

How do i split up a duration while upsampleing a dataframe, as in the example below.
And can i replace the for loop, with e.g. the group_by function?
I want to use pandas to transform data like this:
activity name time started time ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34
into this:
time started Bedtime videos Commute
2021-10-25 00:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 01:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 02:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 03:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 04:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 05:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 06:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 07:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 08:00:00 0 days 00:25:42 0 days 00:26:12 0 days 00:08:06
...
And i get this far:
import pandas as pd
df=pd.DataFrame({'activity name':['Bedtime','videos','Commute'],'time started':["2021-10-25 00:00:00","2021-10-25 08:25:42","2021-10-25 08:51:54"],'time ended':["2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34"]})
# converting strings to datetime
df['time ended']=pd.to_datetime(df['time ended'])
df['time started']=pd.to_datetime(df['time started'])
# calclating the duration
df['duration']=df['time ended']-df['time started']
# changeing index
df.index=df['time started']
df=df.drop(columns=['time started','time ended'])
for a in df['activity name'].unique():
df[a]=(df['activity name']==a)*df['duration']
df=df.drop(columns=['activity name','duration'])
df.resample('H').first()
time started
2021-10-25 00:00:00 0 days 08:25:42 0 days 00:00:00 0 days
2021-10-25 01:00:00 NaT NaT NaT
2021-10-25 02:00:00 NaT NaT NaT
2021-10-25 03:00:00 NaT NaT NaT
2021-10-25 04:00:00 NaT NaT NaT
2021-10-25 05:00:00 NaT NaT NaT
2021-10-25 06:00:00 NaT NaT NaT
2021-10-25 07:00:00 NaT NaT NaT
2021-10-25 08:00:00 0 days 00:00:00 0 days 00:26:12 0 days

Try this:
import pandas as pd
from io import StringIO
txtfile = StringIO(
""" activity name time started time ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34"""
)
df = pd.read_csv(txtfile, sep="\s\s+", engine="python")
df[["time started", "time ended"]] = df[["time started", "time ended"]].apply(
pd.to_datetime
)
df_e = df.assign(
date=[
pd.date_range(s, e, freq="s")
for s, e in zip(df["time started"], df["time ended"])
]
).explode("date")
df_out = (
df_e.groupby(["activity name", pd.Grouper(key="date", freq="H")])["activity name"]
.count()
.unstack(0)
.apply(pd.to_timedelta, unit="s")
)
print(df_out)
Output:
activity name Bedtime Commute videos
date
2021-10-25 00:00:00 0 days 01:00:00 NaT NaT
2021-10-25 01:00:00 0 days 01:00:00 NaT NaT
2021-10-25 02:00:00 0 days 01:00:00 NaT NaT
2021-10-25 03:00:00 0 days 01:00:00 NaT NaT
2021-10-25 04:00:00 0 days 01:00:00 NaT NaT
2021-10-25 05:00:00 0 days 01:00:00 NaT NaT
2021-10-25 06:00:00 0 days 01:00:00 NaT NaT
2021-10-25 07:00:00 0 days 01:00:00 NaT NaT
2021-10-25 08:00:00 0 days 00:25:43 0 days 00:08:06 0 days 00:26:13
2021-10-25 09:00:00 NaT 0 days 00:29:35 NaT
Address #DerekO comment:
import pandas as pd
from io import StringIO
txtfile = StringIO(
""" activity name time started time ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34
3 Bedtime 2021-10-25 11:00:00 2021-10-25 13:04:31"""
)
df = pd.read_csv(txtfile, sep="\s\s+", engine="python")
df[["time started", "time ended"]] = df[["time started", "time ended"]].apply(
pd.to_datetime
)
df_e = df.assign(
date=[
pd.date_range(s, e, freq="s")
for s, e in zip(df["time started"], df["time ended"])
]
).explode("date")
df_out = (
df_e.groupby(["activity name", pd.Grouper(key="date", freq="H")])["activity name"]
.count()
.unstack(0)
.apply(pd.to_timedelta, unit="s")
.sort_index()
)
print(df_out)
Output:
activity name Bedtime Commute videos
date
2021-10-25 00:00:00 0 days 01:00:00 NaT NaT
2021-10-25 01:00:00 0 days 01:00:00 NaT NaT
2021-10-25 02:00:00 0 days 01:00:00 NaT NaT
2021-10-25 03:00:00 0 days 01:00:00 NaT NaT
2021-10-25 04:00:00 0 days 01:00:00 NaT NaT
2021-10-25 05:00:00 0 days 01:00:00 NaT NaT
2021-10-25 06:00:00 0 days 01:00:00 NaT NaT
2021-10-25 07:00:00 0 days 01:00:00 NaT NaT
2021-10-25 08:00:00 0 days 00:25:43 0 days 00:08:06 0 days 00:26:13
2021-10-25 09:00:00 NaT 0 days 00:29:35 NaT
2021-10-25 11:00:00 0 days 01:00:00 NaT NaT
2021-10-25 12:00:00 0 days 01:00:00 NaT NaT
2021-10-25 13:00:00 0 days 00:04:32 NaT NaT

Although I agree that using groupby and resample would be best, I couldn't make such a solution work. You can instead brute force the problem by creating a new DataFrame for every row of your original DataFrame, and concatenating them together.
The way it works is that we use pd.date_range to create a DatetimeIndex between the floor of the start and end times, and the start and end times are inserted into the DatetimeIndex as well. Then the difference between all of the datetimes in this DatetimeIndex are the values of your new DataFrame.
To try to make my solution as robust as possible, I added two additional rows to your original DataFrame with a repeated category, and tested situations where the starting time falls exactly on the hour versus ahead of the hour.
import pandas as pd
from pandas._libs.tslibs.timedeltas import Timedelta
df=pd.DataFrame({
'activity name':['Bedtime','videos','Commute','Work','Commute'],
'time started':["2021-10-25 00:00:00","2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34","2021-10-25 17:00:00"],
'time ended':["2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34","2021-10-25 17:00:00","2021-10-25 18:01:00"]})
# converting strings to datetime
df['time ended']=pd.to_datetime(df['time ended'])
df['time started']=pd.to_datetime(df['time started'])
## column names with spaces can't be accessed by name when using iterruples to iterate through the df
df.columns = [col.replace(" ","_") for col in df.columns]
Starting df:
>>> df
activity_name time_started time_ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34
3 Work 2021-10-25 09:29:34 2021-10-25 17:00:00
4 Commute 2021-10-25 17:00:00 2021-10-25 18:01:00
## we use the start and end times to determine what daterange we create
start_time = df['time_started'].min().floor('h')
end_time = df['time_started'].max().ceil('h')
## setup an empty DataFrame to hold the final result
new_columns = list(df.activity_name.unique())
df_new = pd.DataFrame(columns=new_columns)
for row in df.itertuples(index=True):
new_row = {}
daterange_start = row.time_started.floor('1h')
daterange_end = row.time_ended.floor('1h')
datetimes_index = pd.date_range(daterange_start, daterange_end, freq='1h')
all_datetimes = datetimes_index.union([row.time_started, row.time_ended])
## take the difference and shift by -1 to drop the first NaT
new_row[row.activity_name] = all_datetimes.to_series().diff().shift(-1)
## if the first row starts in the middle of an hour, we don't want the difference between the beginning of the hour and the time in that row
if (row.Index == 0) & (row.time_started > daterange_start):
df_new = df_new.append(pd.DataFrame(new_row))[1:]
else:
df_new = df_new.append(pd.DataFrame(new_row))
df_new.index.name = 'time_started'
df_new.reset_index(inplace=True)
Result:
>>> df_new
time_started Bedtime videos Commute Work
0 2021-10-25 00:00:00 0 days 01:00:00 NaT NaT NaT
1 2021-10-25 01:00:00 0 days 01:00:00 NaT NaT NaT
2 2021-10-25 02:00:00 0 days 01:00:00 NaT NaT NaT
3 2021-10-25 03:00:00 0 days 01:00:00 NaT NaT NaT
4 2021-10-25 04:00:00 0 days 01:00:00 NaT NaT NaT
5 2021-10-25 05:00:00 0 days 01:00:00 NaT NaT NaT
6 2021-10-25 06:00:00 0 days 01:00:00 NaT NaT NaT
7 2021-10-25 07:00:00 0 days 01:00:00 NaT NaT NaT
8 2021-10-25 08:00:00 0 days 00:25:42 NaT NaT NaT
9 2021-10-25 08:25:42 NaT NaT NaT NaT
10 2021-10-25 08:00:00 NaT 0 days 00:25:42 NaT NaT
11 2021-10-25 08:25:42 NaT 0 days 00:26:12 NaT NaT
12 2021-10-25 08:51:54 NaT NaT NaT NaT
13 2021-10-25 08:00:00 NaT NaT 0 days 00:51:54 NaT
14 2021-10-25 08:51:54 NaT NaT 0 days 00:08:06 NaT
15 2021-10-25 09:00:00 NaT NaT 0 days 00:29:34 NaT
16 2021-10-25 09:29:34 NaT NaT NaT NaT
17 2021-10-25 09:00:00 NaT NaT NaT 0 days 00:29:34
18 2021-10-25 09:29:34 NaT NaT NaT 0 days 00:30:26
19 2021-10-25 10:00:00 NaT NaT NaT 0 days 01:00:00
20 2021-10-25 11:00:00 NaT NaT NaT 0 days 01:00:00
21 2021-10-25 12:00:00 NaT NaT NaT 0 days 01:00:00
22 2021-10-25 13:00:00 NaT NaT NaT 0 days 01:00:00
23 2021-10-25 14:00:00 NaT NaT NaT 0 days 01:00:00
24 2021-10-25 15:00:00 NaT NaT NaT 0 days 01:00:00
25 2021-10-25 16:00:00 NaT NaT NaT 0 days 01:00:00
26 2021-10-25 17:00:00 NaT NaT NaT NaT
27 2021-10-25 17:00:00 NaT NaT 0 days 01:00:00 NaT
28 2021-10-25 18:00:00 NaT NaT 0 days 00:01:00 NaT
29 2021-10-25 18:01:00 NaT NaT NaT NaT
For each activity we created a new DataFrame obtaining the differences between times with all_datetimes.to_series().diff().shift(-1) which means there is NaT between each change in an activity. These aren't useful, so we will drop any rows where the activities are all NaT.
We then drop duplicate timestamps in the time_started column and keep the first value of these duplicates, and take the floor of all timestamps in the time_started column:
df_new = df_new.dropna(subset=new_columns, how='all').drop_duplicates(subset=['time_started'], keep='first')
df_new['time_started'] = df_new['time_started'].apply(lambda x: x.floor('1h'))
Result:
>>> df_new
time_started Bedtime videos Commute Work
0 2021-10-25 00:00:00 0 days 01:00:00 NaT NaT NaT
1 2021-10-25 01:00:00 0 days 01:00:00 NaT NaT NaT
2 2021-10-25 02:00:00 0 days 01:00:00 NaT NaT NaT
3 2021-10-25 03:00:00 0 days 01:00:00 NaT NaT NaT
4 2021-10-25 04:00:00 0 days 01:00:00 NaT NaT NaT
5 2021-10-25 05:00:00 0 days 01:00:00 NaT NaT NaT
6 2021-10-25 06:00:00 0 days 01:00:00 NaT NaT NaT
7 2021-10-25 07:00:00 0 days 01:00:00 NaT NaT NaT
8 2021-10-25 08:00:00 0 days 00:25:42 NaT NaT NaT
11 2021-10-25 08:00:00 NaT 0 days 00:26:12 NaT NaT
14 2021-10-25 08:00:00 NaT NaT 0 days 00:08:06 NaT
15 2021-10-25 09:00:00 NaT NaT 0 days 00:29:34 NaT
18 2021-10-25 09:00:00 NaT NaT NaT 0 days 00:30:26
19 2021-10-25 10:00:00 NaT NaT NaT 0 days 01:00:00
20 2021-10-25 11:00:00 NaT NaT NaT 0 days 01:00:00
21 2021-10-25 12:00:00 NaT NaT NaT 0 days 01:00:00
22 2021-10-25 13:00:00 NaT NaT NaT 0 days 01:00:00
23 2021-10-25 14:00:00 NaT NaT NaT 0 days 01:00:00
24 2021-10-25 15:00:00 NaT NaT NaT 0 days 01:00:00
25 2021-10-25 16:00:00 NaT NaT NaT 0 days 01:00:00
27 2021-10-25 17:00:00 NaT NaT 0 days 01:00:00 NaT
28 2021-10-25 18:00:00 NaT NaT 0 days 00:01:00 NaT
Now we fill all NaT with pd.Timedelta("0s"), then we can groupby values in the time_started column and sum them together:
df_new = df_new.fillna(pd.Timedelta(0)).groupby("time_started").sum().reset_index()
Final result:
>>> df_new
time_started Bedtime videos Commute Work
0 2021-10-25 00:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
1 2021-10-25 01:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
2 2021-10-25 02:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
3 2021-10-25 03:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
4 2021-10-25 04:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
5 2021-10-25 05:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
6 2021-10-25 06:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
7 2021-10-25 07:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
8 2021-10-25 08:00:00 0 days 00:25:42 0 days 00:26:12 0 days 00:08:06 0 days 00:00:00
9 2021-10-25 09:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:29:34 0 days 00:30:26
10 2021-10-25 10:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
11 2021-10-25 11:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
12 2021-10-25 12:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
13 2021-10-25 13:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
14 2021-10-25 14:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
15 2021-10-25 15:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
16 2021-10-25 16:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
17 2021-10-25 17:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00 0 days 00:00:00
18 2021-10-25 18:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:01:00 0 days 00:00:00

Related

plotting graph of day from a years data

So I have a dataset that has electricity load over 24 hours:
Time_of_Day = loadData.groupby(loadData.index.hour).mean()
Time_of_Day
Time Load
2019-01-01 01:00:00 38.045
2019-01-01 02:00:00 30.675
2019-01-01 03:00:00 22.570
2019-01-01 04:00:00 22.153
2019-01-01 05:00:00 21.085
... ...
2019-12-31 20:00:00 65.565
2019-12-31 21:00:00 53.513
2019-12-31 22:00:00 49.096
2019-12-31 23:00:00 44.409
2020-01-01 00:00:00 45.744
how do I plot a random day(24hrs) from the 8760 hours please
With the following toy dataframe:
import pandas as pd
import random
df = pd.DataFrame({"Time": pd.date_range(start="1/1/2019", end="12/31/2019", freq="H")})
df["Load"] = [round(random.random() * 100, 2) for _ in range(df.shape[0])]
Time Load
0 2019-01-01 00:00:00 53.36
1 2019-01-01 01:00:00 34.20
2 2019-01-01 02:00:00 64.19
3 2019-01-01 03:00:00 89.18
4 2019-01-01 04:00:00 27.82
... ... ...
8732 2019-12-30 20:00:00 38.26
8733 2019-12-30 21:00:00 49.66
8734 2019-12-30 22:00:00 64.15
8735 2019-12-30 23:00:00 23.97
8736 2019-12-31 00:00:00 3.72
[8737 rows x 2 columns]
Here is one way to do it using choice function from Python standard library random module:
# In Jupyter cell
df[
(df["Time"].dt.month == random.choice(df["Time"].dt.month))
& (df["Time"].dt.day == random.choice(df["Time"].dt.day))
].plot(x="Time")
Output:

Replacing NaNs with date and time format

I'm working with the following dataframes.
Date Light (umols) Time_difference
0 2018-01-12 07:16:52 2.5 NaT
1 2018-01-12 07:19:52 4.9 0 days 00:03:00
2 2018-01-12 07:22:52 4.9 0 days 00:03:00
3 2018-01-12 07:25:52 7.4 0 days 00:03:00
4 2018-01-12 07:28:50 9.9 0 days 00:02:58
... ... ... ...
6252 2018-12-18 17:54:24 12.2 0 days 00:03:00
6253 2018-12-18 17:57:24 7.6 0 days 00:03:00
6254 2018-12-18 18:00:24 4.9 0 days 00:03:00
6255 2018-12-18 18:03:24 2.5 0 days 00:03:00
6256 2018-12-18 18:06:24 0.2 0 days 00:03:00
Date Light (umols) Time_difference
0 2019-01-10 00:00:00 500.4 NaT
1 2019-01-10 00:00:01 451.2 0 days 00:00:01
2 2019-01-10 00:00:02 343.7 0 days 00:00:01
3 2019-01-10 00:00:03 354.5 0 days 00:00:01
4 2019-01-10 00:00:04 176.4 0 days 00:00:00
... ... ... ...
81264 2021-02-22 23:59:55 937.7 0 days 00:00:00
81265 2021-02-22 23:59:56 634.4 0 days 00:00:00
81266 2021-02-22 23:59:57 574.3 0 days 00:00:00
81267 2021-02-22 23:59:58 598.9 0 days 00:00:00
81268 2021-02-22 23:59:59 676.9 0 days 00:00:00
I'm wanting to calculate where there are gaps, how long they are and how many there are. The idea is to have a consistent timeline every 3 minutes in a day tops, and anything above that needs to flagged up, the idea would be to merge the two dataframes together afterwards. There are some pesky NaTs in both their first rows, and I want to replace each one with something like '0 days 00:00:00'. I tried writing the following code with little success:
better = clean['Date'] == '2018-01-12 07:16:52'
clean.loc[better, 'Time_difference'] = clean.loc[clean, 'Time_difference'].replace('NaT', '0 days 00:00:00')
Any suggestions?

Split time series in intervals of non-uniform length

I have a time series with breaks (times w/o recordings) in between. A simplified example would be:
df = pd.DataFrame(
np.random.rand(13), columns=["values"],
index=pd.date_range(start='1/1/2020 11:00:00',end='1/1/2020 23:00:00',freq='H'))
df.iloc[4:7] = np.nan
df.dropna(inplace=True)
df
values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166
Now I would like to split it in intervals which are divided by a certain time span (e.g. 2h). In the example above this would be:
( values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023,
values
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166)
I was a bit surprised that I didn't find anything on that since I thought this is a common problem. My current solution to get start and end index of each interval is :
def intervals(data: pd.DataFrame, delta_t: timedelta = timedelta(hours=2)):
data = data.sort_values(by=['event_timestamp'], ignore_index=True)
breaks = (data['event_timestamp'].diff() > delta_t).astype(bool).values
ranges = []
start = 0
end = start
for i, e in enumerate(breaks):
if not e:
end = i
if i == len(breaks) - 1:
ranges.append((start, end))
start = i
end = start
elif i != 0:
ranges.append((start, end))
start = i
end = start
return ranges
Any suggestions how I could do this in a smarter way? I suspect this should be somehow possible using groupby.
Yes, you can use the very convenient np.split:
dt = pd.Timedelta('2H')
parts = np.split(df, np.where(np.diff(df.index) > dt)[0] + 1)
Which gives, for your example:
>>> parts
[ values
2020-01-01 11:00:00 0.557374
2020-01-01 12:00:00 0.942296
2020-01-01 13:00:00 0.181189
2020-01-01 14:00:00 0.758822,
values
2020-01-01 18:00:00 0.682125
2020-01-01 19:00:00 0.818187
2020-01-01 20:00:00 0.053515
2020-01-01 21:00:00 0.572342
2020-01-01 22:00:00 0.423129
2020-01-01 23:00:00 0.882215]
#Pierre thanks for your input. I now got to a solution which is convenient for me:
df['diff'] = df.index.to_series().diff()
max_gap = timedelta(hours=2)
df['gapId'] = 0
df.loc[df['diff'] >= max_gap, ['gapId']] = 1
df['gapId'] = df['gapId'].cumsum()
list(df.groupby('gapId'))
gives:
[(0,
values date diff gapId
0 1.0 2020-01-01 11:00:00 NaT 0
1 1.0 2020-01-01 12:00:00 0 days 01:00:00 0
2 1.0 2020-01-01 13:00:00 0 days 01:00:00 0
3 1.0 2020-01-01 14:00:00 0 days 01:00:00 0),
(1,
values date diff gapId
7 1.0 2020-01-01 18:00:00 0 days 04:00:00 1
8 1.0 2020-01-01 19:00:00 0 days 01:00:00 1
9 1.0 2020-01-01 20:00:00 0 days 01:00:00 1
10 1.0 2020-01-01 21:00:00 0 days 01:00:00 1
11 1.0 2020-01-01 22:00:00 0 days 01:00:00 1
12 1.0 2020-01-01 23:00:00 0 days 01:00:00 1)]

Convert datetime to the cloest time point

I have a dateset as below.
dummy
datetime
2015-10-25 06:00:00 1
2015-04-05 20:00:00 1
2015-11-24 00:00:00 1
2015-08-18 08:00:00 1
2015-10-21 12:00:00 1
I want to change the datetime to the cloest predefined time point, say 00:00:00 and 12:00:00
dummy
datetime
2015-10-25 00:00:00 1
2015-04-05 12:00:00 1
2015-11-24 00:00:00 1
2015-08-18 00:00:00 1
2015-10-21 12:00:00 1
Here is possible use DatetimeIndex.floor:
df.index = df.index.floor('12H')
print (df)
dummy
datetime
2015-10-25 00:00:00 1
2015-04-05 12:00:00 1
2015-11-24 00:00:00 1
2015-08-18 00:00:00 1
2015-10-21 12:00:00 1

Python Pandas flattening a calendar with overlapping meetings to get actual time in meetings

I have the details of my weekly calendar (obviously Changed the Subjects to protect the innocent) read into into a pandas dataframe. One of my goals is to get the total time in meetings. I would like to have a dataframe indexed by date_range with hourly frequencies for the week showing how many total minutes I was in meetings during those hours. My first challenge is that meetings overlap and as much as I would like to be in two places at once, I am surely not. I do hop out of one and into another though. So for example, rows at index 8 and 9 should be a total meeting time of 90 minutes and not 120 minutes as would be the case if I just df['Duration'].sum() 'd the column. How do I flatten the time periods in the dataframe to only count the overlap once? It seems like there is an answer somewhere using date_range and periods, but I can't wrap my head around it. Below is my dataframe df.
Start End Duration Subject
0 07/04/16 10:30:00 07/04/16 11:00:00 30 Inspirational Poster Design Session
1 07/04/16 15:00:00 07/04/16 15:30:00 30 Corporate Speak Do's and Don'ts
2 07/04/16 09:00:00 07/04/16 12:00:00 180 Metrics or Matrix -Panel Discussion
3 07/04/16 13:30:00 07/04/16 15:00:00 90 "Do More with Less" kickoff party
4 07/05/16 09:00:00 07/05/16 10:00:00 60 Fiscal or Physical -Panel Discussion
5 07/05/16 14:00:00 07/05/16 14:30:00 30 "Why we can't have nice thing" training video
6 07/06/16 15:00:00 07/06/16 16:00:00 60 One-on-One with manager -Panel Discussion
7 07/06/16 09:00:00 07/06/16 10:00:00 60 Fireing for Performance leadership session
8 07/06/16 13:00:00 07/06/16 14:00:00 60 Birthday Cake in the conference room *MANDATORY*
9 07/06/16 12:30:00 07/06/16 13:30:00 60 Obligatory lunchtime meeting because it was the only time everyone had avaiable
Any help would be greatly appreciated.
EDIT:
This is the output I would be hoping for with the above data set.
2016-07-04 00:00:00 0
2016-07-04 01:00:00 0
2016-07-04 02:00:00 0
2016-07-04 03:00:00 0
2016-07-04 04:00:00 0
2016-07-04 05:00:00 0
2016-07-04 06:00:00 0
2016-07-04 07:00:00 0
2016-07-04 08:00:00 0
2016-07-04 09:00:00 60
2016-07-04 10:00:00 60
2016-07-04 11:00:00 60
2016-07-04 12:00:00 0
2016-07-04 13:00:00 30
2016-07-04 14:00:00 60
2016-07-04 15:00:00 30
2016-07-04 16:00:00 0
2016-07-04 17:00:00 0
2016-07-04 18:00:00 0
2016-07-04 19:00:00 0
2016-07-04 20:00:00 0
2016-07-04 21:00:00 0
2016-07-04 22:00:00 0
2016-07-04 23:00:00 0
2016-07-05 00:00:00 0
2016-07-05 01:00:00 0
2016-07-05 02:00:00 0
2016-07-05 03:00:00 0
2016-07-05 04:00:00 0
2016-07-05 05:00:00 0
2016-07-05 06:00:00 0
2016-07-05 07:00:00 0
2016-07-05 08:00:00 0
2016-07-05 09:00:00 60
2016-07-05 10:00:00 0
2016-07-05 11:00:00 0
2016-07-05 12:00:00 0
2016-07-05 13:00:00 0
2016-07-05 14:00:00 30
2016-07-05 15:00:00 0
2016-07-05 16:00:00 0
2016-07-05 17:00:00 0
2016-07-05 18:00:00 0
2016-07-05 19:00:00 0
2016-07-05 20:00:00 0
2016-07-05 21:00:00 0
2016-07-05 22:00:00 0
2016-07-05 23:00:00 0
2016-07-06 00:00:00 0
2016-07-06 01:00:00 0
2016-07-06 02:00:00 0
2016-07-06 03:00:00 0
2016-07-06 04:00:00 0
2016-07-06 05:00:00 0
2016-07-06 06:00:00 0
2016-07-06 07:00:00 0
2016-07-06 08:00:00 0
2016-07-06 09:00:00 60
2016-07-06 10:00:00 0
2016-07-06 11:00:00 0
2016-07-06 12:00:00 30
2016-07-06 13:00:00 60
2016-07-06 14:00:00 0
2016-07-06 15:00:00 60
2016-07-06 16:00:00 0
2016-07-06 17:00:00 0
2016-07-06 18:00:00 0
2016-07-06 19:00:00 0
2016-07-06 20:00:00 0
2016-07-06 21:00:00 0
2016-07-06 22:00:00 0
2016-07-06 23:00:00 0
2016-07-07 00:00:00 0
One possibility is creating a time series (s below) indexed by minute that keeps tracks of whether you are in a meeting during that minute or not, and then resample that by hour. To match your desired output, you may adjust the start and end time of the index of s.
import io
import pandas as pd
data = io.StringIO('''\
Start,End,Duration,Subject
0,07/04/16 10:30:00,07/04/16 11:00:00,30,Inspirational Poster Design Session
1,07/04/16 15:00:00,07/04/16 15:30:00,30,Corporate Speak Do's and Don'ts
2,07/04/16 09:00:00,07/04/16 12:00:00,180,Metrics or Matrix -Panel Discussion
3,07/04/16 13:30:00,07/04/16 15:00:00,90,"Do More with Less" kickoff party
4,07/05/16 09:00:00,07/05/16 10:00:00,60,Fiscal or Physical -Panel Discussion
5,07/05/16 14:00:00,07/05/16 14:30:00,30,"Why we can't have nice thing" training video
6,07/06/16 15:00:00,07/06/16 16:00:00,60,One-on-One with manager -Panel Discussion
7,07/06/16 09:00:00,07/06/16 10:00:00,60,Fireing for Performance leadership session
8,07/06/16 13:00:00,07/06/16 14:00:00,60,Birthday Cake in the conference room *MANDATORY*
9,07/06/16 12:30:00,07/06/16 13:30:00,60,Obligatory lunchtime meeting because it was the only time everyone
''')
df = pd.read_csv(data, usecols=['Start', 'End', 'Subject'])
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
# Ranges in datetime indices include the right endpoint
tdel = pd.Timedelta('1min')
s = pd.Series(False, index=pd.date_range(start=df['Start'].min(),
end=df['End'].max()-tdel,
freq='min'))
for _, meeting in df.iterrows():
s[meeting['Start'] : meeting['End']-tdel] = True
result = s.resample('1H').sum().astype(int)
print(result)
Output:
2016-07-04 09:00:00 60
2016-07-04 10:00:00 60
2016-07-04 11:00:00 60
2016-07-04 12:00:00 0
2016-07-04 13:00:00 30
2016-07-04 14:00:00 60
2016-07-04 15:00:00 30
2016-07-04 16:00:00 0
2016-07-04 17:00:00 0
2016-07-04 18:00:00 0
2016-07-04 19:00:00 0
2016-07-04 20:00:00 0
2016-07-04 21:00:00 0
2016-07-04 22:00:00 0
2016-07-04 23:00:00 0
2016-07-05 00:00:00 0
2016-07-05 01:00:00 0
2016-07-05 02:00:00 0
2016-07-05 03:00:00 0
2016-07-05 04:00:00 0
2016-07-05 05:00:00 0
2016-07-05 06:00:00 0
2016-07-05 07:00:00 0
2016-07-05 08:00:00 0
2016-07-05 09:00:00 60
2016-07-05 10:00:00 0
2016-07-05 11:00:00 0
2016-07-05 12:00:00 0
2016-07-05 13:00:00 0
2016-07-05 14:00:00 30
2016-07-05 15:00:00 0
2016-07-05 16:00:00 0
2016-07-05 17:00:00 0
2016-07-05 18:00:00 0
2016-07-05 19:00:00 0
2016-07-05 20:00:00 0
2016-07-05 21:00:00 0
2016-07-05 22:00:00 0
2016-07-05 23:00:00 0
2016-07-06 00:00:00 0
2016-07-06 01:00:00 0
2016-07-06 02:00:00 0
2016-07-06 03:00:00 0
2016-07-06 04:00:00 0
2016-07-06 05:00:00 0
2016-07-06 06:00:00 0
2016-07-06 07:00:00 0
2016-07-06 08:00:00 0
2016-07-06 09:00:00 60
2016-07-06 10:00:00 0
2016-07-06 11:00:00 0
2016-07-06 12:00:00 30
2016-07-06 13:00:00 60
2016-07-06 14:00:00 0
2016-07-06 15:00:00 60
Freq: H, dtype: int64

Categories

Resources