How do i split up a duration while upsampleing a dataframe, as in the example below.
And can i replace the for loop, with e.g. the group_by function?
I want to use pandas to transform data like this:
activity name time started time ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34
into this:
time started Bedtime videos Commute
2021-10-25 00:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 01:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 02:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 03:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 04:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 05:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 06:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 07:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 08:00:00 0 days 00:25:42 0 days 00:26:12 0 days 00:08:06
...
And i get this far:
import pandas as pd
df=pd.DataFrame({'activity name':['Bedtime','videos','Commute'],'time started':["2021-10-25 00:00:00","2021-10-25 08:25:42","2021-10-25 08:51:54"],'time ended':["2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34"]})
# converting strings to datetime
df['time ended']=pd.to_datetime(df['time ended'])
df['time started']=pd.to_datetime(df['time started'])
# calclating the duration
df['duration']=df['time ended']-df['time started']
# changeing index
df.index=df['time started']
df=df.drop(columns=['time started','time ended'])
for a in df['activity name'].unique():
df[a]=(df['activity name']==a)*df['duration']
df=df.drop(columns=['activity name','duration'])
df.resample('H').first()
time started
2021-10-25 00:00:00 0 days 08:25:42 0 days 00:00:00 0 days
2021-10-25 01:00:00 NaT NaT NaT
2021-10-25 02:00:00 NaT NaT NaT
2021-10-25 03:00:00 NaT NaT NaT
2021-10-25 04:00:00 NaT NaT NaT
2021-10-25 05:00:00 NaT NaT NaT
2021-10-25 06:00:00 NaT NaT NaT
2021-10-25 07:00:00 NaT NaT NaT
2021-10-25 08:00:00 0 days 00:00:00 0 days 00:26:12 0 days
Try this:
import pandas as pd
from io import StringIO
txtfile = StringIO(
""" activity name time started time ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34"""
)
df = pd.read_csv(txtfile, sep="\s\s+", engine="python")
df[["time started", "time ended"]] = df[["time started", "time ended"]].apply(
pd.to_datetime
)
df_e = df.assign(
date=[
pd.date_range(s, e, freq="s")
for s, e in zip(df["time started"], df["time ended"])
]
).explode("date")
df_out = (
df_e.groupby(["activity name", pd.Grouper(key="date", freq="H")])["activity name"]
.count()
.unstack(0)
.apply(pd.to_timedelta, unit="s")
)
print(df_out)
Output:
activity name Bedtime Commute videos
date
2021-10-25 00:00:00 0 days 01:00:00 NaT NaT
2021-10-25 01:00:00 0 days 01:00:00 NaT NaT
2021-10-25 02:00:00 0 days 01:00:00 NaT NaT
2021-10-25 03:00:00 0 days 01:00:00 NaT NaT
2021-10-25 04:00:00 0 days 01:00:00 NaT NaT
2021-10-25 05:00:00 0 days 01:00:00 NaT NaT
2021-10-25 06:00:00 0 days 01:00:00 NaT NaT
2021-10-25 07:00:00 0 days 01:00:00 NaT NaT
2021-10-25 08:00:00 0 days 00:25:43 0 days 00:08:06 0 days 00:26:13
2021-10-25 09:00:00 NaT 0 days 00:29:35 NaT
Address #DerekO comment:
import pandas as pd
from io import StringIO
txtfile = StringIO(
""" activity name time started time ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34
3 Bedtime 2021-10-25 11:00:00 2021-10-25 13:04:31"""
)
df = pd.read_csv(txtfile, sep="\s\s+", engine="python")
df[["time started", "time ended"]] = df[["time started", "time ended"]].apply(
pd.to_datetime
)
df_e = df.assign(
date=[
pd.date_range(s, e, freq="s")
for s, e in zip(df["time started"], df["time ended"])
]
).explode("date")
df_out = (
df_e.groupby(["activity name", pd.Grouper(key="date", freq="H")])["activity name"]
.count()
.unstack(0)
.apply(pd.to_timedelta, unit="s")
.sort_index()
)
print(df_out)
Output:
activity name Bedtime Commute videos
date
2021-10-25 00:00:00 0 days 01:00:00 NaT NaT
2021-10-25 01:00:00 0 days 01:00:00 NaT NaT
2021-10-25 02:00:00 0 days 01:00:00 NaT NaT
2021-10-25 03:00:00 0 days 01:00:00 NaT NaT
2021-10-25 04:00:00 0 days 01:00:00 NaT NaT
2021-10-25 05:00:00 0 days 01:00:00 NaT NaT
2021-10-25 06:00:00 0 days 01:00:00 NaT NaT
2021-10-25 07:00:00 0 days 01:00:00 NaT NaT
2021-10-25 08:00:00 0 days 00:25:43 0 days 00:08:06 0 days 00:26:13
2021-10-25 09:00:00 NaT 0 days 00:29:35 NaT
2021-10-25 11:00:00 0 days 01:00:00 NaT NaT
2021-10-25 12:00:00 0 days 01:00:00 NaT NaT
2021-10-25 13:00:00 0 days 00:04:32 NaT NaT
Although I agree that using groupby and resample would be best, I couldn't make such a solution work. You can instead brute force the problem by creating a new DataFrame for every row of your original DataFrame, and concatenating them together.
The way it works is that we use pd.date_range to create a DatetimeIndex between the floor of the start and end times, and the start and end times are inserted into the DatetimeIndex as well. Then the difference between all of the datetimes in this DatetimeIndex are the values of your new DataFrame.
To try to make my solution as robust as possible, I added two additional rows to your original DataFrame with a repeated category, and tested situations where the starting time falls exactly on the hour versus ahead of the hour.
import pandas as pd
from pandas._libs.tslibs.timedeltas import Timedelta
df=pd.DataFrame({
'activity name':['Bedtime','videos','Commute','Work','Commute'],
'time started':["2021-10-25 00:00:00","2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34","2021-10-25 17:00:00"],
'time ended':["2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34","2021-10-25 17:00:00","2021-10-25 18:01:00"]})
# converting strings to datetime
df['time ended']=pd.to_datetime(df['time ended'])
df['time started']=pd.to_datetime(df['time started'])
## column names with spaces can't be accessed by name when using iterruples to iterate through the df
df.columns = [col.replace(" ","_") for col in df.columns]
Starting df:
>>> df
activity_name time_started time_ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34
3 Work 2021-10-25 09:29:34 2021-10-25 17:00:00
4 Commute 2021-10-25 17:00:00 2021-10-25 18:01:00
## we use the start and end times to determine what daterange we create
start_time = df['time_started'].min().floor('h')
end_time = df['time_started'].max().ceil('h')
## setup an empty DataFrame to hold the final result
new_columns = list(df.activity_name.unique())
df_new = pd.DataFrame(columns=new_columns)
for row in df.itertuples(index=True):
new_row = {}
daterange_start = row.time_started.floor('1h')
daterange_end = row.time_ended.floor('1h')
datetimes_index = pd.date_range(daterange_start, daterange_end, freq='1h')
all_datetimes = datetimes_index.union([row.time_started, row.time_ended])
## take the difference and shift by -1 to drop the first NaT
new_row[row.activity_name] = all_datetimes.to_series().diff().shift(-1)
## if the first row starts in the middle of an hour, we don't want the difference between the beginning of the hour and the time in that row
if (row.Index == 0) & (row.time_started > daterange_start):
df_new = df_new.append(pd.DataFrame(new_row))[1:]
else:
df_new = df_new.append(pd.DataFrame(new_row))
df_new.index.name = 'time_started'
df_new.reset_index(inplace=True)
Result:
>>> df_new
time_started Bedtime videos Commute Work
0 2021-10-25 00:00:00 0 days 01:00:00 NaT NaT NaT
1 2021-10-25 01:00:00 0 days 01:00:00 NaT NaT NaT
2 2021-10-25 02:00:00 0 days 01:00:00 NaT NaT NaT
3 2021-10-25 03:00:00 0 days 01:00:00 NaT NaT NaT
4 2021-10-25 04:00:00 0 days 01:00:00 NaT NaT NaT
5 2021-10-25 05:00:00 0 days 01:00:00 NaT NaT NaT
6 2021-10-25 06:00:00 0 days 01:00:00 NaT NaT NaT
7 2021-10-25 07:00:00 0 days 01:00:00 NaT NaT NaT
8 2021-10-25 08:00:00 0 days 00:25:42 NaT NaT NaT
9 2021-10-25 08:25:42 NaT NaT NaT NaT
10 2021-10-25 08:00:00 NaT 0 days 00:25:42 NaT NaT
11 2021-10-25 08:25:42 NaT 0 days 00:26:12 NaT NaT
12 2021-10-25 08:51:54 NaT NaT NaT NaT
13 2021-10-25 08:00:00 NaT NaT 0 days 00:51:54 NaT
14 2021-10-25 08:51:54 NaT NaT 0 days 00:08:06 NaT
15 2021-10-25 09:00:00 NaT NaT 0 days 00:29:34 NaT
16 2021-10-25 09:29:34 NaT NaT NaT NaT
17 2021-10-25 09:00:00 NaT NaT NaT 0 days 00:29:34
18 2021-10-25 09:29:34 NaT NaT NaT 0 days 00:30:26
19 2021-10-25 10:00:00 NaT NaT NaT 0 days 01:00:00
20 2021-10-25 11:00:00 NaT NaT NaT 0 days 01:00:00
21 2021-10-25 12:00:00 NaT NaT NaT 0 days 01:00:00
22 2021-10-25 13:00:00 NaT NaT NaT 0 days 01:00:00
23 2021-10-25 14:00:00 NaT NaT NaT 0 days 01:00:00
24 2021-10-25 15:00:00 NaT NaT NaT 0 days 01:00:00
25 2021-10-25 16:00:00 NaT NaT NaT 0 days 01:00:00
26 2021-10-25 17:00:00 NaT NaT NaT NaT
27 2021-10-25 17:00:00 NaT NaT 0 days 01:00:00 NaT
28 2021-10-25 18:00:00 NaT NaT 0 days 00:01:00 NaT
29 2021-10-25 18:01:00 NaT NaT NaT NaT
For each activity we created a new DataFrame obtaining the differences between times with all_datetimes.to_series().diff().shift(-1) which means there is NaT between each change in an activity. These aren't useful, so we will drop any rows where the activities are all NaT.
We then drop duplicate timestamps in the time_started column and keep the first value of these duplicates, and take the floor of all timestamps in the time_started column:
df_new = df_new.dropna(subset=new_columns, how='all').drop_duplicates(subset=['time_started'], keep='first')
df_new['time_started'] = df_new['time_started'].apply(lambda x: x.floor('1h'))
Result:
>>> df_new
time_started Bedtime videos Commute Work
0 2021-10-25 00:00:00 0 days 01:00:00 NaT NaT NaT
1 2021-10-25 01:00:00 0 days 01:00:00 NaT NaT NaT
2 2021-10-25 02:00:00 0 days 01:00:00 NaT NaT NaT
3 2021-10-25 03:00:00 0 days 01:00:00 NaT NaT NaT
4 2021-10-25 04:00:00 0 days 01:00:00 NaT NaT NaT
5 2021-10-25 05:00:00 0 days 01:00:00 NaT NaT NaT
6 2021-10-25 06:00:00 0 days 01:00:00 NaT NaT NaT
7 2021-10-25 07:00:00 0 days 01:00:00 NaT NaT NaT
8 2021-10-25 08:00:00 0 days 00:25:42 NaT NaT NaT
11 2021-10-25 08:00:00 NaT 0 days 00:26:12 NaT NaT
14 2021-10-25 08:00:00 NaT NaT 0 days 00:08:06 NaT
15 2021-10-25 09:00:00 NaT NaT 0 days 00:29:34 NaT
18 2021-10-25 09:00:00 NaT NaT NaT 0 days 00:30:26
19 2021-10-25 10:00:00 NaT NaT NaT 0 days 01:00:00
20 2021-10-25 11:00:00 NaT NaT NaT 0 days 01:00:00
21 2021-10-25 12:00:00 NaT NaT NaT 0 days 01:00:00
22 2021-10-25 13:00:00 NaT NaT NaT 0 days 01:00:00
23 2021-10-25 14:00:00 NaT NaT NaT 0 days 01:00:00
24 2021-10-25 15:00:00 NaT NaT NaT 0 days 01:00:00
25 2021-10-25 16:00:00 NaT NaT NaT 0 days 01:00:00
27 2021-10-25 17:00:00 NaT NaT 0 days 01:00:00 NaT
28 2021-10-25 18:00:00 NaT NaT 0 days 00:01:00 NaT
Now we fill all NaT with pd.Timedelta("0s"), then we can groupby values in the time_started column and sum them together:
df_new = df_new.fillna(pd.Timedelta(0)).groupby("time_started").sum().reset_index()
Final result:
>>> df_new
time_started Bedtime videos Commute Work
0 2021-10-25 00:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
1 2021-10-25 01:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
2 2021-10-25 02:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
3 2021-10-25 03:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
4 2021-10-25 04:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
5 2021-10-25 05:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
6 2021-10-25 06:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
7 2021-10-25 07:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
8 2021-10-25 08:00:00 0 days 00:25:42 0 days 00:26:12 0 days 00:08:06 0 days 00:00:00
9 2021-10-25 09:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:29:34 0 days 00:30:26
10 2021-10-25 10:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
11 2021-10-25 11:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
12 2021-10-25 12:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
13 2021-10-25 13:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
14 2021-10-25 14:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
15 2021-10-25 15:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
16 2021-10-25 16:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
17 2021-10-25 17:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00 0 days 00:00:00
18 2021-10-25 18:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:01:00 0 days 00:00:00
I wonder if is it possible to convert irregular time series interval to regular one without interpolating value from other column like this :
Index count
2018-01-05 00:00:00 1
2018-01-07 00:00:00 4
2018-01-08 00:00:00 15
2018-01-11 00:00:00 2
2018-01-14 00:00:00 5
2018-01-19 00:00:00 5
....
2018-12-26 00:00:00 6
2018-12-29 00:00:00 7
2018-12-30 00:00:00 8
And I expect the result to be something like this:
Index count
2018-01-01 00:00:00 0
2018-01-02 00:00:00 0
2018-01-03 00:00:00 0
2018-01-04 00:00:00 0
2018-01-05 00:00:00 1
2018-01-06 00:00:00 0
2018-01-07 00:00:00 4
2018-01-08 00:00:00 15
2018-01-09 00:00:00 0
2018-01-10 00:00:00 0
2018-01-11 00:00:00 2
2018-01-12 00:00:00 0
2018-01-13 00:00:00 0
2018-01-14 00:00:00 5
2018-01-15 00:00:00 0
2018-01-16 00:00:00 0
2018-01-17 00:00:00 0
2018-01-18 00:00:00 0
2018-01-19 00:00:00 5
....
2018-12-26 00:00:00 6
2018-12-27 00:00:00 0
2018-12-28 00:00:00 0
2018-12-29 00:00:00 7
2018-12-30 00:00:00 8
2018-12-31 00:00:00 0
So, far I just try resample from pandas but it only partially solved my problem.
Thanks in advance
Use DataFrame.reindex with date_range:
#if necessary
df.index = pd.to_datetime(df.index)
df = df.reindex(pd.date_range('2018-01-01','2018-12-31'), fill_value=0)
print (df)
count
2018-01-01 0
2018-01-02 0
2018-01-03 0
2018-01-04 0
2018-01-05 1
...
2018-12-27 0
2018-12-28 0
2018-12-29 7
2018-12-30 8
2018-12-31 0
[365 rows x 1 columns]
I have a dataframe that has a date time column called start time and it is set to a default of 12:00:00 AM. I would like to reset this column so that the first row is 00:01:00 and the second row is 00:02:00, that is one minute interval.
This is the original table.
ID State Time End Time
A001 12:00:00 12:00:00
A002 12:00:00 12:00:00
A003 12:00:00 12:00:00
A004 12:00:00 12:00:00
A005 12:00:00 12:00:00
A006 12:00:00 12:00:00
A007 12:00:00 12:00:00
I want to reset the start time column so that my output is this:
ID State Time End Time
A001 0:00:00 12:00:00
A002 0:00:01 12:00:00
A003 0:00:02 12:00:00
A004 0:00:03 12:00:00
A005 0:00:04 12:00:00
A006 0:00:05 12:00:00
A007 0:00:06 12:00:00
How do I go about this?
you could use pd.date_range:
df['Start Time'] = pd.date_range('00:00', periods=df['Start Time'].shape[0], freq='1min')
gives you
df
Out[23]:
Start Time
0 2019-09-30 00:00:00
1 2019-09-30 00:01:00
2 2019-09-30 00:02:00
3 2019-09-30 00:03:00
4 2019-09-30 00:04:00
5 2019-09-30 00:05:00
6 2019-09-30 00:06:00
7 2019-09-30 00:07:00
8 2019-09-30 00:08:00
9 2019-09-30 00:09:00
supply a full date/time string to get another starting date.
First we convert your State Time column to datetime type. Then we use pd.date_range and use the first time as starting point with a frequency of 1 minute.
df['State Time'] = pd.to_datetime(df['State Time'])
df['State Time'] = pd.date_range(start=df['State Time'].min(),
periods=len(df),
freq='min').time
Output
ID State Time End Time
0 A001 12:00:00 12:00:00
1 A002 12:01:00 12:00:00
2 A003 12:02:00 12:00:00
3 A004 12:03:00 12:00:00
4 A005 12:04:00 12:00:00
5 A006 12:05:00 12:00:00
6 A007 12:06:00 12:00:00
CustID UsageDate EnergyConsumed
0 17111 2018-01-01 00:00:00 1.095
1 17111 2018-01-01 01:00:00 1.129
2 17111 2018-01-01 02:00:00 1.165
3 17111 2018-01-01 03:00:00 1.833
4 17111 2018-01-01 04:00:00 1.697
5 17111 2018-01-01 05:00:00 1.835
missing data point 1
6 17111 2018-01-01 07:00:00 1.835
7 17112 2018-01-01 00:00:00 1.095
8 17112 2018-01-01 01:00:00 1.129
missing data point 1
9 17112 2018-01-01 03:00:00 1.833
10 17112 2018-01-01 04:00:00 1.697
11 17112 2018-01-01 05:00:00 1.835
For every customer, I have hourly data. However, some data points are missing in between. I want to check the Min and Max of Usage Date and fill in the missing Usage Date in that time interval (all values are per hour) and EnergyConsumed as zero. I can later use ffill or backfill to take care of this.
Not every customer's max UsageDate is 2018-01-31 23:00:00. So we only want to extend the series till the max date of every customer.
missing point 1 is replaced by
17111 2018-01-01 06:00:00 0
missing point 2 is replaced by
17112 2018-01-01 02:00:00 0
My main point of trouble is how to find the min and max date of every customer and then generate the gaps of dates.
I have tried indexing by date and resampling but havent helped me reach the solution.
Also, I was wondering if there is a way to directly find customerID's which have missing values in the pattern described above. My data is very large and the solution provided by #Vaishali is computing heavy. Any inputs would be helpful!
You can group the Dataframe by custid and create index with desired date range. Now use this index to reindex the data
df['UsageDate'] = pd.to_datetime(df['UsageDate'])
idx = df.groupby('CustID')['UsageDate'].apply(lambda x: pd.Series(index = pd.date_range(x.min(), x.max(), freq = 'H'))).index
df.set_index(['CustID', 'UsageDate']).reindex(idx).fillna(0).reset_index().rename(columns = {'level_1':'UsageDate'})
CustID UsageDate EnergyConsumed
0 17111 2018-01-01 00:00:00 1.095
1 17111 2018-01-01 01:00:00 1.129
2 17111 2018-01-01 02:00:00 1.165
3 17111 2018-01-01 03:00:00 1.833
4 17111 2018-01-01 04:00:00 1.697
5 17111 2018-01-01 05:00:00 1.835
6 17111 2018-01-01 06:00:00 0.000
7 17111 2018-01-01 07:00:00 1.835
8 17112 2018-01-01 00:00:00 1.095
9 17112 2018-01-01 01:00:00 1.129
10 17112 2018-01-01 02:00:00 0.000
11 17112 2018-01-01 03:00:00 1.833
12 17112 2018-01-01 04:00:00 1.697
13 17112 2018-01-01 05:00:00 1.835
Explanation: Since the Usagedates have to be all the dates in the range of minimum and maximum date for that CustID, we group the data by CustID and create a series of min and max dates using date_range. Set the dates as index of the series rather than value. The result of the groupby will be a multiindex with CUSTID as level 0 and usage date as level 1. We now use this multiindex to reindex the original dataframe. It will use the values where the index matches, assign NaN at the rest. Finally convert the NaN to 0 using fillna.
First create DatetimeIndex and then use asfreq in apply:
df['UsageDate'] = pd.to_datetime(df['UsageDate'])
df = (df.set_index('UsageDate')
.groupby('CustID')['EnergyConsumed']
.apply(lambda x: x.asfreq('H'))
.fillna(0)
.reset_index()
)
print (df)
CustID UsageDate EnergyConsumed
0 17111 2018-01-01 00:00:00 1.095
1 17111 2018-01-01 01:00:00 1.129
2 17111 2018-01-01 02:00:00 1.165
3 17111 2018-01-01 03:00:00 1.833
4 17111 2018-01-01 04:00:00 1.697
5 17111 2018-01-01 05:00:00 1.835
6 17111 2018-01-01 06:00:00 0.000
7 17111 2018-01-01 07:00:00 1.835
8 17112 2018-01-01 00:00:00 1.095
9 17112 2018-01-01 01:00:00 1.129
10 17112 2018-01-01 02:00:00 0.000
11 17112 2018-01-01 03:00:00 1.833
12 17112 2018-01-01 04:00:00 1.697
13 17112 2018-01-01 05:00:00 1.835
Also is possible use parameter ffill or bfill:
df = (df.set_index('UsageDate')
.groupby('CustID')['EnergyConsumed']
.apply(lambda x: x.asfreq('H', method='ffill'))
.reset_index()
)
print (df)
CustID UsageDate EnergyConsumed
0 17111 2018-01-01 00:00:00 1.095
1 17111 2018-01-01 01:00:00 1.129
2 17111 2018-01-01 02:00:00 1.165
3 17111 2018-01-01 03:00:00 1.833
4 17111 2018-01-01 04:00:00 1.697
5 17111 2018-01-01 05:00:00 1.835
6 17111 2018-01-01 06:00:00 1.835
7 17111 2018-01-01 07:00:00 1.835
8 17112 2018-01-01 00:00:00 1.095
9 17112 2018-01-01 01:00:00 1.129
10 17112 2018-01-01 02:00:00 1.129
11 17112 2018-01-01 03:00:00 1.833
12 17112 2018-01-01 04:00:00 1.697
13 17112 2018-01-01 05:00:00 1.835
I'm new to pandas / python:
I have a dataframe (events.number) indexed by a datetime object.
I'm trying to extract an event count hourly, on every Monday (or other particular weekday). I wrote:
hour_tally_monday = events.number.groupby(lambda x: (x.hour & x.weekday==0) ).count()
but this does not work correctly.
I can drop the "& x.weekday==1" and it works but presumably uses all the days in the frame. What's the right (simplest) syntax to just average over Mondays?
I think you need first filter dataframe with boolean indexing and then use groupby with size:
import pandas as pd
start = pd.to_datetime('2016-02-01')
end = pd.to_datetime('2016-02-25')
rng = pd.date_range(start, end, freq='12H')
events = pd.DataFrame({'number': [1] * 20 + [2] * 15 + [3] * 14}, index=rng)
print events
number
2016-02-01 00:00:00 1
2016-02-01 12:00:00 1
2016-02-02 00:00:00 1
2016-02-02 12:00:00 1
2016-02-03 00:00:00 1
2016-02-03 12:00:00 1
2016-02-04 00:00:00 1
2016-02-04 12:00:00 1
2016-02-05 00:00:00 1
2016-02-05 12:00:00 1
2016-02-06 00:00:00 1
2016-02-06 12:00:00 1
2016-02-07 00:00:00 1
...
...
filtered = events[events.index.weekday == 0]
print filtered
number
2016-02-01 00:00:00 1
2016-02-01 12:00:00 1
2016-02-08 00:00:00 1
2016-02-08 12:00:00 1
2016-02-15 00:00:00 2
2016-02-15 12:00:00 2
2016-02-22 00:00:00 3
2016-02-22 12:00:00 3
In version 0.18.1 you can use new method DatetimeIndex.weekday_name:
filtered = events[events.index.weekday_name == 'Monday']
print filtered
number
2016-02-01 00:00:00 1
2016-02-01 12:00:00 1
2016-02-08 00:00:00 1
2016-02-08 12:00:00 1
2016-02-15 00:00:00 2
2016-02-15 12:00:00 2
2016-02-22 00:00:00 3
2016-02-22 12:00:00 3
print filtered.groupby(filtered.index.hour).size()
0 4
12 4
dtype: int64