unstacking shift data (start and end time) into hourly data - python

I have a df as follows which shows when a person started a shift, ended a shift, the amount of hours and the date worked.
Business_Date Number PayTimeStart PayTimeEnd Hours
0 2019-05-24 1 2019-05-24 11:00:00 2019-05-24 12:15:00 1.250
1 2019-05-24 2 2019-05-24 12:30:00 2019-05-24 13:30:00 1.00
Now what I'm trying to do is break this into an hourly format, so I know how many hours were used between 11:00 - 12:00
so, in my head, for the above, I want to put the 1 hour between 11 - 12 into the bin for 11:00 and the remainder 0.25 into the next bin of 12
so I would end up with something like
Business Date Time Hour
0 2019-05-24 11:00 1
1 2019-05-24 12:00 0.75
2 2019-05-24 13:00 0.5

One idea is working with minutes - first use list comprehension with flattening for Series and then grouping by hours with hours for count by GroupBy.size and last divide by 60 for final hours:
s = pd.Series([z for x, y in zip(df['Pay Time Start'],
df['Pay Time End'] - pd.Timedelta(60, unit='s'))
for z in pd.date_range(x, y, freq='Min')])
df = (s.groupby([s.dt.date.rename('Business Date'), s.dt.hour.rename('Time')])
.size()
.div(60)
.reset_index(name='Hour'))
print (df)
Business Date Time Hour
0 2019-05-24 11 1.00
1 2019-05-24 12 0.75
2 2019-05-24 13 0.50
If you need to group by a location or ID
df1 = pd.DataFrame([(z, w) for x, y, w in zip(df['Pay Time Start'],
df['Pay Time End'] - pd.Timedelta(60, unit='s'),
df['Location']) for z in pd.date_range(x, y, freq='Min')],
columns=['Date','Location'])
df = (df1.groupby([df1['Date'].dt.date.rename('Business Date'),
df1['Date'].dt.hour.rename('Time'), df1['Location']])
.size() .div(60) .reset_index(name='Hour'))

Another idea, similar's to #jezrael's but work with seconds for higher precision:
def get_series(a):
s, e, h = a
idx = pd.date_range(s,e, freq='6s')
return pd.Series(h/len(idx), index=idx)
(pd.concat(map(get_series, zip(df.Pay_Time_Start,
df.Pay_Time_End,
df.Hours)))
.resample('H').sum()
)
Output:
2019-05-24 11:00:00 0.998668
2019-05-24 12:00:00 0.750500
2019-05-24 13:00:00 0.500832
Freq: H, dtype: float64

Another idea just for your convenience (plus I like challenging questions) is using melt and then conditionally calculating the minutes:
Basically, you have two formulas for your calculations (Pseudocode):
Minutes in Pay Time Start: 60 - minutes in df['Pay Time Start]
Minutes in Pay Time End: minutes in df['Pay Time End]
So we can use these formulas to create our new data:
First we melt our Times in one column
new = df.melt(id_vars=['Business Date', 'Number'],
value_vars=['Pay Time Start', 'Pay Time End'],
var_name='Pay Time Name',
value_name='Pay Time Date').sort_values('Number')
# Apply the formulas noted above
new['Minutes'] = np.where(new['Pay Time Name'].eq('Pay Time Start'),
60 - new['Pay Time Date'].dt.minute,
new['Pay Time Date'].dt.minute)
# Out
Business Date Number Pay Time Name Pay Time Date Minutes
0 2019-05-24 1 Pay Time Start 2019-05-24 11:00:00 60
2 2019-05-24 1 Pay Time End 2019-05-24 12:15:00 15
1 2019-05-24 2 Pay Time Start 2019-05-24 12:30:00 30
3 2019-05-24 2 Pay Time End 2019-05-24 13:30:00 30
Now we calculate the amount of hours with groupby:
daterange = pd.date_range(df['Pay Time Start'].min(), df['Pay Time End'].max(), freq='H')
df_new = pd.DataFrame({'Date':daterange.date,
'Time':daterange.time}, dtype='datetime64[ns]')
df_new['Hours'] = (new.groupby(new['Pay Time Date'].dt.hour)['Minutes'].sum()/60).to_numpy()
Final Output
Date Time Hours
0 2019-05-24 11:00:00 1.00
1 2019-05-24 12:00:00 0.75
2 2019-05-24 13:00:00 0.50

Related

Is there a simple way to calculate the sum of the duration for each hour in the dataframe?

I have a dataframe simmilar to this one:
user
start_date
end_date
activity
1
2022-05-21 07:23:58
2022-05-21 14:23:48
Sleep
2
2022-05-21 12:59:16
2022-05-21 14:59:16
Eat
1
2022-05-21 18:59:16
2022-05-21 21:20:16
Work
3
2022-05-21 18:50:00
2022-05-21 21:20:16
Work
I want to have a dataframe in which the columns will be certain activities and the rows would contain the sum of the duration of each of these activities for all users in that hour. Sorry, I find it hard to even put my thoughts into words. The expected result should be simillar to this one:
hour
sleep[s]
eat[s]
work[s]
00
0
0
0
01
0
0
0
...
...
...
...
07
2162
0
0
08
3600
0
0
...
...
...
...
18
0
0
644
...
...
...
...
The dataframe has more than 10 milion rows so I'm looking for something fast. I was trying to do something with crosstab to get expected columns and resample to get the rows, but I'm not even close to find a soluction. I would be really grateful for any ideas of how to do that. This is my first question, so I apologize in advance for all mistakes :)
I am assuming you the activities start time can span more than one day and you actually want the summary for each day. If not, the answer could be adapted by using datetime.hour instead of binning the start time.
The answer:
Calculate the duration at the top of the hour intervals
Expand the rows of 1-hour intervals
pivot the dataframe by activity
Code:
import pandas as pd
import numpy as np
import math
from datetime import datetime, timedelta
data={'users':[1,2,1,3],
'start':['2022-05-21 07:23:58', '2022-05-21 12:59:16', '2022-05-21 18:59:16', '2022-05-21 18:50:00'],
'end':[ '2022-05-21 14:23:48', '2022-05-21 14:59:16', '2022-05-21 21:20:16', '2022-05-21 21:20:16'],
'activity':[ 'Sleep', 'Eat', 'Work', 'Work']
}
df=pd.DataFrame(data)
#concert to datetime
df.start=pd.to_datetime(df.start)
df.end=pd.to_datetime(df.end)
#Here is where the answer starts
# function to expand the duration in a list of one hour intervals
def genhourlist(e):
start = e.start
end = e.end
nexthour = datetime(start.year,start.month,start.day,start.hour) + timedelta(hours=1)
lst=[]
while end > nexthour:
inter = (nexthour - start)/pd.Timedelta(seconds=1)
lst.append((datetime(start.year,start.month,start.day,start.hour),inter))
start=nexthour
nexthour=nexthour+timedelta(hours=1)
inter = (end - start)/pd.Timedelta(seconds=1)
lst.append((datetime(end.year,end.month,end.day,end.hour),inter))
return lst
# expand the duration
df['duration']=df.apply(genhourlist, axis=1)
df=df.explode('duration')
# update the duration and start
df['start']=df.duration.apply(lambda x: x[0])
df['duration']=df.duration.apply(lambda x: x[1])
pd.pivot_table(df,index=['start'],columns=['activity'], values=['duration'],aggfunc='sum')
Result:
duration
activity Eat Sleep Work
start
2022-05-21 07:00:00 NaN 2162.0 NaN
2022-05-21 08:00:00 NaN 3600.0 NaN
2022-05-21 09:00:00 NaN 3600.0 NaN
2022-05-21 10:00:00 NaN 3600.0 NaN
2022-05-21 11:00:00 NaN 3600.0 NaN
2022-05-21 12:00:00 44.0 3600.0 NaN
2022-05-21 13:00:00 3600.0 3600.0 NaN
2022-05-21 14:00:00 3556.0 1428.0 NaN
2022-05-21 18:00:00 NaN NaN 644.0
2022-05-21 19:00:00 NaN NaN 7200.0
2022-05-21 20:00:00 NaN NaN 7200.0
2022-05-21 21:00:00 NaN NaN 2432.0
Use:
#get difference in seconds between end and start
tot = df['end_date'].sub(df['start_date']).dt.total_seconds()
#repeat unique index values, if necessary create default
#df = df.reset_index(drop=True)
df = df.loc[df.index.repeat(tot)].copy()
#add timedeltas in second to start datetimes
df['start'] = df['start_date'] + pd.to_timedelta(df.groupby(level=0).cumcount(), unit='s')
#create DatetimeIndex in hours with count activity - total seconds
df = df.groupby([pd.Grouper(freq='H',key='start'), 'activity']).size().unstack()
print (df)
activity Eat Sleep Work
start
2022-05-21 07:00:00 NaN 2162.0 NaN
2022-05-21 08:00:00 NaN 3600.0 NaN
2022-05-21 09:00:00 NaN 3600.0 NaN
2022-05-21 10:00:00 NaN 3600.0 NaN
2022-05-21 11:00:00 NaN 3600.0 NaN
2022-05-21 12:00:00 44.0 3600.0 NaN
2022-05-21 13:00:00 3600.0 3600.0 NaN
2022-05-21 14:00:00 3556.0 1428.0 NaN
2022-05-21 18:00:00 NaN NaN 644.0
2022-05-21 19:00:00 NaN NaN 7200.0
2022-05-21 20:00:00 NaN NaN 7200.0
2022-05-21 21:00:00 NaN NaN 2432.0
Here's what I did with the 4 rows of data you posted.
Here's what I tried to do and my assumptions:
Goal = for each row,
- calculate total elapsed time,
- determine start hour sh, end hour eh,
- work out how many seconds pertain to sh & eh
then:
- create a 24-row df with results added up for all users (assuming the data starts at 00:00 and ends at 23:59 on the same day, there isn't any other day, and you're interested in the aggregate result for all users)
I'm using a For loop to build the final dataframe. It's probably not efficient enough if you have 10m rows in your original dataset, but it gets the job done on a smaller subset. Hopefully you can improve upon it!
import pandas as pd
import numpy as np
# work out total elapsed time, in seconds
df['total'] = df['end_date'] - df['start_date']
df['total'] = df['total'].dt.total_seconds()
# determine start/end hours (just the value: 07:00 AM => 7)
df['sh'] = df['start_date'].dt.hour
df['eh'] = df['end_date'].dt.hour
#determine start/end hours again
df['next_hour'] = df['start_date'].dt.ceil('H') # 07:23:58 => 08:00:00
df['previous_hour'] = df['end_date'].dt.floor('H') # 14:23:48 => 14:00:00
df
# determine how many seconds pertain to start/end hours
df['sh_s'] = df['next_hour'] - df['start_date']
df['sh_s'] = df['sh_s'].dt.total_seconds()
df['eh_s'] = df['end_date'] - df['previous_hour']
df['eh_s'] = df['eh_s'].dt.total_seconds()
# where start hour & end hour are the same (start=07:20, end=07:30)
my_mask = df['sh'].eq(df['eh']).values
# set sh_s to be 07:30 - 07:20 = 10 minutes = 600 seconds
df['sh_s'] = df['sh_s'].where(~my_mask, other=(df['end_date'] - df['start_date']).dt.total_seconds())
# set eh_s to be 0
df['eh_s'] = df['eh_s'].where(~my_mask, other=0)
df
# add all column hours to df
hour_columns = np.arange(24)
df[hour_columns] = ''
df
# Sorry for this horrible loop below... Hopefully you can improve upon it or get rid of it completely!
for i in range(len(df)): # for each row:
sh = df.iloc[i, df.columns.get_loc('sh')] # get start hour value
sh_s = df.iloc[i, df.columns.get_loc('sh_s')] #get seconds pertaining to start hour
eh = df.iloc[i, df.columns.get_loc('eh')] # get end hour value
eh_s = df.iloc[i, df.columns.get_loc('eh_s')] #get seconds pertaining to end hour
# fill in hour columns
# if start hour = end hour:
if sh == eh:
df.iloc[i, df.columns.get_loc(sh)] = sh_s
# but ignore eh_s (it would cancel out the previous line)
# if start hour != end hour, report both to the hour columns:
else:
df.iloc[i, df.columns.get_loc(sh)] = sh_s
df.iloc[i, df.columns.get_loc(eh)] = eh_s
# for each col between sh & eh, input 3600
for j in range(sh + 1, eh):
df.iloc[i, df.columns.get_loc(j)] = 3600
df
df.groupby('activity', as_index=False)[hour_columns].sum().transpose()
Result:

Fraction of a day, of a week and of a month conversion from date in python

I have this pandas dataframe with times:
id time
1 4/01/2019 08:00:00
2 4/02/2019 12:00:00
3 4/03/2019 18:00:00
And I want the fraction of a day, fraction of the week and fraction of the month. For example for the first row 08:00:00 is one third of a day, so first column should be 0.333. And it was Monday so it should be 0.047 (a complete day is 1/7 = 0.143 of a week, but since it's a third then 0.143 * 0.333 = 0.047). And it was the start of the month so it should be 0.011 (a complete day is 1/30 = 0.033 of a month, but it is only 8:am so it is 0.033 * 0.333 = 0.011.
Please note that the values are for complete days, for example for 4/02/2019 12:00:00, only 1 day and a half is counted.
The expected result should be:
id time frac_day frac_week frac_month
1 4/01/2019 08:00:00 0.333 0.047 0.011
2 4/02/2019 12:00:00 0.5 0.214 0.050
3 4/03/2019 18:00:00 0.75 0.393 0.092
Please, could you help me with this question in python? Any help will be greatly appreciated.
Try:
import pandas as pd
from pandas.tseries.offsets import MonthEnd, Day
df = pd.DataFrame({
'id': [1, 2, 3],
'time': ['4/01/2019 08:00:00', '4/02/2019 12:00:00',
'4/03/2019 18:00:00']
})
df['time'] = pd.to_datetime(df['time'])
# Percentage of Day by dividing current hour by number of hours in day
df['frac_day'] = df['time'].dt.hour / 24
# Get midnight the beginning of the week for each row
beginning_of_each_week = (
df['time'] - pd.to_timedelta(df['time'].dt.dayofweek, unit='D')
).dt.normalize()
seconds_in_week = 24 * 7 * 60 * 60
# % of week so far. by dividing total seconds by total seconds in week
df['frac_week'] = (
df['time'] - beginning_of_each_week
).dt.total_seconds() / seconds_in_week
# Get Timedelta based on midnight of the first day of the current month
time_so_far = df['time'] - (df['time'] - MonthEnd(1) + Day(1)).dt.normalize()
# Get Total time for the given month
time_in_month = (df['time'] + MonthEnd(1)) - (df['time'] - MonthEnd(1))
# % of month so far by dividing values
df['frac_month'] = time_so_far / time_in_month
df:
id time frac_day frac_week frac_month
0 1 2019-04-01 08:00:00 0.333333 0.047619 0.011111
1 2 2019-04-02 12:00:00 0.500000 0.214286 0.050000
2 3 2019-04-03 18:00:00 0.750000 0.392857 0.091667
Another solution:
df["frac_day"] = (
df["time"].dt.hour * 60 * 60
+ df["time"].dt.minute * 60
+ df["time"].dt.second
) / (24 * 60 * 60)
df["frac_week"] = (df["time"].dt.dayofweek + df["frac_day"]) / 7
df["frac_month"] = df["time"].apply(lambda x: pd.Period(str(x)).days_in_month)
df["frac_month"] = ((df["time"].dt.day - 1) / df["frac_month"]) + df[
"frac_day"
] / df["frac_month"]
print(df)
Prints:
id time frac_day frac_week frac_month
0 1 2019-04-01 08:00:00 0.333333 0.047619 0.011111
1 2 2019-04-02 12:00:00 0.500000 0.214286 0.050000
2 3 2019-04-03 18:00:00 0.750000 0.392857 0.091667
You'll find a lot of built in support for time and datetime functions.
Firstly make sure your df['time'] column is correctly stored as a datetime and then the following should do the trick:
# get number of seconds elapsed in the day - total seconds in a day
# note here we create a timedelta
# hack: use df['time'].dt.date to set time to 00:00:00
df['frac_day'] = (df['time'] - pd.to_datetime(df['time'].dt.date)).dt.total_seconds() / (24 * 3600)
df['frac_week'] = (df['time'].dt.dayofweek + df['frac_day']) / 7
df['frac_month'] = (df['time'].dt.day + df['frac_day'] - 1)/ df['time'].dt.days_in_month
df

upsample between two date columns

I have the following df
lst = [[1548828606206000000, 1548840373139000000],
[1548841285708000000, 1548841458405000000],
[1548842198276000000, 1548843109519000000],
[1548844022821000000, 1548844934207000000],
[1548845431090000000, 1548845539219000000],
[1548845555332000000, 1548845846621000000],
[1548847176147000000, 1548851020030000000],
[1548851704053000000, 1548852256143000000],
[1548852436514000000, 1548855900767000000],
[1548856817770000000, 1548857162183000000],
[1548858736931000000, 1548858979032000000]]
df = pd.DataFrame(lst,columns =['start','end'])
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
and I would like to get the duration of that event with start and end times per hour: e.g.
in my dummy df then for 6th hour should be 60 mins(maximum per hour) - 00:10:06 = 00:49:54. For 7th and 8th should be 1:00:00 each as the end time is 09:26:13. For 9th should be 00:26:13 plus all the intervals in the following .rows that overlap with 9th hour 09:44 - 09:41 = 3mins and 60mins -00:56 =4 mins. So the total for 9th should be 26+ 3 +4~=00:32:28
My initial apporach was to merge start and end, add dummy points every 3rd row, upsample to 1S, get the difference between rows, sum up only the actual rows. There must be a more pythonic way of doing this. Any hint would be great.
IIUC, something like this:
df.apply(lambda x: pd.to_timedelta(pd.Series(1, index=pd.date_range(x.start, x.end, freq='S'))
.groupby(pd.Grouper(freq='H')).count(), unit='S'), axis=1).sum()
Output:
2019-01-30 06:00:00 00:49:54
2019-01-30 07:00:00 01:00:00
2019-01-30 08:00:00 01:00:00
2019-01-30 09:00:00 00:32:28
2019-01-30 10:00:00 00:33:43
2019-01-30 11:00:00 00:40:24
2019-01-30 12:00:00 00:45:37
2019-01-30 13:00:00 00:45:01
2019-01-30 14:00:00 00:09:48
Freq: H, dtype: timedelta64[ns]
Or to get it down to hours, try:
df.apply(lambda r: pd.to_timedelta(pd.Series(1, index=pd.date_range(r.start, r.end, freq='S'))
.pipe(lambda x: x.groupby(x.index.hour).count()), unit='S'), axis=1)\
.sum()
Output:
6 00:49:54
7 01:00:00
8 01:00:00
9 00:32:28
10 00:33:43
11 00:40:24
12 00:45:37
13 00:45:01
14 00:09:48
dtype: timedelta64[ns]

Aggregate a time interval on individual dates

I have pandas dataframe with two timestamps columns start and end
start end
2014-08-28 17:00:00 | 2014-08-29 22:00:00
2014-08-29 10:45:00 | 2014-09-01 17:00:00
2014-09-01 15:00:00 | 2014-09-01 19:00:00
The intention is to aggregate the number of hours that were logged on a given date. So in the case of my example.
I would be creating date range and aggreating the hours over multiple entries.
2014-08-28 -> 7 hrs
2014-08-29 -> 10 hrs + 1 hr 15 min => 11 hrs 15 mins
2014-08-30 -> 24 hrs
2014-08-31 -> 24 hrs
2014-09-01 -> 17 hrs + 4 hrs => 21 hrs
I've tried using timedelta but it only splits in absolute hours, not on a per day basis.
I've also tried to explode the rows(i.e split the row on a day basis but I could only get it to works at a date level, not at a time stamp level)
Any suggestion are greatly appreciated.
you can use of pd.date_range to create a minute to minute interval of each day that spent, after that you can count the spent minutes and convert it to time delta
start end
0 2014-08-28 17:00:00 2014-08-29 22:00:00
1 2014-08-29 10:45:00 2014-09-01 17:00:00
2 2014-09-01 15:00:00 2014-09-01 19:00:00
#Creating the minute to minute time intervals from start to end date of each line and creating as one series of dates
a = pd.Series(sum(df.apply(lambda x: pd.date_range(x['start'],x['end'],freq='min').tolist(),1).tolist(),[])).dt.date
# Counting the each mintue intervals and converting to time stamps
a.value_counts().apply(lambda x: pd.to_timedelta(x,'m'))
Out:
2014-08-29 1 days 11:16:00
2014-08-30 1 days 00:00:00
2014-08-31 1 days 00:00:00
2014-09-01 0 days 21:02:00
2014-08-28 0 days 07:00:00
dtype: timedelta64[ns]
Hope that would be useful. I guess you'll be able to adjust to serve your purpose. Way to thinking is the following - store day and corresponding time in dict. if it's the same day - just write difference. Otherwise write time till first midnight, iterate whenever days needed and write time from last midnight till end. FYI... I guess for 2014-09-01 result might be 21 hrs.
from datetime import datetime, timedelta
from collections import defaultdict
s = [('2014-08-28 17:00:00', '2014-08-29 22:00:00'),
('2014-08-29 10:45:00', '2014-09-01 17:00:00'),
('2014-09-01 15:00:00', '2014-09-01 19:00:00') ]
def aggreate(time):
store = defaultdict(timedelta)
for slice in time:
start = datetime.strptime(slice[0], "%Y-%m-%d %H:%M:%S")
end = datetime.strptime(slice[1], "%Y-%m-%d %H:%M:%S")
start_date = start.date()
end_date = end.date()
if start_date == end_date:
store[start_date] += end - start
else:
midnight = datetime(start.year, start.month, start.day + 1, 0, 0, 0)
part1 = midnight - start
store[start_date] += part1
for i in range(1, (end_date - start_date).days):
next_date = start_date + timedelta(days=i)
store[next_date] += timedelta(hours=24)
last_midnight = datetime(end_date.year, end_date.month, end_date.day, 0, 0, 0)
store[end_date] += end - last_midnight
return store
r = aggreate(s)
for i in r:
print(i, r[i])
2014-08-28 7:00:00
2014-08-29 1 day, 11:15:00
2014-08-30 1 day, 0:00:00
2014-08-31 1 day, 0:00:00
2014-09-01 21:00:00

Hours, Date, Day Count Calculation

I have this huge dataset which has dates for several days and timestamps. The datetime format is in UNIX format. The datasets are logs of some login.
The code is supposed to group start and end time logs and provide log counts and unique id counts.
I am trying to get some stats like:
total log counts per hour & unique login ids per hour.
log count with choice of hours i.e. 24hrs, 12hrs, 6 hrs, 1 hr, etc and day of the week and such options.
I am able to split the data with start and end hours but I am not able to get the stats of counts of logs and unique ids.
Code:
from datetime import datetime,time
# This splits data from start to end time
start = time(8,0,0)
end = time(20,0,0)
with open('input', 'r') as infile, open('output','w') as outfile:
for row in infile:
col = row.split()
t1 = datetime.fromtimestamp(float(col[2])).time()
t2 = datetime.fromtimestamp(float(col[3])).time()
print (t1 >= start and t2 <= end)
Input data format: The data has no headers but the fields are given below. The number of days is not known in input.
UserID, StartTime, StopTime, GPS1, GPS2
00022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d9064bc,1073260801,1073260803,819251,440006
00022dba8f51,1073260801,1073260803,819251,440006
00022de1c6c1,1073260801,1073260803,819251,440006
003065f30f37,1073260801,1073260803,819251,440006
00904b48a3b6,1073260801,1073260803,819251,440006
00904b83a0ea,1073260803,1073260810,819213,439954
00904b85d3cf,1073260803,1073261920,817526,439458
00904b14b494,1073260804,1073265410,817558,439525
00904b99499c,1073260804,1073262625,817558,439525
00904bb96e83,1073260804,1073265163,817558,439525
00904bf91b75,1073260804,1073263786,817558,439525
Expected Output: Example Output
StartTime, EndTime, Day, LogCount, UniqueIDCount
00:00:00, 01:00:00, Mon, 349, 30
StartTime and Endtime = Human readable format
Only to separate data with range of time is already achieved, but I am trying to write a round off time and calculate the counts of logs and uniqueids. Solution with Pandas is also welcome.
Edit One: I more details
StartTime --> EndTIime
1/5/2004, 5:30:01 --> 1/5/2004, 5:30:03
But that falls between 5:00:00 --> 6:00:00 . So this way count of all the logs in the time range is what I am trying to find. Similarly for others also like
5:00:00 --> 6:00:00 Hourly Count
00:00:00 --> 6:00:00 Every 6 hours
00:00:00 --> 12:00:00 Every 12 hours
5 Jan 2004, Mon --> count
6 Jan 2004, Tue --> Count
And so on Looking for a generic program where I can change the time/hours range as needed.
Unfortunately i couldn't find any elegant solution.
Here is my attempt:
fn = r'D:\temp\.data\dart_small.csv'
cols = ['UserID','StartTime','StopTime','GPS1','GPS2']
df = pd.read_csv(fn, header=None, names=cols)
df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime
# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)
# building reporting DF: `r`
freq = '1H' # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)
# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1
r['LogCount'] = 0
r['UniqueIDCount'] = 0
for i, row in r.iterrows():
# intervals overlap test
# https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
# i've slightly simplified the calculations of m and d
# by getting rid of division by 2,
# because it can be done eliminating common terms
u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time
print(r[r.LogCount > 0])
PS the less periods you will have in the report DF - r, the faster it will count. So you may want to get rid of rows (times) if you know beforehand that those timeframes won't contain any data (for example during the weekends, holidays, etc.)
Result:
start LogCount UniqueIDCount Day StartTime EndTime
2004-01-05 00:00:00 1073260800 24 15 Mon 00:00:00 01:00:00
2004-01-05 01:00:00 1073264400 5 5 Mon 01:00:00 02:00:00
2004-01-05 02:00:00 1073268000 3 3 Mon 02:00:00 03:00:00
2004-01-05 03:00:00 1073271600 3 3 Mon 03:00:00 04:00:00
2004-01-05 04:00:00 1073275200 2 2 Mon 04:00:00 05:00:00
2004-01-06 12:00:00 1073390400 22 12 Tue 12:00:00 13:00:00
2004-01-06 13:00:00 1073394000 3 2 Tue 13:00:00 14:00:00
2004-01-06 14:00:00 1073397600 3 2 Tue 14:00:00 15:00:00
2004-01-06 15:00:00 1073401200 3 2 Tue 15:00:00 16:00:00
2004-01-10 16:00:00 1073750400 20 11 Sat 16:00:00 17:00:00
2004-01-14 23:00:00 1074121200 218 69 Wed 23:00:00 00:00:00
2004-01-15 00:00:00 1074124800 12 11 Thu 00:00:00 01:00:00
2004-01-15 01:00:00 1074128400 1 1 Thu 01:00:00 02:00:00

Categories

Resources