upsample between two date columns - python

I have the following df
lst = [[1548828606206000000, 1548840373139000000],
[1548841285708000000, 1548841458405000000],
[1548842198276000000, 1548843109519000000],
[1548844022821000000, 1548844934207000000],
[1548845431090000000, 1548845539219000000],
[1548845555332000000, 1548845846621000000],
[1548847176147000000, 1548851020030000000],
[1548851704053000000, 1548852256143000000],
[1548852436514000000, 1548855900767000000],
[1548856817770000000, 1548857162183000000],
[1548858736931000000, 1548858979032000000]]
df = pd.DataFrame(lst,columns =['start','end'])
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
and I would like to get the duration of that event with start and end times per hour: e.g.
in my dummy df then for 6th hour should be 60 mins(maximum per hour) - 00:10:06 = 00:49:54. For 7th and 8th should be 1:00:00 each as the end time is 09:26:13. For 9th should be 00:26:13 plus all the intervals in the following .rows that overlap with 9th hour 09:44 - 09:41 = 3mins and 60mins -00:56 =4 mins. So the total for 9th should be 26+ 3 +4~=00:32:28
My initial apporach was to merge start and end, add dummy points every 3rd row, upsample to 1S, get the difference between rows, sum up only the actual rows. There must be a more pythonic way of doing this. Any hint would be great.

IIUC, something like this:
df.apply(lambda x: pd.to_timedelta(pd.Series(1, index=pd.date_range(x.start, x.end, freq='S'))
.groupby(pd.Grouper(freq='H')).count(), unit='S'), axis=1).sum()
Output:
2019-01-30 06:00:00 00:49:54
2019-01-30 07:00:00 01:00:00
2019-01-30 08:00:00 01:00:00
2019-01-30 09:00:00 00:32:28
2019-01-30 10:00:00 00:33:43
2019-01-30 11:00:00 00:40:24
2019-01-30 12:00:00 00:45:37
2019-01-30 13:00:00 00:45:01
2019-01-30 14:00:00 00:09:48
Freq: H, dtype: timedelta64[ns]
Or to get it down to hours, try:
df.apply(lambda r: pd.to_timedelta(pd.Series(1, index=pd.date_range(r.start, r.end, freq='S'))
.pipe(lambda x: x.groupby(x.index.hour).count()), unit='S'), axis=1)\
.sum()
Output:
6 00:49:54
7 01:00:00
8 01:00:00
9 00:32:28
10 00:33:43
11 00:40:24
12 00:45:37
13 00:45:01
14 00:09:48
dtype: timedelta64[ns]

Related

Dataframe from Series grouped by weekday and hour of day

I have a Series with a DatetimeIndex, as such :
time my_values
2017-12-20 09:00:00 0.005611
2017-12-20 10:00:00 -0.004704
2017-12-20 11:00:00 0.002980
2017-12-20 12:00:00 0.001497
...
2021-08-20 13:00:00 -0.001084
2021-08-20 14:00:00 -0.001608
2021-08-20 15:00:00 -0.002182
2021-08-20 16:00:00 -0.012891
2021-08-20 17:00:00 0.002711
I would like to create a dataframe of average values with the weekdays as columns names and hour of the day as index, resulting in this :
hour Monday Tuesday ... Sunday
0 0.005611 -0.001083 -0.003467
1 -0.004704 0.003362 -0.002357
2 0.002980 0.019443 0.009814
3 0.001497 -0.002967 -0.003466
...
19 -0.001084 0.009822 0.003362
20 -0.001608 -0.002967 -0.003567
21 -0.002182 0.035600 -0.003865
22 -0.012891 0.002945 -0.002345
23 0.002711 -0.002458 0.006467
How can do this in Python ?
Do as follows
# Coerce time to datetime
df['time'] = pd.to_datetime(df['time'])
# Extract day and hour
df = df.assign(day=df['time'].dt.strftime('%A'), hour=df['time'].dt.hour)
# Pivot
pd.pivot_table(df, values='my_values', index=['hour'],
columns=['day'], aggfunc=np.mean)
Since you asked for a solution that returns the average values, I propose this groupby solution
df["weekday"] = df.time.dt.strftime('%A')
df["hour"] = df.time.dt.strftime('%H')
df = df.drop(["time"], axis=1)
# calculate averages by weekday and hour
df2 = df.groupby(["hour", "weekday"]).mean()
# put it in the right format
df2.unstack()

look up a time if it falls within a time range and return the corresponding value in Pandas?

Still trying to learn Pandas. Let's assume a dataframe includes start and end of an event for an even-type and the event's Val. Here is an example:
>>> df = pd.DataFrame({ 'start': ["11:00","13:00", "14:00"], 'end': ["12:00","14:00", "15:00"], 'event_type':[1,2,3], 'Val':['a','b','c']})
>>> df['start'] = pd.to_datetime(df['start'])
>>> df['end'] = pd.to_datetime(df['end'])
>>> df
Start End event_type Val
0 2021-03-05 11:00:00 2021-03-05 12:00:00 1 a
1 2021-03-05 13:00:00 2021-03-05 14:00:00 2 b
2 2021-03-05 14:00:00 2021-03-05 15:00:00 3 c
What is the best way for example to find a corresponding value for an event that starts at 11:10 and ends 11:30 of event_type 1, in the Val column. For instance, for this event example, since start and end times fall withing the first row of the df, it should return a.
Try pd.IntervalIndex.from_arrays
df.index = pd.IntervalIndex.from_arrays(left = df.start, right = df.end)
Out put like below
df.loc['11:30']
Out[73]:
start 2021-03-05 11:00:00
end 2021-03-05 12:00:00
event_type 1
Val a
Name: (2021-03-05 11:00:00, 2021-03-05 12:00:00], dtype: object
df.loc['11:30','Val']
Out[75]: 'a'

Plot each column mean grouped by specific date range

I have 7 columns of data, indexed by datetime (30 minutes frequency) starting from 2017-05-31 ending in 2018-05-25. I want to plot the mean of specific range of date (seasons). I have been trying groupby, but I can't get to group by specific range. I get wrong results if I do df.groupby(df.date.dt.month).mean().
A few lines from the dataset (date range is from 2017-05-31 to 2018-05-25)
50 51 56 58
date
2017-05-31 00:00:00 200.213542 276.929198 242.879051 NaN
2017-05-31 00:30:00 200.215478 276.928229 242.879051 NaN
2017-05-31 01:00:00 200.215478 276.925324 242.878083 NaN
2017-06-01 01:00:00 200.221288 276.944691 242.827729 NaN
2017-06-01 01:30:00 200.221288 276.944691 242.827729 NaN
2017-08-31 09:00:00 206.961886 283.374453 245.041349 184.358250
2017-08-31 09:30:00 206.966727 283.377358 245.042317 184.360187
2017-12-31 09:00:00 212.925877 287.198416 247.455413 187.175144
2017-12-31 09:30:00 212.926846 287.196480 247.465097 187.179987
2018-03-31 23:00:00 213.304498 286.933093 246.469647 186.887548
2018-03-31 23:30:00 213.308369 286.938902 246.468678 186.891422
2018-04-30 23:00:00 215.496812 288.342024 247.522230 188.104749
2018-04-30 23:30:00 215.497781 288.340086 247.520294 188.103780
I have created these variables (These are the ranges I need)
increment_rates_winter = df['2017-08-30'].mean() - df['2017-06-01'].mean()
increment_rates_spring = df['2017-11-30'].mean() - df['2017-09-01'].mean()
increment_rates_summer = df['2018-02-28'].mean() - df['2017-12-01'].mean()
increment_rates_fall = df['2018-05-24'].mean() - df['2018-03-01'].mean()
Concatenated them:
df_seasons =pd.concat([increment_rates_winter,increment_rates_spring,increment_rates_summer,increment_rates_fall],axis=1)
and after plotting, I got this:
However, I've been trying to get this:
df_seasons
Out[664]:
Winter Spring Summer Fall
50 6.697123 6.948447 -1.961549 7.662622
51 6.428329 4.760650 -2.188402 5.927087
52 5.580953 6.667529 1.136889 12.939295
53 6.406259 2.506279 -2.105125 6.964549
54 4.332826 3.678492 -2.574769 6.569398
56 2.222032 3.359607 -2.694863 5.348258
58 NaN 1.388535 -0.035889 4.213046
The seasons in x and the means plotted for each column.
Winter = df['2017-06-01':'2017-08-30']
Spring = df['2017-09-01':'2017-11-30']
Summer = df['2017-12-01':'2018-02-28']
Fall = df['2018-03-01':'2018-05-30']
Thank you in advance!
We can get a specific date range in the following way, and then you can define it however you want and take the mean
import pandas as pd
df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])
start_date = "2017-12-31 09:00:00"
end_date = "2018-04-30 23:00:00"
mask = (df['date'] > start_date) & (df['date'] <= end_date)
f_df = df.loc[mask]
This gives the output
date 50 ... 58
8 2017-12-31 09:30:00 212.926846 ... 187.179987 NaN
9 2018-03-31 23:00:00 213.304498 ... 186.887548 NaN
10 2018-03-31 23:30:00 213.308369 ... 186.891422 NaN
11 2018-04-30 23:00:00 215.496812 ... 188.104749 NaN
Hope this helps
How about transpose it:
df_seasons.T.plot()
Output:

unstacking shift data (start and end time) into hourly data

I have a df as follows which shows when a person started a shift, ended a shift, the amount of hours and the date worked.
Business_Date Number PayTimeStart PayTimeEnd Hours
0 2019-05-24 1 2019-05-24 11:00:00 2019-05-24 12:15:00 1.250
1 2019-05-24 2 2019-05-24 12:30:00 2019-05-24 13:30:00 1.00
Now what I'm trying to do is break this into an hourly format, so I know how many hours were used between 11:00 - 12:00
so, in my head, for the above, I want to put the 1 hour between 11 - 12 into the bin for 11:00 and the remainder 0.25 into the next bin of 12
so I would end up with something like
Business Date Time Hour
0 2019-05-24 11:00 1
1 2019-05-24 12:00 0.75
2 2019-05-24 13:00 0.5
One idea is working with minutes - first use list comprehension with flattening for Series and then grouping by hours with hours for count by GroupBy.size and last divide by 60 for final hours:
s = pd.Series([z for x, y in zip(df['Pay Time Start'],
df['Pay Time End'] - pd.Timedelta(60, unit='s'))
for z in pd.date_range(x, y, freq='Min')])
df = (s.groupby([s.dt.date.rename('Business Date'), s.dt.hour.rename('Time')])
.size()
.div(60)
.reset_index(name='Hour'))
print (df)
Business Date Time Hour
0 2019-05-24 11 1.00
1 2019-05-24 12 0.75
2 2019-05-24 13 0.50
If you need to group by a location or ID
df1 = pd.DataFrame([(z, w) for x, y, w in zip(df['Pay Time Start'],
df['Pay Time End'] - pd.Timedelta(60, unit='s'),
df['Location']) for z in pd.date_range(x, y, freq='Min')],
columns=['Date','Location'])
df = (df1.groupby([df1['Date'].dt.date.rename('Business Date'),
df1['Date'].dt.hour.rename('Time'), df1['Location']])
.size() .div(60) .reset_index(name='Hour'))
Another idea, similar's to #jezrael's but work with seconds for higher precision:
def get_series(a):
s, e, h = a
idx = pd.date_range(s,e, freq='6s')
return pd.Series(h/len(idx), index=idx)
(pd.concat(map(get_series, zip(df.Pay_Time_Start,
df.Pay_Time_End,
df.Hours)))
.resample('H').sum()
)
Output:
2019-05-24 11:00:00 0.998668
2019-05-24 12:00:00 0.750500
2019-05-24 13:00:00 0.500832
Freq: H, dtype: float64
Another idea just for your convenience (plus I like challenging questions) is using melt and then conditionally calculating the minutes:
Basically, you have two formulas for your calculations (Pseudocode):
Minutes in Pay Time Start: 60 - minutes in df['Pay Time Start]
Minutes in Pay Time End: minutes in df['Pay Time End]
So we can use these formulas to create our new data:
First we melt our Times in one column
new = df.melt(id_vars=['Business Date', 'Number'],
value_vars=['Pay Time Start', 'Pay Time End'],
var_name='Pay Time Name',
value_name='Pay Time Date').sort_values('Number')
# Apply the formulas noted above
new['Minutes'] = np.where(new['Pay Time Name'].eq('Pay Time Start'),
60 - new['Pay Time Date'].dt.minute,
new['Pay Time Date'].dt.minute)
# Out
Business Date Number Pay Time Name Pay Time Date Minutes
0 2019-05-24 1 Pay Time Start 2019-05-24 11:00:00 60
2 2019-05-24 1 Pay Time End 2019-05-24 12:15:00 15
1 2019-05-24 2 Pay Time Start 2019-05-24 12:30:00 30
3 2019-05-24 2 Pay Time End 2019-05-24 13:30:00 30
Now we calculate the amount of hours with groupby:
daterange = pd.date_range(df['Pay Time Start'].min(), df['Pay Time End'].max(), freq='H')
df_new = pd.DataFrame({'Date':daterange.date,
'Time':daterange.time}, dtype='datetime64[ns]')
df_new['Hours'] = (new.groupby(new['Pay Time Date'].dt.hour)['Minutes'].sum()/60).to_numpy()
Final Output
Date Time Hours
0 2019-05-24 11:00:00 1.00
1 2019-05-24 12:00:00 0.75
2 2019-05-24 13:00:00 0.50

Hours, Date, Day Count Calculation

I have this huge dataset which has dates for several days and timestamps. The datetime format is in UNIX format. The datasets are logs of some login.
The code is supposed to group start and end time logs and provide log counts and unique id counts.
I am trying to get some stats like:
total log counts per hour & unique login ids per hour.
log count with choice of hours i.e. 24hrs, 12hrs, 6 hrs, 1 hr, etc and day of the week and such options.
I am able to split the data with start and end hours but I am not able to get the stats of counts of logs and unique ids.
Code:
from datetime import datetime,time
# This splits data from start to end time
start = time(8,0,0)
end = time(20,0,0)
with open('input', 'r') as infile, open('output','w') as outfile:
for row in infile:
col = row.split()
t1 = datetime.fromtimestamp(float(col[2])).time()
t2 = datetime.fromtimestamp(float(col[3])).time()
print (t1 >= start and t2 <= end)
Input data format: The data has no headers but the fields are given below. The number of days is not known in input.
UserID, StartTime, StopTime, GPS1, GPS2
00022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d9064bc,1073260801,1073260803,819251,440006
00022dba8f51,1073260801,1073260803,819251,440006
00022de1c6c1,1073260801,1073260803,819251,440006
003065f30f37,1073260801,1073260803,819251,440006
00904b48a3b6,1073260801,1073260803,819251,440006
00904b83a0ea,1073260803,1073260810,819213,439954
00904b85d3cf,1073260803,1073261920,817526,439458
00904b14b494,1073260804,1073265410,817558,439525
00904b99499c,1073260804,1073262625,817558,439525
00904bb96e83,1073260804,1073265163,817558,439525
00904bf91b75,1073260804,1073263786,817558,439525
Expected Output: Example Output
StartTime, EndTime, Day, LogCount, UniqueIDCount
00:00:00, 01:00:00, Mon, 349, 30
StartTime and Endtime = Human readable format
Only to separate data with range of time is already achieved, but I am trying to write a round off time and calculate the counts of logs and uniqueids. Solution with Pandas is also welcome.
Edit One: I more details
StartTime --> EndTIime
1/5/2004, 5:30:01 --> 1/5/2004, 5:30:03
But that falls between 5:00:00 --> 6:00:00 . So this way count of all the logs in the time range is what I am trying to find. Similarly for others also like
5:00:00 --> 6:00:00 Hourly Count
00:00:00 --> 6:00:00 Every 6 hours
00:00:00 --> 12:00:00 Every 12 hours
5 Jan 2004, Mon --> count
6 Jan 2004, Tue --> Count
And so on Looking for a generic program where I can change the time/hours range as needed.
Unfortunately i couldn't find any elegant solution.
Here is my attempt:
fn = r'D:\temp\.data\dart_small.csv'
cols = ['UserID','StartTime','StopTime','GPS1','GPS2']
df = pd.read_csv(fn, header=None, names=cols)
df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime
# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)
# building reporting DF: `r`
freq = '1H' # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)
# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1
r['LogCount'] = 0
r['UniqueIDCount'] = 0
for i, row in r.iterrows():
# intervals overlap test
# https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
# i've slightly simplified the calculations of m and d
# by getting rid of division by 2,
# because it can be done eliminating common terms
u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time
print(r[r.LogCount > 0])
PS the less periods you will have in the report DF - r, the faster it will count. So you may want to get rid of rows (times) if you know beforehand that those timeframes won't contain any data (for example during the weekends, holidays, etc.)
Result:
start LogCount UniqueIDCount Day StartTime EndTime
2004-01-05 00:00:00 1073260800 24 15 Mon 00:00:00 01:00:00
2004-01-05 01:00:00 1073264400 5 5 Mon 01:00:00 02:00:00
2004-01-05 02:00:00 1073268000 3 3 Mon 02:00:00 03:00:00
2004-01-05 03:00:00 1073271600 3 3 Mon 03:00:00 04:00:00
2004-01-05 04:00:00 1073275200 2 2 Mon 04:00:00 05:00:00
2004-01-06 12:00:00 1073390400 22 12 Tue 12:00:00 13:00:00
2004-01-06 13:00:00 1073394000 3 2 Tue 13:00:00 14:00:00
2004-01-06 14:00:00 1073397600 3 2 Tue 14:00:00 15:00:00
2004-01-06 15:00:00 1073401200 3 2 Tue 15:00:00 16:00:00
2004-01-10 16:00:00 1073750400 20 11 Sat 16:00:00 17:00:00
2004-01-14 23:00:00 1074121200 218 69 Wed 23:00:00 00:00:00
2004-01-15 00:00:00 1074124800 12 11 Thu 00:00:00 01:00:00
2004-01-15 01:00:00 1074128400 1 1 Thu 01:00:00 02:00:00

Categories

Resources