Number of concurrent processes in Pandas - python

Given a Pandas dataframe which represents when some programs start to work and when finish (i.e. single row - single program):
starts finishes
2018-01-01 12:00 2018-01-01 15:00
2018-01-01 16:00 2018-01-01 20:00
2018-01-01 16:30 2018-01-01 20:00
2018-01-01 17:00 2018-01-01 21:00
I need to calculate the number of concurrent programs in every time represented in the table. The table above becomes next:
time number_of_conc_progs
2018-01-01 12:00 1
2018-01-01 15:00 0
2018-01-01 16:00 1
2018-01-01 16:30 2
2018-01-01 17:00 3
2018-01-01 20:00 1
2018-01-01 21:00 0
If a program starts at 12:00 (e.g.) and the current number of processes is n, then at 12:00 the number has value n+1.
If a program finishes at 12:00 (e.g.) and the current number of processes is n, then at 12:00 the number has value n-1.

# creation of the dataframe
df = pd.DataFrame([
["2018-01-01 12:00", "2018-01-01 15:00"],
["2018-01-01 16:00", "2018-01-01 20:00"],
["2018-01-01 16:30", "2018-01-01 20:00"],
["2018-01-01 17:00", "2018-01-01 21:00"]])
df.columns = ["starts", "finishes"]
# number of progs increases of 1 for start times
starts = pd.DataFrame()
starts["time"] = df.starts
starts["number_of_conc_progs"] = 1
# number of progs decreases of 1 for finishes times
finishes = pd.DataFrame()
finishes["time"] = df.finishes
finishes["number_of_conc_progs"] = -1
# then I merge the starts and the finishes dataframes
result = pd.DataFrame()
result = pd.concat([starts,finishes])
# I sort the time values
result = result.sort_values(by=['time'])
# If there is several starts or finishes at the same time, I sum them
result = result.groupby(['time']).sum()
# I do a cumulation sum to get the actual number of progs running
result.number_of_conc_progs = result.number_of_conc_progs.cumsum()


How to calculate the total number of 1-hour intervals in a sequence of intervals?

Let's consider the following dataframe of sorted time intervals:
import pandas as pd
from io import StringIO
2022-01-01 12:30:00,2022-01-01 12:45:00
2022-01-01 13:05:00,2022-01-01 13:50:00
2022-01-01 14:00:00,2022-01-01 14:20:00
2022-01-01 16:00:00,2022-01-01 16:45:00
2022-01-01 17:20:00,2022-01-01 17:35:00
2022-01-01 17:45:00,2022-01-01 18:30:00
2022-01-01 19:00:00,2022-01-01 19:25:00"""
df = pd.read_csv(StringIO(s), sep=",")
df.start_time = pd.to_datetime(df.start_time)
df.end_time = pd.to_datetime(df.end_time)
start_time end_time
0 2022-01-01 12:30:00 2022-01-01 12:45:00
1 2022-01-01 13:05:00 2022-01-01 13:50:00
2 2022-01-01 14:00:00 2022-01-01 14:20:00
3 2022-01-01 16:00:00 2022-01-01 16:45:00
4 2022-01-01 17:20:00 2022-01-01 17:35:00
5 2022-01-01 17:45:00 2022-01-01 18:30:00
6 2022-01-01 19:00:00 2022-01-01 19:25:00
The idea is that an 1-hour interval is basically calculated in the following way:
we start with the start_time of the first interval and we add 1 hour to it.
If the resulting timestamp is within one of the following intervals that are in the dataframe, then we repeat the process by adding 1-hour to this new timestamp and so on.
If, however, the resulting timestamp is not within, but between two intervals, then we continue by adding 1-hour to the start_time of the next interval.
The input would be the dataframe above.
The process is:
We start by adding 1-hour to the start_time of the first interval:
12:30 + 1H -> 13:30 (13:30 is a timestamp that is within one of the available intervals. In particular, it is within 13:05 - 13:50, which is an interval in our dataframe. We shall, then, continue from 13:30).
13:30 + 1H -> 14:30 (14:30 is not contained in any of our df intervals - we pick the closest start_time after 14:30)
16:00 + 1H -> 17:00 (17:00 not included in any interval of our dataframe)
17:20 + 1H -> 18:20 (18:20 is included between 17:45 - 18:30, which is also an interval that we have in our dataframe)
18:20 + 1H -> 19:20 (it is included in our last interval)
19:20 + 1H -> 20:20 (we have reached or surpassed (greater or equal) the end_time of our last inteval, so we stop). If, for instance though, the last end_time in the dataframe was 19:20:00 instead of 19:25:00 then we would have stopped in the previous step (since we reached a timestamp greater or equal to the very last end_time).
Output: 6
(The output in the alternative case that the very last end_time is equal to 19:20:00 would have been equal to 5).
The output stands for the total number of times that the process of adding 1H was repeated.
As far as code is concerned I have thought of maybe using .shift() somehow but I am not sure how. The problem is that when the resulting timestamp is not between an available interval, then we should search for the closest following start_time.
It is unlikely that vectorization (i.e. parallelization) is possible, because the process at each step depends on the result of the calculations at the previous steps. The solution in any case will be some kind of iteration. And the speed of work will depend primarily on the algorithm you choose to work with.
It seems to me that a good algorithm would be to see if the end_time and start_time of neighboring records fall into the same hour step as if we were measuring length by hours starting from some point. For this we can use integer division:
import pandas as pd
from io import StringIO
s = """start_time,end_time
2022-01-01 12:30:00,2022-01-01 12:45:00
2022-01-01 13:05:00,2022-01-01 13:50:00
2022-01-01 14:00:00,2022-01-01 14:20:00
2022-01-01 16:00:00,2022-01-01 16:45:00
2022-01-01 17:20:00,2022-01-01 17:35:00
2022-01-01 17:45:00,2022-01-01 18:30:00
2022-01-01 19:00:00,2022-01-01 19:25:00"""
df = pd.read_csv(StringIO(s), parse_dates=[0, 1])
data = df.to_numpy().flatten()
start = data[0]
step = pd.Timedelta(1, 'H') # hour as a unit of length
count = 0
for x, y in data[1:-1].reshape(-1, 2):
# x is previous end_time
# y is next start_time
length = (x-start) // step + 1
if start + step*length < y:
count += length
start = y
integer, decimal = divmod((data[-1] - start) / step, 1)
count += integer if decimal == 0 else integer+1
print(f'{count = }')
Not sure if pandas is really necessary here, but here is a solution following your logic.
from datetime import timedelta
import numpy as np
count = 0
start = df.loc[0,'start_time']
while 1:
count += 1
print("hour interval start:", start)
end_of_interv = start + timedelta(hours=1)
new_row = np.searchsorted(df.end_time, end_of_interv)
if new_row >= len(df):
s, e = df.loc[new_row, ['start_time', 'end_time']]
if end_of_interv < s:
start = s
elif s < end_of_interv < e:
start = end_of_interv
print("Number of intervals counted: %d" % count)
#hour interval start: 2022-01-01 12:30:00
#hour interval start: 2022-01-01 13:30:00
#hour interval start: 2022-01-01 16:00:00
#hour interval start: 2022-01-01 17:20:00
#hour interval start: 2022-01-01 18:20:00
#hour interval start: 2022-01-01 19:20:00
#Number of intervals counted: 6
You should test this on a few more examples with different intervals (e.g. some longer than 1 hour) and start times, and verify it produces the answers you seek.
Using the start_time as index allows using the pandas.Index.get_indexer
method, which makes searching for the next starting point easier. Here, is a solution based on it
import pandas as pd
import numpy as np
from io import StringIO
s = """start_time,end_time
2022-01-01 12:30:00,2022-01-01 12:45:00
2022-01-01 13:05:00,2022-01-01 13:50:00
2022-01-01 14:00:00,2022-01-01 14:20:00
2022-01-01 16:00:00,2022-01-01 16:45:00
2022-01-01 17:20:00,2022-01-01 17:35:00
2022-01-01 17:45:00,2022-01-01 18:30:00
2022-01-01 19:00:00,2022-01-01 19:25:00"""
def is_in_range(var):
# check if the given timestamp exists withing any of the range in the dataframe
in_range = df.loc[(df['start_time']<=var) & (df['end_time']> var)].shape[0]
# increment by one hour if it exists
if in_range:
var = var + pd.Timedelta(1,'H')
# If not, find the next closest start time (=index)
var = df.index[df.index.get_indexer([var],'bfill')][0]
return var, in_range
df = pd.read_csv(StringIO(s))
df[["start_time", "end_time"]] = df[["start_time", "end_time"]].apply(pd.to_datetime)
glb_start_time = df.loc[0,'start_time']
glb_end_time = df.loc[df.index[-1], 'end_time']
df.index= df['start_time']
count = 0
while glb_start_time< glb_end_time:
glb_start_time, in_range = is_in_range(glb_start_time)
if in_range:
count += 1
def update_df(frame, check_time, hour):
# returns: new_check_time, count
# updates the data frame
frame = frame[frame['end_time'] > check_time]
if frame.shape[0] == 0 : return None, 0 # exceeds the latest time
filt = frame['start_time'] <= check_time
frame = frame[~filt].reset_index(drop=True)
if filt[filt].any(): return check_time + hour, 1 # it is insdie interval
else: return frame.loc[0, 'start_time'] + hour, 1 # it is not in the interval
hour = pd.to_timedelta(1, 'h')
check_time = df.loc[0, 'start_time'] + hour
total_count = 1
while True:
new_check_time, count = update_df(df, check_time, hour)
total_count += count
if new_check_time is None: break
else: check_time = new_check_time

Calculate the sum between the fixed time range using Pandas

My dataset looks like this:
time Open
2017-01-01 00:00:00 1.219690
2017-01-01 01:00:00 1.688490
2017-01-01 02:00:00 1.015285
2017-01-01 03:00:00 1.357672
2017-01-01 04:00:00 1.293786
2017-01-01 05:00:00 1.040048
2017-01-01 06:00:00 1.225080
2017-01-01 07:00:00 1.145402
...., ....
2017-12-31 23:00:00 1.145402
I want to find the sum between the time-range specified and save it to new dataframe.
let's say,
I want to find the sum between 2017-01-01 22:00:00 and 2017-01-02 04:00:00. This is the sum of 6 hours between 2 days. I want to find the sum of the data in the time-range such as 10 PM to next day 4 AM and put it in a different data frame for example df_timerange_sum. Please note that we are doing sum of time in 2 different date?
What did I do?
I used the sum() to calculate time-range like this: df[~df['time'].dt.hour.between(10, 4)].sum()but it gives me sum as a whole of the df but not on the between time-range I have specified.
I also tried the resample but I cannot find a way to do it for time-specific
df['time'].dt.hour.between(10, 4) is always False because no number is larger than 10 and smaller than 4 at the same time. What you want is to mark between(4,21) and then negate that to get the other hours.
Here's what I would do:
# mark those between 4AM and 10PM
# data we want is where s==False, i.e. ~s
s = df['time'].dt.hour.between(4, 21)
# use s.cumsum() marks the consecutive False block
# on which we will take sum
blocks = s.cumsum()
# again we only care for ~s
(df[~s].groupby(blocks[~s], as_index=False) # we don't need the blocks as index
.agg({'time':'min', 'Open':'sum'}) # time : min -- select the beginning of blocks
) # Open : sum -- compute sum of Open
Output for random data:
time Open
0 2017-01-01 00:00:00 1.282701
1 2017-01-01 22:00:00 2.766324
2 2017-01-02 22:00:00 2.838216
3 2017-01-03 22:00:00 4.151461
4 2017-01-04 22:00:00 2.151626
5 2017-01-05 22:00:00 2.525190
6 2017-01-06 22:00:00 0.798234
an alternative (in my opinion more straightforward) approach that accomplishes the same thing..there's definitely ways to reduce the code but I am also relatively new to pandas
df.set_index(['time'],inplace=True) #make time the index col (not 100% necessary)
df2=pd.DataFrame(columns=['start_time','end_time','sum_Open']) #new df that stores your desired output + start and end times if you need them
df2['start_time']=df[df.index.hour == 22].index #gets/stores all start datetimes
df2['end_time']=df[df.index.hour == 4].index #gets/stores all end datetimes
for i,row in df2.iterrows():
df2.set_value(i,'sum_Open',df[(df.index >= row['start_time']) & (df.index <= row['end_time'])]['Open'].sum())
you'd have to add an if statement or something to handle the last day which ends at 11pm.

unstacking shift data (start and end time) into hourly data

I have a df as follows which shows when a person started a shift, ended a shift, the amount of hours and the date worked.
Business_Date Number PayTimeStart PayTimeEnd Hours
0 2019-05-24 1 2019-05-24 11:00:00 2019-05-24 12:15:00 1.250
1 2019-05-24 2 2019-05-24 12:30:00 2019-05-24 13:30:00 1.00
Now what I'm trying to do is break this into an hourly format, so I know how many hours were used between 11:00 - 12:00
so, in my head, for the above, I want to put the 1 hour between 11 - 12 into the bin for 11:00 and the remainder 0.25 into the next bin of 12
so I would end up with something like
Business Date Time Hour
0 2019-05-24 11:00 1
1 2019-05-24 12:00 0.75
2 2019-05-24 13:00 0.5
One idea is working with minutes - first use list comprehension with flattening for Series and then grouping by hours with hours for count by GroupBy.size and last divide by 60 for final hours:
s = pd.Series([z for x, y in zip(df['Pay Time Start'],
df['Pay Time End'] - pd.Timedelta(60, unit='s'))
for z in pd.date_range(x, y, freq='Min')])
df = (s.groupby(['Business Date'), s.dt.hour.rename('Time')])
print (df)
Business Date Time Hour
0 2019-05-24 11 1.00
1 2019-05-24 12 0.75
2 2019-05-24 13 0.50
If you need to group by a location or ID
df1 = pd.DataFrame([(z, w) for x, y, w in zip(df['Pay Time Start'],
df['Pay Time End'] - pd.Timedelta(60, unit='s'),
df['Location']) for z in pd.date_range(x, y, freq='Min')],
df = (df1.groupby([df1['Date']'Business Date'),
df1['Date'].dt.hour.rename('Time'), df1['Location']])
.size() .div(60) .reset_index(name='Hour'))
Another idea, similar's to #jezrael's but work with seconds for higher precision:
def get_series(a):
s, e, h = a
idx = pd.date_range(s,e, freq='6s')
return pd.Series(h/len(idx), index=idx)
(pd.concat(map(get_series, zip(df.Pay_Time_Start,
2019-05-24 11:00:00 0.998668
2019-05-24 12:00:00 0.750500
2019-05-24 13:00:00 0.500832
Freq: H, dtype: float64
Another idea just for your convenience (plus I like challenging questions) is using melt and then conditionally calculating the minutes:
Basically, you have two formulas for your calculations (Pseudocode):
Minutes in Pay Time Start: 60 - minutes in df['Pay Time Start]
Minutes in Pay Time End: minutes in df['Pay Time End]
So we can use these formulas to create our new data:
First we melt our Times in one column
new = df.melt(id_vars=['Business Date', 'Number'],
value_vars=['Pay Time Start', 'Pay Time End'],
var_name='Pay Time Name',
value_name='Pay Time Date').sort_values('Number')
# Apply the formulas noted above
new['Minutes'] = np.where(new['Pay Time Name'].eq('Pay Time Start'),
60 - new['Pay Time Date'].dt.minute,
new['Pay Time Date'].dt.minute)
# Out
Business Date Number Pay Time Name Pay Time Date Minutes
0 2019-05-24 1 Pay Time Start 2019-05-24 11:00:00 60
2 2019-05-24 1 Pay Time End 2019-05-24 12:15:00 15
1 2019-05-24 2 Pay Time Start 2019-05-24 12:30:00 30
3 2019-05-24 2 Pay Time End 2019-05-24 13:30:00 30
Now we calculate the amount of hours with groupby:
daterange = pd.date_range(df['Pay Time Start'].min(), df['Pay Time End'].max(), freq='H')
df_new = pd.DataFrame({'Date',
'Time':daterange.time}, dtype='datetime64[ns]')
df_new['Hours'] = (new.groupby(new['Pay Time Date'].dt.hour)['Minutes'].sum()/60).to_numpy()
Final Output
Date Time Hours
0 2019-05-24 11:00:00 1.00
1 2019-05-24 12:00:00 0.75
2 2019-05-24 13:00:00 0.50

Most efficient way to break up a dataframe using multiple DateTimeIndexes

I have a dataframe which contains prices for a security each minute over a long period of time.
I would like to extract a subset of the prices, 1 per day between certain hours.
Here is an example of brute-forcing it (using hourly for brevity):
dates = pandas.date_range('20180101', '20180103', freq='H')
prices = pandas.DataFrame(index=dates,
I now have a DateTimeIndex for the hours within each day I want to extract:
start = datetime.datetime(2018,1,1,8)
end = datetime.datetime(2018,1,1,17)
day1 = pandas.date_range(start, end, freq='H')
start = datetime.datetime(2018,1,2,9)
end = datetime.datetime(2018,1,2,13)
day2 = pandas.date_range(start, end, freq='H')
days = [ day1, day2 ]
I can then use prices.index.isin with each of my DateTimeIndexes to extract the relevant day's prices:
daily_prices = [ prices[prices.index.isin(d)] for d in days]
This works as expected:
The problem is that as the length of each selection DateTimeIndex increases, and the number of days I want to extract increases, my list-comprehension slows down to a crawl.
Since I know each selection DateTimeIndex is fully inclusive of the hours it encompasses, I tried using loc and the first and last element of each index in my list comprehension:
daily_prices = [ prices.loc[d[0]:d[-1]] for d in days]
Whilst a bit faster, it is still exceptionally slow when the number of days is very large
Is there a more efficient way to divide up a dataframe into begin and end time ranges like above?
If the hours are consistent from day to day as it seems like they might be, you can just filter the index, which should be pretty fast:
In [5]: prices.loc[prices.index.hour.isin(range(8,18))]
2018-01-01 08:00:00 0.638051
2018-01-01 09:00:00 0.059258
2018-01-01 10:00:00 0.869144
2018-01-01 11:00:00 0.443970
2018-01-01 12:00:00 0.725146
2018-01-01 13:00:00 0.309600
2018-01-01 14:00:00 0.520718
2018-01-01 15:00:00 0.976284
2018-01-01 16:00:00 0.973313
2018-01-01 17:00:00 0.158488
2018-01-02 08:00:00 0.053680
2018-01-02 09:00:00 0.280477
2018-01-02 10:00:00 0.802826
2018-01-02 11:00:00 0.379837
2018-01-02 12:00:00 0.247583
EDIT: To your comment, working directly on the index and then doing a single lookup at the end will still probably be fastest even if it's not always consistent from day to day. Single day frames at the end will be easy with a groupby.
For example:
df = prices.loc[[i for i in prices.index if (i.hour in range(8, 18) and in range(1,10)) or (i.hour in range(2,4) and in range(11,32))]]
framelist = [frame for _, frame in df.groupby(]
will give you a list of dataframes with 1 day per list element, and will include 8:00-17:00 for the first 10 days each month and 2:00-3:00 for days 11-31.

Hours, Date, Day Count Calculation

I have this huge dataset which has dates for several days and timestamps. The datetime format is in UNIX format. The datasets are logs of some login.
The code is supposed to group start and end time logs and provide log counts and unique id counts.
I am trying to get some stats like:
total log counts per hour & unique login ids per hour.
log count with choice of hours i.e. 24hrs, 12hrs, 6 hrs, 1 hr, etc and day of the week and such options.
I am able to split the data with start and end hours but I am not able to get the stats of counts of logs and unique ids.
from datetime import datetime,time
# This splits data from start to end time
start = time(8,0,0)
end = time(20,0,0)
with open('input', 'r') as infile, open('output','w') as outfile:
for row in infile:
col = row.split()
t1 = datetime.fromtimestamp(float(col[2])).time()
t2 = datetime.fromtimestamp(float(col[3])).time()
print (t1 >= start and t2 <= end)
Input data format: The data has no headers but the fields are given below. The number of days is not known in input.
UserID, StartTime, StopTime, GPS1, GPS2
Expected Output: Example Output
StartTime, EndTime, Day, LogCount, UniqueIDCount
00:00:00, 01:00:00, Mon, 349, 30
StartTime and Endtime = Human readable format
Only to separate data with range of time is already achieved, but I am trying to write a round off time and calculate the counts of logs and uniqueids. Solution with Pandas is also welcome.
Edit One: I more details
StartTime --> EndTIime
1/5/2004, 5:30:01 --> 1/5/2004, 5:30:03
But that falls between 5:00:00 --> 6:00:00 . So this way count of all the logs in the time range is what I am trying to find. Similarly for others also like
5:00:00 --> 6:00:00 Hourly Count
00:00:00 --> 6:00:00 Every 6 hours
00:00:00 --> 12:00:00 Every 12 hours
5 Jan 2004, Mon --> count
6 Jan 2004, Tue --> Count
And so on Looking for a generic program where I can change the time/hours range as needed.
Unfortunately i couldn't find any elegant solution.
Here is my attempt:
fn = r'D:\temp\.data\dart_small.csv'
cols = ['UserID','StartTime','StopTime','GPS1','GPS2']
df = pd.read_csv(fn, header=None, names=cols)
df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime
# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)
# building reporting DF: `r`
freq = '1H' # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)
# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1
r['LogCount'] = 0
r['UniqueIDCount'] = 0
for i, row in r.iterrows():
# intervals overlap test
# i've slightly simplified the calculations of m and d
# by getting rid of division by 2,
# because it can be done eliminating common terms
u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time
print(r[r.LogCount > 0])
PS the less periods you will have in the report DF - r, the faster it will count. So you may want to get rid of rows (times) if you know beforehand that those timeframes won't contain any data (for example during the weekends, holidays, etc.)
start LogCount UniqueIDCount Day StartTime EndTime
2004-01-05 00:00:00 1073260800 24 15 Mon 00:00:00 01:00:00
2004-01-05 01:00:00 1073264400 5 5 Mon 01:00:00 02:00:00
2004-01-05 02:00:00 1073268000 3 3 Mon 02:00:00 03:00:00
2004-01-05 03:00:00 1073271600 3 3 Mon 03:00:00 04:00:00
2004-01-05 04:00:00 1073275200 2 2 Mon 04:00:00 05:00:00
2004-01-06 12:00:00 1073390400 22 12 Tue 12:00:00 13:00:00
2004-01-06 13:00:00 1073394000 3 2 Tue 13:00:00 14:00:00
2004-01-06 14:00:00 1073397600 3 2 Tue 14:00:00 15:00:00
2004-01-06 15:00:00 1073401200 3 2 Tue 15:00:00 16:00:00
2004-01-10 16:00:00 1073750400 20 11 Sat 16:00:00 17:00:00
2004-01-14 23:00:00 1074121200 218 69 Wed 23:00:00 00:00:00
2004-01-15 00:00:00 1074124800 12 11 Thu 00:00:00 01:00:00
2004-01-15 01:00:00 1074128400 1 1 Thu 01:00:00 02:00:00

