EDIT: Session generation from log file analysis with pandas seems to be exactly what I was looking for.
I have a dataframe that includes non-unique time stamps, and I'd like to group them by time windows. The basic logic would be -
1) Create a time range from each time stamp by adding n minutes before and after the time stamp.
2) Group by time ranges that overlap. The end effect here would be that a time window would be as small as a single time stamp +/- the time buffer, but there is no cap on how large a time window could be, as long as multiple events were less distance apart than the time buffer
It feels like a df.groupby(pd.TimeGrouper(minutes=n)) is the right answer, but I don't know how to have the TimeGrouper create dynamic time ranges when it sees events that are within a time buffer.
For instance, if I try a TimeGrouper('20s') against a set of events: 10:34:00, 10:34:08, 10:34:08, 10:34:15, 10:34:28 and 10:34:54, then pandas will give me three groups (events falling between 10:34:00 - 10:34:20, 10:34:20 - 10:34:40, and 10:34:40-10:35:00). I would like to just get two groups back, 10:34:00 - 10:34:28, since there is no more than a 20 second gap between events in that time range, and a second group that is 10:34:54.
What is the best way to find temporal windows that are not static bins of time ranges?
Given a Series that looks something like -
time
0 2013-01-01 10:34:00+00:00
1 2013-01-01 10:34:12+00:00
2 2013-01-01 10:34:28+00:00
3 2013-01-01 10:34:54+00:00
4 2013-01-01 10:34:55+00:00
5 2013-01-01 10:35:19+00:00
6 2013-01-01 10:35:30+00:00
If I do a df.groupby(pd.TimeGrouper('20s')) on that Series, I would get back 5 group, 10:34:00-:20, :20-:40, :40-10:35:00, etc. What I want to do is have some function that creates elastic timeranges.. as long as events are within 20 seconds, expand the timerange. So I expect to get back -
2013-01-01 10:34:00 - 2013-01-01 10:34:48
0 2013-01-01 10:34:00+00:00
1 2013-01-01 10:34:12+00:00
2 2013-01-01 10:34:28+00:00
2013-01-01 10:34:54 - 2013-01-01 10:35:15
3 2013-01-01 10:34:54+00:00
4 2013-01-01 10:34:55+00:00
2013-01-01 10:35:19 - 2013-01-01 10:35:50
5 2013-01-01 10:35:19+00:00
6 2013-01-01 10:35:30+00:00
Thanks.
This is how to use to create a custom grouper. (requires pandas >= 0.13) for the timedelta computations, but otherwise would work in other versions.
Create your series
In [31]: s = Series(range(6),pd.to_datetime(['20130101 10:34','20130101 10:34:08', '20130101 10:34:08', '20130101 10:34:15', '20130101 10:34:28', '20130101 10:34:54','20130101 10:34:55','20130101 10:35:12']))
In [32]: s
Out[32]:
2013-01-01 10:34:00 0
2013-01-01 10:34:08 1
2013-01-01 10:34:08 2
2013-01-01 10:34:15 3
2013-01-01 10:34:28 4
2013-01-01 10:34:54 5
2013-01-01 10:34:55 6
2013-01-01 10:35:12 7
dtype: int64
This just computes the time difference in seconds between successive elements, but could actually be anything
In [33]: indexer = s.index.to_series().order().diff().fillna(0).astype('timedelta64[s]')
In [34]: indexer
Out[34]:
2013-01-01 10:34:00 0
2013-01-01 10:34:08 8
2013-01-01 10:34:08 0
2013-01-01 10:34:15 7
2013-01-01 10:34:28 13
2013-01-01 10:34:54 26
2013-01-01 10:34:55 1
2013-01-01 10:35:12 17
dtype: float64
Arbitrariy assign things < 20s to group 0, else to group 1. This could also be more arbitrary. if the diff from previous is < 0 BUT the total diff (from first) is > 50 make in group 2.
In [35]: grouper = indexer.copy()
In [36]: grouper[indexer<20] = 0
In [37]: grouper[indexer>20] = 1
In [95]: grouper[(indexer<20) & (indexer.cumsum()>50)] = 2
In [96]: grouper
Out[96]:
2013-01-01 10:34:00 0
2013-01-01 10:34:08 0
2013-01-01 10:34:08 0
2013-01-01 10:34:15 0
2013-01-01 10:34:28 0
2013-01-01 10:34:54 1
2013-01-01 10:34:55 2
2013-01-01 10:35:12 2
dtype: float64
Groupem (can also use an apply here)
In [97]: s.groupby(grouper).sum()
Out[97]:
0 10
1 5
2 13
dtype: int64
You might want consider using apply:
def my_grouper(datetime_value):
return some_group(datetime_value)
df.groupby(df['date_time'].apply(my_grouper))
It's up to you to implement just any grouping logic in your grouper function. Btw, merging overlapping time ranges is kind of iterative task: for example, A = (0, 10), B = (20, 30), C = (10, 20). After C appears, all three, A, B and C should be merged.
UPD:
This is my ugly version of merging algorithm:
groups = {}
def in_range(val, begin, end):
return begin <= val <= end
global max_group_id
max_group_id = 1
def find_merged_group(begin, end):
global max_group_id
found_common_group = None
full_wraps = []
for (group_start, group_end), group in groups.iteritems():
begin_inclusion = in_range(begin, group_start, group_end)
end_inclusion = in_range(end, group_start, group_end)
full_inclusion = begin_inclusion and end_inclusion
full_wrap = not begin_inclusion and not end_inclusion and in_range(group_start, begin, end) and in_range(group_end, begin, end)
if full_inclusion:
groups[(begin, end)] = group
return group
if full_wrap:
full_wraps.append(group)
elif begin_inclusion or end_inclusion:
if not found_common_group:
found_common_group = group
else: # merge
for range, g in groups.iteritems():
if g == group:
groups[range] = found_common_group
if not found_common_group:
found_common_group = max_group_id
max_group_id += 1
groups[(begin, end)] = found_common_group
return found_common_group
def my_grouper(date_time):
return find_merged_group(date_time - 1, date_time + 1)
df['datetime'].apply(my_grouper) # first run to fill groups dict
grouped = df.groupby(df['datetime'].apply(my_grouper)) # this run is using already merged groups
try this:
create a column tsdiff that has the diffs between consecutive times (using shift)
df['new_group'] = df.tsdiff > timedelta
fillna on the new_group
groupby that column
this is just really rough pseudocode, but the solution's in there somewhere...
Related
(
Name
Gun_time
Net_time
Pace
John
28:48:00
28:47:00
4:38:00
George
29:11:00
29:10:00
4:42:00
Mike
29:38:00
29:37:00
4:46:00
Sarah
29:46:00
29:46:00
4:48:00
Roy
30:31:00
30:30:00
4:55:00
Q1. How can I add another column stating difference between Gun_time and Net_time?
Q2. How will I calculate the mean for Gun_time and Net_time. Please help!
I have tried doing the following but it doesn't work
df['Difference'] = df['Gun_time'] - df['Net_time']
for mean value I tried df['Gun_time'].mean
but it doesn't work either, please help!
Q.3 What if we have times in 28:48 (minutes and seconds) format and not 28:48:00 the function gives out a value error.
ValueError: expected hh:mm:ss format
Convert your columns to dtype timedelta, e.g. like
for col in ("Gun_time", "Net_time", "Pace"):
df[col] = pd.to_timedelta(df[col])
Now you can do calculations like
df['Gun_time'].mean()
# Timedelta('1 days 05:34:48')
or
df['Difference'] = df['Gun_time'] - df['Net_time']
#df['Difference']
# 0 0 days 00:01:00
# 1 0 days 00:01:00
# 2 0 days 00:01:00
# 3 0 days 00:00:00
# 4 0 days 00:01:00
# Name: Difference, dtype: timedelta64[ns]
If you need nicer output to string, you can use
def timedeltaToString(td):
hours, remainder = divmod(td.total_seconds(), 3600)
minutes, seconds = divmod(remainder, 60)
return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}"
df['diffString'] = df['Difference'].apply(timedeltaToString)
# df['diffString']
# 0 00:01:00
# 1 00:01:00
# 2 00:01:00
# 3 00:00:00
# 4 00:01:00
#Name: diffString, dtype: object
See also Format timedelta to string.
I am trying to create a proper bin for a timestamp interval column,
using code such as
df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00']))
The Resulting df looks like:
time_interval | bin
00:17:00 (0 days 00:10:00, 0 days 00:20:00]
01:42:00 NaN
00:15:00 (0 days 00:10:00, 0 days 00:20:00]
00:00:00 NaN
00:06:00 (0 days 00:00:00, 0 days 00:10:00]
Which is a little off as the result I want is jjust the time value and not the days and also I want the upper limit or last bin to be 60 mins or inf ( or more)
Desired Output:
time_interval | bin
00:17:00 (00:10:00,00:20:00]
01:42:00 (00:60:00,inf]
00:15:00 (00:10:00,00:20:00]
00:00:00 (00:00:00,00:10:00]
00:06:00 (00:00:00,00:10:00]
Thanks for looking!
In pandas inf for timedeltas not exist, so used maximal value. Also for include lowest values is used parameter include_lowest=True if want bins filled by timedeltas:
b = pd.to_timedelta(['00:00:00','00:10:00','00:20:00',
'00:30:00','00:40:00',
'00:50:00','00:60:00'])
b = b.append(pd.Index([pd.Timedelta.max]))
df['Bin'] = pd.cut(df['time_interval'], include_lowest=True, bins=b)
print (df)
time_interval Bin
0 00:17:00 (0 days 00:10:00, 0 days 00:20:00]
1 01:42:00 (0 days 01:00:00, 106751 days 23:47:16.854775]
2 00:15:00 (0 days 00:10:00, 0 days 00:20:00]
3 00:00:00 (-1 days +23:59:59.999999, 0 days 00:10:00]
4 00:06:00 (-1 days +23:59:59.999999, 0 days 00:10:00]
If want strings instead timedeltas use zip for create labels with append 'inf':
vals = ['00:00:00','00:10:00','00:20:00',
'00:30:00','00:40:00', '00:50:00','00:60:00']
b = pd.to_timedelta(vals).append(pd.Index([pd.Timedelta.max]))
vals.append('inf')
labels = ['{}-{}'.format(i, j) for i, j in zip(vals[:-1], vals[1:])]
df['Bin'] = pd.cut(df['time_interval'], include_lowest=True, bins=b, labels=labels)
print (df)
time_interval Bin
0 00:17:00 00:10:00-00:20:00
1 01:42:00 00:60:00-inf
2 00:15:00 00:10:00-00:20:00
3 00:00:00 00:00:00-00:10:00
4 00:06:00 00:00:00-00:10:00
You could just use labels to solve it -
df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00', '24:00:00']), labels=['(00:00:00,00:10:00]', '(00:10:00,00:20:00]', '(00:20:00,00:30:00]', '(00:30:00,00:40:00]', '(00:40:00,00:50:00]', '(00:50:00,00:60:00]', '(00:60:00,inf]'])
I would like to make a calculation when there is a group of ones that follow continuously.
I have a database on how a compressor works. Every 5 minutes I get the compressor status if it is ON/OFF and the electricity consumed at this moment. The column On_Off there are a 1 when the compressor works (ON) and 0 when it is OFF.
Compresor = pd.Series([0,0,1,1,1,0,0,1,1,1,0,0,0,0,1,1,1,0], index = pd.date_range('1/1/2012', periods=18, freq='5 min'))
df = pd.DataFrame(Compresor)
df.index.rename("Date", inplace=True)
df.set_axis(["ON_OFF"], axis=1, inplace=True)
df.loc[(df.ON_OFF == 1), 'Electricity'] = np.random.randint(4, 20, df.sum())
df.loc[(df.ON_OFF < 1), 'Electricity'] = 0
df
ON_OFF Electricity
Date
2012-01-01 00:00:00 0 0.0
2012-01-01 00:05:00 0 0.0
2012-01-01 00:10:00 1 4.0
2012-01-01 00:15:00 1 10.0
2012-01-01 00:20:00 1 9.0
2012-01-01 00:25:00 0 0.0
2012-01-01 00:30:00 0 0.0
2012-01-01 00:35:00 1 17.0
2012-01-01 00:40:00 1 10.0
2012-01-01 00:45:00 1 5.0
2012-01-01 00:50:00 0 0.0
2012-01-01 00:55:00 0 0.0
2012-01-01 01:00:00 0 0.0
2012-01-01 01:05:00 0 0.0
2012-01-01 01:10:00 1 14.0
2012-01-01 01:15:00 1 5.0
2012-01-01 01:20:00 1 19.0
2012-01-01 01:25:00 0 0.0
What I would like to do is to add the electrical consumption only when there is a set of ones and make another Data.Frame. For example:
In this example, the first time that the compressor was turned on was between 00:20 -00:30. During this period it consumed 25 (10+10+5). The second time it lasted longer on (00:50-01:15) and consumed in this interval 50 (10+10+10+10+10+5+5). The third time it consume 20 (10 + 10).
I would like to do this automatically I'm new to pandas and I can't think of a way to do it.
Lets say you have the following data:
from operator import itemgetter
import numpy as np
import numpy.random as rnd
import pandas as pd
from funcy import concat, repeat
from toolz import partitionby
base_data = {
'time': list(range(20)),
'state': list(concat(repeat(0, 3), repeat(1, 4), repeat(0, 5), repeat(1, 6), repeat(0, 2))),
'value': list(concat(repeat(0, 3), rnd.randint(5, 20, 4), repeat(0, 5), rnd.randint(5, 20, 6), repeat(0, 2)))
}
Well, there are two ways:
The first one is functional and independent of pandas: you simply partition your data by a field, i.e. the method is processes the data sequentially and generates a new partition every time the value of the field changes. You can then simply summarize each partition as desired.
# transform into sample data
sample_data = [dict(zip(base_data.keys(), x)) for x in zip(*base_data.values())]
# and compute statistics the functional way
[sum(x['value'] for x in part if x['state'] == 1)
for part in partitionby(itemgetter('state'), sample_data)
if part[0]['state'] == 1]
There is also the pandas way, similarly to what #ivallesp mentioned:
You compute the change of state by shifting the state column. Then you summarize your data frame by the group
pd_data = pd.DataFrame(base_data)
pd_data['shifted_state'] = pd_data['state'].shift(fill_value = pd_data['state'][0])
pd_data['cum_state'] = np.cumsum(pd_data['state'] != pd_data['shifted_state'])
pd_data[pd_data['state'] == 1].groupby('cum_state').sum()
Depending on what you and your peers can read best you can choose your way. Also, the functional way may not be easily readable, and can also rewritten with readable loop statments.
What I would do is creating a variable representing each period of activity with an integer as an ID, then group by it and sum the Electricity column. An easy way of creating it would be by cumulative summing On_Off (the data has to be sorted by increasing date) and multiplying the resulting value by the On_Off column. If you provide a reproducible example of your table in Pandas I can quickly write you the solution.
Hope it helps
I have two dataframes each with a datetime column:
df_long=
mytime_long
0 00:00:01 1/10/2013
1 00:00:05 1/10/2013
2 00:00:55 1/10/2013
df_short=
mytime_short
0 00:00:02 1/10/2013
1 00:00:03 1/10/2013
2 00:00:06 1/10/2013
The timestamps are unique and can be assumed sorted in each of the two dataframes.
I would like to create a new dataframe that contains the nearest (index,mytime_long) after or at the same time value in mytime_short (hence with a non-negative timedelta)
ex.
0 (0, 00:00:02 1/10/2013)
1 (2, 00:00:06 1/10/2013)
2 (np.nan,np.nat)
write a function to get the closest index & timestamp in df_short given a timestamp
def get_closest(n):
mask = df_short.mytime_short >= n
ids = np.where(mask)[0]
if ids.size > 0:
return ids[0], df_short.mytime_short[ids[0]]
else:
return np.nan, np.nan
apply this function over df_long.mytime_long, to get a new data frame with the index & timestamp values in a tuple
df = df_long.mytime_long.apply(get_closest)
df
# output:
0 (0, 2013-01-10 00:00:02)
1 (2, 2013-01-10 00:00:06)
2 (nan, nan)
ilia timofeev's answer reminded me of this pandas.merge_asof function which is perfect for this type of join
df = pd.merge_asof(df_long,
df_short.reset_index(),
left_on='mytime_long',
right_on='mytime_short',
direction='forward')[['index', 'mytime_short']]
df
# output:
index mytime_short
0 0.0 2013-01-10 00:00:02
1 2.0 2013-01-10 00:00:06
2 NaN NaT
Little bit ugly, but effective way to solve task. Idea is to join them on timestamp and select first "short" after "long" if any.
#recreate data
df_long = pd.DataFrame(
pd.to_datetime( ['00:00:01 1/10/2013','00:00:05 1/10/2013','00:00:55 1/10/2013']),
index = [0,1,2],columns = ['mytime_long'])
df_short = pd.DataFrame(
pd.to_datetime( ['00:00:02 1/10/2013','00:00:03 1/10/2013','00:00:06 1/10/2013']),
index = [0,1,2],columns = ['mytime_short'])
#join by time, preserving ids
df_all = df_short.assign(inx_s=df_short.index).set_index('mytime_short').join(
df_long.assign(inx_l=df_long.index).set_index('mytime_long'),how='outer')
#mark all "short" rows with nearest "long" id
df_all['inx_l'] = df_all.inx_l.ffill().fillna(-1)
#select "short" rows
df_short_candidate = df_all[~df_all.inx_s.isnull()].astype(int)
df_short_candidate['mytime_short'] = df_short_candidate.index
#select get minimal "short" time in "long" group,
#join back with long to recover empty intersection
df_res = df_long.join(df_short_candidate.groupby('inx_l').first())
print (df_res)
Out:
mytime_long inx_s mytime_short
0 2013-01-10 00:00:01 0.0 2013-01-10 00:00:02
1 2013-01-10 00:00:05 2.0 2013-01-10 00:00:06
2 2013-01-10 00:00:55 NaN NaT
Performance comparison on sample of 100000 elements:
186 ms to execute this implementation.
1min 3s to execute df_long.mytime_long.apply(get_closest)
UPD: but the winner is #Haleemur Ali's pd.merge_asof with 10ms
I have two columns; the time an event started and the duration of that event. Like so:
time, duration
1:22:51,41
1:56:29,36
2:02:06,12
2:32:37,38
2:34:51,24
3:24:07,31
3:28:47,59
3:31:19,32
3:42:52,37
3:57:04,58
4:21:55,23
4:40:28,17
4:52:39,51
4:54:48,26
5:17:06,46
6:08:12,1
6:21:34,12
6:22:48,24
7:04:22,1
7:06:28,46
7:19:12,51
7:19:19,4
7:22:27,27
7:32:25,53
I want to create a line chart that shows the number of concurrent events happening throughout the day. Renaming time to start_time and adding a new column that computes the end_time is easy enough (assuming that's the next step) -- what I'm not quite sure I understand is how, afterwards, I can resample this data so I can chart concurrents.
I imagine I want to wind up with something like (but bucketed by the minute):
time, events
1:30:00,1
2:00:00,2
2:30:00,1
3:00:00,1
3:30:00,2
First make it an actual time stamp:
df['time'] = pd.to_datetime('2014-03-14 ' + df['time'])
Now you can get the end times:
df['end_time'] = df['time'] + df['duration'] * pd.offsets.Minute(1)
A way to get the open events is to combine the start and end times, resample and cumsum:
In [11]: open = pd.concat([pd.Series(1, df.time), # created add 1
pd.Series(-1, df.end_time) # closed substract 1
]).resample('30Min', how='sum').cumsum()
In [12]: open
Out[12]:
2014-03-14 01:00:00 1
2014-03-14 01:30:00 2
2014-03-14 02:00:00 1
2014-03-14 02:30:00 1
2014-03-14 03:00:00 2
2014-03-14 03:30:00 4
2014-03-14 04:00:00 2
2014-03-14 04:30:00 2
2014-03-14 05:00:00 2
2014-03-14 05:30:00 1
2014-03-14 06:00:00 2
2014-03-14 06:30:00 0
2014-03-14 07:00:00 3
2014-03-14 07:30:00 2
2014-03-14 08:00:00 0
Freq: 30T, dtype: int64
You could create a list containing dictionary items with values "time", "events"
obviously you need to handle the evaluating and manipulating of time data types differently, but you could do something like this:
event_bucket = []
time_interval = (end_time - start_time) / num_of_buckets
for ii in range(num_of_buckets):
event_bucket.append({"time":start_time + ii*time_interval,"events":0})
for entry in time_entry:
for bucket in event_bucket:
if bucket["time"] >= entry["start_time"] and bucket["time"] <= entry["end_time"]:
bucket["events"] += 1
If you make num_of_buckets larger you make the graph more precise.