Aggregate events with start and end times with Pandas - python

I have data for a number of events with start and end times like this:
df = pd.DataFrame({'start': ['2015-01-05', '2015-01-10', '2015-01-11'], 'end': ['2015-01-07', '2015-01-15', '2015-01-13'], 'value': [3, 4, 5]})
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])
Out:
end start value
0 2015-01-07 2015-01-05 3
1 2015-01-15 2015-01-10 4
2 2015-01-13 2015-01-11 5
Now I need to calculate the number of events active at the same time, and eg. the sum of their values. So the result should look something like this:
date count sum
2015-01-05 1 3
2015-01-06 1 3
2015-01-07 1 3
2015-01-08 0 0
2015-01-09 0 0
2015-01-10 1 4
2015-01-11 2 9
2015-01-12 2 9
2015-01-13 2 9
2015-01-14 1 4
2015-01-15 1 4
Any ideas for how to do this? I was thinking about using a custom Grouper for groupby, but as far as I can see a Grouper can only assign a row to a single group so that doesn't look useful.
EDIT: After some testing I found this rather ugly way to get the desired result:
df['count'] = 1
dates = pd.date_range('2015-01-05', '2015-01-15', freq='1D')
start = df[['start', 'value', 'count']].set_index('start').reindex(dates)
end = df[['end', 'value', 'count']].set_index('end').reindex(dates).shift(1)
rstart = pd.rolling_sum(start, len(start), min_periods=1)
rend = pd.rolling_sum(end, len(end), min_periods=1)
rstart.subtract(rend, fill_value=0).fillna(0)
However, this only works with sums, and I can't see an obvious way to make it work with other functions. For example, is there a way to get it to work with median instead of sum?

If I were using SQL, I would do this by joining an all-dates table to the events table, and then grouping by date. Pandas doesn't make this approach especially easy, since there's no way to left-join on a condition, but we can fake it using dummy columns and reindexing:
df = pd.DataFrame({'start': ['2015-01-05', '2015-01-10', '2015-01-11'], 'end': ['2015-01-07', '2015-01-15', '2015-01-13'], 'value': [3, 4, 5]})
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])
df['dummy'] = 1
Then:
date_series = pd.date_range('2015-01-05', '2015-01-15', freq='1D')
date_df = pd.DataFrame(dict(date=date_series, dummy=1))
cross_join = date_df.merge(df, on='dummy')
cond_join = cross_join[(cross_join.start <= cross_join.date) & (cross_join.date <= cross_join.end)]
grp_join = cond_join.groupby(['date'])
final = (
pd.DataFrame(dict(
val_count=grp_join.size(),
val_sum=grp_join.value.sum(),
val_median=grp_join.value.median()
), index=date_series)
.fillna(0)
.reset_index()
)
The fillna(0) isn't perfect, since it makes nulls in the val_median column into 0s, when they should really remain nulls.
Alternatively, with pandas-ply we can code that up as:
date_series = pd.date_range('2015-01-05', '2015-01-15', freq='1D')
date_df = pd.DataFrame(dict(date=date_series, dummy=1))
final = (
date_df
.merge(df, on='dummy')
.ply_where(X.start <= X.date, X.date <= X.end)
.groupby('date')
.ply_select(val_count=X.size(), val_sum=X.value.sum(), median=X.value.median())
.reindex(date_series)
.ply_select('*', val_count=X.val_count.fillna(0), val_sum=X.val_sum.fillna(0))
.reset_index()
)
which handles nulls a bit better.

This is what I came up with. Got to think there's a better way
Given your frame
end start value
0 2015-01-07 2015-01-05 3
1 2015-01-15 2015-01-10 4
2 2015-01-13 2015-01-11 5
and then
dList = []
vList = []
d = {}
def buildDict(row):
for x in pd.date_range(row["start"],row["end"]): #build a range for each row
dList.append(x) #date list
vList.append(row["value"]) #value list
df.apply(buildDict,axis=1) #each row in df is passed to buildDict
#this d will be used to create our new frame
d["date"] = dList
d["value"] = vList
#from here you can use whatever agg functions you want
pd.DataFrame(d).groupby("date").agg(["count","sum"])
yields
value
count sum
date
2015-01-05 1 3
2015-01-06 1 3
2015-01-07 1 3
2015-01-10 1 4
2015-01-11 2 9
2015-01-12 2 9
2015-01-13 2 9
2015-01-14 1 4
2015-01-15 1 4

You can avoid the cross join by exploding the dates, imputing the missing rows with complete from pyjanitor, before aggregating the dates:
# pip install pyjanitor
import pandas as pd
import janitor
(df.assign(dates = [pd.date_range(start, end, freq='1D')
for start, end
in zip(df.start, df.end)])
.explode('dates')
.loc[:, ['value', 'dates']]
.complete({'dates': lambda df: pd.date_range(df.min(), df.max(), freq='1D')})
.groupby('dates')
.agg(['size', 'sum'])
.droplevel(level=0, axis='columns')
)
size sum
dates
2015-01-05 1 3.0
2015-01-06 1 3.0
2015-01-07 1 3.0
2015-01-08 1 0.0
2015-01-09 1 0.0
2015-01-10 1 4.0
2015-01-11 2 9.0
2015-01-12 2 9.0
2015-01-13 2 9.0
2015-01-14 1 4.0
2015-01-15 1 4.0

Related

How to "unroll" time intervals in a dataframe?

I have a dataframe:
df1 = pd.DataFrame(
[['2011-01-01','2011-01-03','A'], ['2011-04-01','2011-04-01','A'], ['2012-08-28','2012-08-30','B'], ['2015-04-03','2015-04-05','A'], ['2015-08-21','2015-08-21','B']],
columns=['d0', 'd1', 'event'])
d0 d1 event
0 2011-01-01 2011-01-03 A
1 2011-04-01 2011-04-01 A
2 2012-08-28 2012-08-30 B
3 2015-04-03 2015-04-05 A
4 2015-08-21 2015-08-21 B
It contains some events A and B that occurred in the specified interval from d0 to d1. (There are actually more events, they are mixed, but they have no intersection by dates.) Moreover, this interval can be 1 day (d0 = d1). I need to go from df1 to df2 in which these time intervals are "unrolled" for each event, i.e.:
df2 = pd.DataFrame(
[['2011-01-01','A'], ['2011-01-02','A'], ['2011-01-03','A'], ['2011-04-01','A'], ['2012-08-28','B'], ['2012-08-29','B'], ['2012-08-30','B'], ['2015-04-03','A'], ['2015-04-04','A'], ['2015-04-05','A'], ['2015-08-21','B']],
columns=['Date', 'event'])
Date event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B
I tried doing this based on resample and comparing areas where ffill = bfill but couldn't come up with anything. How can this be done in the most simple way?
We can set_index to event then create date_range per row, then explode to unwind the ranges and reset_index to create the DataFrame:
df2 = (
df1.set_index('event')
.apply(lambda r: pd.date_range(r['d0'], r['d1']), axis=1)
.explode()
.reset_index(name='Date')[['Date', 'event']]
)
df2:
Date event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B
Let us try comprehension to create the pairs of date and event
pd.DataFrame(((d, c) for (*v, c) in df1.to_numpy()
for d in pd.date_range(*v)), columns=['Date', 'Event'])
Date Event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B
I don't know if this is the "most simple," but it's the most intuitive way I can think to do it. I iterate over the rows and unroll it manually into a new dataframe. This means that I look at each row, iterate over the dates between d0 and d1, and construct a row for each of them and compile them into a dataframe:
from datetime import timedelta
def unroll_events(df):
rows = []
for _, row in df.iterrows():
event = row['event']
start = row['d0']
end = row['d1']
current = start
while current != end:
rows.append(dict(Date=current, event=event))
current += timedelta(days=1)
rows.append(dict(Date=current, event=event)) # make sure last one is included
return pd.DataFrame(rows)

Shifting selected months in python

I am struggling to find a solution for the following problem: I have a dataframe which reports quarterly values. Unfortunately, some of the companies report their quarterly numbers a month after the typical release quarter-dates. For this reason, I would like to select these dates and change them to the typical release date. My dataframe looks like this:
# dataframe 1
rng1 = pd.date_range('2014-12-31', periods=5, freq='3M')
df1 = pd.DataFrame({ 'Date': rng1, 'Company': [1, 1, 1, 1 ,1], 'Val': np.random.randn(len(rng1)) })
# dataframe 2
rng2 = pd.date_range('2015-01-30', periods=5, freq='3M')
df2 = pd.DataFrame({ 'Date': rng2, 'Company': [2, 2, 2, 2 ,2],'Val': np.random.randn(len(rng2)) })
# Target Dataframe
frames = [df1, df2]
df_fin = pd.concat(frames)
Output:
Date Company Val
0 2014-12-31 1 0.374427
1 2015-03-31 1 0.328239
2 2015-06-30 1 -1.226196
3 2015-09-30 1 -0.153937
4 2015-12-31 1 -0.146096
0 2015-01-31 2 0.283528
1 2015-04-30 2 0.426100
2 2015-07-31 2 -0.044960
3 2015-10-31 2 -1.316574
4 2016-01-31 2 0.353073
So what I would like to do is the following: Company 2 reports their numbers a month later. For this reason I would like to change their dates so they allign with company 1. This means I would change dates such as the 2015-01-31 to the 2014-12-31.
Any help is highly appreciated
Thanks in advance
Use, pd.merge_asof with direction nearest to merge the dataframe df_in with the reference quarterly dates qDates:
# Refrence quarterly dates (typical release dates)
qDates = pd.date_range('2014-12-31', periods=5, freq='Q')
df = pd.merge_asof(
df_fin.sort_values(by='Date'), pd.Series(qDates, name='Quarter'),
left_on='Date', right_on='Quarter', direction='nearest')
df = (
df.sort_values(by=['Company', 'Quarter'])
.drop('Date', 1)
.rename(columns={'Quarter': 'Date'})
.reindex(df_fin.columns, axis=1)
.reset_index(drop=True)
)
# print(df)
Date Company Val
0 2014-12-31 1 0.146874
1 2015-03-31 1 0.297248
2 2015-06-30 1 1.444860
3 2015-09-30 1 -0.348871
4 2015-12-31 1 -0.093267
5 2014-12-31 2 -0.238166
6 2015-03-31 2 -1.503571
7 2015-06-30 2 0.791149
8 2015-09-30 2 -0.419414
9 2015-12-31 2 -0.598963
I hope I get what you mean. You can use pd.DateOffset or pd.offsets.MonthOffset here, to add/minus number of month(s) to Date condition by column value Company == 2
For example:
df_fin.loc[df_fin['Company'] == 2,'Date'] = df_fin.loc[df_fin['Company'] == 2,'Date'] - pd.DateOffset(months=1)
df_fin prints:
# df_fin
Date Company Val
0 2014-12-31 1 -0.794092
1 2015-03-31 1 -2.632114
2 2015-06-30 1 -0.176383
3 2015-09-30 1 0.701986
4 2015-12-31 1 -0.447678
0 2014-12-31 2 -0.003322
1 2015-03-30 2 0.475669
2 2015-06-30 2 -1.024190
3 2015-09-30 2 1.241122
4 2015-12-31 2 0.096882

How to truncate a column in a Pandas time series data frame so as to remove leading and trailing zeros?

I have the following time series df in Pandas:
date value
2015-01-01 0
2015-01-02 0
2015-01-03 0
2015-01-04 3
2015-01-05 0
2015-01-06 4
2015-01-07 0
I would like to remove the leading and trailing zeroes, so as to have the following df:
date value
2015-01-04 3
2015-01-05 0
2015-01-06 4
Simply dropping rows with 0s in them would lead to deleting the 0s in the middle as well, which I don't want.
I thought of writing a forward loop that starts from the first row and then continues until the first non 0 value, and a second backwards loop that goes back from the end and stops at the last non 0 value. But that seems like overkill, is there a more efficient way of doing so?
General solution returned empty DataFrame, if all 0 values in data with cumulative sum of mask tested not equal 0 values and swapped values by [::-1] chained by bitwise AND and filtering by boolean indexing:
s = df['value'].ne(0)
df = df[s.cumsum().ne(0) & s[::-1].cumsum().ne(0)]
print (df)
date value
3 2015-01-04 3
4 2015-01-05 0
5 2015-01-06 4
If always at least one non 0 value is possible convert 0 to missing values and use DataFrame.loc with DataFrame.first_valid_index and
DataFrame.last_valid_index:
s = df['value'].mask(df['value'] == 0)
df = df.loc[s.first_valid_index():s.last_valid_index()]
print (df)
date value
3 2015-01-04 3
4 2015-01-05 0
5 2015-01-06 4
Another idea is use DataFrame.idxmax or DataFrame.idxmin:
s = df['value'].eq(0)
df = df.loc[s.idxmin():s[::-1].idxmin()]
print (df)
date value
3 2015-01-04 3
4 2015-01-05 0
5 2015-01-06 4
s = df['value'].ne(0)
df = df.loc[s.idxmax():s[::-1].idxmax()]
You can get a list of the indexes where value is > than 0, and then find the min.
data = [
['2015-01-01', 0],
['2015-01-02', 0],
['2015-01-03', 0],
['2015-01-04', 3],
['2015-01-05', 0],
['2015-01-06', 4]
]
df = pd.DataFrame(data, columns = ['date', 'value'])
print(min(df.index[df['value'] > 0].tolist()))
# 3
Then filter the main df like this:
df.iloc[3:]
Or even better:
df.iloc[min(df.index[df['value'] > 0].tolist()):]
And you get:
date value
3 2015-01-04 3
4 2015-01-05 0
5 2015-01-06 4

Split a pandas date list based on another pandas date list

I'm trying to split one date list by using another. So:
d = {'date':['1/15/2015','2/15/2015'], 'num':[1,2]}
s = {'split':['2/1/2015']}
df = pd.DataFrame(d)
sf = pd.DataFrame(s)
df['date'] = pd.to_datetime(df['date'])
sf['split'] = pd.to_datetime(sf['split'])
df['date'].split_by(sf['split'])
would yield:
date num
0 2015-01-15 1.0
1 2015-02-01 NaN
2 2015-02-15 2.0
...but of course, it doesn't. I'm sure there's a simple merge or join I'm missing here, but I can't figure it out. Thanks.
Also, if the 'split' list has multiple dates, some of which fall outside the range of the 'date' list, I don't want them included. So basically, the extents of the new range would be the same as the old.
(side note: if there's a better way to convert a dictionary to a DataFrame and immediately convert the date strings to datetimes, that would be icing on the cake)
I think you need boolean indexing for filter sf by min and max of column date in df first and then concat with sort_values, for align need rename column:
d = {'date':['1/15/2015','2/15/2015'], 'num':[1,2]}
s = {'split':['2/1/2015', '2/1/2016', '2/1/2014']}
df = pd.DataFrame(d)
sf = pd.DataFrame(s)
df['date'] = pd.to_datetime(df['date'])
sf['split'] = pd.to_datetime(sf['split'])
print (df)
date num
0 2015-01-15 1
1 2015-02-15 2
print (sf)
split
0 2015-02-01
1 2016-02-01
2 2014-02-01
mask = (sf.split <= df.date.max()) & (sf.split >= df.date.min())
print (mask)
0 True
1 False
2 False
Name: split, dtype: bool
sf = sf[mask]
print (sf)
split
0 2015-02-01
df = pd.concat([df, sf.rename(columns={'split':'date'})]).sort_values('date')
print (df)
date num
0 2015-01-15 1.0
0 2015-02-01 NaN
1 2015-02-15 2.0

Counting dates in a range set by pandas dataframe

I have a pandas dataframe that contains two date columns, a start date and an end date that defines a range. I'd like to be able to collect a total count for all dates across all rows in the dataframe, as defined by these columns.
For example, the table looks like:
index start_date end date
0 '2015-01-01' '2015-01-17'
1 '2015-01-03' '2015-01-12'
And the result would be a per date aggregate, like:
date count
'2015-01-01' 1
'2015-01-02' 1
'2015-01-03' 2
and so on.
My current approach works but is extremely slow on a big dataframe as I'm looping across the rows, calculating the range and then looping through this. I'm hoping to find a better approach.
Currently I'm doing :
date = pd.date_range (min (df.start_date), max (df.end_date))
df2 = pd.DataFrame (index =date)
df2 ['count'] = 0
for index, row in df.iterrows ():
dates = pd.date_range (row ['start_date'], row ['end_date'])
for date in dates:
df2.loc['date']['count'] += 1
After stacking the relevant columns as suggested by #Sam, just use value_counts.
df[['start_date', 'end date']].stack().value_counts()
EDIT:
Given that you also want to count the dates between the start and end dates:
start_dates = pd.to_datetime(df.start_date)
end_dates = pd.to_datetime(df.end_date)
>>> pd.Series(dt.date() for group in
[pd.date_range(start, end) for start, end in zip(start_dates, end_dates)]
for dt in group).value_counts()
Out[178]:
2015-01-07 2
2015-01-06 2
2015-01-12 2
2015-01-05 2
2015-01-04 2
2015-01-10 2
2015-01-03 2
2015-01-09 2
2015-01-08 2
2015-01-11 2
2015-01-16 1
2015-01-17 1
2015-01-14 1
2015-01-15 1
2015-01-02 1
2015-01-01 1
2015-01-13 1
dtype: int64
I think the solution here is to 'stack' your two date columns, group by the date,and do a count. Play around with the df.stack() function. Here is something i threw together that yields a good solution:
import datetime
df = pd.DataFrame({'Start' : [datetime.date(2016, 5, i) for i in range(1,30)],
'End':[datetime.date(2016, 5, i) for i in range(1,30)]})
df.stack().reset_index()[[0, 'level_1']].groupby(0).count()
I would use melt() method for that:
In [76]: df
Out[76]:
start_date end_date
index
0 2015-01-01 2015-01-17
1 2015-01-03 2015-01-12
2 2015-01-03 2015-01-17
In [77]: pd.melt(df, value_vars=['start_date','end_date']).groupby('value').size()
Out[77]:
value
2015-01-01 1
2015-01-03 2
2015-01-12 1
2015-01-17 2
dtype: int64

Categories

Resources