Create features based on cutoff times in featuretools - python

Im using featuretools and I need to create a feature that uses the cutoff time for its calculation.
My entityset consist in a client table and a subscription table (it has more but for the question only these are necessary):
import featuretools as ft
import pandas as pd
client_table = pd.DataFrame({'client_id': (1,2,3),
'start_date': (dt.date(2015,1,1),dt.date(2017,10,15),dt.date(2011,1,10))})
subscription_table = pd.DataFrame({'client_id': (1,3,1,2),
'start_plan_date': (dt.date(2015,1,1),dt.date(2011,1,10), dt.date(2018,2,1),dt.date(2017,10,15)),
'end_plan_date':(dt.date(2018,2,1),dt.date(2019,1,10), dt.date(2021,2,1),dt.date(2019,10,15))})
client table
client_id start_date
0 1 2015-01-01
1 2 2017-10-15
2 3 2011-01-10
substription table
subscription_id client_id start_plan_date end_plan_date
0 1 1 2015-01-01 2018-02-01
1 2 3 2011-01-10 2019-01-10
2 3 1 2018-02-01 2021-02-01
3 4 2 2017-10-15 2019-10-15
I created the entity set using client_id as key and setting start_date as time_index
es = ft.EntitySet()
es = es.entity_from_dataframe(entity_id="client",
dataframe=client_table,
index="client_id",
time_index="start_date")
es = es.entity_from_dataframe(entity_id="subscription",
dataframe=subscription_table,
index="subscription_id",
time_index="start_plan_date",
variable_types={"client_id": ft.variable_types.Index,
"end_plan_date": ft.variable_types.Datetime})
relation= ft.Relationship(es["client"]["client_id"],es["subscription"]["client_id"])
es = es.add_relationship(relation)
print(es)
Out:
Entityset: None
Entities:
subscription [Rows: 4, Columns: 4]
client [Rows: 3, Columns: 2]
Relationships:
subscription.client_id -> client.client_id
Now, I need to create a feature that estimates the time between the cutoff time (i.e. 01/01/2018) and the closest end_plan_date for each client. In algebraic form the calculation should be
time_remaining_in_plan = max(subscription.end_plan_date - cutoff_time)
Also I need to calculate the amount of time since the client started:
time_since_start = cutoff_time - client.start_date
In my example the expected output for those features should look like this (im assuming the time differences in days, but it could be months also, also im using a time range for the cutoff times):
client_id cutoff_time time_remaining_in_plan time_since_start
0 3 2018-10-31 71 2851
1 3 2018-11-30 41 2881
2 1 2018-10-31 824 1399
3 1 2018-11-30 794 1429
4 2 2018-10-31 349 381
5 2 2018-11-30 319 411
Is there a way to use featuretools to create custom primitives (aggregation or transformation) or seed features that can generate this result??
Thanks!!

This can be done with custom primitives that use the use_calc_time parameter. This parameter will set up the primitive such that the cutoff time gets passed to it during calculation.
In your case, we need to define two primitives
from featuretools.primitives import make_trans_primitive
from featuretools.variable_types import Datetime, Numeric
def time_until(array, time):
diff = pd.DatetimeIndex(array) - time
return diff.days
TimeUntil = make_trans_primitive(function=time_until,
input_types=[Datetime],
return_type=Numeric,
uses_calc_time=True,
description="Calculates time until the cutoff time in days",
name="time_until")
def time_since(array, time):
diff = time - pd.DatetimeIndex(array)
return diff.days
TimeSince = make_trans_primitive(function=time_since,
input_types=[Datetime],
return_type=Numeric,
uses_calc_time=True,
description="Calculates time since the cutoff time in days",
name="time_since")
Then we can use the primitives in a call to ft.dfs
cutoff_times = pd.DataFrame({
"client_id": [1, 1, 2, 2, 3, 3],
"cutoff_time": pd.to_datetime([dt.date(2018,10,31), dt.date(2018,11,30)]*3)
})
fm, fl = ft.dfs(entityset=es,
target_entity="client",
cutoff_time=cutoff_times,
agg_primitives=["max"],
trans_primitives=[TimeUntil, TimeSince],
cutoff_time_in_index=True)
# these columns correspond to time_remaining_in_plan and time_since_start
fm = fm[["MAX(subscription.TIME_UNTIL(end_plan_date))", "TIME_SINCE(start_date)"]]
this returns
MAX(subscription.TIME_UNTIL(end_plan_date)) TIME_SINCE(start_date)
client_id time
1 2018-10-31 -272 1399
2 2018-10-31 349 381
3 2018-10-31 71 2851
1 2018-11-30 -302 1429
2 2018-11-30 319 411
3 2018-11-30 41 2881
This matches the result you're looking for in your answer with the exception of time_remaining_in_plan for client id 1. I double checked the numbers Feauturetools came up and I believe they are right for this dataset.

Related

Perform a cross column calculation in Python

Context
I am trying to build a portfolio dashboard following this example, only instead of Excel, I am using Python. I am currently not sure how to conduct from 3:47 onwards, cross calculating to arrive at the next period balance.
Problem
Is there a way to conduct this in Python? I tried a for loop but it returned the same number iterated over the number of forward periods. Below is the example:
date_range = pd.date_range(start=today, periods=period_of_investments, freq=contribution_periods)
returns_port = 12
rs = []
balance_total = []
for one in range(len(date_range))):
return_loss = (returns_port/period_of_investments)*capital_insert
rs.append(return_loss)
period_one_balance = capital_insert+return_loss
period_two_return_loss = (returns_port/period_of_investments)*(period_one_balance + capital_insert)
period_two_balance = period_one_balance + capital_insert + period_two_return_loss
balance_total.append(period_two_balance)
I did not watch the video but I will explain how to write a Python code for the following problem, which is similar to the one in the video.
Suppose you want to calculate the return of investment of a fixed monthly deposit for the next 20 years with a fixed interest rate.
The first step is understanding how pd.date_range() works. If you started at the beginning of this month the whole period would be pd.date_rage(start='4-1-2021', periods='240', freq='1m') (240 comes from 20 years, 12 month each). Basically, we are calculating the return at the end of each month.
import pandas as pd
portfolio = pd.DataFrame(columns=['Date', 'Investment', 'Return/Loss', 'Balance'])
interest_rate = 0.121
monthly_deposit = 500
dates = pd.date_range(start="3-31-2021", periods=240, freq='1m')
investment = [monthly_deposit]*len(dates)
return_losses = []
balances = []
current_balance = 500
for date in dates:
current_return_loss = (interest_rate/12)*current_balance
return_losses.append(round(current_return_loss,2))
balances.append(round(current_balance + current_return_loss))
current_balance += (current_return_loss + monthly_deposit)
portfolio['Date'] = pd.to_datetime(dates)
portfolio['Investment'] = investment
portfolio['Return/Loss'] = return_losses
portfolio['Balance'] = balances
balance_at_end = balances[-1]
print(portfolio.head(10))
print(balance_at_end)
You will get the following result, which is identical to the video:
Date Investment Return/Loss Balance
0 2021-03-31 500 5.04 505
1 2021-04-30 500 10.13 1015
2 2021-05-31 500 15.28 1530
3 2021-06-30 500 20.47 2051
4 2021-07-31 500 25.72 2577
5 2021-08-31 500 31.02 3108
6 2021-09-30 500 36.38 3644
7 2021-10-31 500 41.79 4186
8 2021-11-30 500 47.25 4733
9 2021-12-31 500 52.77 5286
506397

create a new column in pandas dataframe using if condition from another dataframe

I have two dataframes as follows
transactions
buy_date buy_price
0 2018-04-16 33.23
1 2018-05-09 33.51
2 2018-07-03 32.74
3 2018-08-02 33.68
4 2019-04-03 33.58
and
cii
from_fy to_fy score
0 2001-04-01 2002-03-31 100
1 2002-04-01 2003-03-31 105
2 2003-04-01 2004-03-31 109
3 2004-04-01 2005-03-31 113
4 2005-04-01 2006-03-31 117
In the transactions dataframe I need to create a new columns cii_score based on the following condition
if transactions['buy_date'] is between cii['from_fy'] and cii['to_fy'] take the cii['score'] value for transactions['cii_score']
I have tried list comprehension but it is no good.
Request your inputs to tackle this.
First, we set up your dfs. Note I modified the dates in transactions in this short example to make it more interesting
import pandas as pd
from io import StringIO
trans_data = StringIO(
"""
,buy_date,buy_price
0,2001-04-16,33.23
1,2001-05-09,33.51
2,2002-07-03,32.74
3,2003-08-02,33.68
4,2003-04-03,33.58
"""
)
cii_data = StringIO(
"""
,from_fy,to_fy,score
0,2001-04-01,2002-03-31,100
1,2002-04-01,2003-03-31,105
2,2003-04-01,2004-03-31,109
3,2004-04-01,2005-03-31,113
4,2005-04-01,2006-03-31,117
"""
)
tr_df = pd.read_csv(trans_data, index_col = 0)
tr_df['buy_date'] = pd.to_datetime(tr_df['buy_date'])
cii_df = pd.read_csv(cii_data, index_col = 0)
cii_df['from_fy'] = pd.to_datetime(cii_df['from_fy'])
cii_df['to_fy'] = pd.to_datetime(cii_df['to_fy'])
The main thing is the following calculation: for each row index of tr_df find the index of the row in cii_df that satisfies the condition. The following calculates this match, each element of the list is equal to the appropriate row index of cii_df:
match = [ [(f<=d) & (d<=e) for f,e in zip(cii_df['from_fy'],cii_df['to_fy']) ].index(True) for d in tr_df['buy_date']]
match
produces
[0, 0, 1, 2, 2]
now we can merge on this
tr_df.merge(cii_df, left_on = np.array(match), right_index = True)
so that we get
key_0 buy_date buy_price from_fy to_fy score
0 0 2001-04-16 33.23 2001-04-01 2002-03-31 100
1 0 2001-05-09 33.51 2001-04-01 2002-03-31 100
2 1 2002-07-03 32.74 2002-04-01 2003-03-31 105
3 2 2003-08-02 33.68 2003-04-01 2004-03-31 109
4 2 2003-04-03 33.58 2003-04-01 2004-03-31 109
and the score column is what you asked for

How to unify (collapse) multiple columns into one assigning unique values

Edited my previous question:
Want to distinguish each Devices (FOUR types) that are attached to a particular Building's particular Elevator (represented by height).
As there is no unique IDs for the devices, want to identify them and assign unique IDs to each of them by Grouping ('BldID', 'BldHt', 'Deivce') to identify any particular 'Device'.
Count their testing results, i.e. how many times it failed (NG) out of total number of testing (NG + OK) for any particular date for the entire duration consisting of few months.
Original dataframe looks like this
BldgID BldgHt Device Date Time Result
1074 34.0 790 2018/11/20 10:30 OK
1072 31.0 780 2018/11/19 11:10 NG
1072 36.0 780 2018/11/17 05:30 OK
1074 10.0 790 2018/11/19 06:10 OK
1074 10.0 790 2018/12/20 11:50 NG
1076 17.0 760 2018/08/15 09:20 NG
1076 17.0 760 2018/09/20 13:40 OK
As 'Time' is irrelevant, dropped it. Want to find the number of [NG] per day for each set (consists of 'BldgID', 'BlgHt', 'Device'].
#aggregate both functions only once by groupby
df1 = mel_df.groupby(['BldgID','BldgHt','Device','Date'])\
['Result'].agg([('NG', lambda x :(x=='NG').sum()), \
('ALL','count')]).round(2).reset_index()
#create New_ID by insert with Series with zero fill 3 values
s = pd.Series(np.arange(1, len(mel_df2) + 1),
index=mel_df2.index).astype(str).str.zfill(3)
mel_df2.insert(0, 'New_ID', s)
Now the filtered DataFrame looks like:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
1 001 1072 31.0 780 2018/11/19 1 2
8 002 1076 17.0 760 2018/11/20 1 1
If I groupby ['BldgID', 'BldgHt', 'Device', 'Date'] then I get per day 'NG'.
But it would consider every day differently and if I assign 'unique' IDs I can plot how the unique Devices behave in every other single day.
If I groupby ['BldgId', 'BldgHt', 'Device'] then I get the overall 'NG' for that set (or unique Device), which is not my goal.
What I want to achieve is:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
001 1072 31.0 780 2018/11/19 1 2
1072 31.0 780 2018/12/30 3 4
002 1076 17.0 760 2018/11/20 1 1
1076 17.0 760 2018/09/20 2 4
003 1072 36.0 780 2018/08/15 1 3
Any tips would be very much appreciated.
Use:
#aggregate both aggregate function only in once groupby
df1 = mel_df.groupby(['BldgID','BldgHt','Device','Date'])\
['Result'].agg([('NG', lambda x :(x=='NG').sum()), ('ALL','count')]).round(2).reset_index()
#filter non 0 rows
mel_df2 = df1[df1.NG != 0]
#filter first rows by Date
mel_df2 = mel_df2.drop_duplicates('Date')
#create New_ID by insert with Series with zero fill 3 values
s = pd.Series(np.arange(1, len(mel_df2) + 1), index=mel_df2.index).astype(str).str.zfill(3)
mel_df2.insert(0, 'New_ID', s)
Output from data from question:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
1 001 1072 31.0 780 2018/11/19 1 1
8 002 1076 17.0 780 2018/11/20 1 1

Facebook Prophet: Providing different data sets to build a better model

My data frame looks like that. My goal is to predict event_id 3 based on data of event_id 1 & event_id 2
ds tickets_sold y event_id
3/12/19 90 90 1
3/13/19 40 130 1
3/14/19 13 143 1
3/15/19 8 151 1
3/16/19 13 164 1
3/17/19 14 178 1
3/20/19 10 188 1
3/20/19 15 203 1
3/20/19 13 216 1
3/21/19 6 222 1
3/22/19 11 233 1
3/23/19 12 245 1
3/12/19 30 30 2
3/13/19 23 53 2
3/14/19 43 96 2
3/15/19 24 120 2
3/16/19 3 123 2
3/17/19 5 128 2
3/20/19 3 131 2
3/20/19 25 156 2
3/20/19 64 220 2
3/21/19 6 226 2
3/22/19 4 230 2
3/23/19 63 293 2
I want to predict sales for the next 10 days of that data:
ds tickets_sold y event_id
3/24/19 20 20 3
3/25/19 30 50 3
3/26/19 20 70 3
3/27/19 12 82 3
3/28/19 12 94 3
3/29/19 12 106 3
3/30/19 12 118 3
So far my model is that one. However, I am not telling the model that these are two separate events. However, it would be useful to consider all data from different events as they belong to the same organizer and therefore provide more information than just one event. Is that kind of fitting possible for Prophet?
# Load data
df = pd.read_csv('event_data_prophet.csv')
df.drop(columns=['tickets_sold'], inplace=True, axis=0)
df.head()
# The important things to note are that cap must be specified for every row in the dataframe,
# and that it does not have to be constant. If the market size is growing, then cap can be an increasing sequence.
df['cap'] = 500
# growth: String 'linear' or 'logistic' to specify a linear or logistic trend.
m = Prophet(growth='linear')
m.fit(df)
# periods is the amount of days that I look in the future
future = m.make_future_dataframe(periods=20)
future['cap'] = 500
future.tail()
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
fig1 = m.plot(forecast)
Start dates of events seem to cause peaks. You can use holidays for this by setting the starting date of each event as a holiday. This informs prophet about the events (and their peaks). I noticed event 1 and 2 are overlapping. I think you have multiple options here to deal with this. You need to ask yourself what the predictive value of each event is related to event3. You don't have too much data, that will be the main issue. If they have equal value, you could change the date of one event. For example 11 days earlier. The unequal value scenario could mean you drop 1 event.
events = pd.DataFrame({
'holiday': 'events',
'ds': pd.to_datetime(['2019-03-24', '2019-03-12', '2019-03-01']),
'lower_window': 0,
'upper_window': 1,
})
m = Prophet(growth='linear', holidays=events)
m.fit(df)
Also I noticed you forecast on the cumsum. I think your events are stationary therefor prophet probably benefits from forecasting on the daily ticket sales rather than the cumsum.

How to map a function in pandas which compares each record in a column to previous and next records

I have a time series of water levels for which I need to calculate monthly and annual statistics in relation to several arbitrary flood stages. Specifically, I need to determine the duration per month that the water exceeded flood stage, as well as the number of times these excursions occurred. Additionally, because of the noise associated with the dataloggers, I need to exclude floods that lasted less than 1 hour as well as floods with less than 1 hour between events.
Mock up data:
start = datetime.datetime(2014,9,5,12,00)
daterange = pd.date_range(start, periods = 10000, freq = '30min', name = "Datetime")
data = np.random.random_sample((len(daterange), 3)) * 10
columns = ["Pond_A", "Pond_B", "Pond_C"]
df = pd.DataFrame(data = data, index = daterange, columns = columns)
flood_stages = [('Stage_1', 4.0), ('Stage_2', 6.0)]
My desired output is:
Pond_A_Stage_1_duration Pond_A_Stage_1_events \
2014-09-30 12:00:00 35.5 2
2014-10-31 12:00:00 40.5 31
2014-11-30 12:00:00 100 16
2014-12-31 12:00:00 36 12
etc. for the duration and events at each flood stage, at each reservoir.
I've tried grouping by month, iterating through the ponds and then iterating through each row like:
grouper = pd.TimeGrouper(freq = "1MS")
month_groups = df.groupby(grouper)
for name, group in month_groups:
flood_stage_a = group.sum()[1]
flood_stage_b = group.sum()[2]
inundation_a = False
inundation_30_a = False
inundation_hour_a = False
change_inundation_a = 0
for level in group.values:
if level[1]:
inundation_a = True
else:
inundation_a = False
if inundation_hour_a == False and inundation_a == True and inundation_30_a == True:
change_inundation_a += 1
inundation_hour_a = inundation_30_a
inundation_30_a = inundation_a
But this is a caveman solution and the heuristics are getting messy since I don't want to count a new event if a flood started in one month and continued into the next. This also doesn't combine events with less than one hour between their start and end. Is there a better way to compare a record to it previous and next?
My other thought is to create new columns with the series shifted t+1, t+2, t-1, t-2, so I can evaluate each row once, but this still seems inefficient. Is there a smarter way to do this by mapping a function?
Let me give a quick, partial answer since no one has answered yet, and maybe someone else can do something better later on if this does not suffice for you.
You can do the time spent above flood stage pretty easily. I divided by 48 so the units are in days.
df[ df > 4 ].groupby(pd.TimeGrouper( freq = "1MS" )).count() / 48
Pond_A Pond_B Pond_C
Datetime
2014-09-01 15.375000 15.437500 14.895833
2014-10-01 18.895833 18.187500 18.645833
2014-11-01 17.937500 17.979167 18.666667
2014-12-01 18.104167 18.354167 18.958333
2015-01-01 18.791667 18.645833 18.708333
2015-02-01 16.583333 17.208333 16.895833
2015-03-01 18.458333 18.458333 18.458333
2015-04-01 0.458333 0.520833 0.500000
Counting distinct events is a little harder, but something like this will get you most of the way. (Note that this produces an unrealistically high number of flooding events, but that's just because of how the sample data is set up and not reflective of a typical pond, though I'm not an expert on pond flooding!)
for c in df.columns:
df[c+'_events'] = ((df[c] > 4) & (df[c].shift() <= 4))
df.iloc[:,-3:].groupby(pd.TimeGrouper( freq = "1MS" )).sum()
Pond_A_events Pond_B_events Pond_C_events
Datetime
2014-09-01 306 291 298
2014-10-01 381 343 373
2014-11-01 350 346 357
2014-12-01 359 352 361
2015-01-01 355 335 352
2015-02-01 292 337 316
2015-03-01 344 360 386
2015-04-01 9 10 9
A couple things to note. First, an event can span months and this method will group it with the month where the event began. Second, I'm ignoring the duration of the event here, but you can adjust that however you want. For example, if you want to say the event doesn't start unless there are 2 consecutive periods below flood level followed by 2 consecutive periods above flood level, just change the relevant line above to:
df[c+'_events'] = ((df[c] > 4) & (df[c].shift(1) <= 4) &
(df[c].shift(-1) > 4) & (df[c].shift(2) <= 4))
That produces a pretty dramatic reduction in the count of distinct events:
Pond_A_events Pond_B_events Pond_C_events
Datetime
2014-09-01 70 71 72
2014-10-01 91 85 81
2014-11-01 87 75 91
2014-12-01 88 87 77
2015-01-01 91 95 94
2015-02-01 79 90 83
2015-03-01 83 78 85
2015-04-01 0 2 2

Categories

Resources