I am using Lifetimes to compute CLV of some customers of mine.
I have transactional data and, by means of summary_data_from_transaction_data (the implementation can be found here) I would like to compute
the recency, the frequency and the time interval T of each customer.
Unfortunately, it seems that the method does not compute correctly the frequency.
Here is the code for testing my dataset:
df_test = pd.read_csv('test_clv.csv', sep=',')
RFT_from_libray = summary_data_from_transaction_data(df_test,
'Customer',
'Transaction date',
observation_period_end='2020-02-12',
freq='D')
According to the code, the result is:
frequency recency T
Customer
1158624 18.0 389.0 401.0
1171970 67.0 396.0 406.0
1188564 12.0 105.0 401.0
The problem is that customer 1188564 and customer 1171970 did respectively 69 and 14 transaction, thus the frequency should have been 68 and 13.
Printing the size of each customer confirms that:
print(df_test.groupby('Customer').size())
Customer
1158624 19
1171970 69
1188564 14
I did try to use natively the underlying code in the summary_data_from_transaction_data like this:
RFT_native = df_test.groupby('Customer', sort=False)['Transaction date'].agg(["min", "max", "count"])
observation_period_end = (
pd.to_datetime('2020-02-12', format=None).to_period('D').to_timestamp()
)
# subtract 1 from count, as we ignore their first order.
RFT_native ["frequency"] = RFT_native ["count"] - 1
RFT_native ["T"] = (observation_period_end - RFT_native ["min"]) / np.timedelta64(1, 'D') / 1
RFT_native ["recency"] = (RFT_native ["max"] - RFT_native ["min"]) / np.timedelta64(1, 'D') / 1
As you can see, the result is indeed correct.
min max count frequency T recency
Customer
1171970 2019-01-02 15:45:39 2020-02-02 13:40:18 69 68 405.343299 395.912951
1188564 2019-01-07 18:10:55 2019-04-22 14:27:08 14 13 400.242419 104.844595
1158624 2019-01-07 10:52:33 2020-01-31 13:50:36 19 18 400.546840 389.123646
Of course my dataset is much bigger, and a slight difference in my frequency and/or recency alters a lot the computation of the BGF model.
What am I missing? Is there something that I should consider when using the method?
I might be a bit late to answer your query, but here it goes.
The documentation for the Lifestyles package defines frequency as:
frequency represents the number of repeat purchases the customer has made. This means that it’s one less than the total number of purchases. This is actually slightly wrong. It’s the count of time periods the customer had a purchase in. So if using days as units, then it’s the count of days the customer had a purchase on.
So, it's basically the number of time periods when the customer has made a repeat purchase, not the number of individual repeat purchases. A quick scan of your sample dataset confirmed that both 1188564 and 1171970 indeed made 2 purchases on a single day, 13Jan2019 and 15Jun2019, respectively. So these 2 transactions would be considered as 1 when calculating frequency that would result in the frequency calculated by summary_data_from_transaction_data function to be 2 less than your manual count.
According to documentation, you need to set:
include_first_transaction = True
include_first_transaction (bool, optional) – Default: False By default
the first transaction is not included while calculating frequency and
monetary_value. Can be set to True to include it. Should be False if
you are going to use this data with any fitters in lifetimes package
Related
I have a three columns dataframe as follows. I want to calculate the returns in three months per day for every funds, so I need to get the date with recorded NAV data three months ago. Should I use the max() function with filter() function to deal this problem? If so, how? If not, could you please help me figure out a better way to do this?
fund code
date
NAV
fund 1
2021-01-04
1.0000
fund 1
2021-01-05
1.0001
fund 1
2021-01-06
1.0023
...
...
...
fund 2
2020-02-08
1.0000
fund 2
2020-02-09
0.9998
fund 2
2020-02-10
1.0001
...
...
...
fund 3
2022-05-04
2.0021
fund 3
2022-05-05
2.0044
fund 3
2022-05-06
2.0305
I tried to combined the max() function with filter() as follows:
max(filter(lambda x: x<=df['date']-timedelta(days=91)))
But it didn't work.
Were this in excel, I know I could use the following functions to solve this problem:
{max(if(B:B<=B2-91,B:B))}
{max(if(B:B<=B3-91,B:B))}
{max(if(B:B<=B4-91,B:B))}
....
But with python, I don't know what I could do. I just learnt it three days ago. Please help me.
This picture is what I want if it was in excel. The yellow area is the original data. The white part is the procedure I need for the calculation and the red part is the result I want. To get this result, I need to divide the 3rd column by the 5th column.
I know that I could use pct_change(period=7) function to get the same results in this picture. But here is the tricky part: the line 7 rows before is not necessarily the data 7 days before, and not all the funds are recorded daily. Some funds are recorded weekly, some monthly. So I need to check if the data used for division exists first.
what you need is an implementation of the maximum in sliding window (for your example 1 week, 7days).
I could recreated you example as follow (to create the data frame you have):
import pandas as pd
import datetime
from random import randint
df = pd.DataFrame(columns=["fund code", "date", "NAV"])
date = datetime.datetime.strptime("2021-01-04", '%Y-%m-%d')
for i in range(10):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 2', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
this will look like your example, with not continuous dates and two different funds.
The maximum sliding window (for variable days length look like this)
import queue
class max_queue:
def __init__(self, win=7):
self.win = win
self.queue = queue.deque()
self.date = None
def append(self, date, value):
while self.queue and value > self.queue[-1][1]:
self.queue.pop()
while self.queue and date - self.queue[0][0] >= datetime.timedelta(self.win):
self.queue.popleft()
self.queue.append((date, value))
self.date = date
def get_max(self):
return self.queue[0][1]
now you could simply iterate over rows and get the max value in the timeframe you are interested.
mq = max_queue(7)
pre_code = ''
for idx, row in df.iterrows():
code, date, nav,*_ = row
if code != pre_code:
mq = max_queue(7)
pre_code = code
mq.append(date, nav)
df.at[idx, 'max'] = mq.get_max()
results will look like this, with added max column. This assumes that funds data are continuous, but you could as well modify to have seperate max_queue for each funds as well.
using max queue to only keep track of the max in the window would be the correct complexity O(n) for a solution. important if you are dealing with huge datasets and especially bigger date ranges (instead of week).
I have a dataframe similar to the one below;
Price
return
indicator
5
0.05
1
6
0.20
-1
5
-0.16
1
Where the indicator is based upon the forecasted return on the following day.
what I would like to achieve is a strategy where when the indicator is positive 1, I buy the stock at the price on that date/row. Then if the indicator is negative we sell at that price. Then I would like to create a new column with represents the value of the portfolio on each day. Assuming I have $1000 to invest the value of the portfolio should equal the holdings and cash amount. Im assuming that any fraction of Stock can be purchased.
Im unsure where to start with this one. I tried calculating a the Buy/Hold strategy using;
df['Holding'] = df['return'].add(1).cumprod().*5000
this worked for a buy hold strategy but to modify it to the new strategy seems difficult.
I tried;
df['HOLDINg'] = (df['return'].add(1).cumprod()* 5000 * df['Indicator])
#to get the value of the buy or the sell
#then using
df['HOLDING'] = np.where(df['HOLDING'] >0, df['HOLDING'] , df['HON HOLDING 2']*-1)
#my logic was, if its positive its the value of the stock holding, and if its negative it is a cash inflow therefore I made it positive as it would be cash.
the issue is, my logic is massively flawed, as if the holding is cash the return shouldn't apply to it. further I don't think using the cumprod is correct with this strategy.
Has anyone used this strategy before and can offer tips of how to make it work?
thank you
I'm not sure about the returns and prices being in the correct place (they shouldn't really be in the same row if they represent the buying price (presumably yesterday's close), and the daily return (assuming the position was held for the whole day). But anyway...
import pandas as pd
# the data you provided
df = pd.read_csv("Data.csv", header=0)
# an initial starting row (explanation provided)
starting = pd.DataFrame({'Price': [0], 'return': [0], 'indicator': [0]})
# concatenate so starting is first row
df = pd.concat([starting, df]).reset_index(drop=True)
# setting holding to 0 at start (no shares), and cash at 1000 (therefore portfolio = 1000)
df[["Holding", "Cash", "Portfolio"]] = [0, 1000, 1000]
# buy/sell is the difference (explanation provided)
df["BuySell"] = df["indicator"].diff()
# simulating every day
for i in range(1, len(df)):
# buying
if df["BuySell"].iloc[i] > 0:
df["Holding"].iloc[i] += df["Cash"].iloc[i-1] / df["Price"].iloc[i]
df["Cash"].iloc[i] = 0
# selling
elif df["BuySell"].iloc[i] < 0:
df["Cash"].iloc[i] = df["Holding"].iloc[i-1] * df["Price"].iloc[i]
df["Holding"].iloc[i] = 0
# holding position
else:
df["Cash"].iloc[i] = df["Cash"].iloc[i-1]
df["Holding"].iloc[i] = df["Holding"].iloc[i-1]
# multiply holding by return (assuming all-in, so holding=0 not affected)
df["Holding"].iloc[i] *= (1 + df["return"].iloc[i])
df["Portfolio"].iloc[i] = df["Holding"].iloc[i] * df["Price"].iloc[i] + df["Cash"].iloc[i]
Explanations:
Starting row:
This is needed so that the loop can refer to the previous holdings and cash (it would be more of an inconvenience to add in an if statement in the loop if i=0).
Buy/Sell:
The difference is necessary here, as if the position changes from buy to sell, then obviously selling the shares (and vice versa). However, if the previous was buy/sell, the same as the current row, there would be no change (diff=0), with no shares bought or sold.
Portfolio:
This is an "equivalent" amount (the amount you would hold if you converted all shares to cash at the time).
Holding:
This is the number of shares held.
NOTE: from what I understood of your question, this is an all-in strategy - there is no percentage in, which has made this strategy more simplistic, but easier to code.
Output:
#Out:
# Price return indicator Holding Cash Portfolio BuySell
#0 0 0.00 0 0.00 1000 1000.0 NaN
#1 5 0.05 1 210.00 0 1050.0 1.0
#2 6 0.20 -1 0.00 1260 1260.0 -2.0
#3 5 -0.16 1 211.68 0 1058.4 2.0
Hopefully this will give you a good starting point to create something more to your specification and more advanced, such as with multiple shares, or being a certain percentage exposed, etc.
I have a data in excel of employees and no. of hours worked in a week. I tagged each employee to a project he/she is working on. I can get sum of hours worked in each project by doing groupby as below:
util_breakup_sum = df[["Tag", "Bill. Hours"]].groupby("Tag").sum()
Bill. Hours
Tag
A61H 92.00
A63B 139.75
An 27.00
B32B 33.50
H 37.00
Manager 8.00
PP 23.00
RP0117 38.50
Se 37.50
However, when I try to calculate average time spent on each project per person, it gives me (sum/ total number of entries by employee), whereas correct average should be (sum / unique employee in group).
Example of mean is given below:
util_breakup_mean = df[["Tag", "Bill. Hours"]].groupby("Tag").mean()
Bill. Hours
Tag
A61H 2.243902
A63B 1.486702
An 1.000000
B32B 0.712766
H 2.055556
Manager 0.296296
PP 1.095238
RP0117 1.425926
Se 3.750000
For example, group A61H has just two employees, so there average should be (92/2) = 46. However, the code is dividing by total number of entries by these employees and hence giving an average of 2.24.
How to get the average from unique employee names in the group?
Try:
df.groupby("Tag")["Bill. Hours"].sum().div(df.groupby("Tag")["Employee"].nunique()
Where Employee is column identifying employees.
You can try nunique
util_breakup_mean = util_breakup_sum/df.groupby("Tag")['employee'].nunique()
I'm new to Pyomo and trying to utilise data in my pandas dataframe as parameters within the optimisation model, the dataframe looks like this;
Ticker Margin Avg. Volume M_ratio V_ratio
Index
0 ES1 6600.00 1250970 0.126036 0.212996
1 TY1 1150.00 1232311 0.021961 0.209819
2 FV1 700.00 488906 0.013367 0.083244
3 TU1 570.00 293885 0.010885 0.050038
4 ED3 500.00 137802 0.009548 0.023463
5 NQ1 7500.00 427061 0.143223 0.072713
6 FDAX1 24074.12 98838 0.459728 0.016829
7 FESX1 2641.28 832836 0.050439 0.141803
8 FGBL1 2502.75 546878 0.047793 0.093114
9 FGBM1 1042.10 330517 0.019900 0.056275
10 FGBS1 262.97 232801 0.005022 0.039638
11 F2MX1 4822.81 398 0.092098 0.000068
The model I'm constructing aims to find the maximum contracts one may have in all assets based on balance and a number of constraints.
I need to iterate through the rows in order to add all the relevant data to model.utilisation
model.Vw = Param() #<- V_ratio from df
model.M = Param() #<- Margin from df
model.L = Var(domain=NonNegativeReals)
model.utilisation = Objective(expr = model.M * model.L, sense=maximize)
Effectively it needs to take in Margin for each ticker and determine how many of that you can get relevant to balance - i.e.
*(ES1 Margin * model.L) + (TY1 Margin * model.L)* etc etc throughout the dataframe.
I've tested the logic by plugging in dummy data and seems to work but it's not efficient to be writing in each piece of data and then adding it to the utilisation model as I have hundreds of lines in my dataframe.
Apologies if there are some blinding errors, very new to Pyomo
I have a data in which we have two columns, one is description and another is publishedAt. I applied sort function on publishedAt column and get the output of descending order of date. Here is the sample of my data frame:
description publishedAt
13 Bitcoin price has failed to secure momentum in... 2018-05-06T15:22:22Z
16 Brian Kelly, a long-time contributor to CNBC’s... 2018-05-05T15:56:48Z
2 The bitcoin price is less than $100 away from ... 2018-05-05T13:14:45Z
12 Mati Greenspan, a senior analyst at eToro and ... 2018-05-04T16:05:37Z
52 A Singaporean startup developing ‘smart bankno... 2018-05-04T14:02:30Z
75 Cryptocurrencies are set to make a comeback on... 2018-05-03T08:10:19Z
76 The bitcoin price is hovering near its best le... 2018-04-30T16:26:57Z
74 In today’s climate of ICOs with 100 billion to... 2018-04-30T12:03:31Z
27 Investment guru Warren Buffet remains unsold o... 2018-04-29T17:22:19Z
22 The bitcoin price has increased by around $400... 2018-04-28T12:28:35Z
68 Bitcoin futures volume reached an all-time hig... 2018-04-27T16:32:44Z
14 Biotech-company-turned-cryptocurrency-investme... 2018-04-27T14:25:15Z
67 The bitcoin price has rebounded to $9,200 afte... 2018-04-27T06:24:42Z
Now i want to description of last 3 hours, 6 hours, 12 hours and 24 hours.
How can i find it?
Thanks
As a simple solution within Pandas, you can use the DataFrame.last(offset) function. Be sure to set the PublishedAt column as the dataframe DateTimeIndex. A similar function to get rows a the start of a dataframe is the DataFrame.first(offset) function.
Here is an example using the provided offsets:
df.last('24h')
df.last('12h')
df.last('6h')
df.last('3h')
Assuming that the dataframe is called df
import datetime as dt
df[df['publishedAt']>=(dt.datetime.now()-dt.timedelta(hours=3))]['description'] #hours = 6,12, 24
if you need the intervals exclusive, thus the description withing the last 6 hours but not the ones within 3 hours, you'll need to use array-like logical operators from numpy like numpy.logicaland(arr1, arr2) in the first breaket.