I am working with a subscription based data set of which this is an exemplar:
import pandas as pd
import numpy as np
from datetime import timedelta
start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D")
cancel_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date]
churned = [random.randint(0, 1) for i in range(len(start_date))]; churned = [bool(x) for x in churned]
df = pd.DataFrame(
{"start_date":start_date,
"cancel_date":cancel_date,
"churned":churned}
)
df["cancel_date"] = df["cancel_date"].dt.date
df["cancel_date"] = df["cancel_date"].astype("datetime64[ns]")
I need a way to calculate monthly customer churn in python using the following steps:
Firstly, I need to obtain the number of subscriptions that started before the 1st of each month that are still active
Secondly, I need to obtain the number of subscriptions that started before the 1st of each month and which were cancelled after the 1st of each month
These two steps constitute the denominator of the monthly calculation
Finally, I need to obtain the number of subscriptions that cancelled in each month
This step produces the numerator of the monthly calculation.
The numerator and the denominator are divided and multiplied by 100 to obtain the percentage of customers that churn each month
I am really really lost with this problem can someone please point me in the right direction - I have been working on this problem for so long
Related
Working with loan data.
I have a dataframe with the columns:
df_irr = df1[['id', 'funded_amnt_t', 'Expect_NoPayments','installment']]
ID of the Loan | Funded Amount | Expected Number of Payments | fixed instalment of the annuity.
I have estimated the number of payments with regression analysis.
the loans have 36 or 60 months maturity.
Now I am trying to calculate the expected irr (internal rate of return).
But I am stuck
I was planning to use numpy.irr
However, I never had the chance to use it - as my date is not in the right format?
I have tried pandas pivot and reshape functions. No Luck.
Time series of cash flows:
- Columns: Months 0 , ...., 60
- Rows: ID for each loan
- Values in Month 0 = - funded_amount
- Values in Month 0-60: installment if expected_number_of_payments > months
My old Stata code was:
keep id installment funded_amnt expectednumberofpayments
sort id
expand 61, generate(expand)
bysort id : gen month = _n
gen cf = 0
replace cf = installment if (expectednumberofpayments+1)>=month
replace cf = funded_amnt*-1 if month==1
enter image description here
numpy.irr is the wrong formula to use. That formula is for irregular payments (e.g. $100 in month 1, $0 in month 2, and $400 in month 3). Instead, you want to use numpy.rate. I'm making some assumptions about your data for this solution:
import numpy as np
df_irr['rate'] = np.rate(nper=df_irr['Expect_NoPayments'],
pmt=df_irr['installment'],
pv=df_irr['funded_amnt_t'])
More information can be found here numpy documentation.
I'm trying to figure out how to subtract a monthly loan payment with daily compounded interest. Right now I think I've got the right code for subtracting the payment amount daily over a 10 year loan:
P = 20000
r = .068
t = 10
n = 365
payment = 200
for payment_number in xrange(1, n*t):
daily_interest = P * (1+(r/n)) - P
P = (P + daily_interest) - payment
print P
I'd like it if possible to still print the daily balances but instead subtract the payment every month rather than every day. Initially I though maybe use a nested for loop with xrange(1, 30) but I'm not sure that worked correctly. Thanks in advance for the suggestions!
What about inserting an if statement for the purpose?
P = P-200 if payment_number%30 == 0 else P
This will run if the payment_number variable is a multiple of 30.
"Monthly" is a complicated idea. To fully handle the months you will need to use the datetime module.
from datetime import date, timedelta
date_started = date(2000,1,1)
So say you are 123 days from the start date, we need to calculate that date:
date = date_started + timedelta(days=123)
>>> date
datetime.date(2000, 5, 3)
So now we know need to figure how many days are between that date and the first of the month, the dateutil module can help us with that (you will have to download it).
from dateutil.relativedata import relativedelta
firstofmonth_date = date_started + relativedelta(months=4)
tddays = firstofmonth_date - date_started
days = tddays.days
Then just put "days" into the function you already have and you should be good. The only part I left for you to do is figuring out how many months have passed between your dates.
Assume I've a timeseries of a certain number of years as in:
rng = pd.date_range(start = '2001-01-01',periods = 5113)
ts = pd.TimeSeries(np.random.randn(len(rng)), rng)
Than I can calculate it's standard year (the average value of each day over all years) by doing:
std = ts.groupby([ts.index.month, ts.index.day]).mean()
Now I was wondering how I could subtract my multi-year timeseries from this standard year, in order to get a timeseries that show which days were below or above it's standard.
You can do this using the groupby, just subtract each group's mean from the values for that group:
average_diff = ts.groupby([ts.index.month, ts.index.day]).apply(
lambda g: g - g.mean()
)
I have some code which calculates the beta of the S&P 500 vs any stock - in this case the ticker symbol "FET". However the result seems to be completely different from what I am seeing on yahoo finance, historical this stock has been very volatile and that would explain the beta value of 1.55 on yahoo finance - http://finance.yahoo.com/q?s=fet. Can someone please advise as to why I am seeing a completely different number (0.0088)? Thanks in advance.
from pandas.io.data import DataReader
from datetime import datetime
from datetime import date
import numpy
import sys
today = date.today()
stock_one = DataReader('FET','yahoo',datetime(2009,1,1), today)
stock_two = DataReader('^GSPC','yahoo',stock_one['Adj Close'].keys()[0], today)
a = stock_one['Adj Close'].pct_change()
b = stock_two['Adj Close'].pct_change()
covariance = numpy.cov(a[1:],b[1:])[0][1]
variance = numpy.var(b[1:])
beta = covariance / variance
print 'beta value ' + str(beta)
Ok, so I played with the code a bit and this is what I have.
from pandas.io.data import DataReader
import pandas.io.data as web
from datetime import datetime
from datetime import date
import numpy
import sys
start = datetime(2009, 1, 1)
today = date.today()
stock1 = 'AAPL'
stock2 = '^GSPC'
stocks = web.DataReader([stock1, stock2],'yahoo', start, today)
# stock_two = DataReader('^GSPC','yahoo', start, today)
a = stocks['Adj Close'].pct_change()
covariance = a.cov() # Cov Matrix
variance = a.var() # Of stock2
var = variance[stock2]
cov = covariance.loc[stock2, stock1]
beta = cov / var
print "The Beta for %s is: " % (stock2), str(beta)
The length of the prices did not equal each other, so there was problem #1. Also when your final line executed found the beta for every value of the cov matrix, which is probably not what you wanted. You don't need to know what the beta is based on cov(0,0) and cov(1,1), you just need to look at cov(0,1) or cov(1,0). Those are the positions in the matrix not the values.
Anyway, here is the answer I got:
The Beta for ^GSPC is: 0.885852632799
* Edit *
Made the code easier to run, and changed it so there is only one line for inputting what stocks you want to pull from Yahoo.
You need to convert the closing Px into correct format for calculation. These prices should be converted into return percentages for both the index and the stock price.
In order to match Yahoo finance, you need to use three years' of monthly Adjusted Close prices.
https://help.yahoo.com/kb/finance/SLN2347.html?impressions=true
Beta
The Beta used is Beta of Equity. Beta is the monthly price change of a
particular company relative to the monthly price change of the S&P500.
The time period for Beta is 3 years (36 months) when available.
I have financial trade data (timestamped with the trade time, so there are duplicate times and the datetimes are irregularly spaced). Basically I have just a datetime column and a price column in a pandas dataframe, and I've calculated returns, but I want to linearly interpolate the data so that I can get an estimate of prices every second, minute, day, etc...
It seems the best way to do this is treat the beginning of a Tuesday as occurring just after the end of Monday, so essentially modding out by the time between days. Does pandas provide an easy way to do this? I've searched the documentation and found BDay, but that doesn't seem to do what I want.
Edit: Here's a sample of my code:
df = read_csv(filePath,usecols=[0,4]) #column 0 is date_time and column 4 is price
df.date_time = pd.to_datetime(df.date_time,format = '%m-%d-%Y %H:%M:%S.%f')
def get_returns(df):
return np.log(df.Price.shift(1) / df.Price)
But my issue is that this is trade data, so that I have every trade that occurs for a given stock over some time period, trading happens only during a trading day (9:30 am - 4 pm), and the data is timestamped. I can take the price that every trade happens at and make a price series, but when I calculate kurtosis and other stylized facts, I'm getting very strange results because these sorts of statistics are usually run on evenly spaced time series data.
What I started to do was write code to interpolate my data linearly so that I could get the price every 10 seconds, minute, 10 minutes, hour, day, etc. However, with business days, weekends, holidays, and all the time where trading can't happen, I want to make python think that the only time which exists is during a business day, so that my real world times still match up with the correct date times, but not such that I need a price stamp for all the times when trading is closed.
def lin_int_tseries(series, timeChange):
tDelta = datetime.timedelta(seconds=timeChange)
data_times = series['date_time']
new_series = []
sample_times = []
sample_times.append(data_times[0])
while max(sample_times) < max(data_times):
sample_times.append(sample_times[-1] + tDelta)
for position,time in enumerate(sample_times):
try:
ind = data_times.index(time)
new_series.append(series[ind])
except:
t_next = getnextTime(time,data_times) #get next largest timestamp in data
t_prev = getprevTime(time,data_times) #get next smallest timestamp in data
ind_next = data_times.index(t_next) #index of next largest timestamp
ind_prev = data_times.index(t_prev) #index of next smallest timestamp
p_next = series[ind_next][1] #price at next timestamp
p_prev = series[ind_prev][1] #price a prev timestamp
omega = (float(time) - t_prev)/(t_next - t_prev) #linear interpolation
p_interp = (1 - omega)*p_prev + omega*p_next
new_series.append([time,p_interp])
return new_series
Sorry if it's still unclear. I just want to find some way to stitch the end of one trading day to the beginning of the next trading day, while not losing the actual datetime information.
You should use pandas resample:
df=df.resample("D")