Working with loan data.
I have a dataframe with the columns:
df_irr = df1[['id', 'funded_amnt_t', 'Expect_NoPayments','installment']]
ID of the Loan | Funded Amount | Expected Number of Payments | fixed instalment of the annuity.
I have estimated the number of payments with regression analysis.
the loans have 36 or 60 months maturity.
Now I am trying to calculate the expected irr (internal rate of return).
But I am stuck
I was planning to use numpy.irr
However, I never had the chance to use it - as my date is not in the right format?
I have tried pandas pivot and reshape functions. No Luck.
Time series of cash flows:
- Columns: Months 0 , ...., 60
- Rows: ID for each loan
- Values in Month 0 = - funded_amount
- Values in Month 0-60: installment if expected_number_of_payments > months
My old Stata code was:
keep id installment funded_amnt expectednumberofpayments
sort id
expand 61, generate(expand)
bysort id : gen month = _n
gen cf = 0
replace cf = installment if (expectednumberofpayments+1)>=month
replace cf = funded_amnt*-1 if month==1
enter image description here
numpy.irr is the wrong formula to use. That formula is for irregular payments (e.g. $100 in month 1, $0 in month 2, and $400 in month 3). Instead, you want to use numpy.rate. I'm making some assumptions about your data for this solution:
import numpy as np
df_irr['rate'] = np.rate(nper=df_irr['Expect_NoPayments'],
pmt=df_irr['installment'],
pv=df_irr['funded_amnt_t'])
More information can be found here numpy documentation.
Related
I am working with a subscription based data set of which this is an exemplar:
import pandas as pd
import numpy as np
from datetime import timedelta
start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D")
cancel_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date]
churned = [random.randint(0, 1) for i in range(len(start_date))]; churned = [bool(x) for x in churned]
df = pd.DataFrame(
{"start_date":start_date,
"cancel_date":cancel_date,
"churned":churned}
)
df["cancel_date"] = df["cancel_date"].dt.date
df["cancel_date"] = df["cancel_date"].astype("datetime64[ns]")
I need a way to calculate monthly customer churn in python using the following steps:
Firstly, I need to obtain the number of subscriptions that started before the 1st of each month that are still active
Secondly, I need to obtain the number of subscriptions that started before the 1st of each month and which were cancelled after the 1st of each month
These two steps constitute the denominator of the monthly calculation
Finally, I need to obtain the number of subscriptions that cancelled in each month
This step produces the numerator of the monthly calculation.
The numerator and the denominator are divided and multiplied by 100 to obtain the percentage of customers that churn each month
I am really really lost with this problem can someone please point me in the right direction - I have been working on this problem for so long
I have the following code that starts like this:
# Import Libraies
import numpy as np
import pandas as pd
import datetime as dt
#Conexion to Drive
from google.colab import drive
drive.mount('/content/drive')
ruta = '/content/drive/MyDrive/example.csv'
df = pd.read_csv(ruta)
df.head(10)
The file that I import, you can download it from here: Data
And it looks like this:
Then what I do is group the values and then create a metric called Rolling Year (RY_ACTUAL) and (RY_LAST), these help me to know the sales of each category, for example the Blue category, twelve months ago. This metric works fine:
# ROLLING YEAR
# I want to make a Roling Year for each category. Thats mean how much sell each category since 12 moths ago TO current month
# RY_ACTUAL One year have 12 months so I pass as parameter in the rolling 12
f = lambda x:x.rolling(12).sum()
df_group["RY_ACTUAL"] = df_group.groupby(["CATEGORY"])['Sales'].apply(f)
# RY_24 I create a rolling with 24 as parameter to compare actual RY vs last RY
f_1 = lambda x:x.rolling(24).sum()
df_group["RY_24"] = df_group.groupby(["CATEGORY"])['Sales'].apply(f_1)
#RY_LAST Substract RY_24 - RY_Actual to get the correct amount. Thats mean amount of RY vs the amount of RY-1
df_group["RY_LAST"] = df_group["RY_24"] - df_group["RY_ACTUAL"]
My problem is in the metric called Year To Date, which is nothing more than the accumulated sales of each category from JANUARY to the month where you read the table, for example if I stop in March 2015, know how much each category sold in January to March. The column I created called YTD_ACTUAL does just that for me and I achieve it like this:
# YTD_ACTUAL
df_group['YTD_ACTUAL'] = df_group.groupby(["CATEGORY","DATE"]).Sales.cumsum()
However, what I have not been able to do is the YTD_LAST column, that is, from the past period, which reminding the previous example where it was stopped in March 2015, suppose in the blue category, it should return to me how much was the accumulated sales for the blue category from JANUARY to MARCH but 2014 year.
My try >.<
#YTD_LAST
df_group['YTD_LAST'] = df_group.groupby(["CATEGORY", "DATE"]).Sales.apply(f)
Could someone help me to make this column correctly?
Thank you in advance, community!
I have a table of "expected payments" that either are a one-time payment, or recurring monthly, quarterly or yearly. The goal is to calculate the total amount of expected payments for each of the upcoming 12 months.
The model looks like:
class ExpectedPayment(models.Model):
id = models.BigIntegerField(primary_key=True)
period = models.BigIntegerField()
date_start = models.DateField(blank=True, null=True)
date_end = models.DateField(blank=True, null=True)
amount = models.DecimalField(max_digits=1000, decimal_places=1000)
Period 1 = Monthly recurring
Period 2 = Every Quarter - Following the start date. (So started on February, next quarter is May)
Period 3 = Yearly recurring - Also following the start date
Period 4 = One time payment
How can I properly calculate the total "amount" for each month, while taking into account when a payment should take place? Especially the quarterly payments..
I can't figure out how to incorporate when a payment has been started and ended, as it might have been started halfway throughout the year, ended 3 months later or was already finished in the first place. How do I know which month the quarterly should count as an expected payment, when it is always relative to its start_date.
An example for a single month: (every monthly payment that is active in the current month)
start_of_current_month = datetime.today().replace(day=1)
current_month_expected_monthly_payments = ExpectedPayment.objects.filter(period=1, date_start__lte=start_of_current_month, date_end__gte=start_of_current_month).aggregate(total=Sum("amount"))
Assuming for every given period the payment is collected at the beginning of the period. For example - For quarterly payment beginning in February the amount of February, March and April is collected in February itself. For reporting, you want to report the amount as amount/3 since you want to report on monthly basis. Here is what I think -
For every period you can calculate the amount by using annotation -
ExpectedPayment.objects.filter(
date_start__lte=start_of_current_month,
date_end__gte=start_of_current_month
).annotate(
monthly_amount = Case(
When(period=2, then=F('amount')/3,
When(period=3, then=F('amount')/12,
default=F('amount')
)).aggregate(monthly_sum = Sum(F('monthly_amount'))
You would not need a separate case for monthly payments since you are reporting amount monthly. For one time payments, the amount will be considered if the transaction took place in that month. Therefore, no separate cases needed for those cases.
I referred to the documentation here:
Conditional expressions
Query expressions
Here is a issue that uses annotation and aggregation which I think you might find useful.
I'm using LightGBM to solve a time-series regression problem using decision tree methods (determining the price of strawberries over several years). The function lightgbm.Dataset accepts a list of categorical features, and I'm not sure if time features should be included in the list.
I've separated my time index data into year, month, season etc:
df['year'] = pd.DatetimeIndex(df.index).year
df['month'] = pd.DatetimeIndex(df.index).month
df['week'] = np.int64(pd.DatetimeIndex(df.index).isocalendar().week)
df['day'] = pd.DatetimeIndex(df.index).dayofweek # Mon=0, ..., Sun=6
df['weekend'] = np.int64(pd.DatetimeIndex(df.index).dayofweek >= 5) # weekday=0, weekend=1
# 1=winter, 2=spring, 3=summer, 4=autumn
df['season'] = pd.DatetimeIndex(df.index).month%12 // 3 + 1
# national public holidays
df['hols'] = pd.Series(pd.DatetimeIndex(df.index)).apply(lambda x: holidays.CountryHoliday('BEL').get(x)).values.astype('bool').astype('int')
Now I'm trying to determine which of these should be classified as categorical variables. I've had a look at this Data Science post, but it still seems inconclusive.
I already used this kind of temporal variables in the past but I think there is a drawback to use it like that. For instance, take a look to season and this 3 dates:
January 10 -> winter
March 19 -> winter
March 20 -> spring
Which days are the nearest in your case? (10-01, 19-03) or (19-03, 20-03). IMHO, the price of strawberries should probably be closer between (19-03, 20-03) rather than between (10-01, 19-03) even if it's not the same season.
I had a similar problem with DayOfYear (1->365):
2019-12-31: 365
2020-01-01: 1
2020-07-01: 183
2020-12-31: 365
The distance between days was not representative. I solved to use this formula:
Value of day = min((365 - DoY(Day), (DoY(Day) - 1))
2019-12-31: 0
2020-01-01: 0
2020-07-01: 182
2020-12-31: 0
It was the right choice for European electricity consumption because the load is really seasonal.
I have financial trade data (timestamped with the trade time, so there are duplicate times and the datetimes are irregularly spaced). Basically I have just a datetime column and a price column in a pandas dataframe, and I've calculated returns, but I want to linearly interpolate the data so that I can get an estimate of prices every second, minute, day, etc...
It seems the best way to do this is treat the beginning of a Tuesday as occurring just after the end of Monday, so essentially modding out by the time between days. Does pandas provide an easy way to do this? I've searched the documentation and found BDay, but that doesn't seem to do what I want.
Edit: Here's a sample of my code:
df = read_csv(filePath,usecols=[0,4]) #column 0 is date_time and column 4 is price
df.date_time = pd.to_datetime(df.date_time,format = '%m-%d-%Y %H:%M:%S.%f')
def get_returns(df):
return np.log(df.Price.shift(1) / df.Price)
But my issue is that this is trade data, so that I have every trade that occurs for a given stock over some time period, trading happens only during a trading day (9:30 am - 4 pm), and the data is timestamped. I can take the price that every trade happens at and make a price series, but when I calculate kurtosis and other stylized facts, I'm getting very strange results because these sorts of statistics are usually run on evenly spaced time series data.
What I started to do was write code to interpolate my data linearly so that I could get the price every 10 seconds, minute, 10 minutes, hour, day, etc. However, with business days, weekends, holidays, and all the time where trading can't happen, I want to make python think that the only time which exists is during a business day, so that my real world times still match up with the correct date times, but not such that I need a price stamp for all the times when trading is closed.
def lin_int_tseries(series, timeChange):
tDelta = datetime.timedelta(seconds=timeChange)
data_times = series['date_time']
new_series = []
sample_times = []
sample_times.append(data_times[0])
while max(sample_times) < max(data_times):
sample_times.append(sample_times[-1] + tDelta)
for position,time in enumerate(sample_times):
try:
ind = data_times.index(time)
new_series.append(series[ind])
except:
t_next = getnextTime(time,data_times) #get next largest timestamp in data
t_prev = getprevTime(time,data_times) #get next smallest timestamp in data
ind_next = data_times.index(t_next) #index of next largest timestamp
ind_prev = data_times.index(t_prev) #index of next smallest timestamp
p_next = series[ind_next][1] #price at next timestamp
p_prev = series[ind_prev][1] #price a prev timestamp
omega = (float(time) - t_prev)/(t_next - t_prev) #linear interpolation
p_interp = (1 - omega)*p_prev + omega*p_next
new_series.append([time,p_interp])
return new_series
Sorry if it's still unclear. I just want to find some way to stitch the end of one trading day to the beginning of the next trading day, while not losing the actual datetime information.
You should use pandas resample:
df=df.resample("D")