I have a dataframe with quarterly returns of financial entities and I want to calculate 1, 3, 5 10-year annualized returns. The formula for calculating annualized returns is:
R = product(1+r)^(4/N) -1
r are the quarterly return of an entity, N is the number of quarters
for example 3-year annualized return is:
R_3yr = product(1+r)^(4/12) -1 = ((1+r1)*(1+r2)*(1+r3)*...*(1+r12))^(1/3) -1
r1, r2, r3 ... r12 are the quarterly returns of the previous 11 quarters plus current quarter.
I created a code which provides the right results but it is very slow because it is looping through each row of the dataframe. The code below is an extract of my code for 1-year and 3-year annualized retruns (I applied the same concept for 5, 7, 10, 15 and 20-year returns). r_qrt is the field with the quarterly returns
import pandas as pd
import numpy as np
#create dataframe where I append the results
df_final = pd.DataFrame()
columns=['Date','Entity','r_qrt','R_1yr','R_3yr']
#loop thorugh the dataframe
for row in df.itertuples():
R_1yr=np.nan #1-year annualized return
R_3yr=np.nan #3-year annualized return
#Calculate 1 YR Annualized Return
date_previous_period=row.Date+ pd.DateOffset(years=-1)
temp_table=df.loc[(df['Date']>date_previous_period) &
(df['Date']<=row.Date) &
(df['Entity']==row.Entity)]
if temp_table['r_qrt'].count()>=4:
b=(1+(temp_table.r_qrt))[-4:].product()
R_1yr=(b-1)
#Calculate 3 YR Annualized Return
date_previous_period=row.Date+ pd.DateOffset(years=-3)
temp_table=df.loc[(df['Date']>date_previous_period) &
(df['Date']<=row.Date) &
(df['Entity']==row.Entity)]
if temp_table['r_qrt'].count()>=12:
b=(1+(temp_table.r_qrt))[-12:].product()
R_3yr=((b**(1/3))-1)
d=[row.Date,row.Entity,row.r_qrt,R_1yr,R_3yr]
df_final = df_final.append(pd.Series(d, index=columns), ignore_index=True)
df_final looks as below (only reporting 1-year return results for space limitations)
Date
Entity
r_qrt
R_1yr
2015-03-31
A
0.035719
NaN
2015-06-30
A
0.031417
NaN
2015-09-30
A
0.030872
NaN
2015-12-31
A
0.029147
0.133335
2016-03-31
A
0.022100
0.118432
2016-06-30
A
0.020329
0.106408
2016-09-30
A
0.017676
0.092245
2016-12-31
A
0.017304
0.079676
2015-03-31
B
0.034705
NaN
2015-06-30
B
0.037772
NaN
2015-09-30
B
0.036726
NaN
2015-12-31
B
0.031889
0.148724
2016-03-31
B
0.029567
0.143020
2016-06-30
B
0.028958
0.133312
2016-09-30
B
0.028890
0.124746
2016-12-31
B
0.030389
0.123110
I am sure there is a more efficient way to run the same calculations but I have not been able to find it. My code is not efficient and takes more than 2 hours for large dataframes with long time series and many entities.
Thanks
see (https://www.investopedia.com/terms/a/annualized-total-return.asp) for the definition of annualized return
data=[ 3, 7, 5, 12, 1]
def annualize_rate(data):
retVal=0
accum=1
for item in data:
print(1+(item/100))
accum*=1+(item/100)
retVal=pow(accum,1/len(data))-1
return retVal
print(annualize_rate(data))
output
0.05533402290765199
2015 (a and b)
data=[0.133335,0.148724]
print(annualize_rate(data))
output:
0.001410292043902306
2016 (a&b)
data=[0.079676,0.123110]
print(annualize_rate(data))
output
0.0010139064424810051
you can store each year annualized value then use pct_chg to get a 3 year result
data=[0.05,0.06,0.07]
df=pd.DataFrame({'Annualized':data})
df['Percent_Change']=df['Annualized'].pct_change().fillna(0)
amount=1
returns_plus_one=df['Percent_Change']+1
cumulative_return = returns_plus_one.cumprod()
df['Cumulative']=cumulative_return.mul(amount)
df['2item']=df['Cumulative'].rolling(window=2).mean().plot()
print(df)
For future reference of other users, this is the new version of the code that I implemented following Golden Lion suggestion:
def compoundfunct(arr):
return np.product(1+arr)**(4/len(arr)) - 1
# 1-yr annulized return
df["R_1Yr"]=df.groupby('Entity')['r_qrt'].rolling(4).apply(compoundfunct).groupby('Entity').shift(0).reset_index().set_index('level_1').drop('Entity',axis=1)
# 3-yr annulized return
df["R_3Yr"]=df.groupby('Entity')['r_qrt'].rolling(12).apply(compoundfunct).groupby('Entity').shift(0).reset_index().set_index('level_1').drop('Entity',axis=1)
The performance of the previous code was 36.4 sec for a dataframe of 5,640 rows. The new code is more than 10x faster, it took 2.8 sec
One of the issues with this new code is that one has to make sure that rows are sorted by group (Entity in my case) and date before running the calculations, otherwise results could be wrong.
Thanks,
S.
Related
I have a dataframe which contains sales information of products, what i need to do is to create a function which based on the product id, product type and date, calculates the average sales for a time period which is less than the given date in the function.
This is how I have implemented it, but this approach takes a lot of time and I was wondering if there was a faster way to do this.
Dataframe:
product_type = ['A','B']
df = pd.DataFrame({'prod_id':np.repeat(np.arange(start=2,stop=5,step=1),235),'prod_type': np.random.choice(np.array(product_type), 705),'sales_time': pd.date_range(start ='1-1-2018',
end ='3-30-2018', freq ='3H'),'sale_amt':np.random.randint(4,100,size = 705)})
Current code:
def cal_avg(product,ptype,pdate):
temp_df = df[(df['prod_id']==product) & (df['prod_type']==ptype) & (df['sales_time']<= pdate)]
return temp_df['sale_amt'].mean()
Calling the function:
cal_avg(2,'A','2018-02-12 15:00:00')
53.983
If you are running the calc_avg function "rarely" then I suggest ignoring my answer. Otherwise, it might be beneficial to you to simply calculate the expanding window average for each product/product type. It might be slow depending on your dataset size (in which case maybe just run it on specific product types?), but you'll only need to run it once. First sort by the column you want to perform the 'expanding' on (expanding is missing the 'on' parameter) to ensure the proper row order. Then 'groupby' and transform each group (to keep the indices of the original dataframe) with your expanding window aggregation of choice (in this case 'mean').
df = df.sort_values('sales_time')
df['exp_mean_sales'] = df.groupby(['prod_id', 'prod_type'])['sale_amt'].transform(lambda gr: gr.expanding().mean())
With the result being:
df.head()
prod_id prod_type sales_time sale_amt exp_mean_sales
0 2 B 2018-01-01 00:00:00 8 8.000000
1 2 B 2018-01-01 03:00:00 72 40.000000
2 2 B 2018-01-01 06:00:00 33 37.666667
3 2 A 2018-01-01 09:00:00 81 81.000000
4 2 B 2018-01-01 12:00:00 83 49.000000
Check Below code, with %%timeit comparison (Google Colab)
import pandas as pd
product_type = ['A','B']
df = pd.DataFrame({'prod_id':np.repeat(np.arange(start=2,stop=5,step=1),235),'prod_type': np.random.choice(np.array(product_type), 705),'sales_time': pd.date_range(start ='1-1-2018',
end ='3-30-2018', freq ='3H'),'sale_amt':np.random.randint(4,100,size = 705)})
## OP's function
def cal_avg(product,ptype,pdate):
temp_df = df[(df['prod_id']==product) & (df['prod_type']==ptype) & (df['sales_time']<= pdate)]
return temp_df['sale_amt'].mean()
## Numpy data prep
prod_id_array = np.array(df.values[:,:1])
prod_type_array = np.array(df.values[:,1:2])
sales_time_array = np.array(df.values[:,2:3], dtype=np.datetime64)
values = np.array(df.values[:,3:])
OP's function -
%%timeit
cal_avg(2,'A','2018-02-12 15:00:00')
Output:
Numpy version
%%timeit -n 1000
cal_vals = [2,'A','2018-02-12 15:00:00']
mask = np.logical_and(prod_id_array == cal_vals[0], prod_type_array == cal_vals[1], sales_time_array <= np.datetime64(cal_vals[2]) )
np.mean(values[mask])
Output:
I have a dataset that includes a country's temperature in 2020 and the projected temperature rise in 2050. I'm hoping to create a dataset that assumes the linear growth of temperature between 20201 and 2050 for this country. Take the sample df as an example. The temperature in 2020 for country A is 5 degree; by 2050, the temperature is projected to rise by 3 degree. In other words, the temperature would rise by 0.1 degree per year.
Country Temperature 2020 Temperature 2050
A 5 3
The desired output is df2
Country Year Temperature
A 2020 5
A 2021 5.1
A 2022 5.2
I tried to use resample but it seems to only work for scenario when the frequency is within a year (month, quarter). I also tried interpolate. But neither works.
df = df.reindex(pd.date_range(start='20211231', end='20501231', freq='12MS'))
df2 = df.interpolate(method='linear')
You can use something like this:
import numpy as np
import pandas as pd
def interpolate(df, start, stop):
a = np.empty((stop - start, df.shape[0]))
a[1:-1] = np.nan
a[0] = df[f'Temperature {start}']
a[-1] = df[f'Temperature {stop}']
df2 = pd.DataFrame(a, index=pd.date_range(start=f'{start+1}', end=f'{stop+1}', freq='Y'))
return df2.interpolate(method='linear')
df = pd.DataFrame([["A", 5, 3]], columns=["Country", f"Temperature 2020", f"Temperature 2050"])
df[f"Temperature 2050"] += df[f"Temperature 2020"]
print(interpolate(df, 2020, 2050))
This will output
2021-01-01 5.000000
2022-01-01 5.103448
2023-01-01 5.206897
2024-01-01 5.310345
2025-01-01 5.413793
2026-01-01 5.517241
2027-01-01 5.620690
2028-01-01 5.724138
2029-01-01 5.827586
2030-01-01 5.931034
2031-01-01 6.034483
2032-01-01 6.137931
2033-01-01 6.241379
2034-01-01 6.344828
2035-01-01 6.448276
2036-01-01 6.551724
2037-01-01 6.655172
2038-01-01 6.758621
2039-01-01 6.862069
2040-01-01 6.965517
2041-01-01 7.068966
2042-01-01 7.172414
2043-01-01 7.275862
2044-01-01 7.379310
2045-01-01 7.482759
2046-01-01 7.586207
2047-01-01 7.689655
2048-01-01 7.793103
2049-01-01 7.896552
2050-01-01 8.000000
I have a large DataFrame of thousands of rows but only 2 columns. The 2 columns are of the below format:
Dt
Val
2020-01-01
10.5
2020-01-01
11.2
2020-01-01
10.9
2020-01-03
11.3
2020-01-05
12.0
The first column is date and the second column is a value. For each date, there may be zero, one or more values.
What I need to do is the following: Compute the 95th percentile based on the 30 days that just past and see if the current value is above or below that 95th percentile value. There must however be a minimum of 50 values available for the past 30 days.
For example, if a record has date "2020-12-01" and value "10.5", then I need to first see how many values are there available for the date range 2020-11-01 to 2020-11-30. If there are at least 50 values available over that date range, then I will want to compute the 95th percentile of those values and compare 10.5 against that. If 10.5 is greater than the 95th percentile value, then the result for that record is "Above Threshold". If 10.5 is less than the 95th percentile value, then the result for that record is "Below Threshold". If there are less than 50 values over the date range 2020-11-01 to 2020-11-30, then the result for that record is "Insufficient Data".
I would like to avoid running a loop if possible as it may be expensive from a resource and time perspective to loop through thousands of records to process them one by one. I hope someone can advise of a simple(r) python / pandas solution here.
Use rolling on DatetimeIndex to get the number of values available and the 95th percentile in the last 30 days. Here is an example with 3 days rolling window:
import datetime
import pandas as pd
df = pd.DataFrame({'val':[1,2,3,4,5,6]},
index = [datetime.date(2020,10,1), datetime.date(2020,10,1), datetime.date(2020,10,2),
datetime.date(2020,10,3), datetime.date(2020,10,3), datetime.date(2020,10,4)])
df.index = pd.DatetimeIndex(df.index)
df['number_of_values'] = df.rolling('3D').count()
df['rolling_percentile'] = df.rolling('3D')['val'].quantile(0.9, interpolation='nearest')
Then you can simply do your comparison:
# Above Threshold
(df['val']>df['rolling_percentile'])&(df['number_of_values']>=50)
# Below Threshold
(df['val']>df['rolling_percentile'])&(df['number_of_values']>=50)
# Insufficient Data
df['number_of_values']<50
To remove the current date, close argument would not work for more than one row on a day, so maybe use the rolling apply:
def f(x, metric):
x = x[x.index!=x.index[-1]]
if metric == 'count':
return len(x)
elif metric == 'percentile':
return x.quantile(0.9, interpolation='nearest')
else:
return np.nan
df = pd.DataFrame({'val':[1,2,3,4,5,6]},
index = [datetime.date(2020,10,1), datetime.date(2020,10,1), datetime.date(2020,10,2),
datetime.date(2020,10,3), datetime.date(2020,10,3), datetime.date(2020,10,4)])
df.index = pd.DatetimeIndex(df.index)
df['count'] = df.rolling('3D')['val'].apply(f, args = ('count',))
df['percentile'] = df.rolling('3D')['val'].apply(f, args = ('percentile',))
val count percentile
2020-10-01 1 0.0 NaN
2020-10-01 2 0.0 NaN
2020-10-02 3 2.0 2.0
2020-10-03 4 3.0 3.0
2020-10-03 5 3.0 3.0
2020-10-04 6 3.0 5.0
I'm trying to put together a generic piece of code that would:
Take a time series for some price data and divide it into deciles, e.g. take the past 18m of gold prices and divide it into deciles [DONE, see below]
date 4. close decile
2017-01-03 1158.2 0
2017-01-04 1166.5 1
2017-01-05 1181.4 2
2017-01-06 1175.7 1
... ...
2018-04-23 1326.0 7
2018-04-24 1333.2 8
2018-04-25 1327.2 7
[374 rows x 2 columns]
Pull out the dates for a particular decile, then create a secondary datelist with an added 30 days
#So far only for a single decile at a time
firstdecile = gold.loc[gold['decile'] == 1]
datelist = list(pd.to_datetime(firstdecile.index))
datelist2 = list(pd.to_datetime(firstdecile.index) + pd.DateOffset(months=1))
Take an average of those 30-day price returns for each decile
level1 = gold.ix[datelist]
level2 = gold.ix[datelist2]
level2.index = level2.index - pd.DateOffset(months=1)
result = pd.merge(level1,level2, how='inner', left_index=True, right_index=True)
def ret(one, two):
return (two - one)/one
pricereturns = result.apply(lambda x :ret(x['4. close_x'], x['4. close_y']), axis=1)
mean = pricereturns.mean()
Return the list of all 10 averages in a single CSV file
So far I've been able to put together something functional that does steps 1-3 but only for a single decile, but I'm struggling to expand this to a looped-code for all 10 deciles at once with a clean CSV output
First append the close price at t + 1 month as a new column on the whole dataframe.
gold2_close = gold.loc[gold.index + pd.DateOffset(months=1), 'close']
gold2_close.index = gold.index
gold['close+1m'] = gold2_close
However practically relevant should be the number of trading days, i.e. you won't have prices for the weekend or holidays. So I'd suggest you shift by number of rows, not by daterange, i.e. the next 20 trading days
gold['close+20'] = gold['close'].shift(periods=-20)
Now calculate the expected return for each row
gold['ret'] = (gold['close+20'] - gold['close']) / gold['close']
You can also combine steps 1. and 2. directly so you don't need the additional column (only if you shift by number of rows, not by fixed daterange due to reindexing)
gold['ret'] = (gold['close'].shift(periods=-20) - gold['close']) / gold['close']
Since you already have your deciles, you just need to groupby the deciles and aggregate the returns with mean()
gold_grouped = gold.groupby(by="decile").mean()
Putting in some random data you get something like the dataframe below. close and ret are the averages for each decile. You can create a csv from a dataframe via pandas.DataFrame.to_csv
close ret
decile
0 1238.343597 -0.018290
1 1245.663315 0.023657
2 1254.073343 -0.025934
3 1195.941312 0.009938
4 1212.394511 0.002616
5 1245.961831 -0.047414
6 1200.676333 0.049512
7 1181.179956 0.059099
8 1214.438133 0.039242
9 1203.060985 0.029938
I use data from a past kaggle challenge based on panel data across a number of stores and a period spanning 2.5 years. Each observation includes the number of customers for a given store-date. For each store-date, my objective is to compute the average number of customers that visited this store during the past 60 days.
Below is code that does exactly what I need. However, it lasts forever - it would take a night to process the c.800k rows. I am looking for a clever way to achieve the same objective faster.
I have included 5 observations of the initial dataset with the relevant variables: store id (Store), Date and number of customers ("Customers").
Note:
For each row in the iteration, I end up writing the results using .loc instead of e.g. row["Lagged No of customers"] because "row" does not write anything in the cells. I wonder why that's the case.
I normally populate new columns using "apply, axis = 1" so I would really appreciate any solution based on that. I found that "apply" works fine when for each row, computation is done across columns using values at the same row level. However, I don't know how an "apply" function can involve different rows, which is what this problem requires. the only exception I have seen so far is "diff", which is not useful here.
Thanks.
Sample data:
pd.DataFrame({
'Store': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'Customers': {0: 668, 1: 578, 2: 619, 3: 635, 4: 785},
'Date': {
0: pd.Timestamp('2013-01-02 00:00:00'),
1: pd.Timestamp('2013-01-03 00:00:00'),
2: pd.Timestamp('2013-01-04 00:00:00'),
3: pd.Timestamp('2013-01-05 00:00:00'),
4: pd.Timestamp('2013-01-07 00:00:00')
}
})
Code that works but is incredibly slow:
import pandas as pd
import numpy as np
data = pd.read_csv("Rossman - no of cust/dataset.csv")
data.Date = pd.to_datetime(data.Date)
data.Customers = data.Customers.astype(int)
for index, row in data.iterrows():
d = row["Date"]
store = row["Store"]
time_condition = (d - data["Date"]<np.timedelta64(60, 'D')) & (d > data["Date"])
sub_df = data.loc[ time_condition & (data["Store"] == store), :]
data.loc[ (data["Date"]==d) & (data["Store"] == store), "Lagged No customers"] = sub_df["Customers"].sum()
data.loc[ (data["Date"]==d) & (data["Store"] == store), "No of days"] = len(sub_df["Customers"])
if len(sub_df["Customers"]) > 0:
data.loc[ (data["Date"]==d) & (data["Store"] == store), "Av No of customers"] = int(sub_df["Customers"].sum()/len(sub_df["Customers"]))
Given your small sample data, I used a two day rolling average instead of 60 days.
>>> (pd.rolling_mean(data.pivot(columns='Store', index='Date', values='Customers'), window=2)
.stack('Store'))
Date Store
2013-01-03 1 623.0
2013-01-04 1 598.5
2013-01-05 1 627.0
2013-01-07 1 710.0
dtype: float64
By taking a pivot of the data with dates as your index and stores as your columns, you can simply take a rolling average. You then need to stack the stores to get the data back into the correct shape.
Here is some sample output of the original data prior to the final stack:
Store 1 2 3
Date
2015-07-29 541.5 686.5 767.0
2015-07-30 534.5 664.0 769.5
2015-07-31 550.5 613.0 822.0
After .stack('Store'), this becomes:
Date Store
2015-07-29 1 541.5
2 686.5
3 767.0
2015-07-30 1 534.5
2 664.0
3 769.5
2015-07-31 1 550.5
2 613.0
3 822.0
dtype: float64
Assuming the above is named df, you can then merge it back into your original data as follows:
data.merge(df.reset_index(),
how='left',
on=['Date', 'Store'])
EDIT:
There is a clear seasonal pattern in the data for which you may want to make adjustments. In any case, you probably want your rolling average to be in multiples of seven to represent even weeks. I've used a time window of 63 days in the example below (9 weeks).
In order to avoid losing data on stores that just open (and those at the start of the time period), you can specify min_periods=1 in the rolling mean function. This will give you the average value over all available observations for your given time window
df = data.loc[data.Customers > 0, ['Date', 'Store', 'Customers']]
result = (pd.rolling_mean(df.pivot(columns='Store', index='Date', values='Customers'),
window=63, min_periods=1)
.stack('Store'))
result.name = 'Customers_63d_mvg_avg'
df = df.merge(result.reset_index(), on=['Store', 'Date'], how='left')
>>> df.sort_values(['Store', 'Date']).head(8)
Date Store Customers Customers_63d_mvg_avg
843212 2013-01-02 1 668 668.000000
842103 2013-01-03 1 578 623.000000
840995 2013-01-04 1 619 621.666667
839888 2013-01-05 1 635 625.000000
838763 2013-01-07 1 785 657.000000
837658 2013-01-08 1 654 656.500000
836553 2013-01-09 1 626 652.142857
835448 2013-01-10 1 615 647.500000
To more clearly see what is going on, here is a toy example:
s = pd.Series([1,2,3,4,5] + [np.NaN] * 2 + [6])
>>> pd.concat([s, pd.rolling_mean(s, window=4, min_periods=1)], axis=1)
0 1
0 1 1.0
1 2 1.5
2 3 2.0
3 4 2.5
4 5 3.5
5 NaN 4.0
6 NaN 4.5
7 6 5.5
The window is four observations, but note that the final value of 5.5 equals (5 + 6) / 2. The 4.0 and 4.5 values are (3 + 4 + 5) / 3 and (4 + 5) / 2, respectively.
In our example, the NaN rows of the pivot table do not get merged back into df because we did a left join and all the rows in df have one or more Customers.
You can view a chart of the rolling data as follows:
df.set_index(['Date', 'Store']).unstack('Store').plot(legend=False)