Pandas Groupby Geometric Average?

Pandas Groupby Geometric Average? - python

I'm trying to calculate the stock returns for my portfolio which requires "geometrically averaging" the percentages by year.
For simplicity, I have a dataframe that looks likes this:
Date Returns
2013-06-01 1%
2013-07-01 5%
2013-08-01 -4%
2014-01-01 12%
2014-02-01 -9%
I'd like the output to show:
Date Geo Return
2013 1.8%
2015 1.9%
Which is derived by: (1+.01)(1+.05)(1+-.04) = 1.8%
I am able to use the groupby function by year, but it only sums for me and I can't get the geometric average to work. Could someone please help?
Thanks!

Note that you have requested the cumulative product, which is different that the usual definition for the geometric mean.
df["returns"] = 1 + .01*df.Returns.str.split("%").str[0].astype(int)
df["geom_ave"] = df.groupby(df.Date.dt.year).returns.transform("prod")
output:
Date Returns returns geom_ave
0 2013-06-01 1% 1.01 1.01808
1 2013-07-01 5% 1.05 1.01808
2 2013-08-01 -4% 0.96 1.01808
3 2014-01-01 12% 1.12 1.01920
4 2014-02-01 -9% 0.91 1.01920
If instead you want the geometric mean, you can try:
from scipy import stats
series = df.groupby(df.Date.dt.year).returns.apply(stats.gmean)

Related

Pandas: Annualized Returns

I have a dataframe with quarterly returns of financial entities and I want to calculate 1, 3, 5 10-year annualized returns. The formula for calculating annualized returns is:
R = product(1+r)^(4/N) -1
r are the quarterly return of an entity, N is the number of quarters
for example 3-year annualized return is:
R_3yr = product(1+r)^(4/12) -1 = ((1+r1)*(1+r2)*(1+r3)*...*(1+r12))^(1/3) -1
r1, r2, r3 ... r12 are the quarterly returns of the previous 11 quarters plus current quarter.
I created a code which provides the right results but it is very slow because it is looping through each row of the dataframe. The code below is an extract of my code for 1-year and 3-year annualized retruns (I applied the same concept for 5, 7, 10, 15 and 20-year returns). r_qrt is the field with the quarterly returns
import pandas as pd
import numpy as np
#create dataframe where I append the results
df_final = pd.DataFrame()
columns=['Date','Entity','r_qrt','R_1yr','R_3yr']
#loop thorugh the dataframe
for row in df.itertuples():
R_1yr=np.nan #1-year annualized return
R_3yr=np.nan #3-year annualized return
#Calculate 1 YR Annualized Return
date_previous_period=row.Date+ pd.DateOffset(years=-1)
temp_table=df.loc[(df['Date']>date_previous_period) &
(df['Date']<=row.Date) &
(df['Entity']==row.Entity)]
if temp_table['r_qrt'].count()>=4:
b=(1+(temp_table.r_qrt))[-4:].product()
R_1yr=(b-1)
#Calculate 3 YR Annualized Return
date_previous_period=row.Date+ pd.DateOffset(years=-3)
temp_table=df.loc[(df['Date']>date_previous_period) &
(df['Date']<=row.Date) &
(df['Entity']==row.Entity)]
if temp_table['r_qrt'].count()>=12:
b=(1+(temp_table.r_qrt))[-12:].product()
R_3yr=((b**(1/3))-1)
d=[row.Date,row.Entity,row.r_qrt,R_1yr,R_3yr]
df_final = df_final.append(pd.Series(d, index=columns), ignore_index=True)
df_final looks as below (only reporting 1-year return results for space limitations)
Date
Entity
r_qrt
R_1yr
2015-03-31
A
0.035719
NaN
2015-06-30
A
0.031417
NaN
2015-09-30
A
0.030872
NaN
2015-12-31
A
0.029147
0.133335
2016-03-31
A
0.022100
0.118432
2016-06-30
A
0.020329
0.106408
2016-09-30
A
0.017676
0.092245
2016-12-31
A
0.017304
0.079676
2015-03-31
B
0.034705
NaN
2015-06-30
B
0.037772
NaN
2015-09-30
B
0.036726
NaN
2015-12-31
B
0.031889
0.148724
2016-03-31
B
0.029567
0.143020
2016-06-30
B
0.028958
0.133312
2016-09-30
B
0.028890
0.124746
2016-12-31
B
0.030389
0.123110
I am sure there is a more efficient way to run the same calculations but I have not been able to find it. My code is not efficient and takes more than 2 hours for large dataframes with long time series and many entities.
Thanks

see (https://www.investopedia.com/terms/a/annualized-total-return.asp) for the definition of annualized return
data=[ 3, 7, 5, 12, 1]
def annualize_rate(data):
retVal=0
accum=1
for item in data:
print(1+(item/100))
accum*=1+(item/100)
retVal=pow(accum,1/len(data))-1
return retVal
print(annualize_rate(data))
output
0.05533402290765199
2015 (a and b)
data=[0.133335,0.148724]
print(annualize_rate(data))
output:
0.001410292043902306
2016 (a&b)
data=[0.079676,0.123110]
print(annualize_rate(data))
output
0.0010139064424810051
you can store each year annualized value then use pct_chg to get a 3 year result
data=[0.05,0.06,0.07]
df=pd.DataFrame({'Annualized':data})
df['Percent_Change']=df['Annualized'].pct_change().fillna(0)
amount=1
returns_plus_one=df['Percent_Change']+1
cumulative_return = returns_plus_one.cumprod()
df['Cumulative']=cumulative_return.mul(amount)
df['2item']=df['Cumulative'].rolling(window=2).mean().plot()
print(df)

For future reference of other users, this is the new version of the code that I implemented following Golden Lion suggestion:
def compoundfunct(arr):
return np.product(1+arr)**(4/len(arr)) - 1
# 1-yr annulized return
df["R_1Yr"]=df.groupby('Entity')['r_qrt'].rolling(4).apply(compoundfunct).groupby('Entity').shift(0).reset_index().set_index('level_1').drop('Entity',axis=1)
# 3-yr annulized return
df["R_3Yr"]=df.groupby('Entity')['r_qrt'].rolling(12).apply(compoundfunct).groupby('Entity').shift(0).reset_index().set_index('level_1').drop('Entity',axis=1)
The performance of the previous code was 36.4 sec for a dataframe of 5,640 rows. The new code is more than 10x faster, it took 2.8 sec
One of the issues with this new code is that one has to make sure that rows are sorted by group (Entity in my case) and date before running the calculations, otherwise results could be wrong.
Thanks,
S.

Pandas group three variables and calculate mean, mode, and median

I have a pandas dataframe that looks like this:
SCORE ZIP CODE DATE
0 95.2 90210 2016-01-01
1 98.36 70024 2019-03-02
2 78.2 34567 2017-09-01
3 99.25 00345 2018-05-02
4 ..... ..... .....
For each ZIP CODE, I need to calculate the mean, mode, and median, of the SCORE per each year in the DATE column.
How can I do that?

Why isn't the groupby method working in this rolling sum calculation in Pandas and how can I fix it?

I have the following data, which I multi-index with Date and Ticker as indices and I am then adding a rolling sum of the Vol column for each stock.
Raw Data:
Date,Ticker,SharePrice,Vol
2014-12-31,MSFT,10.79,16.24
2015-03-31,MSFT,19.44,14.94
2015-06-30,MSFT,3.73,19.79
2015-09-30,MSFT,3.76,6.52
2015-12-31,MSFT,10.56,17.91
2016-03-31,MSFT,13.56,11.96
2016-06-30,MSFT,16.27,19.79
2015-03-31,GM,18.22,9.92
2015-06-30,GM,17.16,18.69
2015-09-30,GM,15.92,17.45
Here is the code I use to calculate my rolling sum of vol - note that I do not want the rolling sum to include vol related to a different Ticker (I attempt to use gruopby to stop this but it doesn't work):
Code:
import pandas as pd
stocks = pd.read_csv("C:\\Users\\stocks.csv", index_col=["Date", "Ticker"])
stocks['RollingVol'] = stocks['Vol'].groupby(level=1).fillna(0).rolling(1095, min_periods=2).sum()
print(stocks)
Here is the result I get:
Date,Ticker,SharePrice,Vol,RollingVol
2014-12-31,MSFT,10.79,16.24,
2015-03-31,MSFT,19.44,14.94,31.18
2015-06-30,MSFT,3.73,19.79,50.97
2015-09-30,MSFT,3.76,6.52,57.489999999999995
2015-12-31,MSFT,10.56,17.91,75.39999999999999
2016-03-31,MSFT,13.56,11.96,87.35999999999999
2016-06-30,MSFT,16.27,19.79,107.14999999999998
2015-03-31,GM,18.22,9.92,117.06999999999998
2015-06-30,GM,17.16,18.69,135.76
2015-09-30,GM,15.92,17.45,153.20999999999998
My problem for example here is that the first rolling sum entry for GM (117.0699999) is including the MSFT values whereas it should just be NaN (since min_periods = 2) and then the second entry for GM should be 9.92+18.69= 28.61 and so on as detailed below. I don't understand why the groupby(level=1) in my code is not achieving this and how I can fix it?
Many thanks in advance
Expected Result:
Date,Ticker,SharePrice,Vol,RollingVol
2014-12-31,MSFT,10.79,16.24,
2015-03-31,MSFT,19.44,14.94,31.18
2015-06-30,MSFT,3.73,19.79,50.97
2015-09-30,MSFT,3.76,6.52,57.49
2015-12-31,MSFT,10.56,17.91,75.4
2016-03-31,MSFT,13.56,11.96,87.36
2016-06-30,MSFT,16.27,19.79,107.15
2015-03-31,GM,18.22,9.92,
2015-06-30,GM,17.16,18.69,28.61
2015-09-30,GM,15.92,17.45,46.06

The problem with your code is that when you call groupby then
for each group is invoked actually only the first function
from the following content, in your example only fillna, which
changes nothing.
Invocation of the following methods is performed on the final
("consolidated") result of the preceding groupby.
To compute what you really want, change your code to:
stocks['RollingVol'] = stocks.Vol.groupby(level=1).apply(
lambda grp: grp.rolling(1095, min_periods=2).sum())
The result, for your sample data, is:
SharePrice Vol RollingVol
Date Ticker
2014-12-31 MSFT 10.79 16.24 NaN
2015-03-31 MSFT 19.44 14.94 31.18
2015-06-30 MSFT 3.73 19.79 50.97
2015-09-30 MSFT 3.76 6.52 57.49
2015-12-31 MSFT 10.56 17.91 75.40
2016-03-31 MSFT 13.56 11.96 87.36
2016-06-30 MSFT 16.27 19.79 107.15
2015-03-31 GM 18.22 9.92 NaN
2015-06-30 GM 17.16 18.69 28.61
2015-09-30 GM 15.92 17.45 46.06
Note that the first value in each group is NaN, since you want
min_periods=2.
And a final detail to consider: You chose a strikingly big window
size (1095).
This raises suspicion that you actually want an expanding window,
from the start of the current group, up to the current row.
Something like:
stocks['RollingVol'] = stocks.Vol.groupby(level=1).apply(
lambda grp: grp.expanding(min_periods=2).sum())
Or maybe you want the rolling sum for 3 years, assuming that you have
data for each day.

Pandas: Merging two dataframe with diffrent time index

I have a large dateset that includes categorical data which are my labels ( non-uniform timestamp). I have another dataset which is aggregate of the measurement.
When I want to assemble these two dataset, they have two different timestamp ( aggregated vs non-aggregated).
Categorical dataframe (df_Label)
count 1185
unique 10
top ABCD
freq 1165
Aggregated Dataset (MeasureAgg),
In order to assemble the label dataframe with measurement dataframe.
I use df_Label=df_Label.reindex(MeasureAgg.index, method='nearest')
The issue is that the result of this reindexing will eliminate many of my labels, so the df.describe() will be:
count 4
unique 2
top ABCD
freq 3
I looked in two several lines of where the labels get replaced by nan but couldn't find any indication of where this come from.
I was suspicious that this issue might be due clustering of the labels in between two timestamp which eliminate many of them but this is not the case.
I tried this for fabricated dataset and it work as expected but not sure why is not working in my case df_Label=df_Label.reindex(MeasureAgg.index, method='nearest')
my apology on vague nature of my question, I couldn't replicate the issue with fabricated dataset( for fabricated dataset it worked fine). I would greatly appreciate if any one can guide me with alternative way/modified way that I can assemble these two dataframes.
Thanks in advance
Update:
There is only timestamp since it is mostly missing data
df_Label.head(5)
Time
2000-01-01 00:00:10.870 NaN
2000-01-01 00:00:10.940 NaN
2000-01-01 00:00:11.160 NaN
2000-01-01 00:00:11.640 NaN
2000-01-01 00:00:12.460 NaN
Name: SUM, dtype: object
df_Label.describe()
count 1185
unique 10
top 9_33_2_0_0_0
freq 1165
Name: SUM, dtype: object
MeasureAgg.head(5)
Time mean std skew kurt
2000-01-01 00:00:00 0.0 0.0
2010-01-01 00:00:00 0.0
2015-01-01 00:00:00
2015-12-01 00:00:00
2015-12-01 12:40:00 0.0
MeasureAgg.describe()
mean std skew kurt
count 407.0 383.0 382.0 382.0
mean 487.3552791234544 35.67631749396375 -0.7545081710390299 2.52171909979003
std 158.53524231679074 43.66050329988979 1.3831195437535115 6.72280956322486
min 0.0 0.0 -7.526780108501018 -1.3377292623812096
25% 474.33696969696973 11.5126181533734 -1.1790982769904146 -0.4005545816076801
50% 489.03428571428566 13.49696931937243 -0.2372819584684056 -0.017202890096714274
75% 532.3371929824561 51.40084557371704 0.12755009341999793 1.421205718986767
max 699.295652173913 307.8822231525122 1.2280152015331378 66.9243304128838

How to impute missing data within a date range in a table?

I have the following problem with imputing the missing or zero values in a table. It seems like it's more of an algorithm problem. I wanted to know if someone could help me out figure this out in python or R.
Asset Mileage Date
-----------------------------------
A 41,084 01/26/2017 00:00:00
A 0 01/24/2017 00:00:00
A 0 01/23/2017 00:00:00
A 40,864 01/19/2017 00:00:00
A 0 01/18/2017 00:00:00
B 5,000 01/13/2017 00:00:00
B 0 01/12/2017 00:00:00
B 0 01/11/2017 00:00:00
B 0 01/10/2017 00:00:00
B 0 01/09/2017 00:00:00
B 2,000 01/07/2017 00:00:00
for each asset(A,B,etc..) traverse through the records chronologically(date) replace all the zeros with the average of mileage between the points =
(earlier mileage that is not zero - later mileage that is not zero) /
( number of records from the earlier mileage to the later mileage) +
the earlier mileage.
for instance for the above table the data will look like this after it's fixed
Asset Mileage Date
-----------------------------------
A 41,084 01/26/2017 00:00:00
A 40,974 01/24/2017 00:00:00
A 40,919 01/23/2017 00:00:00
A 40,864 01/19/2017 00:00:00
A 39,800 01/18/2017 00:00:00
B 5,000 01/13/2017 00:00:00
B 4,000 01/12/2017 00:00:00
B 3,500 01/11/2017 00:00:00
B 3,000 01/10/2017 00:00:00
B 2,500 01/09/2017 00:00:00
B 2,000 01/07/2017 00:00:00
in the above case for instance the calculation for one of the records is as below:
(41084-40864)/4(# of records from 40,864 to 41,084) = 110 + previous
value(40,864) = 40919

It seems like you want to be using an analysis method that uses some sort of by to iterate over your data frame and find averages. You could consider something using by() and apply(). The specific iterative changes make it harder without adding in an ordered variable (i.e., right now your rows are implied to be numbered, but should be numbered by date within asset).
Steps to solving this yourself:
Create an ordered variable that provides a number from mileage (0) to mileage (X).
Use either by() or dplyr::group_by() to create averages within each asset. You might want to merge() or dplyr::inner_join() that to the original dataset, or use a lookup.
Use ifelse() to add that average to rows where mileage is 0, multiplying it by the ordered variable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Groupby Geometric Average? - python

Related

Pandas: Annualized Returns

Pandas group three variables and calculate mean, mode, and median

Why isn't the groupby method working in this rolling sum calculation in Pandas and how can I fix it?

Pandas: Merging two dataframe with diffrent time index

How to impute missing data within a date range in a table?

Categories

Resources