Annualized return by group - python

I have monthly return and want to calculate annualized return by two groups. Below is the sample data.
Return_M Rise
0.097425 1
0.188547 1
-0.1509 1
0.28011 1
-0.09596 1
0.041459 1
0.106838 1
0.046581 0
-0.16068 0
0.009242 0
0.006104 0
-0.00709 0
0.050352 0
-0.01023 0
-0.00731 0
0.031946 0
0.048552 0
This is what I tried, but the code actually count the length of df1 not by group. I hope a method that could be applied broadly.
df2 = df1.groupby(['Rise'])[['Return_M']].apply(lambda x:np.prod(1+x)**(12/len(x)))
This is the expected output:
Rise Return_M
1 0.249862
0 -0.00443

You only have to groupby on Rise column and aggregate on the Return_M column.
The following snippet assumes you want to divide by 12 (based on your question)
df2 = df1.groupby('Rise').agg({'Return_M': 'sum'}).reset_index()
df2['avg'] = df2['Return_M']/12
df2[['Rise', 'avg']]
But if you need the average based on however many records you have for each group of Rise, you can simply do:
df2 = df1.groupby('Rise').agg('Return_M': 'mean')
EDIT: Editing the answer based on OP's comment:
To get the geometric annualized return as per your formula, the following will work:
df.groupby('Rise').Return_M.apply(lambda x: (1+x).product() ** (12/float(len(x))))
However, the output is different from the expected output you posted in your question:
Rise
0 0.986765
1 1.952498
This however is exactly the correct output as per the formula you described.
I did this calculation manually too, for Rise = 1:
I took the product of each (1 plus Return_M) value
Raised the product to (12 divided by length of the group, which is 7 for this group).
(1 + 0.097425) * (1 + 0.188547) * (1 + -0.1509) * (1 + 0.28011) * (1 + -0.09596)* (1 + 0.041459)* (1 + 0.106838) = 1.4774446702
1.4774446702 ^ (12/7) = 1.9524983367
So just check if your logic is correct. Please mark this answer as accepted if it solves your problem.

Related

Iterate over rows of a dataframe based on index in python

I am trying to build a loop that iterate over each rows of several Dataframes in order to create two new columns. The original dataframes contain two columns (time, velocity), which can vary in length and stored in nested dictionaries. Here an exemple of one of them :
time velocity
0 0.000000 0.136731
1 0.020373 0.244889
2 0.040598 0.386443
3 0.060668 0.571861
4 0.080850 0.777680
5 0.101137 1.007287
6 0.121206 1.207533
7 0.141284 1.402833
8 0.161388 1.595385
9 0.181562 1.762003
10 0.201640 1.857233
11 0.221788 2.006104
12 0.241866 2.172649
The two new columns should de a normalization of the 'time' and 'velocity' column, respectively. Each rows of the new columns should therefore be equal to the following transformation :
t_norm = (time(n) - time(n-1)) / (time(max) - time(min))
vel_norm = (velocity(n) - velocity(n-1)) / (velocity(max) - velocity(min))
Also, the first value of the two new column should be set to 0.
My problem is that I don't know how to properly indicate to python how to access to n and n-1 values to realize such operations, and I don't know if that could be done using pd.DataFrame.iterrows() or the .iloc function.
I have come with the following piece of code, but it miss the crucial parts :
for nested_dict in dict_all_raw.values():
for dflist in nested_dict.values():
dflist['t_norm'] = ? / (dflist['time'].max() - dflist['time'].min())
dflist['vel_norm'] = ? / (dflist['velocity'].max() - dflist['velocity'].min())
dflist['acc_norm'] = dflist['vel_norm'] / dflist['t_norm']
Any help is welcome..! :)
If you just want to normalise, you can write the expression directly, using Series.min and Series.max:
m = df['time'].min()
df['normtime'] = (df['time'] - m) / (df['time'].max() - m)
However, if you want the difference between successive elements, you can use Series.diff:
df['difftime'] = df['time'].diff() / (df['time'].max() - df['time'].min())
Testing:
df = pd.DataFrame({'time': [0.000000, 0.020373, 0.040598], 'velocity': [0.136731, 0.244889, 0.386443]})
print(df)
# time velocity
# 0 0.000000 0.136731
# 1 0.020373 0.244889
# 2 0.040598 0.386443
m = df['time'].min()
df['normtime'] = (df['time'] - m) / (df['time'].max() - m)
df['difftime'] = df['time'].diff() / (df['time'].max() - df['time'].min())
print(df)
# time velocity normtime difftime
# 0 0.000000 0.136731 0.000000 NaN
# 1 0.020373 0.244889 0.501823 0.501823
# 2 0.040598 0.386443 1.000000 0.498177
You can use shift (see the doc here) to create lagged columns
df['time_n-1']=df['time'].shift(1)
Also, the first value of the two new column should be set to 0.
Use df['column']=df['column'].fillna(0) after your calculations

Pandas, get pct change period mean

I have a Data Frame which contains a column like this:
pct_change
0 NaN
1 -0.029767
2 0.039884 # period of one
3 -0.026398
4 0.044498 # period of two
5 0.061383 # period of two
6 -0.006618
7 0.028240 # period of one
8 -0.009859
9 -0.012233
10 0.035714 # period of three
11 0.042547 # period of three
12 0.027874 # period of three
13 -0.008823
14 -0.000131
15 0.044907 # period of one
I want to get all the periods where the pct change was positive into a list, so with the example column it will be:
raise_periods = [1,2,1,3,1]
Assuming that the column of your dataframe is a series called y which contains the pct_changes, the following code provides a vectorized solution without loops.
y = df['pct_change']
raise_periods = (y < 0).cumsum()[y > 0]
raise_periods.groupby(raise_periods).count()
eventually, the answer provided by #gioxc88 didn't get me where I wanted, but it did put me in the right direction.
what I ended up doing is this:
def get_rise_avg_period(cls, df):
df[COMPOUND_DIFF] = df[NEWS_COMPOUND].diff()
df[CONSECUTIVE_COMPOUND] = df[COMPOUND_DIFF].apply(lambda x: 1 if x > 0 else 0)
# group together the periods of rise and down changes
unfiltered_periods = [list(group) for key, group in itertools.groupby(df.consecutive_high.values.tolist())]
# filter out only the rise periods
positive_periods = [li for li in unfiltered_periods if 0 not in li]
I wanted to get the average length of this positive periods, so I added this at the end:
period = round(np.mean(positive_periods_lens))

Calculate weighted sum using two columns in pandas dataframe

I am trying to calculate weighted sum using two columns in a python dataframe.
Dataframe structure:
unique_id weight value
1 0.061042375 20.16094523
1 0.3064548 19.50932003
1 0.008310739 18.76469039
1 0.624192086 21.25
2 0.061042375 20.23776924
2 0.3064548 19.63366165
2 0.008310739 18.76299395
2 0.624192086 21.25
.......
Output I desired is:
Weighted sum for each unique_id = sum((weight) * (value))
Example: Weighted sum for unique_id 1 = ( (0.061042375 * 20.16094523) + (0.3064548 * 19.50932003) + (0.008310739 * 18.76469039) + (0.624192086 * 21.25) )
I checked out this answer (Calculate weighted average using a pandas/dataframe) but could not figure out the correct way of applying it to my specific scenario.
This is what I am doing based on the above answer:
#Assume temp_weighted_sum_dataframe is the dataframe stated above
grouped_data = temp_weighted_sum_dataframe.groupby('unique_id') #I think this groups data based on unique_id values
weighted_sum_output = (grouped_data.weight * grouped_data.value).transform("sum") #This should allow me to multiple weight and value for every record within each group and sum it up to one value for that group.
# On above line I am getting the error > TypeError: unsupported operand type(s) for *: 'SeriesGroupBy' and 'SeriesGroupBy'
Any help is appreciated, thanks
The accepted answer in the linked question would indeed solve your problem. However, I would solve it differently with just one groupby:
u = (df.assign(s=df['weight']*df['value'])
.groupby('unique_id')
[['s', 'weight']]
.sum()
)
u['s']/u['weight']
Output:
unique_id
1 20.629427
2 20.672208
dtype: float64
you could do it this way:
df['partial_sum'] = df['weight']*df['value']
out = df.groupby('unique_id')['partial_sum'].agg('sum')
output:
unique_id
1 20.629427
2 20.672208
or..
df['weight'].mul(df['value']).groupby(df['unique_id']).sum()
same output
You may take advantage agg by using agg with # (it is dot)
df.groupby('unique_id')[['weight']].agg(lambda x: x.weight # x.value)
Out[24]:
weight
unique_id
1 20.629427
2 20.672208

Calculate absolute from percent change using pandas dataframe

I know how to calculate percent change from absolute using a pandas dataframe, using the following:
df_pctChange = df_absolute.pct_change()
But I can't seem to figure out how to calculate the inverse: using the initial row of df_absolute as the starting point, how do I calculate the absolute number from the percent change located in df_pctChange?
As an example, let's say that the initial row for the two columns in df_absolute are 548625 and 525980, and the the df_pctChange is the following:
NaN NaN
-0.004522 -0.000812
-0.009018 0.001385
-0.009292 -0.002438
How can I produce the content of df_absolute? It should look as follows:
548625 525980
546144 525553
541219 526281
536190 524998
You should be able to use the formula:
(1 + r).cumprod()
to get a cumulative growth factor.
Example:
>>> data
0 1
0 548625 525980
1 546144 525553
2 541219 526281
3 536190 524998
>>> pctchg = data.pct_change()
>>> init = data.iloc[0] # may want to use `data.iloc[0].copy()`
>>> res = (1 + pctchg).cumprod() * init
>>> res.iloc[0] = init
>>> res
0 1
0 548625.0 525980.0
1 546144.0 525553.0
2 541219.0 526281.0
3 536190.0 524998.0
To confirm you worked backwards into the correct absolute figures:
>>> np.allclose(data, res)
True

Building complex subsets in Pandas DataFrame

I'm making my way around GroupBy, but I still need some help. Let's say that I've a DataFrame with columns Group, giving objects group number, some parameter R and spherical coordinates RA and Dec. Here is a mock DataFrame:
df = pd.DataFrame({
'R' : (-21.0,-21.5,-22.1,-23.7,-23.8,-20.4,-21.8,-19.3,-22.5,-24.7,-19.9),
'RA': (154.362789,154.409301,154.419191,154.474165,154.424842,162.568516,8.355454,8.346812,8.728223,8.759622,8.799796),
'Dec': (-0.495605,-0.453085,-0.481657,-0.614827,-0.584243,8.214719,8.355454,8.346812,8.728223,8.759622,8.799796),
'Group': (1,1,1,1,1,2,2,2,2,2,2)
})
I want to built a selection containing for each group the "brightest" object, i.e. the one with the smallest R (or the greatest absolute value, since Ris negative) and the 3 closest objects of the group (so I keep 4 objects in each group - we can assume that there is no group smaller than 4 objects if needed).
We assume here that we have defined the following functions:
#deg to rad
def d2r(x):
return x * np.pi / 180.0
#rad to deg
def r2d(x):
return x * 180.0 / np.pi
#Computes separation on a sphere
def calc_sep(phi1,theta1,phi2,theta2):
return np.arccos(np.sin(theta1)*np.sin(theta2) +
np.cos(theta1)*np.cos(theta2)*np.cos(phi2 - phi1) )
and that separation between two objects is given by r2d(calc_sep(RA1,Dec1,RA2,Dec2)), with RA1 as RA for the first object, and so on.
I can't figure out how to use GroupBy to achieve this...
What you can do here is build a more specific helper function that gets applied to each "sub-frame" (each group).
GroupBy is really just a facility that creates something like an iterator of (group id, DataFrame) pairs, and a function is applied to each of these when you call .groupby().apply. (That glazes over a lot of details, see here for some details on internals if you're interested.)
So after defining your three NumPy-based functions, also define:
def sep_df(df, keep=3):
min_r = df.loc[df.R.argmin()]
RA1, Dec1 = min_r.RA, min_r.Dec
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
idx = sep.nsmallest(keep+1).index
return df.loc[idx]
Then just apply and you get a MultiIndex DataFrame where the first index level is the group.
print(df.groupby('Group').apply(sep_df))
Dec Group R RA
Group
1 3 -0.61483 1 -23.7 154.47416
2 -0.48166 1 -22.1 154.41919
0 -0.49561 1 -21.0 154.36279
4 -0.58424 1 -23.8 154.42484
2 8 8.72822 2 -22.5 8.72822
10 8.79980 2 -19.9 8.79980
6 8.35545 2 -21.8 8.35545
9 8.75962 2 -24.7 8.75962
With some comments interspersed:
def sep_df(df, keep=3):
# Applied to each sub-Dataframe (this is what GroupBy does under the hood)
# Get RA and Dec values at minimum R
min_r = df.loc[df.R.argmin()] # Series - row at which R is minimum
RA1, Dec1 = min_r.RA, min_r.Dec # Relevant 2 scalars within this row
# Calculate separation for each pair including minimum R row
# The result is a series of separations, same length as `df`
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
# Get index values of `keep` (default 3) smallest results
# Retain `keep+1` values because one will be the minimum R
# row where separation=0
idx = sep.nsmallest(keep+1).index
# Restrict the result to those 3 index labels + your minimum R
return df.loc[idx]
For speed, consider passing sort=False to GroupBy if the result still works for you.
I want to built a selection containing for each group the "brightest" object...and the 3 closest objects of the group
step 1:
create a dataframe for the brightest object in each group
maxR = df.sort_values('R').groupby('Group')['Group', 'Dec', 'RA'].head(1)
step 2:
merge the two frames on Group & calculate the separation
merged = df.merge(maxR, on = 'Group', suffixes=['', '_max'])
merged['sep'] = merged.apply(
lambda x: r2d(calc_sep(x.RA, x.Dec, x.RA_max, x.Dec_max)),
axis=1
)
step 3:
order the data frame, group by 'Group', (optional) discard intermediate fields & take the first 4 rows from each group
finaldf = merged.sort_values(['Group', 'sep'], ascending=[1,1]
).groupby('Group')[df.columns].head(4)
Produces the following data frame with your sample data:
Dec Group R RA
4 -0.584243 1 -23.8 154.424842
3 -0.614827 1 -23.7 154.474165
2 -0.481657 1 -22.1 154.419191
0 -0.495605 1 -21.0 154.362789
9 8.759622 2 -24.7 8.759622
8 8.728223 2 -22.5 8.728223
10 8.799796 2 -19.9 8.799796
6 8.355454 2 -21.8 8.355454

Categories

Resources