AttributeError: 'DataFrame' object has no attribute 'colmap' in Python - python

I am a python beginner and I try to use the following code from this source: Portfolio rebalancing with bandwidth method in python
The code works well so far.
The problem is that if I want to call the function not as usual like rebalance(df, tol), but from a certain location in the dataframe on, like: rebalance(df[500:], tol), I get the following error:
AttributeError: 'DataFrame' object has no attribute 'colmap'. So my question is: how do I have to adjust the code in order to make this possible?
Here is the code:
import datetime as DT
import numpy as np
import pandas as pd
import pandas.io.data as PID
def setup_df():
df1 = PID.get_data_yahoo("IBM",
start=DT.datetime(1970, 1, 1),
end=DT.datetime.today())
df1.rename(columns={'Adj Close': 'ibm'}, inplace=True)
df2 = PID.get_data_yahoo("F",
start=DT.datetime(1970, 1, 1),
end=DT.datetime.today())
df2.rename(columns={'Adj Close': 'ford'}, inplace=True)
df = df1.join(df2.ford, how='inner')
df = df[['ibm', 'ford']]
df['sh ibm'] = 0
df['sh ford'] = 0
df['ibm value'] = 0
df['ford value'] = 0
df['ratio'] = 0
# This is useful in conjunction with iloc for referencing column names by
# index number
df.colmap = dict([(col, i) for i,col in enumerate(df.columns)])
return df
def invest(df, i, amount):
"""
Invest amount dollars evenly between ibm and ford
starting at ordinal index i.
This modifies df.
"""
c = df.colmap
halfvalue = amount/2
df.iloc[i:, c['sh ibm']] = halfvalue / df.iloc[i, c['ibm']]
df.iloc[i:, c['sh ford']] = halfvalue / df.iloc[i, c['ford']]
df.iloc[i:, c['ibm value']] = (
df.iloc[i:, c['ibm']] * df.iloc[i:, c['sh ibm']])
df.iloc[i:, c['ford value']] = (
df.iloc[i:, c['ford']] * df.iloc[i:, c['sh ford']])
df.iloc[i:, c['ratio']] = (
df.iloc[i:, c['ibm value']] / df.iloc[i:, c['ford value']])
def rebalance(df, tol):
"""
Rebalance df whenever the ratio falls outside the tolerance range.
This modifies df.
"""
i = 0
amount = 100
c = df.colmap
while True:
invest(df, i, amount)
mask = (df['ratio'] >= 1+tol) | (df['ratio'] <= 1-tol)
# ignore prior locations where the ratio falls outside tol range
mask[:i] = False
try:
# Move i one index past the first index where mask is True
# Note that this means the ratio at i will remain outside tol range
i = np.where(mask)[0][0] + 1
except IndexError:
break
amount = (df.iloc[i, c['ibm value']] + df.iloc[i, c['ford value']])
return df
df = setup_df()
tol = 0.05 #setting the bandwidth tolerance
rebalance(df, tol)
df['portfolio value'] = df['ibm value'] + df['ford value']
df["ibm_weight"] = df['ibm value']/df['portfolio value']
df["ford_weight"] = df['ford value']/df['portfolio value']
print df['ibm_weight'].min()
print df['ibm_weight'].max()
print df['ford_weight'].min()
print df['ford_weight'].max()
# This shows the rows which trigger rebalancing
mask = (df['ratio'] >= 1+tol) | (df['ratio'] <= 1-tol)
print(df.loc[mask])

The problem you encountered is due to a poor design decision on my part.
colmap is an attribute defined on df in setup_df:
df.colmap = dict([(col, i) for i,col in enumerate(df.columns)])
It is not a standard attribute of a DataFrame.
df[500:] returns a new DataFrame which is generated by copying data from df into the new DataFrame. Since colmap is not a standard attribute, it is not copied into the new DataFrame.
To call rebalance on a DataFrame other than the one returned by setup_df, replace c = df.colmap with
c = dict([(col, j) for j,col in enumerate(df.columns)])
I've made this change in the original post as well.
PS. In the other question, I had chosen to define colmap on df itself so
that this dict would not have to be recomputed with every call to rebalance
and invest.
Your question shows me that this minor optimization is not worth making these
functions so dependent on the specialness of the DataFrame returned by
setup_df.
There is a second problem you will encounter using rebalance(df[500:], tol):
Since df[500:] returns a copy of a portion of df, rebalance(df[500:], tol) will modify
this copy and not the original df. If the object, df[500:],
has no reference outside of rebalance(df[500:], tol), it will be garbage
collected after the call to rebalance is completed. So the entire computation
would be lost. Therefore rebalance(df[500:], tol) is not useful.
Instead, you could modify rebalance to accept i as a parameter:
def rebalance(df, tol, i=0):
"""
Rebalance df whenever the ratio falls outside the tolerance range.
This modifies df.
"""
c = dict([(col, j) for j, col in enumerate(df.columns)])
while True:
mask = (df['ratio'] >= 1+tol) | (df['ratio'] <= 1-tol)
# ignore prior locations where the ratio falls outside tol range
mask[:i] = False
try:
# Move i one index past the first index where mask is True
# Note that this means the ratio at i will remain outside tol range
i = np.where(mask)[0][0] + 1
except IndexError:
break
amount = (df.iloc[i, c['ibm value']] + df.iloc[i, c['ford value']])
invest(df, i, amount)
return df
Then you can rebalance df starting at the 500th row using
rebalance(df, tol, i=500)
Note that this finds the first row on or after i=500 that needs
rebalancing. It does not necessarily rebalance at i=500 itself. This allows you to call rebalance(df, tol, i) for arbitrary i without having to determine in advance if rebalancing is required on row i.

Related

A more efficient way to split timeseries data (pd.Series) at gaps?

I am trying to split a pd.Series with sorted dates that have sometimes gaps between them that are bigger than the normal ones. To do this, I calculated the size of the gaps with pd.Series.diff() and then iterated over all the elements in the series with a while-loop. But this is unfortunately quite computationally intensive. Is there a better way (apart from parallelization)?
Minimal example with my function:
import pandas as pd
import time
def get_samples_separated_at_gaps(data: pd.Series, normal_gap) -> list:
diff = data.diff()
# creating list that should contains all samples
samples_list = [pd.Series(data[0])]
i = 1
while i < len(data):
if diff[i] == normal_gap:
# normal gap: add data[i] to last sample in samples_list
samples_list[-1] = samples_list[-1].append(pd.Series(data[i]))
else:
# not normal gap: creating new sample in samples_list
samples_list.append(pd.Series(data[i]))
i += 1
return samples_list
# make sample data as example
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])
# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
# start sampling
start_time = time.time()
my_list_with_samples = get_samples_separated_at_gaps(data_with_samples, normal_distance)
print(f"Duration: {time.time() - start_time}")
The real data have a size of over 150k and are calculated for several minutes... :/
I'm not sure I understand completely what you want but I think this could work:
...
data_with_samples = first_sample.append(second_sample, ignore_index=True)
idx = data_with_samples[data_with_samples.diff(1) > normal_distance].index
samples_list = [data_with_samples]
if len(idx) > 0:
samples_list = ([data_with_samples.iloc[:idx[0]]]
+ [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
+ [data_with_samples.iloc[idx[-1]:]])
idx collects the indicees directly after a gap, and the rest is just splitting the series at this indicees and packing the pieces into the list samples_list.
If the index is non-standard, then you need some overhead (resetting index and later setting the index back to the original) to make sure that iloc can be used.
...
data_with_samples = first_sample.append(second_sample, ignore_index=True)
data_with_samples = data_with_samples.reset_index(drop=False).rename(columns={0: 'data'})
idx = data_with_samples.data[data_with_samples.data.diff(1) > normal_distance].index
data_with_samples.set_index('index', drop=True, inplace=True)
samples_list = [data_with_samples]
if len(idx) > 0:
samples_list = ([data_with_samples.iloc[:idx[0]]]
+ [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
+ [data_with_samples.iloc[idx[-1]:]])
(You don't need that for your example.)
Your code is a bit unclear regarding the method to store these two different lists. Specifically, I'm not sure what is the correct structure of sample_list that you have in mind.
Regardless, using Series.pct_change and np.unique() you should achieve approximately what you're looking for.
uniques, indices = np.unique(
data_with_samples.diff()
[1:]
.pct_change(),
return_index=True)
Now indices points you to the start and end of that wrong gap.
If your data will have more than one gap then you'd want to only use diff()[1:].pct_change() and look for all values that are different than 0 using where().
same as above question mention
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])
# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
use time diff to compare with the normal_distance.seconds
create an auxiliary column tag to separate the gap group
# start sampling
start_time = time.time()
df = data_with_samples.to_frame()
df['time_diff'] = df[0].diff().dt.seconds
cond = (df['time_diff'] > normal_distance.seconds) | (df['time_diff'].isnull())
df['tag'] = np.where(cond, 1, 0)
df['tag'] = df['tag'].cumsum()
my_list_with_samples = []
for _, group in df.groupby('tag'):
my_list_with_samples.append(group[0])
print(f"Duration: {time.time() - start_time}")

fast way to iterate through list, find duplicates and perform calculations

I have two lists, one of areas and one of prices which are the same size.
For example:
area = [1500,2000,2000,1800,2000,1500,500]
price = [200,800,600,800,1000,750,200]
I need to return a list of prices for each unique area not including the original area.
So for 1500, the lists that I need returned are: [750] and [200]
For the 2000, the lists that I need returned are [600,1000], [800,1000] and [800,600]
For the 1800 and 500, the lists I need returned are both empty lists [].
The goal is then to determine whether a value is an outlier subject to the absolute value of the price - mean(excluding the price itself) being less than 5 * population standard deviation(calculated excluding the price itself)
import statistics
area = [1500,2000,2000,1800,2000,1500,500]
price = [200,800,600,800,1000,750,200]
outlier_idx = []
for idx, val in enumerate(area):
comp_idx = [i for i, x in enumerate(area) if x == val]
comp_idx.remove(idx)
comp_price = [price[i] for i in comp_idx]
if len(comp_price)>2:
sigma = statistics.stdev(comp_price)
p_m = statistics.mean(comp_price)
if abs(price[idx]-p_m) > 5 * sigma:
outlier_idx.append(idx)
area = [i for j, i in enumerate(area) if j not in outlier_idx]
price = [i for j, i in enumerate(price) if j not in outlier_idx]
The problem is that this calculation takes up a lot of time and I am dealing with arrays that can be quite large.
I am stuck as to how I can increase the computational efficiency.
I am open to using numpy, pandas or any other common packages.
Additionally, I have tried the problem in pandas:
df['p-p_m'] = ''
df['sigma'] = ''
df['outlier'] = False
for name, group in df.groupby('area'):
if len(group)>1:
idx = list(group.index)
for i in range(len(idx)):
tmp_idx = idx.copy()
tmp_idx.pop(i)
df['p-p_m'][idx[i]] = abs(group.price[idx[i]] - group.price[tmp_idx].mean())
df['sigma'][idx[i]] = group.price[tmp_idx].std(ddof=0)
if df['p-p_m'][idx[i]] > 3*df['sigma'][idx[i]]:
df['outlier'][idx[i]] = True
Thanks.
This code is how to must created the list for each area:
df = pd.DataFrame({'area': area, 'price': price})
price_to_delete = [item for idx_array in df.groupby('price').groups.values() for item in idx_array[1:]]
df.loc[price_to_delete, 'price'] = None
df = df.groupby('area').agg(lambda x: [] if all(x.isnull()) else x.tolist())
df
I don't understand what are you want, but this part is to calculate outliers for each price in each area:
df['outlier'] = False
df['outlier'] = df['price'].map(lambda x: abs(np.array(x) - np.mean(x)) > 3*np.std(x) if len(x) > 0 else [])
df
I hope this help you, in any way!
Here is a solution that combines Numpy and Numba. Although correct, I did not test it against alternative approaches regarding efficiency, but Numba usually results in significant speedups for tasks that require looping through data. I have added one extra point which is an outlier, according to your definition.
import numpy as np
from numba import jit
# data input
price = np.array([200,800,600,800,1000,750,200, 2000])
area = np.array([1500,2000,2000,1800,2000,1500,500, 1500])
#jit(nopython=True)
def outliers(price, area):
is_outlier = np.full(len(price), False)
for this_area in set(area):
indexes = area == this_area
these_prices = price[indexes]
for this_price in set(these_prices):
arr2 = these_prices[these_prices != this_price]
if arr2.size > 1:
std = arr2.std()
mean = arr2.mean()
indices = (this_price == price) & (this_area == area)
is_outlier[indices] = np.abs(mean - this_price) > 5 * std
return is_outlier
> outliers(price, area)
> array([False, False, False, False, False, False, False, True])
The code should be fast in case you have several identical price levels for each area, as they will be updated all at once.
I hope this helps.

How can combine 3 matrices into 1 matrice with reversible-approach?

I want to reshape my 24x20 matrices 'A','B','C' which are extracted from text file and are saved before and after normalizing by def normalize() in for-loop through cycles in such way that each cycles would be a row with all elements of 3 matrices side by side like below:
[[A(1,1),B(1,1),C(1,1),A(1,2),B(1,2),C(1,2),...,A(24,20),B(24,20),C(24,20)] #cycle1
[A(1,1),B(1,1),C(1,1),A(1,2),B(1,2),C(1,2),...,A(24,20),B(24,20),C(24,20)] #cycle2
[A(1,1),B(1,1),C(1,1),A(1,2),B(1,2),C(1,2),...,A(24,20),B(24,20),C(24,20)]] #cycle3
So far based on #odyse suggestion I used following snippet in the end of for-loop:
for cycle in range(cycles):
dff = pd.DataFrame({'A_norm':A_norm[cycle] , 'B_norm': B_norm[cycle] , 'C_norm': C_norm[cycle] } , index=[0])
D = dff.as_matrix().ravel()
if cycle == 0:
Results = np.array(D)
else:
Results = np.vstack((Results, D2))
np.savetxt("Results.csv", Results, delimiter=",")
but there is a problem when I use after def normalize() in for-loop in spite of its error (ValueError) it also has warning FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead for D = dff.as_matrix().ravel() which is not important but right now since it is FutureWarning nevertheless I checked the shape of output was correct for 3 cycles by using print(data1.shape) and it was (3, 1440) which is 3 rows as 3 cycles and number of columns should be 3 times 480= 1440 but all in all wasn't stable solution.
the complete scripts are following:
import numpy as np
import pandas as pd
import os
def normalize(value, min_value, max_value, min_norm, max_norm):
new_value = ((max_norm - min_norm)*((value - min_value)/(max_value - min_value))) + min_norm
return new_value
#the size of matrices are (24,20)
df1 = np.zeros((24,20))
df2 = np.zeros((24,20))
df3 = np.zeros((24,20))
#next iteration create all plots, change the number of cycles
cycles = int(len(df)/480)
print(cycles)
for cycle in range(3):
count = '{:04}'.format(cycle)
j = cycle * 480
new_value1 = df['A'].iloc[j:j+480]
new_value2 = df['B'].iloc[j:j+480]
new_value3 = df['C'].iloc[j:j+480]
df1 = print_df(mkdf(new_value1))
df2 = print_df(mkdf(new_value2))
df3 = print_df(mkdf(new_value3))
for i in df:
try:
os.mkdir(i)
except:
pass
min_val = df[i].min()
min_nor = -1
max_val = df[i].max()
max_nor = 1
ordered_data = mkdf(df.iloc[j:j+480][i])
csv = print_df(ordered_data)
#Print .csv files contains matrix of each parameters by name of cycles respectively
csv.to_csv(f'{i}/{i}{count}.csv', header=None, index=None)
if 'C' in i:
min_nor = -40
max_nor = 150
#Applying normalization for C between [-40,+150]
new_value3 = normalize(df['C'].iloc[j:j+480], min_val, max_val, -40, 150)
C_norm = print_df(mkdf(new_value3))
C_norm.to_csv(f'{i}/norm{i}{count}.csv', header=None, index=None)
else:
#Applying normalization for A,B between [-1,+1]
new_value1 = normalize(df['A'].iloc[j:j+480], min_val, max_val, -1, 1)
new_value2 = normalize(df['B'].iloc[j:j+480], min_val, max_val, -1, 1)
A_norm = print_df(mkdf(new_value1))
B_norm = print_df(mkdf(new_value2))
A_norm.to_csv(f'{i}/norm{i}{count}.csv', header=None, index=None)
B_norm.to_csv(f'{i}/norm{i}{count}.csv', header=None, index=None)
dff = pd.DataFrame({'A_norm':A_norm[cycle] , 'B_norm': B_norm[cycle] , 'C_norm': C_norm[cycle] } , index=[0])
D = dff.as_matrix().ravel()
if cycle == 0:
Results = np.array(D)
else:
Results = np.vstack((Results, D))
np.savetxt("Results.csv", Results , delimiter=',', encoding='utf-8')
#Check output shape whether is (3, 1440) or not
data1 = np.loadtxt('Results.csv', delimiter=',')
print(data1.shape)
Note1: my data is txt file is following:
id_set: 000
A: -2.46882615679
B: -2.26408246559
C: -325.004619528
Note2: I provided a dataset in text file for 3 cycles:
Text dataset
Note3: for mapping A, B, C parameters into matrices in right order I used print_df() mkdf() functions but I didn't mention due to reduce it to the core problem and just leave a minimal example in start of this post. Let me know if you need that.
Expected result should be done by completing for-loop on 'A_norm','B_norm','C_norm' which are represented normalized versions of 'A','B','C' respectively and output let's call it "Results.csv" should be reversible to regenerate 'A','B','C' matrices through cycles again save them in csv. files for controlling , therefore if you have any ideas about reverse part please mention that separately otherwise just control it by using print(data.shape) and it should be (3, 1440).
Have a nice day and thanks in advance!

Apply function but remember index position

I have a pandas df containing weights. Rows contain dates and columns contain asset names. Every row sum to 1.
I want to run
df_with_stocks_weight.apply(rescale_w, weight_min=0.01, weight_max=0.30)
in order to change so that weights still sum to 1 but have min value 1% and max value 30%. I tried using the function below, but I get problems with the index: The calculated values are correct but the output refers to the wrong asset!
def rescale_w(row_input, weight_min, weight_max):
'''
:param row_input: a row from a pandas df
:param weight_min: the floor. type float.
:param weight_max: the cap. type float.
:return: a pandas row where weights are adjusted to specify min max.
step 1:
while any asset has weight above weight_max,
set that asset's weight to == weight_max
and distribute the leftovers to all other assets (whose weight are >0)
in accordance with their weight.
step 2:
if there is a positive weight below min_weight,
force it to == min_weight
by stealing from every other asset
(except those whose weight == max_weight).
note that the function produce strange output with few assets.
for example with 3 assets and max 30% the sum is 0.90
and if A=50% B=20% and one other asset is 1% then
these are not practical problems as we will analyze on data with many assets.
'''
# rename
w1 = row_input
# na
# script returned many errors regarding na
# so i a fillna(0) here.
# if that will be the final solution, some cleaning up can be done
# eg remove _null objects and remove some assertions.
w1 = w1.fillna(0)
# remove zeroes to get a faster script
w1nz = w1[w1 > 0]
w1z = w1[w1 == 0]
assert len(w1) == len(w1nz) + len(w1z)
assert set(w1nz.index).intersection(set(w1z.index)) == set()
# input must sum to 1
assert abs(w1nz.sum()-1) < 0.001
# only execute if there is at least one notnull value
# below will work with nz
if len(w1nz) > 0:
# step 1: make sure upper threshold is satisfied
while max(w1nz) > weight_max:
# clip at 30%
w2 = w1nz.clip(upper=weight_max)
# calc leftovers from this upper clip
leftover_upper = 1 - w2.sum()
# add leftovers to the untouched, in accordance with weight
w2_touched = w2[w2 == weight_max]
w2_unt = w2[(weight_max > w2) & (w2 > 0)]
w2_unt_added = w2_unt + leftover_upper * w2_unt / w2_unt.sum()
# concat all back
w3 = pd.concat([w2_touched, w2_unt_added], axis=0)
# same index for output and input
#w3 = w3.reindex(w1nz.index) # todo prövar nu att ta bort .reindex överallt. ser om pd löser det själv automatiskt
# rename w3 so that it works in a while loop
w1nz = w3
usestep2 = False
if usestep2:
# step 2: make sure lower threshold is satisfied
if min(w1nz) < weight_min:
# three parts: lower, middle, upper.
# those in "lower" will recieve from those in "middle"
upper = w1nz[w1nz >= weight_max]
middle = w1nz[(w1nz > weight_min) & (w1nz < weight_max)]
lower = w1nz[w1nz <= weight_min]
# assert len
assert (len(upper) + len(middle) + len(lower) == len(w1nz))
# change lower to == weight_min
lower_modified = lower.clip(lower=weight_min)
# the weights given to "lower" is stolen from "middle"
stolen_weigths = lower_modified.sum() - lower.sum()
middle_modified = middle - stolen_weigths * middle / middle.sum()
# concat
w4 = pd.concat([lower_modified,
middle_modified,
upper], axis=0)
# reindex
#w4 = w4.reindex(w1nz.index)
# rename
w1nz = w4
# lastly, concat adjusted nonzero with zero.
w1adj = pd.concat([w1nz, w1z], axis=0)
w1adj = w1adj.reindex(w1.index) # works?
assert (w1adj.index == w1.index).all()
assert abs(w1adj.sum() - 1 < 0.001)
return (w1adj)

financial python library that has xirr and xnpv function?

numpy has irr and npv function, but I need xirr and xnpv function.
this link points out that xirr and xnpv will be coming soon.
http://www.projectdirigible.com/documentation/spreadsheet-functions.html#coming-soon
Is there any python library that has those two functions? tks.
Here is one way to implement the two functions.
import scipy.optimize
def xnpv(rate, values, dates):
'''Equivalent of Excel's XNPV function.
>>> from datetime import date
>>> dates = [date(2010, 12, 29), date(2012, 1, 25), date(2012, 3, 8)]
>>> values = [-10000, 20, 10100]
>>> xnpv(0.1, values, dates)
-966.4345...
'''
if rate <= -1.0:
return float('inf')
d0 = dates[0] # or min(dates)
return sum([ vi / (1.0 + rate)**((di - d0).days / 365.0) for vi, di in zip(values, dates)])
def xirr(values, dates):
'''Equivalent of Excel's XIRR function.
>>> from datetime import date
>>> dates = [date(2010, 12, 29), date(2012, 1, 25), date(2012, 3, 8)]
>>> values = [-10000, 20, 10100]
>>> xirr(values, dates)
0.0100612...
'''
try:
return scipy.optimize.newton(lambda r: xnpv(r, values, dates), 0.0)
except RuntimeError: # Failed to converge?
return scipy.optimize.brentq(lambda r: xnpv(r, values, dates), -1.0, 1e10)
With the help of various implementations I found in the net, I came up with a python implementation:
def xirr(transactions):
years = [(ta[0] - transactions[0][0]).days / 365.0 for ta in transactions]
residual = 1
step = 0.05
guess = 0.05
epsilon = 0.0001
limit = 10000
while abs(residual) > epsilon and limit > 0:
limit -= 1
residual = 0.0
for i, ta in enumerate(transactions):
residual += ta[1] / pow(guess, years[i])
if abs(residual) > epsilon:
if residual > 0:
guess += step
else:
guess -= step
step /= 2.0
return guess-1
from datetime import date
tas = [ (date(2010, 12, 29), -10000),
(date(2012, 1, 25), 20),
(date(2012, 3, 8), 10100)]
print xirr(tas) #0.0100612640381
Created a package for fast XIRR calculation, PyXIRR
It doesn't have external dependencies and works faster than any existing implementation.
from datetime import date
from pyxirr import xirr
dates = [date(2020, 1, 1), date(2021, 1, 1), date(2022, 1, 1)]
amounts = [-1000, 1000, 1000]
# feed columnar data
xirr(dates, amounts)
# feed tuples
xirr(zip(dates, amounts))
# feed DataFrame
import pandas as pd
xirr(pd.DataFrame({"dates": dates, "amounts": amounts}))
This answer is an improvement on #uuazed's answer and derives from that. However, there are a few changes:
It uses a pandas dataframe instead of a list of tuples
It is cashflow direction agnostic, i.e., whether you treat inflows as negative and outflows as positive or vice versa, the result will be the same, as long as the treatment is consistent for all transactions.
XIRR calculation with this method doesn't work if cashflows are not ordered by date. Hence I have handled sorting of the dataframe internally.
In the earlier answer, there was an implicit assumption that XIRR will mostly be positive. which created the problem pointed out in the other comment, that XIRR between -100% and -95% cannot be calculated. This solution does away with that problem.
import pandas as pd
import numpy as np
def xirr(df, guess=0.05, date_column = 'date', amount_column = 'amount'):
'''Calculates XIRR from a series of cashflows.
Needs a dataframe with columns date and amount, customisable through parameters.
Requires Pandas, NumPy libraries'''
df = df.sort_values(by=date_column).reset_index(drop=True)
df['years'] = df[date_column].apply(lambda x: (x-df[date_column][0]).days/365)
step = 0.05
epsilon = 0.0001
limit = 1000
residual = 1
#Test for direction of cashflows
disc_val_1 = df[[amount_column, 'years']].apply(
lambda x: x[amount_column]/((1+guess)**x['years']), axis=1).sum()
disc_val_2 = df[[amount_column, 'years']].apply(
lambda x: x[amount_column]/((1.05+guess)**x['years']), axis=1).sum()
mul = 1 if disc_val_2 < disc_val_1 else -1
#Calculate XIRR
for i in range(limit):
prev_residual = residual
df['disc_val'] = df[[amount_column, 'years']].apply(
lambda x: x[amount_column]/((1+guess)**x['years']), axis=1)
residual = df['disc_val'].sum()
if abs(residual) > epsilon:
if np.sign(residual) != np.sign(prev_residual):
step /= 2
guess = guess + step * np.sign(residual) * mul
else:
return guess
Explanation:
In the test block, it checks whether increasing the discounting rate increases the discounted value or reduces it. Based on this test, it is determined which direction the guess should move. This block makes the function handle cashflows regardless of direction assumed by the user.
The np.sign(residual) != np.sign(prev_residual) checks when the guess has increased/decreased beyond the required XIRR rate, because that's when the residual goes from negative to positive or vice versa. The step size is reduced at this point.
The numpy package is not absolutely necessary. without numpy, np.sign(residual) can be replaced with residual/abs(residual). I have used numpy to make the code more readable and intuitive
I have tried to test this code with a variety of cash flows. If you find any cases which are not handled by this function, do let me know.
Edit: Here's a cleaner and faster version of the code using numpy arrays. In my test with about 700 transaction, this code ran 5 times faster than the one above:
def xirr(df, guess=0.05, date_column='date', amount_column='amount'):
'''Calculates XIRR from a series of cashflows.
Needs a dataframe with columns date and amount, customisable through parameters.
Requires Pandas, NumPy libraries'''
df = df.sort_values(by=date_column).reset_index(drop=True)
amounts = df[amount_column].values
dates = df[date_column].values
years = np.array(dates-dates[0], dtype='timedelta64[D]').astype(int)/365
step = 0.05
epsilon = 0.0001
limit = 1000
residual = 1
#Test for direction of cashflows
disc_val_1 = np.sum(amounts/((1+guess)**years))
disc_val_2 = np.sum(amounts/((1.05+guess)**years))
mul = 1 if disc_val_2 < disc_val_1 else -1
#Calculate XIRR
for i in range(limit):
prev_residual = residual
residual = np.sum(amounts/((1+guess)**years))
if abs(residual) > epsilon:
if np.sign(residual) != np.sign(prev_residual):
step /= 2
guess = guess + step * np.sign(residual) * mul
else:
return guess
I started from #KT 's solution but improved on it in a few ways:
as pointed out by others, there is no need for xnpv to return inf if the discount rate <= -100%
if the cashflows are all positive or all negative, we can return a nan straight away: no point in letting the algorithm search forever for a solution which doesn't exist
I have made the daycount convention an input; sometimes it is 365, some other times it is 360 - it depends on the case. I have not modelled 30/360. More details on Matlab's docs
I have added optional inputs for the maximum number of iterations and for the starting point of the algorithm
I have not changed the default tolerance of the algorithms but that's very easy to change
Key findings for the specific example below (results may well be different for other cases, I have not had the time to test many other cases):
starting from a value = -sum(all cashflows) / sum(negative cashflows) slows the algorithms a little bit (by 7-10%)
scipi's netwon is faster than scipy's fsolve
Execution time with newton vs fsolve:
import numpy as np
import pandas as pd
import scipy
import scipy.optimize
from datetime import date
import timeit
def xnpv(rate, values, dates , daycount = 365):
daycount = float(daycount)
# Why would you want to return inf if the rate <= -100%? I removed it, I don't see how it makes sense
# if rate <= -1.0:
# return float('inf')
d0 = dates[0] # or min(dates)
# NB: this xnpv implementation discounts the first value LIKE EXCEL
# numpy's npv does NOT, it only starts discounting from the 2nd
return sum([ vi / (1.0 + rate)**((di - d0).days / daycount) for vi, di in zip(values, dates)])
def find_guess(cf):
whereneg = np.where(cf < 0)
sumneg = np.sum( cf[whereneg] )
return -np.sum(cf) / sumneg
def xirr_fsolve(values, dates, daycount = 365, guess = 0, maxiters = 1000):
cf = np.array(values)
if np.where(cf <0,1,0).sum() ==0 | np.where(cf>0,1,0).sum() == 0:
#if the cashflows are all positive or all negative, no point letting the algorithm
#search forever for a solution which doesn't exist
return np.nan
result = scipy.optimize.fsolve(lambda r: xnpv(r, values, dates, daycount), x0 = guess , maxfev = maxiters, full_output = True )
if result[2]==1: #ie if the solution converged; if it didn't, result[0] will be the last iteration, which won't be a solution
return result[0][0]
else:
#consider rasiing a warning
return np.nan
def xirr_newton(values, dates, daycount = 365, guess = 0, maxiters = 1000, a = -100, b =1e5):
# a and b: lower and upper bound for the brentq algorithm
cf = np.array(values)
if np.where(cf <0,1,0).sum() ==0 | np.where(cf>0,1,0).sum() == 0:
#if the cashflows are all positive or all negative, no point letting the algorithm
#search forever for a solution which doesn't exist
return np.nan
res_newton = scipy.optimize.newton(lambda r: xnpv(r, values, dates, daycount), x0 = guess, maxiter = maxiters, full_output = True)
if res_newton[1].converged == True:
out = res_newton[0]
else:
res_b = scipy.optimize.brentq(lambda r: xnpv(r, values, dates, daycount), a = a , b = b, maxiter = maxiters, full_output = True)
if res_b[1].converged == True:
out = res_b[0]
else:
out = np.nan
return out
# let's compare how long each takes
d0 = pd.to_datetime(date(2010,1,1))
# an investment in which we pay 100 in the first month, then get 2 each month for the next 59 months
df = pd.DataFrame()
df['month'] = np.arange(0,60)
df['dates'] = df.apply( lambda x: d0 + pd.DateOffset(months = x['month']) , axis = 1 )
df['cf'] = 0
df.iloc[0,2] = -100
df.iloc[1:,2] = 2
r = 100
n = 5
t_newton_no_guess = timeit.Timer ("xirr_newton(df['cf'], df['dates'], guess = find_guess(df['cf'].to_numpy() ) ) ", globals = globals() ).repeat(repeat = r, number = n)
t_fsolve_no_guess = timeit.Timer ("xirr_fsolve(df['cf'], df['dates'], guess = find_guess(df['cf'].to_numpy() ) )", globals = globals() ).repeat(repeat = r, number = n)
t_newton_guess_0 = timeit.Timer ("xirr_newton(df['cf'], df['dates'] , guess =0.) ", globals = globals() ).repeat(repeat = r, number = n)
t_fsolve_guess_0 = timeit.Timer ("xirr_fsolve(df['cf'], df['dates'], guess =0.) ", globals = globals() ).repeat(repeat = r, number = n)
resdf = pd.DataFrame(index = ['min time'])
resdf['newton no guess'] = [min(t_newton_no_guess)]
resdf['fsolve no guess'] = [min(t_fsolve_no_guess)]
resdf['newton guess 0'] = [min(t_newton_guess_0)]
resdf['fsolve guess 0'] = [min(t_fsolve_guess_0)]
# the docs explain why we should take the min and not the avg
resdf = resdf.transpose()
resdf['% diff vs fastest'] = (resdf / resdf.min() -1) * 100
Conclusions
I noticed there were some cases in which newton and brentq didn't converge, but fsolve did, so I modified the function so that, in order, it starts with newton, then brentq, then, lastly, fsolve.
I haven't actually found a case in which brentq was used to find a solution. I'd be curious to understand when it would work, otherwise it's probably best to just remove it.
I went back to try/except because I noticed the code above wasn't identifying all the cases of non-convergence. That's something I'd like to look into when I have a bit more time
This is my final code:
def xirr(values, dates, daycount = 365, guess = 0, maxiters = 10000, a = -100, b =1e10):
# a and b: lower and upper bound for the brentq algorithm
cf = np.array(values)
if np.where(cf <0,1,0).sum() ==0 | np.where(cf >0,1,0).sum() == 0:
#if the cashflows are all positive or all negative, no point letting the algorithm
#search forever for a solution which doesn't exist
return np.nan
try:
output = scipy.optimize.newton(lambda r: xnpv(r, values, dates, daycount),
x0 = guess, maxiter = maxiters, full_output = True, disp = True)[0]
except RuntimeError:
try:
output = scipy.optimize.brentq(lambda r: xnpv(r, values, dates, daycount),
a = a , b = b, maxiter = maxiters, full_output = True, disp = True)[0]
except:
result = scipy.optimize.fsolve(lambda r: xnpv(r, values, dates, daycount),
x0 = guess , maxfev = maxiters, full_output = True )
if result[2]==1: #ie if the solution converged; if it didn't, result[0] will be the last iteration, which won't be a solution
output = result[0][0]
else:
output = np.nan
return output
Tests
These are some tests I have put together with pytest
import pytest
import numpy as np
import pandas as pd
import whatever_the_file_name_was as finc
from datetime import date
def test_xirr():
dates = [date(2010, 12, 29), date(2012, 1, 25), date(2012, 3, 8)]
values = [-10000, 20, 10100]
assert pytest.approx( finc.xirr(values, dates) ) == 1.006127e-2
dates = [date(2010, 1,1,), date(2010,12,27)]
values = [-100,110]
assert pytest.approx( finc.xirr(values, dates, daycount = 360) ) == 0.1
values = [100,-110]
assert pytest.approx( finc.xirr(values, dates, daycount = 360) ) == 0.1
values = [-100,90]
assert pytest.approx( finc.xirr(values, dates, daycount = 360) ) == -0.1
# test numpy arrays
values = np.array([-100,0,121])
dates = [date(2010, 1,1,), date(2011,1,1), date(2012,1,1)]
assert pytest.approx( finc.xirr(values, dates, daycount = 365) ) == 0.1
# with a pandas df
df = pd.DataFrame()
df['values'] = values
df['dates'] = dates
assert pytest.approx( finc.xirr(df['values'], df['dates'], daycount = 365) ) == 0.1
# with a pands df and datetypes
df['dates'] = pd.to_datetime(dates)
assert pytest.approx( finc.xirr(df['values'], df['dates'], daycount = 365) ) == 0.1
# now for some unrealistic values
df['values'] =[-100,5000,0]
assert pytest.approx( finc.xirr(df['values'], df['dates'], daycount = 365) ) == 49
df['values'] =[-1e3,0,1]
rate = finc.xirr(df['values'], df['dates'], daycount = 365)
npv = finc.xnpv(rate, df['values'], df['dates'])
# this is an extreme case; as long as the corresponsing NPV is between these values it's not a bad result
assertion = ( npv < 0.1 and npv > -.1)
assert assertion == True
P.S. Important difference between this xnpv and numpy.npv
This is not, strictly speaking, relevant to this answer, but useful to know for whoever runs financial calculations with numpy:
numpy.npv doesn't discount the first item of cashflow - it starts from the second, e.g.
np.npv(0.1,[110,0]) = 110
and
np.npv(0.1,[0,110] = 100
Excel, however, discounts from the very first item:
NPV(0.1,[110,0]) = 100
Numpy's financial functions will be deprecated and replaced with those of numpy_financial, which however will likely continue to behave the same, if only for backward compatibility.
Created a python package finance-calulator which can be used for xirr calculation. underlying, it uses newton's method.
Also I did some time profiling and it is little better than the scipy's xnpv method as suggested in #KT.'s answer.
Here's the implementation.
With Pandas, I got the following to work:
(note, I'm using ACT/365 convention)
rate = 0.10
dates= pandas.date_range(start=pandas.Timestamp('2015-01-01'),periods=5, freq="AS")
cfs = pandas.Series([-500,200,200,200,200],index=dates)
# intermediate calculations( if interested)
# cf_xnpv_days = [(cf.index[i]-cf.index[i-1]).days for i in range(1,len(cf.index))]
# cf_xnpv_days_cumulative = [(cf.index[i]-cf.index[0]).days for i in range(1,len(cf.index))]
# cf_xnpv_days_disc_factors = [(1+rate)**(float((cf.index[i]-cf.index[0]).days)/365.0)-1 for i in range(1,len(cf.index))]
cf_xnpv_days_pvs = [cf[i]/float(1+(1+rate)**(float((cf.index[i]-cf.index[0]).days)/365.0)-1) for i in range(1,len(cf.index))]
cf_xnpv = cf[0]+ sum(cf_xnpv_days_pvs)
def xirr(cashflows,transactions,guess=0.1):
#function to calculate internal rate of return.
#cashflow: list of tuple of date,transactions
#transactions: list of transactions
try:
return optimize.newton(lambda r: xnpv(r,cashflows),guess)
except RuntimeError:
positives = [x if x > 0 else 0 for x in transactions]
negatives = [x if x < 0 else 0 for x in transactions]
return_guess = (sum(positives) + sum(negatives)) / (-sum(negatives))
return optimize.newton(lambda r: xnpv(r,cashflows),return_guess)

Categories

Resources