Speeding up a pd.apply() function

Speeding up a pd.apply() function - python

I have some example data
import numpy as np
import pandas as pd
users = 5
size = users*6
df = pd.DataFrame(
{'userid': np.random.choice(np.arange(0, users), size),
'a_time': np.random.normal(loc = 1.5, scale = 0.5, size = size),
'b_time': np.random.normal(loc = 1.5, scale = 0.5, size = size),
}
)
df['focus'] = np.where(df.userid % 2 == 0, 'a', 'b')
test_dat = df[['userid', 'focus', 'a_time', 'b_time']].sort_values('userid').copy(deep = True).reset_index(drop = True)
For each userid, I need to determine how many times a_time > b_time or vice versa, depending on column focus.
I have a custom function
def some_func(x):
if (x.focus == 'a').all():
a = x.a_time
b = x.b_time
x['changes'] = (b > a).sum()
x['days'] = len(a)
elif (x.focus == 'b').all():
a = x.a_time
b = x.b_time
x['changes'] = (a > b).sum()
x['days'] = len(a)
elif (x.focus == 'both').all():
x['changes'] = 0
x['days'] = len(a)
else:
x['changes'] = None
x['days'] = None
return x
test_dat.groupby(['userid', 'focus']).apply(some_func).reset_index(name = 'n_changes')
that works just fine when the number of userid is small. However, as the number of unique userid increases to >100K, this function is almost unbearably slow.
Is there a way to speed up this fx? My guess is that there might be an alternative to the if-elif-else syntax in some_func() but I'm not sure what that syntax might be. The number of rows for each userid is arbitrarily long.
I'm open to non-pandas options if necessary.

A now-deleted answer from another user helped me get to a reasonably-quick solution. After learning that GroupBy.apply() does operate on dataframes and that this answer suggests that apply() isn't very performant (at least compared to transform()), I thought I'd try dropping apply() entirely. I don't think this solution is very pretty but with 277161 userid values in 1.36MM rows, it runs in about 32 seconds:
# Simulate some data
users = 277161
size = 1360000
df = pd.DataFrame(
{'userid': np.random.choice(np.arange(0, users), size),
'a_time': np.random.normal(loc = 1.5, scale = 0.5, size = size),
'b_time': np.random.normal(loc = 1.5, scale = 0.5, size = size),
}
)
df['focus'] = np.where(df.userid % 2 == 0, 'a', 'b')
test_dat = df[['userid', 'focus', 'a_time', 'b_time']].sort_values('userid').copy(deep = True).reset_index(drop = True)
# Evalute time
start = datetime.now()
test_dat['aux'] = np.where(
test_dat.focus == 'a', test_dat.a_time.lt(test_dat.b_time),
np.where(test_dat.focus == 'b', test_dat.a_time.gt(test_dat.b_time), None))
test = test_dat.groupby(['userid', 'focus'])['aux'].agg([np.sum, np.size]).reset_index()
# the row above for some reason doesn't handle single boolean values well and just
# returns the boolean values when the number of rows in a group = 1,
# so I force those remaining boolean values into ints here
test['sum'] = test['sum'].astype(int)
test = test.rename(columns = {'sum': 'changes', 'size': 'days'})
end = datetime.now()
(end - start).seconds
I recognize that this isn't an explicit answer to my question (i.e., "speeding up an apply() function") but the reality is that this is a workable, reasonably-quick solution.

Related

python: processing data so that only constant values remain

I have data from a measurement and I want to process the data so that only the values remain, that are constant. The measured signal consists of parts where the value stays constant for some time then I do a change on the system that causes the value to increase. It takes time for the system to reach the constant value after the adjustment I do.
I wrote a programm that compares every value with the 10 previous values. If it is equal to them within a tolerance it gets saved.
The code works but i feel like this can be done cleaner and more efficient so that it is sutable to process larger amouts of data. But I dont know how to make the code in for-loop more efficient. Do you have any suggestions?
Thank you in advance.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('radiale Steifigkeit_22_04_2022_raw.csv',
sep= ";",
decimal = ',',
skipinitialspace=True,
comment = '\t')
#df = df.drop(df.columns[[0,4]], axis=1)
#print(df.head())
#print(df.dtypes)
#df.plot(x = 'Time_SYS 01-cDAQ:1_A-In-All_Rec_rel', y = 'Kraft')
#df.plot(x = 'Time_SYS 01-cDAQ:1_A-In-All_Rec_rel', y = 'Weg')
#plt.show()
s = pd.Series(df['Weg'], name = 'Weg')
f = pd.Series(df['Kraft'], name= 'Kraft')
t = pd.Series(df['Time_SYS 01-cDAQ:1_A-In-All_Rec_rel'], name= 'Zeit')
#s_const = pd.Series()
s_const = []
f_const = []
t_const = []
s = np.abs(s)
#plt.plot(s)
#plt.show()
c = 0
#this for-loop compares the value s[i] with the previous 10 measurements.
#If it is equal within a tolerance it is saved into s_const.
for i in range(len(s)):
#for i in range(0,2000):
if i > 10:
si = round(s[i],3)
s1i = round(s[i-1],3)
s2i = round(s[i-2],3)
s3i = round(s[i-3],3)
s4i = round(s[i-4],3)
s5i = round(s[i-5],3)
s6i = round(s[i-6],3)
s7i = round(s[i-7],3)
s8i = round(s[i-8],3)
s9i = round(s[i-9],3)
s10i = round(s[i-10],3)
if si == s1i == s2i == s3i == s4i == s5i== s6i == s7i== s8i == s9i == s10i:
c = c+1
s_const.append(s[i])
f_const.append(f[i])

Here is a very performant implementation using itertools (based on Check if all elements in a list are identical):
from itertools import groupby
def all_equal(iterable):
g = groupby(iterable)
return next(g, True) and not next(g, False)
data = [1, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5]
window = 3
stable = [i for i in range(len(data) - window + 1) if all_equal(data[i:i+window])]
print(stable) # -> [1, 2, 7, 8, 9, 10, 13]
The algorithm produces a list of indices in your data where a stable period of length window starts.

How to replace a pandas column row with the previous row if a condition is met

I'm trying to speed up my trading strategy backtesting.
Right now, I have
for i in trange(1, len(real_choice), disable=not backtesting, desc="Converting HOLDs and calculating backtest correct/incorrect... [3/3]"):
if (advice[i] == "HOLD"):
advice[i] = advice[i-1]
if (real_choice[i] == "HOLD"):
real_choice[i] = real_choice[i-1]
if advice[i] == real_choice[i]:
correct[i] = "CORRECT"
else:
correct[i] = "INCORRECT"
This part of the code takes the longest, so I want to speed it up.
I'm learning Python so this was simple and worked but now I'm paying for it with how long the backtests take.
Is there a way to do this faster?

you can use np.where to compare two columns and assign a value to those rows
correct = np.where( advice == real_choice
, "CORRECT", "INCORRECT)
but to make it look more pandas it would be
df['correct'] = np.where( df['advice'] == df['real_choice']
, "CORRECT", "INCORRECT)
with some time comparisons
(Full Code)
A = randint(0, 10, 10000)
B = randint(0, 10, 10000)
df = pd.DataFrame({'A': A, 'B':B, 'C': "INCORRECT"})
print(df)
start = time.process_time()
for i in range(0, len(real_choice)):
if df['A'][i] == df['B'][i]:
df['C'][i] = "CORRECT"
else:
df['C'][i] = "INCORRECT"
print("method 1", time.process_time() - start)
start = time.process_time()
df['C2'] = np.where( df['A'] == df['B'], "CORRECT", "INCORRECT")
print("method 2", time.process_time() - start)
method 2 took a shorter amount of time to compute
method 1 1.0530679999999997
method 2 0.0022619999999999862

fast way to iterate through list, find duplicates and perform calculations

I have two lists, one of areas and one of prices which are the same size.
For example:
area = [1500,2000,2000,1800,2000,1500,500]
price = [200,800,600,800,1000,750,200]
I need to return a list of prices for each unique area not including the original area.
So for 1500, the lists that I need returned are: [750] and [200]
For the 2000, the lists that I need returned are [600,1000], [800,1000] and [800,600]
For the 1800 and 500, the lists I need returned are both empty lists [].
The goal is then to determine whether a value is an outlier subject to the absolute value of the price - mean(excluding the price itself) being less than 5 * population standard deviation(calculated excluding the price itself)
import statistics
area = [1500,2000,2000,1800,2000,1500,500]
price = [200,800,600,800,1000,750,200]
outlier_idx = []
for idx, val in enumerate(area):
comp_idx = [i for i, x in enumerate(area) if x == val]
comp_idx.remove(idx)
comp_price = [price[i] for i in comp_idx]
if len(comp_price)>2:
sigma = statistics.stdev(comp_price)
p_m = statistics.mean(comp_price)
if abs(price[idx]-p_m) > 5 * sigma:
outlier_idx.append(idx)
area = [i for j, i in enumerate(area) if j not in outlier_idx]
price = [i for j, i in enumerate(price) if j not in outlier_idx]
The problem is that this calculation takes up a lot of time and I am dealing with arrays that can be quite large.
I am stuck as to how I can increase the computational efficiency.
I am open to using numpy, pandas or any other common packages.
Additionally, I have tried the problem in pandas:
df['p-p_m'] = ''
df['sigma'] = ''
df['outlier'] = False
for name, group in df.groupby('area'):
if len(group)>1:
idx = list(group.index)
for i in range(len(idx)):
tmp_idx = idx.copy()
tmp_idx.pop(i)
df['p-p_m'][idx[i]] = abs(group.price[idx[i]] - group.price[tmp_idx].mean())
df['sigma'][idx[i]] = group.price[tmp_idx].std(ddof=0)
if df['p-p_m'][idx[i]] > 3*df['sigma'][idx[i]]:
df['outlier'][idx[i]] = True
Thanks.

This code is how to must created the list for each area:
df = pd.DataFrame({'area': area, 'price': price})
price_to_delete = [item for idx_array in df.groupby('price').groups.values() for item in idx_array[1:]]
df.loc[price_to_delete, 'price'] = None
df = df.groupby('area').agg(lambda x: [] if all(x.isnull()) else x.tolist())
df
I don't understand what are you want, but this part is to calculate outliers for each price in each area:
df['outlier'] = False
df['outlier'] = df['price'].map(lambda x: abs(np.array(x) - np.mean(x)) > 3*np.std(x) if len(x) > 0 else [])
df
I hope this help you, in any way!

Here is a solution that combines Numpy and Numba. Although correct, I did not test it against alternative approaches regarding efficiency, but Numba usually results in significant speedups for tasks that require looping through data. I have added one extra point which is an outlier, according to your definition.
import numpy as np
from numba import jit
# data input
price = np.array([200,800,600,800,1000,750,200, 2000])
area = np.array([1500,2000,2000,1800,2000,1500,500, 1500])
#jit(nopython=True)
def outliers(price, area):
is_outlier = np.full(len(price), False)
for this_area in set(area):
indexes = area == this_area
these_prices = price[indexes]
for this_price in set(these_prices):
arr2 = these_prices[these_prices != this_price]
if arr2.size > 1:
std = arr2.std()
mean = arr2.mean()
indices = (this_price == price) & (this_area == area)
is_outlier[indices] = np.abs(mean - this_price) > 5 * std
return is_outlier
> outliers(price, area)
> array([False, False, False, False, False, False, False, True])
The code should be fast in case you have several identical price levels for each area, as they will be updated all at once.
I hope this helps.

How can combine 3 matrices into 1 matrice with reversible-approach?

I want to reshape my 24x20 matrices 'A','B','C' which are extracted from text file and are saved before and after normalizing by def normalize() in for-loop through cycles in such way that each cycles would be a row with all elements of 3 matrices side by side like below:
[[A(1,1),B(1,1),C(1,1),A(1,2),B(1,2),C(1,2),...,A(24,20),B(24,20),C(24,20)] #cycle1
[A(1,1),B(1,1),C(1,1),A(1,2),B(1,2),C(1,2),...,A(24,20),B(24,20),C(24,20)] #cycle2
[A(1,1),B(1,1),C(1,1),A(1,2),B(1,2),C(1,2),...,A(24,20),B(24,20),C(24,20)]] #cycle3
So far based on #odyse suggestion I used following snippet in the end of for-loop:
for cycle in range(cycles):
dff = pd.DataFrame({'A_norm':A_norm[cycle] , 'B_norm': B_norm[cycle] , 'C_norm': C_norm[cycle] } , index=[0])
D = dff.as_matrix().ravel()
if cycle == 0:
Results = np.array(D)
else:
Results = np.vstack((Results, D2))
np.savetxt("Results.csv", Results, delimiter=",")
but there is a problem when I use after def normalize() in for-loop in spite of its error (ValueError) it also has warning FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead for D = dff.as_matrix().ravel() which is not important but right now since it is FutureWarning nevertheless I checked the shape of output was correct for 3 cycles by using print(data1.shape) and it was (3, 1440) which is 3 rows as 3 cycles and number of columns should be 3 times 480= 1440 but all in all wasn't stable solution.
the complete scripts are following:
import numpy as np
import pandas as pd
import os
def normalize(value, min_value, max_value, min_norm, max_norm):
new_value = ((max_norm - min_norm)*((value - min_value)/(max_value - min_value))) + min_norm
return new_value
#the size of matrices are (24,20)
df1 = np.zeros((24,20))
df2 = np.zeros((24,20))
df3 = np.zeros((24,20))
#next iteration create all plots, change the number of cycles
cycles = int(len(df)/480)
print(cycles)
for cycle in range(3):
count = '{:04}'.format(cycle)
j = cycle * 480
new_value1 = df['A'].iloc[j:j+480]
new_value2 = df['B'].iloc[j:j+480]
new_value3 = df['C'].iloc[j:j+480]
df1 = print_df(mkdf(new_value1))
df2 = print_df(mkdf(new_value2))
df3 = print_df(mkdf(new_value3))
for i in df:
try:
os.mkdir(i)
except:
pass
min_val = df[i].min()
min_nor = -1
max_val = df[i].max()
max_nor = 1
ordered_data = mkdf(df.iloc[j:j+480][i])
csv = print_df(ordered_data)
#Print .csv files contains matrix of each parameters by name of cycles respectively
csv.to_csv(f'{i}/{i}{count}.csv', header=None, index=None)
if 'C' in i:
min_nor = -40
max_nor = 150
#Applying normalization for C between [-40,+150]
new_value3 = normalize(df['C'].iloc[j:j+480], min_val, max_val, -40, 150)
C_norm = print_df(mkdf(new_value3))
C_norm.to_csv(f'{i}/norm{i}{count}.csv', header=None, index=None)
else:
#Applying normalization for A,B between [-1,+1]
new_value1 = normalize(df['A'].iloc[j:j+480], min_val, max_val, -1, 1)
new_value2 = normalize(df['B'].iloc[j:j+480], min_val, max_val, -1, 1)
A_norm = print_df(mkdf(new_value1))
B_norm = print_df(mkdf(new_value2))
A_norm.to_csv(f'{i}/norm{i}{count}.csv', header=None, index=None)
B_norm.to_csv(f'{i}/norm{i}{count}.csv', header=None, index=None)
dff = pd.DataFrame({'A_norm':A_norm[cycle] , 'B_norm': B_norm[cycle] , 'C_norm': C_norm[cycle] } , index=[0])
D = dff.as_matrix().ravel()
if cycle == 0:
Results = np.array(D)
else:
Results = np.vstack((Results, D))
np.savetxt("Results.csv", Results , delimiter=',', encoding='utf-8')
#Check output shape whether is (3, 1440) or not
data1 = np.loadtxt('Results.csv', delimiter=',')
print(data1.shape)
Note1: my data is txt file is following:
id_set: 000
A: -2.46882615679
B: -2.26408246559
C: -325.004619528
Note2: I provided a dataset in text file for 3 cycles:
Text dataset
Note3: for mapping A, B, C parameters into matrices in right order I used print_df() mkdf() functions but I didn't mention due to reduce it to the core problem and just leave a minimal example in start of this post. Let me know if you need that.
Expected result should be done by completing for-loop on 'A_norm','B_norm','C_norm' which are represented normalized versions of 'A','B','C' respectively and output let's call it "Results.csv" should be reversible to regenerate 'A','B','C' matrices through cycles again save them in csv. files for controlling , therefore if you have any ideas about reverse part please mention that separately otherwise just control it by using print(data.shape) and it should be (3, 1440).
Have a nice day and thanks in advance!

financial python library that has xirr and xnpv function?

numpy has irr and npv function, but I need xirr and xnpv function.
this link points out that xirr and xnpv will be coming soon.
http://www.projectdirigible.com/documentation/spreadsheet-functions.html#coming-soon
Is there any python library that has those two functions? tks.

Here is one way to implement the two functions.
import scipy.optimize
def xnpv(rate, values, dates):
'''Equivalent of Excel's XNPV function.
>>> from datetime import date
>>> dates = [date(2010, 12, 29), date(2012, 1, 25), date(2012, 3, 8)]
>>> values = [-10000, 20, 10100]
>>> xnpv(0.1, values, dates)
-966.4345...
'''
if rate <= -1.0:
return float('inf')
d0 = dates[0] # or min(dates)
return sum([ vi / (1.0 + rate)**((di - d0).days / 365.0) for vi, di in zip(values, dates)])
def xirr(values, dates):
'''Equivalent of Excel's XIRR function.
>>> from datetime import date
>>> dates = [date(2010, 12, 29), date(2012, 1, 25), date(2012, 3, 8)]
>>> values = [-10000, 20, 10100]
>>> xirr(values, dates)
0.0100612...
'''
try:
return scipy.optimize.newton(lambda r: xnpv(r, values, dates), 0.0)
except RuntimeError: # Failed to converge?
return scipy.optimize.brentq(lambda r: xnpv(r, values, dates), -1.0, 1e10)

With the help of various implementations I found in the net, I came up with a python implementation:
def xirr(transactions):
years = [(ta[0] - transactions[0][0]).days / 365.0 for ta in transactions]
residual = 1
step = 0.05
guess = 0.05
epsilon = 0.0001
limit = 10000
while abs(residual) > epsilon and limit > 0:
limit -= 1
residual = 0.0
for i, ta in enumerate(transactions):
residual += ta[1] / pow(guess, years[i])
if abs(residual) > epsilon:
if residual > 0:
guess += step
else:
guess -= step
step /= 2.0
return guess-1
from datetime import date
tas = [ (date(2010, 12, 29), -10000),
(date(2012, 1, 25), 20),
(date(2012, 3, 8), 10100)]
print xirr(tas) #0.0100612640381

Created a package for fast XIRR calculation, PyXIRR
It doesn't have external dependencies and works faster than any existing implementation.
from datetime import date
from pyxirr import xirr
dates = [date(2020, 1, 1), date(2021, 1, 1), date(2022, 1, 1)]
amounts = [-1000, 1000, 1000]
# feed columnar data
xirr(dates, amounts)
# feed tuples
xirr(zip(dates, amounts))
# feed DataFrame
import pandas as pd
xirr(pd.DataFrame({"dates": dates, "amounts": amounts}))

This answer is an improvement on #uuazed's answer and derives from that. However, there are a few changes:
It uses a pandas dataframe instead of a list of tuples
It is cashflow direction agnostic, i.e., whether you treat inflows as negative and outflows as positive or vice versa, the result will be the same, as long as the treatment is consistent for all transactions.
XIRR calculation with this method doesn't work if cashflows are not ordered by date. Hence I have handled sorting of the dataframe internally.
In the earlier answer, there was an implicit assumption that XIRR will mostly be positive. which created the problem pointed out in the other comment, that XIRR between -100% and -95% cannot be calculated. This solution does away with that problem.
import pandas as pd
import numpy as np
def xirr(df, guess=0.05, date_column = 'date', amount_column = 'amount'):
'''Calculates XIRR from a series of cashflows.
Needs a dataframe with columns date and amount, customisable through parameters.
Requires Pandas, NumPy libraries'''
df = df.sort_values(by=date_column).reset_index(drop=True)
df['years'] = df[date_column].apply(lambda x: (x-df[date_column][0]).days/365)
step = 0.05
epsilon = 0.0001
limit = 1000
residual = 1
#Test for direction of cashflows
disc_val_1 = df[[amount_column, 'years']].apply(
lambda x: x[amount_column]/((1+guess)**x['years']), axis=1).sum()
disc_val_2 = df[[amount_column, 'years']].apply(
lambda x: x[amount_column]/((1.05+guess)**x['years']), axis=1).sum()
mul = 1 if disc_val_2 < disc_val_1 else -1
#Calculate XIRR
for i in range(limit):
prev_residual = residual
df['disc_val'] = df[[amount_column, 'years']].apply(
lambda x: x[amount_column]/((1+guess)**x['years']), axis=1)
residual = df['disc_val'].sum()
if abs(residual) > epsilon:
if np.sign(residual) != np.sign(prev_residual):
step /= 2
guess = guess + step * np.sign(residual) * mul
else:
return guess
Explanation:
In the test block, it checks whether increasing the discounting rate increases the discounted value or reduces it. Based on this test, it is determined which direction the guess should move. This block makes the function handle cashflows regardless of direction assumed by the user.
The np.sign(residual) != np.sign(prev_residual) checks when the guess has increased/decreased beyond the required XIRR rate, because that's when the residual goes from negative to positive or vice versa. The step size is reduced at this point.
The numpy package is not absolutely necessary. without numpy, np.sign(residual) can be replaced with residual/abs(residual). I have used numpy to make the code more readable and intuitive
I have tried to test this code with a variety of cash flows. If you find any cases which are not handled by this function, do let me know.
Edit: Here's a cleaner and faster version of the code using numpy arrays. In my test with about 700 transaction, this code ran 5 times faster than the one above:
def xirr(df, guess=0.05, date_column='date', amount_column='amount'):
'''Calculates XIRR from a series of cashflows.
Needs a dataframe with columns date and amount, customisable through parameters.
Requires Pandas, NumPy libraries'''
df = df.sort_values(by=date_column).reset_index(drop=True)
amounts = df[amount_column].values
dates = df[date_column].values
years = np.array(dates-dates[0], dtype='timedelta64[D]').astype(int)/365
step = 0.05
epsilon = 0.0001
limit = 1000
residual = 1
#Test for direction of cashflows
disc_val_1 = np.sum(amounts/((1+guess)**years))
disc_val_2 = np.sum(amounts/((1.05+guess)**years))
mul = 1 if disc_val_2 < disc_val_1 else -1
#Calculate XIRR
for i in range(limit):
prev_residual = residual
residual = np.sum(amounts/((1+guess)**years))
if abs(residual) > epsilon:
if np.sign(residual) != np.sign(prev_residual):
step /= 2
guess = guess + step * np.sign(residual) * mul
else:
return guess

I started from #KT 's solution but improved on it in a few ways:
as pointed out by others, there is no need for xnpv to return inf if the discount rate <= -100%
if the cashflows are all positive or all negative, we can return a nan straight away: no point in letting the algorithm search forever for a solution which doesn't exist
I have made the daycount convention an input; sometimes it is 365, some other times it is 360 - it depends on the case. I have not modelled 30/360. More details on Matlab's docs
I have added optional inputs for the maximum number of iterations and for the starting point of the algorithm
I have not changed the default tolerance of the algorithms but that's very easy to change
Key findings for the specific example below (results may well be different for other cases, I have not had the time to test many other cases):
starting from a value = -sum(all cashflows) / sum(negative cashflows) slows the algorithms a little bit (by 7-10%)
scipi's netwon is faster than scipy's fsolve
Execution time with newton vs fsolve:
import numpy as np
import pandas as pd
import scipy
import scipy.optimize
from datetime import date
import timeit
def xnpv(rate, values, dates , daycount = 365):
daycount = float(daycount)
# Why would you want to return inf if the rate <= -100%? I removed it, I don't see how it makes sense
# if rate <= -1.0:
# return float('inf')
d0 = dates[0] # or min(dates)
# NB: this xnpv implementation discounts the first value LIKE EXCEL
# numpy's npv does NOT, it only starts discounting from the 2nd
return sum([ vi / (1.0 + rate)**((di - d0).days / daycount) for vi, di in zip(values, dates)])
def find_guess(cf):
whereneg = np.where(cf < 0)
sumneg = np.sum( cf[whereneg] )
return -np.sum(cf) / sumneg
def xirr_fsolve(values, dates, daycount = 365, guess = 0, maxiters = 1000):
cf = np.array(values)
if np.where(cf <0,1,0).sum() ==0 | np.where(cf>0,1,0).sum() == 0:
#if the cashflows are all positive or all negative, no point letting the algorithm
#search forever for a solution which doesn't exist
return np.nan
result = scipy.optimize.fsolve(lambda r: xnpv(r, values, dates, daycount), x0 = guess , maxfev = maxiters, full_output = True )
if result[2]==1: #ie if the solution converged; if it didn't, result[0] will be the last iteration, which won't be a solution
return result[0][0]
else:
#consider rasiing a warning
return np.nan
def xirr_newton(values, dates, daycount = 365, guess = 0, maxiters = 1000, a = -100, b =1e5):
# a and b: lower and upper bound for the brentq algorithm
cf = np.array(values)
if np.where(cf <0,1,0).sum() ==0 | np.where(cf>0,1,0).sum() == 0:
#if the cashflows are all positive or all negative, no point letting the algorithm
#search forever for a solution which doesn't exist
return np.nan
res_newton = scipy.optimize.newton(lambda r: xnpv(r, values, dates, daycount), x0 = guess, maxiter = maxiters, full_output = True)
if res_newton[1].converged == True:
out = res_newton[0]
else:
res_b = scipy.optimize.brentq(lambda r: xnpv(r, values, dates, daycount), a = a , b = b, maxiter = maxiters, full_output = True)
if res_b[1].converged == True:
out = res_b[0]
else:
out = np.nan
return out
# let's compare how long each takes
d0 = pd.to_datetime(date(2010,1,1))
# an investment in which we pay 100 in the first month, then get 2 each month for the next 59 months
df = pd.DataFrame()
df['month'] = np.arange(0,60)
df['dates'] = df.apply( lambda x: d0 + pd.DateOffset(months = x['month']) , axis = 1 )
df['cf'] = 0
df.iloc[0,2] = -100
df.iloc[1:,2] = 2
r = 100
n = 5
t_newton_no_guess = timeit.Timer ("xirr_newton(df['cf'], df['dates'], guess = find_guess(df['cf'].to_numpy() ) ) ", globals = globals() ).repeat(repeat = r, number = n)
t_fsolve_no_guess = timeit.Timer ("xirr_fsolve(df['cf'], df['dates'], guess = find_guess(df['cf'].to_numpy() ) )", globals = globals() ).repeat(repeat = r, number = n)
t_newton_guess_0 = timeit.Timer ("xirr_newton(df['cf'], df['dates'] , guess =0.) ", globals = globals() ).repeat(repeat = r, number = n)
t_fsolve_guess_0 = timeit.Timer ("xirr_fsolve(df['cf'], df['dates'], guess =0.) ", globals = globals() ).repeat(repeat = r, number = n)
resdf = pd.DataFrame(index = ['min time'])
resdf['newton no guess'] = [min(t_newton_no_guess)]
resdf['fsolve no guess'] = [min(t_fsolve_no_guess)]
resdf['newton guess 0'] = [min(t_newton_guess_0)]
resdf['fsolve guess 0'] = [min(t_fsolve_guess_0)]
# the docs explain why we should take the min and not the avg
resdf = resdf.transpose()
resdf['% diff vs fastest'] = (resdf / resdf.min() -1) * 100
Conclusions
I noticed there were some cases in which newton and brentq didn't converge, but fsolve did, so I modified the function so that, in order, it starts with newton, then brentq, then, lastly, fsolve.
I haven't actually found a case in which brentq was used to find a solution. I'd be curious to understand when it would work, otherwise it's probably best to just remove it.
I went back to try/except because I noticed the code above wasn't identifying all the cases of non-convergence. That's something I'd like to look into when I have a bit more time
This is my final code:
def xirr(values, dates, daycount = 365, guess = 0, maxiters = 10000, a = -100, b =1e10):
# a and b: lower and upper bound for the brentq algorithm
cf = np.array(values)
if np.where(cf <0,1,0).sum() ==0 | np.where(cf >0,1,0).sum() == 0:
#if the cashflows are all positive or all negative, no point letting the algorithm
#search forever for a solution which doesn't exist
return np.nan
try:
output = scipy.optimize.newton(lambda r: xnpv(r, values, dates, daycount),
x0 = guess, maxiter = maxiters, full_output = True, disp = True)[0]
except RuntimeError:
try:
output = scipy.optimize.brentq(lambda r: xnpv(r, values, dates, daycount),
a = a , b = b, maxiter = maxiters, full_output = True, disp = True)[0]
except:
result = scipy.optimize.fsolve(lambda r: xnpv(r, values, dates, daycount),
x0 = guess , maxfev = maxiters, full_output = True )
if result[2]==1: #ie if the solution converged; if it didn't, result[0] will be the last iteration, which won't be a solution
output = result[0][0]
else:
output = np.nan
return output
Tests
These are some tests I have put together with pytest
import pytest
import numpy as np
import pandas as pd
import whatever_the_file_name_was as finc
from datetime import date
def test_xirr():
dates = [date(2010, 12, 29), date(2012, 1, 25), date(2012, 3, 8)]
values = [-10000, 20, 10100]
assert pytest.approx( finc.xirr(values, dates) ) == 1.006127e-2
dates = [date(2010, 1,1,), date(2010,12,27)]
values = [-100,110]
assert pytest.approx( finc.xirr(values, dates, daycount = 360) ) == 0.1
values = [100,-110]
assert pytest.approx( finc.xirr(values, dates, daycount = 360) ) == 0.1
values = [-100,90]
assert pytest.approx( finc.xirr(values, dates, daycount = 360) ) == -0.1
# test numpy arrays
values = np.array([-100,0,121])
dates = [date(2010, 1,1,), date(2011,1,1), date(2012,1,1)]
assert pytest.approx( finc.xirr(values, dates, daycount = 365) ) == 0.1
# with a pandas df
df = pd.DataFrame()
df['values'] = values
df['dates'] = dates
assert pytest.approx( finc.xirr(df['values'], df['dates'], daycount = 365) ) == 0.1
# with a pands df and datetypes
df['dates'] = pd.to_datetime(dates)
assert pytest.approx( finc.xirr(df['values'], df['dates'], daycount = 365) ) == 0.1
# now for some unrealistic values
df['values'] =[-100,5000,0]
assert pytest.approx( finc.xirr(df['values'], df['dates'], daycount = 365) ) == 49
df['values'] =[-1e3,0,1]
rate = finc.xirr(df['values'], df['dates'], daycount = 365)
npv = finc.xnpv(rate, df['values'], df['dates'])
# this is an extreme case; as long as the corresponsing NPV is between these values it's not a bad result
assertion = ( npv < 0.1 and npv > -.1)
assert assertion == True
P.S. Important difference between this xnpv and numpy.npv
This is not, strictly speaking, relevant to this answer, but useful to know for whoever runs financial calculations with numpy:
numpy.npv doesn't discount the first item of cashflow - it starts from the second, e.g.
np.npv(0.1,[110,0]) = 110
and
np.npv(0.1,[0,110] = 100
Excel, however, discounts from the very first item:
NPV(0.1,[110,0]) = 100
Numpy's financial functions will be deprecated and replaced with those of numpy_financial, which however will likely continue to behave the same, if only for backward compatibility.

Created a python package finance-calulator which can be used for xirr calculation. underlying, it uses newton's method.
Also I did some time profiling and it is little better than the scipy's xnpv method as suggested in #KT.'s answer.
Here's the implementation.

With Pandas, I got the following to work:
(note, I'm using ACT/365 convention)
rate = 0.10
dates= pandas.date_range(start=pandas.Timestamp('2015-01-01'),periods=5, freq="AS")
cfs = pandas.Series([-500,200,200,200,200],index=dates)
# intermediate calculations( if interested)
# cf_xnpv_days = [(cf.index[i]-cf.index[i-1]).days for i in range(1,len(cf.index))]
# cf_xnpv_days_cumulative = [(cf.index[i]-cf.index[0]).days for i in range(1,len(cf.index))]
# cf_xnpv_days_disc_factors = [(1+rate)**(float((cf.index[i]-cf.index[0]).days)/365.0)-1 for i in range(1,len(cf.index))]
cf_xnpv_days_pvs = [cf[i]/float(1+(1+rate)**(float((cf.index[i]-cf.index[0]).days)/365.0)-1) for i in range(1,len(cf.index))]
cf_xnpv = cf[0]+ sum(cf_xnpv_days_pvs)

def xirr(cashflows,transactions,guess=0.1):
#function to calculate internal rate of return.
#cashflow: list of tuple of date,transactions
#transactions: list of transactions
try:
return optimize.newton(lambda r: xnpv(r,cashflows),guess)
except RuntimeError:
positives = [x if x > 0 else 0 for x in transactions]
negatives = [x if x < 0 else 0 for x in transactions]
return_guess = (sum(positives) + sum(negatives)) / (-sum(negatives))
return optimize.newton(lambda r: xnpv(r,cashflows),return_guess)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speeding up a pd.apply() function - python

Related

python: processing data so that only constant values remain

How to replace a pandas column row with the previous row if a condition is met

fast way to iterate through list, find duplicates and perform calculations

How can combine 3 matrices into 1 matrice with reversible-approach?

financial python library that has xirr and xnpv function?

Categories

Resources