financial python library that has xirr and xnpv function? - python

numpy has irr and npv function, but I need xirr and xnpv function.
this link points out that xirr and xnpv will be coming soon.
http://www.projectdirigible.com/documentation/spreadsheet-functions.html#coming-soon
Is there any python library that has those two functions? tks.

Here is one way to implement the two functions.
import scipy.optimize
def xnpv(rate, values, dates):
'''Equivalent of Excel's XNPV function.
>>> from datetime import date
>>> dates = [date(2010, 12, 29), date(2012, 1, 25), date(2012, 3, 8)]
>>> values = [-10000, 20, 10100]
>>> xnpv(0.1, values, dates)
-966.4345...
'''
if rate <= -1.0:
return float('inf')
d0 = dates[0] # or min(dates)
return sum([ vi / (1.0 + rate)**((di - d0).days / 365.0) for vi, di in zip(values, dates)])
def xirr(values, dates):
'''Equivalent of Excel's XIRR function.
>>> from datetime import date
>>> dates = [date(2010, 12, 29), date(2012, 1, 25), date(2012, 3, 8)]
>>> values = [-10000, 20, 10100]
>>> xirr(values, dates)
0.0100612...
'''
try:
return scipy.optimize.newton(lambda r: xnpv(r, values, dates), 0.0)
except RuntimeError: # Failed to converge?
return scipy.optimize.brentq(lambda r: xnpv(r, values, dates), -1.0, 1e10)

With the help of various implementations I found in the net, I came up with a python implementation:
def xirr(transactions):
years = [(ta[0] - transactions[0][0]).days / 365.0 for ta in transactions]
residual = 1
step = 0.05
guess = 0.05
epsilon = 0.0001
limit = 10000
while abs(residual) > epsilon and limit > 0:
limit -= 1
residual = 0.0
for i, ta in enumerate(transactions):
residual += ta[1] / pow(guess, years[i])
if abs(residual) > epsilon:
if residual > 0:
guess += step
else:
guess -= step
step /= 2.0
return guess-1
from datetime import date
tas = [ (date(2010, 12, 29), -10000),
(date(2012, 1, 25), 20),
(date(2012, 3, 8), 10100)]
print xirr(tas) #0.0100612640381

Created a package for fast XIRR calculation, PyXIRR
It doesn't have external dependencies and works faster than any existing implementation.
from datetime import date
from pyxirr import xirr
dates = [date(2020, 1, 1), date(2021, 1, 1), date(2022, 1, 1)]
amounts = [-1000, 1000, 1000]
# feed columnar data
xirr(dates, amounts)
# feed tuples
xirr(zip(dates, amounts))
# feed DataFrame
import pandas as pd
xirr(pd.DataFrame({"dates": dates, "amounts": amounts}))

This answer is an improvement on #uuazed's answer and derives from that. However, there are a few changes:
It uses a pandas dataframe instead of a list of tuples
It is cashflow direction agnostic, i.e., whether you treat inflows as negative and outflows as positive or vice versa, the result will be the same, as long as the treatment is consistent for all transactions.
XIRR calculation with this method doesn't work if cashflows are not ordered by date. Hence I have handled sorting of the dataframe internally.
In the earlier answer, there was an implicit assumption that XIRR will mostly be positive. which created the problem pointed out in the other comment, that XIRR between -100% and -95% cannot be calculated. This solution does away with that problem.
import pandas as pd
import numpy as np
def xirr(df, guess=0.05, date_column = 'date', amount_column = 'amount'):
'''Calculates XIRR from a series of cashflows.
Needs a dataframe with columns date and amount, customisable through parameters.
Requires Pandas, NumPy libraries'''
df = df.sort_values(by=date_column).reset_index(drop=True)
df['years'] = df[date_column].apply(lambda x: (x-df[date_column][0]).days/365)
step = 0.05
epsilon = 0.0001
limit = 1000
residual = 1
#Test for direction of cashflows
disc_val_1 = df[[amount_column, 'years']].apply(
lambda x: x[amount_column]/((1+guess)**x['years']), axis=1).sum()
disc_val_2 = df[[amount_column, 'years']].apply(
lambda x: x[amount_column]/((1.05+guess)**x['years']), axis=1).sum()
mul = 1 if disc_val_2 < disc_val_1 else -1
#Calculate XIRR
for i in range(limit):
prev_residual = residual
df['disc_val'] = df[[amount_column, 'years']].apply(
lambda x: x[amount_column]/((1+guess)**x['years']), axis=1)
residual = df['disc_val'].sum()
if abs(residual) > epsilon:
if np.sign(residual) != np.sign(prev_residual):
step /= 2
guess = guess + step * np.sign(residual) * mul
else:
return guess
Explanation:
In the test block, it checks whether increasing the discounting rate increases the discounted value or reduces it. Based on this test, it is determined which direction the guess should move. This block makes the function handle cashflows regardless of direction assumed by the user.
The np.sign(residual) != np.sign(prev_residual) checks when the guess has increased/decreased beyond the required XIRR rate, because that's when the residual goes from negative to positive or vice versa. The step size is reduced at this point.
The numpy package is not absolutely necessary. without numpy, np.sign(residual) can be replaced with residual/abs(residual). I have used numpy to make the code more readable and intuitive
I have tried to test this code with a variety of cash flows. If you find any cases which are not handled by this function, do let me know.
Edit: Here's a cleaner and faster version of the code using numpy arrays. In my test with about 700 transaction, this code ran 5 times faster than the one above:
def xirr(df, guess=0.05, date_column='date', amount_column='amount'):
'''Calculates XIRR from a series of cashflows.
Needs a dataframe with columns date and amount, customisable through parameters.
Requires Pandas, NumPy libraries'''
df = df.sort_values(by=date_column).reset_index(drop=True)
amounts = df[amount_column].values
dates = df[date_column].values
years = np.array(dates-dates[0], dtype='timedelta64[D]').astype(int)/365
step = 0.05
epsilon = 0.0001
limit = 1000
residual = 1
#Test for direction of cashflows
disc_val_1 = np.sum(amounts/((1+guess)**years))
disc_val_2 = np.sum(amounts/((1.05+guess)**years))
mul = 1 if disc_val_2 < disc_val_1 else -1
#Calculate XIRR
for i in range(limit):
prev_residual = residual
residual = np.sum(amounts/((1+guess)**years))
if abs(residual) > epsilon:
if np.sign(residual) != np.sign(prev_residual):
step /= 2
guess = guess + step * np.sign(residual) * mul
else:
return guess

I started from #KT 's solution but improved on it in a few ways:
as pointed out by others, there is no need for xnpv to return inf if the discount rate <= -100%
if the cashflows are all positive or all negative, we can return a nan straight away: no point in letting the algorithm search forever for a solution which doesn't exist
I have made the daycount convention an input; sometimes it is 365, some other times it is 360 - it depends on the case. I have not modelled 30/360. More details on Matlab's docs
I have added optional inputs for the maximum number of iterations and for the starting point of the algorithm
I have not changed the default tolerance of the algorithms but that's very easy to change
Key findings for the specific example below (results may well be different for other cases, I have not had the time to test many other cases):
starting from a value = -sum(all cashflows) / sum(negative cashflows) slows the algorithms a little bit (by 7-10%)
scipi's netwon is faster than scipy's fsolve
Execution time with newton vs fsolve:
import numpy as np
import pandas as pd
import scipy
import scipy.optimize
from datetime import date
import timeit
def xnpv(rate, values, dates , daycount = 365):
daycount = float(daycount)
# Why would you want to return inf if the rate <= -100%? I removed it, I don't see how it makes sense
# if rate <= -1.0:
# return float('inf')
d0 = dates[0] # or min(dates)
# NB: this xnpv implementation discounts the first value LIKE EXCEL
# numpy's npv does NOT, it only starts discounting from the 2nd
return sum([ vi / (1.0 + rate)**((di - d0).days / daycount) for vi, di in zip(values, dates)])
def find_guess(cf):
whereneg = np.where(cf < 0)
sumneg = np.sum( cf[whereneg] )
return -np.sum(cf) / sumneg
def xirr_fsolve(values, dates, daycount = 365, guess = 0, maxiters = 1000):
cf = np.array(values)
if np.where(cf <0,1,0).sum() ==0 | np.where(cf>0,1,0).sum() == 0:
#if the cashflows are all positive or all negative, no point letting the algorithm
#search forever for a solution which doesn't exist
return np.nan
result = scipy.optimize.fsolve(lambda r: xnpv(r, values, dates, daycount), x0 = guess , maxfev = maxiters, full_output = True )
if result[2]==1: #ie if the solution converged; if it didn't, result[0] will be the last iteration, which won't be a solution
return result[0][0]
else:
#consider rasiing a warning
return np.nan
def xirr_newton(values, dates, daycount = 365, guess = 0, maxiters = 1000, a = -100, b =1e5):
# a and b: lower and upper bound for the brentq algorithm
cf = np.array(values)
if np.where(cf <0,1,0).sum() ==0 | np.where(cf>0,1,0).sum() == 0:
#if the cashflows are all positive or all negative, no point letting the algorithm
#search forever for a solution which doesn't exist
return np.nan
res_newton = scipy.optimize.newton(lambda r: xnpv(r, values, dates, daycount), x0 = guess, maxiter = maxiters, full_output = True)
if res_newton[1].converged == True:
out = res_newton[0]
else:
res_b = scipy.optimize.brentq(lambda r: xnpv(r, values, dates, daycount), a = a , b = b, maxiter = maxiters, full_output = True)
if res_b[1].converged == True:
out = res_b[0]
else:
out = np.nan
return out
# let's compare how long each takes
d0 = pd.to_datetime(date(2010,1,1))
# an investment in which we pay 100 in the first month, then get 2 each month for the next 59 months
df = pd.DataFrame()
df['month'] = np.arange(0,60)
df['dates'] = df.apply( lambda x: d0 + pd.DateOffset(months = x['month']) , axis = 1 )
df['cf'] = 0
df.iloc[0,2] = -100
df.iloc[1:,2] = 2
r = 100
n = 5
t_newton_no_guess = timeit.Timer ("xirr_newton(df['cf'], df['dates'], guess = find_guess(df['cf'].to_numpy() ) ) ", globals = globals() ).repeat(repeat = r, number = n)
t_fsolve_no_guess = timeit.Timer ("xirr_fsolve(df['cf'], df['dates'], guess = find_guess(df['cf'].to_numpy() ) )", globals = globals() ).repeat(repeat = r, number = n)
t_newton_guess_0 = timeit.Timer ("xirr_newton(df['cf'], df['dates'] , guess =0.) ", globals = globals() ).repeat(repeat = r, number = n)
t_fsolve_guess_0 = timeit.Timer ("xirr_fsolve(df['cf'], df['dates'], guess =0.) ", globals = globals() ).repeat(repeat = r, number = n)
resdf = pd.DataFrame(index = ['min time'])
resdf['newton no guess'] = [min(t_newton_no_guess)]
resdf['fsolve no guess'] = [min(t_fsolve_no_guess)]
resdf['newton guess 0'] = [min(t_newton_guess_0)]
resdf['fsolve guess 0'] = [min(t_fsolve_guess_0)]
# the docs explain why we should take the min and not the avg
resdf = resdf.transpose()
resdf['% diff vs fastest'] = (resdf / resdf.min() -1) * 100
Conclusions
I noticed there were some cases in which newton and brentq didn't converge, but fsolve did, so I modified the function so that, in order, it starts with newton, then brentq, then, lastly, fsolve.
I haven't actually found a case in which brentq was used to find a solution. I'd be curious to understand when it would work, otherwise it's probably best to just remove it.
I went back to try/except because I noticed the code above wasn't identifying all the cases of non-convergence. That's something I'd like to look into when I have a bit more time
This is my final code:
def xirr(values, dates, daycount = 365, guess = 0, maxiters = 10000, a = -100, b =1e10):
# a and b: lower and upper bound for the brentq algorithm
cf = np.array(values)
if np.where(cf <0,1,0).sum() ==0 | np.where(cf >0,1,0).sum() == 0:
#if the cashflows are all positive or all negative, no point letting the algorithm
#search forever for a solution which doesn't exist
return np.nan
try:
output = scipy.optimize.newton(lambda r: xnpv(r, values, dates, daycount),
x0 = guess, maxiter = maxiters, full_output = True, disp = True)[0]
except RuntimeError:
try:
output = scipy.optimize.brentq(lambda r: xnpv(r, values, dates, daycount),
a = a , b = b, maxiter = maxiters, full_output = True, disp = True)[0]
except:
result = scipy.optimize.fsolve(lambda r: xnpv(r, values, dates, daycount),
x0 = guess , maxfev = maxiters, full_output = True )
if result[2]==1: #ie if the solution converged; if it didn't, result[0] will be the last iteration, which won't be a solution
output = result[0][0]
else:
output = np.nan
return output
Tests
These are some tests I have put together with pytest
import pytest
import numpy as np
import pandas as pd
import whatever_the_file_name_was as finc
from datetime import date
def test_xirr():
dates = [date(2010, 12, 29), date(2012, 1, 25), date(2012, 3, 8)]
values = [-10000, 20, 10100]
assert pytest.approx( finc.xirr(values, dates) ) == 1.006127e-2
dates = [date(2010, 1,1,), date(2010,12,27)]
values = [-100,110]
assert pytest.approx( finc.xirr(values, dates, daycount = 360) ) == 0.1
values = [100,-110]
assert pytest.approx( finc.xirr(values, dates, daycount = 360) ) == 0.1
values = [-100,90]
assert pytest.approx( finc.xirr(values, dates, daycount = 360) ) == -0.1
# test numpy arrays
values = np.array([-100,0,121])
dates = [date(2010, 1,1,), date(2011,1,1), date(2012,1,1)]
assert pytest.approx( finc.xirr(values, dates, daycount = 365) ) == 0.1
# with a pandas df
df = pd.DataFrame()
df['values'] = values
df['dates'] = dates
assert pytest.approx( finc.xirr(df['values'], df['dates'], daycount = 365) ) == 0.1
# with a pands df and datetypes
df['dates'] = pd.to_datetime(dates)
assert pytest.approx( finc.xirr(df['values'], df['dates'], daycount = 365) ) == 0.1
# now for some unrealistic values
df['values'] =[-100,5000,0]
assert pytest.approx( finc.xirr(df['values'], df['dates'], daycount = 365) ) == 49
df['values'] =[-1e3,0,1]
rate = finc.xirr(df['values'], df['dates'], daycount = 365)
npv = finc.xnpv(rate, df['values'], df['dates'])
# this is an extreme case; as long as the corresponsing NPV is between these values it's not a bad result
assertion = ( npv < 0.1 and npv > -.1)
assert assertion == True
P.S. Important difference between this xnpv and numpy.npv
This is not, strictly speaking, relevant to this answer, but useful to know for whoever runs financial calculations with numpy:
numpy.npv doesn't discount the first item of cashflow - it starts from the second, e.g.
np.npv(0.1,[110,0]) = 110
and
np.npv(0.1,[0,110] = 100
Excel, however, discounts from the very first item:
NPV(0.1,[110,0]) = 100
Numpy's financial functions will be deprecated and replaced with those of numpy_financial, which however will likely continue to behave the same, if only for backward compatibility.

Created a python package finance-calulator which can be used for xirr calculation. underlying, it uses newton's method.
Also I did some time profiling and it is little better than the scipy's xnpv method as suggested in #KT.'s answer.
Here's the implementation.

With Pandas, I got the following to work:
(note, I'm using ACT/365 convention)
rate = 0.10
dates= pandas.date_range(start=pandas.Timestamp('2015-01-01'),periods=5, freq="AS")
cfs = pandas.Series([-500,200,200,200,200],index=dates)
# intermediate calculations( if interested)
# cf_xnpv_days = [(cf.index[i]-cf.index[i-1]).days for i in range(1,len(cf.index))]
# cf_xnpv_days_cumulative = [(cf.index[i]-cf.index[0]).days for i in range(1,len(cf.index))]
# cf_xnpv_days_disc_factors = [(1+rate)**(float((cf.index[i]-cf.index[0]).days)/365.0)-1 for i in range(1,len(cf.index))]
cf_xnpv_days_pvs = [cf[i]/float(1+(1+rate)**(float((cf.index[i]-cf.index[0]).days)/365.0)-1) for i in range(1,len(cf.index))]
cf_xnpv = cf[0]+ sum(cf_xnpv_days_pvs)

def xirr(cashflows,transactions,guess=0.1):
#function to calculate internal rate of return.
#cashflow: list of tuple of date,transactions
#transactions: list of transactions
try:
return optimize.newton(lambda r: xnpv(r,cashflows),guess)
except RuntimeError:
positives = [x if x > 0 else 0 for x in transactions]
negatives = [x if x < 0 else 0 for x in transactions]
return_guess = (sum(positives) + sum(negatives)) / (-sum(negatives))
return optimize.newton(lambda r: xnpv(r,cashflows),return_guess)

Related

Speeding up a pd.apply() function

I have some example data
import numpy as np
import pandas as pd
users = 5
size = users*6
df = pd.DataFrame(
{'userid': np.random.choice(np.arange(0, users), size),
'a_time': np.random.normal(loc = 1.5, scale = 0.5, size = size),
'b_time': np.random.normal(loc = 1.5, scale = 0.5, size = size),
}
)
df['focus'] = np.where(df.userid % 2 == 0, 'a', 'b')
test_dat = df[['userid', 'focus', 'a_time', 'b_time']].sort_values('userid').copy(deep = True).reset_index(drop = True)
For each userid, I need to determine how many times a_time > b_time or vice versa, depending on column focus.
I have a custom function
def some_func(x):
if (x.focus == 'a').all():
a = x.a_time
b = x.b_time
x['changes'] = (b > a).sum()
x['days'] = len(a)
elif (x.focus == 'b').all():
a = x.a_time
b = x.b_time
x['changes'] = (a > b).sum()
x['days'] = len(a)
elif (x.focus == 'both').all():
x['changes'] = 0
x['days'] = len(a)
else:
x['changes'] = None
x['days'] = None
return x
test_dat.groupby(['userid', 'focus']).apply(some_func).reset_index(name = 'n_changes')
that works just fine when the number of userid is small. However, as the number of unique userid increases to >100K, this function is almost unbearably slow.
Is there a way to speed up this fx? My guess is that there might be an alternative to the if-elif-else syntax in some_func() but I'm not sure what that syntax might be. The number of rows for each userid is arbitrarily long.
I'm open to non-pandas options if necessary.
A now-deleted answer from another user helped me get to a reasonably-quick solution. After learning that GroupBy.apply() does operate on dataframes and that this answer suggests that apply() isn't very performant (at least compared to transform()), I thought I'd try dropping apply() entirely. I don't think this solution is very pretty but with 277161 userid values in 1.36MM rows, it runs in about 32 seconds:
# Simulate some data
users = 277161
size = 1360000
df = pd.DataFrame(
{'userid': np.random.choice(np.arange(0, users), size),
'a_time': np.random.normal(loc = 1.5, scale = 0.5, size = size),
'b_time': np.random.normal(loc = 1.5, scale = 0.5, size = size),
}
)
df['focus'] = np.where(df.userid % 2 == 0, 'a', 'b')
test_dat = df[['userid', 'focus', 'a_time', 'b_time']].sort_values('userid').copy(deep = True).reset_index(drop = True)
# Evalute time
start = datetime.now()
test_dat['aux'] = np.where(
test_dat.focus == 'a', test_dat.a_time.lt(test_dat.b_time),
np.where(test_dat.focus == 'b', test_dat.a_time.gt(test_dat.b_time), None))
test = test_dat.groupby(['userid', 'focus'])['aux'].agg([np.sum, np.size]).reset_index()
# the row above for some reason doesn't handle single boolean values well and just
# returns the boolean values when the number of rows in a group = 1,
# so I force those remaining boolean values into ints here
test['sum'] = test['sum'].astype(int)
test = test.rename(columns = {'sum': 'changes', 'size': 'days'})
end = datetime.now()
(end - start).seconds
I recognize that this isn't an explicit answer to my question (i.e., "speeding up an apply() function") but the reality is that this is a workable, reasonably-quick solution.

Generating random floats, summing to 1, with minimum value

I saw a many solutions for generating random floats within a specific range (like this) which actually helps me, and solutions for generating random floats summing to 1 (like this), and separately solutions work perfectly, but I can't figure how to merge them.
Currently my code is:
import random
def sample_floats(low, high, k=1):
""" Return a k-length list of unique random floats
in the range of low <= x <= high
"""
result = []
seen = set()
for i in range(k):
x = random.uniform(low, high)
while x in seen:
x = random.uniform(low, high)
seen.add(x)
result.append(x)
return result
And still, applying
weights = sample_floats(0.055, 1.0, 11)
weights /= np.sum(weights)
Returns weights array, in which there are some floats less that 0.055
Should I somehow implement np.random.dirichlet in function above, or it should be built on the basis of np.random.dirichlet and then implement condition > 0.055? Can't figure any solution.
Thank you in advice!
IIUC, you want to generate an array of k values, with minimum value of low=0.055.
It is easier to generate numbers from 0 that sum up to 1-low*k, and then to add low so that the final array sums to 1. Thus, this guarantees both the lower bound and the sum.
Regarding the high, I am pretty sure it is mathematically impossible to add this constraint as once you fix the lower bound and the sum, there is not enough degrees of freedom to chose an upper bound. The upper bound will be 1-low*(k-1) (here 0.505).
Also, be aware that, with a minimum value, you necessarily enforce a maximum k of 1//low (here 18 values). If you set k higher, the low bound won't be correct.
# parameters
low = 0.055
k = 10
a = np.random.rand(k)
a = (a/a.sum()*(1-low*k))
weights = a+low
# checking that the sum is 1
assert np.isclose(weights.sum(), 1)
Example output:
array([0.13608635, 0.06796974, 0.07444545, 0.1361171 , 0.07217206,
0.09223554, 0.12713463, 0.11012871, 0.1107402 , 0.07297022])
You could generate k-1 numbers iteratively by varying the lower and upper bounds of the uniform random number generator - the constraint at any iteration being that the number generated allows the rest of the numbers to be at least low
def sample_floats(low, high, k=1):
result = []
generated = 0
while generated < k-1:
current_higher_bound = max(low, 1 - (k - 1 - generated)*low - sum(result))
next_num = random.uniform(low, current_higher_bound)
result.append(next_num)
generated += 1
last_num = 1 - sum(result)
result.append(last_num)
return result
print(sample_floats(0.01, 1, k=15))
#[0.08878760926151083,
# 0.17897435239586243,
# 0.5873150041878156,
# 0.021487776792166513,
# 0.011234379498998357,
# 0.012408564286727042,
# 0.015391011259745103,
# 0.01264921242128719,
# 0.010759267284382326,
# 0.010615007333002748,
# 0.010288605412288477,
# 0.010060487014659121,
# 0.010027216923973544,
# 0.010000064276203318,
# 0.010001441651377285]
The samples are correlated, so I believe you can't generate them in an IID way. you can, however, do it in an iterative manner. For example, you can do it as I show in the code below. There are a few more special cases to check like what if the user inputs low<high or high*k<sum. But I figured you can find and account for them using my modification to your code.
import random
import warnings
def sample_floats(low = 0.055, high = 1., x_sum = 1., k = 1):
""" Return a k-length list of unique random floats
in the range of 'low' <= x <= 'high' summing up to 'sum'.
"""
sum_i = 0
xs = []
if x_sum - (k-1)*low < high:
warnings.warn(f'high = {high} is to high to be generated under the'
f' conditions set by k = {k}, sum = {x_sum}, and low = {low}.'
f' high automatically set to {x_sum - (k-1)*low}.')
if k == 1:
if high < x_sum:
raise ValueError(f'The parameter combination k = {k}, sum = {x_sum},'
' and high = {high} is impossible.')
else: return x_sum
high_i = high
for i in range(k-1):
x = random.uniform(low, high_i)
xs.append(x)
sum_i = sum_i + x
if high < (x_sum - sum_i - (k-1-i)*low):
high_i = high
else: high_i = x_sum - sum_i - (k-1-i)*low
xs.append(x_sum - sum_i)
return xs
For example:
random.seed(0)
xs = sample_floats(low = 0.055, high = 0.5, x_sum = 1., k = 5)
print(xs)
print(sum(xs))
Output:
[0.43076772392864643, 0.27801464913542906, 0.08495210994346317, 0.06568433355884717, 0.14058118343361425]
1.0

Battery Storage Pyomo: optimize and iterate yearly data over a 365 hours time horizon

I have yearly data on electricity prices called 'HOEP'. With my pyomo model, I want to determine the behavior of a battery for the whole year but with a 365 hours time horizon (energy in = Ein and energy out = Eout). In other words, I want to make my algorithm run for the first 365 hours, then run again the next 365 hours time horizon with initial battery state equal to the last hour of the previous time horizon period.
I have tried dividing my yearly data into chunks (24 chunks of 365 hours in the year). With df_list = np.vsplit(dfa, 24), I create a list of chunks and transform them into 24 different dataframe. Then, I use for idx, df in enumerate([df0, df1, df2]), (here is only 3 chunks for testing) before my model to loop over the data. However, when I look at my results, it seems that the model only optimize for the last argument of enumerate([df0, df1, df2]) which is df2.
Does anybody know why it does not work for the 3 chunks? Or how could I do this in a different way?
Thank you in advance for your help!
Here is the edited version of my code that works now but I know it is porbably not the most pythonic way of doing this.
import numpy as np
import pandas as pd
from typing import List
from itertools import chain
from pyomo.environ import *
output = []
for idx, df in enumerate([df0,df1,df2]):
model = ConcreteModel()
# Variables of the model
model.T = Set(initialize=df.hour.tolist(), ordered=True)
model.Rmax = Param(initialize=1, within=Any)
model.Smax = Param(initialize=5, within=Any)
model.Dmax = Param(initialize=5, within=Any)
model.Ein = Var(model.T, domain=NonNegativeReals)
model.Eout = Var(model.T, domain=NonNegativeReals)
model.Z = Var(model.T, domain=NonNegativeReals)
model.L = Var(model.T, domain=NonNegativeReals)
model.NES = Var(model.T)
# Constraints
def storage_state(model, t):
if t == model.T.first():
return model.Z[t] == 0
else:
return (model.Z[t] == (model.Z[t-1] + (model.Ein[t]) - (model.Eout[t])))
model.charge_state = Constraint(model.T, rule=storage_state)
def discharge_constraint(model, t):
return model.Eout[t] <= model.Rmax
model.discharge = Constraint(model.T, rule=discharge_constraint)
def charge_constraint(model, t):
return model.Ein[t] <= model.Rmax
model.charge = Constraint(model.T, rule=charge_constraint)
def positive_charge(model, t):
return model.Eout[t] <= model.Z[t]
model.positive_charge = Constraint(model.T, rule=positive_charge)
def max_SOC(model, t):
return model.Z[t] <= model.Smax
model.max_SOC = Constraint(model.T, rule=max_SOC)
def demand_constraint(model, t):
return (model.L[t] == (df.loc[t, 'MktDemand'] + (model.Ein[t]) - (model.Eout[t])))
model.demand_constraint = Constraint(model.T, rule=demand_constraint)
def discharge_limit(model, t):
max_t = model.T.last()
if t < max_t - 24:
return sum(model.Eout[i] for i in range(t, t+24)) <= model.Dmax
else:
return Constraint.Skip
model.limit_disch_out = Constraint(model.T, rule=discharge_limit)
def charge_limit(model, t):
max_t = model.T.last()
if t < max_t - 24:
return sum(model.Ein[i] for i in range(t, t+24)) <= model.Dmax
else:
return Constraint.Skip
model.limit_charg_out = Constraint(model.T, rule=charge_limit)
def Net_energy_sold(model, t):
return model.NES[t] == ((model.Eout[t] - model.Ein[t]) / model.Rmax * 100)
model.net_energy = Constraint(model.T, rule=Net_energy_sold)
# Objective function and optimization
income = sum(df.loc[t,'HOEP'] * model.Eout[t] for t in model.T)
expenses = sum(df.loc[t,'HOEP'] * model.Ein[t] for t in model.T)
profits = (income - expenses)
model.objective = Objective(expr=profits, sense=maximize)
# Solve model
solver = SolverFactory('glpk')
solver.solve(model)
# Extract model output in list
Date = list(df['Date'])
output.append([Date, model.Ein.get_values().values(), model.Eout.get_values().values(),
model.Z.get_values().values(), model.NES.get_values().values(),
model.L.get_values().values()])
df_results = pd.DataFrame(output)
df_results.rename(columns = {0: 'Date', 1: 'Ein', 2:'Eout', 3:'Z', 4:'NES', 5:'Load'}, inplace = True)
df_results
# Present final results in dataframe
d = ein = eout = z = l = nes = []
for i in list(df_results.index):
d = d + list(df_results.loc[i,'Date'])
ein = ein + list(df_results.loc[i,'Ein'])
eout = eout + list(df_results.loc[i,'Eout'])
z = z + list(df_results.loc[i,'Z'])
nes = nes + list(df_results.loc[i,'NES'])
l = l + list(df_results.loc[i,'Load'])
results = pd.DataFrame(zip(d, ein, eout, z, nes, l), columns = ['Date','Ein','Eout','SOC','NES','Load'])
results
# Returned dataframe
Date Ein Eout SOC NES Load
0 2019-01-01 0.0 0.00 0.00 0.0 16231.00
1 2019-01-01 0.0 0.00 0.00 0.0 16051.00
2 2019-01-01 1.0 0.00 1.00 -100.0 15806.00
3 2019-01-01 1.0 0.00 2.00 -100.0 15581.00
...
Why it isn't working
(disclaimer: this is one issue I see, there might be others).
At each iteration of the for loop, list_of_series is defined from scratch, so all the results obtained in previous iterations are lost.
I'd also check that df.hour is "hour of the year" or "hour from beginning of data" rather than "hour of the day" (if it's the latter, this will also cause an error).
Fixing the problem
(there are several solutions, obviously) at each iteration of the for loop, turn list_of_series into a pd.DataFrame, and append the dataframe to a list.
At the end of the for loop (once you have run the model on each chunk of data), concatenate the list of dataframes.
from typing import List
...
# find a better name, variable names shouldn't specify the type
list_of_dataframes: List[pd.DataFrame] = []
for ...: # for each chunk of data
... # create model, solve
list_of_series = ...
list_of_dataframes.append(pd.DataFrame(list_of_series))
results = pd.concat(list_of_dataframes, axis=0) # use `ignore_index=True` if needed
A few tips
Break your code into functions. Create a function which defines the model. That highlights what inputs and outputs are, makes the for loop more readable, allows you to use it in other contexts and potentially to test it.
(opinionated) set your "data" as parameters of the model, instead of using them directly to construct constraints and the objective function. This allows you to have a single place where each piece of data is ingested in the model, creates internal consistency in the model and allows you to extract results purely based on the optimized model.
separate I/O (reading/writing to file) from the rest of the code. If your data source changes format or filetype, you'll be able to change that without changing any of the rest of the code.
def main(input_data: pd.DataFrame) -> pd.DataFrame:
# group by week, month, or any applicable resolution
# this assumes the index is a `pd.DatetimeIndex`
# `MS` is "Month Start" - watch out with weeks because `freq="w"` starts
# on Mondays, and your data might start on a different weekday.
# If you want split into chunks of some number of days,
# use e.g. `freq="14d"`
grouped = df.groupby(pd.Grouper(freq="MS"))
results_list: List[pd.DataFrame] = []
for month, group in grouped:
model = create_model(group)
optimization_results = SolverFactory('glpk').solve(model)
results_list.append(extract_results(model)) # pass `group` if needed
results = pd.concat(results_list, axis=0, ignore_index=True)
return results
def create_model(df: pd.DataFrame) -> ConcreteModel:
# NOTE: instead of hard-coding parameters such as battery capacity,
# pass them as inputs to the function.
model = ConcreteModel()
...
return model
def extract_results(model: ConcreteModel) -> pd.DataFrame:
...
def load_data(filename) -> pd.DataFrame:
...
if __name__ == "__main__":
input_data = load_data(...)
results = main(input_data)
results.to_csv(...)

fast way to iterate through list, find duplicates and perform calculations

I have two lists, one of areas and one of prices which are the same size.
For example:
area = [1500,2000,2000,1800,2000,1500,500]
price = [200,800,600,800,1000,750,200]
I need to return a list of prices for each unique area not including the original area.
So for 1500, the lists that I need returned are: [750] and [200]
For the 2000, the lists that I need returned are [600,1000], [800,1000] and [800,600]
For the 1800 and 500, the lists I need returned are both empty lists [].
The goal is then to determine whether a value is an outlier subject to the absolute value of the price - mean(excluding the price itself) being less than 5 * population standard deviation(calculated excluding the price itself)
import statistics
area = [1500,2000,2000,1800,2000,1500,500]
price = [200,800,600,800,1000,750,200]
outlier_idx = []
for idx, val in enumerate(area):
comp_idx = [i for i, x in enumerate(area) if x == val]
comp_idx.remove(idx)
comp_price = [price[i] for i in comp_idx]
if len(comp_price)>2:
sigma = statistics.stdev(comp_price)
p_m = statistics.mean(comp_price)
if abs(price[idx]-p_m) > 5 * sigma:
outlier_idx.append(idx)
area = [i for j, i in enumerate(area) if j not in outlier_idx]
price = [i for j, i in enumerate(price) if j not in outlier_idx]
The problem is that this calculation takes up a lot of time and I am dealing with arrays that can be quite large.
I am stuck as to how I can increase the computational efficiency.
I am open to using numpy, pandas or any other common packages.
Additionally, I have tried the problem in pandas:
df['p-p_m'] = ''
df['sigma'] = ''
df['outlier'] = False
for name, group in df.groupby('area'):
if len(group)>1:
idx = list(group.index)
for i in range(len(idx)):
tmp_idx = idx.copy()
tmp_idx.pop(i)
df['p-p_m'][idx[i]] = abs(group.price[idx[i]] - group.price[tmp_idx].mean())
df['sigma'][idx[i]] = group.price[tmp_idx].std(ddof=0)
if df['p-p_m'][idx[i]] > 3*df['sigma'][idx[i]]:
df['outlier'][idx[i]] = True
Thanks.
This code is how to must created the list for each area:
df = pd.DataFrame({'area': area, 'price': price})
price_to_delete = [item for idx_array in df.groupby('price').groups.values() for item in idx_array[1:]]
df.loc[price_to_delete, 'price'] = None
df = df.groupby('area').agg(lambda x: [] if all(x.isnull()) else x.tolist())
df
I don't understand what are you want, but this part is to calculate outliers for each price in each area:
df['outlier'] = False
df['outlier'] = df['price'].map(lambda x: abs(np.array(x) - np.mean(x)) > 3*np.std(x) if len(x) > 0 else [])
df
I hope this help you, in any way!
Here is a solution that combines Numpy and Numba. Although correct, I did not test it against alternative approaches regarding efficiency, but Numba usually results in significant speedups for tasks that require looping through data. I have added one extra point which is an outlier, according to your definition.
import numpy as np
from numba import jit
# data input
price = np.array([200,800,600,800,1000,750,200, 2000])
area = np.array([1500,2000,2000,1800,2000,1500,500, 1500])
#jit(nopython=True)
def outliers(price, area):
is_outlier = np.full(len(price), False)
for this_area in set(area):
indexes = area == this_area
these_prices = price[indexes]
for this_price in set(these_prices):
arr2 = these_prices[these_prices != this_price]
if arr2.size > 1:
std = arr2.std()
mean = arr2.mean()
indices = (this_price == price) & (this_area == area)
is_outlier[indices] = np.abs(mean - this_price) > 5 * std
return is_outlier
> outliers(price, area)
> array([False, False, False, False, False, False, False, True])
The code should be fast in case you have several identical price levels for each area, as they will be updated all at once.
I hope this helps.

AttributeError: 'DataFrame' object has no attribute 'colmap' in Python

I am a python beginner and I try to use the following code from this source: Portfolio rebalancing with bandwidth method in python
The code works well so far.
The problem is that if I want to call the function not as usual like rebalance(df, tol), but from a certain location in the dataframe on, like: rebalance(df[500:], tol), I get the following error:
AttributeError: 'DataFrame' object has no attribute 'colmap'. So my question is: how do I have to adjust the code in order to make this possible?
Here is the code:
import datetime as DT
import numpy as np
import pandas as pd
import pandas.io.data as PID
def setup_df():
df1 = PID.get_data_yahoo("IBM",
start=DT.datetime(1970, 1, 1),
end=DT.datetime.today())
df1.rename(columns={'Adj Close': 'ibm'}, inplace=True)
df2 = PID.get_data_yahoo("F",
start=DT.datetime(1970, 1, 1),
end=DT.datetime.today())
df2.rename(columns={'Adj Close': 'ford'}, inplace=True)
df = df1.join(df2.ford, how='inner')
df = df[['ibm', 'ford']]
df['sh ibm'] = 0
df['sh ford'] = 0
df['ibm value'] = 0
df['ford value'] = 0
df['ratio'] = 0
# This is useful in conjunction with iloc for referencing column names by
# index number
df.colmap = dict([(col, i) for i,col in enumerate(df.columns)])
return df
def invest(df, i, amount):
"""
Invest amount dollars evenly between ibm and ford
starting at ordinal index i.
This modifies df.
"""
c = df.colmap
halfvalue = amount/2
df.iloc[i:, c['sh ibm']] = halfvalue / df.iloc[i, c['ibm']]
df.iloc[i:, c['sh ford']] = halfvalue / df.iloc[i, c['ford']]
df.iloc[i:, c['ibm value']] = (
df.iloc[i:, c['ibm']] * df.iloc[i:, c['sh ibm']])
df.iloc[i:, c['ford value']] = (
df.iloc[i:, c['ford']] * df.iloc[i:, c['sh ford']])
df.iloc[i:, c['ratio']] = (
df.iloc[i:, c['ibm value']] / df.iloc[i:, c['ford value']])
def rebalance(df, tol):
"""
Rebalance df whenever the ratio falls outside the tolerance range.
This modifies df.
"""
i = 0
amount = 100
c = df.colmap
while True:
invest(df, i, amount)
mask = (df['ratio'] >= 1+tol) | (df['ratio'] <= 1-tol)
# ignore prior locations where the ratio falls outside tol range
mask[:i] = False
try:
# Move i one index past the first index where mask is True
# Note that this means the ratio at i will remain outside tol range
i = np.where(mask)[0][0] + 1
except IndexError:
break
amount = (df.iloc[i, c['ibm value']] + df.iloc[i, c['ford value']])
return df
df = setup_df()
tol = 0.05 #setting the bandwidth tolerance
rebalance(df, tol)
df['portfolio value'] = df['ibm value'] + df['ford value']
df["ibm_weight"] = df['ibm value']/df['portfolio value']
df["ford_weight"] = df['ford value']/df['portfolio value']
print df['ibm_weight'].min()
print df['ibm_weight'].max()
print df['ford_weight'].min()
print df['ford_weight'].max()
# This shows the rows which trigger rebalancing
mask = (df['ratio'] >= 1+tol) | (df['ratio'] <= 1-tol)
print(df.loc[mask])
The problem you encountered is due to a poor design decision on my part.
colmap is an attribute defined on df in setup_df:
df.colmap = dict([(col, i) for i,col in enumerate(df.columns)])
It is not a standard attribute of a DataFrame.
df[500:] returns a new DataFrame which is generated by copying data from df into the new DataFrame. Since colmap is not a standard attribute, it is not copied into the new DataFrame.
To call rebalance on a DataFrame other than the one returned by setup_df, replace c = df.colmap with
c = dict([(col, j) for j,col in enumerate(df.columns)])
I've made this change in the original post as well.
PS. In the other question, I had chosen to define colmap on df itself so
that this dict would not have to be recomputed with every call to rebalance
and invest.
Your question shows me that this minor optimization is not worth making these
functions so dependent on the specialness of the DataFrame returned by
setup_df.
There is a second problem you will encounter using rebalance(df[500:], tol):
Since df[500:] returns a copy of a portion of df, rebalance(df[500:], tol) will modify
this copy and not the original df. If the object, df[500:],
has no reference outside of rebalance(df[500:], tol), it will be garbage
collected after the call to rebalance is completed. So the entire computation
would be lost. Therefore rebalance(df[500:], tol) is not useful.
Instead, you could modify rebalance to accept i as a parameter:
def rebalance(df, tol, i=0):
"""
Rebalance df whenever the ratio falls outside the tolerance range.
This modifies df.
"""
c = dict([(col, j) for j, col in enumerate(df.columns)])
while True:
mask = (df['ratio'] >= 1+tol) | (df['ratio'] <= 1-tol)
# ignore prior locations where the ratio falls outside tol range
mask[:i] = False
try:
# Move i one index past the first index where mask is True
# Note that this means the ratio at i will remain outside tol range
i = np.where(mask)[0][0] + 1
except IndexError:
break
amount = (df.iloc[i, c['ibm value']] + df.iloc[i, c['ford value']])
invest(df, i, amount)
return df
Then you can rebalance df starting at the 500th row using
rebalance(df, tol, i=500)
Note that this finds the first row on or after i=500 that needs
rebalancing. It does not necessarily rebalance at i=500 itself. This allows you to call rebalance(df, tol, i) for arbitrary i without having to determine in advance if rebalancing is required on row i.

Categories

Resources