I am learning how to use Gurobi optimizer and here is the sample code for portfolio optimization.
import gurobipy as gp
from gurobipy import GRB
from math import sqrt
import pandas as pd
import numpy as np
# Create historical return data for two stocks
equity1 = [0.0107, 0.0122, 0.076, 0.084, 0.0207]
equity2 = [0.0133, 0.0278, 0.0719, 0.0353, 0.0163]
data = pd.DataFrame(list(zip(equity1, equity2)), columns = ['SPX', 'FXAIX'])
stocks = data.columns
# Calculate basic summary statistics for individual stocks
stock_volatility = data.std()
stock_return = data.mean()
# Create an empty model
m = gp.Model('portfolio')
# Add a variable for each stock
vars = pd.Series(m.addVars(stocks), index=stocks)
# Objective is to minimize risk (squared). This is modeled using the
# covariance matrix, which measures the historical correlation between stocks.
sigma = data.cov()
portfolio_risk = sigma.dot(vars).dot(vars)
m.setObjective(portfolio_risk, GRB.MINIMIZE)
# Fix budget with a constraint
m.addConstr(vars.sum() == 1, 'budget')
# Optimize model to find the minimum risk portfolio
m.setParam('OutputFlag', 0)
m.optimize()
# Create an expression representing the expected return for the portfolio
portfolio_return = stock_return.dot(vars)
# Display minimum risk portfolio
print('Minimum Risk Portfolio:\n')
for v in vars:
if v.x > 0:
print('\t%s\t: %g' % (v.varname, v.x))
minrisk_volatility = sqrt(portfolio_risk.getValue())
minrisk_return = portfolio_return.getValue()
# Solve for efficient frontier by varying target return
frontier = pd.Series(dtype=np.float64)
for r in np.linspace(stock_return.min(), stock_return.max(), 100):
m.addConstr(portfolio_return == r, 'target')
m.optimize()
print(portfolio_risk.getValue())
#frontier.loc[sqrt(portfolio_risk.getValue())] = r
I got the error "Unable to retrieve attribute 'x' for the last line of code somehow when I try to create an efficient frontier. Thanks for any suggestions!
First, two important suggestions:
Always check the Status code to see the status of a call to Model.optimize()
For debugging, never disable logging, so comment out the line that sets the parameter OutputFlag=0
With logging, I get this output:
Barrier solved model in 0 iterations and 0.00 seconds
Model is infeasible or unbounded
Traceback (most recent call last):
File "so.py", line 56, in <module>
print(portfolio_risk.getValue())
File "src/gurobipy/quadexpr.pxi", line 404, in gurobipy.QuadExpr.getValue
File "src/gurobipy/var.pxi", line 125, in gurobipy.Var.__getattr__
File "src/gurobipy/var.pxi", line 153, in gurobipy.Var.getAttr
File "src/gurobipy/attrutil.pxi", line 100, in gurobipy.__getattr
AttributeError: Unable to retrieve attribute 'x'
So to resolve the underlying issue, follow the instructions in the Gurobi knowledgebase article: How do I resolve the error "Model is infeasible or unbounded".
Related
I'm attempting to identify elements in the euclidean distance matrix that fall under a certain threshold. I then take the positional arguments for this search and use them to compare elements in a second array (for sake of demonstration this array is the first eigenvector of PCA, but the sort is the most relevant part for my question). The application needs to be applicable for an unknown number of observations, but should run effectively on several million.
#
import numpy as np
from scipy.spatial.distance import cdist
threshold = 10
data = np.random.uniform((1, 2, 3), 5000)
searchValues = np.where(cdist(data, data) < threshold)
#
My problem is two fold.
Firstly the euclidean distance matrix quickly becomes too large for simply applying scipy.spatial.distance.cdist(). To solve this issue I apply the cdist function in batches over the dataset and implement the search iteratively.
#
cdist(data, data)
Traceback (most recent call last):
File "C:\Users\tl928yx\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-10-fb93ae543712>", line 1, in <module>
cdist(data, data)
File "C:\Users\tl928yx\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\spatial\distance.py", line 2142, in cdist
dm = np.zeros((mA, mB), dtype=np.double)
MemoryError
#
The second problem is a runtime issue that results from constructing distance matrix iteratively. When I institute my iterative approach the runtime increases exponentially. This isn't unexpected due to the nature of the iterative approach.
#
import numpy as np
import dask.array as da
from scipy.spatial.distance import cdist
import itertools
import timeit
threshold = 10
data = np.random.uniform(1, 100, (200000,40)) #Build random data
data = da.asarray(data)
it = round(data.shape[0]/10000)
dataArrays = [data[i*10000:(i+1)*10000] for i in range(0, it)]
comparisons = itertools.combinations(dataArrays, 2)
start = timeit.default_timer()
searchvalues = []
for comparison in comparisons:
searchvalues.append(np.where(cdist(comparison[0], comparison[1]) < threshold))
time = timeit.default_timer() - start
print(time)
#
Neither of these issues are unexpected due to the nature of the problem. To try and offset both problems I've tried using dask to implement both a large data framework in python, and insert parallelization in the batch process. However, this hasn't resulted in a significant improvement in the time calculation, and I have a pretty strict memory limitation with this iterative method in dask (requiring taking in batches of 1000 obs at a time.
from dask.diagnostics import ProgressBar
import dask.delayed
import dask.bag
#dask.delayed
def eucDist(comparison):
return da.asarray(cdist(comparison[0], comparison[1]))
#dask.delayed
def findValues(euclideanMatrix):
return np.where(euclideanMatrix < threshold)
start = timeit.default_timer()
searchvalues = []
test = []
for comparison in comparisons:
comp = dask.delayed(eucDist)(comparison)
test.append(comp)
look = []
with ProgressBar():
for element in test:
look.append(dask.delayed(findValues)(element).compute())
I'm hoping that I can parallelize the comparisons to increase my speed, but I'm not sure how to implement that in python. Any help with that, or any recommendations for how I can improve the initial comparison code would be appreciated.
You can calculate the Euclidean distance in Dask by using dask_distance.euclidean(x,y).
I believe that the dask-image package has some dask-enabled distance algorithms.
https://github.com/dask/dask-image
I am using threading library for the first time inorder to speed up the training time of my SARIMAX model. But the code keeps failing with the following error
Bad direction in the line search; refresh the lbfgs memory and restart the iteration.
This problem is unconstrained.
This problem is unconstrained.
This problem is unconstrained.
Following is my code:
import numpy as np
import pandas as pd
from statsmodels.tsa.arima_model import ARIMA
import statsmodels.tsa.api as smt
from threading import Thread
def process_id(ndata):
train = ndata[0:-7]
test = ndata[len(train):]
try:
model = smt.SARIMAX(train.asfreq(freq='1d'), exog=None, order=(0, 1, 1), seasonal_order=(0, 1, 1, 7)).fit()
pred = model.get_forecast(len(test))
fcst = pred.predicted_mean
fcst.index = test.index
mapelist = []
for i in range(len(fcst)):
mapelist.insert(i, (np.absolute(test[i] - fcst[i])) / test[i])
mape = np.mean(mapelist) * 100
print(mape)
except:
mape = 0
pass
return mape
def process_range(ndata, store=None):
if store is None:
store = {}
for id in ndata:
store[id] = process_id(ndata[id])
return store
def threaded_process_range(nthreads,ndata):
store = {}
threads = []
# create the threads
k = 0
tk = ndata.columns
for i in range(nthreads):
dk = tk[k:len(tk)/nthreads+k]
k = k+len(tk)/nthreads
t = Thread(target=process_range, args=(ndata[dk],store))
threads.append(t)
[ t.start() for t in threads ]
[ t.join() for t in threads ]
return store
outdata = threaded_process_range(4,ndata)
Few things I would like to mention:
Data is daily stock time series in a dataframe
Threading works for ARIMA model
SARIMAX model works when done in a for loop
Any insights would be highly appreciated thanks!
I got the same error with lbfgs, I'm not sure why lbfgs fails to do gradient evaluations, but I tried changing the optimizer. You can try this too, choose among any of these optimizers
’newton’ for Newton-Raphson, ‘nm’ for Nelder-Mead
’bfgs’ for Broyden-Fletcher-Goldfarb-Shanno (BFGS)
’lbfgs’ for limited-memory BFGS with optional box constraints
’powell’ for modified Powell’s method
’cg’ for conjugate gradient
’ncg’ for Newton-conjugate gradient
’basinhopping’ for global basin-hopping solver
change this in your code
model = smt.SARIMAX(train.asfreq(freq='1d'), exog=None, order=(0, 1, 1), seasonal_order=(0, 1, 1, 7)).fit(method='cg')
It's an old question but still I'm answering it in case someone in future faces the same problem.
I'm trying to fit Einstein approximation of resistivity in a solid in a set of experimental data.
I have resistivity vs temperature (from 200 to 4 K)
import xlrd as xd
import matplotlib.pyplot as plt
import numpy as np
import pylab as pl
import scipy as sp
from scipy.optimize import curve_fit
#retrieve data from file
data = pl.loadtxt('salita.txt')
Temp = data[:, 1]
Res = data[:, 2]
#define fitting function
def einstein_func( T, ro0, AE, TE):
nl = np.sinh(TE/(2*T))
return ro0 + AE*nl*T
p0 = sp.array([1 , 1, 1])
coeffs, cov = curve_fit(einstein_func, Temp, Res, p0)
But I get these warnings
crio.py:14: RuntimeWarning: divide by zero encountered in divide
nl = np.sinh(TE/(2*T))
crio.py:14: RuntimeWarning: overflow encountered in sinh
nl = np.sinh(TE/(2*T))
crio.py:15: RuntimeWarning: divide by zero encountered in divide
return ro0 + AE*np.sinh(TE/(2*T))*T
crio.py:15: RuntimeWarning: overflow encountered in sinh
return ro0 + AE*np.sinh(TE/(2*T))*T
crio.py:15: RuntimeWarning: invalid value encountered in multiply
return ro0 + AE*np.sinh(TE/(2*T))*T
Traceback (most recent call last):
File "crio.py", line 19, in <module>
coeffs, cov = curve_fit(einstein_func, Temp, Res, p0)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/optimize/minpack.py", line 511, in curve_fit
raise RuntimeError(msg)
RuntimeError: Optimal parameters not found: Number of calls to function has reached maxfev = 800.
I don't understand why it keeps saying that there is a divide by zero in sinh, since I have strictly positive values. Varying my starting guess has no effect on it.
EDIT: My dataset is organized like this:
4.39531E+0 1.16083E-7
4.39555E+0 -5.92258E-8
4.39554E+0 -3.79045E-8
4.39525E+0 -2.13213E-8
4.39619E+0 -4.02736E-8
4.43130E+0 -1.42142E-8
4.45900E+0 -2.60594E-8
4.46129E+0 -9.00232E-8
4.46181E+0 1.42142E-7
4.46195E+0 -2.13213E-8
4.46225E+0 4.26426E-8
4.46864E+0 -2.60594E-8
4.47628E+0 1.37404E-7
4.47747E+0 9.47612E-9
4.48008E+0 2.84284E-8
4.48795E+0 1.35035E-7
4.49804E+0 1.39773E-7
4.51151E+0 -1.75308E-7
4.54916E+0 -1.63463E-7
4.59176E+0 -2.36902E-9
where the first column is temperature and the second one is resistivity (negative values are due to noise in trial current since the sample is a PbIn alloy which becomes superconductive at temperature lower than 6.7-6.9K, here we are at 4.5K).
Argument I'm providing to sinh are Numpy arrays, with a linear function ro0 + AE*T my code works. I've tried with scipy.optimize.minimize but the result is the same.
Now I see that I have almost nine hundred values in my file, may that be the problem?
I have edited my dataset removing some lines and now the only warning showing is
RuntimeWarning: overflow encountered in sinh
How can I workaround it?
Here are a couple of observations that could help:
You could try the least-squares fit directly with leastsq, providing the Jacobian, which might help tame it.
I'm guessing you don't want the superconducting temperatures in your data set at all if you're fitting to an Einstein model (do you have a source for this eqn, btw?)
Do make sure your initial guesses are as good as they could possibly be (ro0=AE=TE=1 probably won't cut it).
Plot your data and make sure there aren't any weird artefacts
You seem to be indexing your data array in the wrong way in your code example: if the data is structured as you say, you want:
Temp = data[:, 0]
Res = data[:, 1]
(Python indexes start at 0).
I'm trying to get my first power spectral density graph plotted using actual data instead of something that's purely theoretical and generated within Python. I'm having problems getting anything to work, however. Code is attached below, followed by the error I get in my console after line 19.
Don't know if it makes a difference, but I'm transitioning to Python from mostly working in MATLAB. I am not counting on having access to a license forever, so I really want to learn how to start doing everything in Python. But it's hard.
Code:
import numpy as np
from scipy import signal
import scipy.io
import matplotlib.pyplot as plt
#import data from a .mat file using the loadmat command
mat = scipy.io.loadmat('Mic_Data_Sums.mat')
# 1 x 1 array, sampling frequency of 22050 Hz
fs = mat['Fs']
# Attempted fix: change data type to 8-point float?
# fs = fs.astype('f8')
# 13 x 1323000 array - 13 separate time series of data, 60 seconds each
data = mat['Mic_Data_Sums']
# Welch function - transpose 'data' and use the 2nd time series
f, Pxx_spec = signal.welch(data.T[1], fs, window = 'hanning', nperseg = fs,
noverlap = fs/2, scaling = 'spectrum')
Console:
/Users/******/anaconda/lib/python3.4/site-packages/scipy/signal/spectral.py:297: RuntimeWarning: divide by zero encountered in double_scalars
scale = 1.0 / win.sum()**2
Traceback (most recent call last):
File "plotPSDs.py", line 20, in <module>
noverlap = fs/2, scaling = 'spectrum')
File "/Users/******/anaconda/lib/python3.4/site-packages/scipy/signal/spectral.py", line 333, in welch
xft = fftpack.rfft(x_dt*win, nfft)
ValueError: operands could not be broadcast together with shapes (22050,) (0,22051)
Note how the ValueError tag gives me weird shape (dimension) results: I have no idea where the 22051 is coming from.
Edit: As a workaround solution, I commented out the line of fs = mat['Fs'] and simply replaced it with fs = 22050, which made the code execute successfully. However, the question still remains, why can't I simply reference the variable as it was stored in the .mat file?
[from the comments above] If you know fs is 1x1, try passing fs[0,0] to welch. The docstring for welch says fs should be a float, so it might behave unpredictably if you give it a two-dimensional array. – Warren Weckesser 22 hours ago
This worked well. The code I implemented is:
# 1 x 1 array, sampling frequency (22050 Hz)
fs = mat['Fs']
fs = fs[0,0]
then using the code from before,
f, Pxx_spec = signal.welch(data.T[1], fs, window = 'hanning', nperseg = fs,
noverlap = fs/2, scaling = 'spectrum')
Here's a MWE of a much larger code I'm using. Basically, it performs a Monte Carlo integration over a KDE (kernel density estimate) for all values located below a certain threshold (the integration method was suggested over at this question BTW: Integrate 2D kernel density estimate).
import numpy as np
from scipy import stats
import time
# Generate some random two-dimensional data:
def measure(n):
"Measurement model, return two coupled measurements."
m1 = np.random.normal(size=n)
m2 = np.random.normal(scale=0.5, size=n)
return m1+m2, m1-m2
# Get data.
m1, m2 = measure(20000)
# Define limits.
xmin = m1.min()
xmax = m1.max()
ymin = m2.min()
ymax = m2.max()
# Perform a kernel density estimate on the data.
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
# Define point below which to integrate the kernel.
x1, y1 = 0.5, 0.5
# Get kernel value for this point.
tik = time.time()
iso = kernel((x1,y1))
print 'iso: ', time.time()-tik
# Sample from KDE distribution (Monte Carlo process).
tik = time.time()
sample = kernel.resample(size=1000)
print 'resample: ', time.time()-tik
# Filter the sample leaving only values for which
# the kernel evaluates to less than what it does for
# the (x1, y1) point defined above.
tik = time.time()
insample = kernel(sample) < iso
print 'filter/sample: ', time.time()-tik
# Integrate for all values below iso.
tik = time.time()
integral = insample.sum() / float(insample.shape[0])
print 'integral: ', time.time()-tik
The output looks something like this:
iso: 0.00259208679199
resample: 0.000817060470581
filter/sample: 2.10829401016
integral: 4.2200088501e-05
which clearly means that the filter/sample call is consuming almost all of the time the code uses to run. I have to run this block of code iteratively several thousand times so it can get pretty time consuming.
Is there any way to speed up the filtering/sampling process?
Add
Here's a slightly more realistic MWE of my actual code with Ophion's multi-threading solution written into it:
import numpy as np
from scipy import stats
from multiprocessing import Pool
def kde_integration(m_list):
m1, m2 = [], []
for item in m_list:
# Color data.
m1.append(item[0])
# Magnitude data.
m2.append(item[1])
# Define limits.
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
# Perform a kernel density estimate on the data:
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
out_list = []
for point in m_list:
# Compute the point below which to integrate.
iso = kernel((point[0], point[1]))
# Sample KDE distribution
sample = kernel.resample(size=1000)
#Create definition.
def calc_kernel(samp):
return kernel(samp)
#Choose number of cores and split input array.
cores = 4
torun = np.array_split(sample, cores, axis=1)
#Calculate
pool = Pool(processes=cores)
results = pool.map(calc_kernel, torun)
#Reintegrate and calculate results
insample_mp = np.concatenate(results) < iso
# Integrate for all values below iso.
integral = insample_mp.sum() / float(insample_mp.shape[0])
out_list.append(integral)
return out_list
# Generate some random two-dimensional data:
def measure(n):
"Measurement model, return two coupled measurements."
m1 = np.random.normal(size=n)
m2 = np.random.normal(scale=0.5, size=n)
return m1+m2, m1-m2
# Create list to pass.
m_list = []
for i in range(60):
m1, m2 = measure(5)
m_list.append(m1.tolist())
m_list.append(m2.tolist())
# Call KDE integration function.
print 'Integral result: ', kde_integration(m_list)
The solution presented by Ophion works great on the original code I presented, but fails with the following error in this version:
Integral result: Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 319, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
I tried moving the calc_kernel function around since one of the answers in this question Multiprocessing: How to use Pool.map on a function defined in a class? states that "the function that you give to map() must be accessible through an import of your module"; but I still can't get this code to work.
Any help will be very much appreciated.
Add 2
Implementing Ophion's suggestion to remove the calc_kernel function and simply using:
results = pool.map(kernel, torun)
works to get rid of the PicklingError but now I see that if I create an initial m_list of anything more than around 62-63 items I get this error:
Traceback (most recent call last):
File "~/gauss_kde_temp.py", line 67, in <module>
print 'Integral result: ', kde_integration(m_list)
File "~/gauss_kde_temp.py", line 38, in kde_integration
pool = Pool(processes=cores)
File "/usr/lib/python2.7/multiprocessing/__init__.py", line 232, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 161, in __init__
self._result_handler.start()
File "/usr/lib/python2.7/threading.py", line 494, in start
_start_new_thread(self.__bootstrap, ())
thread.error: can't start new thread
Since my actual list in my real implementation of this code can have up to 2000 items, this issue renders the code unusable. Line 38 is this one:
pool = Pool(processes=cores)
so apparently it has something to do with the number of cores I'm using?
This question "Can't start a new thread error" in Python suggests using:
threading.active_count()
to check the number of threads I have going when I get that error. I checked and it always crashes when it reaches 374 threads. How can I code around this issue?
Here's the new question dealing with this last issue: Thread error: can't start new thread
Probably the easiest way to speed this up is to parallelize kernel(sample):
Taking this code fragment:
tik = time.time()
insample = kernel(sample) < iso
print 'filter/sample: ', time.time()-tik
#filter/sample: 1.94065904617
Change this to use multiprocessing:
from multiprocessing import Pool
tik = time.time()
#Create definition.
def calc_kernel(samp):
return kernel(samp)
#Choose number of cores and split input array.
cores = 4
torun = np.array_split(sample, cores, axis=1)
#Calculate
pool = Pool(processes=cores)
results = pool.map(calc_kernel, torun)
#Reintegrate and calculate results
insample_mp = np.concatenate(results) < iso
print 'multiprocessing filter/sample: ', time.time()-tik
#multiprocessing filter/sample: 0.496874094009
Double check they are returning the same answer:
print np.all(insample==insample_mp)
#True
A 3.9x improvement on 4 cores. Not sure what you are running this on, but after about 6 processors your input array size is not large enough to get considerably gains. For example using 20 processors its only about 5.8x faster.
The claim in the comments section of this article (link below) is
"SciPy’s gaussian_kde doesn’t use FFT, while there is a statsmodels implementation that does"
…which is a possible cause of the observed poor performance. It goes on to report orders of magnitude improvement using FFT. See #jseabold's reply.
http://slendrmeans.wordpress.com/2012/05/01/will-it-python-machine-learning-for-hackers-chapter-2-part-1-summary-stats-and-density-estimators/
Disclaimer: I have no experience with statsmodels or scipy.