Speed up sampling of kernel estimate

Speed up sampling of kernel estimate - python

Here's a MWE of a much larger code I'm using. Basically, it performs a Monte Carlo integration over a KDE (kernel density estimate) for all values located below a certain threshold (the integration method was suggested over at this question BTW: Integrate 2D kernel density estimate).
import numpy as np
from scipy import stats
import time
# Generate some random two-dimensional data:
def measure(n):
"Measurement model, return two coupled measurements."
m1 = np.random.normal(size=n)
m2 = np.random.normal(scale=0.5, size=n)
return m1+m2, m1-m2
# Get data.
m1, m2 = measure(20000)
# Define limits.
xmin = m1.min()
xmax = m1.max()
ymin = m2.min()
ymax = m2.max()
# Perform a kernel density estimate on the data.
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
# Define point below which to integrate the kernel.
x1, y1 = 0.5, 0.5
# Get kernel value for this point.
tik = time.time()
iso = kernel((x1,y1))
print 'iso: ', time.time()-tik
# Sample from KDE distribution (Monte Carlo process).
tik = time.time()
sample = kernel.resample(size=1000)
print 'resample: ', time.time()-tik
# Filter the sample leaving only values for which
# the kernel evaluates to less than what it does for
# the (x1, y1) point defined above.
tik = time.time()
insample = kernel(sample) < iso
print 'filter/sample: ', time.time()-tik
# Integrate for all values below iso.
tik = time.time()
integral = insample.sum() / float(insample.shape[0])
print 'integral: ', time.time()-tik
The output looks something like this:
iso: 0.00259208679199
resample: 0.000817060470581
filter/sample: 2.10829401016
integral: 4.2200088501e-05
which clearly means that the filter/sample call is consuming almost all of the time the code uses to run. I have to run this block of code iteratively several thousand times so it can get pretty time consuming.
Is there any way to speed up the filtering/sampling process?
Add
Here's a slightly more realistic MWE of my actual code with Ophion's multi-threading solution written into it:
import numpy as np
from scipy import stats
from multiprocessing import Pool
def kde_integration(m_list):
m1, m2 = [], []
for item in m_list:
# Color data.
m1.append(item[0])
# Magnitude data.
m2.append(item[1])
# Define limits.
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
# Perform a kernel density estimate on the data:
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
out_list = []
for point in m_list:
# Compute the point below which to integrate.
iso = kernel((point[0], point[1]))
# Sample KDE distribution
sample = kernel.resample(size=1000)
#Create definition.
def calc_kernel(samp):
return kernel(samp)
#Choose number of cores and split input array.
cores = 4
torun = np.array_split(sample, cores, axis=1)
#Calculate
pool = Pool(processes=cores)
results = pool.map(calc_kernel, torun)
#Reintegrate and calculate results
insample_mp = np.concatenate(results) < iso
# Integrate for all values below iso.
integral = insample_mp.sum() / float(insample_mp.shape[0])
out_list.append(integral)
return out_list
# Generate some random two-dimensional data:
def measure(n):
"Measurement model, return two coupled measurements."
m1 = np.random.normal(size=n)
m2 = np.random.normal(scale=0.5, size=n)
return m1+m2, m1-m2
# Create list to pass.
m_list = []
for i in range(60):
m1, m2 = measure(5)
m_list.append(m1.tolist())
m_list.append(m2.tolist())
# Call KDE integration function.
print 'Integral result: ', kde_integration(m_list)
The solution presented by Ophion works great on the original code I presented, but fails with the following error in this version:
Integral result: Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 319, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
I tried moving the calc_kernel function around since one of the answers in this question Multiprocessing: How to use Pool.map on a function defined in a class? states that "the function that you give to map() must be accessible through an import of your module"; but I still can't get this code to work.
Any help will be very much appreciated.
Add 2
Implementing Ophion's suggestion to remove the calc_kernel function and simply using:
results = pool.map(kernel, torun)
works to get rid of the PicklingError but now I see that if I create an initial m_list of anything more than around 62-63 items I get this error:
Traceback (most recent call last):
File "~/gauss_kde_temp.py", line 67, in <module>
print 'Integral result: ', kde_integration(m_list)
File "~/gauss_kde_temp.py", line 38, in kde_integration
pool = Pool(processes=cores)
File "/usr/lib/python2.7/multiprocessing/__init__.py", line 232, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 161, in __init__
self._result_handler.start()
File "/usr/lib/python2.7/threading.py", line 494, in start
_start_new_thread(self.__bootstrap, ())
thread.error: can't start new thread
Since my actual list in my real implementation of this code can have up to 2000 items, this issue renders the code unusable. Line 38 is this one:
pool = Pool(processes=cores)
so apparently it has something to do with the number of cores I'm using?
This question "Can't start a new thread error" in Python suggests using:
threading.active_count()
to check the number of threads I have going when I get that error. I checked and it always crashes when it reaches 374 threads. How can I code around this issue?
Here's the new question dealing with this last issue: Thread error: can't start new thread

Probably the easiest way to speed this up is to parallelize kernel(sample):
Taking this code fragment:
tik = time.time()
insample = kernel(sample) < iso
print 'filter/sample: ', time.time()-tik
#filter/sample: 1.94065904617
Change this to use multiprocessing:
from multiprocessing import Pool
tik = time.time()
#Create definition.
def calc_kernel(samp):
return kernel(samp)
#Choose number of cores and split input array.
cores = 4
torun = np.array_split(sample, cores, axis=1)
#Calculate
pool = Pool(processes=cores)
results = pool.map(calc_kernel, torun)
#Reintegrate and calculate results
insample_mp = np.concatenate(results) < iso
print 'multiprocessing filter/sample: ', time.time()-tik
#multiprocessing filter/sample: 0.496874094009
Double check they are returning the same answer:
print np.all(insample==insample_mp)
#True
A 3.9x improvement on 4 cores. Not sure what you are running this on, but after about 6 processors your input array size is not large enough to get considerably gains. For example using 20 processors its only about 5.8x faster.

The claim in the comments section of this article (link below) is
"SciPy’s gaussian_kde doesn’t use FFT, while there is a statsmodels implementation that does"
…which is a possible cause of the observed poor performance. It goes on to report orders of magnitude improvement using FFT. See #jseabold's reply.
http://slendrmeans.wordpress.com/2012/05/01/will-it-python-machine-learning-for-hackers-chapter-2-part-1-summary-stats-and-density-estimators/
Disclaimer: I have no experience with statsmodels or scipy.

Related

portfolio optimization - unable to retrieve attribute for 'x'

I am learning how to use Gurobi optimizer and here is the sample code for portfolio optimization.
import gurobipy as gp
from gurobipy import GRB
from math import sqrt
import pandas as pd
import numpy as np
# Create historical return data for two stocks
equity1 = [0.0107, 0.0122, 0.076, 0.084, 0.0207]
equity2 = [0.0133, 0.0278, 0.0719, 0.0353, 0.0163]
data = pd.DataFrame(list(zip(equity1, equity2)), columns = ['SPX', 'FXAIX'])
stocks = data.columns
# Calculate basic summary statistics for individual stocks
stock_volatility = data.std()
stock_return = data.mean()
# Create an empty model
m = gp.Model('portfolio')
# Add a variable for each stock
vars = pd.Series(m.addVars(stocks), index=stocks)
# Objective is to minimize risk (squared). This is modeled using the
# covariance matrix, which measures the historical correlation between stocks.
sigma = data.cov()
portfolio_risk = sigma.dot(vars).dot(vars)
m.setObjective(portfolio_risk, GRB.MINIMIZE)
# Fix budget with a constraint
m.addConstr(vars.sum() == 1, 'budget')
# Optimize model to find the minimum risk portfolio
m.setParam('OutputFlag', 0)
m.optimize()
# Create an expression representing the expected return for the portfolio
portfolio_return = stock_return.dot(vars)
# Display minimum risk portfolio
print('Minimum Risk Portfolio:\n')
for v in vars:
if v.x > 0:
print('\t%s\t: %g' % (v.varname, v.x))
minrisk_volatility = sqrt(portfolio_risk.getValue())
minrisk_return = portfolio_return.getValue()
# Solve for efficient frontier by varying target return
frontier = pd.Series(dtype=np.float64)
for r in np.linspace(stock_return.min(), stock_return.max(), 100):
m.addConstr(portfolio_return == r, 'target')
m.optimize()
print(portfolio_risk.getValue())
#frontier.loc[sqrt(portfolio_risk.getValue())] = r
I got the error "Unable to retrieve attribute 'x' for the last line of code somehow when I try to create an efficient frontier. Thanks for any suggestions!

First, two important suggestions:
Always check the Status code to see the status of a call to Model.optimize()
For debugging, never disable logging, so comment out the line that sets the parameter OutputFlag=0
With logging, I get this output:
Barrier solved model in 0 iterations and 0.00 seconds
Model is infeasible or unbounded
Traceback (most recent call last):
File "so.py", line 56, in <module>
print(portfolio_risk.getValue())
File "src/gurobipy/quadexpr.pxi", line 404, in gurobipy.QuadExpr.getValue
File "src/gurobipy/var.pxi", line 125, in gurobipy.Var.__getattr__
File "src/gurobipy/var.pxi", line 153, in gurobipy.Var.getAttr
File "src/gurobipy/attrutil.pxi", line 100, in gurobipy.__getattr
AttributeError: Unable to retrieve attribute 'x'
So to resolve the underlying issue, follow the instructions in the Gurobi knowledgebase article: How do I resolve the error "Model is infeasible or unbounded".

euclidean distance calculation using Python and Dask

I'm attempting to identify elements in the euclidean distance matrix that fall under a certain threshold. I then take the positional arguments for this search and use them to compare elements in a second array (for sake of demonstration this array is the first eigenvector of PCA, but the sort is the most relevant part for my question). The application needs to be applicable for an unknown number of observations, but should run effectively on several million.
#
import numpy as np
from scipy.spatial.distance import cdist
threshold = 10
data = np.random.uniform((1, 2, 3), 5000)
searchValues = np.where(cdist(data, data) < threshold)
#
My problem is two fold.
Firstly the euclidean distance matrix quickly becomes too large for simply applying scipy.spatial.distance.cdist(). To solve this issue I apply the cdist function in batches over the dataset and implement the search iteratively.
#
cdist(data, data)
Traceback (most recent call last):
File "C:\Users\tl928yx\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-10-fb93ae543712>", line 1, in <module>
cdist(data, data)
File "C:\Users\tl928yx\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\spatial\distance.py", line 2142, in cdist
dm = np.zeros((mA, mB), dtype=np.double)
MemoryError
#
The second problem is a runtime issue that results from constructing distance matrix iteratively. When I institute my iterative approach the runtime increases exponentially. This isn't unexpected due to the nature of the iterative approach.
#
import numpy as np
import dask.array as da
from scipy.spatial.distance import cdist
import itertools
import timeit
threshold = 10
data = np.random.uniform(1, 100, (200000,40)) #Build random data
data = da.asarray(data)
it = round(data.shape[0]/10000)
dataArrays = [data[i*10000:(i+1)*10000] for i in range(0, it)]
comparisons = itertools.combinations(dataArrays, 2)
start = timeit.default_timer()
searchvalues = []
for comparison in comparisons:
searchvalues.append(np.where(cdist(comparison[0], comparison[1]) < threshold))
time = timeit.default_timer() - start
print(time)
#
Neither of these issues are unexpected due to the nature of the problem. To try and offset both problems I've tried using dask to implement both a large data framework in python, and insert parallelization in the batch process. However, this hasn't resulted in a significant improvement in the time calculation, and I have a pretty strict memory limitation with this iterative method in dask (requiring taking in batches of 1000 obs at a time.
from dask.diagnostics import ProgressBar
import dask.delayed
import dask.bag
#dask.delayed
def eucDist(comparison):
return da.asarray(cdist(comparison[0], comparison[1]))
#dask.delayed
def findValues(euclideanMatrix):
return np.where(euclideanMatrix < threshold)
start = timeit.default_timer()
searchvalues = []
test = []
for comparison in comparisons:
comp = dask.delayed(eucDist)(comparison)
test.append(comp)
look = []
with ProgressBar():
for element in test:
look.append(dask.delayed(findValues)(element).compute())
I'm hoping that I can parallelize the comparisons to increase my speed, but I'm not sure how to implement that in python. Any help with that, or any recommendations for how I can improve the initial comparison code would be appreciated.

You can calculate the Euclidean distance in Dask by using dask_distance.euclidean(x,y).

I believe that the dask-image package has some dask-enabled distance algorithms.
https://github.com/dask/dask-image

Using Scipy's signal.welch command on data: ValueError and a dimension mismatch?

I'm trying to get my first power spectral density graph plotted using actual data instead of something that's purely theoretical and generated within Python. I'm having problems getting anything to work, however. Code is attached below, followed by the error I get in my console after line 19.
Don't know if it makes a difference, but I'm transitioning to Python from mostly working in MATLAB. I am not counting on having access to a license forever, so I really want to learn how to start doing everything in Python. But it's hard.
Code:
import numpy as np
from scipy import signal
import scipy.io
import matplotlib.pyplot as plt
#import data from a .mat file using the loadmat command
mat = scipy.io.loadmat('Mic_Data_Sums.mat')
# 1 x 1 array, sampling frequency of 22050 Hz
fs = mat['Fs']
# Attempted fix: change data type to 8-point float?
# fs = fs.astype('f8')
# 13 x 1323000 array - 13 separate time series of data, 60 seconds each
data = mat['Mic_Data_Sums']
# Welch function - transpose 'data' and use the 2nd time series
f, Pxx_spec = signal.welch(data.T[1], fs, window = 'hanning', nperseg = fs,
noverlap = fs/2, scaling = 'spectrum')
Console:
/Users/******/anaconda/lib/python3.4/site-packages/scipy/signal/spectral.py:297: RuntimeWarning: divide by zero encountered in double_scalars
scale = 1.0 / win.sum()**2
Traceback (most recent call last):
File "plotPSDs.py", line 20, in <module>
noverlap = fs/2, scaling = 'spectrum')
File "/Users/******/anaconda/lib/python3.4/site-packages/scipy/signal/spectral.py", line 333, in welch
xft = fftpack.rfft(x_dt*win, nfft)
ValueError: operands could not be broadcast together with shapes (22050,) (0,22051)
Note how the ValueError tag gives me weird shape (dimension) results: I have no idea where the 22051 is coming from.
Edit: As a workaround solution, I commented out the line of fs = mat['Fs'] and simply replaced it with fs = 22050, which made the code execute successfully. However, the question still remains, why can't I simply reference the variable as it was stored in the .mat file?

[from the comments above] If you know fs is 1x1, try passing fs[0,0] to welch. The docstring for welch says fs should be a float, so it might behave unpredictably if you give it a two-dimensional array. – Warren Weckesser 22 hours ago
This worked well. The code I implemented is:
# 1 x 1 array, sampling frequency (22050 Hz)
fs = mat['Fs']
fs = fs[0,0]
then using the code from before,
f, Pxx_spec = signal.welch(data.T[1], fs, window = 'hanning', nperseg = fs,
noverlap = fs/2, scaling = 'spectrum')

How to call function for scipy.optimize.fmin_cg(func) in Python

I will simply explain the problem in short. This problem is exactly similar as shown in scipy.doc. The problem is on error occurance as float argument required, not numpy.ndarray
What I have:
Function: y = s*z^t
Variable length/dimensions
t - 1...m,
s - 1...m and 1...n. So, m is row number, n - col number.
z - 1...n.
y - this can be y1, y[2], y[3],..., y[m],
T - s[m,n] matrix
Like this:
y[1] = s[1][1]*z[1]^t[1]+s[1][2]*z[2]^t[1]+...s[1][n]*z[n]^t[1])
y[2] = s[2][1]*z[1]^t[2]+s[2][2]*z[2]^t[2]+...s[2][n]*z[n]^t[2])
...
y[m] = s[m][1]*z[1]^t[m]+s[m][2]*z[2]^t[2]+...s[m][n]*z[n]^t[m])
Problem: Error occured.
Optimization terminated successfully.
Traceback (most recent call last):
solution = optimize.fmin_cg(func, z, fprime=gradf, args=args)
File "C:\Python27\lib\site-packages\scipy\optimize\optimize.py", line 952, in fmin_cg
res = _minimize_cg(f, x0, args, fprime, callback=callback, **opts)
File "C:\Python27\lib\site-packages\scipy\optimize\optimize.py", line 1072, in _minimize_cg
print " Current function value: %f" % fval
TypeError: float argument required, not numpy.ndarray
Here is the code
import numpy as np
import scipy as sp
import scipy.optimize as optimize
def func(z, *args):
y,T,t = args[0]
return y - counter(T,z,t)
def counter(T, z, t):
rows,cols = np.shape(T)
res = np.zeros(rows)
for i,row_val in enumerate(T):
res[i] = np.dot(row_val, z**t[i])
return res
def gradf(z, *args):
y,T,t = args[0]
return np.dot(t,counter(T,z,t-1))
def main():
# Inputs
N = 30
M = 20
z0 = np.zeros(N) # initial guess
y = 30*np.random.random(M)
T = 10*np.random.random((M,N))
t = 5*np.random.random(M)
args = [y, T, t]
solution = optimize.fmin_cg(func, z0, fprime=gradf, args=args)
print 'solution: ', solution
if __name__ == '__main__':
main()
I also tried to find similar examples but could not find something very similar. Here is the code for your consideration. Thanks in advance.

The root of your problem is that fmin_cg expects the function to return a single scalar value for the misfit instead of an array.
Basically, you want something vaguely similar to:
def func(z, y, T, t):
return np.linalg.norm(y - counter(T,z,t))
I'm using np.linalg.norm here because there's no builtin function in numpy for the root-mean-square. The actual RMS would be norm(x) / sqrt(x.size), but for minimization the constant multiplier doesn't make any difference.
There are also other minor problems in your code (e.g. args[0] is going to be a single item. You want y, T, t = args or better yet, just func(z, y, T, t)). Your gradient function doesn't make any sense to me, but it's optional regardless. Also, there's no way the solution can produce reasonable values at the moment, as you're testing it against pure noise. I assume those are just meant to be placeholder values, though.
However, you have a larger problem. You're trying to minimize in 30-dimensional space. Most non-linear solvers aren't going to work well with that high of a dimensionality. It may work fine, but you're very likely to run into problems.
All that having been said, you may find it more intuitive to use the scipy.optimize.curve_fit interface rather than using the others, if you're okay with LM instead of CG (they're fairly similar methods).
One final thing: You're trying to solve for 30 model parameters with 20 observations. This is an underdetermined problem. This problem doesn't have a unique solution. You're going to need to apply some a-priori knowledge to get a reasonable answer.

Scipy minimize, fmin, leastsq type problems (setting array element with sequence), bad fit

I'm having trouble with the scipy.optimize.fmin and scipy.optimize.minimize functions. I've checked and confirmed that all the arguments passed to the function are of type numpy.array, as well as the return value of the error function. Also, the carreau function returns a scalar value.
The reason for some of the extra arguments, such as size, is this: I need to fit data with a given model (Carreau). The data is taken at different temperatures, which are corrected with a shift factor (which is also fitted by the model), I end up with several sets of data which should all be used to calculate the same 4 constants (parameters p).
I read that I can't pass the fmin function a list of arrays, so I had to concatenate all data into x_data_lin, keeping track of the different sets with the size parameter. t holds different test temperatures, while t_0 is a one-element array which holds the reference temperature.
I am positive (triple checked) that all the arguments passed to the function, as well as the result, are one-dimensional arrays. Here's the code aside from that:
import numpy as np
import scipy.optimize
from scipy.optimize import fmin as simplex
def err_func2(p, x, y, t, t_0, size):
result = array([])
temp = 0
for i in range(0, int(len(size)-1)):
for j in range(int(temp), int(temp+size[i])):
result = np.append(result, (carreau(p, x[j], t[i], t_0[0])-y[i]))
temp += size[i]
return result
p1 = simplex(err_func2, initial_guess,
args=(x_data_lin, y_data_lin, t_list, t_0, size), full_output=0)
Here's the error:
Traceback (most recent call last):
File "C:\Python27\Scripts\projects\Carreau - WLF\carreau_model_fit.py", line 146, in <module>
main()
File "C:\Python27\Scripts\projects\Carreau - WLF\carreau_model_fit.py", line 105, in main
args=(x_data_lin, y_data_lin, t_list, t_0, size), full_output=0)
File "C:\Python27\lib\site-packages\scipy\optimize\optimize.py", line 351, in fmin
res = _minimize_neldermead(func, x0, args, callback=callback, **opts)
File "C:\Python27\lib\site-packages\scipy\optimize\optimize.py", line 415, in _minimize_neldermead
fsim[0] = func(x0)
ValueError: setting an array element with a sequence.
It's worth noting that I got the leastsq function working while passing it lists of arrays. Unfortunately, it did a poor job of fitting the data. But, as it took me a lot of time and research to get to that point, I'll post the code as follows. If somebody is interested in seeing all of the code, I would gladly post it, if you can recommend me somewhere to upload a few files(as it includes another imported script and of course sample data):
##def error_function(p, x, y, t, t_0):
## result = array([])
## for index in range(len(x)):
## result = np.append(result,(carreau(p, x[index],
## t[index], t_0) - y[index]))
## return result
## p1, success = scipy.optimize.leastsq(error_function, initial_guess,
## args=(x_list, y_list, t_list, t_0),
## maxfev=10000)
:( I was going to post a picture of the graphed data with the leastsq fit, but I don't have the requisite 10 points.
Late Edit: I now have gotten optimize.curvefit and optimize.leastsq to work (which probably not-so-coincidentally give the same answer), but the curve is bad. I've been trying to figure out optimize.minimize, but it's been a bit of a headache. the simplex (fmin, Nelder_Mead, whatever you want to call it) will run, but produces a crazy answer nowhere close. I've never worked with nonlinear optimization problems before, and I don't really know what direction to head.
Here's the working curve_fit code:
def temp_shift(t_s, t, t_0):
""" This function calculates the a_t temperature shift factor for polymer
viscosity curves. Variable is the standard temperature, t_s
"""
C_1 = 8.86
C_2 = 101.6
return(np.exp(
(C_1*(t_0-t_s) / (C_2+(t_0-t_s))) - (C_1*(t-t_s) / (C_2 + (t-t_s)))
))
def pass_data(t, t_0):
def carreau_2(x, p0, p1, p2, p3):
visc_0 = p0
m = p1
n = p2
t_s = p3
a_T = temp_shift(p3, t, t_0)
return (visc_0 * a_T / (1 + m * x * a_T)**n)
return carreau_2
initial_guess = array([20000, 3, 0.94, -20])
p1, conv = scipy.optimize.curve_fit(pass_data(t_all, t_0), x_data_lin,
y_data_lin, initial_guess)
Here's some sample data:
x_data_lin = array([0.01998, 0.04304, 0.2004, 0.43160, 0.92870, 2.0000, 4.30900,
9.28500, 15.51954, 21.94936, 37.52960, 90.41786, 204.35230,
331.58495, 811.92250, 1694.55309, 3464.27648, 8826.65738,
14008.00242])
y_data_lin = array([13520.00000, 13740.00000, 12540.00000, 9384.00000, 5201,
3232.00000, 2094.00000, 1484.00000, 999.00000, 1162.05088
942.56946, 705.62489, 429.47341, 254.15136, 185.22916,
122.07113, 76.46324, 47.85064, 25.74315, 18.84875])
t_all = array([190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 190,
190, 190, 190, 190, 190, 190, 190])
t_0 = 80
Here's a picture of the result of curve_fit (now that I have 10 points and can post!). Note there are 3 curves drawn because I used 3 sets of data to optimize the curve, at 3 different temperatures. Polymers have the property that the shear_rate - viscosity relationship stays the same, just shifted by a temperature factor a_T:
I'd really appreciate any suggestions about how to improve the fit, or how to define the function so that optimize.minimize works, and which method (Nelder-Mead, Powel, BFGS) might work.
Another edit to add: I got the Nelder-Mead (optimize.fmin, and the default of optimize.minimize) function to work - I'll include the revised error function below. Before, I simply summed the result array and returned it. This led to extremely negative values (obviously, since the function's goal is to minimize). Squaring the result before summing it solved that problem. Note that I also changed the function completely to take advantage of numpy's array broadcasting, as suggested by JaminSore (Thanks Jamin!)
def err_func2(p, x, y, t, t_0):
return ((carreau(p, x, t, t_0)-y)**2).sum()
Unfortunately, the Nelder-Mead function gives me the same result as leastsq and curve_fit. You can see in the graph above that it's not the optimal fit; in fact, at this point, Microsoft Excel's solver function is doing a better job on the data.
At least, I hope this thread can be useful for beginners to scipy.optimize in the future, since it's taken me quite awhile to discover all of this.

Unlike leastsq, fmin can only deal with error functions that return a scalar so if possible you have to rewrite your error function so that it returns a scalar. Here is a simple working example.
Import the necessary libraries
import numpy as np
from scipy.optimize import fmin
Define a helper function (you'll see later)
def prob(a, b):
return (1 + np.exp(b - a))**-1
Simulate some data
true_ = np.random.normal(size = 100) #parameters we're trying to recover
b = np.random.normal(size = 20)
exp_ = prob(true_[:, None], b) #expected
a_s, b_s = true_.shape[0], b.shape[0]
noise = np.random.uniform(size = (a_s, b_s))
response = (noise > (1 - exp_)).astype(int)
Define our error function (I'm using lambdas but this is not recommended in practice)
# sum of the squared residuals
err_func = lambda a : ((prob(a[:, None], b) - response) ** 2).sum()
result = fmin(err_func, np.zeros_like(true_)) #solve
If I remove the .sum() at the end of my error function definition, I get the same error.

OK, now I finally know the answer! First, the final piece, then a recap. The problem of the fit wasn't the fault of curve_fit, leastsq, Nelder_Mead, or Powell (the methods I've tried). It has to do with the relative weights of the errors. Since this data is on a log scale, the errors in the fit near the high y values are very costly, while errors near the low y values are insignificant. To correct this, I made the error relative by dividing by the y value of the data, as follows:
def err_func2(p, x, y, t, t_0):
return (((carreau(p, x, t, t_0)-y)/y)**2).sum()
Now, each relative error is squared, summed, then minimized, giving the following fit (using optimize.minimize with the Powell method, although it should be the same for the other methods as well.)
So now a recap of the answers found in this thread:
The easiest way (or at least for me, most fool-proof) to deal with curve fitting is to collect all the data into 1D numpy.arrays. Then, you can rely on numpy's array broadcasting to perform all operations. This means that arithmetic operations are treated the same way a vector dot-product would be. For example, array_1 = [a,b], array_2 = [c,d], then array_1 + array_2 = [a+c, b+d]. This works for addition, subtraction, multiplication, division, and powers: array+1array_2 = [ac, b**d].
For the optimize.leastsq function, you need to let the objective function return an array; i.e. return result where result is an array. For optimize.curve_fit, you also return an array. In this case, it's a bit more complicated to pass extra arguments (think other constants), but you can do it with a nested function, as I demonstrated above in the pass_data function.
For optimize.minimize, you need to return a scalar - that is, a single number. You can also return an array of answers, I think, but I avoided this by getting all the data into 1D arrays, as I mentioned earlier. To get this scalar, you can simply square and sum the result (like I have written in this post under err_func2) Squaring the data is very important, otherwise negative errors take over and drive the resulting scalar extremely negative.
Finally, as mentioned, when your data crosses several scales (105, 104, 10**3, etc), it may be necessary to normalize the errors. I did this by dividing each error by the y value.
So... I guess that's it? Finally?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speed up sampling of kernel estimate - python

Related

portfolio optimization - unable to retrieve attribute for 'x'

euclidean distance calculation using Python and Dask

Using Scipy's signal.welch command on data: ValueError and a dimension mismatch?

How to call function for scipy.optimize.fmin_cg(func) in Python

Scipy minimize, fmin, leastsq type problems (setting array element with sequence), bad fit

Categories

Resources