scipy.optimize.curvefit() - array must not contain infs or NaNs - python

I am trying to fit some data to a curve in Python using scipy.optimize.curve_fit. I am running into the error ValueError: array must not contain infs or NaNs.
I don't believe either my x or y data contain infs or NaNs:
>>> x_array = np.asarray_chkfinite(x_array)
>>> y_array = np.asarray_chkfinite(y_array)
>>>
To give some idea of what my x_array and y_array look like at either end (x_array is counts and y_array is quantiles):
>>> type(x_array)
<type 'numpy.ndarray'>
>>> type(y_array)
<type 'numpy.ndarray'>
>>> x_array[:5]
array([0, 0, 0, 0, 0])
>>> x_array[-5:]
array([2919, 2965, 3154, 3218, 3461])
>>> y_array[:5]
array([ 0.9999582, 0.9999163, 0.9998745, 0.9998326, 0.9997908])
>>> y_array[-5:]
array([ 1.67399000e-04, 1.25549300e-04, 8.36995200e-05,
4.18497600e-05, -2.22044600e-16])
And my function:
>>> def func(x,alpha,beta,b):
... return ((x/1)**(-alpha) * ((x+1*b)/(1+1*b))**(alpha-beta))
...
Which I am executing with:
>>> popt, pcov = curve_fit(func, x_array, y_array)
resulting in the error stack trace:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/scipy/optimize/minpack.py", line 426, in curve_fit
res = leastsq(func, p0, args=args, full_output=1, **kw)
File "/usr/lib/python2.7/dist-packages/scipy/optimize/minpack.py", line 338, in leastsq
cov_x = inv(dot(transpose(R),R))
File "/usr/lib/python2.7/dist-packages/scipy/linalg/basic.py", line 285, in inv
a1 = asarray_chkfinite(a)
File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 590, in asarray_chkfinite
"array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs
I'm guessing the error might not be with respect to my arrays, but rather an array created by scipy in an intermediate step? I've had a bit of a dig through the relevant scipy source
files, but things get hairy pretty quickly debugging the problem that way. Is there something obvious I'm doing wrong here? I've seen casually mentioned in other questions that sometimes certain initial parameter guesses (of which I currently don't have any explicit) might result in these kind of errors, but even if this is the case, it would be good to know a) why that is and b) how to avoid it.

Why it is failing
Not your input arrays are entailing nans or infs, but evaluation of your objective function at some X points and for some values of the parameters results in nans or infs: in other words, the array with values func(x,alpha,beta,b) for some x, alpha, beta and b is giving nans or infs over the optimization routine.
Scipy.optimize curve fitting function uses Levenberg-Marquardt algorithm. It is also called damped least square optimization. It is an iterative procedure, and a new estimate for the optimal function parameters is computed at each iteration. Also, at some point during optimization, algorithm is exploring some region of the parameters space where your function is not defined.
How to fix
1/Initial guess
Initial guess for parameters is decisive for the convergence. If initial guess is far from optimal solution, you are more likely to explore some regions where objective function is undefined. So, if you can have a better clue of what your optimal parameters are, and feed your algorithm with this initial guess, error while proceeding might be avoided.
2/Model
Also, you could modify your model, so that it is not returning nans. For those values of the parameters, params where original function func is not defined, you wish that objective function takes huge values, or in other words that func(params) is far from Y values to be fitted.
Also, at points where your objective function is not defined, you may return a big float, for instance AVG(Y)*10e5 with AVG the average (so that you make sure to be much bigger than average of Y values to be fitted).
Link
You could have a look at this post: Fitting data to an equation in python vs gnuplot

Your function has a negative power (x^-alpha) this is the same as (1/x)^(alpha). If x is ever 0 your function will return inf and your curve fit operation will break, I'm surprised a warning/error isn't thrown earlier informing you of a divide by 0.
BTW why are you multiplying and dividing by 1?

I was able to reproduce this error in python2.7 like so:
from sklearn.decomposition import FastICA
X = load_data.load("stuff") #this sets X to a 2d numpy array containing
#large positive and negative numbers.
ica = FastICA(whiten=False)
print(np.isnan(X).any()) #this prints False
print(np.isinf(X).any()) #this prints False
ica.fit(X) #this produces the error:
Which always produces the Error:
/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py:58: RuntimeWarning: invalid value encountered in sqrt
return np.dot(np.dot(u * (1. / np.sqrt(s)), u.T), W)
Traceback (most recent call last):
File "main.py", line 43, in <module>
ica()
File "main.py", line 18, in ica
ica.fit(X)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 523, in fit
self._fit(X, compute_sources=False)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 479, in _fit
compute_sources=compute_sources, return_n_iter=True)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 335, in fastica
W, n_iter = _ica_par(X1, **kwargs)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 108, in _ica_par
- g_wtx[:, np.newaxis] * W)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 55, in _sym_decorrelation
s, u = linalg.eigh(np.dot(W, W.T))
File "/usr/lib64/python2.7/site-packages/scipy/linalg/decomp.py", line 297, in eigh
a1 = asarray_chkfinite(a)
File "/usr/lib64/python2.7/site-packages/numpy/lib/function_base.py", line 613, in asarray_chkfinite
"array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs
Solution:
from sklearn.decomposition import FastICA
X = load_data.load("stuff") #this sets X to a 2d numpy array containing
#large positive and negative numbers.
ica = FastICA(whiten=False)
#this is a column wise normalization function which flattens the
#two dimensional array from very large and very small numbers to
#reasonably sized numbers between roughly -1 and 1
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
print(np.isnan(X).any()) #this prints False
print(np.isinf(X).any()) #this prints False
ica.fit(X) #this works correctly.
Why does that normalization step fix the error?
I found the eureka moment here: sklearn's PLSRegression: "ValueError: array must not contain infs or NaNs"
What I think is happening is that numpy is being fed gigantic numbers and very tiny numbers, and inside it's tiny brain it's creating NaN's and Inf's. So it's a bug in the sklearn. The work around is to flatten your input data to the algorithm so that there are no very large or very small numbers.
Bad sklearn! NO biscuit!

Related

What is rank list in tucker decomposition?

I am going to decompose a 4D tensor using tucker decomposition in python. I found a library, tensorly, to do this.
I only want to perform the decomposition on the first and second dimensions. To perform tucker decomposition on some modes (not all modes) using tensorly I have to use partial_tucker command. This is my code:
F = 256
D = 96
h = 5
w = 6
ranks = [89, 48]
modes = [0, 1]
tensor = tl.tensor((np.arange(F*D*h*w).reshape((F, D, h, w))).astype(np.float64))
core, factors = partial_tucker(tensor, modes=modes, rank=ranks)
This code works well, but when I am trying to change the rank list, for example:
ranks = [3,4]
I get an error as follows:
Traceback (most recent call last):
File "D:\PhD_Thessaloniki\Codes\LRF_Convolutional\tucker-decomposition.py", line 49, in <module>
core, factors = partial_tucker(tensor, modes=modes, rank=ranks)
File "C:\Users\Milad\Anaconda3\envs\tensorly\lib\site-packages\tensorly\decomposition\_tucker.py", line 109, in partial_tucker
eigenvecs, _, _ = svd_fun(unfold(core_approximation, mode), n_eigenvecs=rank[index], random_state=random_state)
File "C:\Users\Milad\Anaconda3\envs\tensorly\lib\site-packages\tensorly\backend\core.py", line 913, in partial_svd
S, V = scipy.sparse.linalg.eigsh(
File "C:\Users\Milad\Anaconda3\envs\tensorly\lib\site-packages\scipy\sparse\linalg\_eigen\arpack\arpack.py", line 1689, in eigsh
params.iterate()
File "C:\Users\Milad\Anaconda3\envs\tensorly\lib\site-packages\scipy\sparse\linalg\_eigen\arpack\arpack.py", line 571, in iterate
raise ArpackError(self.info, infodict=self.iterate_infodict)
scipy.sparse.linalg._eigen.arpack.arpack.ArpackError: ARPACK error 3: No shifts could be applied during a cycle of the Implicitly restarted Arnoldi iteration. One possibility is to increase the size of NCV relative to NEV.
I don't know if there is a constraint to define the rank in tucker decomposition or not, but when I am trying to perform decomposition on only one dimension, for example:
ranks = [3]
modes = [0]
or
ranks = [4]
modes = [1]
works well again.
I want to know:
Is this an algorithmic or code (tensorly) problem (constraint)?
What is this problem?
What rank lists are valid?
Thanks
Tucker relies on higher-order PCA. The error you are seeing is is in the sparse SVD, applied to an unfolding of the main tensor.
You can try a different SVD function (svd parameter in partial_tucker), you can see the available option using tensorly.tenalg.svd.SVD_FUNS.
You also might want to try a tensor with random elements, using tensorly.random.random_tensor or tensorly.randn.

Python--calculate normalized probability of a value given a list of samples

So, as the title says, I'm trying to calculate the probability of a value given a list of samples, preferably normalized so the probability is 0<p<1. I found this answer on the topic from about 6 years ago, which seemed promising. To test it, I implemented the example used in the first reply (edited for brevity):
import numpy as np
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad
# Generate random samples from a mixture of 2 Gaussians
# with modes at 5 and 10
data = np.concatenate((5 + np.random.randn(10, 1),
10 + np.random.randn(30, 1)))
x = np.linspace(0, 16, 1000)[:, np.newaxis]
# Do kernel density estimation
kd = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(data)
# Get probability for range of values
start = 5 # Start of the range
end = 6 # End of the range
probability = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]
However, this approach throws the following error:
Traceback (most recent call last):
File "prob test.py", line 44, in <module>
probability = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]
File "/usr/lib/python3/dist-packages/scipy/integrate/quadpack.py", line 340, in quad
retval = _quad(func, a, b, args, full_output, epsabs, epsrel, limit,
File "/usr/lib/python3/dist-packages/scipy/integrate/quadpack.py", line 448, in _quad
return _quadpack._qagse(func,a,b,args,full_output,epsabs,epsrel,limit)
File "prob test.py", line 44, in <lambda>
probability = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]
File "/usr/lib/python3/dist-packages/sklearn/neighbors/_kde.py", line 190, in score_samples
X = check_array(X, order='C', dtype=DTYPE)
File "/usr/lib/python3/dist-packages/sklearn/utils/validation.py", line 545, in check_array
raise ValueError(
ValueError: Expected 2D array, got scalar array instead:
array=5.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I'm not sure how to reshape the distribution when its already inside the lambda function, and, in any case, I'm guessing this is happening because Scikit-Learn has been updated in the 6 years since this answer was written. What's the best way to work around this issue to get the probability value?
Thanks!
As said in the library:
score_samples(X): X array-like of shape (n_samples, n_features)
Therefore, you should pass an array-like and not a scalar:
probability = quad(lambda x: np.exp(kd.score_samples(np.array([[x]]))), start, end)
or:
probability = quad(lambda x: np.exp(kd.score_samples(np.array([x]).reshape(-1,1))), start, end)

ValueError: A value in x_new is below the interpolation range

This is a scikit-learn error that I get when I do
my_estimator = LassoLarsCV(fit_intercept=False, normalize=False, positive=True, max_n_alphas=1e5)
Note that if I decrease max_n_alphas from 1e5 down to 1e4 I do not get this error any more.
Anyone has an idea on what's going on?
The error happens when I call
my_estimator.fit(x, y)
I have 40k data points in 40 dimensions.
The full stack trace looks like this
File "/usr/lib64/python2.7/site-packages/sklearn/linear_model/least_angle.py", line 1113, in fit
axis=0)(all_alphas)
File "/usr/lib64/python2.7/site-packages/scipy/interpolate/polyint.py", line 79, in __call__
y = self._evaluate(x)
File "/usr/lib64/python2.7/site-packages/scipy/interpolate/interpolate.py", line 498, in _evaluate
out_of_bounds = self._check_bounds(x_new)
File "/usr/lib64/python2.7/site-packages/scipy/interpolate/interpolate.py", line 525, in _check_bounds
raise ValueError("A value in x_new is below the interpolation "
ValueError: A value in x_new is below the interpolation range.
There must be something particular to your data. LassoLarsCV() seems to be working correctly with this synthetic example of fairly well-behaved data:
import numpy
import sklearn.linear_model
# create 40000 x 40 sample data from linear model with a bit of noise
npoints = 40000
ndims = 40
numpy.random.seed(1)
X = numpy.random.random((npoints, ndims))
w = numpy.random.random(ndims)
y = X.dot(w) + numpy.random.random(npoints) * 0.1
clf = sklearn.linear_model.LassoLarsCV(fit_intercept=False, normalize=False, max_n_alphas=1e6)
clf.fit(X, y)
# coefficients are almost exactly recovered, this prints 0.00377
print max(abs( clf.coef_ - w ))
# alphas actually used are 41 or ndims+1
print clf.alphas_.shape
This is in sklearn 0.16, I don't have positive=True option.
I'm not sure why you would want to use a very large max_n_alphas anyway. While I don't know why 1e+4 works and 1e+5 doesn't in your case, I suspect the paths you get from max_n_alphas=ndims+1 and max_n_alphas=1e+4 or whatever would be identical for well behaved data. Also the optimal alpha that is estimated by cross-validation in clf.alpha_ is going to be identical. Check out Lasso path using LARS example for what alpha is trying to do.
Also, from the LassoLars documentation
alphas_ array, shape (n_alphas + 1,)
Maximum of covariances (in
absolute value) at each iteration. n_alphas is either max_iter,
n_features, or the number of nodes in the path with correlation
greater than alpha, whichever is smaller.
so it makes sense that we end with alphas_ of size ndims+1 (ie n_features+1) above.
P.S. Tested with sklearn 0.17.1 and positive=True as well, also tested with some positive and negative coefficients, same result: alphas_ is ndims+1 or less.

Resistivity non linear fit in Python

I'm trying to fit Einstein approximation of resistivity in a solid in a set of experimental data.
I have resistivity vs temperature (from 200 to 4 K)
import xlrd as xd
import matplotlib.pyplot as plt
import numpy as np
import pylab as pl
import scipy as sp
from scipy.optimize import curve_fit
#retrieve data from file
data = pl.loadtxt('salita.txt')
Temp = data[:, 1]
Res = data[:, 2]
#define fitting function
def einstein_func( T, ro0, AE, TE):
nl = np.sinh(TE/(2*T))
return ro0 + AE*nl*T
p0 = sp.array([1 , 1, 1])
coeffs, cov = curve_fit(einstein_func, Temp, Res, p0)
But I get these warnings
crio.py:14: RuntimeWarning: divide by zero encountered in divide
nl = np.sinh(TE/(2*T))
crio.py:14: RuntimeWarning: overflow encountered in sinh
nl = np.sinh(TE/(2*T))
crio.py:15: RuntimeWarning: divide by zero encountered in divide
return ro0 + AE*np.sinh(TE/(2*T))*T
crio.py:15: RuntimeWarning: overflow encountered in sinh
return ro0 + AE*np.sinh(TE/(2*T))*T
crio.py:15: RuntimeWarning: invalid value encountered in multiply
return ro0 + AE*np.sinh(TE/(2*T))*T
Traceback (most recent call last):
File "crio.py", line 19, in <module>
coeffs, cov = curve_fit(einstein_func, Temp, Res, p0)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/optimize/minpack.py", line 511, in curve_fit
raise RuntimeError(msg)
RuntimeError: Optimal parameters not found: Number of calls to function has reached maxfev = 800.
I don't understand why it keeps saying that there is a divide by zero in sinh, since I have strictly positive values. Varying my starting guess has no effect on it.
EDIT: My dataset is organized like this:
4.39531E+0 1.16083E-7
4.39555E+0 -5.92258E-8
4.39554E+0 -3.79045E-8
4.39525E+0 -2.13213E-8
4.39619E+0 -4.02736E-8
4.43130E+0 -1.42142E-8
4.45900E+0 -2.60594E-8
4.46129E+0 -9.00232E-8
4.46181E+0 1.42142E-7
4.46195E+0 -2.13213E-8
4.46225E+0 4.26426E-8
4.46864E+0 -2.60594E-8
4.47628E+0 1.37404E-7
4.47747E+0 9.47612E-9
4.48008E+0 2.84284E-8
4.48795E+0 1.35035E-7
4.49804E+0 1.39773E-7
4.51151E+0 -1.75308E-7
4.54916E+0 -1.63463E-7
4.59176E+0 -2.36902E-9
where the first column is temperature and the second one is resistivity (negative values are due to noise in trial current since the sample is a PbIn alloy which becomes superconductive at temperature lower than 6.7-6.9K, here we are at 4.5K).
Argument I'm providing to sinh are Numpy arrays, with a linear function ro0 + AE*T my code works. I've tried with scipy.optimize.minimize but the result is the same.
Now I see that I have almost nine hundred values in my file, may that be the problem?
I have edited my dataset removing some lines and now the only warning showing is
RuntimeWarning: overflow encountered in sinh
How can I workaround it?
Here are a couple of observations that could help:
You could try the least-squares fit directly with leastsq, providing the Jacobian, which might help tame it.
I'm guessing you don't want the superconducting temperatures in your data set at all if you're fitting to an Einstein model (do you have a source for this eqn, btw?)
Do make sure your initial guesses are as good as they could possibly be (ro0=AE=TE=1 probably won't cut it).
Plot your data and make sure there aren't any weird artefacts
You seem to be indexing your data array in the wrong way in your code example: if the data is structured as you say, you want:
Temp = data[:, 0]
Res = data[:, 1]
(Python indexes start at 0).

Scipy minimize, fmin, leastsq type problems (setting array element with sequence), bad fit

I'm having trouble with the scipy.optimize.fmin and scipy.optimize.minimize functions. I've checked and confirmed that all the arguments passed to the function are of type numpy.array, as well as the return value of the error function. Also, the carreau function returns a scalar value.
The reason for some of the extra arguments, such as size, is this: I need to fit data with a given model (Carreau). The data is taken at different temperatures, which are corrected with a shift factor (which is also fitted by the model), I end up with several sets of data which should all be used to calculate the same 4 constants (parameters p).
I read that I can't pass the fmin function a list of arrays, so I had to concatenate all data into x_data_lin, keeping track of the different sets with the size parameter. t holds different test temperatures, while t_0 is a one-element array which holds the reference temperature.
I am positive (triple checked) that all the arguments passed to the function, as well as the result, are one-dimensional arrays. Here's the code aside from that:
import numpy as np
import scipy.optimize
from scipy.optimize import fmin as simplex
def err_func2(p, x, y, t, t_0, size):
result = array([])
temp = 0
for i in range(0, int(len(size)-1)):
for j in range(int(temp), int(temp+size[i])):
result = np.append(result, (carreau(p, x[j], t[i], t_0[0])-y[i]))
temp += size[i]
return result
p1 = simplex(err_func2, initial_guess,
args=(x_data_lin, y_data_lin, t_list, t_0, size), full_output=0)
Here's the error:
Traceback (most recent call last):
File "C:\Python27\Scripts\projects\Carreau - WLF\carreau_model_fit.py", line 146, in <module>
main()
File "C:\Python27\Scripts\projects\Carreau - WLF\carreau_model_fit.py", line 105, in main
args=(x_data_lin, y_data_lin, t_list, t_0, size), full_output=0)
File "C:\Python27\lib\site-packages\scipy\optimize\optimize.py", line 351, in fmin
res = _minimize_neldermead(func, x0, args, callback=callback, **opts)
File "C:\Python27\lib\site-packages\scipy\optimize\optimize.py", line 415, in _minimize_neldermead
fsim[0] = func(x0)
ValueError: setting an array element with a sequence.
It's worth noting that I got the leastsq function working while passing it lists of arrays. Unfortunately, it did a poor job of fitting the data. But, as it took me a lot of time and research to get to that point, I'll post the code as follows. If somebody is interested in seeing all of the code, I would gladly post it, if you can recommend me somewhere to upload a few files(as it includes another imported script and of course sample data):
##def error_function(p, x, y, t, t_0):
## result = array([])
## for index in range(len(x)):
## result = np.append(result,(carreau(p, x[index],
## t[index], t_0) - y[index]))
## return result
## p1, success = scipy.optimize.leastsq(error_function, initial_guess,
## args=(x_list, y_list, t_list, t_0),
## maxfev=10000)
:( I was going to post a picture of the graphed data with the leastsq fit, but I don't have the requisite 10 points.
Late Edit: I now have gotten optimize.curvefit and optimize.leastsq to work (which probably not-so-coincidentally give the same answer), but the curve is bad. I've been trying to figure out optimize.minimize, but it's been a bit of a headache. the simplex (fmin, Nelder_Mead, whatever you want to call it) will run, but produces a crazy answer nowhere close. I've never worked with nonlinear optimization problems before, and I don't really know what direction to head.
Here's the working curve_fit code:
def temp_shift(t_s, t, t_0):
""" This function calculates the a_t temperature shift factor for polymer
viscosity curves. Variable is the standard temperature, t_s
"""
C_1 = 8.86
C_2 = 101.6
return(np.exp(
(C_1*(t_0-t_s) / (C_2+(t_0-t_s))) - (C_1*(t-t_s) / (C_2 + (t-t_s)))
))
def pass_data(t, t_0):
def carreau_2(x, p0, p1, p2, p3):
visc_0 = p0
m = p1
n = p2
t_s = p3
a_T = temp_shift(p3, t, t_0)
return (visc_0 * a_T / (1 + m * x * a_T)**n)
return carreau_2
initial_guess = array([20000, 3, 0.94, -20])
p1, conv = scipy.optimize.curve_fit(pass_data(t_all, t_0), x_data_lin,
y_data_lin, initial_guess)
Here's some sample data:
x_data_lin = array([0.01998, 0.04304, 0.2004, 0.43160, 0.92870, 2.0000, 4.30900,
9.28500, 15.51954, 21.94936, 37.52960, 90.41786, 204.35230,
331.58495, 811.92250, 1694.55309, 3464.27648, 8826.65738,
14008.00242])
y_data_lin = array([13520.00000, 13740.00000, 12540.00000, 9384.00000, 5201,
3232.00000, 2094.00000, 1484.00000, 999.00000, 1162.05088
942.56946, 705.62489, 429.47341, 254.15136, 185.22916,
122.07113, 76.46324, 47.85064, 25.74315, 18.84875])
t_all = array([190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 190,
190, 190, 190, 190, 190, 190, 190])
t_0 = 80
Here's a picture of the result of curve_fit (now that I have 10 points and can post!). Note there are 3 curves drawn because I used 3 sets of data to optimize the curve, at 3 different temperatures. Polymers have the property that the shear_rate - viscosity relationship stays the same, just shifted by a temperature factor a_T:
I'd really appreciate any suggestions about how to improve the fit, or how to define the function so that optimize.minimize works, and which method (Nelder-Mead, Powel, BFGS) might work.
Another edit to add: I got the Nelder-Mead (optimize.fmin, and the default of optimize.minimize) function to work - I'll include the revised error function below. Before, I simply summed the result array and returned it. This led to extremely negative values (obviously, since the function's goal is to minimize). Squaring the result before summing it solved that problem. Note that I also changed the function completely to take advantage of numpy's array broadcasting, as suggested by JaminSore (Thanks Jamin!)
def err_func2(p, x, y, t, t_0):
return ((carreau(p, x, t, t_0)-y)**2).sum()
Unfortunately, the Nelder-Mead function gives me the same result as leastsq and curve_fit. You can see in the graph above that it's not the optimal fit; in fact, at this point, Microsoft Excel's solver function is doing a better job on the data.
At least, I hope this thread can be useful for beginners to scipy.optimize in the future, since it's taken me quite awhile to discover all of this.
Unlike leastsq, fmin can only deal with error functions that return a scalar so if possible you have to rewrite your error function so that it returns a scalar. Here is a simple working example.
Import the necessary libraries
import numpy as np
from scipy.optimize import fmin
Define a helper function (you'll see later)
def prob(a, b):
return (1 + np.exp(b - a))**-1
Simulate some data
true_ = np.random.normal(size = 100) #parameters we're trying to recover
b = np.random.normal(size = 20)
exp_ = prob(true_[:, None], b) #expected
a_s, b_s = true_.shape[0], b.shape[0]
noise = np.random.uniform(size = (a_s, b_s))
response = (noise > (1 - exp_)).astype(int)
Define our error function (I'm using lambdas but this is not recommended in practice)
# sum of the squared residuals
err_func = lambda a : ((prob(a[:, None], b) - response) ** 2).sum()
result = fmin(err_func, np.zeros_like(true_)) #solve
If I remove the .sum() at the end of my error function definition, I get the same error.
OK, now I finally know the answer! First, the final piece, then a recap. The problem of the fit wasn't the fault of curve_fit, leastsq, Nelder_Mead, or Powell (the methods I've tried). It has to do with the relative weights of the errors. Since this data is on a log scale, the errors in the fit near the high y values are very costly, while errors near the low y values are insignificant. To correct this, I made the error relative by dividing by the y value of the data, as follows:
def err_func2(p, x, y, t, t_0):
return (((carreau(p, x, t, t_0)-y)/y)**2).sum()
Now, each relative error is squared, summed, then minimized, giving the following fit (using optimize.minimize with the Powell method, although it should be the same for the other methods as well.)
So now a recap of the answers found in this thread:
The easiest way (or at least for me, most fool-proof) to deal with curve fitting is to collect all the data into 1D numpy.arrays. Then, you can rely on numpy's array broadcasting to perform all operations. This means that arithmetic operations are treated the same way a vector dot-product would be. For example, array_1 = [a,b], array_2 = [c,d], then array_1 + array_2 = [a+c, b+d]. This works for addition, subtraction, multiplication, division, and powers: array+1array_2 = [ac, b**d].
For the optimize.leastsq function, you need to let the objective function return an array; i.e. return result where result is an array. For optimize.curve_fit, you also return an array. In this case, it's a bit more complicated to pass extra arguments (think other constants), but you can do it with a nested function, as I demonstrated above in the pass_data function.
For optimize.minimize, you need to return a scalar - that is, a single number. You can also return an array of answers, I think, but I avoided this by getting all the data into 1D arrays, as I mentioned earlier. To get this scalar, you can simply square and sum the result (like I have written in this post under err_func2) Squaring the data is very important, otherwise negative errors take over and drive the resulting scalar extremely negative.
Finally, as mentioned, when your data crosses several scales (105, 104, 10**3, etc), it may be necessary to normalize the errors. I did this by dividing each error by the y value.
So... I guess that's it? Finally?

Categories

Resources