I want to fit the function f(x) = b + a / x to my data set. For that I found scipy leastsquares from optimize were suitable.
My code is as follows:
x = np.asarray(range(20,401,20))
y is distances that I calculated, but is an array of length 20, here is just random numbers for example
y = np.random.rand(20)
Initial guesses of the params a and b:
params = np.array([1,1])
Function to minimize
def funcinv(x):
return params[0]/x+params[1]
res = least_squares(funinv, params, args=(x, y))
Error given:
return np.atleast_1d(fun(x, *args, **kwargs))
TypeError: funinv() takes 1 positional argument but 3 were given
How can I fit my data?
To make a little of clarity. There are two related problems:
Minimizing a function
Fitting model to data
To fit a model to observed data is to find such parameters of a model which minimize some sort of error between model data and observed data.
least_squares method just minimizes a following function with respect to x (x can be a vector).
F(x) = 0.5 * sum(rho(f_i(x)**2), i = 0, ..., m - 1)
(rho is a loss function and default is rho(x) = x so don't mind it for now)
least_squares(func, x0) expects that call to func(x) will return a vector [a1, a2, a3, ...] for which a sum of squares will be computed: S = 0.5 * (a1^2 + a2^2 + a3^2 + ...).
least_squares will tweak x0 to minimize S.
Thus, in order to use it to fit model to data, one must construct a function of error between a model and actual data - residuals and then minimize that residuals function. In your case you can write it as follows:
import numpy as np
from scipy.optimize import least_squares
x = np.asarray(range(20,401,20))
y = np.random.rand(20)
params = np.array([1,1])
def funcinv(x, a, b):
return b + a/x
def residuals(params, x, data):
# evaluates function given vector of params [a, b]
# and return residuals: (observed_data - model_data)
a, b = params
func_eval = funcinv(x, a, b)
return (data - func_eval)
res = least_squares(residuals, params, args=(x, y))
This gives a result:
print(res)
...
message: '`gtol` termination condition is satisfied.'
nfev: 4
njev: 4 optimality: 5.6774618339971994e-10
status: 1
success: True
x: array([ 6.89518618, 0.37118815])
However, as a residuals function pretty much the same all the time (res = observed_data - model_data), there is a shortcut in scipy.optimize called curve_fit: curve_fit(func, xdata, ydata, x0). curve_fit builds residuals function automatically and you can simply write:
import numpy as np
from scipy.optimize import curve_fit
x = np.asarray(range(20,401,20))
y = np.random.rand(20)
params = np.array([1,1])
def funcinv(x, a, b):
return b + a/x
res = curve_fit(funcinv, x, y, params)
print(res) # ... array([ 6.89518618, 0.37118815]), ...
Related
By using scipy.odr (link), I would like to perform a fit by assuming an implicit model function. For simplicity, let us assume a linear model, namely
y = a*x + b, (1)
and we are interested in its implicit form, i.e.
a*x + b - y = 0. (2)
In the following I show my code:
import scipy.odr
import numpy as np
import matplotlib.pyplot as plt
a = 1
b = 2
x = np.linspace(0, 10, 50)
y = a * x + b + np.random.normal(scale=0.1, size=len(x))
# Define function for scipy.odr
def func(p, var):
x, y = var
a, b = p
return a * x + b - y
# Fit the data using scipy.odr
Model = scipy.odr.Model(func, implicit=True)
Data = scipy.odr.Data(np.array([x,y]), 0)
Odr = scipy.odr.ODR(Data, Model, [a,b], maxit=10000)
From the last line of the code I get this error:
in _check(self)
848 print(res.shape)
849 print(fcn_perms)
--> 850 raise OdrError("fcn does not output %s-shaped array" % y_s)
851
852 if self.model.fjacd is not None:
OdrError: fcn does not output [0, 50]-shaped array
I don't get what I should modify in the code. For sure there is something I'm not understanding, even if I carefully read the documentation.
EDIT:
I misunderstood the meaning of the y parameter in scipy.odr.Data. I thought it was the right-hand side of Eq. (2), but instead it is the dimensionality of the response. This is the correct line of code:
Data = scipy.odr.Data(np.array([x,y]), 1)
Now the code properly works.
I am a newbie in using scipy.optimize. I have the following function calls func. I have x and y values given as a list and need to get the estimated value of a, b and c. I could use curve_fit to get the estimation of a, b and c. However, I want to explore the possibilities of using least_squares.
When I run the following code, I get the below error. It'd be great if anyone could point me to the right direction.
import numpy as np
from scipy.optimize import curve_fit
from scipy.optimize import least_squares
np.random.seed(0)
x = np.random.randint(0, 100, 100) # sample dataset for independent variables
y = np.random.randint(0, 100, 100) # sample dataset for dependent variables
def func(x,a,b,c):
return a*x**2 + b*x + c
def result(list_x, list_y):
popt = curve_fit(func, list_x, list_y)
sol = least_squares(result,x, args=(y,),method='lm',jac='2-point',max_nfev=2000)
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be
safely coerced to any supported types according to the casting rule ''safe''
The following code uses the least_squares() routine for optimization. The most important change in comparison to your code is ensuring that func() returns a vector of residuals. I also compared the solution to the linear algebra result to ensure correctness.
import numpy as np
from scipy.optimize import curve_fit
from scipy.optimize import least_squares
np.random.seed(0)
x = np.random.randint(0, 100, 100) # sample dataset for independent variables
y = np.random.randint(0, 100, 100) # sample dataset for dependent variables
def func(theta, x, y):
# Return residual = fit-observed
return (theta[0]*x**2 + theta[1]*x + theta[2]) - y
# Initial parameter guess
theta0 = np.array([0.5, -0.1, 0.3])
# Compute solution providing initial guess theta0, x input, and y input
sol = least_squares(func, theta0, args=(x,y))
print(sol.x)
#------------------- OPTIONAL -------------------#
# Compare to linear algebra solution
temp = x.reshape((100,1))
X = np.hstack( (temp**2, temp, np.ones((100,1))) )
OLS = np.linalg.lstsq(X, y.reshape((100,1)), rcond=None)
print(OLS[0])
To use least_squares you need a residual function and not the curve_fit. Also, least_squares requires a guess for the parameters that you are fitting (i.e. a,b,c). In your case, if you want to use least_squares you can write something similar (I just used random values for the guess)
import numpy as np
from scipy.optimize import least_squares
np.random.seed(0)
x = np.random.randint(0, 100, 100) # sample dataset for independent variables
y = np.random.randint(0, 100, 100) # sample dataset for dependent variables
def func(x,a,b,c):
return a*x**2 + b*x + c
def residual(p, x, y):
return y - func(x, *p)
guess = np.random.rand(3)
sol = least_squares(residual, guess, args=(x, y,),method='lm',jac='2-point',max_nfev=2000)
I'm trying to minimize the mse between 2 functions with bounds, curve_fit is doing it very well, but I want to stop the computation when the mse between the two functions is lower than 0.1.
Here is a simple example code
import numpy as np
from scipy import optimize, integrate
def sir_model(y, x, beta, gamma):
sus = -beta * y[0] * y[1] / N
rec = gamma * y[1]
inf = -(sus + rec)
return sus, inf, rec
def fit_odeint(x, beta, gamma):
return integrate.odeint(sir_model, (sus0, inf0, rec0), x, args=(beta, gamma))[:,1]
population = float(1000)
xdata = np.arange(0,335,dtype = float)
upper_bounds = np.array([1,0.7])
N = population
inf0 = 10
sus0 = N - inf0
rec0 = 0.0
#curve to approximate
ydata = fit_odeint(xdata, beta = 0.258, gamma = 0.612)
popt, pcov = optimize.curve_fit(fit_odeint, xdata, ydata,bounds=(0, upper_bounds))
The problem is that the real problem is harder. So I want to stop the function curve_fit with a fixed tolerence (mse = 0.1). I tried with ftol but it doesn't seems to work.
If I understand you correctly, you want to terminate the optimization as soon as the objective of the underlying least-squares optimization problem is <= 0.1. Unfortunately, neither curve_fit nor least_squares support callbacks or tolerances for the objective value. However, scipy.optimize.minimize does. So let's use it.
For this purpose we have to formulate your curve fitting problem as minimization problem:
min ||ydata - fit_odeint(xdata, *coeffs)||**2
s.t. lb <= coeffs <= ub
Then, we solve the problem via minimize and use a callback that terminates the algorithm as soon as the objective function value is <= 0.1:
from scipy.optimize import minimize
from numpy.linalg import norm
# the objective function
def obj(coeffs):
return norm(ydata - fit_odeint(xdata, *coeffs))**2
# bounds
bnds = [(0, 1), (0, 0.7)]
# initial point
x0 = np.zeros(2)
# xk is the current parameter vector and state is an OptimizeResult object
def my_callback(xk, state):
if state.fun <= 0.1:
return True
# call the solver (res.x contains your coefficients)
res = minimize(obj, x0=x0, bounds=bnds, method="trust-constr", callback=my_callback)
This gives me:
barrier_parameter: 0.1
barrier_tolerance: 0.1
cg_niter: 3
cg_stop_cond: 4
constr: [array([0.08205584, 0.44233162])]
constr_nfev: [0]
constr_nhev: [0]
constr_njev: [0]
constr_penalty: 1635226.9491785716
constr_violation: 0.0
execution_time: 0.09222197532653809
fun: 0.007733264340965375
grad: array([-3.99185467, 4.04015523])
jac: [<2x2 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Row format>]
lagrangian_grad: array([-0.03385145, 0.19847042])
message: '`callback` function requested termination.'
method: 'tr_interior_point'
nfev: 12
nhev: 0
nit: 4
niter: 4
njev: 4
optimality: 0.1984704226037759
status: 3
success: False
tr_radius: 7.0
v: [array([ 3.95800322, -3.8416848 ])]
x: array([0.08205584, 0.44233162])
Note that the callback's signature only works for the 'trust-constr' method, for the other methods the signature is callback(xk) -> bool, i.e. you'd need to calculate the objective function value on your own inside the callback.
I'm trying to perform what are many iterations of Scipy's curve_fit at once in order to avoid loops and therefore increase speed.
This is very similar to this problem, which was solved. However, the fact that the functions are piece-wise (discontinuous) makes so that that solution isn't applicable here.
Consider this example:
import numpy as np
from numpy import random as rng
from scipy.optimize import curve_fit
rng.seed(0)
N=20
X=np.logspace(-1,1,N)
Y = np.zeros((4, N))
for i in range(0,4):
b = i+1
a = b
print(a,b)
Y[i] = (X/b)**(-a) #+ 0.01 * rng.randn(6)
Y[i, X>b] = 1
This yields these arrays:
Which as you can see are discontinuous at X==b. I can retrieve the original values of a and b by using curve_fit iteratively:
def plaw(r, a, b):
""" Theoretical power law for the shape of the normalized conditional density """
import numpy as np
return np.piecewise(r, [r < b, r >= b], [lambda x: (x/b)**-a, lambda x: 1])
coeffs=[]
for ix in range(Y.shape[0]):
print(ix)
c0, pcov = curve_fit(plaw, X, Y[ix])
coeffs.append(c0)
But this process can be very slow depending of the size of X, Y and the loop, so I'm trying to speed things up by trying to get coeffs without the need for a loop. So far I haven't had any luck.
Things that might be important:
X and Y only contain positive values
a and b are always positive
Although the data to fit in this example is smooth (for the sake of simplicity), the real data has noise
EDIT
This is as far as I've gotten:
y=np.ma.masked_where(Y<1.01, Y)
lX = np.log(X)
lY = np.log(y)
A = np.vstack([lX, np.ones(len(lX))]).T
m,c=np.linalg.lstsq(A, lY.T)[0]
print('a=',-m)
print('b=',np.exp(-c/m))
But even without any noise the output is:
a= [0.18978965578339158 1.1353633705997466 2.220234483915197 3.3324502660995714]
b= [339.4090881838179 7.95073481873057 6.296592007396107 6.402567167503574]
Which is way worse than I was hoping to get.
Here are three approaches to speeding this up. You gave no desired speed up or accuracies, or even vector sizes, so buyer beware.
TL;DR
Timings:
len 1 2 3 4
1000 0.045 0.033 0.025 0.022
10000 0.290 0.097 0.029 0.023
100000 3.429 0.767 0.083 0.030
1000000 0.546 0.046
1) Original Method
2) Pre-estimate with Subset
3) M Newville [linear log-log estimate](https://stackoverflow.com/a/44975066/7311767)
4) Subset Estimate (Use Less Data)
Pre-estimate with Subset (Method 2):
A decent speedup can be achieved by simply running the curve_fit twice, where the first time uses a short subset of the data to get a quick estimate. That estimate is then used to seed a curve_fit with the entire dataset.
x, y = current_data
stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]
return curve_fit(power_law, x, y, p0=c0)[0]
M Newville linear log-log estimate (Method 3):
Using the log estimate proposed by M Newville, is also considerably faster. As the OP was concerned about the initial estimate method proposed by Newville, this method uses curve_fit with a subset to provide the estimate of the break point in the curve.
x, y = current_data
stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]
index_max = np.where(x > c0[1])[0][0]
log_x = np.log(x[:index_max])
log_y = np.log(y[:index_max])
result = linregress(log_x, log_y)
return -result[0], np.exp(-result[1] / result[0])
return (m, c), result
Use Less Data (Method 4):
Finally the seed mechanism used for the previous two methods provides pretty good estimates on the sample data. Of course it is sample data so your mileage may vary.
stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]
Test Code:
import numpy as np
from numpy import random as rng
from scipy.optimize import curve_fit
from scipy.stats import linregress
fit_data = {}
current_data = None
def data_for_fit(a, b, n):
key = a, b, n
if key not in fit_data:
rng.seed(0)
x = np.logspace(-1, 1, n)
y = np.clip((x / b) ** (-a) + 0.01 * rng.randn(n), 0.001, None)
y[x > b] = 1
fit_data[key] = x, y
return fit_data[key]
def power_law(r, a, b):
""" Power law for the shape of the normalized conditional density """
import numpy as np
return np.piecewise(
r, [r < b, r >= b], [lambda x: (x/b)**-a, lambda x: 1])
def method1():
x, y = current_data
return curve_fit(power_law, x, y)[0]
def method2():
x, y = current_data
return curve_fit(power_law, x, y, p0=method4()[0])
def method3():
x, y = current_data
c0, pcov = method4()
index_max = np.where(x > c0[1])[0][0]
log_x = np.log(x[:index_max])
log_y = np.log(y[:index_max])
result = linregress(log_x, log_y)
m, c = -result[0], np.exp(-result[1] / result[0])
return (m, c), result
def method4():
x, y = current_data
stride = int(max(1, len(x) / 200))
return curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])
from timeit import timeit
def runit(stmt):
print("%s: %.3f %s" % (
stmt, timeit(stmt + '()', number=10,
setup='from __main__ import ' + stmt),
eval(stmt + '()')[0]
))
def runit_size(size):
print('Length: %d' % size)
if size <= 100000:
runit('method1')
runit('method2')
runit('method3')
runit('method4')
for i in (1000, 10000, 100000, 1000000):
current_data = data_for_fit(3, 3, i)
runit_size(i)
Two suggestions:
Use numpy.where (and possibly argmin) to find the X value at which the Y data becomes 1, or perhaps just slightly larger than 1, and truncate the data to that point -- effectively ignoring the data where Y=1.
That might be something like:
index_max = numpy.where(y < 1.2)[0][0]
x = y[:index_max]
y = y[:index_max]
Use the hint shown in your log-log plot that the power law is now linear in log-log. You don't need curve_fit, but can use scipy.stats.linregress on log(Y) vs log(Y). For your real work, that will at the very least give good starting values for a subsequent fit.
Following up on this and trying to follow your question, you might try something like:
import numpy as np
from scipy.stats import linregress
np.random.seed(0)
npts = 51
x = np.logspace(-2, 2, npts)
YTHRESH = 1.02
for i in range(5):
b = i + 1.0 + np.random.normal(scale=0.1)
a = b + np.random.random()
y = (x/b)**(-a) + np.random.normal(scale=0.0030, size=npts)
y[x>b] = 1.0
# to model exponential decay, first remove the values
# where y ~= 1 where the data is known to not decay...
imax = np.where(y < YTHRESH)[0][0]
# take log of this truncated x and y
_x = np.log(x[:imax])
_y = np.log(y[:imax])
# use linear regression on the log-log data:
out = linregress(_x, _y)
# map slope/intercept to scale, exponent
afit = -out.slope
bfit = np.exp(out.intercept/afit)
print(""" === Fit Example {i:3d}
a expected {a:4f}, got {afit:4f}
b expected {b:4f}, got {bfit:4f}
""".format(i=i+1, a=a, b=b, afit=afit, bfit=bfit))
Hopefully that's enough to get you going.
I need to do a simple curve fitting using scipy's curve_fit function. However, my data is in the form of a matrix. I can easily do this in numpy but I wanted to see the goodness of fit for scipy.
Problem:
AX = B --> given A, find X for least square error.
from scipy.optimize import curve_fit
def getXval():
a = 4; b = 3, c = 1;
f0 = a*pow(b, 2)*c
f1 = a*b/c
return [f0, f1]
def fit(x, a0, a1):
res = a0*x[0] + a1*x[1]
return [res]
x = getXval()
y = [0.15]
popt, pcov = curve_fit(fit, x, y)
This is, however, not working. Can someone point what is going on here?
Your code has a few problems.
1) Use numpy arrays instead of Python lists
2) your are missing values for y.
This works for me:
from scipy.optimize import curve_fit
import numpy as np
def getXval():
a = 4; b = 3; c = 1;
f0 = a*pow(b, 2)*c
f1 = a*b/c
return np.array([f0, f1])
def fit(x, a0, a1):
res = a0*x[0] + a1*x[1]
return np.array([res])
x = getXval()
y = np.array([0.15, 0.34])
popt, pcov = curve_fit(fit, x, y)
print popt, pcov