I am new to analytics , I am looking for a solution to find a model to solve non linear equation of the form Y=a(X1^b) + c(X2^d) + e ( where X1 , X2 are independent variables)
Below is a full set , unfortunately we dont have much observations , all we need is any simple fit.But this data does not have any outliers , every observations has to be considered.
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
import sympy as sym
x1=np.array([217,160,97,75])
x2=np.array([5.319751464,6.88536761,5.319751464,5.319751464])
x3=np.array([143.420635344,36.7384534912,23.420635344,1.420635344])
y=np.array([14,7,7,1])
def func(X, a, b, c ,d , e ):
x1,x2 = X
return a*x1**b + c*x2**d + e
popt, pcov = curve_fit(func, (x1,x2), y)
plt.plot(y, func((x1,x2), *popt), label="Fitted Curve")
plt.legend(loc='upper left')
plt.show()
but running curve_fit gives me error saying that the
TypeError: Improper input: N=5 must not exceed M=4
Then i had to add few more dummy inputs as observations of almost similar values adding decimal point difference which resulted in error
x1=np.array([217,160,97,75,76,219])
x2=np.array([5.319751464,6.88536761,5.319751464,5.319751464,5.319751464,5.319751464])
x3=np.array([143.420635344,36.7384534912,23.420635344,1.420635344,1.420635344,143.420635344])
y=np.array([14,7,7,1,1,14])
RuntimeError: Optimal parameters not found: Number of calls to
function has reached maxfev = 1200.
Then I had to remove variable d and keep function as
def func(X, a, b, c ,e ):
x1,x2 = X
return a*x1**b + c*x2 + e
Finally it did run but again with below warning , but results are not good
RuntimeWarning: overflow encountered in power
Note that
x3 = max(x2 - {(x1^2)*2.6},0)
and solving
y=a*(x3^b) gives a=0.89 and b=0.58 with r2=0.98 and error=0.19 which is the best one i could get so far
But i would like to have the result in a generalised form without me trying to equate a relation. Because based on data set ,the function x3=f(x1,x2) can change and it is not a fixed equation for all cases.
A very good approach to get acceptable fit results is the use of initial guesses of the free parameters. curve_fit takes a p0 argument. If your data does not vary too much between two data sets, this should be a proper way. In my experience, taking some afford to guess good start values p0 and set limits by the parameter boundsis a good approach. Otherwise you could try to interpolate the data, if you don't have to know the mathematical relation.
Related
I want to model the following curve:
To perform it, I'm using curve_fit from SciPy, fitting an exponential function.
def exponenial_func(x, a, b, c):
return a * b**(c*x)
popt, pcov = curve_fit(exponenial_func, x, y, p0=(1,2,2),
bounds=((0, 0, 0), (np.inf, np.inf, np.inf)))
When I first do it, I get this:
Which is minimising the residuals, each point with the same level of importance.
What I want, is to get a curve that gives more importance to the last values of the curve (from x-axis 30, for example) than to the first values, so it fits better in the end of the curve than in the beginning of it.
I know that from here there are many ways to approach this (first of all, define what is the importance that I want to give to each of the residuals). My question here, is to get some idea of how to approach this.
One idea that I had, is to change the sigma value to weight each data point by its inverse value.
popt, pcov = curve_fit(exponenial_func, x, y, p0=(1,2,2),
bounds=((0, 0, 0), (np.inf, np.inf, np.inf)),
sigma=1/y)
In this case, I get something like I was looking for:
It doesn't look bad, but I'm looking for another way of doing this, so that I can "control" each of the data points, like to weight each of the residuals in a linear way, or exponential, or even choosing it manually (rather than all of them by the inverse, as in the previous case).
Thanks in advance
First of all, note that there's no need for three coefficients. Since
a * b**(c*x) = a * exp(log(b)*c*x).
we can define k = log(b)*c.
Here's a suggestion how you could tackle your problem by hands with scipy.optimize.least_squares and a priority vector:
import numpy as np
from scipy.optimize import least_squares
def exponenial_func2(x, a, k):
return a * np.exp(k*x)
# returns the vector of residuals
def fitwrapper2(coeffs, *args):
xdata, ydata, prio = args
return prio*(exponenial_func2(xdata, *coeffs)-ydata)
# Data
n = 31
xdata = np.arange(n)
ydata = np.array([155.0,229,322,453,655,888,1128,1694,
2036,2502,3089,3858,4636,5883,7375,
9172,10149,12462,12462,17660,21157,
24747,27980,31506,35713,41035,47021,
53578,59138,63927,69176])
# The priority vector
prio = np.ones(n)
prio[-1] = 5
res = least_squares(fitwrapper2, x0=[1.0,2.0], bounds=(0,np.inf), args=(xdata,ydata,prio))
With prio[-1] = 5 we give the last point a high priority.
res.x contains your optimal coefficients. Here a, k = res.x.
Note that for prio = np.ones(n) it's a normal least squares fitting (like curve_fit does) where all points have the same priority.
You can control the priority of each point by increasing its value in the prio array. Comparing both results gives me:
I have a set of data, basically with the information of f(x) as a function of x, and x itself. I know from the theory of the problem that I'm working on the format of f(x), which is given as the expression below:
Essentially, I want to use this set of data to find the parameters a and b. My problem is: How can I do that? What library should I use? I would like an answer using Python. But R or Julia would be ok as well.
From everything I had done so far, I've read about a functionallity called curve fit from the SciPy library but I'm having some trouble in which form I would do the code as long my x variable is located in one of the integration limit.
For better ways of working with the problem, I also have the following resources:
A sample set, for which I know the parameters I'm looking for. To this set I know that a = 2 and b = 1 (and c = 3). And before it rises some questions about how I know these parameters: I know they because I created this sample set using this parameters from the integration of the equation above just to use the sample to investigate how can I find them and have a reference.
I also have this set, for which the only information I have is that c = 4 and want to find a and b.
I would also like to point out that:
i) right now I have no code to post here because I don't have a clue how to write something to solve my problem. But I would be happy to edit and update the question after reading any answer or help that you guys could provide me.
ii) I'm looking first for a solution where I don't know a and b. But in case that it is too hard I would be happy to see some solution where I suppose that one either a or b is known.
EDIT 1: I would like to reference this question to anyone interested in this problem as it's a parallel but also important discussion to the problem faced here
I would use a pure numeric approach, which you can use even when you can not directly solve the integral. Here's a snipper for fitting only the a parameter:
import numpy as np
from scipy.optimize import curve_fit
import pandas as pd
import matplotlib.pyplot as plt
def integrand(x, a):
b = 1
c = 3
return 1/(a*np.sqrt(b*(1+x)**3 + c*(1+x)**4))
def integral(x, a):
dx = 0.001
xx = np.arange(0, x, dx)
arr = integrand(xx, a)
return np.trapz(arr, dx=dx, axis=-1)
vec_integral = np.vectorize(integral)
df = pd.read_csv('data-with-known-coef-a2-b1-c3.csv')
x = df.domin.values
y = df.resultados2.values
out_mean, out_var = curve_fit(vec_integral, x, y, p0=[2])
plt.plot(x, y)
plt.plot(x, vec_integral(x, out_mean[0]))
plt.title(f'a = {out_mean[0]:.3f} +- {np.sqrt(out_var[0][0]):.3f}')
plt.show()
vec_integral = np.vectorize(integral)
Of course, you can lower the value of dx to get the desired precision. While for fitting just the a, when you try to fir b as well, the fit does not converge properly (in my opinion because a and b are strongly correlated). Here's what you get:
def integrand(x, a, b):
c = 3
return 1/(a*np.sqrt(np.abs(b*(1+x)**3 + c*(1+x)**4)))
def integral(x, a, b):
dx = 0.001
xx = np.arange(0, x, dx)
arr = integrand(xx, a, b)
return np.trapz(arr, dx=dx, axis=-1)
vec_integral = np.vectorize(integral)
out_mean, out_var = sp.optimize.curve_fit(vec_integral, x, y, p0=[2,3])
plt.title(f'a = {out_mean[0]:.3f} +- {np.sqrt(out_var[0][0]):.3f}\nb = {out_mean[1]:.3f} +- {np.sqrt(out_var[1][1]):.3f}')
plt.plot(x, y, alpha=0.4)
plt.plot(x, vec_integral(x, out_mean[0], out_mean[1]), color='green', label='fitted solution')
plt.plot(x, vec_integral(x, 2, 1),'--', color='red', label='theoretical solution')
plt.legend()
plt.show()
As you can see, even if the resulting a and b parameters form the fit are "not good", the plot is very similar.
They are three variables a,b,c which are not independent. One of them must be given if we want compute the two others thanks to regression. With given c, solving for a,b is simple :
The example of numerical calculus below is made with a small data (n=10) in order to make it easy to check.
Note that the regression is for the function t(y) wich is not exactly the same as for y(x) when the data is scattered (The result is the same if no scatter).
If it is absolutely necessary to have the regression for y(x) a non-linear regression is necessary. This involves an iterative process starting from good enought initial guess for a,b. The above calculus gives very good initial values.
IN ADDITION :
Meanwhile Andrea posted a pertinent answer. Of course the fitting with his method is better because this is a non-linear regression instead of linear as already pointed out in the above note.
Nevertheless, dispite the different values (a=1.881 ; b=1.617) compared to (a=2.346 , b=-0.361) the respective curves drawn below are not far one from the other :
Blue curve : from linear regression (above method)
Green curve : from non-linear regression ( Andrea's )
CASE OF THE SECOND SET OF DATA
https://mega.nz/#!echEjQyK!tUEx0gpFND7gucvsTONiB_wn-ewBq-5k-pZlfLxmfvw
The regression fails because the assumption c=3 is false.
In the case c=0 the analytic calculus of the integral is different from above :
I want to fit a function with vector output using Scipy's curve_fit (or something more appropriate if available). For example, consider the following function:
import numpy as np
def fmodel(x, a, b):
return np.vstack([a*np.sin(b*x), a*x**2 - b*x, a*np.exp(b/x)])
Each component is a different function but they share the parameters I wish to fit. Ideally, I would do something like this:
x = np.linspace(1, 20, 50)
a = 0.1
b = 0.5
y = fmodel(x, a, b)
y_noisy = y + 0.2 * np.random.normal(size=y.shape)
from scipy.optimize import curve_fit
popt, pcov = curve_fit(f=fmodel, xdata=x, ydata=y_noisy, p0=[0.3, 0.1])
But curve_fit does not work with functions with vector output, and an error Result from function call is not a proper array of floats. is thrown. What I did instead is to flatten out the output like this:
def fmodel_flat(x, a, b):
return fmodel(x[0:len(x)/3], a, b).flatten()
popt, pcov = curve_fit(f=fmodel_flat, xdata=np.tile(x, 3),
ydata=y_noisy.flatten(), p0=[0.3, 0.1])
and this works. If instead of a vector function I am actually fitting several functions with different inputs as well but which share model parameters, I can concatenate both input and output.
Is there a more appropriate way to fit vector function with Scipy or perhaps some additional module? A main consideration for me is efficiency - the actual functions to fit are much more complex and fitting can take some time, so if this use of curve_fit is mangled and is leading to excessive runtimes I would like to know what I should do instead.
If I can be so blunt as to recommend my own package symfit, I think it does precisely what you need. An example on fitting with shared parameters can be found in the docs.
Your specific problem stated above would become:
from symfit import variables, parameters, Model, Fit, sin, exp
x, y_1, y_2, y_3 = variables('x, y_1, y_2, y_3')
a, b = parameters('a, b')
a.value = 0.3
b.value = 0.1
model = Model({
y_1: a * sin(b * x),
y_2: a * x**2 - b * x,
y_3: a * exp(b / x),
})
xdata = np.linspace(1, 20, 50)
ydata = model(x=xdata, a=0.1, b=0.5)
y_noisy = ydata + 0.2 * np.random.normal(size=(len(model), len(xdata)))
fit = Fit(model, x=xdata, y_1=y_noisy[0], y_2=y_noisy[1], y_3=y_noisy[2])
fit_result = fit.execute()
Check out the docs for more!
I think what you're doing is perfectly fine from an efficiency stand point. I'll try to look at the implementation and come up with something more quantitative, but for the time being here is my reasoning.
What you're doing during curve fitting is optimizing the parameters (a,b) such that
res = sum_i |f(x_i; a,b)-y_i|^2
is minimal. By this I mean that you have data points (x_i,y_i) of arbitrary dimensionality, two parameters (a,b) and a fitting model that approximates the data at query points x_i.
The curve fitting algorithm starts from a starting (a,b) pair, puts this into a black box that computes the above square error, and tries to come up with a new (a',b') pair that produces a smaller error. My point is that the error above is really a black box for the fitting algorithm: the configurational space of the fitting is defined merely by the (a,b) parameters. If you imagine how you'd implement a simple curve fitting function, you could imagine that you try to do, say, a gradient descent, with the square error as cost function.
Now, it should be irrelevant to the fitting procedure how the black box computes the error. It's easy to see that the dimensionality of x_i is really irrelevant for scalar functions, since it doesn't matter if you have 1000 1d query points to fit for, or a 10x10x10 grid in 3d space. What matters is that you have 1000 points x_i for which you need to compute f(x_i) ~ y_i from the model.
The only subtlety that should further be noted is that in case of a vector-valued function, the calculation of the error is not trivial. In my opinion, it's fine to define the error at each x_i point using the 2-norm of the vector-valued function. But hey: in this case, the square error at point x_i is
|f(x_i; a,b)-y_i|^2 == sum_k (f(x_i; a,b)[k]-y_i[k])^2
which implies that the square error for each component is accumulated. This just means that what you're doing right now is just right: by replicating your x_i points and taking into account each component of the function individually, your square error will contain exactly the 2-norm of the error at each point.
So my point is what you're doing is mathematically correct, and I don't expect any behaviour of the fitting procedure to depend on the way how multivariate/vector-valued functions are handled.
I'm trying to fit a model to my dataset of wind profiles, i.e. wind speed values u(z) at different altitudes z.
The model consists of two parts, which I for now simplified to:
u(z) = ust/k * ln(z/z0) for z < zsl
u(z) = a*z + b for z > zsl
In the logarithmic model, ust and z0 are free parameters k is fixed. zsl is the height of the surface layer, which is also not known a priori.
I want to fit this model to my data, and I have already tried different approaches. The best result I'm getting so far is with:
def two_layer(z,hsl,ust,z0,a,b):
return ust/0.4*(np.log(z/z0)) if z<hsl else a*z+b
two_layer = np.vectorize(two_layer)
def two_layer_err(p,z,u):
return two_layer(z,*p)-u
popt, pcov ,infodict, mesg, ier = optimize.leastsq(two_layer_err,[150.,0.3,0.002,0.1,10.],(wspd_hgt,prof),full_output=1)
# wspd_hgt are my measurements heights and
# prof are the corresponding wind speed values
This gives me reasonable estimates for all parameters, except for zsl which is not changed during the fitting procedure. I guess this has to do with the fact that is used as a threshold rather than a function parameter. Is there any way I could let zsl be varied during the optimization?
I tried something with numpy.piecewise, but that didn't work very well, perhaps because I don't really understand it very well, or I might be completely off here because it's not suitable for my cause.
For the idea, the wind profile looks like this if the axes are reversed (z plotted versus u):
I think I finally have a solution for this type of problem, which I came across while answering a similar question.
The solution seems to be to implement a constraint saying that u1 == u2 at the switch between the two models. Because I cannot try it with your model because there is no data posted, I will instead show how it works for the other model, and you can adapt it to your situation. I solved this using a scipy wrapper I wrote to make fitting of such problems more pythonic, called symfit. But you could do the same using the SLSQP algorithm in scipy if you prefer.
from symfit import parameters, variables, Fit, Piecewise, exp, Eq
import numpy as np
import matplotlib.pyplot as plt
t, y = variables('t, y')
m, c, d, k, t0 = parameters('m, c, d, k, t0')
# Help the fit by bounding the switchpoint between the models
t0.min = 0.6
t0.max = 0.9
# Make a piecewise model
y1 = m * t + c
y2 = d * exp(- k * t)
model = {y: Piecewise((y1, t <= t0), (y2, t > t0))}
# As a constraint, we demand equality between the two models at the point t0
# Substitutes t in the components by t0
constraints = [Eq(y1.subs({t: t0}), y2.subs({t: t0}))]
# Read the data
tdata, ydata = np.genfromtxt('Experimental Data.csv', delimiter=',', skip_header=1).T
fit = Fit(model, t=tdata, y=ydata, constraints=constraints)
fit_result = fit.execute()
print(fit_result)
plt.scatter(tdata, ydata)
plt.plot(tdata, fit.model(t=tdata, **fit_result.params).y)
plt.show()
I think you should be able to adapt this example to your situation!
Edit: as requested in the comment, it is possible to demand matching derivatives as well. In order to do this, the following additional code is needed for the above example:
from symfit import Derivative
dy1dt = Derivative(y1, t)
dy2dt = Derivative(y2, t)
constraints = [
Eq(y1.subs({t: t0}), y2.subs({t: t0})),
Eq(dy1dt.subs({t: t0}), dy2dt.subs({t: t0}))
]
That should do the trick! So from a programming point of view it is very doable, although depending on the model this might not actually have a solution.
I've been using the numpy.polyfit function to do some forecasting. If I put in a degree of 1, it works, but I need to do a second degree polynomial fit. In some cases it works, in other cases the plot of the prediction goes down and then goes up forever. For example:
import matplotlib.pyplot as plt
from numpy import *
x=[1,2,3,4,5,6,7,8,9,10]
y=[100,85,72,66,52,48,39,33,29,32]
fit = polyfit(x, y, degree)
fitfunction = poly1d(z4)
to_predict=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
plt.plot(to_predict,fitfunction(to_predict))
plt.show()
After I run that, this shows up (I tried putting a picture up but stackoverflow won't let me).
I want to force it to go through zero.
How would I do that?
If you don't need the fit's error be computed using the original least square formula (i.e. minimizing ∑ |yi - (axi2 + bxi)|2), you could try to perform a linear fit of y/x instead, because (ax2 + bx)/x = ax + b.
If you must use the same error metric, construct the coefficient matrices directly and use numpy.linalg.lstsq:
coeff = numpy.transpose([x*x, x])
((a, b), _, _, _) = numpy.linalg.lstsq(coeff, y)
polynomial = numpy.poly1d([a, b, 0])
(Note that your provided data sequence does not look like a parabola having a y-intercept of 0.)
if anyone has to do this under a deadline, a quick solution is to just add a bunch of extra points at 0 to skew the weighting off. i did this:
for i in range(0,100):
x_vent.insert(i,0)
y_vent.insert(i,0)
slope_vent,intercept_vent=np.polyfit(x_vent,y_vent,1)