I have a set of data, basically with the information of f(x) as a function of x, and x itself. I know from the theory of the problem that I'm working on the format of f(x), which is given as the expression below:
Essentially, I want to use this set of data to find the parameters a and b. My problem is: How can I do that? What library should I use? I would like an answer using Python. But R or Julia would be ok as well.
From everything I had done so far, I've read about a functionallity called curve fit from the SciPy library but I'm having some trouble in which form I would do the code as long my x variable is located in one of the integration limit.
For better ways of working with the problem, I also have the following resources:
A sample set, for which I know the parameters I'm looking for. To this set I know that a = 2 and b = 1 (and c = 3). And before it rises some questions about how I know these parameters: I know they because I created this sample set using this parameters from the integration of the equation above just to use the sample to investigate how can I find them and have a reference.
I also have this set, for which the only information I have is that c = 4 and want to find a and b.
I would also like to point out that:
i) right now I have no code to post here because I don't have a clue how to write something to solve my problem. But I would be happy to edit and update the question after reading any answer or help that you guys could provide me.
ii) I'm looking first for a solution where I don't know a and b. But in case that it is too hard I would be happy to see some solution where I suppose that one either a or b is known.
EDIT 1: I would like to reference this question to anyone interested in this problem as it's a parallel but also important discussion to the problem faced here
I would use a pure numeric approach, which you can use even when you can not directly solve the integral. Here's a snipper for fitting only the a parameter:
import numpy as np
from scipy.optimize import curve_fit
import pandas as pd
import matplotlib.pyplot as plt
def integrand(x, a):
b = 1
c = 3
return 1/(a*np.sqrt(b*(1+x)**3 + c*(1+x)**4))
def integral(x, a):
dx = 0.001
xx = np.arange(0, x, dx)
arr = integrand(xx, a)
return np.trapz(arr, dx=dx, axis=-1)
vec_integral = np.vectorize(integral)
df = pd.read_csv('data-with-known-coef-a2-b1-c3.csv')
x = df.domin.values
y = df.resultados2.values
out_mean, out_var = curve_fit(vec_integral, x, y, p0=[2])
plt.plot(x, y)
plt.plot(x, vec_integral(x, out_mean[0]))
plt.title(f'a = {out_mean[0]:.3f} +- {np.sqrt(out_var[0][0]):.3f}')
plt.show()
vec_integral = np.vectorize(integral)
Of course, you can lower the value of dx to get the desired precision. While for fitting just the a, when you try to fir b as well, the fit does not converge properly (in my opinion because a and b are strongly correlated). Here's what you get:
def integrand(x, a, b):
c = 3
return 1/(a*np.sqrt(np.abs(b*(1+x)**3 + c*(1+x)**4)))
def integral(x, a, b):
dx = 0.001
xx = np.arange(0, x, dx)
arr = integrand(xx, a, b)
return np.trapz(arr, dx=dx, axis=-1)
vec_integral = np.vectorize(integral)
out_mean, out_var = sp.optimize.curve_fit(vec_integral, x, y, p0=[2,3])
plt.title(f'a = {out_mean[0]:.3f} +- {np.sqrt(out_var[0][0]):.3f}\nb = {out_mean[1]:.3f} +- {np.sqrt(out_var[1][1]):.3f}')
plt.plot(x, y, alpha=0.4)
plt.plot(x, vec_integral(x, out_mean[0], out_mean[1]), color='green', label='fitted solution')
plt.plot(x, vec_integral(x, 2, 1),'--', color='red', label='theoretical solution')
plt.legend()
plt.show()
As you can see, even if the resulting a and b parameters form the fit are "not good", the plot is very similar.
They are three variables a,b,c which are not independent. One of them must be given if we want compute the two others thanks to regression. With given c, solving for a,b is simple :
The example of numerical calculus below is made with a small data (n=10) in order to make it easy to check.
Note that the regression is for the function t(y) wich is not exactly the same as for y(x) when the data is scattered (The result is the same if no scatter).
If it is absolutely necessary to have the regression for y(x) a non-linear regression is necessary. This involves an iterative process starting from good enought initial guess for a,b. The above calculus gives very good initial values.
IN ADDITION :
Meanwhile Andrea posted a pertinent answer. Of course the fitting with his method is better because this is a non-linear regression instead of linear as already pointed out in the above note.
Nevertheless, dispite the different values (a=1.881 ; b=1.617) compared to (a=2.346 , b=-0.361) the respective curves drawn below are not far one from the other :
Blue curve : from linear regression (above method)
Green curve : from non-linear regression ( Andrea's )
CASE OF THE SECOND SET OF DATA
https://mega.nz/#!echEjQyK!tUEx0gpFND7gucvsTONiB_wn-ewBq-5k-pZlfLxmfvw
The regression fails because the assumption c=3 is false.
In the case c=0 the analytic calculus of the integral is different from above :
Related
I know I can just write needed method by myself but there must be a function for this because this problem is so common as heck. If somebody does't understand what I am talking about take a look at the following formula
{Image must be here}
For example, I have a function y = kx+b where y is dependent variable, and x is independent. I need to calculate k (slope) and b (intercept), and I have formulas from the picture, and everything those formulas need. Is there any function in common data science libraries which can help calculate those ones? I mentioned "only one independent variable" because sometimes there are multiple independent vars which leads to multidimentional plots
Googling gives nothing. I already use my own implementation of those ones, but I prefer native functions from packages such as scipy and numpy, or sklearn
not sure to fully understand the question (especially, what do you mean by "one independent variable"?), so I try to reformulate. If you have two variables, x andy, both represented by samples (x_1,..., x_n), (y_1,..., y_n) and that you suspect a linear relationship between them, y = a*x +b, then you can use numpy.polyfit to find the coefficients a and b. Here is an example:
import numpy as np
n = 100
x = np.linspace(0, 1, n)
y = 2*x + 0.3
a, b = np.polyfit(x, y, 1)
print(f"a={a}, b={b}")
Returns
a=2.0, b=0.30000000000000016
Hope that helps!
I am new to analytics , I am looking for a solution to find a model to solve non linear equation of the form Y=a(X1^b) + c(X2^d) + e ( where X1 , X2 are independent variables)
Below is a full set , unfortunately we dont have much observations , all we need is any simple fit.But this data does not have any outliers , every observations has to be considered.
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
import sympy as sym
x1=np.array([217,160,97,75])
x2=np.array([5.319751464,6.88536761,5.319751464,5.319751464])
x3=np.array([143.420635344,36.7384534912,23.420635344,1.420635344])
y=np.array([14,7,7,1])
def func(X, a, b, c ,d , e ):
x1,x2 = X
return a*x1**b + c*x2**d + e
popt, pcov = curve_fit(func, (x1,x2), y)
plt.plot(y, func((x1,x2), *popt), label="Fitted Curve")
plt.legend(loc='upper left')
plt.show()
but running curve_fit gives me error saying that the
TypeError: Improper input: N=5 must not exceed M=4
Then i had to add few more dummy inputs as observations of almost similar values adding decimal point difference which resulted in error
x1=np.array([217,160,97,75,76,219])
x2=np.array([5.319751464,6.88536761,5.319751464,5.319751464,5.319751464,5.319751464])
x3=np.array([143.420635344,36.7384534912,23.420635344,1.420635344,1.420635344,143.420635344])
y=np.array([14,7,7,1,1,14])
RuntimeError: Optimal parameters not found: Number of calls to
function has reached maxfev = 1200.
Then I had to remove variable d and keep function as
def func(X, a, b, c ,e ):
x1,x2 = X
return a*x1**b + c*x2 + e
Finally it did run but again with below warning , but results are not good
RuntimeWarning: overflow encountered in power
Note that
x3 = max(x2 - {(x1^2)*2.6},0)
and solving
y=a*(x3^b) gives a=0.89 and b=0.58 with r2=0.98 and error=0.19 which is the best one i could get so far
But i would like to have the result in a generalised form without me trying to equate a relation. Because based on data set ,the function x3=f(x1,x2) can change and it is not a fixed equation for all cases.
A very good approach to get acceptable fit results is the use of initial guesses of the free parameters. curve_fit takes a p0 argument. If your data does not vary too much between two data sets, this should be a proper way. In my experience, taking some afford to guess good start values p0 and set limits by the parameter boundsis a good approach. Otherwise you could try to interpolate the data, if you don't have to know the mathematical relation.
I want to fit a function with vector output using Scipy's curve_fit (or something more appropriate if available). For example, consider the following function:
import numpy as np
def fmodel(x, a, b):
return np.vstack([a*np.sin(b*x), a*x**2 - b*x, a*np.exp(b/x)])
Each component is a different function but they share the parameters I wish to fit. Ideally, I would do something like this:
x = np.linspace(1, 20, 50)
a = 0.1
b = 0.5
y = fmodel(x, a, b)
y_noisy = y + 0.2 * np.random.normal(size=y.shape)
from scipy.optimize import curve_fit
popt, pcov = curve_fit(f=fmodel, xdata=x, ydata=y_noisy, p0=[0.3, 0.1])
But curve_fit does not work with functions with vector output, and an error Result from function call is not a proper array of floats. is thrown. What I did instead is to flatten out the output like this:
def fmodel_flat(x, a, b):
return fmodel(x[0:len(x)/3], a, b).flatten()
popt, pcov = curve_fit(f=fmodel_flat, xdata=np.tile(x, 3),
ydata=y_noisy.flatten(), p0=[0.3, 0.1])
and this works. If instead of a vector function I am actually fitting several functions with different inputs as well but which share model parameters, I can concatenate both input and output.
Is there a more appropriate way to fit vector function with Scipy or perhaps some additional module? A main consideration for me is efficiency - the actual functions to fit are much more complex and fitting can take some time, so if this use of curve_fit is mangled and is leading to excessive runtimes I would like to know what I should do instead.
If I can be so blunt as to recommend my own package symfit, I think it does precisely what you need. An example on fitting with shared parameters can be found in the docs.
Your specific problem stated above would become:
from symfit import variables, parameters, Model, Fit, sin, exp
x, y_1, y_2, y_3 = variables('x, y_1, y_2, y_3')
a, b = parameters('a, b')
a.value = 0.3
b.value = 0.1
model = Model({
y_1: a * sin(b * x),
y_2: a * x**2 - b * x,
y_3: a * exp(b / x),
})
xdata = np.linspace(1, 20, 50)
ydata = model(x=xdata, a=0.1, b=0.5)
y_noisy = ydata + 0.2 * np.random.normal(size=(len(model), len(xdata)))
fit = Fit(model, x=xdata, y_1=y_noisy[0], y_2=y_noisy[1], y_3=y_noisy[2])
fit_result = fit.execute()
Check out the docs for more!
I think what you're doing is perfectly fine from an efficiency stand point. I'll try to look at the implementation and come up with something more quantitative, but for the time being here is my reasoning.
What you're doing during curve fitting is optimizing the parameters (a,b) such that
res = sum_i |f(x_i; a,b)-y_i|^2
is minimal. By this I mean that you have data points (x_i,y_i) of arbitrary dimensionality, two parameters (a,b) and a fitting model that approximates the data at query points x_i.
The curve fitting algorithm starts from a starting (a,b) pair, puts this into a black box that computes the above square error, and tries to come up with a new (a',b') pair that produces a smaller error. My point is that the error above is really a black box for the fitting algorithm: the configurational space of the fitting is defined merely by the (a,b) parameters. If you imagine how you'd implement a simple curve fitting function, you could imagine that you try to do, say, a gradient descent, with the square error as cost function.
Now, it should be irrelevant to the fitting procedure how the black box computes the error. It's easy to see that the dimensionality of x_i is really irrelevant for scalar functions, since it doesn't matter if you have 1000 1d query points to fit for, or a 10x10x10 grid in 3d space. What matters is that you have 1000 points x_i for which you need to compute f(x_i) ~ y_i from the model.
The only subtlety that should further be noted is that in case of a vector-valued function, the calculation of the error is not trivial. In my opinion, it's fine to define the error at each x_i point using the 2-norm of the vector-valued function. But hey: in this case, the square error at point x_i is
|f(x_i; a,b)-y_i|^2 == sum_k (f(x_i; a,b)[k]-y_i[k])^2
which implies that the square error for each component is accumulated. This just means that what you're doing right now is just right: by replicating your x_i points and taking into account each component of the function individually, your square error will contain exactly the 2-norm of the error at each point.
So my point is what you're doing is mathematically correct, and I don't expect any behaviour of the fitting procedure to depend on the way how multivariate/vector-valued functions are handled.
I am trying to extrapolate future data points from a data set that contains one continuous value per day for almost 600 days. I am currently fitting a 1st order function to the data using numpy.polyfit and numpy.poly1d. In the graph below you can see the curve (blue) and the 1st order function (green). The x-axis is days since beginning. I am looking for an effective way to model this curve in Python in order to extrapolate future data points as accurately as possible. A linear regression isnt accurate enough and Im unaware of any methods of nonlinear regression that can work in this instance.
This solution isnt accurate enough as if I feed
x = dfnew["days_since"]
y = dfnew["nonbrand"]
z = numpy.polyfit(x,y,1)
f = numpy.poly1d(z)
x_new = future_days
y_new = f(x_new)
plt.plot(x,y, '.', x_new, y_new, '-')
EDIT:
I have now tried the curve_fit using a logarithmic function as the curve and data behaviour seems to conform to:
def func(x, a, b):
return a*numpy.log(x)+b
x = dfnew["days_since"]
y = dfnew["nonbrand"]
popt, pcov = curve_fit(func, x, y)
plt.plot( future_days, func(future_days, *popt), '-')
However when I plot it, my Y-values are way off:
The very general rule of thumb is that if your fitting function is not fitting well enough to your actual data then either:
You are using the function wrong, e.g. You are using 1st order polynomials - So if you are convinced that it is a polynomial then try higher order polynomials.
You are using the wrong function, it is always worth taking a look at:
your data curve &
what you know about the process that is generating the data
to come up with some speculation/theorem/guesses about what sort of model might fit better.
Might your process be a logarithmic one, a saturating on, etc. try them!
Finally, if you are not getting a consistent long term trend then you might be able to justify using cubic splines.
I've been using the numpy.polyfit function to do some forecasting. If I put in a degree of 1, it works, but I need to do a second degree polynomial fit. In some cases it works, in other cases the plot of the prediction goes down and then goes up forever. For example:
import matplotlib.pyplot as plt
from numpy import *
x=[1,2,3,4,5,6,7,8,9,10]
y=[100,85,72,66,52,48,39,33,29,32]
fit = polyfit(x, y, degree)
fitfunction = poly1d(z4)
to_predict=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
plt.plot(to_predict,fitfunction(to_predict))
plt.show()
After I run that, this shows up (I tried putting a picture up but stackoverflow won't let me).
I want to force it to go through zero.
How would I do that?
If you don't need the fit's error be computed using the original least square formula (i.e. minimizing ∑ |yi - (axi2 + bxi)|2), you could try to perform a linear fit of y/x instead, because (ax2 + bx)/x = ax + b.
If you must use the same error metric, construct the coefficient matrices directly and use numpy.linalg.lstsq:
coeff = numpy.transpose([x*x, x])
((a, b), _, _, _) = numpy.linalg.lstsq(coeff, y)
polynomial = numpy.poly1d([a, b, 0])
(Note that your provided data sequence does not look like a parabola having a y-intercept of 0.)
if anyone has to do this under a deadline, a quick solution is to just add a bunch of extra points at 0 to skew the weighting off. i did this:
for i in range(0,100):
x_vent.insert(i,0)
y_vent.insert(i,0)
slope_vent,intercept_vent=np.polyfit(x_vent,y_vent,1)