Statistical regression on multi-dimensional data [closed]

Statistical regression on multi-dimensional data [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I have a set of data in (x, y, z) format where z is the output of some formula involving x and y. I want to find out what the formula is, and my Internet research suggests that statistical regression is the way to do this.
However, all of the examples I have found while researching only deal with two-dimensional data sets (x, y) which is not useful for my situation. Said examples also don't seem to provide a way to see what the resulting formula is, they just provide a function for predicting future outputs based on data not in a training data set.
The level of precision needed is that the formula for z needs to produce results within +/- 0.5 of actual values.
Can anyone tell me how I can do what I want to do? Please note I was not asking for specific recommendations on a software library to use.

If the formula is a linear function, checkout this tutorial. It uses Ordinary least squares to fit your data which is quite powerful.
Assume that you have data points (x1, y1, z1), (x2, y2, z2), ..., (xn, yn, zn), transform them into three separated numpy arrays X, Y and Z.
import numpy as np
X = np.array([x1, x2, ..., xn])
Y = np.array([y1, y2, ..., yn])
Z = np.array([z1, z2, ..., zn])
Then, use ols to fit them!
import pandas
from statsmodels.formula.api import ols
# Your data.
# Z = a*X + b*Y + c
data = pandas.DataFrame({'x': X, 'y': Y, 'z': Z})
# Fit your data with ols model.
model = ols("Z ~ X + Y", data).fit()
# Get your model summary.
print(model.summary())
# Get your model parameters.
print(model._results.params)
# should be approximately array([c, a, b])
If more variables are presented
Add as much variables in the DataFrame as you like.
# Your data.
data = pandas.DataFrame({'v1': V1, 'v2': V2, 'v3': V3, 'v4': V4, 'z': Z})
Reference
Python package StatsModel

The most basic tool you need to use is Multiple linear regression. The basic method models z as a linear function of x and y, added a Gaussian noise e on top of them: f(x,y) = a1*x + a2*y + a3 and then z is produced as f(x,y) + e, where e is usually a zero mean Gaussian with unknown variance. You need to find the coefficients a1,a2 and the bias a3, which are usually estimated with Maximum Likelihood, which then boils down to ordinary least squares under the Gaussian assumption. It has closed form analytic solution.
Since you have access to Python, take a look to linear regression in scikit-learn:
http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares

If you can reuse code from an existing a Python 3 tkinter GUI application on GitHub, take a look at fitting the linear polynomial surface equation that you mentioned using my tkInterFit project - it will also create fitted surface and contour plots. The GitHub source code is at https://github.com/zunzun/tkInterFit with a BSD license.

Related

Fitting a model with some known parameters to an experimental dataset in python, in order to optimise other parameters

I have an experimental dataset 1 which plots intensity as a function of energy. These are arrays of 1800 datapoints.
I have been trying to fit a model to this data, given by the equation below:
Imodel = I0 * ((math.cos(phi) + (beta * f1))**2 + (math.sin(phi) + (beta*f2))**2 + Ioff
I have 2 other datasets of f1 vs. energy and f2 vs. energy 2. These are arrays of 700 datapoints, albeit over the same energy range as the first dataset.
I want to use this model function together with the f1 and f2 data to find optimal values of the other 4 parameters (I0, phi, beta, Ioff) where this model function fits the experimental dataset exactly.
I have been looking into curve_fit and least_squares from the scipy.optimize package, as well as linear regression packages such as lmfit and scikit, but to no avail.
can anyone help? Thanks

Presently I have no representative data from Ayrtonb1 in order to test the method proposed below. The method seems convenient from theoretical basis but one cannot be sure that it will be satisfying with the OP data.
Nevertheless a preliminary test was carried out with a "toy" data (shown below).
I suppose that the screencopy below is sufficient to understand the method and to reproduce the calculus with real data.
The result of this preliminary test is rather good :
LRMSE<2 for a range up to 600. (Least Root Mean Square Error).
LRMSRE<2% (Least Root Mean Square Relative Error).

Your data and formula look suspiciously like resonant (or anomalous) X-ray diffraction data, with measurements of scattered intensity going across the Zn K-edge. Although you do not say this, the discussion here will assume that. You say you have 1800 measurements, presumably as a function of X-ray energy in eV. The resonant scattering factors (f1, f2) you show seem to be idealized and possibly "typical", and perhaps not specifically for the Zn K-edge -- at the very least the energy scale shown is not the same as your data.
You will want to treat the data and model the intensity as a function of X-ray energy. And you will want realistic values for f1 and f2 for the element of interest, and at the actual energy points for your data. I recommend using xraydb (full disclosure: I am the lead author) [pip install xraydb], so that you can do
import numpy as np
import xraydb
#edata, idata = function_to_extract_your_data()
# or perhaps testing with
edata = np.linspace(9500, 10500, 501)
f1 = xraydb.f1_chantler('Zn', edata)
f2 = xraydb.f2_chantler('Zn', edata)
As written, your intensity function does not directly depend on energy, though it might at a later date, say to make that offset be linear in energy, not just a constant. You might write a function like:
def intensity(en, phi, beta, scale=1, slope=0, offset=0, f1=-1, f2=1):
costerm = np.cos(phi) + beta*f1
sinterm = np.sin(phi) + beta*f2
return scale * (costerm**2 + sinterm**2) + slope*en + offset
with that you can start just trying out some values to get a feel for the function and how it compares to your data:
import matplotlib.pyplot as plt
beta = 0.025 # Wild Guess!
for phi in np.pi*np.arange(20)/10:
plt.plot(edata, intensity(edata, phi, beta, f1=f1, f2=f2), label='%.1f'%phi)
plt.legend()
plt.show()
It kind of looks like your value for phi would be around 5.5 to 6 (or -0.8 to -0.3). You could also try different values of beta and plot that with your actual data.
With that model for intensity and a feel for what the range of parameters is, you could then try to fit your data. To do that, I would recommend using lmfit (full disclosure: I am the lead author) [pip install lmfit], where you can create a model from your intensity model function - this will use the names of the function arguments to name the fitting parameters.
from lmfit import Model
imodel = Model(intensity, independent_vars=['en', 'f1', 'f2'])
params = imodel.make_params(scale=1, offset=0, slope=0, beta=0.1, phi=5.5)
That independent_vars will tell Model to not make fitting Parameters for the function arguments f1 and f2 and to expect them to be passed into any evaluation or fit. The other function arguments (scale, phi, etc) will be expected to become fitting variables. You do have to create a "Parameters" object and must give initial values for all parameters. You can put bounds on these or fix them so they are not altered in the fit, as with
params['scale'].min = 0 # force scale to be positive
params['slope'].vary = False # slope will be fixed at 0.
You can then evaluate the model with
init_value = imodel.eval(params, en=edata, f1=f1, f2=f2)
And then eventually do a fit with
result = imodel.fit(idata, params, en=edata, f1=f1, f2=f2)
print(result.fit_report())
plt.plot(edata, idata, label='data')
plt.plot(edata, init_value, label='initial fit')
plt.plot(edata, result.best_fit, label='best fit')
plt.legend()
plt.show()
Finally, for analysis of X-ray resonant scattering be sure to consider including absorption corrections in that intensity calculation. As you go across the Zn K edge, the absorption depth of the sample may change dramatically if the Zn concentration is high.

Using Python, how would I fit a logistic (or similar s-shaped function) to 3 points?

Let's say I have 3 (x, y) coordinates: (xb, yb), (xm, ym), and (xt, yt). For simplicity, the b, m and t notations correspond to "bottom", "middle" and "top" (i.e. (0, 0), (0.5, 0.5) and (1, 1)).
I've seen many similar SO posts using the logistic function to perform a basic "line of best fit" operation, but I need my logistic function to fit these points exactly. I would also prefer to not use a 3rd party library (like scipy or scikit-learn) given this is the only use case I have for these libraries. numpy is an exception as I use it quite liberally in my program.
Thank you in advance for your help.

In this case, what you have is a basic question in algebra; Python is simply your implementation language. You have three points, so you need an appropriate function with three parameters.
For a polynomial, you would need a quadratic equation, ax^2 + bx + c.
For a simple exponential, you would need something like a * e^bx + c
You have to choose your equation form; then simply solve it on each of your three points: you have three equations in three variables. You should be able to do this by hand (since you don't want a canned program to do that for you).
In particular, the logistic function is
y = L / (1 + e^(-k*(x-x0)))
Your three parameters are L, k, and x0. Again, plug in each point in turn, to give you three equations in three variables. Solve for L, k, and x0.

Fitting parameter inside an integral using python (or another useful language)

I have a set of data, basically with the information of f(x) as a function of x, and x itself. I know from the theory of the problem that I'm working on the format of f(x), which is given as the expression below:
Essentially, I want to use this set of data to find the parameters a and b. My problem is: How can I do that? What library should I use? I would like an answer using Python. But R or Julia would be ok as well.
From everything I had done so far, I've read about a functionallity called curve fit from the SciPy library but I'm having some trouble in which form I would do the code as long my x variable is located in one of the integration limit.
For better ways of working with the problem, I also have the following resources:
A sample set, for which I know the parameters I'm looking for. To this set I know that a = 2 and b = 1 (and c = 3). And before it rises some questions about how I know these parameters: I know they because I created this sample set using this parameters from the integration of the equation above just to use the sample to investigate how can I find them and have a reference.
I also have this set, for which the only information I have is that c = 4 and want to find a and b.
I would also like to point out that:
i) right now I have no code to post here because I don't have a clue how to write something to solve my problem. But I would be happy to edit and update the question after reading any answer or help that you guys could provide me.
ii) I'm looking first for a solution where I don't know a and b. But in case that it is too hard I would be happy to see some solution where I suppose that one either a or b is known.
EDIT 1: I would like to reference this question to anyone interested in this problem as it's a parallel but also important discussion to the problem faced here

I would use a pure numeric approach, which you can use even when you can not directly solve the integral. Here's a snipper for fitting only the a parameter:
import numpy as np
from scipy.optimize import curve_fit
import pandas as pd
import matplotlib.pyplot as plt
def integrand(x, a):
b = 1
c = 3
return 1/(a*np.sqrt(b*(1+x)**3 + c*(1+x)**4))
def integral(x, a):
dx = 0.001
xx = np.arange(0, x, dx)
arr = integrand(xx, a)
return np.trapz(arr, dx=dx, axis=-1)
vec_integral = np.vectorize(integral)
df = pd.read_csv('data-with-known-coef-a2-b1-c3.csv')
x = df.domin.values
y = df.resultados2.values
out_mean, out_var = curve_fit(vec_integral, x, y, p0=[2])
plt.plot(x, y)
plt.plot(x, vec_integral(x, out_mean[0]))
plt.title(f'a = {out_mean[0]:.3f} +- {np.sqrt(out_var[0][0]):.3f}')
plt.show()
vec_integral = np.vectorize(integral)
Of course, you can lower the value of dx to get the desired precision. While for fitting just the a, when you try to fir b as well, the fit does not converge properly (in my opinion because a and b are strongly correlated). Here's what you get:
def integrand(x, a, b):
c = 3
return 1/(a*np.sqrt(np.abs(b*(1+x)**3 + c*(1+x)**4)))
def integral(x, a, b):
dx = 0.001
xx = np.arange(0, x, dx)
arr = integrand(xx, a, b)
return np.trapz(arr, dx=dx, axis=-1)
vec_integral = np.vectorize(integral)
out_mean, out_var = sp.optimize.curve_fit(vec_integral, x, y, p0=[2,3])
plt.title(f'a = {out_mean[0]:.3f} +- {np.sqrt(out_var[0][0]):.3f}\nb = {out_mean[1]:.3f} +- {np.sqrt(out_var[1][1]):.3f}')
plt.plot(x, y, alpha=0.4)
plt.plot(x, vec_integral(x, out_mean[0], out_mean[1]), color='green', label='fitted solution')
plt.plot(x, vec_integral(x, 2, 1),'--', color='red', label='theoretical solution')
plt.legend()
plt.show()
As you can see, even if the resulting a and b parameters form the fit are "not good", the plot is very similar.

They are three variables a,b,c which are not independent. One of them must be given if we want compute the two others thanks to regression. With given c, solving for a,b is simple :
The example of numerical calculus below is made with a small data (n=10) in order to make it easy to check.
Note that the regression is for the function t(y) wich is not exactly the same as for y(x) when the data is scattered (The result is the same if no scatter).
If it is absolutely necessary to have the regression for y(x) a non-linear regression is necessary. This involves an iterative process starting from good enought initial guess for a,b. The above calculus gives very good initial values.
IN ADDITION :
Meanwhile Andrea posted a pertinent answer. Of course the fitting with his method is better because this is a non-linear regression instead of linear as already pointed out in the above note.
Nevertheless, dispite the different values (a=1.881 ; b=1.617) compared to (a=2.346 , b=-0.361) the respective curves drawn below are not far one from the other :
Blue curve : from linear regression (above method)
Green curve : from non-linear regression ( Andrea's )
CASE OF THE SECOND SET OF DATA
https://mega.nz/#!echEjQyK!tUEx0gpFND7gucvsTONiB_wn-ewBq-5k-pZlfLxmfvw
The regression fails because the assumption c=3 is false.
In the case c=0 the analytic calculus of the integral is different from above :

Extrapolating data from a curve using Python

I am trying to extrapolate future data points from a data set that contains one continuous value per day for almost 600 days. I am currently fitting a 1st order function to the data using numpy.polyfit and numpy.poly1d. In the graph below you can see the curve (blue) and the 1st order function (green). The x-axis is days since beginning. I am looking for an effective way to model this curve in Python in order to extrapolate future data points as accurately as possible. A linear regression isnt accurate enough and Im unaware of any methods of nonlinear regression that can work in this instance.
This solution isnt accurate enough as if I feed
x = dfnew["days_since"]
y = dfnew["nonbrand"]
z = numpy.polyfit(x,y,1)
f = numpy.poly1d(z)
x_new = future_days
y_new = f(x_new)
plt.plot(x,y, '.', x_new, y_new, '-')
EDIT:
I have now tried the curve_fit using a logarithmic function as the curve and data behaviour seems to conform to:
def func(x, a, b):
return a*numpy.log(x)+b
x = dfnew["days_since"]
y = dfnew["nonbrand"]
popt, pcov = curve_fit(func, x, y)
plt.plot( future_days, func(future_days, *popt), '-')
However when I plot it, my Y-values are way off:

The very general rule of thumb is that if your fitting function is not fitting well enough to your actual data then either:
You are using the function wrong, e.g. You are using 1st order polynomials - So if you are convinced that it is a polynomial then try higher order polynomials.
You are using the wrong function, it is always worth taking a look at:
your data curve &
what you know about the process that is generating the data
to come up with some speculation/theorem/guesses about what sort of model might fit better.
Might your process be a logarithmic one, a saturating on, etc. try them!
Finally, if you are not getting a consistent long term trend then you might be able to justify using cubic splines.

Python LeastSquares plot

I have to draw plot using least squares method in Python 3. I have list of x and y values:
y = [186,273,308,484]
x = [2.25,2.34,2.47,2.56]
There are many more values for x and for y, there is only a shortcut. And now, I know, that f(x)=y should be a linear function. I can get cofactor „a” and „b” of this function, by calculating:
delta_x = x[len(x)]-x[0] and delta_y = y[len(y)]-y[0]
Etc, using tangent function. I know, how to do it.
But there are also uncertainties of y, about 2 percent of y. So I have y_errors table, which contains all uncertainties of y.
But what now, how I can draw least squares?
Of course I have been used Google, I saw docs.scipy.org/doc/scipy/reference/tutorial/optimize.html#least-square-fitting-leastsq, but there are some problems.
I tried to edit example from scipy.org to my own purpose. So I edited x, y, y_meas variables, putting here my own lists. But now, I dont know, what is p0 variable in this example. And what should I must edit to make my example working.
Of course I can edit also residuals function. It must get only one variable - y_true. In addition to this I dont understand arquments of leastsq function.
Sorry for my english and for asking such newbie question. But I dont understand this method. Thank You in advance.

I believe you are trying to fit a set of {x, y} (and possibly sigma_y: the uncertainties in y) values to a linear expression. This is known as linear regression, and For linear regression (or indeed, for regression of any polynomial) you can use numpy's polyfit. The uncertainties can be used for the weights::
weight = 1/sigma_y
where sigma_y is the standard deviation in y.
The least-squares routines in scipy.optimize allow you to fit a non-linear function to data, but you have to write the function that computes the "residual" (data - model) in terms of variables that are to be adjusted in order to minimize the calculated residual.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.