Least Squares method in practice - python

Very simple regression task. I have three variables x1, x2, x3 with some random noise. And I know target equation: y = q1*x1 + q2*x2 + q3*x3. Now I want to find target coefs: q1, q2, q3 evaluate the
performance using the mean Relative Squared Error (RSE) (Prediction/Real - 1)^2 to evaluate the performance of our prediction methods.
In the research, I see that this is ordinary Least Squares Problem. But I can't get from examples on the internet how to solve this particular problem in Python. Let say I have data:
import numpy as np
sourceData = np.random.rand(1000, 3)
koefs = np.array([1, 2, 3])
target = np.dot(sourceData, koefs)
(In real life that data are noisy, with not normal distribution.) How to find this koefs using Least Squares approach in python? Any lib usage.

#ayhan made a valuable comment.
And there is a problem with your code: Actually there is no noise in the data you collect. The input data is noisy, but after the multiplication, you don't add any additional noise.
I've added some noise to your measurements and used the least squares formula to fit the parameters, here's my code:
data = np.random.rand(1000,3)
true_theta = np.array([1,2,3])
true_measurements = np.dot(data, true_theta)
noise = np.random.rand(1000) * 1
noisy_measurements = true_measurements + noise
estimated_theta = np.linalg.inv(data.T # data) # data.T # noisy_measurements
The estimated_theta will be close to true_theta. If you don't add noise to the measurements, they will be equal.
I've used the python3 matrix multiplication syntax.
You could use np.dot instead of #
That makes the code longer, so I've split the formula:
MTM_inv = np.linalg.inv(np.dot(data.T, data))
MTy = np.dot(data.T, noisy_measurements)
estimated_theta = np.dot(MTM_inv, MTy)
You can read up on least squares here: https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)#The_general_problem
Or you could just use the builtin least squares function:
np.linalg.lstsq(data, noisy_measurements)

In addition to the #lhk answer I have found great scipy Least Squares function. It is easy to get the requested behavior with it.
This way we can provide a custom function that returns residuals and form Relative Squared Error instead of absolute squared difference:
import numpy as np
from scipy.optimize import least_squares
data = np.random.rand(1000,3)
true_theta = np.array([1,2,3])
true_measurements = np.dot(data, true_theta)
noise = np.random.rand(1000) * 1
noisy_measurements = true_measurements + noise
#noisy_measurements[-1] = data[-1] # (1000 * true_theta) - uncoment this outliner to see how much Relative Squared Error esimator works better then default abs diff for this case.
def my_func(params, x, y):
res = (x # params) / y - 1 # If we change this line to: (x # params) - y - we will got the same result as np.linalg.lstsq
return res
res = least_squares(my_func, x0, args=(data, noisy_measurements) )
estimated_theta = res.x
Also, we can provide custom loss with loss argument function that will process the residuals and form final loss.


Sum of Gaussian random variables using python

Given two independent Gaussian variables X and Y, with probability density functions pdf1 and pdf2, then I want to calculate Z = X + Y ~ PDF(Z).
The probability density function of Z is given by the convolution of pdf1 and pdf2.
I have taken the code base (see scipy - Python: How to get the convolution of two continuous distributions? - Stack Overflow) and adapted it.
First, I tested the solution with mean=0 and sigma²=1 for both pdf1 and pdf2. I got the correct solution.
E(Z)=E(X)+E(Y)=0 and Var(Z)=Var(X)+Var(Y)=2
Second, I tested the solution with mean=2 and sigma²=8 for both pdf1 and pdf2. I got an approximate solution with large errors. Result was E(Z)=E(X)+E(Y)=3.21 and Var(Z)=Var(X)+Var(Y)=12.21 but expected was E(Z)=E(X)+E(Y)=4.0 and Var(Z)=Var(X)+Var(Y)=16.0.
The critical part in the code is the convolution of pmf1 and pmf2. The sum of the convoluted PDF should be 1.0 and not 0.93.
Hint: I used a reference implementation based on the "openturns" library to verify my results.
#given two independent gaussian variables X,Y; calculate Z = X + Y ~ PDF(Z)
delta = 1e-4
big_grid = np.arange(-10,10,delta)
mean = 2 #E(X)=E(Y)=2
std = np.sqrt(8) #Var(X)=Var(Y)=8
X = norm(loc=mean, scale=std)
Y = norm(loc=mean, scale=std)
pmf1 = X.pdf(big_grid)*delta
print("Sum of gaussian pmf: "+str(sum(pmf1)))
pmf2 = Y.pdf(big_grid)*delta
print("Sum of gaussian pmf: "+str(sum(pmf1)))
conv_pmf = signal.fftconvolve(pmf1,pmf2,'same') #convolution of pmf1 and pmf2
print("Sum of convoluted pmf: "+str(sum(conv_pmf)))
pdf1 = pmf1/delta
pdf2 = pmf2/delta
conv_pdf = conv_pmf/delta
print("Integration of convoluted pdf: " + str(np.trapz(conv_pdf, big_grid)))
plt.plot(big_grid, pdf1, label='Gaussian PDF1')
plt.plot(big_grid, pdf2, label='Gaussian PDF2')
plt.plot(big_grid, conv_pdf, label='Sum')
plt.legend(loc='best'), plt.suptitle('PDFs')
Mean and variance of convoluted PDF
#E(Z)=E(X)+E(Y); Var(Z)=Var(X)+Var(Y); if E(X)=E(Y)=2 and Var(X)=Var(Y)=8 it follows E(Z)=4 and Var(Z)=16
E_Z = (big_grid * conv_pmf).sum(); E_Z #E(Z) = Σ z . P(z): sum(z[j] * p(z[j])) expected: E(Z)=4
E_Z_squared = (big_grid**2 * conv_pmf).sum(); E_Z_squared #E(Z²) = Σ z² . P(z): sum(z[j]² * p(z[j]))
Var_Z = E_Z_squared - (E_Z)**2; Var_Z #Var(Z) = E(Z²) - E(Z)²; expected: Var(Z)=16
This is the output I get.
Sum of gaussian pmf1: 0.9976499589626819
Sum of gaussian pmf2: 0.9976499589626819
Sum of convoluted pmf: 0.9321607580277965
Integration of convoluted pdf: 0.9321591482687606
E_Z = 3.210819533318452
E_Z_squared = 22.52303025237063
Var_Z = 12.21366817683131
So what is going wrong here? How can I adapt the code to get correct results?
The results you have now are fine. There is no reason to believe the sums you are printing here would be equal to 1. Although it is true that the integral of the PDF over the entire support (from negative to positive infinity) would be 1, this doesn't have to be true discretised version because it is an approximation.
Remember also that your grid is arange(-10, 10, delta), and that a significant proportion of the total probability of norm(4, 4) lies outside of that range.
Luckily, you know the PDF for the sum of normal variables, so you can check your results yourself using the CDF of the real distribution.
def realcdf(x):
return stats.norm(loc = 4, scale = 4).cdf(x)
print("Supposed to be: " + str(realcdf(max(big_grid)) - realcdf(min(big_grid))))
With output:
Supposed to be: 0.9329569316499936
Which is not 1. In fact the fftconvolve approximation is quite close. Errors arising from floating point arithmetic and the discretisation onto the grid likely account for the relatively small difference between the two.
As for the statistics at the end, enlarging the size of the grid should help. For example, on the grid:
big_grid = np.arange(-20,20,delta)
Produces statistics closer to the truth:
E_Z = 3.9994379102826576
Var_Z = 15.991432657282482

Solving coupled differential equations with sympy

I am trying to solve the following system of first order coupled differential equations:
- dR/dt=k2*Y(t)-k1*R(t)*L(t)+k4*X(t)-k3*R(t)*I(t)
- dL/dt=k2*Y(t)-k1*R(t)*L(t)
- dI/dt=k4*X(t)-k3*R(t)*I(t)
- dX/dt=k1*R(t)*L(t)-k2*Y(t)
- dY/dt=k3*R(t)*I(t)-k4*X(t)
The knowed initial conditions are: X(0)=0, Y(0)=0
The knowed constants values are: k1=6500, k2=0.9
This equations defines a kinetick model and I need to solve them to get the Y(t) function to fit my data and find k3 and k4 values. In order to that, I have tried to solve the system simbologically with sympy. There is my code:
import matplotlib.pyplot as plt
import numpy as np
import sympy
from sympy.solvers.ode.systems import dsolve_system
from scipy.integrate import solve_ivp
from scipy.integrate import odeint
k1 = sympy.Symbol('k1', real=True, positive=True)
k2 = sympy.Symbol('k2', real=True, positive=True)
k3 = sympy.Symbol('k3', real=True, positive=True)
k4 = sympy.Symbol('k4', real=True, positive=True)
t = sympy.Symbol('t',real=True, positive=True)
L = sympy.Function('L')
R = sympy.Function('R')
I = sympy.Function('I')
Y = sympy.Function('Y')
X = sympy.Function('X')
solsys=dsolve_system(eqs=Sys,funcs=[X(t),Y(t),R(t),L(t),I(t)], t=t, ics={Y(0):0, X(0):0})
There is the answer:
The system of ODEs passed cannot be solved by dsolve_system.
I have tried with dsolve too, but I get the same.
Is there any other solver I can use or some way of doing this that will allow me to get the function for the fitting? I'm using python 3.8 in Spider with Anaconda in windows64.
Thank you!
# Update
You are saying "experiment". So you have data and want to fit the model to them, find appropriate values for k3 and k4 at least, and perhaps for all coefficients and the initial conditions (the first measured data point might not be the initial condition for the best fit)? See stackoverflow.com/questions/71722061/… for a recent attempt on such a task. –
Lutz Lehmann
23 hours
There is my new code:
def derivative(S, t, k3, k4):
x, y,r,l,i = S
doty = k1*r*l+k2*y
dotx = k3*r*i-k4*x
dotr = k2*y-k1*r*l+k4*x-k3*r*i
dotl = k2*y-k1*r*l
doti = k4*x-k3*r*i
return np.array([doty, dotx, dotr, dotl, doti])
def solver(XY,t,para):
return odeint(derivative, XY, t, args = para, atol=1e-8, rtol=1e-11)
def integration(XY_arr,*para):
XY0 = para[:5]
para = para[5:]
T = np.arange(len(XY_arr))
res0 = solver(XY0,T, para)
res1 = [ solver(XY0,[t,t+1],para)[-1]
for t,XY in enumerate(XY_arr[:-1]) ]
return np.concatenate([res0,res1]).flatten()
XData =yexp
YData = np.concatenate([ yexp,yexp,yexp,yexp,yexp,yexp[1:],yexp[1:],yexp[1:],yexp[1:],yexp[1:]]).flatten()
p0 =[0,0,100,10,10,1e8,0.01]
params, info = curve_fit(integration,XData,YData,p0=p0, maxfev=5000)
XY0, para = params[:5], params[5:]
t_plot = np.linspace(0,len(t),500)
x_plot = solver(XY0, t_plot, tuple(para))
But the output are not correct, as are the same as initial condition p0:
[ 0. 0. 100. 10. 10.] (100000000.0, 0.01)
I understand that the function 'integration' gives packed values of y for each function at each instant of time, but I don't know how to unpack them to make the curve_fitt separately. Maybe I don't quite understand how it works.
Thank you!
As you observed, sympy is not able to solve this system. This might mean that
the procedure to classify ODE in sympy is not complete enough, or
some trick/method is needed above the standard set of methods that is implemented in sympy, or
that there just is no symbolic solution.
The last case is the generic one, take a symbolically solvable ODE, add some random term, and almost certainly the resulting ODE is no longer symbolically solvable.
As I understand with the comments, you have an model via ODE system with state space (cX,cY,cR,cL,cI) with equations with 4 parameters k1,k2,k3,k4 and, by the structure of a reaction system R+I <-> X, R+L <-> Y, the sums cR+cX+cY, cL+cY, cI+cX are all constant.
For some other process that is approximately represented by the model, you have time series data t[k],y[k] for the Y component. Also you have partial information on the initial state and the parameter set. If there are sufficiently many data points one could also forget about these, fit for all parameters, and compare how far away the given parameters are to the computed ones.
There are several modules and packages that solve this fitting task in a more or less abstract fashion. I think pyomo and gekko can both be used. More directly one can use the facilities of scipy.odr or scipy.optimize.
Define the forward function that transforms time and parameters
def model(t,u,k1,k2,k3,k4):
X,Y,R,L,I = u
dL = k2*Y - k1*R*L
dI = k4*X - k3*R*I
dR = dL+dI
dX = -dI
dY = -dL
return dX,dY,dR,dL,dI
def solver(t,u0,k):
res = solve_ivp(model, [0, t[-1]], u0, args=tuple(k), t_eval=t,
method="DOP853", atol=1e-7, rtol=1e-10)
return res.y
Prepare some data plus noise
k1o = 6.500; k2o=0.9
T = np.linspace(0,0.05,21)
U = solver(T, [0,0,50,40,25], [k1o, k2o, 5.400, 0.7])
Y = U[1] # equilibrium slightly above 30
Y += np.random.uniform(high=0.05, size=Y.shape)
Prepare the function that splits the combined parameter vector in initial state and coefficients, call the curve fitting function
from scipy.optimize import curve_fit
def partial(t,R,L,I,k3,k4):
U = solver(t,[0,0,R,L,I],[k1o,k2o,k3,k4])
return U[1]
params, info = curve_fit(partial,T,Y, p0=[30,20,10, 0.3,3.000])
R,L,I, k3,k4 = params
print(R,L,I, k3,k4)
It turns out that curve_fit goes into strange regions with large negative values. A likely reason is that the Y component is, in the end, not coupled strongly enough to all the other components, meaning that large changes in some of the parameters have minimal influence on Y, so that minimal noise in Y can lead to large deviations in these parameters. Here this apparently happens (first) to k3.

Are these functions equivalent?

I am building a neural network that makes use of T-distribution noise. I am using functions defined in the numpy library np.random.standard_t and the one defined in tensorflow tf.distributions.StudentT. The link to the documentation of the first function is here and that to the second function is here. I am using the said functions like below:
a = np.random.standard_t(df=3, size=10000) # numpy's function
t_dist = tf.distributions.StudentT(df=3.0, loc=0.0, scale=1.0)
sess = tf.Session()
b = sess.run(t_dist.sample(10000))
In the documentation provided for the Tensorflow implementation, there's a parameter called scale whose description reads
The scaling factor(s) for the distribution(s). Note that scale is not technically the standard deviation of this distribution but has semantics more similar to standard deviation than variance.
I have set scale to be 1.0 but I have no way of knowing for sure if these refer to the same distribution.
Can someone help me verify this? Thanks
I would say they are, as their sampling is defined in almost the exact same way in both cases. This is how the sampling of tf.distributions.StudentT is defined:
def _sample_n(self, n, seed=None):
# The sampling method comes from the fact that if:
# X ~ Normal(0, 1)
# Z ~ Chi2(df)
# Y = X / sqrt(Z / df)
# then:
# Y ~ StudentT(df).
seed = seed_stream.SeedStream(seed, "student_t")
shape = tf.concat([[n], self.batch_shape_tensor()], 0)
normal_sample = tf.random.normal(shape, dtype=self.dtype, seed=seed())
df = self.df * tf.ones(self.batch_shape_tensor(), dtype=self.dtype)
gamma_sample = tf.random.gamma([n],
0.5 * df,
samples = normal_sample * tf.math.rsqrt(gamma_sample / df)
return samples * self.scale + self.loc # Abs(scale) not wanted.
So it is a standard normal sample divided by the square root of a chi-square sample with parameter df divided by df. The chi-square sample is taken as a gamma sample with parameter 0.5 * df and rate 0.5, which is equivalent (chi-square is a special case of gamma). The scale value, like the loc, only comes into play in the last line, as a way to "relocate" the distribution sample at some point and scale. When scale is one and loc is zero, they do nothing.
Here is the implementation for np.random.standard_t:
double legacy_standard_t(aug_bitgen_t *aug_state, double df) {
double num, denom;
num = legacy_gauss(aug_state);
denom = legacy_standard_gamma(aug_state, df / 2);
return sqrt(df / 2) * num / sqrt(denom);
So essentially the same thing, slightly rephrased. Here we have also have a gamma with shape df / 2 but it is standard (rate one). However, the missing 0.5 is now by the numerator as / 2 within the sqrt. So it's just moving the numbers around. Here there is no scale or loc, though.
In truth, the difference is that in the case of TensorFlow the distribution really is a noncentral t-distribution. A simple empirical proof that they are the same for loc=0.0 and scale=1.0 is to plot histograms for both distributions and see how close they look.
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
t_np = np.random.standard_t(df=3, size=10000)
with tf.Graph().as_default(), tf.Session() as sess:
t_dist = tf.distributions.StudentT(df=3.0, loc=0.0, scale=1.0)
t_tf = sess.run(t_dist.sample(10000))
plt.hist((t_np, t_tf), np.linspace(-10, 10, 20), label=['NumPy', 'TensorFlow'])
That looks pretty close. Obviously, from the point of view of statistical samples, this is not any kind of proof. If you were not still convinced, there are some statistical tools for testing whether a sample comes from a certain distribution or two samples come from the same distribution.

How to fit experimental data in Python to inverse trigonometric function with limited definition area using scipy.curve_fit?

I am trying to fit some experimental data to a nonlinear function with one parameter that includes an arcus cosine function which therefore is limited in its area of definition from -1 to 1. I use scipy's curve_fit to find the parameter of the function, but it returns the following error:
RuntimeError: Optimal parameters not found: Number of calls to function has reached maxfev = 400.
The function I want to fit is this one:
def fitfunc(x, a):
y = np.rad2deg(np.arccos(x*np.cos(np.deg2rad(a))))
return y
For the fitting, I provid a numpy array for x and y respectively which contain values in degree (which is why the function contains conversion to and from radians).
param, param_cov = curve_fit(fitfunc, xs, ys)
When I use other fit functions like for example a polynomial, the curve_fit returns some values, the error mentioned above only occurs when I use this function which includes an arcus cosine.
I suspect that it cannot fit the data points because depending on the parameter of the arcus cosine function, some data points do not lie inside the area of definition of the arcus cosine. I have tried raising the number iterations (maxfev) but without success.
Sample data:
ys = np.array([113.46125, 129.4225, 140.88125, 145.80375, 145.4425,
146.97125, 97.8025, 112.91125, 114.4325, 119.16125,
130.13875, 134.63125, 129.4375, 141.99, 139.86,
138.77875, 137.91875, 140.71375])
xs = np.array([2.786427013, 3.325624466, 3.473013087, 3.598247534, 4.304280248,
4.958273121, 2.679526725, 2.409388637, 2.606306639, 3.661558062,
4.569923009, 4.836843789, 3.377013596, 3.664550526, 4.335401233,
3.064199519, 3.97155254, 4.100567011])
As HS-nebula mentioned in his comments, you need to define an initial value a0 of a as a start guess for the curve-fitting. Moreover, you need to be careful when choosing a0 as your np.arcos() is only defined in [-1,1] and choosing the wrong a0 results in error.
import numpy as np
from scipy.optimize import curve_fit
ys = np.array([113.46125, 129.4225, 140.88125, 145.80375, 145.4425, 146.97125,
97.8025, 112.91125, 114.4325, 119.16125, 130.13875, 134.63125,
129.4375, 141.99, 139.86, 138.77875, 137.91875, 140.71375])
xs = np.array([2.786427013, 3.325624466, 3.473013087, 3.598247534, 4.304280248, 4.958273121,
2.679526725, 2.409388637, 2.606306639, 3.661558062, 4.569923009, 4.836843789,
3.377013596, 3.664550526, 4.335401233, 3.064199519, 3.97155254, 4.100567011])
def fit_func(x, a):
a_in_rad = np.deg2rad(a)
cos_a_in_rad = np.cos(a_in_rad)
arcos_xa_product = np.arccos( x * cos_a_in_rad )
return np.rad2deg(arcos_xa_product)
a0 = 80
param, param_cov = curve_fit(fit_func, xs, ys, a0, bounds = (0, 360))
print('Using curve we retrieve a value of a = ', param[0])
Using curve we retrieve a value of a = 100.05275506147824
However if you choose a0=60, you get the following error:
ValueError: Residuals are not finite in the initial point.
To be able to use the data with all possible values of a, a normalization as HS-nebula suggested is good idea.

The shape variable in pymc3.DensityDist does not work properly

I am trying to define a multivariate custom distribution through pymc3.DensityDist(); however, I keep getting the following error that dimensions do not match:
"LinAlgError: 0-dimensional array given. Array must be two-dimensional"
I have already seen https://github.com/pymc-devs/pymc3/issues/535 but I could not find the answer to my question. Just for clarity, here is my simple example
import numpy as np
import pymc3 as pm
def pdf(x):
y = 0
sigma = np.identity(2)
isigma = sigma
mu = np.array([[1,2],[3,4]])
for i in range(2):
x0 = x- mu[i,:]
xsinv = np.linalg.multi_dot([x0,isigma,x0])
y = y + np.exp(-0.5*xsinv)
return y
logp = lambda x: np.log(pdf(x))
with pm.Model() as model:
pm.DensityDist('x',logp, shape=2)
step = pm.Metropolis(tune=False, S=np.identity(2))
trace = pm.sample(100000, step=step, chain=1, tune=0,progressbar=False)
result = trace['x']
In this simple code I want to define an unnormilized pdf function, which is sum of two unnormalized normal distributions, and take samples from this pdf through Metropolis algorithm.
Try replacing numpy for theano in the following lines:
xsinv = tt.dot(tt.dot(x0, isigma), x0)
y = y + tt.exp(-0.5 * xsinv)
as a side note, try using NUTS instead of metropolis and let PyMC3 choose the sampling method for you, just do
trace = pm.sample(1000)
For future reference you can also ask questions here

