How to calculate the intercept using numpy.linalg.lstsq

How to calculate the intercept using numpy.linalg.lstsq - python

After running a multiple linear regression using numpy.linalg.lstsq I get 4 arrays as described in the documentation, however it is not clear to me how do I get the intercept value. Does anyone know this? I'm new to statistical analysis.
Here is my model:
X1 = np.array(a)
X2 = np.array(b)
X3 = np.array(c)
X4 = np.array(d)
X5 = np.array(e)
X6 = np.array(f)
X1l = np.log(X1)
X2l = np.log(X2)
X3l = np.log(X3)
X6l = np.log(X6)
Y = np.array(g)
A = np.column_stack([X1l, X2l, X3l, X4, X5, X6l, np.ones(len(a), float)])
result = np.linalg.lstsq(A, Y)
This is a sample of what my model is generating:
(array([ 654.12744154, -623.28893569, 276.50269246, 11.52493817,
49.92528734, -375.43282832, 3852.95023087]), array([ 4.80339071e+11]),
7, array([ 1060.38693842, 494.69470547, 243.14700033, 164.97697748,
58.58072929, 19.30593045, 13.35948642]))
I believe the intercept is the second array, still I'm not sure about that, as its value is just too high.

The intersect is the coefficient that corresponds to the column of ones, which in this case is:
result[0][6]
To make it clearer to see, consider your regression, which is something like:
y = c1*x1 + c2*x2 + c3*x3 + c4*x4 + m
written in matrix form as:
[[y1], [[x1_1, x2_1, x3_1, x4_1, 1], [[c1],
[y2], [x1_2, x2_2, x3_2, x4_2, 1], [c2],
[y3], = [x1_3, x2_3, x3_3, x4_3, 1], * [c3],
... ... [c4],
[yn]] [x1_n, x2_n, x3_n, x4_n, 1]] [m]]
or:
Y = A * C
where A is the so called "Coefficient' matrix and C the vector containing the solution for your regression. Note that m corresponds to the column of ones.

Related

Solving matrix differential equations in python with odeint?

I'm trying to solve the system:
dP/dt = AP
where,
A is the matrix [ [1,2], [3,4]]
P is the the matrix [ [P1, P2], [P3, P4] ]
with initial condition P0 = [P10, P20, P30, P40] = [1,2,3,5]
I've implemented two different methods in python but I'm getting different answers.
The first is to split the system into 2 pairs of coupled ODEs and then find the eigenvalues and eigenvectors of A. These can then be substituted into a general solution:
Below is some of the code focusing on obtaining a solution for P1.
A = np.array([[1,2],[3,4]])
# extract eigenvalues
lambda1, lambda2, = np.linalg.eig(A)[0]
# extract eigenvectors
x1, x2 = np.linalg.eig(A)[1]
# combine eigenvectors and solve system at t=0 with initial condition to get constants
x1 = x1.reshape(2,1)
x2 = x2.reshape(2,1)
X = np.concatenate((x1,x2),axis=1)
P10 = 1
P30 = 3
Pinitial = np.array([P10, P30])
C1, C2 = np.linalg.solve(X, Pinitial)
# combine everything to get the solution to P1
t = np.linspace(0,3)
P1true = C1 * np.exp(lambda1 * t) * x1[0] + C2 * np.exp(lambda2 * t) * x2[0]
print(P1true)
My output to this implementation is the following:
[ 1.00000000e+00 4.90610966e-01 -1.96909212e-01 -1.13239194e+00
-2.41285490e+00 -4.17309019e+00 -6.60037624e+00 -9.95491931e+00
-1.45982562e+01 -2.10327188e+01 -2.99562678e+01 -4.23386834e+01
-5.95274276e+01 -8.33947363e+01 -1.16541999e+02 -1.62583733e+02
-2.26542159e+02 -3.15395444e+02 -4.38839463e+02 -6.10346242e+02
-8.48634612e+02 -1.17971363e+03 -1.63972185e+03 -2.27887232e+03
-3.16693407e+03 -4.40084834e+03 -6.11531116e+03 -8.49747721e+03
-1.18073904e+04 -1.64063715e+04 -2.27964611e+04 -3.16752246e+04
-4.40119010e+04 -6.11532094e+04 -8.49703621e+04 -1.18063333e+05
-1.64044684e+05 -2.27933923e+05 -3.16705455e+05 -4.40049940e+05
-6.11432160e+05 -8.49560890e+05 -1.18043123e+06 -1.64016232e+06
-2.27894028e+06 -3.16649669e+06 -4.39972081e+06 -6.11323640e+06
-8.49409776e+06 -1.18022094e+07]
My second implementation is to use scipy's odeint:
# function that defines dP/dt
def model(P,t):
A = np.array([[1,2], [3,4]])
P = np.array([[P[0], P[1]],[P[2],P[3]]])
RHS = np.matmul(A,P)
RHS = RHS.reshape(1,4)
dPdt = RHS.tolist()
return dPdt[0]
# initial condition
P0 = [1,2,3,5]
#time points
t = np.linspace(0,3)
#solve model
P = odeint(model,P0,t)
# print P1 solution
print(P[:,0])
For which I get the following output:
[1.00000000e+00 1.50619854e+00 2.20691051e+00 3.17795032e+00
4.52465783e+00 6.39339726e+00 8.98753443e+00 1.25896370e+01
1.75923199e+01 2.45411052e+01 3.41939721e+01 4.76041020e+01
6.62348466e+01 9.21194739e+01 1.28083127e+02 1.78051229e+02
2.47477996e+02 3.43941844e+02 4.77972676e+02 6.64201371e+02
9.22956950e+02 1.28248577e+03 1.78203503e+03 2.47613711e+03
3.44056260e+03 4.78059170e+03 6.64250701e+03 9.22956240e+03
1.28241709e+04 1.78187343e+04 2.47584791e+04 3.44009754e+04
4.77988372e+04 6.64146291e+04 9.22805262e+04 1.28220154e+05
1.78156830e+05 2.47541841e+05 3.43949539e+05 4.77904179e+05
6.64028793e+05 9.22641500e+05 1.28197351e+06 1.78125097e+06
2.47497704e+06 3.43888166e+06 4.77818858e+06 6.63910199e+06
9.22476676e+06 1.28174445e+07]
Which isn't the same as my first implementation? Can anyone see where I've gone wrong? I feel like the problem might be with my second implementation?

Matplotlib x axis

I have the following issue with matplotlib. I have this Numpy-Matrix and now I plot plt.plot(wp[:, 0]) the first column which works flawlessly. Now on the x-axis I have written (1..2..3..4..5..6..7..8..9) (for this example))
But instead I would like to have there (0.1..2...3.4) So it should display me the current value of the second column. (The size of wp varies, so I need a general solution..)
wp=[[x1,0],
[x2,1],
[x3,1],
[x4,2],
[x5,2],
[x6,2],
[x7,3],
[x8,3],
[x9,4]]
Edit: Sorry, I made a huge mistake, when I was lazy and created the example matrix. The x-values are all different.
Edit2: To be more precise. In this example the x-value of x1 should be 0, and the stick also 0. Then x2 should be right to x1 and should have the x-tick 1. x3 should be right to x2 and there should be no x-tick displayed. x4 should be right of x3 and there should be the x-tick 2, and so forth. So it should be plotted like plt.plot(wp[:, 0]) does, but on the x-axis I want to see in which area, the second column is 0 or 1 or 2 or ...

import matplotlib.pyplot as plt
import numpy as np
# creating random values for x1,x2 and x3
x1 = 1
x2 = 2
x3 = 3
wp=[[x1,0],
[x2,1],
[x3,1],
[x1,2],
[x2,2],
[x3,2],
[x1,3],
[x2,3],
[x3,4]]
my_xticks = [x[1] for x in wp] # taking the second values from tuple
my_xticks = list(set(my_xticks)) # removing duplicates
# In [16]: my_xticks
# Out[16]: [0, 1, 2, 3, 4] # values of my_xticks after removing duplicates
x = [x[0] for x in wp]
y = x
plt.xticks(my_xticks)
plt.plot(x, y)
plt.show()

Scipy.Odr multiple variable regression

I would like to perform a multidimensional ODR with scipy.odr. I read the API documentation, it says that multi-dimensionality is possible, but I cannot make it work. I cannot find working example on the internet and API is really crude and give no hints how to proceed.
Here is my MWE:
import numpy as np
import scipy.odr
def linfit(beta, x):
return beta[0]*x[:,0] + beta[1]*x[:,1] + beta[2]
n = 1000
t = np.linspace(0, 1, n)
x = np.full((n, 2), float('nan'))
x[:,0] = 2.5*np.sin(2*np.pi*6*t)+4
x[:,1] = 0.5*np.sin(2*np.pi*7*t + np.pi/3)+2
e = 0.25*np.random.randn(n)
y = 3*x[:,0] + 4*x[:,1] + 5 + e
print(x.shape)
print(y.shape)
linmod = scipy.odr.Model(linfit)
data = scipy.odr.Data(x, y)
odrfit = scipy.odr.ODR(data, linmod, beta0=[1., 1., 1.])
odrres = odrfit.run()
odrres.pprint()
It raises the following exception:
scipy.odr.odrpack.odr_error: number of observations do not match
Which seems to be related to my matrix shapes, but I do not know how must I shape it properly. Does anyone know?

Firstly, in my experience scipy.odr uses mostly arrays, not matrices. The library seems to make a large amount of size checks along the way and getting it to work with multiple variables seems to be quite troublesome.
This is the workflow how I usually get it to work (and worked at least on python 2.7):
import numpy as np
import scipy.odr
n = 1000
t = np.linspace(0, 1, n)
def linfit(beta, x):
return beta[0]*x[0] + beta[1]*x[1] + beta[2] #notice changed indices for x
x1 = 2.5*np.sin(2*np.pi*6*t)+4
x2 = 0.5*np.sin(2*np.pi*7*t + np.pi/3)+2
x = np.row_stack( (x1, x2) ) #odr doesn't seem to work with column_stack
e = 0.25*np.random.randn(n)
y = 3*x[0] + 4*x[1] + 5 + e #indices changed
linmod = scipy.odr.Model(linfit)
data = scipy.odr.Data(x, y)
odrfit = scipy.odr.ODR(data, linmod, beta0=[1., 1., 1.])
odrres = odrfit.run()
odrres.pprint()
So using identical (1D?) arrays, using row_stack and adressing by single index number seems to work.

Correspondence between a "ij" meshgrid and a long meshgrid

Consider a matrix Z that contains grid-based results for z = z(a,m,e). Z has shape (len(aGrid), len(mGrid), len(eGrid)). Z[0,1,2] contains the z(a=aGrid[0], m=mGrid[1], e=eGrid[2]). However, we may have removed some elements from the state space from the object (for example and simplicity, (a,m,e : a > 3). Say that the size of the valid state space is x.
I have been suggested a code to transform this object to an object Z2 of shape (x, 3). Every row in Z2 corresponds to an element i from Z2: (aGrid[a[i]], mGrid[m[i]], eGrid[e[i]]).
# first create Z, a mesh grid based matrix that has some invalid states (we set them to NaN)
aGrid = np.arange(0, 10, dtype=float)
mGrid = np.arange(100, 110, dtype=float)
eGrid = np.arange(1000, 1200, dtype=float)
A,M,E = np.meshgrid(aGrid, mGrid, eGrid, indexing='ij')
Z = A
Z[Z > 3] = np.NaN #remove some states from being "allowed"
# now, translate them from shape (len(aGrid), len(mGrid), len(eGrid)) to
grids = [A,M,E]
grid_bc = np.broadcast_arrays(*grids)
Z2 = np.column_stack([g.ravel() for g in grid_bc])
Z2[np.isnan(Z.ravel())] = np.nan
Z3 = Z2[~np.isnan(Z2)]
Through some computation, I then get a matrix V4 that has the shape of Z3 but contains 4 columns.
I am given
Z2 (as above)
Z3 (as above)
V4 which is a matrix shape (Z3.shape[0], Z3.shape[1]+1): it has an additional column appended
(if necessary, I still have access to the grid A,M,E)
and I need to recreate
V, which is the matrix that contains the values (of the last column) of V4, but is transformed back to the shape of Z1.
That is, if there is a row in V4 that reads (aGrid[0], mGrid[1], eGrid[2], v1), then the the value of V at V[0,1,2] = v1, etc. for all rows in V4,
Efficiency is key.

Given your original problem conditions, recreated as follows, modified such that A is a copy of Z:
aGrid = np.arange(0, 10, dtype=float)
mGrid = np.arange(100, 110, dtype=float)
eGrid = np.arange(1000, 1200, dtype=float)
A,M,E = np.meshgrid(aGrid, mGrid, eGrid, indexing='ij')
Z = A.copy()
Z[Z > 3] = np.NaN
grids = [A,M,E]
grid_bc = np.broadcast_arrays(*grids)
Z2 = np.column_stack([g.ravel() for g in grid_bc])
Z2[np.isnan(Z.ravel())] = np.nan
Z3 = Z2[~np.isnan(Z2)]
A function can be defined as follows, to recreate a dense N-D matrix from a sparse 2D # data points x # dims + 1 matrix. The first argument of the function is the aformentioned 2D matrix, the last (optional) arguments are the grid indexes for each dimension:
import numpy as np
def map_array_to_index(uniq_arr):
return np.vectorize(dict(map(reversed, enumerate(uniq_arr))).__getitem__)
def recreate(arr, *coord_arrays):
if len(coord_arrays) != arr.shape[1] - 1:
coord_arrays = map(np.unique, arr.T[0:-1])
lookups = map(map_array_to_index, coord_arrays)
new_array = np.nan * np.ones(map(len, coord_arrays))
new_array[tuple(l(c) for c, l in zip(arr.T[0:-1], lookups))] = arr[:, -1]
new_grids = np.meshgrid(*coord_arrays, indexing='ij')
return new_array, new_grids
Given a 2D matrix V4, defined above with values derived from Z,
V4 = np.column_stack([g.ravel() for g in grid_bc] + [Z.ravel()])
it is possible to recreate Z as follows:
V4_orig_form, V4_grids = recreate(V4, aGrid, mGrid, eGrid)
All non-NaN values correctly test for equality:
np.all(Z[~np.isnan(Z)] == V4_orig_form[~np.isnan(V4_orig_form)])
The function also works without aGrid, mGrid, eGrid passed in, but in this case it will not include any coordinate that is not present in the corresponding column of the input array.

So Z is the same shape as A,M,E; and Z2 is the shape (Z.ravel(),len(grids)) = (10x10x200, 3) in this case (if you do not filter out the NaN elements).
This is how you recreate your grids from the values of Z2:
grids = Z2.T
A,M,E = [g.reshape(A.shape) for g in grids]
Z = A # or whatever other calculation you need here
The only thing you need is the shape to which you want to go back. NaN will propagate to the final array.

Apply non-linear regression for multi dimension data samples in Python

I have installed Numpy and SciPy, but I'm not quite understand their documentation about polyfit.
For exmpale, Here's my three data samples:
[-0.042780748663101636, -0.0040771571786609945, -0.00506567946276074]
[0.042780748663101636, -0.0044771571786609945, -0.10506567946276074]
[0.542780748663101636, -0.005771571786609945, 0.30506567946276074]
[-0.342780748663101636, -0.0304077157178660995, 0.90506567946276074]
The first two columns are sample features, the third column is output, My target is to get a function that could take two parameters(first two columns) and return its prediction(the output).
Any simple example ?
====================== EDIT ======================
Note that, I need to fit something like a curve, not only straight lines. The polynomial should be something like this ( n = 3):
a*x1^3 + b*x2^2 + c*x3 + d = y
Not:
a*x1 + b*x2 + c*x3 + d = y
x1, x2, x3 are features of one sample, y is the output

Try something like
edit: added an example function that used results of linear regression to estimate output.
import numpy as np
data =np.array(
[[-0.042780748663101636, -0.0040771571786609945, -0.00506567946276074],
[0.042780748663101636, -0.0044771571786609945, -0.10506567946276074],
[0.542780748663101636, -0.005771571786609945, 0.30506567946276074],
[-0.342780748663101636, -0.0304077157178660995, 0.90506567946276074]])
coefficient = data[:,0:2]
dependent = data[:,-1]
x,residuals,rank,s = np.linalg.lstsq(coefficient,dependent)
def f(x,u,v):
return u*x[0] + v*x[1]
for datum in data:
print f(x,*datum[0:2])
Which gives
>>> x
array([ 0.16991146, -30.18923739])
>>> residuals
array([ 0.07941146])
>>> rank
2
>>> s
array([ 0.64490113, 0.02944663])
and the function created with your coefficients gave
0.115817326583
0.142430900298
0.266464019171
0.859743371665
More info can be found at the documentation I posted as a comment.
edit 2: fitting your data to an arbitrary model.
edit 3: made my model a function for ease of understanding.
edit 4: made code more easily read/ changed model to a quadratic fit, but you should be able to read this code and know how to make it minimize any residual you want now.
contrived example:
import numpy as np
from scipy.optimize import leastsq
data =np.array(
[[-0.042780748663101636, -0.0040771571786609945, -0.00506567946276074],
[0.042780748663101636, -0.0044771571786609945, -0.10506567946276074],
[0.542780748663101636, -0.005771571786609945, 0.30506567946276074],
[-0.342780748663101636, -0.0304077157178660995, 0.90506567946276074]])
coefficient = data[:,0:2]
dependent = data[:,-1]
def model(p,x):
a,b,c = p
u = x[:,0]
v = x[:,1]
return (a*u**2 + b*v + c)
def residuals(p, y, x):
a,b,c = p
err = y - model(p,x)
return err
p0 = np.array([2,3,4]) #some initial guess
p = leastsq(residuals, p0, args=(dependent, coefficient))[0]
def f(p,x):
return p[0]*x[0] + p[1]*x[1] + p[2]
for x in coefficient:
print f(p,x)
gives
-0.108798280153
-0.00470479385807
0.570237823475
0.413016072653

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to calculate the intercept using numpy.linalg.lstsq - python

Related

Solving matrix differential equations in python with odeint?

Matplotlib x axis

Scipy.Odr multiple variable regression

Correspondence between a "ij" meshgrid and a long meshgrid

Apply non-linear regression for multi dimension data samples in Python

Categories

Resources