What's the easiest way to calculate regression coefficient in python? - python

I have a 100 by 1 response variable Y, and a 100 by 3 predictor matrix X. I want to calcualte the regression coefficient (X'X)^{-1}X'Y.
Currently I'm doing it as follows:
invXpX=inv(np.dot(np.transpose(X),X))
XpY=np.dot(np.transpose(X),Y)
betahat=np.dot(invXpX,XpY)
This looks pretty cumbersome, while in MATLAB we could do it just like the original math formula: inv(X'*X)*X'*Y. Is there an easier way to calculate this regression coefficient in python?
Thanks!

Yes it can be written more compact, but note that this will not always improve your code, or the readability.
The transpose of a numpy array can be found using dot T (.T). If you use numpy matrix instead of numpy arrays you can also use .I for the inverse, but I would recommend you to use ndarray. For the dot product you can use #. Thereby np.dot(X,Y) = X.dot(Y) when X and Y are numpy arrays.
import numpy as np
# Simulate data using a quadratic equation with coefficients y=ax^2+bx+c
a, b, c = 1, 2, 3
x = np.arange(100)
# Add random component to y values for estimation
y = a*x**2 + b*x + c + np.random.randn(100)
# Get X matrix [100x3]
X = np.vstack([x**2, x, np.ones(x.shape)]).T
# Estimate coefficients a, b, c
x_hat = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
>>> array([0.99998334, 2.00246583, 2.95697339])
x_hat = np.linalg.inv(X.T#(X))#(X.T)#(y)
>>> array([0.99998334, 2.00246583, 2.95697339])
# Use matrix:
X_mat = np.matrix(X)
x_hat = (X_mat.T#X_mat).I#X_mat.T#y
>>> matrix([[0.99998334, 2.00246583, 2.95697339]])
# without noise:
y = a*x**2 + b*x + c
x_hat = (X_mat.T#X_mat).I#X_mat.T#y
>>> matrix([[1., 2., 3.]])

You can try this:
np.invert(X.T # X) # (X.T # Y)

Related

Evaluating a numpy array

I have a function that returns a numpy array as follows:
import numpy as np
from scipy.misc import derivative
import sympy as sp
x, y = sp.symbols('x y')
f = x**2 + y **2
def grad(f):
exp = sp.expand(f)
dfdx = sp.diff(exp,x)
dfdy = sp.diff(exp,y)
global grad
Df = np.array([dfdx,dfdy])
return Df
I'm using the variable Df in another function and do some computations with it.
As you may have guessed, the results come out including x and y. However, I need the results to be evaluated each time with the initial values I choose for x and y instead of the symbols.
I was wondering if there was something like the .subs() in sympy but works on a numpy array rather than a function????
Sympy and numpy are two separate worlds, that aren't easy to bring together.
With sympy's lambdify, sympy expressions can be made to work on numpy arguments. When arrays are used as arguments, they all need to be 1D and of the same size. The function np_grad_1 below is how it works standard. It returns an array with two subarrays.
To get your desired functionality, a wrapper can take a 2D numpy input and convert the result back to a 2D numpy array:
import sympy as sp
import numpy as np
x, y = sp.symbols('x y')
f = x ** 2 + y ** 2
def grad(f, x, y):
exp = sp.expand(f)
dfdx = sp.diff(exp, x)
dfdy = sp.diff(exp, y)
return [dfdx, dfdy]
np_grad_1 = sp.lambdify([x, y], grad(f, x, y))
np_grad_2 = lambda points: np.array(np_grad_1(points[:, 0], points[:, 1])).T
points = np.random.uniform(-1, 1, (5, 2))
np_grad_1(points[:, 0], points[:, 1]) # returns an array with 2 subarrays
np_grad_2(points) # returns an Nx2 array

Density of multivariate t distribution in Python for large number of observations

I am trying to evaluate the density of multivariate t distribution of a 13-d vector. Using the dmvt function from the mvtnorm package in R, the result I get is
[1] 1.009831e-13
When i tried to write the function by myself in Python (thanks to the suggestions in this post:
multivariate student t-distribution with python), I realized that the gamma function was taking very high values (given the fact that I have n=7512 observations), making my function going out of range.
I tried to modify the algorithm, using the math.lgamma() and np.linalg.slogdet() functions to transform it to the log scale, but the result I got was
8.97669876e-15
This is the function that I used in python is the following:
def dmvt(x,mu,Sigma,df,d):
'''
Multivariate t-student density:
output:
the density of the given element
input:
x = parameter (d dimensional numpy array or scalar)
mu = mean (d dimensional numpy array or scalar)
Sigma = scale matrix (dxd numpy array)
df = degrees of freedom
d: dimension
'''
Num = math.lgamma( 1. *(d+df)/2 ) - math.lgamma( 1.*df/2 )
(sign, logdet) = np.linalg.slogdet(Sigma)
Denom =1/2*logdet + d/2*( np.log(pi)+np.log(df) ) + 1.*( (d+df)/2 )*np.log(1 + (1./df)*np.dot(np.dot((x - mu),np.linalg.inv(Sigma)), (x - mu)))
d = 1. * (Num - Denom)
return np.exp(d)
Any ideas why this functions does not produce the same results as the R equivalent?
Using as x = (0,0) produces similar results (up to a point, die to rounding) but with x = (1,1)1 I get a significant difference!
I finally managed to 'translate' the code from the mvtnorm package in R and the following script works without numerical underflows.
import numpy as np
import scipy.stats
import math
from math import lgamma
from numpy import matrix
from numpy import linalg
from numpy.linalg import slogdet
import scipy.special
from scipy.special import gammaln
mu = np.array([3,3])
x = np.array([1, 1])
Sigma = np.array([[1, 0], [0, 1]])
p=2
df=1
def dmvt(x, mu, Sigma, df, log):
'''
Multivariate t-student density. Returns the density
of the function at points specified by x.
input:
x = parameter (n x d numpy array)
mu = mean (d dimensional numpy array)
Sigma = scale matrix (d x d numpy array)
df = degrees of freedom
log = log scale or not
'''
p = Sigma.shape[0] # Dimensionality
dec = np.linalg.cholesky(Sigma)
R_x_m = np.linalg.solve(dec,np.matrix.transpose(x)-mu)
rss = np.power(R_x_m,2).sum(axis=0)
logretval = lgamma(1.0*(p + df)/2) - (lgamma(1.0*df/2) + np.sum(np.log(dec.diagonal())) \
+ p/2 * np.log(math.pi * df)) - 0.5 * (df + p) * math.log1p((rss/df) )
if log == False:
return(np.exp(logretval))
else:
return(logretval)
print(dmvt(x,mu,Sigma,df,True))
print(dmvt(x,mu,Sigma,df,False))

linear algebra in python

Given a tall m by n matrix X, I need to calculate s = 1 + x(X.T X)^{-1} x.T. Here, x is a row vector and s is scalar. Is there an efficient (or, recommended) way to compute this in python?
Needless to say, X.T X will be symmetric positive definite.
My attempt:
If we consider the QR decomposition of X, i.e., X = QR, where Q is orthogonal, R is upper triangular, then X.T X = R.T R.
QR decomposition can be easily obtained using numpy.linalg.qr, that is
Q,R = numpy.linalg.qr(X)
But then again, is there a particularly efficient way to calculate inv(R.T R)?
If you are doing the QR factorization of X, resulting in X.T X = R.T R, you may avoid using np.linalg.inv (and np.linalg.solve) by using forward and backward substitution instead (R.T is lower triangular!) with scipy.linalg.solve_triangular:
import numpy as np
import scipy.linalg as LA
Q,R = np.linalg.qr(X)
# solve R.T R z = x such that R z = y
# with step (a) then (b)
# step (a) solve R.T y = x
y = LA.solve_triangular(R,x,trans='T')
# step (b) solve R z = y
z = LA.solve_triangular(R,x)
s = 1 + x # z
where # is the python3 matrix multiplication operator.

Matrix of polynomial elements

I am using NumPy for operations on matrices, to calculate matrixA * matrixB, the trace of the matrix, etc... And elements of my matrices are integers. But what I want to know is if there is possibility to work with matrices of polynomials. So for instance I can work with matrices such as [x,y;a,b], not [1,1;1,1], and when I calculate the trace it provides me with the polynomial x + b, and not 2. Is there some polynomial class in NumPy which matrices can work with?
One option is to use the SymPy Matrices module. SymPy is a symbolic mathematics library for Python which is quite interoperable with NumPy, especially for simple matrix manipulation tasks such as this.
>>> from sympy import symbols, Matrix
>>> from numpy import trace
>>> x, y, a, b = symbols('x y a b')
>>> M = Matrix(([x, y], [a, b]))
>>> M
Matrix([
[x, y],
[a, b]])
>>> trace(M)
b + x
>>> M.dot(M)
[a*y + x**2, a*b + a*x, b*y + x*y, a*y + b**2]

Simultaneous fitting to N datasets in Python

I have a single function that I want to fit to a number of different datasets, all with the same number of points. For example, I might want to fit a polynomial to all rows of an image. Is there an efficient and vectorized way of doing this with scipy or other packages, or do I have to resort to a single loop (or use multiprocessing to speed it up a bit)?
You can use numpy.linalg.lstsq:
import numpy as np
# independent variable
x = np.arange(100)
# some sample outputs with random noise
y1 = 3*x**2 + 2*x + 4 + np.random.randn(100)
y2 = x**2 - 4*x + 10 + np.random.randn(100)
# coefficient matrix, where each column corresponds to a term in your function
# this one is simple quadratic polynomial: 1, x, x**2
a = np.vstack((np.ones(100), x, x**2)).T
# result matrix, where each column is one set of outputs
b = np.vstack((y1, y2)).T
solutions, residuals, rank, s = np.linalg.lstsq(a, b)
# each column in solutions is the coefficients of terms
# for the corresponding output
for i, solution in enumerate(zip(*solutions),1):
print "y%d = %.1f + (%.1f)x + (%.1f)x^2" % ((i,) + solution)
# outputs:
# y1 = 4.4 + (2.0)x + (3.0)x^2
# y2 = 9.8 + (-4.0)x + (1.0)x^2

Categories

Resources