Multiple linear regression python - python

I use multiple linear regression, I have one dependant variable (var) and several independant variables (varM1, varM2,...)
I use this code in python:
z=array([varM1, varM2, varM3],int32)
n=max(shape(var))
X = vstack([np.ones(n), z]).T
a = np.linalg.lstsq(X, var)[0]
How can I calculate the R-square change for every variable with python ? I would like to see how the regression changes if I add or remove predictor variables.

If the broadcasting is correct along the way the following should give you the correlation coefficient R:
R = np.sqrt( ((var - X.dot(a))**2).sum() )
One full example of multi-variate regression:
import numpy as np
x1 = np.array([1,2,3,4,5,6])
x2 = np.array([1,1.5,2,2.5,3.5,6])
x3 = np.array([6,5,4,3,2,1])
y = np.random.random(6)
nvar = 3
one = np.ones(x1.shape)
A = np.vstack((x1,one,x2,one,x3,one)).T.reshape(nvar,x1.shape[0],2)
for i,Ai in enumerate(A):
a = np.linalg.lstsq(Ai,y)[0]
R = np.sqrt( ((y - Ai.dot(a))**2).sum() )
print R

Related

Is there a way to suitably adjust this sklearn logistic regression function to account for multiple independent variables and fixed effects?

I would like to adapt the LogitRegression function included below to include additional independent variables and fixed effects.
The code below has been adapted from the answer provided here: how to use sklearn when target variable is a proportion
from sklearn.linear_model import LinearRegression
from random import choices
from string import ascii_lowercase
import numpy as np
import pandas as pd
class LogitRegression(LinearRegression):
def fit(self, x, p):
p = np.asarray(p)
y = np.log(p / (1 - p))
return super().fit(x, y)
def predict(self, x):
y = super().predict(x)
return 1 / (np.exp(-y) + 1)
if __name__ == '__main__':
### 1. Original version with a single independent variable
# generate example data
np.random.seed(42)
n = 100
## orig version provided in the link - single random independent variable
x = np.random.randn(n).reshape(-1,1)
# defining the predictor (dependent) variable (a proportional value between 0 and 1)
noise = 0.1 * np.random.randn(n).reshape(-1, 1)
p = np.tanh(x + noise) / 2 + 0.5
# applying the model - this works
model = LogitRegression()
model.fit(x, p)
### 2. Adding additional independent variables and a fixed effects variable
# creating 3 random independent variables
x1 = np.random.randn(n)
x2 = np.random.randn(n)
x3 = np.random.randn(n)
# a fixed effects variable
cats = ["".join(choices(["France","Norway","Ireland"])) for _ in range(100)]
# combining these into a dataframe
df = pd.DataFrame({"x1":x1,"x2":x2,"x3":x3,"countries":cats})
# adding the fixed effects country columns
df = pd.concat([df,pd.get_dummies(df.countries)],axis=1)
print(df)
# ideally I would like to use the independent variables x1,x2,x3 and the fixed
# effects column, countries, from the above df but I'm not sure how best to edit the
# LogitRegression class to account for this. The dependent variable is a proportion.
# x = np.array(df)
model = LogitRegression()
model.fit(x, p)
I would like the predicted output to be a proportion bounded between 0 and 1. I've previously tried the sklearn linear regression method but this gave predictions outside of the expected range. I've also looked at using the statsmodels OLS function but although I can include multiple independent variables, I can't find a way to include the fixed effects.
Thanks in advance for any assistance you can provide with this, or please let me know if there is another suitable method that I could use instead.
I managed to solve this using the following small adjustments when passing the independent and fixed effect variables to the function using a dataframe (writing out a simplified example of the problem helped me a lot in finding the answer):
from sklearn.linear_model import LinearRegression
from random import choices
from string import ascii_lowercase
import numpy as np
import pandas as pd
class LogitRegression(LinearRegression):
def fit(self, x, p):
p = np.asarray(p)
y = np.log(p / (1 - p))
return super().fit(x, y)
def predict(self, x):
y = super().predict(x)
return 1 / (np.exp(-y) + 1)
if __name__ == '__main__':
# generate example data
np.random.seed(42)
n = 100
x = np.random.randn(n).reshape(-1,1)
# defining the predictor (dependent) variable (a proportional value between 0 and 1)
noise = 0.1 * np.random.randn(n).reshape(-1, 1)
p = np.tanh(x + noise) / 2 + 0.5
# creating 3 random independent variables
x1 = np.random.randn(n)
x2 = np.random.randn(n)
x3 = np.random.randn(n)
# a fixed effects variable
cats = ["".join(choices(["France","Norway","Ireland"])) for _ in range(100)]
# combining these into a dataframe
df = pd.DataFrame({"x1":x1,"x2":x2,"x3":x3,"countries":cats})
# adding the fixed effects country columns
df = pd.concat([df,pd.get_dummies(df.countries)],axis=1)
print(df)
# Using the independent variables x1,x2,x3 and the fixed effects column, countries, from the above df. The dependent variable is a proportion.
# x = np.array(df)
categories = df['countries'].unique()
x = df.loc[:,np.concatenate((["x1","x2","x3"],categories))]
model = LogitRegression()
model.fit(x, p)

Turning a for loop function in Vectorized form with numpy

I am trying to make my program faster with the use of numpy arrays however all the time I have tried modifying the vanilla python in the form of vectors it has given me errors. How can I vectorize the code so that I dont have to use the for loop.In the for loop code down below I have the linear regression and standard deviation formulas that are dependent on the PC_list values to be calculated.
PC_list= [457.334015,424.440002,394.795990,408.903992,398.821014,402.152008,435.790985,423.204987,411.574005,
404.424988,399.519989,377.181000,375.467010,386.944000,383.614990,375.071991,359.511993,328.865997,
320.510010,330.079010,336.187012,352.940002,365.026001,361.562012,362.299011,378.549011,390.414001,
400.869995,394.773010,382.556000]
#x_mean and x_squared is used for the lin regressions and stand dev
x_mean = number/2*(1 + number)
x_squared_mean = number*(number+1)*(2*number+1)/6
for i in range(len(PC_list)-number):
y_mean = sum(PC_list[i:i+number])/number
xy_mean = sum([x * (i + 1) for i, x in enumerate(PC_list[i:i+number])])/number
#Linear regression slope(m) and b vert shift
m = (x_mean* y_mean- xy_mean)/((x_mean)**2- x_squared_mean)
b = y_mean - m*x_mean
#Standard Dev function = square root((first list value - y_mean)+(second list value - y_mean) + (third list value - y_mean)/n-1)
std = (sum([(k - y_mean)**2 for k in PC_list[i:i+number]])/(number-1))**0.5
#Upper and lower boundary calculations
Upper_Boundary = round((m*(i)+b + Upper*std),1)
Lower_Boundary = round((m*(i)+b + Lower*std),1)
#appends the upper and lower boundary to a list
upper.append(Upper_Boundary)
lower.append(Lower_Boundary)
#Boundary x and y positions appended in list for graphing
Boundary_x = number + i
Boundary_x_list.append(Boundary_x)
There is a good implementation of simple linear regression with Python and Numpy here: Simple Linear Regression in Python
The first thing I would recommend is converting your original dataset to a numpy array.
import numpy as np
X = np.array([457.334015,424.440002,394.795990,408.903992,398.821014,402.152008,435.790985,423.204987,411.574005,
404.424988,399.519989,377.181000,375.467010,386.944000,383.614990,375.071991,359.511993,328.865997,
320.510010,330.079010,336.187012,352.940002,365.026001,361.562012,362.299011,378.549011,390.414001,
400.869995,394.773010,382.556000])
# Calculating mean of the array is made trivial
x_mean = X.mean()
# values of array are squared first and then we get the mean
x_squared_mean = np.power(X, 2).mean()
# covariance (b)
cov = np.sum((X - x_mean) * (y - y_mean)) / np.sum(np.power(X - x_mean, 2))
# variance (m)
variance = x_mean - (cov * x_mean)
# regression line
reg_line = cov + variance * X
This is just an example, but in general the first step is to convert your data to numpy arrays and then you get access to all the non-loop type functions that are implemented in C.

What exactly is coef_ from sklearn LinearRegression? and how to interpret a formula from it

When I use LinearRegression in sklearn, I would do
m = 100
X = 6*np.random.rand(m,1)-3
y = 0.5*X**2 + X+2 + np.random.randn(m,1)
lin_reg = LinearRegression()
lin_reg.fit(X,y)
y_pred_1 = lin_reg.predict(X)
y_pred_1 = [_[0] for _ in y_pred_1]
and when I plot (X,y) and (X, y_pred_1) it seems to be correct.
I wanted create formula for line of best fit by:
y= (lin_reg.coef_)x + lin_reg.intercept_
Manually I've inserted values into formula I've got by using coef_, intercept_ and compared it to predicted value from lin_reg.predict(value) which are the same so lin_reg.predict in fact uses formula I've made above using coef, intercept.
My problem is how to I create a formula for simple polynomial regression?
I would do
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly_2 = poly_features.fit_transform(X)
poly_reg_2 = LinearRegression()
poly_reg_2.fit(X_poly_2, y)
then poly_reg_2.coef_ gives me array([[0.93189329, 0.43283304]]) and poly_reg_2.intercept_ = array([2.20637695]).
Since it is "simple" polynomial regression it should look something like
y = x^2 + x + b where x are same variable.
from poly_reg_2.coef_ which one is x^2 and which one is not?
Thanks to https://www.youtube.com/watch?v=Hwj_9wMXDVo I've gotten insight and found out how to interpret formula for polynomial regression.
So poly_reg_2.coef_ = array([[0.93189329, 0.43283304]])
you know simple linear regression looks like
y = b + m1x
Then 2-degree polynomial regression looks like
y = b + m1x + m2(x^2)
and 3-degree:
y = b + m1x + m2(x^2) + m3(x^3)
and so on... so for my case two coefficients are just m1 and m2 in according order.
so finally my formula becomes:
y = b + 0.93189329x + 0.43283304(x^2).

is there an equivalent of R's nls in statsmodels?

Does statsmodels support nonlinear regression to an arbitrary equation? (I know that there are some forms that are already built in, e.g. for logistic regression, but I am after something more flexible)
In the solution https://stats.stackexchange.com/a/44249 to a question about non-linear regression,
the code is in R and uses the function nls. There it has the equation's parameters defined with start = list(a1=0, ...). These are of course just some initial guesses and not the final fitted values. But what is different here compared to lm is that the parameters don't need to be from the columns of the input data.
I've been able to use statsmodels.formula.api.ols as an equivalent for R's lm, but when I try to use it with an equation that has parameters (and not weights for the inputs / combinations of inputs), statsmodels complains about the parameters not being defined. It does not seem to have an equivalent argument as start= so it isn't obvious how to introduce them.
Is there a different class or function in statsmodels that accepts definition of these initial parameter values?
EDIT:
My current attempt and also workaround following suggestion with lmfit
from statsmodels.formula.api import ols
import numpy as np
import pandas as pd
def eqn_poly(x, a, b):
''' simple polynomial '''
return a*x**2.0 + b*x
def eqn_nl(x, a, b):
''' fractional equation '''
return 1.0 / ((a+x)*b)
x = np.arange(0, 3, 0.1)
y1 = eqn_poly(x, 0.1, 0.5)
y2 = eqn_nl(x, 0.1, 0.5)
sigma =0.05
y1_noise = y1 + sigma * np.random.randn(*y1.shape)
y2_noise = y2 + sigma * np.random.randn(*y2.shape)
df1 = pd.DataFrame(np.vstack([x, y1_noise]).T, columns= ['x', 'y'])
df2 = pd.DataFrame(np.vstack([x, y2_noise]).T, columns= ['x', 'y'])
res1 = ols("y ~ 1 + x + I(x ** 2.0)", df1).fit()
print res1.summary()
res3 = ols("y ~ 1 + x + I(x ** 2.0)", df2).fit()
#res2 = ols("y ~ eqn_nl(x, a, b)", df2).fit()
# ^^^ this fails if a, b are not initialised ^^^
# so initialise a, b
a,b = 1.0, 1.0
res2 = ols("y ~ eqn_nl(x, a, b)", df2).fit()
print res2.summary()
# ===> and now the fitting is bad, it has an intercept -4.79, and a weight
# on the equation 15.7.
Giving lmfit the formula it is able to find parameters.
import lmfit
mod = lmfit.Model(eqn_nl)
lm_result = mod.fit(y2_noise, x=x, a=1.0, b=1.0)
print lm_result.fit_report()
# ===> this one works fine, a=0.101, b=0.4977
But trying to put y1, x into ols doesn't seem to work ("PatsyError: model is missing required outcome variables"). I didn't really follow that suggestion.
consider scipy.optimize.curve_fit as desired R.nls-like function

Sympy: How to compute Lie derivative of matrix with respect to a vector field

I have a system(x'=f(x)+g(x)u), such that f(x) is f:R3->R3 and g(x) is g:R3->R(3x2).
My system is
As you can see, it is a MIMO nonlinear control system and I wish to find the controllability matrix for my system. Controllability matrix in this case is formulated by
C=[g [f,g] [f,[f,g]] ..],
where [f,g] denotes the lie bracket operation between f and g.
That is the reason why I need to compute Lie derivative of a matrix with respect to a vector field and vice versa. Because [f,g]=fdg/dx-gdf/dx
Here in my system, f is 3x1 and g is 3x2 as there are two inputs available.
And I wish to calculate the above matrix C in Python.
My system is
f=sm.Matrix([[x1**2],[sin(x1)+x3**2],[cos(x3)+x1**2]]) and
g=sm.Matrix([[cos(x1),0],[x1**2,x2],[0,0]]).
My code is:
from sympy.diffgeom import *
from sympy import sqrt,sin,cos
M = Manifold("M",3)
P = Patch("P",M)
coord = CoordSystem("coord",P,["x1","x2","x3"])
x1,x2,x3 = coord.coord_functions()
e_x1,e_x2,e_x3 = coord.base_vectors()
f = x1**2*e_x1 + (sin(x1)+x3**2)*e_x2 + (cos(x3) + x1**2)*e_x3
g = (cos(x1))*e_x1+(x1**2,x2)*e_x2 + 0*e_x3
#h1 = x1
#h2 = x2
#Lfh1 = LieDerivative(f,h1)
#Lfh2 = LieDerivative(f,h2)
#print(Lfh1)
#print(Lfh2)
Lfg = LieDerivative(f,g)
print(Lfg)
Why isn't my code giving me correct answer?
The only error in your code is due to the tuple used for multiple inputs. For LieDerivative in Sympy.diffgeom to work you need a vector field defined properly.
For single input systems, the exact code that you have works without the tuple, so, in that case, for example if you have
g = (cos(x1))*e_x1+x1**2*e_x2 + 0*e_x3
(that is g(x) is 3 x 1 matrix with just the first column). Then, making the above change, you get the correct Lie derivatives.
For multiple input case, (as in your question), you can simply separate the two columns into g1 and g2 and proceed as the above case. This works because for multiple inputs case,
See math here
where g_1 and g_2 are the two columns. The final result for Lgh is a 1 x 2 matrix which you can basically get if you have the two results calculated above (Lg1h and Lg2h).
Code -
from sympy.diffgeom import *
from sympy import sqrt,sin,cos
from sympy import *
M = Manifold("M",3)
P = Patch("P",M)
coord = CoordSystem("coord",P,["x1","x2","x3"])
x1,x2,x3 = coord.coord_functions()
e_x1,e_x2,e_x3 = coord.base_vectors()
f = x1**2*e_x1 + (sin(x1)+x3**2)*e_x2 + (cos(x3) + x1**2)*e_x3
g1 = (cos(x1))*e_x1+(x1**2)*e_x2 + 0*e_x3
g2 = 0*e_x1+ x2*e_x2 + 0*e_x3
h = x1
Lg1h = LieDerivative(g1,h)
Lg2h = LieDerivative(g2,h)
Lgh = [Lg1h, Lg2h]
print(Lgh)

Categories

Resources