Scikit-Learn: Std.Error, p-Value from LinearRegression - python
I've been trying to get the standard error & p-Values by using LR from scikit-learn. But no success.
I've end up finding up this article: but the std error & p-value does not match that from the statsmodel.api OLS method
import numpy as np
from sklearn import datasets
from sklearn import linear_model
import regressor
import statsmodels.api as sm
boston = datasets.load_boston()
which_betas = np.ones(13, dtype=bool)
which_betas[3] = False
X = boston.data[:,which_betas]
y = boston.target
#scikit + regressor stats
ols = linear_model.LinearRegression()
ols.fit(X,y)
xlables = boston.feature_names[which_betas]
regressor.summary(ols, X, y, xlables)
# statsmodel
x2 = sm.add_constant(X)
models = sm.OLS(y,x2)
result = models.fit()
print result.summary()
Output as follows:
Residuals:
Min 1Q Median 3Q Max
-26.3743 -1.9207 0.6648 2.8112 13.3794
Coefficients:
Estimate Std. Error t value p value
_intercept 36.925033 4.915647 7.5117 0.000000
CRIM -0.112227 0.031583 -3.5534 0.000416
ZN 0.047025 0.010705 4.3927 0.000014
INDUS 0.040644 0.055844 0.7278 0.467065
NOX -17.396989 3.591927 -4.8434 0.000002
RM 3.845179 0.272990 14.0854 0.000000
AGE 0.002847 0.009629 0.2957 0.767610
DIS -1.485557 0.180530 -8.2289 0.000000
RAD 0.327895 0.061569 5.3257 0.000000
TAX -0.013751 0.001055 -13.0395 0.000000
PTRATIO -0.991733 0.088994 -11.1438 0.000000
B 0.009827 0.001126 8.7256 0.000000
LSTAT -0.534914 0.042128 -12.6973 0.000000
---
R-squared: 0.73547, Adjusted R-squared: 0.72904
F-statistic: 114.23 on 12 features
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.735
Model: OLS Adj. R-squared: 0.729
Method: Least Squares F-statistic: 114.2
Date: Sun, 21 Aug 2016 Prob (F-statistic): 7.59e-134
Time: 21:56:26 Log-Likelihood: -1503.8
No. Observations: 506 AIC: 3034.
Df Residuals: 493 BIC: 3089.
Df Model: 12
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 36.9250 5.148 7.173 0.000 26.811 47.039
x1 -0.1122 0.033 -3.405 0.001 -0.177 -0.047
x2 0.0470 0.014 3.396 0.001 0.020 0.074
x3 0.0406 0.062 0.659 0.510 -0.081 0.162
x4 -17.3970 3.852 -4.516 0.000 -24.966 -9.828
x5 3.8452 0.421 9.123 0.000 3.017 4.673
x6 0.0028 0.013 0.214 0.831 -0.023 0.029
x7 -1.4856 0.201 -7.383 0.000 -1.881 -1.090
x8 0.3279 0.067 4.928 0.000 0.197 0.459
x9 -0.0138 0.004 -3.651 0.000 -0.021 -0.006
x10 -0.9917 0.131 -7.547 0.000 -1.250 -0.734
x11 0.0098 0.003 3.635 0.000 0.005 0.015
x12 -0.5349 0.051 -10.479 0.000 -0.635 -0.435
==============================================================================
Omnibus: 190.837 Durbin-Watson: 1.015
Prob(Omnibus): 0.000 Jarque-Bera (JB): 897.143
Skew: 1.619 Prob(JB): 1.54e-195
Kurtosis: 8.663 Cond. No. 1.51e+04
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
I've also found the following articles
Find p-value (significance) in scikit-learn LinearRegression
http://connor-johnson.com/2014/02/18/linear-regression-with-python/
Both the codes in the SO link doesn't compile
Here is my code & data that I'm working on - but not being able to find the std error & p-values
import pandas as pd
import statsmodels.api as sm
import numpy as np
import scipy
from sklearn.linear_model import LinearRegression
from sklearn import metrics
def readFile(filename, sheetname):
xlsx = pd.ExcelFile(filename)
data = xlsx.parse(sheetname, skiprows=1)
return data
def lr_statsmodel(X,y):
X = sm.add_constant(X)
model = sm.OLS(y,X)
results = model.fit()
print (results.summary())
def lr_scikit(X,y,featureCols):
model = LinearRegression()
results = model.fit(X,y)
predictions = results.predict(X)
print 'Coefficients'
print 'Intercept\t' , results.intercept_
df = pd.DataFrame(zip(featureCols, results.coef_))
print df.to_string(index=False, header=False)
# Query:: The numbers matches with Excel OLS but skeptical about relating score as rsquared
rSquare = results.score(X,y)
print '\nR-Square::', rSquare
# This looks like a better option
# source: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score
r2 = metrics.r2_score(y,results.predict(X))
print 'r2', r2
# Query: No clue at all! http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
print 'Rsquared?!' , metrics.explained_variance_score(y, results.predict(X))
# INFO:: All three of them are providing the same figures!
# Adj-Rsquare formula # https://www.easycalculation.com/statistics/learn-adjustedr2.php
# In ML, we don't use all of the data for training, and hence its highly unusual to find AdjRsquared. Thus the need for manual calculation
N = X.shape[0]
p = X.shape[1]
adjRsquare = 1 - ((1 - rSquare ) * (N - 1) / (N - p - 1))
print "Adjusted R-Square::", adjRsquare
# calculate standard errors
# mean_absolute_error
# mean_squared_error
# median_absolute_error
# r2_score
# explained_variance_score
mse = metrics.mean_squared_error(y,results.predict(X))
print mse
print 'Residual Standard Error:', np.sqrt(mse)
# OLS in Matrix : https://github.com/nsh87/regressors/blob/master/regressors/stats.py
n = X.shape[0]
X1 = np.hstack((np.ones((n, 1)), np.matrix(X)))
se_matrix = scipy.linalg.sqrtm(
metrics.mean_squared_error(y, results.predict(X)) *
np.linalg.inv(X1.T * X1)
)
print 'se',np.diagonal(se_matrix)
# https://github.com/nsh87/regressors/blob/master/regressors/stats.py
# http://regressors.readthedocs.io/en/latest/usage.html
y_hat = results.predict(X)
sse = np.sum((y_hat - y) ** 2)
print 'Standard Square Error of the Model:', sse
if __name__ == '__main__':
# read file
fileData = readFile('Linear_regression.xlsx','Input Data')
# list of independent variables
feature_cols = ['Price per week','Population of city','Monthly income of riders','Average parking rates per month']
# build dependent & independent data set
X = fileData[feature_cols]
y = fileData['Number of weekly riders']
# Statsmodel - OLS
# lr_statsmodel(X,y)
# ScikitLearn - OLS
lr_scikit(X,y,feature_cols)
My data-set
Y X1 X2 X3 X4
City Number of weekly riders Price per week Population of city Monthly income of riders Average parking rates per month
1 1,92,000 $15 18,00,000 $5,800 $50
2 1,90,400 $15 17,90,000 $6,200 $50
3 1,91,200 $15 17,80,000 $6,400 $60
4 1,77,600 $25 17,78,000 $6,500 $60
5 1,76,800 $25 17,50,000 $6,550 $60
6 1,78,400 $25 17,40,000 $6,580 $70
7 1,80,800 $25 17,25,000 $8,200 $75
8 1,75,200 $30 17,25,000 $8,600 $75
9 1,74,400 $30 17,20,000 $8,800 $75
10 1,73,920 $30 17,05,000 $9,200 $80
11 1,72,800 $30 17,10,000 $9,630 $80
12 1,63,200 $40 17,00,000 $10,570 $80
13 1,61,600 $40 16,95,000 $11,330 $85
14 1,61,600 $40 16,95,000 $11,600 $100
15 1,60,800 $40 16,90,000 $11,800 $105
16 1,59,200 $40 16,30,000 $11,830 $105
17 1,48,800 $65 16,40,000 $12,650 $105
18 1,15,696 $102 16,35,000 $13,000 $110
19 1,47,200 $75 16,30,000 $13,224 $125
20 1,50,400 $75 16,20,000 $13,766 $130
21 1,52,000 $75 16,15,000 $14,010 $150
22 1,36,000 $80 16,05,000 $14,468 $155
23 1,26,240 $86 15,90,000 $15,000 $165
24 1,23,888 $98 15,95,000 $15,200 $175
25 1,26,080 $87 15,90,000 $15,600 $175
26 1,51,680 $77 16,00,000 $16,000 $190
27 1,52,800 $63 16,10,000 $16,200 $200
I've exhausted all my options and whatever I could make sense of. So any guidance on how to compute std error & p-values that is the same as per the statsmodel.api is appreciated.
EDIT: I'm trying to find the std error & p-values for intercept and all the independent variables
Here is reg is output of lin regression fit method of sklearn
to calculate adjusted r2
def adjustedR2(x,y reg):
r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
return adjusted_r2
and for p values
from sklearn.feature_selection import f_regression
freg=f_regression(x,y)
p=freg[1]
print(p.round(3))
Related
Using GLM to reproduce built-in regression models in statsmodels
I am currently trying to reproduce a regression model eq. (3) (edit: fixed link) in python using statsmodels. As this model is no part of the standard models provided by statsmodels I clearly have to write it myself using the provided formula api. Since I have never worked with the formula api (or patsy for that matter), I wanted to start and verify my approach by reproducing standard models with the formula api and a generalized linear model. My code and the results for a poisson regression are given below at the end of my question. You will see that it predicts the parameters beta = (2, -3, 1) for all three models with good accuracy. However, I have a couple of questions: How do I explicitly add covariates to the glm model with a regression coefficient equal to 1? From what I understand, a poisson regression in general has the shape ln(counts) = exp(intercept + beta * x + log(exposure)), i.e. the exposure is added through a fixed constant of value 1. I would like to reproduce this behaviour in my glm model, i.e. I want something like ln(counts) = exp(intercept + beta * x + k * log(exposure)) where k is a fixed constant as a formula. Simply using formula = "1 + x1 + x2 + x3 + np.log(exposure)" returns a perfect seperation error (why?). I can bypass that by adding some random noise to y, but in that case np.log(exposure) has a non-unity regression coefficient, i.e. it is treated as a normal regression covariate. Apparently both built-in models 1 and 2 have no intercept, eventhough I tried to explicitly add one in model 2. Or is there a hidden intercept that is simply not reported? In either case, how do I fix that? Any help would be greatly appreciated, so thanks in advance! import numpy as np import pandas as pd np.random.seed(1+8+2022) # Number of random samples n = 4000 # Intercept alpha = 1 # Regression Coefficients beta = np.asarray([2.0,-3.0,1.0]) # Random Data data = { "x1" : np.random.normal(1.00,0.10, size = n), "x2" : np.random.normal(1.50,0.15, size = n), "x3" : np.random.normal(-2.0,0.20, size = n), "exposure": np.random.poisson(14, size = n), } # Calculate the response x = np.asarray([data["x1"], data["x2"] , data["x3"]]).T t = np.asarray(data["exposure"]) # Add response to random data data["y"] = np.exp(alpha + np.dot(x,beta) + np.log(t)) # Convert dict to df data = pd.DataFrame(data) print(data) #----------------------------------------------------- # My Model #----------------------------------------------------- import statsmodels.api as sm import statsmodels.formula.api as smf formula = "y ~ x1 + x2 + x3" model = smf.glm(formula=formula, data=data, family=sm.families.Poisson()).fit() print(model.summary()) #----------------------------------------------------- # statsmodels.discrete.discrete_model.Poisson 1 #----------------------------------------------------- import statsmodels.api as sm data["offset"] = np.ones(n) model = sm.Poisson( endog = data["y"], exog = data[["x1", "x2", "x3"]], exposure = data["exposure"], offset = data["offset"]).fit() print(model.summary()) #----------------------------------------------------- # statsmodels.discrete.discrete_model.Poisson 2 #----------------------------------------------------- import statsmodels.api as sm data["x1"] = sm.add_constant(data["x1"]) model = sm.Poisson( endog = data["y"], exog = data[["x1", "x2", "x3"]], exposure = data["exposure"]).fit() print(model.summary()) RESULTS: x1 x2 x3 exposure y 0 1.151771 1.577677 -1.811903 13 0.508422 1 0.897012 1.678311 -2.327583 22 0.228219 2 1.040250 1.471962 -1.705458 13 0.621328 3 0.866195 1.512472 -1.766108 17 0.478107 4 0.925470 1.399320 -1.886349 13 0.512518 ... ... ... ... ... ... 3995 1.073945 1.365260 -1.755071 12 0.804081 3996 0.855000 1.251951 -2.173843 11 0.439639 3997 0.892066 1.710856 -2.183085 10 0.107643 3998 0.763777 1.538938 -2.013619 22 0.363551 3999 1.056958 1.413922 -1.722252 19 1.098932 [4000 rows x 5 columns] Generalized Linear Model Regression Results ============================================================================== Dep. Variable: y No. Observations: 4000 Model: GLM Df Residuals: 3996 Model Family: Poisson Df Model: 3 Link Function: log Scale: 1.0000 Method: IRLS Log-Likelihood: -2743.7 Date: Sat, 08 Jan 2022 Deviance: 141.11 Time: 09:32:32 Pearson chi2: 140. No. Iterations: 4 Covariance Type: nonrobust ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 3.6857 0.378 9.755 0.000 2.945 4.426 x1 2.0020 0.227 8.800 0.000 1.556 2.448 x2 -3.0393 0.148 -20.604 0.000 -3.328 -2.750 x3 0.9937 0.114 8.719 0.000 0.770 1.217 ============================================================================== Optimization terminated successfully. Current function value: 0.668293 Iterations 10 Poisson Regression Results ============================================================================== Dep. Variable: y No. Observations: 4000 Model: Poisson Df Residuals: 3997 Method: MLE Df Model: 2 Date: Sat, 08 Jan 2022 Pseudo R-squ.: 0.09462 Time: 09:32:32 Log-Likelihood: -2673.2 converged: True LL-Null: -2952.6 Covariance Type: nonrobust LLR p-value: 4.619e-122 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ x1 2.0000 0.184 10.875 0.000 1.640 2.360 x2 -3.0000 0.124 -24.160 0.000 -3.243 -2.757 x3 1.0000 0.094 10.667 0.000 0.816 1.184 ============================================================================== Optimization terminated successfully. Current function value: 0.677893 Iterations 5 Poisson Regression Results ============================================================================== Dep. Variable: y No. Observations: 4000 Model: Poisson Df Residuals: 3997 Method: MLE Df Model: 2 Date: Sat, 08 Jan 2022 Pseudo R-squ.: 0.08162 Time: 09:32:32 Log-Likelihood: -2711.6 converged: True LL-Null: -2952.6 Covariance Type: nonrobust LLR p-value: 2.196e-105 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ x1 2.9516 0.304 9.711 0.000 2.356 3.547 x2 -2.9801 0.147 -20.275 0.000 -3.268 -2.692 x3 0.9807 0.113 8.655 0.000 0.759 1.203 ============================================================================== Process return 0 (0x0) Press Enter to continue...
How to Incorporate and Forecast Lagged Time-Series Variables in a Python Regression Model
I'm trying to figure out how to incorporate lagged dependent variables into statsmodel or scikitlearn to forecast time series with AR terms but cannot seem to find a solution. The general linear equation looks something like this: y = B1*y(t-1) + B2*x1(t) + B3*x2(t-3) + e I know I can use pd.Series.shift(t) to create lagged variables and then add it to be included in the model and generate parameters, but how can I get a prediction when the code does not know which variable is a lagged dependent variable? In SAS's Proc Autoreg, you can designate which variable is a lagged dependent variable and will forecast accordingly, but it seems like there are no options like that in Python. Any help would be greatly appreciated and thank you in advance.
Since you're already mentioned statsmodels in your tags you may want to take a look at statsmodels - ARIMA, i.e.: from statsmodels.tsa.arima_model import ARIMA model = ARIMA(endog=t, order=(2, 0, 0)) # p=2, d=0, q=0 for AR(2) fit = model.fit() fit.summary() But like you mentioned, you could create new variables manually the way you described (I used some random data): import numpy as np import pandas as pd import statsmodels.api as sm df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date']) df['random_variable'] = np.random.randint(0, 10, len(df)) df['y'] = np.random.rand(len(df)) df.index = df['date'] df = df[['y', 'value', 'random_variable']] df.columns = ['y', 'x1', 'x2'] shifts = 3 for variable in df.columns.values: for t in range(1, shifts + 1): df[f'{variable} AR({t})'] = df.shift(t)[variable] df = df.dropna() >>> df.head() y x1 x2 ... x2 AR(1) x2 AR(2) x2 AR(3) date ... 1991-10-01 0.715115 3.611003 7 ... 5.0 7.0 7.0 1991-11-01 0.202662 3.565869 3 ... 7.0 5.0 7.0 1991-12-01 0.121624 4.306371 7 ... 3.0 7.0 5.0 1992-01-01 0.043412 5.088335 6 ... 7.0 3.0 7.0 1992-02-01 0.853334 2.814520 2 ... 6.0 7.0 3.0 [5 rows x 12 columns] I'm using the model you describe in your post: model = sm.OLS(df['y'], df[['y AR(1)', 'x1', 'x2 AR(3)']]) fit = model.fit() >>> fit.summary() <class 'statsmodels.iolib.summary.Summary'> """ OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.696 Model: OLS Adj. R-squared: 0.691 Method: Least Squares F-statistic: 150.8 Date: Tue, 08 Oct 2019 Prob (F-statistic): 6.93e-51 Time: 17:51:20 Log-Likelihood: -53.357 No. Observations: 201 AIC: 112.7 Df Residuals: 198 BIC: 122.6 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ y AR(1) 0.2972 0.072 4.142 0.000 0.156 0.439 x1 0.0211 0.003 6.261 0.000 0.014 0.028 x2 AR(3) 0.0161 0.007 2.264 0.025 0.002 0.030 ============================================================================== Omnibus: 2.115 Durbin-Watson: 2.277 Prob(Omnibus): 0.347 Jarque-Bera (JB): 1.712 Skew: 0.064 Prob(JB): 0.425 Kurtosis: 2.567 Cond. No. 41.5 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. """ Hope this helps you getting started.
Dual beta in python - multiple linear regression with dummy variable in statsmodel
I am trying to calculate the dual beta in python using a statsmodel regression. Unfortunately I am prompting an error message. The regression equation for dual betas is given here Dual Beta Formula I am neglecting the risk free rate (rf) for now, but implementation should be similiar once I add it. For now my code looks as follows, where my 'spx.xlsx' file simple has two columns with returns, called 'SPXrets' and 'AAPLrets' (+ one column with dates): import pandas as pd from pandas import ExcelWriter from pandas import ExcelFile import statsmodels.api as sm import statsmodels.formula.api as smf import numpy as np df = pd.read_excel('spx.xlsx') print(df.columns) mod = smf.ols(formula='AAPLrets ~ SPXrets', data=df) res = mod.fit() print(res.summary()) Prompting an patsy error: PatsyError: intercept term cannot interact with anything else AAPLrets ~ SPXrets:c(D) + SPXrets:(1-c(D)) Grateful for any help - many thanks!
Edit: After my initial suggestions, the OP has changed both the title and the provided code snippet. My suggestions have since been edited accordingly. New suggestion: I suspect you're experiencing some problems with your dataset. I suggest that you tell us a little more about the data source, how you've loaded the data, what it looks like (structure) and what type your columns have (string, float etc). What I can tell you right now, is that your snippet runs fine with some sample data like this: Code: CONret DAXret:c(D) DAXret:(1-c(D)) AAPLrets SPXrets dummy 2017-01-08 109 107 122 101 100 0 2017-01-09 117 108 124 113 147 0 2017-01-10 142 108 130 107 103 1 2017-01-11 106 121 149 103 104 1 2017-01-12 124 149 143 112 126 0 Output: OLS Regression Results ============================================================================== Dep. Variable: AAPLrets R-squared: 0.095 Model: OLS Adj. R-squared: 0.004 Method: Least Squares F-statistic: 1.044 Date: Thu, 14 Feb 2019 Prob (F-statistic): 0.331 Time: 16:00:01 Log-Likelihood: -48.388 No. Observations: 12 AIC: 100.8 Df Residuals: 10 BIC: 101.7 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 84.3198 31.143 2.708 0.022 14.929 153.711 SPXrets 0.2635 0.258 1.022 0.331 -0.311 0.838 ============================================================================== Omnibus: 5.649 Durbin-Watson: 1.882 Prob(Omnibus): 0.059 Jarque-Bera (JB): 2.933 Skew: 1.202 Prob(JB): 0.231 Kurtosis: 3.290 Cond. No. 872. ============================================================================== Here's the whole thing: # imports import statsmodels.formula.api as smf import pandas as pd import numpy as np import statsmodels.api as sm # sample data np.random.seed(1) rows = 12 listVars= ['CONret','DAXret:c(D)', 'DAXret:(1-c(D))', 'AAPLrets', 'SPXrets'] rng = pd.date_range('1/1/2017', periods=rows, freq='D') df = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars) df = df.set_index(rng) df['dummy'] = np.random.randint(2, size=df.shape[0]) mod = smf.ols(formula='AAPLrets ~ SPXrets', data=df) res = mod.fit() res.summary() Another suggestion: Personally, I'd feel much more comfortable without patsy. The snippet below will let you run a linear regression and select whether to return the model summary, or a dataframe with other details like coefficient p-values and r-squared. # Imports import pandas as pd import numpy as np import statsmodels.api as sm # sample data np.random.seed(1) rows = 12 listVars= ['CONret','DAXret:c(D)', 'DAXret:(1-c(D))', 'AAPLrets', 'SPXrets'] rng = pd.date_range('1/1/2017', periods=rows, freq='D') df = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars) df = df.set_index(rng) df['dummy'] = np.random.randint(2, size=df.shape[0]) def LinReg(df, y, x, const, results): betas = x.copy() # Model with out without a constant if const == True: x = sm.add_constant(df[x]) model = sm.OLS(df[y], x).fit() else: model = sm.OLS(df[y], df[x]).fit() # Estimates of R2 and p res1 = {'Y': [y], 'R2': [format(model.rsquared, '.4f')], 'p': [model.pvalues.tolist()], 'start': [df.index[0]], 'stop': [df.index[-1]], 'obs' : [df.shape[0]], 'X': [betas]} df_res1 = pd.DataFrame(data = res1) # Regression Coefficients theParams = model.params[0:] coefs = theParams.to_frame() df_coefs = pd.DataFrame(coefs.T) xNames = list(df_coefs) xValues = list(df_coefs.loc[0].values) xValues2 = [ '%.2f' % elem for elem in xValues ] res2 = {'Independent': [xNames], 'beta': [xValues2]} df_res2 = pd.DataFrame(data = res2) # All results df_res = pd.concat([df_res1, df_res2], axis = 1) df_res = df_res.T df_res.columns = ['results'] if results == 'summary': return(model.summary()) print(model.summary()) else: return(df_res) df_regression = LinReg(df = df, y = 'CONret', x = ['DAXret:c(D)', 'DAXret:(1-c(D))', 'dummy'], const = True, results = 'summary') print(df_regression) Test run 1: df_regression = LinReg(df = df, y = 'CONret', x = ['DAXret:c(D)', 'DAXret:(1-c(D))'], const = True, results = '') Output 1: results Y CONret R2 0.0813 p [0.13194822614949883, 0.45726622261432304, 0.9... start 2017-01-01 00:00:00 stop 2017-01-12 00:00:00 obs 12 X [DAXret:c(D), DAXret:(1-c(D)), dummy] Independent [const, DAXret:c(D), DAXret:(1-c(D)), dummy] beta [88.94, 0.24, -0.01, 2.20] Test run 2: df_regression = LinReg(df = df, y = 'CONret', x = ['DAXret:c(D)', 'DAXret:(1-c(D))', 'dummy'], const = True, results = 'summary') Output 2: OLS Regression Results ============================================================================== Dep. Variable: CONret R-squared: 0.081 Model: OLS Adj. R-squared: -0.263 Method: Least Squares F-statistic: 0.2361 Date: Thu, 14 Feb 2019 Prob (F-statistic): 0.869 Time: 16:04:02 Log-Likelihood: -47.138 No. Observations: 12 AIC: 102.3 Df Residuals: 8 BIC: 104.2 Df Model: 3 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------- const 88.9438 53.019 1.678 0.132 -33.318 211.205 DAXret:c(D) 0.2350 0.301 0.781 0.457 -0.459 0.929 DAXret:(1-c(D)) -0.0060 0.391 -0.015 0.988 -0.908 0.896 dummy 2.2005 8.973 0.245 0.812 -18.490 22.891 ============================================================================== Omnibus: 1.025 Durbin-Watson: 2.354 Prob(Omnibus): 0.599 Jarque-Bera (JB): 0.720 Skew: 0.540 Prob(JB): 0.698 Kurtosis: 2.477 Cond. No. 2.15e+03 ==============================================================================
Python: Super Dictionary of Simple OLS
I am trying to build a super dictionary which holds within a number of lower level libraries Concept I have interest rates for my retail bank for the last 12 years and I am trying to model the interest rates by using a portfolio of different bonds. Regression formula Y_i - Y_i-1 = A + B(X_i - X_i-1) + E In words, Y_Lag = alpha + beta(X_Lag) + Error term Data Note: Y = Historic Rate df = pd.DataFrame(np.random.randint(low=0, high=10, size=(100,17)), columns=['Historic Rate', 'Overnight', '1M', '3M', '6M','1Y','2Y','3Y','4Y','5Y','6Y','7Y','8Y','9Y','10Y','12Y','15Y']) Code thus far #Import packages required for the analysis import pandas as pd import numpy as np import statsmodels.api as sm def Simulation(TotalSim,j): #super dictionary to hold all iterations of the loop Super_fit_d = {} for i in range(1,TotalSim): #Create a introductory loop to run the first set of regressions #Each loop produces a univariate regression #Each loop has a fixed lag of i fit_d = {} # This will hold all of the fit results and summaries for col in [x for x in df.columns if x != 'Historic Rate']: Y = df['Historic Rate'] - df['Historic Rate'].shift(1) # Need to remove the NaN for fit Y = Y[Y.notnull()] X = df[col] - df[col].shift(i) X = X[X.notnull()] #Y now has more observations than X due to lag, drop rows to match Y = Y.drop(Y.index[0:i-1]) if j = 1: X = sm.add_constant(X) # Add a constant to the fit fit_d[col] = sm.OLS(Y,X).fit() #append the dictionary for each lag onto the super dictionary Super_fit_d[lag_i] = fit_d #Check the output for one column fit_d['Overnight'].summary() #Check the output for one column in one segment of the super dictionary Super_fit_d['lag_5'].fit_d['Overnight'].summary() Simulation(11,1) Question I seem to be overwriting my dictionary with every loop and I'm not evaluating the i properly to index the iteration as lag_1, lag_2, lag_3 etc. How do I fix this? Thanks in advance
There are a couple of issues here: you sometimes use i and sometimes lag_i, but only i is defined. I changed all to lag_i for consistency if j = 1 is incorrect syntax. You need if j == 1 You need to return fit_d so that it persists after your loop I got it done by applying those changes import pandas as pd import numpy as np import statsmodels.api as sm df = pd.DataFrame(np.random.randint(low=0, high=10, size=(100,17)), columns=['Historic Rate', 'Overnight', '1M', '3M', '6M','1Y','2Y','3Y','4Y','5Y','6Y','7Y','8Y','9Y','10Y','12Y','15Y']) def Simulation(TotalSim,j): Super_fit_d = {} for lag_i in range(1,TotalSim): #Create a introductory loop to run the first set of regressions #Each loop produces a univariate regression #Each loop has a fixed lag of i fit_d = {} # This will hold all of the fit results and summaries for col in [x for x in df.columns if x != 'Historic Rate']: Y = df['Historic Rate'] - df['Historic Rate'].shift(1) # Need to remove the NaN for fit Y = Y[Y.notnull()] X = df[col] - df[col].shift(lag_i) X = X[X.notnull()] #Y now has more observations than X due to lag, drop rows to match Y = Y.drop(Y.index[0:lag_i-1]) if j == 1: X = sm.add_constant(X) # Add a constant to the fit fit_d[col] = sm.OLS(Y,X).fit() #append the dictionary for each lag onto the super dictionary # return fit_d Super_fit_d[lag_i] = fit_d return Super_fit_d test_dict = Simulation(11,1) First lag test_dict[1]['Overnight'].summary() Out[76]: <class 'statsmodels.iolib.summary.Summary'> """ OLS Regression Results ============================================================================== Dep. Variable: Historic Rate R-squared: 0.042 Model: OLS Adj. R-squared: 0.033 Method: Least Squares F-statistic: 4.303 Date: Fri, 28 Sep 2018 Prob (F-statistic): 0.0407 Time: 11:15:13 Log-Likelihood: -280.39 No. Observations: 99 AIC: 564.8 Df Residuals: 97 BIC: 570.0 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const -0.0048 0.417 -0.012 0.991 -0.833 0.823 Overnight 0.2176 0.105 2.074 0.041 0.009 0.426 ============================================================================== Omnibus: 1.449 Durbin-Watson: 2.756 Prob(Omnibus): 0.485 Jarque-Bera (JB): 1.180 Skew: 0.005 Prob(JB): 0.554 Kurtosis: 2.465 Cond. No. 3.98 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. """ Second Lag test_dict[2]['Overnight'].summary() Out[77]: <class 'statsmodels.iolib.summary.Summary'> """ OLS Regression Results ============================================================================== Dep. Variable: Historic Rate R-squared: 0.001 Model: OLS Adj. R-squared: -0.010 Method: Least Squares F-statistic: 0.06845 Date: Fri, 28 Sep 2018 Prob (F-statistic): 0.794 Time: 11:15:15 Log-Likelihood: -279.44 No. Observations: 98 AIC: 562.9 Df Residuals: 96 BIC: 568.0 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 0.0315 0.428 0.074 0.941 -0.817 0.880 Overnight 0.0291 0.111 0.262 0.794 -0.192 0.250 ============================================================================== Omnibus: 2.457 Durbin-Watson: 2.798 Prob(Omnibus): 0.293 Jarque-Bera (JB): 1.735 Skew: 0.115 Prob(JB): 0.420 Kurtosis: 2.391 Cond. No. 3.84 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. """
how to get equation for nonlinear mutivariate regression in which one variable is dependent on other two independent variables in python
I have set of 5000 data points of like_so_ (x,y,z) for eg (0,1,50) where x=1,y=2,z=120.with help of these 5000 enteries,i have to get an equation in which given x and y ,equation should be able to get value of z
You can use statsmodels.ols. Some sample data - assuming you can create a pd.DataFrame from your (x, y, z) data: import pandas as pd df = pd.DataFrame(np.random.randint(100, size=(150, 3)), columns=list('XYZ')) df.info() RangeIndex: 150 entries, 0 to 149 Data columns (total 3 columns): X 150 non-null int64 Y 150 non-null int64 Z 150 non-null int64 Now estimate linear regression parameters: import numpy as np import statsmodels.api as sm model = sm.OLS(df['Z'], df[['X', 'Y']]) results = model.fit() to get: results.summary()) OLS Regression Results ============================================================================== Dep. Variable: Z R-squared: 0.652 Model: OLS Adj. R-squared: 0.647 Method: Least Squares F-statistic: 138.6 Date: Fri, 17 Jun 2016 Prob (F-statistic): 1.21e-34 Time: 13:48:38 Log-Likelihood: -741.94 No. Observations: 150 AIC: 1488. Df Residuals: 148 BIC: 1494. Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ X 0.5224 0.076 6.874 0.000 0.372 0.673 Y 0.3531 0.076 4.667 0.000 0.204 0.503 ============================================================================== Omnibus: 5.869 Durbin-Watson: 1.921 Prob(Omnibus): 0.053 Jarque-Bera (JB): 2.990 Skew: -0.000 Prob(JB): 0.224 Kurtosis: 2.308 Cond. No. 2.70 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. to predict, use: params = results.params params = results.params df['predictions'] = model.predict(params) which yields: X Y Z predictions 0 31 85 75 54.701830 1 36 46 43 34.828605 2 77 42 8 43.795386 3 78 84 65 66.932761 4 27 54 50 36.737606