I am trying to perform multiple linear regression using the statsmodels.formula.api package in python and have listed the code that i have used to perform this regression below.
auto_1= pd.read_csv("Auto.csv")
formula = 'mpg ~ ' + " + ".join(auto_1.columns[1:-1])
results = smf.ols(formula, data=auto_1).fit()
print(results.summary())
The data consists the following variables - mpg, cylinders, displacement, horsepower, weight , acceleration, year, origin and name. When the print result comes up, it shows multiple rows of the horsepower column and the regression results are also not correct. Im not sure why?
screenshot of repeated rows
It's likely because of the data type of the horsepower column. If its values are categories or just strings, the model will use treatment (dummy) coding for them by default, producing the results you are seeing. Check the data type (run auto_1.dtypes) and cast the column to a numeric type (it's best to do it when you are first reading the csv file with the dtype= parameter of the read_csv() method.
Here is an example where a column with numeric values is cast (i.e. converted) to strings (or categories):
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
df = pd.DataFrame(
{
'mpg': np.random.randint(20, 40, 50),
'horsepower': np.random.randint(100, 200, 50)
}
)
# convert integers to strings (or categories)
df['horsepower'] = (
df['horsepower'].astype('str') # same result with .astype('category')
)
formula = 'mpg ~ horsepower'
results = smf.ols(formula, df).fit()
print(results.summary())
Output (dummy coding):
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.778
Model: OLS Adj. R-squared: -0.207
Method: Least Squares F-statistic: 0.7901
Date: Sun, 18 Sep 2022 Prob (F-statistic): 0.715
Time: 20:17:51 Log-Likelihood: -110.27
No. Observations: 50 AIC: 302.5
Df Residuals: 9 BIC: 380.9
Df Model: 40
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept 32.0000 5.175 6.184 0.000 20.294 43.706
horsepower[T.103] -4.0000 7.318 -0.547 0.598 -20.555 12.555
horsepower[T.112] -1.0000 7.318 -0.137 0.894 -17.555 15.555
horsepower[T.116] -9.0000 7.318 -1.230 0.250 -25.555 7.555
horsepower[T.117] 6.0000 7.318 0.820 0.433 -10.555 22.555
horsepower[T.118] 2.0000 7.318 0.273 0.791 -14.555 18.555
horsepower[T.120] -4.0000 6.338 -0.631 0.544 -18.337 10.337
etc.
Now, converting the strings back to integers:
df['horsepower'] = pd.to_numeric(df.horsepower)
# or df['horsepower'] = df['horsepower'].astype('int')
results = smf.ols(formula, df).fit()
print(results.summary())
Output (as expected):
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.011
Model: OLS Adj. R-squared: -0.010
Method: Least Squares F-statistic: 0.5388
Date: Sun, 18 Sep 2022 Prob (F-statistic): 0.466
Time: 20:24:54 Log-Likelihood: -147.65
No. Observations: 50 AIC: 299.3
Df Residuals: 48 BIC: 303.1
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 31.7638 3.663 8.671 0.000 24.398 39.129
horsepower -0.0176 0.024 -0.734 0.466 -0.066 0.031
==============================================================================
Omnibus: 3.529 Durbin-Watson: 1.859
Prob(Omnibus): 0.171 Jarque-Bera (JB): 1.725
Skew: 0.068 Prob(JB): 0.422
Kurtosis: 2.100 Cond. No. 834.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
I'm trying to do a linear regression to predict the count of a dataframe based on the count of another dataframe. I am using statsmodels. I have tried the following:
X = df1.count()
Y = df2.count()
from statsmodels.formula.api import ols
fit = ols(Y ~ X, data=kak).fit()
fit.summary()
Using the X and Y variables in the OLS formula is not allowed and I have no idea what to fill in at the data= keyword argument. How would I go about doing this?
Let's assume you have to 1D arrays X and Y:
import numpy as np
X = np.arange(100)
Y = 2*X + 5
Then you can run a linear regression using the line below. There are two important things:
The first argument to ols is a str containing a formula. "Y ~ X" and not Y ~ X (note the double quotes).
The second argument data can be any Python object as long as it has keys data["X"] and data["Y"]. I wrote a dict here, but it would also work with a DataFrame. It basically allows statsmodels to understand who are X and Y in the formula you gave to it.
ols("Y ~ X", {"X": X, "Y": Y}).fit().summary()
Output:
OLS Regression Results
==============================================================================
Dep. Variable: Y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.916e+33
Date: Fri, 21 May 2021 Prob (F-statistic): 0.00
Time: 14:01:48 Log-Likelihood: 3055.1
No. Observations: 100 AIC: -6106.
Df Residuals: 98 BIC: -6101.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 5.0000 2.62e-15 1.91e+15 0.000 5.000 5.000
X 2.0000 4.57e-17 4.38e+16 0.000 2.000 2.000
==============================================================================
Omnibus: 220.067 Durbin-Watson: 0.015
Prob(Omnibus): 0.000 Jarque-Bera (JB): 9.816
Skew: 0.159 Prob(JB): 0.00739
Kurtosis: 1.498 Cond. No. 114.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
I get completely different results with the same datasets in R and Python. I cannot understand why it happens.
R:
library(RcppCNPy)
d <- npyLoad("/home/vvkovalchuk/bin/src/python/asks1.npy")
datas = npyLoad('/home/vvkovalchuk/bin/src/python/bids2.npy')
m <- lm(d ~ datas)
summary(m)
Python:
import time
import numpy
import statsmodels.api as sm
from math import log
Y = numpy.load('./asks1.npy', allow_pickle=True)
X = numpy.load('./bids2.npy', allow_pickle=True)
X3 = sm.add_constant(X)
res_ols = sm.OLS(Y, X3).fit()
print(res_ols.params)
What am I doing wrong?
Results:
R:
Call:
lm(formula = d ~ datas)
Residuals:
Min 1Q Median 3Q Max
-6.089e+06 8.797e+07 2.163e+08 2.179e+08 1.122e+10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.561e+00 2.253e+06 0 1
datas 3.809e+03 2.164e+09 0 1
Residual standard error: 208100000 on 14639 degrees of freedom
Multiple R-squared: 0.2735, Adjusted R-squared: 0.2735
F-statistic: 5512 on 1 and 14639 DF, p-value: < 2.2e-16
Python:
OLS Regression Results
Dep. Variable: y R-squared: 0.112
Model: OLS Adj. R-squared: 0.112
Method: Least Squares F-statistic: 1846.
Date: Thu, 25 Mar 2021 Prob (F-statistic): 0.00
Time: 13:08:43 Log-Likelihood: 1.6948e+05
No. Observations: 14641 AIC: -3.390e+05
Df Residuals: 14639 BIC: -3.389e+05
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0004 3.07e-06 126.136 0.000 0.000 0.000
x1 0.1478 0.003 42.969 0.000 0.141 0.155
Omnibus: 3251.130 Durbin-Watson: 0.004
Prob(Omnibus): 0.000 Jarque-Bera (JB): 14606.605
Skew: 1.019 Prob(JB): 0.00
Kurtosis: 7.449 Cond. No. 1.83e+05
I also tried to swap arguments in OLS function. Still getting incorrect results. Could this be related to NAs?
I'm trying to figure out how to incorporate lagged dependent variables into statsmodel or scikitlearn to forecast time series with AR terms but cannot seem to find a solution.
The general linear equation looks something like this:
y = B1*y(t-1) + B2*x1(t) + B3*x2(t-3) + e
I know I can use pd.Series.shift(t) to create lagged variables and then add it to be included in the model and generate parameters, but how can I get a prediction when the code does not know which variable is a lagged dependent variable?
In SAS's Proc Autoreg, you can designate which variable is a lagged dependent variable and will forecast accordingly, but it seems like there are no options like that in Python.
Any help would be greatly appreciated and thank you in advance.
Since you're already mentioned statsmodels in your tags you may want to take a look at statsmodels - ARIMA, i.e.:
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(endog=t, order=(2, 0, 0)) # p=2, d=0, q=0 for AR(2)
fit = model.fit()
fit.summary()
But like you mentioned, you could create new variables manually the way you described (I used some random data):
import numpy as np
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'])
df['random_variable'] = np.random.randint(0, 10, len(df))
df['y'] = np.random.rand(len(df))
df.index = df['date']
df = df[['y', 'value', 'random_variable']]
df.columns = ['y', 'x1', 'x2']
shifts = 3
for variable in df.columns.values:
for t in range(1, shifts + 1):
df[f'{variable} AR({t})'] = df.shift(t)[variable]
df = df.dropna()
>>> df.head()
y x1 x2 ... x2 AR(1) x2 AR(2) x2 AR(3)
date ...
1991-10-01 0.715115 3.611003 7 ... 5.0 7.0 7.0
1991-11-01 0.202662 3.565869 3 ... 7.0 5.0 7.0
1991-12-01 0.121624 4.306371 7 ... 3.0 7.0 5.0
1992-01-01 0.043412 5.088335 6 ... 7.0 3.0 7.0
1992-02-01 0.853334 2.814520 2 ... 6.0 7.0 3.0
[5 rows x 12 columns]
I'm using the model you describe in your post:
model = sm.OLS(df['y'], df[['y AR(1)', 'x1', 'x2 AR(3)']])
fit = model.fit()
>>> fit.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.696
Model: OLS Adj. R-squared: 0.691
Method: Least Squares F-statistic: 150.8
Date: Tue, 08 Oct 2019 Prob (F-statistic): 6.93e-51
Time: 17:51:20 Log-Likelihood: -53.357
No. Observations: 201 AIC: 112.7
Df Residuals: 198 BIC: 122.6
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
y AR(1) 0.2972 0.072 4.142 0.000 0.156 0.439
x1 0.0211 0.003 6.261 0.000 0.014 0.028
x2 AR(3) 0.0161 0.007 2.264 0.025 0.002 0.030
==============================================================================
Omnibus: 2.115 Durbin-Watson: 2.277
Prob(Omnibus): 0.347 Jarque-Bera (JB): 1.712
Skew: 0.064 Prob(JB): 0.425
Kurtosis: 2.567 Cond. No. 41.5
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
Hope this helps you getting started.
I am trying to calculate the dual beta in python using a statsmodel regression. Unfortunately I am prompting an error message.
The regression equation for dual betas is given here
Dual Beta Formula
I am neglecting the risk free rate (rf) for now, but implementation should be similiar once I add it. For now my code looks as follows, where my 'spx.xlsx' file simple has two columns with returns, called 'SPXrets' and 'AAPLrets' (+ one column with dates):
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
df = pd.read_excel('spx.xlsx')
print(df.columns)
mod = smf.ols(formula='AAPLrets ~ SPXrets', data=df)
res = mod.fit()
print(res.summary())
Prompting an patsy error:
PatsyError: intercept term cannot interact with anything else
AAPLrets ~ SPXrets:c(D) + SPXrets:(1-c(D))
Grateful for any help - many thanks!
Edit:
After my initial suggestions, the OP has changed both the title and the provided code snippet. My suggestions have since been edited accordingly.
New suggestion:
I suspect you're experiencing some problems with your dataset.
I suggest that you tell us a little more about the data source, how you've loaded the data, what it looks like (structure) and what type your columns have (string, float etc).
What I can tell you right now, is that your snippet runs fine with some sample data like this:
Code:
CONret DAXret:c(D) DAXret:(1-c(D)) AAPLrets SPXrets dummy
2017-01-08 109 107 122 101 100 0
2017-01-09 117 108 124 113 147 0
2017-01-10 142 108 130 107 103 1
2017-01-11 106 121 149 103 104 1
2017-01-12 124 149 143 112 126 0
Output:
OLS Regression Results
==============================================================================
Dep. Variable: AAPLrets R-squared: 0.095
Model: OLS Adj. R-squared: 0.004
Method: Least Squares F-statistic: 1.044
Date: Thu, 14 Feb 2019 Prob (F-statistic): 0.331
Time: 16:00:01 Log-Likelihood: -48.388
No. Observations: 12 AIC: 100.8
Df Residuals: 10 BIC: 101.7
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 84.3198 31.143 2.708 0.022 14.929 153.711
SPXrets 0.2635 0.258 1.022 0.331 -0.311 0.838
==============================================================================
Omnibus: 5.649 Durbin-Watson: 1.882
Prob(Omnibus): 0.059 Jarque-Bera (JB): 2.933
Skew: 1.202 Prob(JB): 0.231
Kurtosis: 3.290 Cond. No. 872.
==============================================================================
Here's the whole thing:
# imports
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
import statsmodels.api as sm
# sample data
np.random.seed(1)
rows = 12
listVars= ['CONret','DAXret:c(D)', 'DAXret:(1-c(D))', 'AAPLrets', 'SPXrets']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars)
df = df.set_index(rng)
df['dummy'] = np.random.randint(2, size=df.shape[0])
mod = smf.ols(formula='AAPLrets ~ SPXrets', data=df)
res = mod.fit()
res.summary()
Another suggestion:
Personally, I'd feel much more comfortable without patsy.
The snippet below will let you run a linear regression and select whether to return the model summary, or a dataframe with other details like coefficient p-values and r-squared.
# Imports
import pandas as pd
import numpy as np
import statsmodels.api as sm
# sample data
np.random.seed(1)
rows = 12
listVars= ['CONret','DAXret:c(D)', 'DAXret:(1-c(D))', 'AAPLrets', 'SPXrets']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars)
df = df.set_index(rng)
df['dummy'] = np.random.randint(2, size=df.shape[0])
def LinReg(df, y, x, const, results):
betas = x.copy()
# Model with out without a constant
if const == True:
x = sm.add_constant(df[x])
model = sm.OLS(df[y], x).fit()
else:
model = sm.OLS(df[y], df[x]).fit()
# Estimates of R2 and p
res1 = {'Y': [y],
'R2': [format(model.rsquared, '.4f')],
'p': [model.pvalues.tolist()],
'start': [df.index[0]],
'stop': [df.index[-1]],
'obs' : [df.shape[0]],
'X': [betas]}
df_res1 = pd.DataFrame(data = res1)
# Regression Coefficients
theParams = model.params[0:]
coefs = theParams.to_frame()
df_coefs = pd.DataFrame(coefs.T)
xNames = list(df_coefs)
xValues = list(df_coefs.loc[0].values)
xValues2 = [ '%.2f' % elem for elem in xValues ]
res2 = {'Independent': [xNames],
'beta': [xValues2]}
df_res2 = pd.DataFrame(data = res2)
# All results
df_res = pd.concat([df_res1, df_res2], axis = 1)
df_res = df_res.T
df_res.columns = ['results']
if results == 'summary':
return(model.summary())
print(model.summary())
else:
return(df_res)
df_regression = LinReg(df = df, y = 'CONret', x = ['DAXret:c(D)', 'DAXret:(1-c(D))', 'dummy'], const = True, results = 'summary')
print(df_regression)
Test run 1:
df_regression = LinReg(df = df, y = 'CONret', x = ['DAXret:c(D)', 'DAXret:(1-c(D))'], const = True, results = '')
Output 1:
results
Y CONret
R2 0.0813
p [0.13194822614949883, 0.45726622261432304, 0.9...
start 2017-01-01 00:00:00
stop 2017-01-12 00:00:00
obs 12
X [DAXret:c(D), DAXret:(1-c(D)), dummy]
Independent [const, DAXret:c(D), DAXret:(1-c(D)), dummy]
beta [88.94, 0.24, -0.01, 2.20]
Test run 2:
df_regression = LinReg(df = df, y = 'CONret', x = ['DAXret:c(D)', 'DAXret:(1-c(D))', 'dummy'], const = True, results = 'summary')
Output 2:
OLS Regression Results
==============================================================================
Dep. Variable: CONret R-squared: 0.081
Model: OLS Adj. R-squared: -0.263
Method: Least Squares F-statistic: 0.2361
Date: Thu, 14 Feb 2019 Prob (F-statistic): 0.869
Time: 16:04:02 Log-Likelihood: -47.138
No. Observations: 12 AIC: 102.3
Df Residuals: 8 BIC: 104.2
Df Model: 3
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
const 88.9438 53.019 1.678 0.132 -33.318 211.205
DAXret:c(D) 0.2350 0.301 0.781 0.457 -0.459 0.929
DAXret:(1-c(D)) -0.0060 0.391 -0.015 0.988 -0.908 0.896
dummy 2.2005 8.973 0.245 0.812 -18.490 22.891
==============================================================================
Omnibus: 1.025 Durbin-Watson: 2.354
Prob(Omnibus): 0.599 Jarque-Bera (JB): 0.720
Skew: 0.540 Prob(JB): 0.698
Kurtosis: 2.477 Cond. No. 2.15e+03
==============================================================================