Control for a variable in regression (Python) - python

I am trying to perform multiple linear regression using statsmodels while controlling for one variable but I am not sure how to control for a variable in python.
import statsmodels.formula.api as smf
results = smf.ols('y ~ Var1 + Var2', data=df).fit()
print(results.summary())
How can I control for Var2 on this example? Thank you

Related

Find RSME and Standard Deviation of a StatsModels Multiple Regression

I currently have a multiple regression that generates an OLS summary based on the life expectancy and the variables that impact it, however that does not include RMSE or standard deviation. Does statsmodels have a rsme library, and is there a way to calculate standard deviation from my code?
I have found a previous example of this problem: regression model statsmodel python , and I read the statsmodels info page: https://www.statsmodels.org/stable/generated/statsmodels.tools.eval_measures.rmse.html and testing I am still not able to get this problem resolved.
import pandas as pd
import openpyxl
import statsmodels.formula.api as smf
import statsmodels.formula.api as ols
df = pd.read_excel(C:/Users/File1.xlsx, sheet_name = 'States')
dfME = df[(df[State] == "Maine")]
pd.set_option('display.max_columns', None)
dfME.head()
model = smf.ols(Life Expectancy ~ Race + Age + Weight + C(Pets), data = dfME)
modelfit = model.fit()
modelfit.summary
It sounds like you mean the Standard Deviation of the Residuals which is calculated using the Root Mean Squared Error. This gives you a measure of how spread out the data points are from the line of best fit. It's often used as a measure of Prediction Error.
There is a lot of information left off the summary in Statsmodels. Fortunately, Statsmodels provides us with alternatives. You can find a list of available properties and methods here: Regression Results
Let's use the variable assignment modelfit from your code. To find the Mean Squared Error of the Residuals, use the mse_resid method in Statsmodels found in the link. To find the RMSE (root mean squared error) of the residuals take the square root of the mean squared error using the square root function in Numpy, sqrt.
Thus the Root Mean Squared Error of the Residuals can be found using this code:
rmse_residuals = np.sqrt(modelfit.mse_resid)
You could try something like this:
from statsmodels.tools.eval_measures import rmse
X = dfME[["Race", "Age", "Weight", "C(Pets)"]]
rmse_result = rmse(dfME["Life Expectancy"], model.predict(X))
To get the standard deviation of life expectancy, you can simply use:
stdev = dfME["Life Expectancy"].std()

SARIMAX simulation of possible paths

I am trying to create a simulation of possible paths of a stochastic process, which is not anchored to any particular point. E.g. fit SARIMAX model to weather temperature data and then use the model to make a simulation of the temperature.
Here I use the standard demonstration from statsmodels page as a simpler example:
import numpy as np
import pandas as pd
from scipy.stats import norm
import statsmodels.api as sm
import matplotlib.pyplot as plt
from datetime import datetime
import requests
from io import BytesIO
Fitting the model:
wpi1 = requests.get('https://www.stata-press.com/data/r12/wpi1.dta').content
data = pd.read_stata(BytesIO(wpi1))
data.index = data.t
# Set the frequency
data.index.freq="QS-OCT"
# Fit the model
mod = sm.tsa.statespace.SARIMAX(data['wpi'], trend='c', order=(1,1,1))
res = mod.fit(disp=False)
print(res.summary())
Creating simulation:
res.simulate(len(data), repetitions=10).plot();
Here is the history:
Here is the simulation:
The simulated curves are so widely distibuted and apart from each other that this cannot make sense. The initial historical process doesn't have that much of a variance. What do I understand wrongly? How to perform the right simulation?
When you don't pass an initial state, it uses the first predicted state to start the simulation along with its predicted covariance. Since there is no information available to make the first prediction, it uses a diffuse prior with a variance of 1,000,000. This is why you are getting the wide range in your time series. A simple solution is to pass your own initial state using the smoothed_state.
Taking your code above, but using
initial = res.smoothed_state[:, 0]
res.simulate(len(data),
repetitions=10,
initial_state=initial).plot()
I get a plot that looks like
The first value is what really matters in this model, and is 30.6. You could add some randomness here directly by drawing the initial state from another (sensible) distribution. The default distribution is not sensible for simulation since it has a diffuse prior (it is, however, very sensible for estimation).
Other Notes
One other small note: You should not use trend="c" with d=1. You should instead use trend="t" when d=1 so that the model includes a drift. The model you estimate should be
mod = sm.tsa.statespace.SARIMAX(data["wpi"], trend="t", order=(1, 1, 1))
I used this model in the picture above to capture the positive trend in the data.

Simple question I am not getting output as expected.(Linear regression)

I am new to programming. Currently, I am learning machine learning from this video.
This is related to linear regression
CODE:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
df=pd.read_csv('homeprices.csv')
reg = linear_model.LinearRegression()
Problem 1
reg.fit(df[['area']],df.price)
Expected output should be
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
My output:
LinearRegression()
Problem 2
reg.predict(3300)
It's giving error when I use "()" but when I use 2D array "[[]]" It is giving me correct output, But I want to know why It is not giving me output(as shown in video) when I use the only parenthesis.
Problem 1 :
This is how the fitted model outputs are shown in the newest version of sklearn, i.e., 0.23. The parameters are the same, but they are not shown in the output.
You can use reg.get_params() to view the parameters.
Problem 2 :
Newer versions of Scikit-learn require 2D inputs for the predict function and we can make 3300 2D by [[3300]]:
reg.predict( [[3300]] )
Problem1:
it depends on the default parameters which you might have changed it before or any other reason which has changed it, but you can easily set your desired parameters while you are initializing the Linear classifier in this way:
reg = linear_model.LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Problem 2:
reg.predict(3300) it's not correct to pass the parameter to Pandas in that way and you can see that the instructor has also corrected his answer to the reg.predict([3300]) in the description of the youtube Post
try this but you should define your variable and fit them to get desired output
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression()
df=pd.read_csv('homeprices.csv')
reg =LinearRegression()

Set ANOVA contrast in Python OLS

Is there a way to define the contrast of the ANOVA on Python using the OLS.fit() function?
Trying to extend the R code Contrast("Contr.sum", "Contr.sum") to Python without success.
Results=ols('score ~ C(Var3) + C(Var1) + C(Var2)', data=Dataset).fit()

Find the sum of the residuals

I am doing a hands on exercise of Poissons Regression of Stats with Python in Fresco Play.
Problem statement is like:
Load the R dataset Insurance from the MASS package.
Capture the data as a pandas dataframe.
Build a Poisson regression model with a log of an independent variable
Holders, and dependent variable Claims.
Fit the model with data, and find the sum of the residuals.
I am stuck with the last line i.e. Sum of Residuals
I used np.sum(model.resid). But answer is not accepted
Here is my code
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
INS_data = sm.datasets.get_rdataset('Insurance','MASS').data
model = smf.poisson('Claims ~ np.log(Holders)', INS_data).fit()
print(np.sum(model.resid))
I was running the code in Python2 which gave wrong answer but running it in Python3 gave the correct answer. I don't know the reason but code works perfectly in Python3
For residual, you can use the basic concept of residual i.e. actual - predicted.
Here is the code snippet.
import statsmodels.api as sm
import numpy as np
import statsmodels.formula.api as smf
Insurance = sm.datasets.get_rdataset('Insurance','MASS')
data = Insurance.data
data['Holders_'] = np.log(data['Holders'])
model = smf.poisson('Claims ~ Holders_',data).fit()
y_predicted = p.predict(data['Holders_'])
residual = (data['Claims']-y_predicted)
print(sum(residual))
output
After much serach, i came to know that it is expecting cumulative sum so use
np.cumsum(model.resid)
It will pass in Frescoplay

Categories

Resources