I have a pandas data frame that contains several columns. I need to perform a multivariate linear regression. Before doing that i would like to analyze the R,R2,adjusted R2 and p value of each independent variable with respect to the dependent variable.
For the R and R2 I have no problem, since i can calculate the R matrix and the select only the dependent variable and then see the R coefficient between it and all the independent variables. Then i can square these values to obtain the R2.
My problem is how to do the same with the adjusted R2 and the p value
At the end what i want to obtain is somenthing like that:
Variable R R2 ADJUSTED_R2 p_value
A 0.4193 0.1758 ...
B 0.2620 0.0686 ...
C 0.2535 0.0643 ...
All the values are with respect to the dependent variable let's say Y.
The following will not give you ALL the answers, but it WILL get you going using python, pandas and statsmodels for regression analyses.
Given a dataframe like this...
# Imports
import pandas as pd
import numpy as np
import itertools
# A datafrane with random numbers
np.random.seed(123)
rows = 12
listVars= ['y','x1', 'x2', 'x3']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars)
df_1 = df_1.set_index(rng)
print(df_1)
...you can get any regression results using the statsmodels library and altering the result = model.rsquared part in the snippet below:
x = df_1['x1']
x = sm.add_constant(x)
model = sm.OLS(df_1['y'], x).fit()
result = model.rsquared
print(result)
Now you have r-squared. Use model.pvalues for the p-value. And use dir(model)to have closer look at other model results (there is more in the output than what you can see below):
Now, this should get you going to obtain your desired results.
To get desired results for ALL combinations of variables / columns, the question and answer here should get you very far.
Edit: You can have a closer look at some common regression results using model.summary(). Using that together with dir(model) you can see that not ALL regression results are availabel the same way that pvalues are using model.pvalues. To get Durbin-Watson, for example, you'll have to use durbinwatson = sm.stats.stattools.durbin_watson(model.fittedvalues, axis=0).
This post has got more information on the issue.
Related
I have the following code which is trying to predict a y variable, in this case 'distance', based on multiple predictor variables, which are stored in newdf[cols].
However, when I run the code, I get the outcome: 'Pandas data cast to numpy dtype of object. Check input data with np.asarray(data)'.
Am I specifying the smf.ols() command in the wrong way?
I would be so grateful for a helping hand.
import statsmodels.api as sm
cols = newdf.drop(['distance', 'duration','short_id'],axis=1)
X = cols
Y = newdf['distance']
X = sm.add_constant(x)
resultmodel = sm.OLS(Y,X).fit()
print(resultmodel.summary())
The first 20 rows of X are:
The first 20 rows of Y are:
For the formula api you have to enter a formula as a string as the first argument. If you just want to enter X and y and use all columns of X, you can use the non-formula API. Basically just replace smf.ols with sm.OLS in your code.
I have a book dataset. I want to make a fixed effect regression model.
I want to fixed effect of year, month, day and book_genre in my model, so in this case I will take out the effects of repetition of the same books in multiple observations. I want to use Python code for my fixed effect model. My variables are:
Variables that I want to fix them are: year, month, day and book_genre.
Other variables in the model are: Read_or_not: categorical variable, ne_factor, x1, x2, x3, x4, x5= numerical variables
Response variable: Y
I used this code but I get an error "DataFrame input must have a MultiIndex with 2 levels"
I highly appreciate it if you help me with how I can fix my code to make a fixed effect model regression.
I also attach a png of dataset to show the variables:
''''
import pandas as pd
from linearmodels import PanelOLS
import numpy as np
df = pd.read_csv('all_a.csv')
df
# Set the index for fixed effects
data = df.set_index(['year', 'month', 'day','book_genre'])
data = df.dropna(subset=['book_id','year','month','day','Read_or_not ' ,'ne_factor,','Y','book_genre','X1', 'X2','X3',"X4" ,"X5"])
# Regression
FE = PanelOLS(data.attention_data_score, data[ 'Y'],
entity_effects = True,
time_effects=True
)
# Result
result = FE.fit(cov_type = 'clustered',
cluster_entity=True,
cluster_time=True
)
I have a data frame with input and output columns. They have a linear relation. So, I want to remove data that does not fit this relation. My actual df is big and has many samples. Here, I am giving an example.
My code:
xdf = pd.DataFrame({'ip':[10,20,30,40],'op':[105,195,500,410]})
I am not getting any idea on how to proceed.
You can do a linear fit first then filter out the data that is outside of a certain threshold.
Sample code below:
import numpy as np
df = pd.DataFrame({'ip':[10,20,30,40],'op':[105,195,500,410]})
# do a linear fit on ip and op
f = np.polyfit(df.ip,df.op,1)
fl = np.poly1d(f)
# you will have to determine this threshold in some way
threshold = 100
output = df[(df.op - fl(df.ip)).abs()<threshold]
Another way:
You can create a boolean mask to check the ratio of op/dp is less then their mean value:
m=xdf.eval("op/ip").lt(xdf.eval("op/ip").mean())
Finally:
out=xdf[m]
plt.scatter(x=out['ip'],y=out['op'])
I am having an issue with this function. I am wanting to perform a cross-sectional regression on 25 portfolios ranked on value and size. I have 7 independent variables as the right side of the equation.
import pandas as pd
import numpy as np
from linearmodels import FamaMacBeth
#creating a multi_index of independent variables
ind_var = pd.read_excel('FAMA_MACBETH.xlsx')
ind_var['date'] = pd.to_datetime(ind_var['date'])
# dropping our dependent variables
ind_var = ind_var.drop(['Mkt_rf', 'div_innovations', 'term_innovations',
'def_innovations', 'rf_innovations', 'hml_innovations',
'smb_innovations'],axis = 1)
ind_var = pd.DataFrame(ind_var.set_index('date').stack())
ind_var.columns = ['x']
x = np.asarray(ind_var)
len(x)
11600
#creatiing a multi_index of dependent variables
# reading in our data
dep_var = pd.read_excel('FAMA_MACBETH.xlsx')
dep_var['date'] = pd.to_datetime(dep_var['date'])
# dropping our independent variables
dep_var = dep_var.drop(['SMALL_LoBM', 'ME1_BM2', 'ME1_BM3', 'ME1_BM4',
'SMALL_HiBM', 'ME2_BM1', 'ME2_BM2', 'ME2_BM3', 'ME2_BM4', 'ME2_BM5',
'ME3_BM1', 'ME3_BM2', 'ME3_BM3', 'ME3_BM4', 'ME3_BM5', 'ME4_BM1',
'ME4_BM2', 'ME4_BM3', 'ME4_BM4', 'ME4_BM5', 'BIG_LoBM', 'ME5_BM2',
'ME5_BM3', 'ME5_BM4', 'BIG_HiBM'],axis = 1)
dep_var = pd.DataFrame(dep_var.set_index('date').stack())
dep_var.columns = ['y']
y = np.asarray(dep_var)
len(y)
3248
mod = FamaMacBeth(y, x)
res = mod.fit(cov_type='kernel', kernel='Parzen')
output with tstats and errors ideally
I have tried numerous methods of getting this to work. I am really thinking of using SAS at this point. Really, I would prefer to get this running with pandas
I expect a cross-sectional regression output with standard errors and t stats
I got it to work in one go. See this site and run the lines of code for OLS below: "Here the difference is presented using the canonical Grunfeld data on investment."
(Note that this line is important: etdata = data.set_index(['firm','year']), else Python won't know the correct dimensions to run F&McB on.)
Then run:
from linearmodels import FamaMacBeth
FamaMacBeth(etdata.invest,etdata[['value','capital']]).fit()
Note, I updated linearmodels to the latest version, that got me access to the data.
I've been trying to get a prediction for future values in a model I've created. I have tried both OLS in pandas and statsmodels. Here is what I have in statsmodels:
import statsmodels.api as sm
endog = pd.DataFrame(dframe['monthly_data_smoothed8'])
smresults = sm.OLS(dframe['monthly_data_smoothed8'], dframe['date_delta']).fit()
sm_pred = smresults.predict(endog)
sm_pred
The length of the array returned is equal to the number of records in my original dataframe but the values are not the same. When I do the following using pandas I get no values returned.
from pandas.stats.api import ols
res1 = ols(y=dframe['monthly_data_smoothed8'], x=dframe['date_delta'])
res1.predict
(Note that there is no .fit function for OLS in Pandas) Could somebody shed some light on how I might get future predictions from my OLS model in either pandas or statsmodel-I realize I must not be using .predict properly and I've read the multiple other problems people have had but they do not seem to apply to my case.
edit I believe 'endog' as defined is incorrect-I should be passing the values for which I want to predict; therefore I've created a date range of 12 periods past the last recorded value. But still I miss something as I am getting the error:
matrices are not aligned
edit here is a snippet of data, the last column (in red) of numbers is the date delta which is a difference in months from the first date:
month monthly_data monthly_data_smoothed5 monthly_data_smoothed8 monthly_data_smoothed12 monthly_data_smoothed3 date_delta
0 2011-01-31 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 0.000000
1 2011-02-28 3.776706e+11 3.750759e+11 3.748327e+11 3.746975e+11 3.755084e+11 0.919937
2 2011-03-31 4.547079e+11 4.127964e+11 4.083554e+11 4.059256e+11 4.207653e+11 1.938438
3 2011-04-30 4.688370e+11 4.360748e+11 4.295531e+11 4.257843e+11 4.464035e+11 2.924085
I think your issue here is that statsmodels doesn't add an intercept by default, so your model doesn't achieve much of a fit. To solve it in your code would be something like this:
dframe = pd.read_clipboard() # your sample data
dframe['intercept'] = 1
X = dframe[['intercept', 'date_delta']]
y = dframe['monthly_data_smoothed8']
smresults = sm.OLS(y, X).fit()
dframe['pred'] = smresults.predict()
Also, for what it's worth, I think the statsmodel formula api is much nicer to work with when dealing with DataFrames, and adds an intercept by default (add a - 1 to remove). See below, it should give the same answer.
import statsmodels.formula.api as smf
smresults = smf.ols('monthly_data_smoothed8 ~ date_delta', dframe).fit()
dframe['pred'] = smresults.predict()
Edit:
To predict future values, just pass new data to .predict() For example, using the first model:
In [165]: smresults.predict(pd.DataFrame({'intercept': 1,
'date_delta': [0.5, 0.75, 1.0]}))
Out[165]: array([ 2.03927604e+11, 2.95182280e+11, 3.86436955e+11])
On the intercept - there's nothing encoded in the number 1 it's just based on the math of OLS (an intercept is perfectly analogous to a regressor that always equals 1), so you can pull the value right off the summary. Looking at the statsmodels docs, an alternative way to add an intercept would be:
X = sm.add_constant(X)