Pandas/Statsmodel OLS predicting future values - python

I've been trying to get a prediction for future values in a model I've created. I have tried both OLS in pandas and statsmodels. Here is what I have in statsmodels:
import statsmodels.api as sm
endog = pd.DataFrame(dframe['monthly_data_smoothed8'])
smresults = sm.OLS(dframe['monthly_data_smoothed8'], dframe['date_delta']).fit()
sm_pred = smresults.predict(endog)
sm_pred
The length of the array returned is equal to the number of records in my original dataframe but the values are not the same. When I do the following using pandas I get no values returned.
from pandas.stats.api import ols
res1 = ols(y=dframe['monthly_data_smoothed8'], x=dframe['date_delta'])
res1.predict
(Note that there is no .fit function for OLS in Pandas) Could somebody shed some light on how I might get future predictions from my OLS model in either pandas or statsmodel-I realize I must not be using .predict properly and I've read the multiple other problems people have had but they do not seem to apply to my case.
edit I believe 'endog' as defined is incorrect-I should be passing the values for which I want to predict; therefore I've created a date range of 12 periods past the last recorded value. But still I miss something as I am getting the error:
matrices are not aligned
edit here is a snippet of data, the last column (in red) of numbers is the date delta which is a difference in months from the first date:
month monthly_data monthly_data_smoothed5 monthly_data_smoothed8 monthly_data_smoothed12 monthly_data_smoothed3 date_delta
0 2011-01-31 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 0.000000
1 2011-02-28 3.776706e+11 3.750759e+11 3.748327e+11 3.746975e+11 3.755084e+11 0.919937
2 2011-03-31 4.547079e+11 4.127964e+11 4.083554e+11 4.059256e+11 4.207653e+11 1.938438
3 2011-04-30 4.688370e+11 4.360748e+11 4.295531e+11 4.257843e+11 4.464035e+11 2.924085

I think your issue here is that statsmodels doesn't add an intercept by default, so your model doesn't achieve much of a fit. To solve it in your code would be something like this:
dframe = pd.read_clipboard() # your sample data
dframe['intercept'] = 1
X = dframe[['intercept', 'date_delta']]
y = dframe['monthly_data_smoothed8']
smresults = sm.OLS(y, X).fit()
dframe['pred'] = smresults.predict()
Also, for what it's worth, I think the statsmodel formula api is much nicer to work with when dealing with DataFrames, and adds an intercept by default (add a - 1 to remove). See below, it should give the same answer.
import statsmodels.formula.api as smf
smresults = smf.ols('monthly_data_smoothed8 ~ date_delta', dframe).fit()
dframe['pred'] = smresults.predict()
Edit:
To predict future values, just pass new data to .predict() For example, using the first model:
In [165]: smresults.predict(pd.DataFrame({'intercept': 1,
'date_delta': [0.5, 0.75, 1.0]}))
Out[165]: array([ 2.03927604e+11, 2.95182280e+11, 3.86436955e+11])
On the intercept - there's nothing encoded in the number 1 it's just based on the math of OLS (an intercept is perfectly analogous to a regressor that always equals 1), so you can pull the value right off the summary. Looking at the statsmodels docs, an alternative way to add an intercept would be:
X = sm.add_constant(X)

Related

Fixed effect regression model in Python

I have a book dataset. I want to make a fixed effect regression model.
I want to fixed effect of year, month, day and book_genre in my model, so in this case I will take out the effects of repetition of the same books in multiple observations. I want to use Python code for my fixed effect model. My variables are:
Variables that I want to fix them are: year, month, day and book_genre.
Other variables in the model are: Read_or_not: categorical variable, ne_factor, x1, x2, x3, x4, x5= numerical variables
Response variable: Y
I used this code but I get an error "DataFrame input must have a MultiIndex with 2 levels"
I highly appreciate it if you help me with how I can fix my code to make a fixed effect model regression.
I also attach a png of dataset to show the variables:
''''
import pandas as pd
from linearmodels import PanelOLS
import numpy as np
df = pd.read_csv('all_a.csv')
df
# Set the index for fixed effects
data = df.set_index(['year', 'month', 'day','book_genre'])
data = df.dropna(subset=['book_id','year','month','day','Read_or_not ' ,'ne_factor,','Y','book_genre','X1', 'X2','X3',"X4" ,"X5"])
# Regression
FE = PanelOLS(data.attention_data_score, data[ 'Y'],
entity_effects = True,
time_effects=True
)
# Result
result = FE.fit(cov_type = 'clustered',
cluster_entity=True,
cluster_time=True
)

Adjusted R square for each predictor variable in python

I have a pandas data frame that contains several columns. I need to perform a multivariate linear regression. Before doing that i would like to analyze the R,R2,adjusted R2 and p value of each independent variable with respect to the dependent variable.
For the R and R2 I have no problem, since i can calculate the R matrix and the select only the dependent variable and then see the R coefficient between it and all the independent variables. Then i can square these values to obtain the R2.
My problem is how to do the same with the adjusted R2 and the p value
At the end what i want to obtain is somenthing like that:
Variable R R2 ADJUSTED_R2 p_value
A 0.4193 0.1758 ...
B 0.2620 0.0686 ...
C 0.2535 0.0643 ...
All the values are with respect to the dependent variable let's say Y.
The following will not give you ALL the answers, but it WILL get you going using python, pandas and statsmodels for regression analyses.
Given a dataframe like this...
# Imports
import pandas as pd
import numpy as np
import itertools
# A datafrane with random numbers
np.random.seed(123)
rows = 12
listVars= ['y','x1', 'x2', 'x3']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars)
df_1 = df_1.set_index(rng)
print(df_1)
...you can get any regression results using the statsmodels library and altering the result = model.rsquared part in the snippet below:
x = df_1['x1']
x = sm.add_constant(x)
model = sm.OLS(df_1['y'], x).fit()
result = model.rsquared
print(result)
Now you have r-squared. Use model.pvalues for the p-value. And use dir(model)to have closer look at other model results (there is more in the output than what you can see below):
Now, this should get you going to obtain your desired results.
To get desired results for ALL combinations of variables / columns, the question and answer here should get you very far.
Edit: You can have a closer look at some common regression results using model.summary(). Using that together with dir(model) you can see that not ALL regression results are availabel the same way that pvalues are using model.pvalues. To get Durbin-Watson, for example, you'll have to use durbinwatson = sm.stats.stattools.durbin_watson(model.fittedvalues, axis=0).
This post has got more information on the issue.

Pandas Series: Log Normalize

I have a Pandas Series, that needs to be log-transformed to be normal distributed. But I can´t log transform yet, because there are values =0 and values below 1 (0-4000). Therefore I want to normalize the Series first. I heard of StandardScaler(scikit-learn), Z-score standardization and Min-Max scaling(normalization).
I want to cluster the data later, which would be the best method?
StandardScaler and Z-score standardization use mean, variance etc. Can I use them on "not yet normal distibuted" data?
To transform to logarithms, you need positive values, so translate your range of values (-1,1] to normalized (0,1] as follows
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.uniform(-1,1,(10,1)))
df['norm'] = (1+df[0])/2 # (-1,1] -> (0,1]
df['lognorm'] = np.log(df['norm'])
results in a dataframe like
0 norm lognorm
0 0.360660 0.680330 -0.385177
1 0.973724 0.986862 -0.013225
2 0.329130 0.664565 -0.408622
3 0.604727 0.802364 -0.220193
4 0.416732 0.708366 -0.344795
5 0.085439 0.542719 -0.611163
6 -0.964246 0.017877 -4.024232
7 0.738281 0.869141 -0.140250
8 0.558220 0.779110 -0.249603
9 0.485144 0.742572 -0.297636
If your data is in the range (-1;+1) (assuming you lost the minus in your question) then log transform is probably not what you need. At least from a theoretical point of view, it's obviously the wrong thing to do.
Maybe your data has already been preprocessed (inadequately)? Can you get the raw data? Why do you think log transform will help?
If you don't care about what is the meaningful thing to do, you can call log1p, which is the same as log(1+x) and which will thus work on (-1;∞).

Beginner stats: Predict binary outcome of set of numbers given history (Logistic regression)

I apologize in advance for the simplicity of this question. I have no background in stats and am getting lost in the complexity of it all.
If I have a couple thousand numbers all with a binary outcome
number,outcome
14,0
27,1
88,1
04,0
42,1
How do I predict future numbers ? for example:
82
45
02
Or is this going to be inaccurate due to there only being a single variable ? All the examples I have seen use multiple variables.
I have been digging through statsmodels and went through this great tutorial: http://blog.yhathq.com/posts/logistic-regression-and-python.html. And through that I've made this:
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("binary.csv")
df.columns = ["number", "outcome"]
data = df[['number', 'outcome']]
train_cols = data.columns[0]
logit = sm.Logit(data['outcome'], data[train_cols])
result = logit.fit()
print result.summary()
But that seems to be analyzing the weight of current numbers, how would you predict new ones ? Am I even going about this the right way ?
The result of the fit should have a method predict(). That is what you need to use to predict future values, for example:
result = sm.Logit(outcomes, values).fit()
result.predict([82,45,2])

GLM in statsmodel returning error

Now that I have figured out how to use OLS ( Pandas/Statsmodel OLS predicting future values ), I am trying to fit a nicer curve to my data...GLM should work similarly I assumed.
import statsmodels.api as sma
df1['intercept'] = 1
y = df1[['intercept', 'date_delta']]
X = df1['monthly_data']
smaresults_normal = sma.GLM(X,y, family=sma.families.Binomial()).fit()
returns ValueError: The first guess on the deviance function returned a nan. This could be a boundary problem and should be reported. which was a known issue in 2010. I've also tried:
import statsmodels.api as sm
import statsmodels.formula.api as smf
glm_unsmoothed = smf.GLM('monthly_data ~ date_delta', df1, family=sm.families.Binomial() )
glm_unsmoothed.fit()
which raises the error'builtin_function_or_method' object has no attribute 'equals'
I want to graph the model as well as future values as I was able to do with the ols model:
#ols model
df1['intercept'] = 1
X = df1[['intercept', 'date_delta']]
y = df1['monthly_data']
smresults_normal = sm.OLS(y, X).fit()
#future values
smresults_normal.predict(df_future12[['intercept', 'future_monthly']])
#model in sample data
import statsmodels.formula.api as smf
smresults_unsmoothed = smf.ols('monthly_data ~ date_delta', df1).fit()
df1['ols_preds_unsmoothed'] = smresults_unsmoothed.predict()
edit I abandoned trying to use GLM and instead used OLS with a formula for a polynomial fit which I think worked quite well...(though getting future predictions apparently does not work the same as in my other OLS, someday I will hopefully write some code without endless fiddling!)!unfortunately my reputation is too low to post the nice pic! :(
I think I had the same issue, all you need is to secure that your data frame doesn't contain lines where both cases and not cases are equal to zero. Just before estimating the glm, run:
data=data[(data.cases !=0) | (data.notcases!=0)]
Apparently R does it automatically.

Categories

Resources