Now that I have figured out how to use OLS ( Pandas/Statsmodel OLS predicting future values ), I am trying to fit a nicer curve to my data...GLM should work similarly I assumed.
import statsmodels.api as sma
df1['intercept'] = 1
y = df1[['intercept', 'date_delta']]
X = df1['monthly_data']
smaresults_normal = sma.GLM(X,y, family=sma.families.Binomial()).fit()
returns ValueError: The first guess on the deviance function returned a nan. This could be a boundary problem and should be reported. which was a known issue in 2010. I've also tried:
import statsmodels.api as sm
import statsmodels.formula.api as smf
glm_unsmoothed = smf.GLM('monthly_data ~ date_delta', df1, family=sm.families.Binomial() )
glm_unsmoothed.fit()
which raises the error'builtin_function_or_method' object has no attribute 'equals'
I want to graph the model as well as future values as I was able to do with the ols model:
#ols model
df1['intercept'] = 1
X = df1[['intercept', 'date_delta']]
y = df1['monthly_data']
smresults_normal = sm.OLS(y, X).fit()
#future values
smresults_normal.predict(df_future12[['intercept', 'future_monthly']])
#model in sample data
import statsmodels.formula.api as smf
smresults_unsmoothed = smf.ols('monthly_data ~ date_delta', df1).fit()
df1['ols_preds_unsmoothed'] = smresults_unsmoothed.predict()
edit I abandoned trying to use GLM and instead used OLS with a formula for a polynomial fit which I think worked quite well...(though getting future predictions apparently does not work the same as in my other OLS, someday I will hopefully write some code without endless fiddling!)!unfortunately my reputation is too low to post the nice pic! :(
I think I had the same issue, all you need is to secure that your data frame doesn't contain lines where both cases and not cases are equal to zero. Just before estimating the glm, run:
data=data[(data.cases !=0) | (data.notcases!=0)]
Apparently R does it automatically.
Related
I am doing a simple ARIMAX model (1,0,0) with one dependent variable y, one independent variable x with 49 observations as a time series.
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
df = pd.read_excel('/Users/gaetanlion/Google Drive/Python/Arima/df.xlsx', sheet_name = 'final')
from statsmodels.tsa.arima_model import ARIMA
endo = df['y']
exo = df['x']
''' Doing a ARIMA(1,0,0) '''
model = ARIMA(endo, exo, order = (1,0,0)).fit()
When I run this simple model, I get the mentioned error:
TypeError: __new__() got multiple values for argument 'order'
Ok, I was able to resolve this coding issue. But, I am not so sure it is the best way to resolve it.
model = sm.tsa.arima.ARIMA(endo, exo, order =(1,0,0)).fit() # This works
model = ARIMA(endo, exo, order = (1,0,0)).fit() # This does not work
I wrote the following code in python using the statsmodels package, to create OLS regression model. I tried the code with different data-sets and got the model with all the coefficients values near to zero except the first(intercept) coefficient. What could possibly be wrong with the code ?
data1 = pandas.concat([Y, X], axis = 1)
dta = lagmat2ds(data1, mxlg, trim='both', dropex=1)
dtaown = sm.add_constant(dta[:, 0:(mxlg + 1)], prepend = False)
dtajoint = sm.add_constant(dta[:, 0:], prepend = False)
res2down = sm.OLS(dta[:, 0], dtaown).fit()
res2djoint = sm.OLS(dta[:, 0], dtajoint).fit()
Here the sm is statsmodels.api as sm and for sample testing you can consider the dataset sm.datasets.spector.
The way your data is structured - you are modeling Y vs Y|lag Y|constant. Note that the OLS documentation (https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html) states that -
No constant is added by the model unless you are using formulas.
So the first value that you see is not the intercept but the coefficient of fitting Y vs Y - which will be 1.0.
What you can try to check that you are getting sensible results is to exclude Y from the predictors like this -
res2down = sm.OLS(dta[:, 0], dtaown[:, 1:]).fit()
Basically, I am trying to run a regression based on a dataframe without an intercept, so I set fit intercept to false, yet the following code yields a parameters that include an intercept. Anyone have an idea why this may be the case?
model2 = smf.ols('Y ~ X', data=df_final)
result2 = model2.fit(cov_type = 'HAC', cov_kwds = {'maxlags':5}, fit_intercept= False)
result2.params
Intercept 0.032649
X 0.014521
dtype: float64
When running an OLS model using a formula an intercept is added by default. One way to omit the intercept term is to add a -1 to the formula:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
df = pd.DataFrame({'X': np.random.randint(0, 100, size=20),
'Y': np.random.randint(0, 100, size=20)})
model = smf.ols('Y ~ X - 1', data=df)
result = model.fit()
The fitted model now only contains a single parameter (for X):
X 0.691876
dtype: float64
If you're not using the formula api, then the OLS model doesn't include an intercept so you don't need to worry about it (in that case you need to explicitly add it to your data)
I'm not sure where you got the fit_intercept parameter from as I cant find any reference to it in the statsmodels documentation or source code. Maybe you're thinking of linear regression using scikit-learn, which does use a parameter to control the intercept
I'm trying to model some simulated data using the DPGMM classifier from scikitlearn, but I'm getting poor performance. Here is the example I'm using:
from sklearn import mixture
import numpy as np
import matplotlib.pyplot as plt
clf = mixture.DPGMM(n_components=5, init_params='wc')
s = 0.1
a = np.random.normal(loc=1, scale=s, size=(1000,))
b = np.random.normal(loc=2, scale=s, size=(1000,))
c = np.random.normal(loc=3, scale=s, size=(1000,))
d = np.random.normal(loc=4, scale=s, size=(1000,))
e = np.random.normal(loc=7, scale=s*2, size=(5000,))
noise = np.random.random(500)*8
data = np.hstack([a,b,c,d,e,noise]).reshape((-1,1))
clf.means_ = np.array([1,2,3,4,7]).reshape((-1,1))
clf.fit(data)
labels = clf.predict(data)
plt.scatter(data.T, np.random.random(len(data)), c=labels, lw=0, alpha=0.2)
plt.show()
I would think that this would be exactly the kind of problem that gaussian mixture models would work for. I've tried playing around with alpha, using gmm instead of dpgmm, changing the number of starting components, etc. I can't seem to get a reliable and accurate classification. Is there something I'm just missing? Is there another model that would be more appropriate?
Because you didn't iterate long enough for it to converge.
Check the value of
clf.converged_
and try increasing n_iter to 1000.
Note that, however, the DPGMM still fails miserably IMHO on this data set, decreasing the number of clusters to just 2 eventually.
I've been trying to get a prediction for future values in a model I've created. I have tried both OLS in pandas and statsmodels. Here is what I have in statsmodels:
import statsmodels.api as sm
endog = pd.DataFrame(dframe['monthly_data_smoothed8'])
smresults = sm.OLS(dframe['monthly_data_smoothed8'], dframe['date_delta']).fit()
sm_pred = smresults.predict(endog)
sm_pred
The length of the array returned is equal to the number of records in my original dataframe but the values are not the same. When I do the following using pandas I get no values returned.
from pandas.stats.api import ols
res1 = ols(y=dframe['monthly_data_smoothed8'], x=dframe['date_delta'])
res1.predict
(Note that there is no .fit function for OLS in Pandas) Could somebody shed some light on how I might get future predictions from my OLS model in either pandas or statsmodel-I realize I must not be using .predict properly and I've read the multiple other problems people have had but they do not seem to apply to my case.
edit I believe 'endog' as defined is incorrect-I should be passing the values for which I want to predict; therefore I've created a date range of 12 periods past the last recorded value. But still I miss something as I am getting the error:
matrices are not aligned
edit here is a snippet of data, the last column (in red) of numbers is the date delta which is a difference in months from the first date:
month monthly_data monthly_data_smoothed5 monthly_data_smoothed8 monthly_data_smoothed12 monthly_data_smoothed3 date_delta
0 2011-01-31 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 0.000000
1 2011-02-28 3.776706e+11 3.750759e+11 3.748327e+11 3.746975e+11 3.755084e+11 0.919937
2 2011-03-31 4.547079e+11 4.127964e+11 4.083554e+11 4.059256e+11 4.207653e+11 1.938438
3 2011-04-30 4.688370e+11 4.360748e+11 4.295531e+11 4.257843e+11 4.464035e+11 2.924085
I think your issue here is that statsmodels doesn't add an intercept by default, so your model doesn't achieve much of a fit. To solve it in your code would be something like this:
dframe = pd.read_clipboard() # your sample data
dframe['intercept'] = 1
X = dframe[['intercept', 'date_delta']]
y = dframe['monthly_data_smoothed8']
smresults = sm.OLS(y, X).fit()
dframe['pred'] = smresults.predict()
Also, for what it's worth, I think the statsmodel formula api is much nicer to work with when dealing with DataFrames, and adds an intercept by default (add a - 1 to remove). See below, it should give the same answer.
import statsmodels.formula.api as smf
smresults = smf.ols('monthly_data_smoothed8 ~ date_delta', dframe).fit()
dframe['pred'] = smresults.predict()
Edit:
To predict future values, just pass new data to .predict() For example, using the first model:
In [165]: smresults.predict(pd.DataFrame({'intercept': 1,
'date_delta': [0.5, 0.75, 1.0]}))
Out[165]: array([ 2.03927604e+11, 2.95182280e+11, 3.86436955e+11])
On the intercept - there's nothing encoded in the number 1 it's just based on the math of OLS (an intercept is perfectly analogous to a regressor that always equals 1), so you can pull the value right off the summary. Looking at the statsmodels docs, an alternative way to add an intercept would be:
X = sm.add_constant(X)