I am trying to set up a hierarchical linear regression model using PYMC3. In my particular case, I want to see whether postal codes provide a meaningful structure for other features. Suppose I use the following mock data:
import pandas as pd
import numpy as np
import pymc3 as pm
data = pd.DataFrame({"postalcode": np.floor(np.random.uniform(low=10, high=99, size=1000)),
"x": np.random.normal(size=1000),
"y": np.random.normal(size=1000)})
data["postalcode"] = data["postalcode"].astype(int)
I generate postal codes from 10 to 99, as well as a normally distributed feature x and a target value y. Now I set up my indices for postal code level 1 and level 2:
def create_pc_index(level):
pc = data["postalcode"].astype(str).str[0:level]
unique_pc = pc.unique()
pc_dict = dict(zip(unique_pc, range(0, len(unique_pc))))
return pc_dict, pc.apply(lambda x: pc_dict[x]).values
pc1_dict, pc1_index = create_pc_index(1)
pc2_dict, pc2_index = create_pc_index(2)
Using the first digit of the postal code as hierarchical attribute works fine:
number_of_samples = 1000
x = data["x"]
y = data["y"]
with pm.Model() as model:
sigma = pm.HalfCauchy('sigma', beta=10, testval=0.5, shape=1)
mu_i = pm.Normal("mu_i", 5, sd=25, shape=1)
intercept = pm.Normal('Intercept', mu_i, sd=1, shape=len(pc1_dict))
mu_s = pm.Normal("mu_x", 0, sd=3, shape=1)
x_coeffs = pm.Normal("x", mu_s, 1, shape=len(pc1_dict))
mean = intercept[pc1_index] + x_coeffs[pc1_index] * x
likelihood_mean = pm.Deterministic("mean", mean)
likelihood = pm.Normal('y', mu=likelihood_mean, sd=sigma, observed=y)
trace = pm.sample(number_of_samples)
burned_trace = trace[number_of_samples/2:]
However, if I want to add a second level to my hierarchy (in this case only on the intercept, ignoring x for the moment), I run into shape problems
with pm.Model() as model:
sigma = pm.HalfCauchy('sigma', beta=10, testval=0.5, shape=1)
mu_i_level_1 = pm.Normal("mu_i", 0, sd=25, shape=1)
mu_i_level_2 = pm.Normal("mu_i_level_2", mu_i_level_1, sd=1, shape=len(pc1_dict))
intercept = pm.Normal('Intercept', mu_i_level_2[pc1_index], sd=1, shape=len(pc2_dict))
mu_s = pm.Normal("mu_x", 0, sd=3, shape=1)
x_coeffs = pm.Normal("x", mu_s, 1, shape=len(pc1_dict))
mean = intercept[pc2_index] + x_coeffs[pc1_index] * x
likelihood_mean = pm.Deterministic("mean", mean)
likelihood = pm.Normal('y', mu=likelihood_mean, sd=sigma, observed=y)
trace = pm.sample(number_of_samples)
burned_trace = trace[number_of_samples/2:]
The error message is:
operands could not be broadcast together with shapes (89,) (1000,)
How do I model multiple levels in my regression correctly? Is this just an issue with the correct shape size or is there a more fundamental error on my part?
Thanks in advance!
I don't think intercept can have a shape of len(pc2_dict) but a mu of len(pc1_dict). The contradiction is here:
intercept = pm.Normal('Intercept', mu_i_level_2[pc1_index], sd=1, shape=len(pc2_dict))
Related
I want to use a hurdle model from statsmodels (https://www.statsmodels.org/dev/examples/notebooks/generated/count_hurdle.html).
Unfortunately, when following the manual, I am running into several problems:
I am only able to load the discrete_model and count_model from discrete:
Hence, the following work
import statsmodels.discrete.discrete_model
import statsmodels.discrete.count_model
However, there is no module named 'statsmodels.discrete.truncated_model'
Furthermore, I am not able to run
get_diagnostic()
with
a poisson regression. I am running the same code as on the statsmodels website (see below), but it throughs the following error:
'PoissonResults' object has no attribute 'get_diagnostic'
Thanks for your help!
np.random.seed(987456348)
# large sample to get strong results
nobs = 5000
x = np.column_stack((np.ones(nobs), np.linspace(0, 1, nobs)))
mu0 = np.exp(0.5 *2 * x.sum(1))
y = np.random.poisson(mu0, size=nobs)
print(np.bincount(y))
y_ = y
indices = np.arange(len(y))
mask = mask0 = y > 0
for _ in range(10):
print( mask.sum())
indices = mask #indices[mask]
if not np.any(mask):
break
mu_ = np.exp(0.5 * x[indices].sum(1))
y[indices] = y_ = np.random.poisson(mu_, size=len(mu_))
np.place(y, mask, y_)
mask = np.logical_and(mask0, y == 0)
np.bincount(y)
mod_p = Poisson(y, x)
res_p = mod_p.fit()
print(res_p.summary())
dia_p = res_p.get_diagnostic()
dia_p.plot_probs();
I am lost within the pymcmcstat documentation of Python. I managed to plot the parameter distributions etc, but when it comes to the Bayes factor, I need to calculate the integral over the parameter space of likelihood for each model.
I followed this video. Each model has a different model function with different parameters. According to this link, I am supposed to compare the model evidences for model selection. All I have in my hand is the chain results after burnin that returns the distribution for each parameters, chain for sum-of-squares error (SSE) and variances. How do I compare the models with mcmc chain results I have?
Where do I go from here?
Here is my code for one model; for each model, the test_modelfun is changed and the chain results are saved for further comparison of different models;
# Data related lines: input omega and output fm
x = (np.array([76.29395, 152.5879, 305.1758, 610.3516, 1220.703, 2441.406, 4882.813, 9765.625, 19531.25, 39062.5, 78125, 156250, 312500, 625000]))
y = np.array([155.6412886 -63.3826188j , 113.9114436 -79.90544719j, 64.97809441-77.65152741j, 26.87482243-57.38474656j, 7.44462341-34.02438426j, 2.32954856-16.17918216j, 2.30747953 -6.72487436j, 3.39658859 -2.72444011j, 4.0084345 -1.2029167j , 4.25877486 -0.70276446j, 4.11761329 -0.69591231j, 3.83339489 -0.65244854j, 3.47289164 -0.6079278j , 3.07027319 -0.14914359j])
#import mcmc library and add data to the library in the second line below
mcstat = MCMC()
mcstat.data.add_data_set(x,y)
##define transfer function model calculated with theta parameters
def test_modelfun(xdata, theta):
K, alpha_0, alpha_1, Tp_1, Tp_2, Tz_1 = 10**theta[0], 10**theta[1], 10**theta[2], 10**theta[3], 10**theta[4], 10**theta[5]
#####################
Pz_0 = (omega**(alpha_0))
Pz_1 = (np.sqrt(((Tp_1**2)*(omega**(2*alpha_1))) + (2*Tp_1*(omega**alpha_1)*cos(alpha_1*pi/2)) +1))
Pz_2 = (np.sqrt(((Tp_2**2)*(omega**(2*alpha_1))) + (2*Tp_2*(omega**alpha_1)*cos(alpha_1*pi/2)) +1))
Zz_1 = (np.sqrt(((Tz_1**2)*(omega**(2*alpha_1))) + (2*Tz_1*(omega**alpha_1)*cos(alpha_1*pi/2)) +1))
Pp_0 = np.array([(-1*pi*alpha_0)/2]*len(omega)).T#[0]
Pp_1 = np.array([math.atan((Tp_1*(omega[i]**alpha_1)*sin(pi*alpha_1/2))/(1+(Tp_1*(omega[i]**alpha_1)*cos(pi*alpha_1/2)))) for i in range(len(omega))])
Pp_2 = np.array([math.atan((Tp_2*(omega[i]**alpha_1)*sin(pi*alpha_1/2))/(1+(Tp_2*(omega[i]**alpha_1)*cos(pi*alpha_1/2)))) for i in range(len(omega))])
Zp_1 = np.array([math.atan((Tz_1*(omega[i]**alpha_1)*sin(pi*alpha_1/2))/(1+(Tz_1*(omega[i]**alpha_1)*cos(pi*alpha_1/2)))) for i in range(len(omega))])
#####################
Z_est = (K*Zz_1)/(Pz_0*Pz_1*Pz_2)
P_est = Zp_1 + Pp_0 - Pp_1 - Pp_2
#####################
R_est = np.real([cmath.rect(Z_est[i], P_est[i]) for i in range(len(omega))])#abs()#[:,0]
X_est = np.imag([cmath.rect(Z_est[i], P_est[i]) for i in range(len(omega))])#abs()#[:,0]
RX_est = (R_est + 1j*X_est)
return RX_est
def modelfun(xdata, theta):
ymodel = test_modelfun(xdata,theta)
Zest = 20*log10(np.abs(ymodel))
return Zest
##define sum of squares function for the error in evaluating the likelihood function L(Fobs(i)|q)
def test_ssfun(theta,data):
xdata = data.xdata[0]
ydata = data.ydata[0]
ymodel = test_modelfun(xdata,theta)
return (1/len(omega))*(sum((real(fm)- real(ymodel))**2 + (imag(fm)-imag(ymodel))**2))
#sumsquares = sum((ymodel[:,0]-ydata[:,0])**2)
##import mcmc library and add data to the library in the second line below
itr = 50.0e4
verb = 1
wbar = 1
mcstat = MCMC()
mcstat.data.add_data_set(x,y)
## add model parameters
mcstat.parameters.add_model_parameter(name='th_1',theta0=1, minimum=-2,maximum=3) #m_k, M_k = -2, 3
mcstat.parameters.add_model_parameter(name='th_2',theta0=-1, minimum=-4,maximum=0) #m_a0, M_a0 = -4, 0
mcstat.parameters.add_model_parameter(name='th_3',theta0=-1, minimum=-3,maximum=0) #m_a1, M_a1 = -3, 0
mcstat.parameters.add_model_parameter(name='th_4',theta0=-4, minimum=-9,maximum=0) #m_p1, M_p1 = -9, 0
mcstat.parameters.add_model_parameter(name='th_5',theta0=-4, minimum=-9,maximum=0) #m_p2, M_p2 = -9, 0
mcstat.parameters.add_model_parameter(name='th_6',theta0=-4, minimum=-9,maximum=0) #m_z1, M_z1 = -9, 0
## define simulation options: mh=metropolis-hastings, am=adaptive metropolis, dr=delayed rejection, dram=dr+am
mcstat.simulation_options.define_simulation_options(nsimu=int(itr), updatesigma=1, method='dr', adaptint=100, verbosity=verb, waitbar=wbar)
## define model settings
mcstat.model_settings.define_model_settings(sos_function=test_ssfun)
mcstat.run_simulation()
## extract results
results=mcstat.simulation_results.results
chain = results['chain']# chain for each parameter sampled during simulation. s2
s2chain = results['s2chain']# chain for error variances. if updatesigma=0 then s2chain is an empty list
sschain = results['sschain']# chain for sum-of-squares error calculated using each set of parameter values in the cahin
names = results['names']
burnin = int(itr/2)
## display chain statistics
mcstat.chainstats(chain[burnin:,:],results)
mcpl = mcstat.mcmcplot
figcp = mcpl.plot_chain_panel(chain, names, figsizeinches = (7,6))
axes = figcp.get_axes()
for ii, ax in enumerate(axes):
ch = chain[:, ii]
ax.plot([burnin, burnin], [ch.min(), ch.max()], 'r')
figpd = mcpl.plot_density_panel(chain[burnin:,:], names, figsizeinches=(7,6))
figpc = mcpl.plot_pairwise_correlation_panel(chain[burnin:,:], names, figsizeinches = (7,6))
mcstat.PI.setup_prediction_interval_calculation(results=results, data=mcstat.data, modelfunction=modelfun, burnin=burnin)
mcstat.PI.generate_prediction_intervals(calc_pred_int=True, waitbar=False)
fg, ax = mcstat.PI.plot_prediction_intervals(adddata=True, plot_pred_int=True, figsizeinches = (7,5), data_display=dict(color='k'))
I'm looking for a way to generalize regression using pykalman from 1 to N regressors. We will not bother about online regression initially - I just want a toy example to set up the Kalman filter for 2 regressors instead of 1, i.e. Y = c1 * x1 + c2 * x2 + const.
For the single regressor case, the following code works. My question is how to change the filter setup so it works for two regressors:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pykalman import KalmanFilter
if __name__ == "__main__":
file_name = '<path>\KalmanExample.txt'
df = pd.read_csv(file_name, index_col = 0)
prices = df[['ETF', 'ASSET_1']] #, 'ASSET_2']]
delta = 1e-5
trans_cov = delta / (1 - delta) * np.eye(2)
obs_mat = np.vstack( [prices['ETF'],
np.ones(prices['ETF'].shape)]).T[:, np.newaxis]
kf = KalmanFilter(
n_dim_obs=1,
n_dim_state=2,
initial_state_mean=np.zeros(2),
initial_state_covariance=np.ones((2, 2)),
transition_matrices=np.eye(2),
observation_matrices=obs_mat,
observation_covariance=1.0,
transition_covariance=trans_cov
)
state_means, state_covs = kf.filter(prices['ASSET_1'].values)
# Draw slope and intercept...
pd.DataFrame(
dict(
slope=state_means[:, 0],
intercept=state_means[:, 1]
), index=prices.index
).plot(subplots=True)
plt.show()
The example file KalmanExample.txt contains the following data:
Date,ETF,ASSET_1,ASSET_2
2007-01-02,176.5,136.5,141.0
2007-01-03,169.5,115.5,143.25
2007-01-04,160.5,111.75,143.5
2007-01-05,160.5,112.25,143.25
2007-01-08,161.0,112.0,142.5
2007-01-09,155.5,110.5,141.25
2007-01-10,156.5,112.75,141.25
2007-01-11,162.0,118.5,142.75
2007-01-12,161.5,117.0,142.5
2007-01-15,160.0,118.75,146.75
2007-01-16,156.5,119.5,146.75
2007-01-17,155.0,120.5,145.75
2007-01-18,154.5,124.5,144.0
2007-01-19,155.5,126.0,142.75
2007-01-22,157.5,124.5,142.5
2007-01-23,161.5,124.25,141.75
2007-01-24,164.5,125.25,142.75
2007-01-25,164.0,126.5,143.0
2007-01-26,161.5,128.5,143.0
2007-01-29,161.5,128.5,140.0
2007-01-30,161.5,129.75,139.25
2007-01-31,161.5,131.5,137.5
2007-02-01,164.0,130.0,137.0
2007-02-02,156.5,132.0,128.75
2007-02-05,156.0,131.5,132.0
2007-02-06,159.0,131.25,130.25
2007-02-07,159.5,136.25,131.5
2007-02-08,153.5,136.0,129.5
2007-02-09,154.5,138.75,128.5
2007-02-12,151.0,136.75,126.0
2007-02-13,151.5,139.5,126.75
2007-02-14,155.0,169.0,129.75
2007-02-15,153.0,169.5,129.75
2007-02-16,149.75,166.5,128.0
2007-02-19,150.0,168.5,130.0
The single regressor case provides the following output and for the two-regressor case I want a second "slope"-plot representing C2.
Answer edited to reflect my revised understanding of the question.
If I understand correctly you wish to model an observable output variable Y = ETF, as a linear combination of two observable values; ASSET_1, ASSET_2.
The coefficients of this regression are to be treated as the system states, i.e. ETF = x1*ASSET_1 + x2*ASSET_2 + x3, where x1 and x2 are the coefficients assets 1 and 2 respectively, and x3 is the intercept. These coefficients are assumed to evolve slowly.
Code implementing this is given below, note that this is just extending the existing example to have one more regressor.
Note also that you can get quite different results by playing with the delta parameter. If this is set large (far from zero), then the coefficients will change more rapidly, and the reconstruction of the regressand will be near-perfect. If it is set small (very close to zero) then the coefficients will evolve more slowly and the reconstruction of the regressand will be less perfect. You might want to look into the Expectation Maximisation algorithm - supported by pykalman.
CODE:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pykalman import KalmanFilter
if __name__ == "__main__":
file_name = 'KalmanExample.txt'
df = pd.read_csv(file_name, index_col = 0)
prices = df[['ETF', 'ASSET_1', 'ASSET_2']]
delta = 1e-3
trans_cov = delta / (1 - delta) * np.eye(3)
obs_mat = np.vstack( [prices['ASSET_1'], prices['ASSET_2'],
np.ones(prices['ASSET_1'].shape)]).T[:, np.newaxis]
kf = KalmanFilter(
n_dim_obs=1,
n_dim_state=3,
initial_state_mean=np.zeros(3),
initial_state_covariance=np.ones((3, 3)),
transition_matrices=np.eye(3),
observation_matrices=obs_mat,
observation_covariance=1.0,
transition_covariance=trans_cov
)
# state_means, state_covs = kf.em(prices['ETF'].values).smooth(prices['ETF'].values)
state_means, state_covs = kf.filter(prices['ETF'].values)
# Re-construct ETF from coefficients and 'ASSET_1' and ASSET_2 values:
ETF_est = np.array([a.dot(b) for a, b in zip(np.squeeze(obs_mat), state_means)])
# Draw slope and intercept...
pd.DataFrame(
dict(
slope1=state_means[:, 0],
slope2=state_means[:, 1],
intercept=state_means[:, 2],
), index=prices.index
).plot(subplots=True)
plt.show()
# Draw actual y, and estimated y:
pd.DataFrame(
dict(
ETF_est=ETF_est,
ETF_act=prices['ETF'].values
), index=prices.index
).plot()
plt.show()
PLOTS:
In the Pymc3 example for multilevel linear regression (the example is here, with the radon data set from Gelman et al.’s (2007)), the intercepts (for different counties) and slopes (for apartment with and without basement) each have a Normal prior. How can I model them together with a multivariate normal prior, so that I can examine the correlation between them?
The hierarchical model given in the example is like this:
with pm.Model() as hierarchical_model:
# Hyperpriors for group nodes
mu_a = pm.Normal('mu_a', mu=0., sd=100**2)
sigma_a = pm.HalfCauchy('sigma_a', 5)
mu_b = pm.Normal('mu_b', mu=0., sd=100**2)
sigma_b = pm.HalfCauchy('sigma_b', 5)
# Intercept for each county, distributed around group mean mu_a
# Above we just set mu and sd to a fixed value while here we
# plug in a common group distribution for all a and b (which are
# vectors of length n_counties).
a = pm.Normal('a', mu=mu_a, sd=sigma_a, shape=n_counties)
# Intercept for each county, distributed around group mean mu_a
b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=n_counties)
# Model error
eps = pm.HalfCauchy('eps', 5)
radon_est = a[county_idx] + b[county_idx] * data.floor.values
# Data likelihood
radon_like = pm.Normal('radon_like', mu=radon_est, sd=eps, observed=data.log_radon)
hierarchical_trace = pm.sample(2000)
And I'm trying to make some change to the priors
with pm.Model() as correlation_model:
# Hyperpriors for group nodes
mu_a = pm.Normal('mu_a', mu=0., sd=100**2)
mu_b = pm.Normal('mu_b', mu=0., sd=100**2)
# here I want to model a and b together
# I borrowed some code from a multivariate normal model
# but the code does not work
sigma = pm.HalfCauchy('sigma', 5, shape=2)
C_triu = pm.LKJCorr('C_triu', n=2, p=2)
C = T.fill_diagonal(C_triu[np.zeros((2,2), 'int')], 1)
cov = pm.Deterministic('cov', T.nlinalg.matrix_dot(sigma, C, sigma))
tau = pm.Deterministic('tau', T.nlinalg.matrix_inverse(cov))
a, b = pm.MvNormal('mu', mu=(mu_a, mu_b), tau=tau,
shape=(n_counties, n_counties))
# Model error
eps = pm.HalfCauchy('eps', 5)
radon_est = a[county_idx] + b[county_idx] * data.floor.values
# Data likelihood
radon_like = pm.Normal('radon_like', mu=radon_est, sd=eps, observed=data.log_radon)
correlation_trace = pm.sample(2000)
Here is the error message I got:
File "<ipython-input-108-ce400c54cc39>", line 14, in <module>
tau = pm.Deterministic('tau', T.nlinalg.matrix_inverse(cov))
File "/home/olivier/anaconda3/lib/python3.5/site-packages/theano/gof/op.py", line 611, in __call__
node = self.make_node(*inputs, **kwargs)
File "/home/olivier/anaconda3/lib/python3.5/site-packages/theano/tensor/nlinalg.py", line 73, in make_node
assert x.ndim == 2
AssertionError
Clearly I've made some mistakes about the covariance matrix, but I'm new to pymc3 and completely new to theano so have no idea how to fix it. I gather this should be a rather common use case so maybe there have been some examples on it? I just can't find them.
The full replicable code and data can be seen on the example page (link given above). I didn't include it here because it's too long and also I thought those familiar with pymc3 are very likely already quite familiar with it:)
You forgot to add one line when creating the covariance matrix you miss-specified the shape of the MvNormal. Your model should look something like this:
with pm.Model() as correlation_model:
mu = pm.Normal('mu', mu=0., sd=10, shape=2)
sigma = pm.HalfCauchy('sigma', 5, shape=2)
C_triu = pm.LKJCorr('C_triu', n=2, p=2)
C = tt.fill_diagonal(C_triu[np.zeros((2,2), 'int')], 1.)
sigma_diag = tt.nlinalg.diag(sigma) # this line
cov = tt.nlinalg.matrix_dot(sigma_diag, C, sigma_diag)
tau = tt.nlinalg.matrix_inverse(cov)
ab = pm.MvNormal('ab', mu=mu, tau=tau, shape=(n_counties, 2))
eps = pm.HalfCauchy('eps', 5)
radon_est = ab[:,0][county_idx] + ab[:,1][county_idx] * data.floor.values
radon_like = pm.Normal('radon_like', mu=radon_est, sd=eps, observed=data.log_radon)
trace = pm.sample(2000)
Notice that alternatively, you can evaluate the correlation of the intercept and the slope from the posterior of hierarchical_model. You can use a frequentist method or build another Bayesian model, that takes as the observed data the result of hierarchical_model. May be this could be faster.
EDIT
If you want to evaluate the correlation of two variables from the posterior you can do something like.
chain = hierarchical_trace[100:]
x_0 = chain['mu_a']
x_1 = chain['mu_b']
X = np.vstack((x_0, x_1)).T
and then you can run the following model:
with pm.Model() as correlation:
mu = pm.Normal('mu', mu=0., sd=10, shape=2)
sigma = pm.HalfCauchy('sigma', 5, shape=2)
C_triu = pm.LKJCorr('C_triu', n=2, p=2)
C = tt.fill_diagonal(C_triu[np.zeros((2,2), 'int')], 1.)
sigma_diag = tt.nlinalg.diag(sigma)
cov = tt.nlinalg.matrix_dot(sigma_diag, C, sigma_diag)
tau = tt.nlinalg.matrix_inverse(cov)
yl = pm.MvNormal('yl', mu=mu, tau=tau, shape=(2, 2), observed=X)
trace = pm.sample(5000, pm.Metropolis())
You can replace x_0 and x_1 according to your needs. For example you may want to do:
x_0 = np.random.normal(chain['mu_a'], chain['sigma_a'])
x_1 = np.random.normal(chain['mu_b'], chain['sigma_b'])
I'd like to find the standard deviation and confidence intervals for an out-of-sample prediction from an OLS model.
This question is similar to Confidence intervals for model prediction, but with an explicit focus on using out-of-sample data.
The idea would be for a function along the lines of wls_prediction_std(lm, data_to_use_for_prediction=out_of_sample_df), that returns the prstd, iv_l, iv_u for that out of sample dataframe.
For instance:
import pandas as pd
import random
import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.predstd import wls_prediction_std
df = pd.DataFrame({"y":[x for x in range(10)],
"x1":[(x*5 + random.random() * 2) for x in range(10)],
"x2":[(x*2.1 + random.random()) for x in range(10)]})
out_of_sample_df = pd.DataFrame({"x1":[(x*3 + random.random() * 2) for x in range(10)],
"x2":[(x + random.random()) for x in range(10)]})
formula_string = "y ~ x1 + x2"
lm = smf.ols(formula=formula_string, data=df).fit()
# Prediction works fine:
print(lm.predict(out_of_sample_df))
# I can also get std and CI for in-sample data:
prstd, iv_l, iv_u = wls_prediction_std(lm)
print(prstd)
# I cannot figure out how to get std and CI for out-of-sample data:
try:
print(wls_prediction_std(lm, exog= out_of_sample_df))
except ValueError as e:
print(str(e))
#returns "ValueError: wrong shape of exog"
# trying to concatenate the DFs:
df_both = pd.concat([df, out_of_sample_df],
ignore_index = True)
# Only returns results for the data from df, not from out_of_sample_df
lm2 = smf.ols(formula=formula_string, data=df_both).fit()
prstd2, iv_l2, iv_u2 = wls_prediction_std(lm2)
print(prstd2)
It looks like the problem is in the format of the exog parameter. This method is 100% stolen from this workaround by github user thatneat. It is necessary because of this bug.
def transform_exog_to_model(fit, exog):
transform=True
self=fit
# The following is lifted straight from statsmodels.base.model.Results.predict()
if transform and hasattr(self.model, 'formula') and exog is not None:
from patsy import dmatrix
exog = dmatrix(self.model.data.orig_exog.design_info.builder,
exog)
if exog is not None:
exog = np.asarray(exog)
if exog.ndim == 1 and (self.model.exog.ndim == 1 or
self.model.exog.shape[1] == 1):
exog = exog[:, None]
exog = np.atleast_2d(exog) # needed in count model shape[1]
# end lifted code
return exog
transformed_exog = transform_exog_to_model(lm, out_of_sample_df)
print(transformed_exog)
prstd2, iv_l2, iv_u2 = wls_prediction_std(lm, transformed_exog, weights=[1])
print(prstd2)
Additionally you can try to use the get_prediction method.
predictions = result.get_prediction(out_of_sample_df)
predictions.summary_frame(alpha=0.05)
This returns the confidence and prediction interval. I found the summary_frame() method buried here and you can find the get_prediction() method here. You can change the significance level of the confidence interval and prediction interval by modifying the "alpha" parameter.