I am trying to start using the AR models in statsmodels. However, I seem to be doing something wrong. Consider the following example, which fails:
from statsmodels.tsa.ar_model import AR
import numpy as np
signal = np.ones(20)
ar_mod = AR(signal)
ar_res = ar_mod.fit(4)
ar_res.predict(4, 60)
I think this should just continue the (trivial) time series consisting of ones. However, in this case it seems to return not enough parameters. len(ar_res.params) equals 4, while it should be 5. In the following example it works:
signal = np.ones(20)
signal[range(0, 20, 2)] = -1
ar_mod = AR(signal)
ar_res = ar_mod.fit(4)
ar_res.predict(4, 60)
I have the feeling that this could be a bug but I am not sure as I have no experience using the package. Maybe someone with more experience can help me...
EDIT: I have reported the issue here.
It works after adding a bit of noise, for example
signal = np.ones(20) + 1e-6 * np.random.randn(20)
My guess is that the constant is not added properly because of perfect collinearity with the signal.
You should open an issue to handle this corner case better. https://github.com/statsmodels/statsmodels/issues
My guess is also that the parameters are not identified in this case, so there might not be any good solution.
(Parameters not identified means that several parameter combinations can produce exactly the same fit, but I think they should all produce the same predictions in this case.)
Related
i am seeking a solution for the following equation in Python.
345-0.25*t = 37.5 * x_a
'with'
t = max(0, 10-x_a)*(20-10) + max(0,25-5*x_a)*(3-4) + max(0,4-0.25*x_a)*(30-12.5)
'x_a = ??'
If there is more than one solution to the problem (I am not even sure, whether this can happen from a mathematical point of view?), I want my code to return a positive(!) value for x_a, that minimizes t.
With my previous knowledge in the Basics of Python, Pandas and NumPy I actually have no clue, how to tackle this problem. Can someone give me a hint?
For Clarification: I inserted some exemplary numbers in the equation to make it easier to gasp the problem. In my final code, there might of course be different numbers for different scenarios. However, in every scenario x_a is the only unknown variable.
Update
I thought about the problem again and came up with the following solution, which yields the same result as the calculations done by Michał Mazur:
import itertools
from sympy import Eq, Symbol, solve
import numpy as np
x_a = Symbol('x_a')
possible_elements = np.array([10-x_a, 25-5*x_a, 4-0.25*x_a])
assumptions = np.array(list(itertools.product([True, False], repeat=3)))
for assumption in assumptions:
x_a = Symbol('x_a')
elements = assumption.astype(int) * possible_elements
t = elements[0]*(20-10) + elements[1]*(3-4) + elements[2]*(30-12.5)
eqn = Eq(300-0.25*t, 40*x_a)
solution = solve(eqn)
if len(solution)>2:
print('Warning! the code may suppress possible solutions')
if len(solution)==1:
solution = solution[0]
if (((float(possible_elements[0].subs(x_a,solution))) > 0) == assumption[0]) &\
(((float(possible_elements[1].subs(x_a,solution))) > 0) == assumption[1]) &\
(((float(possible_elements[2].subs(x_a,solution)))> 0) == assumption[2]):
print('solution:', solution)
Compared to the already suggested approach this may have an little advantage as it does not rely on testing all possible values and therefore can be used for very small as well as very big solutions without taking a lot of time (?). However, it probably only is useful as long as you don't have more complex functions for t (even having for example 5 max(...) statements and therefore (2^5=)32 scenarios to test seems quite cumbersome).
As far as I'm concerned, I just realized that my problem is even more complex as i thought. For my project the calculations needed to derive the value of "t" are pretty entangled and can not be written in just one equation. However it still is a function, that only relies on x_a. So I am still hoping for a Python-Solution similar to the proposed Solver in Excel... or I will stick to the approach of simply testing out all possible numbers.
If you are interested in a solution a bit different than the Python one, then I will give you a hand. Open up your Excel, with Solver extention and plug in
the data you are interested in cheking, as the following:
Into the E2 you plug the command I just have writen, into E4 you plug in
=300-0,25*E2
Into the F4 you plug:
=40*F2
Then you open up your Solver menu
Into the Set Objective you put the variable t, which you want to minimize.
Into Changing Variables you put the a.
Into Constraint Menu you put the equality of E4 and F4 cells.
You check the "Make Unconstarained Variables be non-negative" which will prevent your a variable to go below 0. Your method of computing is strictly non-linear, so you leave this option there.
You click solve. The computed value is presented in the screen.
The python approach I can think of:
minimumval=10100
minxa=10000
eps=0.01
for i in range(100000):
k=i/10000
x_a=k
t = max(0, 10-x_a)*(20-10) + max(0,25-5*x_a)*(3-4) + max(0,4-0.25*x_a)*(30-12.5)
val=abs(300-0.25*t-40*x_a)
if (val<eps):
if t<minimumval:
minimumval=t
minxa=x_a
It isn't direct solution, as in it only controls the error that you make in the equality by eps value. However, it gives solution.
I have a hierarchical logit that has observations over time. Following Carter 2010, I have included a time, time^2, and time^3 term. The model mixes using Metropolis or NUTS before I add the time variables. HamiltonianMC fails. NUTS and Metropolis also work with time. But NUTS and Metropolis fail with time^2 and time^3, but they fail differently and in a puzzling way. However, unlike in other models that fail for more obvious model specification reasons, ADVI still gives an estimate, (and ELBO is not inf).
NUTS either stalls early (last time it made it to 60 iterations), or progresses too quickly and returns an empty traceplot with an error about varnames.
Metropolis errors out immediately with a dimension mismatch error. It looks like the one in this github error, but I'm using a Bernoulli outcome, not a negative binomial. The end of the error looks like: ValueError: Input dimension mis-match. (input[0].shape[0] = 1, input[4].shape[0] = 18)
I get an empty trace when I try HamiltonianMC. It returns the starting values with no mixing
ADVI gives me a mean and a standard deviation.
I increased the ADVI iterations by a lot. It gave pretty close to the same starting points and NUTS still failed.
I double checked that the fix for the github issue is in place in the version of pymc3 I'm running. It is.
My intuition is that this has something to do with how huge the time^2 and time^3 variables get, since I'm looking over a large time-frame. Time^3 starts at 0 and goes to 64,000.
Here is what I've tried for sampling so far. Note that I have small sample sizes while testing, since it takes so long to run (if it finishes) and I'm just trying to get it to sample at all. Once I find one that works, I'll up the number of iterations
with my_model:
mu,sds,elbo = pm.variational.advi(n=500000,learning_rate=1e-1)
print(mu['mu_b'])
step = pm.NUTS(scaling=my_model.dict_to_array(sds)**2,
is_cov=True)
my_trace = pm.sample(500,
step=step,
start=mu,
tune=100)
I've also done the above with tune=1000
I've also tried a Metropolis and Hamiltonian.
with my_model:
my_trace = pm.sample(5000,step=pm.Metropolis())
with my_model:
my_trace = pm.sample(5000,step=pm.HamiltonianMC())
Questions:
Is my intuition about the size and spread of the time variables reasonable?
Are there ways to sample square and cubed terms more effectively? I've searched, but can you perhaps point me to a resource on this so I can learn more about it?
What's up with Metropolis and the dimension mismatch error?
What's up with the empty trace plots for NUTS? Usually when it stalls, the trace up until the stall works.
Are there alternative ways to handle time that might be easier to sample?
I haven't posted a toy model, because it's hard to replicate without the data. I'll add a toy model once I replicate with simulated data. But the actual model is below:
with pm.Model() as my_model:
mu_b = pm.Flat('mu_b')
sig_b = pm.HalfCauchy('sig_b',beta=2.5)
b_raw = pm.Normal('b_raw',mu=0,sd=1,shape=n_groups)
b = pm.Deterministic('b',mu_b + sig_b*b_raw)
t1 = pm.Normal('t1',mu=0,sd=100**2,shape=1)
t2 = pm.Normal('t2',mu=0,sd=100**2,shape=1)
t3 = pm.Normal('t3',mu=0,sd=100**2,shape=1)
est =(b[data.group.values]* data.x.values) +\
(t1*data.t.values)+\
(t2*data.t2.values)+\
(t3*data.t3.values)
y = pm.Bernoulli('y', p=tt.nnet.sigmoid(est), observed = data.y)
BREAKTHROUGH 1: Metropolis error
Weird syntax issue. Theano seemed to be confused about a model with both constant and random effects. I created a constant in data equal to 0, data['c']=0 and used it as an index for the time, time^2 and time^3 effects, as follows:
est =(b[data.group.values]* data.x.values) +\
(t1[data.c.values]*data.t.values)+\
(t2[data.c.values]*data.t2.values)+\
(t3[data.c.values]*data.t3.values)
I don't think this is the whole issue, but it's a step in the right direction. I bet this is why my asymmetric specification didn't work, and if so, suspect it may sample better.
UPDATE: It sampled! Will now try some of the suggestions for making it easier on the sampler, including using the specification suggested here. But at least it's working!
Without having the dataset to play with it is hard to give a definite answer, but here is my best guess:
To me, it is a bit unexpected to hear about the third order polynomial in there. I haven't read the paper, so I can't really comment on it, but I think this might be the reason for your problems. Even very small values for t3 will have a huge influence on the predictor. To keep this reasonable, I'd try to change the parametrization a bit: First make sure that your predictor is centered (something like data['t'] = data['t'] - data['t'].mean() and after that define data.t2 and data.t3). Then try to set a more reasonable prior on t2 and t3. They should be pretty small, so maybe try something like
t1 = pm.Normal('t1',mu=0,sd=1,shape=1)
t2 = pm.Normal('t2',mu=0,sd=1,shape=1)
t2 = t2 / 100
t3 = pm.Normal('t3',mu=0,sd=1,shape=1)
t3 = t3 / 1000
If you want to look at other models, you could try to model your predictor as a GaussianRandomWalk or a Gaussian Process.
Updating pymc3 to the latest release candidate should also help, the sampler was improved quit a bit.
Update I just noticed you don't have an intercept term in your model. Unless there is a good reason for that you probably want to add
intercept = pm.Flat('intercept')
est = (intercept
+ b[..] * data.x
+ ...)
I'm working on some python code to interpolate irregular data onto a 180° lat x 360° lon spherical grid. The code is currently hanging when I call the following:
def geo_interp(lats,lons,data,grid_size_deg):
deg2rad = pi/180.
new_lats = np.linspace(grid_size_deg, 180, 180) * deg2rad
new_lons = np.linspace(grid_size_deg, 360, 360) * deg2rad
knotst, knotsp = new_lats.copy(), new_lons.copy()
knotst[0] += .0001
knotst[-1] -= .0001
knotsp[0] += .0001
knotsp[-1] -= .0001
lut = LSQSphereBivariateSpline(lats.ravel(),lons.ravel(),data.T.ravel(),knotst,knotsp)
data_interp = lut(new_lats, new_lons)
return data_interp
The arrays I'm using as arguments when I call the above subroutine all fit the requirements of LSQSphereBivariateSpline as listed in the documentation. When I run it, it takes much longer than I feel like it should take to process a 180x360 dataset.
When I run the script with python -m trace --trace, the last line of output before nothing happens for a long time is
fitpack2.py(1025): w=w, eps=eps)
As far as I can tell, line 1025 of fitpack2.py is in a comment, which is even more confusing.
So my questions are:
1. Is there a way to tell if it's hanging or just very slow?
2. If it's hanging, how might I fix it?
The only thing I can think of is that I have no idea what I'm doing as far as choosing knots. Is there a good way to choose those? I just went with the grid I'll be interpolating to later, since the example in the doc seemed to be an arbitrary grid.
UPDATE: It finally finished after about 3 hours but the "interpolated data" looked like random noise. Also, if this is relevant, as far as I can tell LSQSphereBivariateSpline is the only function I can use for this because my lats and lons are not strictly increasing.
Also, I should add that when it finished it output the following warning: Warning (from warnings module):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/interpolate/fitpack2.py", line 1029
warnings.warn(message)
UserWarning:
WARNING. The coefficients of the spline returned have been computed as the
minimal norm least-squares solution of a (numerically) rank
deficient system (deficiency=16336, rank=48650). Especially if the rank
deficiency, which is computed by 6+(nt-8)*(np-7)+ier, is large,
the results may be inaccurate. They could also seriously depend on
the value of eps.
SOLUTION: I had far too many knots, causing both the glacial pace and the useless results.
I am working on translating a model from MATLAB to Python. The crux of the model lies in MATLAB's ode15s. In the MATLAB execution, the ode15s has standard options:
options = odeset()
[t P] = ode15s(#MODELfun, tspan, y0, options, params)
For reference, y0 is a vector (of size 98) as is MODELfun.
My Python attempt at an equivalent is as follows:
ode15s = scipy.integrate.ode(Model.fun)
ode15s.set_integrator('vode', method = 'bdf', order = 15)
ode15s.set_initial_value(y0).set_f_params(params)
dt = 1
while ode15s.successful() and ode15s.t < duration:
ode15s.integrate(ode15s.t+dt)
This though, does not seem to be working. Any suggestions, or an alternative?
Edit:
After looking at the output, the result I'm getting from the Python is either no change in some elements of y0 over time, or a constant change at each step for the rest of the y0. Any experience with something like this?
According to the SciPy wiki for Matlab Users, the right way for using the ode15s is
scipy.integrate.ode(f).set_integrator('vode', method='bdf', order=15)
One point to make clear is that, unlike Matlab's ode15s, the scipy integrator 'vode' does not support models with a mass matrix. So any recommendation should include this caveat.
I am interested in using python to compute a confidence interval from a student t.
I am using the StudentTCI() function in Mathematica and now need to code the same function in python http://reference.wolfram.com/mathematica/HypothesisTesting/ref/StudentTCI.html
I am not quite sure how to build this function myself, but before I embark on that, is this function in python somewhere? Like numpy? (I haven't used numpy and my advisor advised not using numpy if possible).
What would be the easiest way to solve this problem? Can I copy the source code from the StudentTCI() in numpy (if it exists) into my code as a function definition?
edit: I'm going to need to build the Student TCI using python code (if possible). Installing scipy has turned into a dead end. I am having the same problem everyone else is having, and there is no way I can require Scipy for the code I distribute if it takes this long to set up.
Anyone know how to look at the source code for the algorithm in the scipy version? I'm thinking I'll refactor it into a python definition.
I guess you could use scipy.stats.t and its interval method:
In [1]: from scipy.stats import t
In [2]: t.interval(0.95, 10, loc=1, scale=2) # 95% confidence interval
Out[2]: (-3.4562777039298762, 5.4562777039298762)
In [3]: t.interval(0.99, 10, loc=1, scale=2) # 99% confidence interval
Out[3]: (-5.338545334351676, 7.338545334351676)
Sure, you can make your own function if you like. Let's make it look like in Mathematica:
from scipy.stats import t
def StudentTCI(loc, scale, df, alpha=0.95):
return t.interval(alpha, df, loc, scale)
print StudentTCI(1, 2, 10)
print StudentTCI(1, 2, 10, 0.99)
Result:
(-3.4562777039298762, 5.4562777039298762)
(-5.338545334351676, 7.338545334351676)