I am trying to use ARIMA model fitted by arima_mod = sm.tsa.ARIMA(residual, (p,d,q)).fit(trend="c",maxiter = 20) for out of sample prediction of the next value in the residual series. To to that, I may apply one of the following:
next_pred1 = arima_mod.predict(start,end,dynamic=True)[-1]
next_pred2 = arima_mod.predict(start,end,dynamic=False)[-1]
The results of both predictions are bad. Correction - with dynamic=False these are bad. With dynamic=True these are horrible. I am trying to understand why: when I set start to distant past and end to the next value (which is out of sample), the prediction is bad but at least is does not change between different values. I.e., for start = 4 and for start = 8 the prediction will give the same output (I have about 40 samples to base the prediction on). When I set the same start/end values and use dynamic=True, the output gets worse given more past. Which does not make sense - for ARIMA - (2,0,2) the prediction should only use the estimate of the last 2 values with the dynamic setting, but it seems that it uses all past samples for forecasting. The output value is suddenly 1E-20 rather than, say 0.15 or 1.1... It is as if a strange (weighted by p or similar) approximation of the mean of the series is used for the prediction...
Please advise - how many samples should I take back for forecasting of the next out-of-sample value with ARIMA = (p,d,q)? The last p values only? Else? Why do the results of the prediction are so bad (model trained on 40 values - is this not "sufficiently enough" for small p and q? I know that more values the better, but this is all I have).
Your advice will be appreciated.
Thanks!
Related
I'm forecasting website visits using ARIMA model in time-series forecast with Python. Here is main details to get the picture of the process:
imported modules
Data and actual plot
Figures represent the number of visits. Dates used as index, but as string (formatted date)
Differences (1 and 2)
Correlations (auto & Partial)
ARIMA model with (1,1,1) order
ARIMA error
So, as you see, for p and q, 1 is the best choice, and differencing 1 is more stationary (and even doesn't work with 2). So, I chose order = (1,1,1) which must best fit, but I get value error result that says:
The model specification cannot be estimated. The model contains 7 regressors (0 trend, 0 seasonal, 7 lags) but after adjustment for hold_back and creation of the lags, there are only 2 data points available to estimate parameters.
It doesn't work for (1,2,1) too, and the same error. However, it works for (0,1,1) which gives not bad q (MA) value (0.01), and with AIC 147.596. But the plot doesn't seem good when comparing with actuals:
Actual with Forecast (plot_predict())
Please, help to identify the reason why ARIMA gives error for (1,1,1) which must be the best order. How can I solve this?
Thanks in advance!
I built Liear Regression model in Python and I had target variable for example Sales: 10, 9, 8.
I decided to log my target variable: df["Sales"] = np.log(df["Sales"])so I have after that values np 3, 2, 1.
My question is how can I interpretate results of this model being aware that my target was log ? Because currently I have interpretation for example: If there is night sales decrease average by 1.333 nevertheless it is probably bad interpretation because without log of target I will have result in definitely higher quantification like If there is night sales decrease average by for example 2 589.
So how can I interpretate results of Linear Regression after log of target ? Because after log target has really low values ?
Your transformation is called a "log-level" regression. That is, your target variable was log-transformed and your independent variables are left in their normal scales.
The model should be interpreted as follows:
On average, a marginal change in X_i will cause a change of 100 * B_i percent.
Do note that if you transformed any of your independent variables, the interpretation will change too. For example, if you changed X_i to np.log(df['X_i]), then you would interpret B_i` as a log-log transformation.
You can find a handy cheat sheet to help you remember how to interpret models here.
I detrended my data in python using the following code from scipy.signal.detrend
detrended =signal.detrend(feature, axis=-1, type='constant', bp=0, overwrite_data=True)
np.savetxt('constant detrend.csv', detrended, delimiter=',', fmt='%s')
The last line saves the data into a csv file then i reload this data to run some models. I found that the my RandomForest model performs really well with the detrended dataset.
So next will be to make forecasts using this model. However i am a bit unsure of how i can move from the detrended dataset to a more meaningful dataset that i can understand. From my understanding the detrend removed the mean and normalized the data. But if i do my predictions i need to be able to see the actual numbers of my forecasts not the detrended numbers.
Is there a way i can readd the mean and renormalize to get a 'meanful dataset' that i can interpret. For example my dataset has a rainfall variable. So for each month i can see how much it rained. But when i detrended, the rainfall value is no longer the actual rainfall value. When i make forecasts i want to be able to say in this month it rained 200mm but my forecasts don't tell me this since the data has been detrended.
Any help would be appraciated.
According to the docs, detrend simply removes the least squares line fit from the data. When you use type='constant', it's even simpler, since it just removes the mean:
If type == 'constant', only the mean of data is subtracted.
The source code bears this out. After checking the inputs, the entire computation is done in one line (scipy/signal/signaltools.py, line 3261):
ret = data - np.expand_dims(np.mean(data, axis), axis)
The easiest way to get the subtracted mean is to implement the calculation by hand, given how simple it is.
mean = np.mean(feature, axis=-1, keepdims=True)
detrended = feature - mean
You can save the mean to a file, or do whatever else you want with it. To "retrend", just add the mean back:
point = prediction + mean
If you had some other manipulation you were concerned with, like normalizing to the maximum, you could handle it the same way.
max = np.amax(detrended, axis=-1, keepdims=True)
detrended /= max
In this case you'd have to multiply before offsetting to retrend:
point = prediction * max + mean
Simple manipulations like this are easy to reproduce by hand. A more complicated function might be hard to reproduce reliably, but would also be more likely to return the parameters it uses, at least optionally.
I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?
There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.
As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
EDIT
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
plt.show()
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.
I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$
Is it possible to perform "pseudoexperiments" using PyMC?
By pseudoexperiments, I mean generating random "observations" by sampling from the prior, and then, given each pseudoexperiment, drawing samples from the posterior. Afterwards, one would compare the trace for each parameter to the sample (obtained from the prior) used in sampling from the posterior.
A more concrete example: Suppose that I want to know the rate of process X. I count how many occurrences there are in a certain period of time. However, I know that process Y also sometimes occurs and will contaminate my count. The rate of process Y is known with some uncertainty. So, I build a model, include my observations, and sample from the posterior:
import pymc
class mymodel:
rate_x = pymc.Uniform('rate_x', lower=0, upper=100)
rate_y = pymc.Normal('rate_y', mu=150, tau=1./(15**2))
total_rate = pymc.LinearCombination('total_rate', [1,1], [rate_x, rate_y])
data = pymc.Poisson('data', mu=total_rate, value=193, observed=True)
Mod = pymc.Model(mymodel)
MCMC = pymc.MCMC(Mod)
MCMC.sample(100000, burn=5000, thin=5)
print MCMC.stats()['rate_x']['quantiles']
However, before I do my experiment (or before I "unblind" my analysis and look at my data), I would like to know how sensitive I expect to be -- what will be the uncertainty on my measurement of rate_x?
To answer this, I could sample from the prior
Mod.draw_from_prior()
but this only samples rate_x, rate_y, and calculates total_rate. But once the values of those are set by draw_from_prior(), I can draw a pseudoexperiment:
Mod.data.random()
This just returns a number, so I have to set the value of Mod.data to a random sample. Because Mod.data has the observed flag set, I have to also "force" it:
Mod.data.set_value(Mod.data.random(), force=True)
Now I can sample from the posterior again
MCMC.sample(100000, burn=500, thin=5)
print MCMC.stats()['rate_x']['quantiles']
All this works, so I suppose the simple answer to my question is "yes". But it feels very hacky. Is there a better or more natural way to accomplish this?