I have one FB Prophet demand forecast time-series for each of three consumer classes. I want to add these three models and get a confidence interval for the total demand---the stakeholders may look at everything and choose different assumptions for each consumer class. How should I approach that?
I have tried:
using the usual square root of the summation of variances plus two times the covariances for each point in time. That returned a much wider uncertainty interval than it makes sense (the summation of the historical series falling entirely well within two standard deviations), maybe because of trend uncertainty. Also, the distribution around each point in the series time is not normal, so just getting the standard deviation won't do.
adding sample forecasts from each model and using them to estimate the confidence intervals. But then I remembered that the samples wouldn't be correlated.
Any other ideas? Is there any way for me to correlate samples from the three models?
Related
I set up a sensor which measures temperature data every 3 seconds. I collected the data for 3 days and have 60.000 rows in my csv export. Now I would like to forecast the next few days. When looking at the data you can already see a "seasonality" which displays the fridges heating and cooling cycle so I guess it shouldn't be too difficult to predict. I am not really sure if my data is too granular and if I should do some kind of undersampling. I thought about using a seasonal ARIMA model but I am having difficulties with picking parameters. As the seasonality in the data is pretty obious is there maybe a model that fits better? Please bear with me I'm pretty new to machine learning.
When the goal is to forecast rising temperatures, you can forecast the lower and upper peaks, i.e., their hight and distances. Assuming (simplified model) that the temperature change in between is linear we can, model each complete peak starting from a first lower peak of the temperature curve to the next upper peak down to next lower peak. So a complete peak can be seen as triangle which we easily integrate (calculate its area + the area of the rectangle below of it). The estimation can now be done by a integrating a number of complete peaks we have already measured. By repeating this procedure, we can do now a linear regression on the average temperatures and alert when the slope is above a defined threshold.
As this only tackles a certain kind of errors, one can do the same for the average distances between the upper peaks and the also for the lower peaks. I.e., take the times between them for a certain periode, fit a curve (linear regression can possibly be sufficient) and alert when the slope of the curve is indicating too long distances.
It's mission impossible. If fridge work without interference, then graph always looks the same. The change can be caused, for example, by opening a door, a breakdown, a major change in external conditions. But you cannot predict such events. Instead, you can try to warn about the possibility of problems in the near future, for example, based on a constant increase in average temperature. This situation may indicate a leak in the cooling system.
By the way, have you considered logging the temperature every 3 seconds? This is usually unjustified, because it is physically impossible for the temperature to change to a measurable degree in such an interval. Our team usually sets the login interval to 30 or 60 seconds in such cases. Sometimes even more. Depending on the size of the chamber, the way the air is circulated, the ratio of volume to power of the refrigeration unit, etc.
I am using this code from Jindong Wang to estimate MMD (Maximum Mean Discrepancy) with the aim of distinguishing between different characteristics of time series that I artificially generate following this skcikit-learn example. I started with simple A*sin(wx+phi) to test if it is possible to differentiate phases, amplitudes or frequencies using such an approach by comparing each data set with the primary sen(x). The idea is that distances must be increasing as I chose larger frequencies or amplitudes. I have two questions.
How can I estimate an uncertainty related to the MMD distances? (this is a more theoretical question)
How can I optimize (in terms of memory and my arrays) to be able to use long time series with more than 10000-time points with x-y elements?
Why it works fine for differences in amplitudes but not for phases or frequencies? Could it be related to the sampling frequency of the data?
I have a time series in which i am trying to detect anomalies. The thing is that with those anomalies i want to have a range for which the data points should lie to avoid being the anomaly point. I am using the ML .Net algorithm to detect anomalies and I have done that part but how to get range?
If by some way I can get the range for the points in time series I can plot them and show that the points outside this range are anomalies.
I have tried to calculate the range using prediction interval calculation but that doesn't work for all the data points in the time series.
Like, assume I have 100 points, I take 100/4, i.e 25 as the sliding window to calculate the prediction interval for the next point, i.e 26th point but the problem then arises is that how to calculate the prediction interval for the first 25 points?
A method operating on a fixed-length sliding window generally needs that entire window to be filled, in order to make an output. In that case you must pad the input sequence in the beginning if you want to get predictions (and thus anomaly scores) for the first datapoints. It can be hard to make that padded data realistic, however, which can lead to poor predictions.
A nifty technique is to compute anomaly scores with two different models, one going in the forward direction, the other in the reverse direction, to get scores everywhere. However now you must decide how to handle the ares where you have two sets of predictions - to use min/max/average anomaly score.
There are some models that can operate well on variable-length inputs, like sequence to sequence models made with Recurrent Neural Networks.
With python I want to compare a simulated light curve with the real light curve. It should be mentioned that the measured data contain gaps and outliers and the time steps are not constant. The model, however, contains constant time steps.
In a first step I would like to compare with a statistical method how similar the two light curves are. Which method is best suited for this?
In a second step I would like to fit the model to my measurement data. However, the model data is not calculated in Python but in an independent software. Basically, the model data depends on four parameters, all of which are limited to a certain range, which I am currently feeding mannualy to the software (planned is automatic).
What is the best method to create a suitable fit?
A "Brute-Force-Fit" is currently an option that comes to my mind.
This link "https://imgur.com/a/zZ5xoqB" provides three different plots. The simulated lightcurve, the actual measurement and lastly both together. The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
If I understand this correctly, you're asking a more foundational question that could be better answered in https://datascience.stackexchange.com/, rather than something specific to Python.
That said, as a data science layperson, this may be a problem suited for gradient descent with a mean-square-error cost function. You initialize the parameters of the curve (possibly randomly), then calculate the square error at your known points.
Then you make tiny changes to each parameter in turn, and calculate how the cost function is affected. Then you change all the parameters (by a tiny amount) in the direction that decreases the cost function. Repeat this until the parameters stop changing.
(Note that this might trap you in a local minimum and not work.)
More information: https://towardsdatascience.com/implement-gradient-descent-in-python-9b93ed7108d1
Edit: I overlooked this part
The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
Is the simulated curve just a sum of sine waves, and are the parameters just phase/period/amplitude of each? In this case what you're looking for is the Fourier transform of your signal, which is very easy to calculate with numpy: https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html
Recently I've been trying to figure out how to calculate the entropy of a random variable X using
sp.stats.entropy()
from the stats package of SciPy, with this random variable X being the returns I obtain from the stock of a specific company ("Company 1") from 1997 to 2012 (this is for a financial data/machine learning assignment). However, the arguments involve inputting the probability values
pk
and so far I'm even struggling with computing the actual empirical probabilities, seeing as I only have the observations of the random variable. I've tried different ways of normalising the data in order to obtain an array of probabilities, but my data contains negative values too, which means that when I try and do
asset1/np.sum(asset1)
where asset1 is the row array of the returns of the stock of "Company 1", I manage to obtain a new array which adds up to 1, but obviously with some negative values, and as we all know, negative probabilities do not exist. Therefore, is there any way of computing the empirical probabilities of my observations occurring again (ideally with the option of choosing specific bins, or for a range of values) on Python?
Furthermore, I've been trying to look for a Python package for countless hours which is solely dedicated to the calculation of random variable entropies, joint entropies, mutual information etc. as an alternative to SciPy's entropy option (simply to compare) but most seem to be outdated (I currently have Python 3.5), hence does anyone know of any good package which is compatible with my current version of Python? I know R seems to have a very compact one.
Any kind of help would be highly appreciated. Thank you very much in advance!
EDIT: stock returns are considered to be RANDOM VARIABLES, as opposed to the stock prices which are processes. Therefore, the entropy can definitely be applied in this context.
For continuous distributions, you are better off using the Kozachenko-Leonenko k-nearest neighbour estimator for entropy (K & L 1987) and the corresponding Kraskov, ..., Grassberger (2004) estimator for mutual information. These circumvent the intermediate step of calculating the probability density function, and estimate the entropy directly from the distances of data point to their k-nearest neighbour.
The basic idea of the Kozachenko-Leonenko estimator is to look at (some function of) the average distance between neighbouring data points. The intuition is that if that distance is large, the dispersion in your data is large and hence the entropy is large. In practice, instead of taking the nearest neighbour distance, one tends to take the k-nearest neighbour distance, which tends to make the estimate more robust.
I have implementations for both on my github:
https://github.com/paulbrodersen/entropy_estimators
The code has only been tested using python 2.7, but I would be surprised if it doesn't run on 3.x.