numeric value for time series data with frequency every 2 minutes - python

I have a dataset of two months in which I am getting reading every 2 minutes. statsmodel.tsa.seasonal_decompose method asks for the numeric value for frequency. What would be the numeric value for such data and what is the proper method to calculate freq in such time series data.

You need to identify frequency of the seasonality yourself. Often this is done using knowledge of the dataset or by visually inspecting a partial autocorrelation plot, which the statsmodels library provides:
statsmodels - partial autocorrelation
If the data has an hourly seasonality to it, you might see a significant partial autocorrelation lag 30 (as there are 30 data points between this hour's first 2 minutes, and last hour's first 2 minutes). I'm assuming statsmodels would expect this value; I'm assuming if you had monthly data it would expect a 12, or if you had daily data, it would expect a 7 for the weekly shape etc.
It sounds like you have multiple seasonalities to consider judging by your other post. You might see a significant lag that corresponds to the same 2 minutes in the previous hours, previous days, and/or the previous weeks. This approach to seasonal decomposition is considered to be naive and only addresses 1 seasonality as described in its documentation:
Seasonal Decomposition
If you would like to continue down the seasonal decomposition path, you could try the relatively recently released double seasonal model released by facebook. It is specifically designed to work well with daily data where it models intra-year and intra-week seasonalities. Perhaps it can be adapted to your problem.
fbprophet
The shortcoming with seasonal decomposition models is that it does not capture how a season might change through time. For example, the characteristics of a week of electricity demand in the summer is much different than the winter. This method will determine an average seasonal pattern and leave the remaining information in the residuals. So given your characteristics vary by day of week (mentioned in your other post), this won't capture that.
If you'd like to send me your data, I'd be interested in taking a look. Given my experience, you've jumped into the deep end of time-series forecasting that doesn't necessarily have an easy to use off-the-shelf solution. If you do provide it, please also clarify what your objective is:
Are you attempting to forecast ahead, and if so, how many 2 minute intervals?
Do you require confidence intervals, monte carlo results, or neither?
How do you measure accuracy of the model's performance? How 'good' does it need to be?

Related

What is window_size in time-series and What are the advantages and disadvantages of having small and large window_size?

I am quite beginner in machine learning. I have tried a lot to understand this concept but I am unable to understand it on google. I need to understand this concept in easy way.
Please explain this question in easy words and much detail.
This question is best suited for stack exchange as it is not a specific coding question.
Window size is the duration of observations that you ask an algorithm to consider when learning a time series. For example, if you need to predict tomorrow's temperature, and you use a window of 5 days, then the algorithm will divide your entire time series into segments of 6 days (5 training days and 1 prediction days) and try to learn how to use only 5 days of data to predict next 1 day based on the historic records.
Advantage of short window:
You get more samples out of the time series so that your estimation of short term effects are more reliable (100 days historic time series will provide you around 95 samples if you are using a 5 day window - so the model is more certain about what the influence of the past 5 days has on next day temperature)
Advantage of long window
long windows allow you to better learn seasonal and trend effects (think about events that happen yearly, monthly...etc). If your window is small - say 5 days, your model will not learn any seasonal effect that occurs monthly. However, if your window is 60 days, then every sample of data that you feed to the model would have at least 2 occurrences of the monthly seasonal effect, and this would enable your model to learn such seasonality.
The downside of long window is the number of samples decreases. Assuming an 100 day time series, 60 day window will only yield 40 samples of data. This would mean every parameter of your model is now fitted on much smaller sample of data and may be reduce the reliability of the model.
"window size" typically refers to the number of time periods that are used to calculate a statistic or model.
Advantages and Disadvantages of various window sizes relate to the balance between:
the sensitivity to changes in the data vs susceptibility to noise & outliers
If you have ever dealt with moving average indicators on the stock market, you will understand that each window size has a purpose, and these different window sizes are often used in combination to get a more holistic view/understanding. eg. MA20 vs MA50 vs MA100. Each of these indicators are using a different window size to calculate the moving average of the stock of interest.
Image Source: Yahoo Finance

Create a ML model with tensorflow that predicts a values at any given time range at hourly intervals

I am pretty new to ML and completely new to creating my own models. I have went through tensorflows time-series forecasting tutorial and other LSTM time series examples on how to predict with multi-variate inputs. After trying multiple examples I think I realized that this is not what I want to achieve.
My problem involves a dataset that is in hourly intervals and includes 4 different variables with the possibility of more in the future. One column being the datetime. I want to train a model with data that can range from one month to many years. I will then create an input set that involves a few of the variables included during training with at least one of them missing which is what I want it to predict.
For example if I had 2 years worth of solar panel data and weather conditions from Jan-12-2016 to Jan-24-2018 I would like to train the model on that then have it predict the solar panel data from May-05-2021 to May-09-2021 given the weather conditions during that date range. So in my case I am not necessarily forecasting but using existing data to point at a certain day of any year given the conditions of that day at each hour. This would mean I should be able to go back in time as well.
Is this possible to achieve using tensorflow and If so are there any resources or tutorials I should be looking at?
See Best practice for encoding datetime in machine learning.
Relevant variables for solar power seem to be hour-of-day and day-of-year. (I don't think separating into month and day-of-month is useful, as they are part of the same natural cycle; if you had data over spending habits of people who get paid monthly, then it would make sense to model the month cycle, but Sun doesn't do anything particular monthly.) Divide them by hours-in-day and days-in-year respectively to get them to [0,1), multiply by 2*PI to get to a circle, and create a sine column and a cosine column for each of them. This gives you four features that capture the cyclic nature of day and year.
CyclicalFeatures discussed in the link is a nice convenience, but you should still pre-process your day-of-year into a [0,1) interval manually, as otherwise you can't get it to handle leap years correctly - some years days-in-year are 365 and some years 366, and CyclicalFeatures only accepts a single max value per column.

Univariate time series forecasting based (also) on other univariate series

I am given a set of data of energy consumption of some households in some given areas for a period of several months. For some households I have the complete dataset, for others the last month is missing and I have to estimate it in Python. I am trying to find an estimation technique that allows me to merge two aspects:
historical time-series analysis: I want to consider past consumptions in order to estimate the last month;
"similar" households consumptions: I want to integrate in the analysis also information for those similar households that have a complete dataset, with the assumptions that there exists external variables (like temperature) that affects all households in similar ways.
The datasets for each household are univariate (consisting in just month/consumption).
Is there a Python module that allows me to do forecast considering both aspects?
Thanks!

Need approach recomendation in field ML or DL in forecast value task

I am trying to forecast value in 30 days. I have a timeseries data with some parameteers. The example of date I will attach at the bottom.
The main idea is that Y value is our aim variable to predict in 30 days from today. The f1-f5 variables is values which influence on Y value. So I need to predict Y using Date and f1-f5 columns. All the data comes every day.
Recomend me please some ML and DL approaches to predict "Y" value?
My thoughts.
As I understood it is time series data. And the task is regression. But I am a bit disapointed because time series approaches, as I understood, predict value based on only date value, using seasonality and so on. But I afraid that if I will use XGBoost or Linear regression approaches I will loose timeseries effect on this data.
Date,f1,f2,f3,f4,f5,Y
2015-01-01,183,34,15,1166,50,3251
2015-01-02,364,173,5,739,32,8132
2015-01-03,83,72,38,551,49,6271
2015-01-04,183,81,7,937,32,3334
2015-01-05,324,61,73,554,71,3742
2015-01-06,183,97,15,337,17,5543
2015-01-07,38,152,83,883,32,9143
2015-01-08,78,72,5,551,11,6435
2015-01-09,183,30,21,443,92,4353
...,...,...,...,...,...,...
2018-06-08,924,9,53,897,88,7446
Time series are traditionally modeled with AR (auto-regression) and MA (moving average). Trend and seasonality should also be accounted for. So why not use ARIMA or Prophet? Here's some theory on the subject - https://otexts.com/fpp2/
There are some ML/DL implementations based on RNN/LSTM but they are really complex, often hard to explain, and tend to suffer from vanishing gradient problem. If you must use ML/DL, you may want to have a look at LSTNet.

ARIMA order for trend

I am trying to fit ARIMA model. I have 3 months data and it shows count(float) for every minute. Which order I should pass for arima.fit()?
I need to predict for every minute.
A basic ARIMA(p,d,q) model isn't going to work with your data.
Your data violates the assumptions of ARIMA, one of which is that the parameters have to be consistent over time.
I see clusters of 5 spikes, so I'm presuming that you have 5 busy workdays and quiet weekends. Basic ARIMA isn't going to know the difference between a weekday and weekend, so it probably won't give you useful results.
There is such a thing as a SARIMA (Seasonal Autoregressive Integrated Moving Average). That would be useful if you were dealing with daily data points, but not really suitable for minute data either.
I would suggest that you try to filter your data so that it excludes evenings and weekends. Then you might be dealing with a dataset that has consistent parameters. If you're using python, you could try using pyramid's auto_arima() function on your filtered time series data (s) to have it try to auto-find the best parameters p, d, q.
It also does a lot of the statistical tests you might want to look into for this type of analysis. I actually don't always agree with the auto_arima's parameter choices, but it's a start.
model = pyramid.arima.auto_arima(s)
print(model.summary())
http://pyramid-arima.readthedocs.io/en/latest/_submodules/arima.html
1) is your data even good for box -Jenkins model( ARIMA)?
2) I see higher mean in the end. There is clear seasonality in the data. ARIMA will fail. Please try seasonal based models- SARIMA as righly suggested by try, Prophet is another beautiful algo by Facebook for seasonality. - ( R implementation)
https://machinelearningstories.blogspot.com/2017/05/facebooks-phophet-model-for-forecasting.html
3) Not just rely on ARIMA. Try other time series algos that are STL, BTS, TBATS. ( Structural Time series) and Hybrid etc. Let me know if you want some packages information in R?

Categories

Resources