I have data of the strength of the sun's rays every 15 min from March until October. Now I want to make some predictions with SARIMA / SARIMAX. Now I have problems to choose the appropiate saisonal period parameter. Since 15min data should be 35064.
Since the data handles the amount of sun rays I expect a seasonal pattern during the different seasons (summer and winter) as well as patterns during day / night.
So far I've got only error with the period of 35064 therefore I'm not sure if thats the right period to choose.
Error be like e.g.: Unable to allocate 9.16 GiB for an array with shape (35067, 35067, 1) and data type float64
and e.g. the differencing is not possible since I have only about 18000 data entries.
I don't recommend to use SARIMA or SARIMAX model when you have such a dataset. You encounter one of the problems i.e. Define the right period on this dataset becomes really challenging.
When datasets start being large, (starting at 10 000 data points or more), deep learning algorithms like CNN, LSTM or autoregressive deep neural network start to be a great tool to obtain accurate forecast and leverage all the available data.
Related
I have a time-series data from 2016 to 2021, how could I backcast to get the data from 2010 to 2015 using ARIMA in Python?
COuld you guys give me some sample Python code?
Thank you very much
The only possibiliy I see here is to simply inverse your time series. That means the last observations becomes the first, the second last becomes the second and so on. You then have a series from 2021 to 2016.
You can do that by:
df = df.reindex(index=df.index[::-1])
You can then train an ARIMA model on this data and predict the "next" five years from 2015 to 2010. Remember that the first prediction will be for 2015-12-31, so you need to inverse this again to have the series from 2010 to 2015.
Keep in mind that ARIMA the predictions will be very, very bad, since your forecasts will be based on forecasts and so on. ARIMA is not made for predictions on such long time frames, so the results will be useless anyway, I guess. It is very likely that the predicitons will become a straight line after 30 or 40 predicions.And you can only use the autoregression part in such a case, since the order of the moving average model will limit the amount of steps you can forecast into the future.
Forecasting from an inversed timeseries would be the solution if you had more data.
However, only having 6 observations is problematic. Creating a forecasting (or backcasting) model requires using some of the observations to train the model and others to validate it. If you train with 4 observations then you only have 2 observations for validation. Is a model good if it forecasts those two well or did you just get lucky? Is it bad if it forecasts one observation well and the other poorly? If you increase the validation set to 3 observations, you get more confidence on whether the model is good or bad but creating the model (with only 3 observations) gets even harder than before.
Like others have stated, regardless of what machine learning model you choose, the results are likely to be poor with so little data. If you had the monthly data it might be more fruitful.
If you can't get the monthly data, since you are backcasting into the past, it might be better to estimate the values manually based on some related variables that you have data of (if any). E.g. if your timeseries is about a company's sales then maybe you could estimate based on the company's annual revenue (or company size, or something else) if you can get the historical data of that variable. This is not precise but can still be more precise than what ARIMA or similar methods would give with the data you have.
I want to train an autoencoder on timeseries from systems of an aircrafts to be able to recognize when the patterns from a specific system shows an anomaly (ie: actual pattern will “significantly” deviate from reconstructed signal).
I’m hesitant between two different approaches to pre-process and train the model and I’d like to consult your collective wisdom and experience to guide me.
First a quick look on how the data is structured :
Each column is a collection of timestamp from different flights for a specific system.
Since System1 from Aircraft A is completely independent from System1 on Aircraft B, I cannot simply feed the preceding data set the “regular way” by batching together slices of ‘sequence_lenght’ and feeding it to the network. This would only work as long as I’m reading data from the same Aircraft but as soon as I switch Aircraft there is no continuity.
Approach number 1 : Incremental training
I would train the model the “regular way” using only data from one aircraft at a time. At each iteration I’ll keep training the same model but on a different batch from a different Aircraft.
Each iteration (hopefully) improving on the preceding one.
Pros : that’s the “easy” way in term of pre-processing + I could leverage the memory effect to identify longer pattern (using RNN or LSTM layers)….
Cons : … As long as the system does not try to find pattern across batches from different aircraft, would that be the case ?
Approach number 2 : the “Electrocardiogram way”
This one’s inspired by the ECG dataset. Instead of looking at the dataset as one big timeseries stream, I could think in term flight.
In this case I would resample each flights to be the same length. The model would then be trained to identify what’s wrong during a flight by looking at a collection of independent flight, and the actual chronology (between flight) would no longer be relevant. So I could train in one go over all the dataset.
Pros : simplify the training + ‘sequence_lenght’ is equivalent to the flight duration so it’s easier to work with.
Cons : resampling can involve compression and loss of information + By assuming each flight are independent from each other’s I prevent the system from finding longer time pattern occurring over several flights.
Of course an easy answer would be to “try both and see for yourself”, and I would totally go for it, were not time and resources a constraint.
Do you have any experience or recommendation as to what I should try first ?
Thank you for your help,
I'm trying to find the best type of neural network for time series regression. I would describe my scenario this way:
I have 1D time series data from sensors A, B, C, D, E and F.
I'm trying to generate a regression model for sensor F data, by using data from A-E sensors.
I know that for some sensors, I need to take in account the last X hours of data in order to have a proper model. This "delay" is different for every sensor (for example, I need to use the last 6 hours from sensor A-B, the last 30 mins for C-E), but consistent over time.
I have an approximative estimation of the max "delays", but I do not know them precisely for every sensor (which prevents me to preprocess my data).
My goal is to be able to produce a model/network trained on data of all sensors, and then apply it to new data from A-E and compare my regression results with actual data (from sensor F in this case).
Before this, I have been using Time Delay Neural Networks with MATLAB but this approach did not allow me much flexibility over the design of a network. By doing some research on vulgarization websites, I found many people comparing Time Delay Neural Networks and Recursive Neural Networks. However, while the MATLAB quick documentation made me think that I understood how these networks work, I am now a bit confused with various (and sometime contradictory) opinions on the subject.
What would be the appropriate types of Neural Networks for my problem ? Should I transform my data ? (for example, instead of 1d data, using time segments of the past X hours at each timestamp ?)
I would gladly accept any reference/book for better understanding.
For time series predictions are usually used:
Radial basis function networks.-The network input are past states over the time series, the output is a prediction for future states. This ones are simple and easy to build.
Recurrent neural networks.- Harder to build, but are prepared to work with dynamic behaviors.
I'm sorry for the poorly worded title, I'm just not sure how to phrase my question. Explaining my problem is probably easier. Also, I'm quite new to this, having only taken a few Udemy courses in Python and no professional experience in the field.
TL;DR: I want to create a model that will predict how many a minutes a flight will be delayed, or if it'll be canceled.
I'm working with the 2015 Flight Delays dataset on Kaggle, and I'm combining this data with a site I found that gives historical weather data for airports. I'm trying to combine those two sets of data to built a better delay predictor.
My hypothesis is that bad weather is the primary cause of delays, but really bad weather is the primary cause of cancellations.
My first thought at attacking this is that I could create a linear model that takes ceiling, visibility, wind as continuous independent variables, plus origin airport, departure airport, and precipitation intensity as categorical independent variables, and output the arrival delay as the dependent variable. However, this doesn't account for flight cancellations.
In the dataset, a cancelled flight has a NaN for its departure and arrival times, and then a one-hot encoded column for Cancelled. I could just edit the dataset and do something like making the arrival delay 999 if the Cancelled column is 1, but that screws with the linear model.
Is it possible for my model to output either a numeric (continuous) value for how many minutes a flight will be delayed, or if it'll be cancelled altogether (categorical)?
If not, can I somehow stack models? Ie, make something like a random forest model that only predicts "Canceled' or "Not Canceled", and feed the "Not Canceled" flights into a linear model that predicts a number of minutes a flight will be delayed?
Slightly unrelated question: I don't think there is a linear relationship between ceiling/visibility/winds, but I do think there is maybe a logarithmic relationship. For example, there is a massive difference between a visibility of 0.5, 1.0, and 1.5 miles, but there is zero difference between a visibility of 8 and 10 miles. Same for winds and ceiling. What sort of model is available to me in scikit-learn that accounts for this?
There are a lot of questions here but let me answer the one related with the stacked machine learning models.
The answer is yes, you train a neural network which solves both tasks. That is called multitask learning. I only know implementations of multitask learning with deep neural networks and its not hard to implement. The most important part is the output layer, in this case you would have an output layer with two activation units. One sigmoid output unit and one relu output unit (if you are planning to predict only possitive delays).
Another solution for your problem is the one you proposed. Just set the delay to a big number and set a condition in your code like "If the number of delayed hours is longer than n days, then the flight was cancelled".
And the last solution I see for it is implementing two neural networks. One which determines if the flight was cancelled and one which determines the delay time. If the first one returns true, then don't use the second one.
Please feel free to ask
I have the bank data of around 4 years of different branches. I am trying to predict number of rows in daily and hourly level. I have issue_datetime (year, month, day, hour) as important features. I applied different regression techniques (linear, decision trees, random forest, xgb) using graph lab but could not get better accuracy.
I was also thinking to set the threshold based on past data like taking the mean of counts in daily, monthly level after removing outliers and set that as a threshold.
What is the best approach?
Since you have 1d time series data, it should be relatively easy to graph your data and look for interesting patterns.
Once you establish that there are some non-stationary aspects to your data, the class of models you are probably wanting to check out first are auto-regressive models, possibly with seasonal additions. ARIMA models are pretty standard for time-series data. http://www.seanabu.com/2016/03/22/time-series-seasonal-ARIMA-model-in-python/