Recommend approach to train autoencoder on timeseries data - python

I want to train an autoencoder on timeseries from systems of an aircrafts to be able to recognize when the patterns from a specific system shows an anomaly (ie: actual pattern will “significantly” deviate from reconstructed signal).
I’m hesitant between two different approaches to pre-process and train the model and I’d like to consult your collective wisdom and experience to guide me.
First a quick look on how the data is structured :
Each column is a collection of timestamp from different flights for a specific system.
Since System1 from Aircraft A is completely independent from System1 on Aircraft B, I cannot simply feed the preceding data set the “regular way” by batching together slices of ‘sequence_lenght’ and feeding it to the network. This would only work as long as I’m reading data from the same Aircraft but as soon as I switch Aircraft there is no continuity.
Approach number 1 : Incremental training
I would train the model the “regular way” using only data from one aircraft at a time. At each iteration I’ll keep training the same model but on a different batch from a different Aircraft.
Each iteration (hopefully) improving on the preceding one.
Pros : that’s the “easy” way in term of pre-processing + I could leverage the memory effect to identify longer pattern (using RNN or LSTM layers)….
Cons : … As long as the system does not try to find pattern across batches from different aircraft, would that be the case ?
Approach number 2 : the “Electrocardiogram way”
This one’s inspired by the ECG dataset. Instead of looking at the dataset as one big timeseries stream, I could think in term flight.
In this case I would resample each flights to be the same length. The model would then be trained to identify what’s wrong during a flight by looking at a collection of independent flight, and the actual chronology (between flight) would no longer be relevant. So I could train in one go over all the dataset.
Pros : simplify the training + ‘sequence_lenght’ is equivalent to the flight duration so it’s easier to work with.
Cons : resampling can involve compression and loss of information + By assuming each flight are independent from each other’s I prevent the system from finding longer time pattern occurring over several flights.
Of course an easy answer would be to “try both and see for yourself”, and I would totally go for it, were not time and resources a constraint.
Do you have any experience or recommendation as to what I should try first ?
Thank you for your help,

Related

How do I make a good feature using machine learning on a timeseries forecast that has only traffic volume as an input?

So I have a time series that only has traffic volume. I've done FB prophet and neural prophet. They work okay, but I would like to do something using machine learning. So far I have the problem of trying to make my features. Using the classical dayofyear, month, etc does not give me good results. I have tried using shift where I get the average, minimum, and max of the two previous days. However that would work, but my problem is when I try to predict days in advance the feature doesn't really work for that since I cant get the average of that day. My main concern is trying to find a good feature that my predicting future dataframe also has. A picture of my data is included. Does anyone know how I would do this?
First of all, you have to clarify some definitions. FBProphet works on a mechanism that is the same as any machine learning algorithm that is, fitting the model and then predicting the output. Being an additive regression model with a piecewise linear or logistic growth curve trend, it can be considered as a Machine Learning method that allows us to predict a continuous outcome variable.
Secondly, I think you missed the most important word that your question was about - namely: Feature engineering.
Feature Engineering includes :
Process of using domain knowledge to Extract features (characteristics, properties, attributes) from raw data.
Process of transforming raw data into features that better represent the underlying problem to the predictive models, etc..
But it's very unlikely to use Machine Learning to do Feature engineering. You do Feature engineering in order to improve your Machine Learning model. Many techniques such as imputation, handling outliers, binning, log transform, one-hot encoding, grouping operations, feature split, scaling are hybrid methods using a statistical approach and/or domain knowledge.
Regarding your data, bearing in my mind that the seasonality is already handled by FBProphet, I am not confident if feature engineering transformations such as adding the day of the week, adding holidays periods, etc... could really help improve performance...
To conclude, it is not possible to create ex-nihilo new features that would outperform your model. Whether you process/transform your data or add external domain-knowledge dataset

Best model to predict failure using time series from sensors

I'm working with a company on a project to develop ML models for predictive maintenance. The data we have is a collection of log files. In each log file we have time series from sensors (Temperature, Pressure, MototSpeed,...) and a variable in which we record the faults occurred. The aim here is to build a model that will use the log files as its input (the time series) and to predict whether there will be a failure or not. For this I have some questions:
1) What is the best model capable of doing this?
2) What is the solution to deal with imbalanced data? In fact, for some kind of failures we don't have enough data.
I tried to construct an RNN classifier using LSTM after transforming the time series to sub time series of a fixed length. The targets were 1 if there was a fault and 0 if not. The number of ones compared to the number of zeros is negligible. As a result, the model always predicted 0. What is the solution?
Mohamed, for this problem you could actually start with traditional ML models (random forest, lightGBM, or anything of this nature). I recommend you focus on your features. For example you mentioned Pressure, MototSpeed. Look at some window of time going back. Calculate moving averages, min/max values in that same window, st.dev. To tackle this problem you will need to have a set of healthy features. Take a look at featuretools package. You can either use it or get some ideas what features can be created using time series data. Back to your questions.
1) What is the best model capable of doing this? Traditional ML methods as mentioned above. You could also use deep learning models, but I would first start with easy models. Also if you do not have a lot of data I probably would not touch RNN models.
2) What is the solution to deal with imbalanced data? You may want to oversample or undersample your data. For oversampling look at the SMOTE package.
Good luck

Training CNN on small subsets to select architecture

Do you know if it's possible to use a very small subset of my training data (100 or 500 instances only for example), to train very rough CNN network quickly in order to compare different architectures, then select the best performing one ?
When I say "possible", I mean is there evidence that applying that kind of selection strategy works, and that the selected network will consistently outperform the other to for this specific task.
Thank you,
For information, the project in question would constist of two stages CNNs to classify multichannel timeseries. The first CNN would forecast the inputs data over the next period of time, then the second CNN would use this forecast and classify the results in two categories.
The procedure you are talking about is actually used in practice. When tuning hyperparameters, a lot of people select a subset of the whole dataset to do this.
Is the best architecture on the subset necessarily the best on the full dataset? NO! However, it's the best guess you have and that's why it's useful.
A couple of things to note on your question:
100-500 instances is extremely low! The CNN still needs to be trained. When we say subset we usually mean tens of thousands of images (out of the millions of the dataset). If your dataset is under 50000 images then why do you need a subset? Train on the whole dataset.
Contrary to what a lot of people believe, the details of the architecture are of little importance to the classification performance. Some of the hyperparameters you mention (e.g. kernel size) are of secondary importance. The key things you should focus on is depth, size of layers, use of pooling/skip connections/batch norm/dropout, etc.

Can a single model return either a continuous or categorical result?

I'm sorry for the poorly worded title, I'm just not sure how to phrase my question. Explaining my problem is probably easier. Also, I'm quite new to this, having only taken a few Udemy courses in Python and no professional experience in the field.
TL;DR: I want to create a model that will predict how many a minutes a flight will be delayed, or if it'll be canceled.
I'm working with the 2015 Flight Delays dataset on Kaggle, and I'm combining this data with a site I found that gives historical weather data for airports. I'm trying to combine those two sets of data to built a better delay predictor.
My hypothesis is that bad weather is the primary cause of delays, but really bad weather is the primary cause of cancellations.
My first thought at attacking this is that I could create a linear model that takes ceiling, visibility, wind as continuous independent variables, plus origin airport, departure airport, and precipitation intensity as categorical independent variables, and output the arrival delay as the dependent variable. However, this doesn't account for flight cancellations.
In the dataset, a cancelled flight has a NaN for its departure and arrival times, and then a one-hot encoded column for Cancelled. I could just edit the dataset and do something like making the arrival delay 999 if the Cancelled column is 1, but that screws with the linear model.
Is it possible for my model to output either a numeric (continuous) value for how many minutes a flight will be delayed, or if it'll be cancelled altogether (categorical)?
If not, can I somehow stack models? Ie, make something like a random forest model that only predicts "Canceled' or "Not Canceled", and feed the "Not Canceled" flights into a linear model that predicts a number of minutes a flight will be delayed?
Slightly unrelated question: I don't think there is a linear relationship between ceiling/visibility/winds, but I do think there is maybe a logarithmic relationship. For example, there is a massive difference between a visibility of 0.5, 1.0, and 1.5 miles, but there is zero difference between a visibility of 8 and 10 miles. Same for winds and ceiling. What sort of model is available to me in scikit-learn that accounts for this?
There are a lot of questions here but let me answer the one related with the stacked machine learning models.
The answer is yes, you train a neural network which solves both tasks. That is called multitask learning. I only know implementations of multitask learning with deep neural networks and its not hard to implement. The most important part is the output layer, in this case you would have an output layer with two activation units. One sigmoid output unit and one relu output unit (if you are planning to predict only possitive delays).
Another solution for your problem is the one you proposed. Just set the delay to a big number and set a condition in your code like "If the number of delayed hours is longer than n days, then the flight was cancelled".
And the last solution I see for it is implementing two neural networks. One which determines if the flight was cancelled and one which determines the delay time. If the first one returns true, then don't use the second one.
Please feel free to ask

Predicting Energy Consumption of different buildings

I have the dataset which you can find the (updated) file here , containing many different characteristics of different office buildings, including their surface area and number of people working in there. In total there are about 200 records. I want to use an algorithm, that can be trained using the dataset above, in order to be able to predict the electricity consumption(given in the column 'kwh') of a the building that is not in the set.
I have tried most of the possible machine learning algorithms using the scikit library in python (linear regression, Ridge, Lasso, SVC etc) in order to predict a continuous variable. Surface_area and number of workers had a coorelation value with the target variable between 0.3-0.4 so I assumed them to be good features for the model and included them in the training of the model. However I had about 13350 mean absolute error and R-squared value of about 0.22-0.35, which is not good at all.
I would be very grateful, if someone could give me some advice, or if you could examine a little the dataset and run some algorithms on it. What type of preprocessing should I use, and what type of algorithm? Is the number of datasets too low to train a regression model for predicting continuous variables?
Any feedback would be helpful as I am new to machine learning :)
The first thing that should be done in these kinds of Machine Learning Problems is to understand the data. Yes, the number of features in your dataset is small, yes, the number of data samples are very less, but it is important to do the best we can with what we have.
The data set header is in a language other than English, it is important to convert it to a language most of the people in the community would understand (in this case English). After doing a bit of tinkering, I found out that the language being used is Dutch.
There are some key features missing in the dataset. From something as obvious as the number of floors in the building to something not obvious like the number of working hours. Surface Area and the number of workers seems to me are the most important features, but you are missing out on a feature called building_function which (after using Google Translate) tells what the purpose of the building is. Intuitively, this is supposed to have a large correlation with the power consumption. Industries tend to use more power than normal Households. After translation, I found out that the main types were Residential, Office, Accommodation and Meeting. This feature thus has to be encoded as a nominal variable to train the model.
Another feature hoofsbi also seems to have some variance. But I do not know what that feature means.
If you could translate the headers in the data and share it, I will be able to provide you some code to perform this regression task. It is very important in such tasks to understand what the data is and thus perform feature engineering.

Categories

Resources