I do not understand how model.predict(...) works on a a time series forecasting problem. I usually use it with a CNN and it is pretty straight forward but for time series I don't understand what it returns.
For example I am currently doing an exercise where I have to forecast the power consumption based on data using LTSM, I succeeded to train my model but when I want to know what the power cusumption will be tomorrow (so no data except past ones) I don't know what input to use.
Traditional ML algorithms, which you might be more used to, generally expect the data in a 2D structure like this:
For sequential data, such as a stream of timed events associated with each user, it’s also possible to create a lagged 2D dataset, where the history of different features for different IDs is aligned into single rows, with this structure:
This can be a good way to work because once your data is in the correct shape you can use it with fast to set up and train models. However, models using features engineered using this approach generally don’t have any capacity to “learn” anything about the natural sequence of the data. To something like a tree-based ensemble model receiving this format, feature 1 at time t and time t-1 in the example above are treated completely independently and this can severely limit the model’s predictive power.
There are types of deep learning architecture specifically designed for modelling sequence data called recurrent neural nets (RNN). Two of the most popular cells to use in these are long short term memory (LSTM) and gated recurrent units (GRU). There’s a good post on how to understand how LSTM cells work here, but the TL;DR is they have a structure that allows them to learn from sequences of data.
Cells like LSTM expect a 3D tensor of input data. We arrange it so that one axis has the data features along it, the second axis has the sequence steps (like time ticks) and the third axis has each of the different examples we want to predict a single "y" value for stacked along it. Using the same type of dataset as the lagged example above, it would look something like this:
The ability to learn patterns in sequences of data like this is particularly beneficial for both time series and text data, which are naturally ordered.
To return to your original question, when you want to predict something in your test set you'll need to pass it sequences represented just like the ones it was trained in (this is a reasonably good rule of supervised learning in general). For example, if the data is trained like the last example above, you'll need to pass it a 2D example for each ID you want to make a prediction for.
You should explore the way the original training data is represented and make sure you understand it well, as you'll need to create the same shape of data to make predictions. X_train.shape is a great place to start, if you have your training data in a pandas dataframe or numpy arrays, to see what the dimensionality is, and then you can inspect entries along each axis until you get a good feel for the data it contains.
Related
Currently, I am working on Human Activity Recognition using wearable sensor data (i.e., accelerometer, gyroscope, etc). Now, I am trying to generate some synthetic sensor data from accelerometer (xyz) data.
I used GAN to generate synthetic data set from 3D accelerometer data. However, the result is not good (the generated/fake data not similar to the real data). I have used several sequence models (i.e., LSTM, bi-directional LSTM), but the result is the same. I got a repeating pattern on my fake data.
GAN result
Is there any suggestion for this? Some explanation would be much appreciated.
Thank you :)
Designing and training GANs -specifically for temporal tasks- is a bit tricky.
It's generally a good idea to use state-of-the art architectures instead of writing and training your own model.
If you must use your own model, something you might want to try is using depth-wise convolutions (i.e. along the time dimension) instead of LSTMs.
The training of GANs is also a tricky process so it might help to look into some of the implemented codes to get some tips on avoiding mode collapses etc.
I'm working with a company on a project to develop ML models for predictive maintenance. The data we have is a collection of log files. In each log file we have time series from sensors (Temperature, Pressure, MototSpeed,...) and a variable in which we record the faults occurred. The aim here is to build a model that will use the log files as its input (the time series) and to predict whether there will be a failure or not. For this I have some questions:
1) What is the best model capable of doing this?
2) What is the solution to deal with imbalanced data? In fact, for some kind of failures we don't have enough data.
I tried to construct an RNN classifier using LSTM after transforming the time series to sub time series of a fixed length. The targets were 1 if there was a fault and 0 if not. The number of ones compared to the number of zeros is negligible. As a result, the model always predicted 0. What is the solution?
Mohamed, for this problem you could actually start with traditional ML models (random forest, lightGBM, or anything of this nature). I recommend you focus on your features. For example you mentioned Pressure, MototSpeed. Look at some window of time going back. Calculate moving averages, min/max values in that same window, st.dev. To tackle this problem you will need to have a set of healthy features. Take a look at featuretools package. You can either use it or get some ideas what features can be created using time series data. Back to your questions.
1) What is the best model capable of doing this? Traditional ML methods as mentioned above. You could also use deep learning models, but I would first start with easy models. Also if you do not have a lot of data I probably would not touch RNN models.
2) What is the solution to deal with imbalanced data? You may want to oversample or undersample your data. For oversampling look at the SMOTE package.
Good luck
Recurrent Neural Networks (RNN) With Attention Mechanism is generally used for Machine Translation and Natural Language Processing. In Python, implementation of RNN With Attention Mechanism is abundant in Machine Translation (For Eg. https://talbaumel.github.io/blog/attention/, however what I would like to do is to use RNN With Attention Mechanism on a temporal data file (not any textual/sentence based data).
I have a CSV file with of dimensions 21392 x 1972, which I have converted to a Dataframe using Pandas. The first column is of Datetime Format and last column consists of target classes like "Class1", "Class2", "Class3" etc. which I would like to identify. So in total, there are 21392 rows (instances of data in 10 minutes time-steps) and 1971 features. The last (1972th column) is the label column, with 14 different classes in total.
I have looked into available implementation documentation on Keras (https://medium.com/datalogue/attention-in-keras-1892773a4f22) as well as on Tensorflow (Visualizing attention activation in Tensorflow), but none of them seem to be doing what I want to accomplish. I understand that this is an unusual approach, but would want to try this and use the attention mechanism because many of my features are presumably redundant in the data.
import pandas as pd
mydataset = pd.read_csv('final_merged_data.csv')
It is predominant from existing literature that an Attention Mechanism works quite well when coupled into the RNN. I am unable to locate any such implementation of RNN with Attention Mechanism, which can also provide a visualisation as well. I am also unable to understand how I can convert my data into a sequence (or a list of lists) so that I can use it with One Hot Encoding afterwards for using RNN with Attention. I am new to using Python as well as Keras/Tensorflow, and am quite confused on the procedure to convert my data/typecast it to a form which will be able to mimic the sequence classification problem. My problem is basically of multi-class classification, like one would normally do using Machine Learning Classifiers to predict the labels, but using RNN with Attention. Any help in this regard would be highly appreciated. Cheers!
Kindly refer to this paper for using Sequence to Sequence Model with attention for time series classification.
https://www.computer.org/csdl/proceedings/icdmw/2016/5910/00/07836709.pdf
I'm running some experiments with neural networks in TensorFlow. The release notes for the latest version say DataSet is henceforth the recommended API for supplying input data.
In general, when taking numeric values from the outside world, the range of values needs to be normalized; if you plug in raw numbers like length, mass, velocity, date or time, the resulting problem will be ill-conditioned; it's necessary to check the dynamic range of values and normalize to the range (0,1) or (-1,1).
This can of course be done in raw Python. However, DataSet provides a number of data transformation features and encourages their use, on the theory that the resulting code will not only be easier to maintain, but run faster. That suggests there should also be a built-in feature for normalization.
Looking over the documentation at https://www.tensorflow.org/programmers_guide/datasets however, I'm not seeing any mention of such. Am I missing something? What is the recommended way to do this?
My understanding of tensorflow datasets main idea tells me that complex pre-procesing is not directly applicable, because tf.data.Dataset is specifically designed to stream very large amounts of data, more precisely tensors:
A Dataset can be used to represent an input pipeline as a collection
of elements (nested structures of tensors) and a "logical plan" of
transformations that act on those elements.
The fact that tf.data.Dataset operates with tensors means that obtaining any particular statistic over the data, such as min or max, requires a complete tf.Session and at least one run through the whole pipeline. The following sample lines:
iterator = dataset.make_one_shot_iterator()
batch_x, batch_y = iterator.get_next()
... which are designed to provide the next batch fast, no matter of the size of the dataset, would stop the world until the first batch is ready, if the dataset is responsible for pre-processing. That's why the "logical plan" includes only local transformations, which ensures the data can be streamed and, in addition, allows to do transformations in parallel.
This doesn't mean it's impossible to implement the normalization with tf.data.Dataset, I feel like it's never been designed to do so and, as a result, it will look ugly (though I can't be absolutely sure of that). However, note that batch-normalization fits into this picture perfectly, and it's one of the "nice" options I see. Another option is do simple pre-processing in numpy and feed the result of that into tf.data.Dataset.from_tensor_slices. This doesn't make the pipeline much more complicated, but doesn't restrict you from using tf.data.Dataset at all.
When it comes to normal ANNs, or any of the standard machine learning techniques, I understand what the training, testing, and validation sets should be (both conceptually, and the rule-of-thumb ratios). However, for a bidirectional LSTM (BLSTM) net, how to split the data is confusing me.
I am trying to improve prediction on individual subject data that consists of monitored health values. In the simplest case, for each subject, there is one long time series of values (>20k values), and contiguous parts of that time series are labeled from a set of categories, depending on the current health of the subject. For a BLSTM, the net is trained on all of the data going forwards and backwards simultaneously. The problem then is, how does one split a time series for one subject?
I can't just take the last 2,000 values (for example), because they might all fall into a single category.
And I can't chop the time series up randomly, because then both the learning and testing phases would be made of disjointed chunks.
Finally, each of the subjects (as far as I can tell) has slightly different (but similar) characteristics. So, maybe, since I have thousands of subjects, do I train on some, test on some, and validate on others? However, since there are inter-subject differences, how would I set up the tests if I was only considering one subject to start?
I think this has more to do with your particular dataset than Bi-LSTMs in general.
You're confusing splitting a dataset for training/testing vs. splitting a sequence in a particular sample. It seems like you have many different subjects, which constitute a different sample. For a standard training/testing split, you would split your dataset between subjects, as you suggested in the last paragraph.
For any sort of RNN application, you do NOT split along your temporal sequence; you input your entire sequence as a single sample to your Bi-LSTM. So the question really becomes whether such a model is well-suited to your problem, which has multiple labels at specific points in the sequence. You can use a sequence-to-sequence variant of the LSTM model to predict which label each time point in the sequence belongs to, but again you would NOT be splitting the sequence into multiple parts.