How to model RNN with Attention Mechanism for Non-Text Classification?

How to model RNN with Attention Mechanism for Non-Text Classification? - python

Recurrent Neural Networks (RNN) With Attention Mechanism is generally used for Machine Translation and Natural Language Processing. In Python, implementation of RNN With Attention Mechanism is abundant in Machine Translation (For Eg. https://talbaumel.github.io/blog/attention/, however what I would like to do is to use RNN With Attention Mechanism on a temporal data file (not any textual/sentence based data).
I have a CSV file with of dimensions 21392 x 1972, which I have converted to a Dataframe using Pandas. The first column is of Datetime Format and last column consists of target classes like "Class1", "Class2", "Class3" etc. which I would like to identify. So in total, there are 21392 rows (instances of data in 10 minutes time-steps) and 1971 features. The last (1972th column) is the label column, with 14 different classes in total.
I have looked into available implementation documentation on Keras (https://medium.com/datalogue/attention-in-keras-1892773a4f22) as well as on Tensorflow (Visualizing attention activation in Tensorflow), but none of them seem to be doing what I want to accomplish. I understand that this is an unusual approach, but would want to try this and use the attention mechanism because many of my features are presumably redundant in the data.
import pandas as pd
mydataset = pd.read_csv('final_merged_data.csv')
It is predominant from existing literature that an Attention Mechanism works quite well when coupled into the RNN. I am unable to locate any such implementation of RNN with Attention Mechanism, which can also provide a visualisation as well. I am also unable to understand how I can convert my data into a sequence (or a list of lists) so that I can use it with One Hot Encoding afterwards for using RNN with Attention. I am new to using Python as well as Keras/Tensorflow, and am quite confused on the procedure to convert my data/typecast it to a form which will be able to mimic the sequence classification problem. My problem is basically of multi-class classification, like one would normally do using Machine Learning Classifiers to predict the labels, but using RNN with Attention. Any help in this regard would be highly appreciated. Cheers!

Kindly refer to this paper for using Sequence to Sequence Model with attention for time series classification.
https://www.computer.org/csdl/proceedings/icdmw/2016/5910/00/07836709.pdf

Related

how to use RNN For energy disaggregation?

How can I find out the most impactful inputs to use as the predictors for the Recurrent Neural Network (RNN) modeling? I have a CSV file that has 25 columns and all of them are numeric. I want to predict one of the columns using the rest of the columns (24 columns). How can I find out how many of those 24 columns are impactful enough to be used as input using Mutual Information Analysis in python?

Usually Energy Disaggregation is done from a single input (grid consumption) to multiple outputs (appliances in a particular home). If you want to include multiple inputs, try building a multiple-input branch Neural Network and then stack your RNN layers.
You can also have a look this blog for a better understanding of Disaggregation.
If you want to get started with Energy Disaggregation and NILM using Deep Learning, you can have a look at this Open Source library: https://github.com/plexflo/plexflo. There is a Deep Learning model also included (LSTM) that can do basic Energy DIsaggregation.

Recommendation for Best Neural Network Type (in TensorFLow or PyTorch) For Fitting Problems

I am looking to develop a simple Neural Network in PyTorch or TensorFlow to predict one numeric value based on several inputs.
For example, if one has data describing the interior comfort parameters for a building, the NN should predict the numeric value for the energy consumption.
Both PyTorch or TensorFlow documented examples and tutorials are generally focused on classification and time dependent series (which is not the case). Any idea on which NN available in those libraries is best for this kind of problems? I'm just looking for a hint about the type, not code.
Thanks!

The type of problem you are talking about is called a regression problem. In such types of problems, you would have a single output neuron with a linear activation (or no activation). You would use MSE or MAE to train your network.
If your problem is time series(where you are using previous values to predict current/next value) then you could try doing multi-variate time series forecasting using LSTMs.
If your problem is not time series, then you could just use a vanilla feed forward neural network. This article explains the concepts of data correlation really well and you might find it useful in deciding what type of neural networks to use based on the type of data and output you have.

Time series forecasting model.predict()

I do not understand how model.predict(...) works on a a time series forecasting problem. I usually use it with a CNN and it is pretty straight forward but for time series I don't understand what it returns.
For example I am currently doing an exercise where I have to forecast the power consumption based on data using LTSM, I succeeded to train my model but when I want to know what the power cusumption will be tomorrow (so no data except past ones) I don't know what input to use.

Traditional ML algorithms, which you might be more used to, generally expect the data in a 2D structure like this:
For sequential data, such as a stream of timed events associated with each user, it’s also possible to create a lagged 2D dataset, where the history of different features for different IDs is aligned into single rows, with this structure:
This can be a good way to work because once your data is in the correct shape you can use it with fast to set up and train models. However, models using features engineered using this approach generally don’t have any capacity to “learn” anything about the natural sequence of the data. To something like a tree-based ensemble model receiving this format, feature 1 at time t and time t-1 in the example above are treated completely independently and this can severely limit the model’s predictive power.
There are types of deep learning architecture specifically designed for modelling sequence data called recurrent neural nets (RNN). Two of the most popular cells to use in these are long short term memory (LSTM) and gated recurrent units (GRU). There’s a good post on how to understand how LSTM cells work here, but the TL;DR is they have a structure that allows them to learn from sequences of data.
Cells like LSTM expect a 3D tensor of input data. We arrange it so that one axis has the data features along it, the second axis has the sequence steps (like time ticks) and the third axis has each of the different examples we want to predict a single "y" value for stacked along it. Using the same type of dataset as the lagged example above, it would look something like this:
The ability to learn patterns in sequences of data like this is particularly beneficial for both time series and text data, which are naturally ordered.
To return to your original question, when you want to predict something in your test set you'll need to pass it sequences represented just like the ones it was trained in (this is a reasonably good rule of supervised learning in general). For example, if the data is trained like the last example above, you'll need to pass it a 2D example for each ID you want to make a prediction for.
You should explore the way the original training data is represented and make sure you understand it well, as you'll need to create the same shape of data to make predictions. X_train.shape is a great place to start, if you have your training data in a pandas dataframe or numpy arrays, to see what the dimensionality is, and then you can inspect entries along each axis until you get a good feel for the data it contains.

How to structure and size Y-labels for multivariate sequence prediction using Keras LSTMs

I am working on a sequence prediction problem where my inputs are of size (numOfSamples, numOfTimeSteps, features) where each sample is independent, number of time steps is uniform for each sample (after pre-padding the length with 0's using keras.pad_sequences), and my number of features is 2. To summarize my question(s), I am wondering how to structure my Y-label dataset to feed the model and want to gain some insight on how to properly structure my model to output what I want.
My first feature is a categorical variable encoded to a unique int and my second is numerical. I want to be able to predict the next categorical variable as well as an associated feature2 value, and then use this to feed back into the network to predict a sequence until the EOS category is output.
This is a main source I've been referencing to try and understand how to create a generator for use with keras.fit_generator.
[1]
There is no confusion with how the mini-batch for "X" data is grabbed, but for the "Y" data, I am not sure about the proper format for what I am trying to do. Since I am trying to predict a category, I figured a one-hot vector representation of the t+1 timestep would be the proper way to encode the first feature, I guess resulting in a 4? Dimensional numpy matrix?? But I'm kinda lost with how to deal with the second numerical feature.
Now, this leads me to questions concerning architecture and how to structure a model to do what I am wanting. Does the following architecture make sense? I believe there is something missing that I am not understanding.
Proposed architecture (parameters loosely filled in, nothing set yet):
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(timesteps, features)))
model.add(LSTM(hidden_size, return_sequences=True))
model.add(TimeDistributed(Dense(vocab_size)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
model.fit_generator(...) #ill figure this out
So, at the end, a softmax activation can predict the next categorical value for feature1. How do I also output a value for feature2 so that I can feed the new prediction for both features back as the next time-step? Do I need some sort of parallel architecture with two LSTMs that are combined somehow?
This is my first attempt at doing anything with neural networks or Keras, and I would not say I'm "great" at python, I can get by though. However, I feel I have a decent grasp at the fundamental theoretical concepts, but lack the practice.
This question is sorta open ended, with encouragement to pick apart my current strategy.
Once again, the overall goal is to predict both features (categorical, numeric) in order to predict "full sequences" from intermediate length sequences.
Ex. I train on these padded max-len sequences, but in production I want to use this to predict the remaining part of the currently unseen time-steps, which would be variable length.

Okay, so If I understand you properly (correct me if I'm wrong) you would like to predict next features based on the current ones.
When it comes to categorical variables, you are on point, your Dense layer should output N-1 vector containing probability of each class (while we are at it, if you, by any chance, use pandas.get_dummies remember to specify argument drop_first=True, similiar approach should be employed whatever you are using for one-hot encoding).
Except those N-1 output vector for each sample, it should output one more number for numerical value.
Remember to output logits (no activation, don't use softmax at the end like you currently do). Afterwards network output should be separated into N-1 part (your categorical feature) and passed to loss function able to handle logits (e.g. in Tensorflow it is tf.nn.softmax_cross_entropy_with_logits_v2 which applies numerically stable softmax for you).
Now, your N-th element of network output should be passed to different loss, probably Mean Squared Error.
Based on loss value of those two losses (you could take a mean of both to obtain one loss value), you backpropagate through the network and it might do just fine.
Unfortunately I'm not skilled enough in Keras in order to help you with the code, but I think you will figure it out yourself. While we're at it, I would like to suggest PyTorch for more custom neural networks (I think yours fits this description), though it's definitely doable in Keras as well, your choice.
Additional 'maybe helpful' thought: you may check Teacher Forcing for your kind of task. More on the topic and theory behind it can be found in the outstanding Deep Learning Book and code example (though in PyTorch once again), can be found in their docs here.
BTW interesting idea, mind if I use it in connection with my current research trajectory (with kudos going to you of course)? Comment on this answer if so we can talk it out in chat.

Basically every answer I was looking for was exampled and explained in this tutorial. Absolutely great resource for trying to understand how to model multi-output networks. This one goes through a lengthy walkthrough of a multi-output CNN architecture. It only took me about three weeks to stumble upon, however.
https://www.pyimagesearch.com/2018/06/04/keras-multiple-outputs-and-multiple-losses/

How feature columns work in tensorflow?

I read that feature columns in tensorflow are used to define our data but how and why? How do feature columns work and why they even exist if we can make a custom estimator without them too?
And if they are necessary, why libraries like keras don't use them?

Broadly Speaking
This may be too general to answer. You may want to watch some videos or do more reading on machine learning, because this is a broad topic.
I will try to explain what features of data are used for.
A "feature" of the data is a meaningful variable that should separate two classes from each other. For example, if we choose the feature "weight", we can tell elephants apart from squirrels. They have very different weights, and our machine learning algorithm can learn to "understand" that an animal with a heavy weight is more likely to be an elephant than it is to be a squirrel. In a real scenario you would generally have more than one feature.
I'm not sure why you would say that Keras does not use features. They are a fundamental aspect of many classification problems. Some datasets may contain labelled data or labelled features, like this one: https://keras.io/datasets/#cifar100-small-image-classification
When we "don't use features", I think a more accurate way to state that would be that the data is unlabelled. In this case, a machine learning algorithm can still find relationships in the data, but without human labels applied to the data.
If you Ctrl+F for the word "features" on this page you will see places where Keras accepts them as an argument: https://keras.io/layers/core/
I am not a machine learning expert so if anyone is able to correct my answer, I would appreciate that too.
In Tensorflow
My understanding of Tensorflow's feature columns implementation in particular is that they allow you to cast raw data into a typed column that allow the algorithm to better distinguish what type of data you are passing. For example Latitude and Longitude could be passed as two numerical columns, but as the docs say here, using a Crossed Column for Latitude X Longitude may allow the model to train on the data in a more meaningful/effective way. After all, what "Latitude" and "Longitude" really mean is "Location." As for why Keras does not have this functionality, I am not sure, hopefully someone else can offer insight on this topic.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.