Multilabel classification using LSTM on variable length signal using Keras - python

I have recently started working on ECG signal classification in to various classes. It is basically multi label classification task (Total 4 classes). I am new to Deep Learning, LSTM and Keras that why i am confused in few things.
I am thinking about giving normalized original signal as input to the network, is this a good approach?
I also need to understand training input shape for LSTM as ECG signals are of variable length (9000 to 18000 samples) and usually classifier need fixed variable input. How can i handle such type of input in case of LSTM.
Finally what should be structure of deep LSTM network for such lengthy input and how many layers should i use.
Thanks for your time.
Regards

I am thinking about giving normalized original signal as input to the network, is this a good approach?
Yes this is a good approach. It is actually quite standard for Deep Learning algorithms to give them your input normalized or rescaled.
This usually helps your model converge faster, as now you are inside smaller range (i.e.: [-1, 1]) instead of greater un-normalized ranges from your original input (say [0, 1000]). It also helps you get better, more precise results, as it helps solve problems like the vanishing gradient as well as adapting better to modern activation and optimizer functions.
I also need to understand training input shape for LSTM as ECG signals are of variable length (9000 to 18000 samples) and usually classifier need fixed variable input. How can i handle such type of input in case of LSTM.
This part is really important. You are correct, LSTM expects to receive inputs with a fixed shape, one that you know beforehand (in fact, any Deep Learning layer expects fixed shape inputs). This is also explained in the keras docs on Recurrent Layers where they say:
Input shape
3D tensor with shape (batch_size, timesteps, input_dim).
As we can see, it expects your data to have a number of timesteps as well as a dimension on each one of those timesteps (batch size is usually 1). To exemplify, suppose your input data consists of elements like: [[1,4],[2,3],[3,2],[4,1]]. Then, using a batch_size of 1, the shape of your data would be (1,4,2). As you have 4 timesteps, each with 2 features.
So bottom line, you have to make sure that you pre-process you data so it has a fixed shape you can then pass to your LSTM layers. This one you will have to find out by yourself, as you know your data and problem better than we do.
Maybe you can fix the samples you obtain from your signal, discarding some and keeping others so every signal is of the same length (if you say your signals are between 9k and 18k choosing 9000 could be the logical choice, discarding samples from the others you get). You could even do some other conversion to your data in a way that you can map from inputs of 9000-18000 to a fixed size.
Finally what should be structure of deep LSTM network for such lengthy input and how many layers should i use.
This one is really quite broad and doesn't have a unique answer. It would depend on the nature of your problem, and determining those parameters a priori is not so straightforward.
What I recommend you do is to start with a simple model first, and then add layers and blocks (neurons) incrementally until you are satisfied with the results.
Try just one hidden layer first, train and test your model and check your performance. You can then add more blocks and see if your performance improved. You can also add more layers and check for the same until you are satisfied.
This is a good way to create Deep Learning models, as you will arrive to the results you want while keeping your Network as lean as possible, which in turn helps your execution time and complexity. Good luck with your coding, hope you find this useful.

Related

Can CNN autoencoders have different input and output dimensions?

I am working on a problem which requires me to build a deep learning model that based on certain input image it has to output another image. It is worth noting that these two images are conceptually related but they don't have the same dimensions.
At first I thought that a classical CNN with a final dense layer whose argument is the multiplication of the height and width of the output image would suit this case, but when training it was giving strange figures such as accuracy of 0.
While looking for some answers on the Internet I discovered the concepts of CNN autoencoders and I was wondering if this approach could help me solve my problem. Among all the examples I saw, the input and output of an autoencoder had the same size and dimensions.
At this point I wanted to ask if there was a type of CNN autoencoders that produce an output image that has different dimension compared to input image.
Auto-encoder (AE) is an architecture that tries to encode your image into a lower-dimensional representation by learning to reconstruct the data from such representation simultaniously. Therefore AE rely on a unsupervised (don't need labels) data that is used both as an input and as the target (used in the loss).
You can try using a U-net based architecture for your usecase. A U-net would forward intermediate data representations to later layers of the network which should assist with faster learning/mapping of the inputs into a new domain..
You can also experiment with a simple architecture containing a few ResNet blocks without any downsampling layers, which might or might not be enough for your use-case.
If you want to dig a little deeper you can look into Disco-GAN and related methods.They explicitly try to map image into a new domain while maintaining image information.

Can Keras embedding layer give random vector for a certain index (e.g: -1) instead of a fixed vector

I have a problem where I have texts ( that can be very long max ~9000 words) that I need to embed with Keras Layer. I choose the fixed size 5000 for every text and I need to pad each sequence to get to the right shape. The classical way is to use Keras' pad_sequence that take as input list of lists of indexes and pad with zeros or cut the lists of indexes to 5000.
For my downstream task, I use a sort of convnet inspired by Kim's Paper (https://arxiv.org/abs/1408.5882). My concern is that the network learns in a certain sense the Wordcount by detecting the pattern of vectors that embed the 0 I used to pad the sequences. I am not saying that this feature is not important but I would like to force the network to learn other features in preferences. I was thinking about two things, first using an additional task (like an adversarial task) that take the latent representation created by the model before the output and use a branch of the model to predict the size of the text or a cluster of size, for example :
[,1000 words] -- cluster 1
[1001,2000words] -- cluster 2
ect..
Then use the output to encourage the network to map other information in the latent space by adding an adversarial loss to the main loss term. My other idea was instead of using zeros' vectors to pad the embed the zero paddings, we could use random vectors, generated on the fly while training. (every time the network sees a particular index, for example -1, it knows it has to generate a random vector). I was thinking that it breaks the symmetry introduced by using zeros vectors and helps the model to generalize better as it introduces noise in the training process.
As I didn't find any papers on this task of padding with something else than zeros, I turn to the community. What do you think? I went through the Embedding layer implementation and I am pretty sure that the implementation of the second idea is pretty straightforward in keras by changing the K.gather() by a flag for the right indexes (It would be longer execution time though).
Thanks in advance for your feedback and your ressources !

How to structure and size Y-labels for multivariate sequence prediction using Keras LSTMs

I am working on a sequence prediction problem where my inputs are of size (numOfSamples, numOfTimeSteps, features) where each sample is independent, number of time steps is uniform for each sample (after pre-padding the length with 0's using keras.pad_sequences), and my number of features is 2. To summarize my question(s), I am wondering how to structure my Y-label dataset to feed the model and want to gain some insight on how to properly structure my model to output what I want.
My first feature is a categorical variable encoded to a unique int and my second is numerical. I want to be able to predict the next categorical variable as well as an associated feature2 value, and then use this to feed back into the network to predict a sequence until the EOS category is output.
This is a main source I've been referencing to try and understand how to create a generator for use with keras.fit_generator.
[1]
There is no confusion with how the mini-batch for "X" data is grabbed, but for the "Y" data, I am not sure about the proper format for what I am trying to do. Since I am trying to predict a category, I figured a one-hot vector representation of the t+1 timestep would be the proper way to encode the first feature, I guess resulting in a 4? Dimensional numpy matrix?? But I'm kinda lost with how to deal with the second numerical feature.
Now, this leads me to questions concerning architecture and how to structure a model to do what I am wanting. Does the following architecture make sense? I believe there is something missing that I am not understanding.
Proposed architecture (parameters loosely filled in, nothing set yet):
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(timesteps, features)))
model.add(LSTM(hidden_size, return_sequences=True))
model.add(TimeDistributed(Dense(vocab_size)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
model.fit_generator(...) #ill figure this out
So, at the end, a softmax activation can predict the next categorical value for feature1. How do I also output a value for feature2 so that I can feed the new prediction for both features back as the next time-step? Do I need some sort of parallel architecture with two LSTMs that are combined somehow?
This is my first attempt at doing anything with neural networks or Keras, and I would not say I'm "great" at python, I can get by though. However, I feel I have a decent grasp at the fundamental theoretical concepts, but lack the practice.
This question is sorta open ended, with encouragement to pick apart my current strategy.
Once again, the overall goal is to predict both features (categorical, numeric) in order to predict "full sequences" from intermediate length sequences.
Ex. I train on these padded max-len sequences, but in production I want to use this to predict the remaining part of the currently unseen time-steps, which would be variable length.
Okay, so If I understand you properly (correct me if I'm wrong) you would like to predict next features based on the current ones.
When it comes to categorical variables, you are on point, your Dense layer should output N-1 vector containing probability of each class (while we are at it, if you, by any chance, use pandas.get_dummies remember to specify argument drop_first=True, similiar approach should be employed whatever you are using for one-hot encoding).
Except those N-1 output vector for each sample, it should output one more number for numerical value.
Remember to output logits (no activation, don't use softmax at the end like you currently do). Afterwards network output should be separated into N-1 part (your categorical feature) and passed to loss function able to handle logits (e.g. in Tensorflow it is tf.nn.softmax_cross_entropy_with_logits_v2 which applies numerically stable softmax for you).
Now, your N-th element of network output should be passed to different loss, probably Mean Squared Error.
Based on loss value of those two losses (you could take a mean of both to obtain one loss value), you backpropagate through the network and it might do just fine.
Unfortunately I'm not skilled enough in Keras in order to help you with the code, but I think you will figure it out yourself. While we're at it, I would like to suggest PyTorch for more custom neural networks (I think yours fits this description), though it's definitely doable in Keras as well, your choice.
Additional 'maybe helpful' thought: you may check Teacher Forcing for your kind of task. More on the topic and theory behind it can be found in the outstanding Deep Learning Book and code example (though in PyTorch once again), can be found in their docs here.
BTW interesting idea, mind if I use it in connection with my current research trajectory (with kudos going to you of course)? Comment on this answer if so we can talk it out in chat.
Basically every answer I was looking for was exampled and explained in this tutorial. Absolutely great resource for trying to understand how to model multi-output networks. This one goes through a lengthy walkthrough of a multi-output CNN architecture. It only took me about three weeks to stumble upon, however.
https://www.pyimagesearch.com/2018/06/04/keras-multiple-outputs-and-multiple-losses/

Keras LSTM Input Transpose

I'm training a timeseries LSTM model using Keras. I understand that the input to the model has to be in the format: [samples, timesteps, features].
However, when I reverse transpose each input element, so the input now is in the format: [samples, features, timesteps] my model accuracy improves significantly, and training time is reduced quite a bit as well. Does anyone have an explanation as to why?
For reference, here are the stats on my training data:
samples: 720
timesteps: 256
features: 4
So my input tensor should have the shape [720, 256, 4] but reshaping to [720, 4, 256] produces better results. Why?
As I said in my rather long comments, the answer is "because you are not learning the same thing". Frameworks like tensorflow and keras attempt to make training and using networks convenient, so as long as your inputs are at least approximately the right shape, the framework will try its best to feed the data into the network. But the network has no way to interpret the data you feed into it. It is up to you to make sure that what you are feeding into the network makes sense in the context of your data. No matter what data you send and what labels you use, the network will do its best to learn a mapping between the data and the labels. And it might succeed. Just because the pattern you are trying to learn makes no sense doesn't mean it cannot be learned. So to answer your question, you need to figure out what is the meaning of your input transposition and given that LSTM will treat your data as a sequence of consecutive datapoints, what sequences did you end up learning.

LSTM Embedding output size and No. of LSTM

I am not sure why we have only output vector of size 32, while have LSTM 100?
What I am confuse is that if we have only 32 words vector, if fetch into LSTM, 32 LSTM should big enough to hold it?
Model.add(Embedding(5000,32)
Model.add(LSTM(100))
Those are hyper-parameters of your model and there is no best way of setting them without experimentation. In your case, embedding single words into a vector of dimension 32 might be enough, but the LSTM will process a sequence of them and might require more capacity (ie dimensions) to store information about multiple words. Without knowing the objective or the dataset it is difficult to make an educated guess on what those parameters would be. Often we look at past research papers tackling similar problems and see what hyper-parameters they used and then tune them via experimentation.

Categories

Resources