LSTM num_units size, ie size of hidden_layer - python

I often see natural language processing tasks use LSTM in a way that they first use a embedding layer followed by an LSTM layer of the size of the embedding, i.e. if a word is represented by a 1x300 vector LSTM(300) is used.
model = Sequential()
model.add(Embedding(vocabulary, hidden_size, input_length=num_steps))
model.add(LSTM(hidden_size, return_sequences=True))
Is there a particular reason for doing so? Like a better representation of the meaning?

I don't think there's any special reason/need for this and frankly I haven't seen that many cases myself where this is the case (i.e. using the LSTM hidden units == Embedding size). The only effect this has is there's a single memory cell for each embedding vector element (which I don't think is a requirement or a necessity).
Having said that, I thought I might mention something extra. That is, there's a reason for having an Embedding layer in this setup. In fact a very good reason(s). Let's consider the two options,
Using one hot encoding for word representation
Using Embeddings for word represetnations
Option 2 has several advantages over option 1.
The dimensionality of the inputs is way smaller when you use an embedding layer (e.g. 300 as opposed to 50000)
You're providing the flexibility to the model to learn a word representation that is infact suited to the task your solving. In other words, you are not restricting the representation of words to remain constant during the training process.
If you use a pretrained word embedding layer to initialize the Embedding layer, even better. You are bringing word semantics in to the task you are solving. That always help to solve the task better. This is analogous to asking a toddler that doesn't understand the meaning of words to do something text-related (e.g. order words in the correct grammatical order) vs asking a 3-year old to do the same task. They both might eventually do it. But one will do it quicker and better.


How to handle variable length sentences in PyTorch with Glove embedding layer?

I am building a text classifier using an RNN in PyTorch. The embeddings i'm using are GLOVE. However i am feeding variable length index references in to the model. This will lead to variable length embeddings, which i take it will not work. How do i get around this and make the embedding output the same length for all sentences?
def forward(self, sentence):
embeds = self.embedding(sentence)
hidden = self.__init__hidden(size)
output, hidden = self.rnn(embeds, hidden)
out = self.hidden2out(output)
Also, if someone could tell me how to choose the hidden layer size that would be great.
Your data in this case can be represented in pytorch as having shape: [seq_len, batch, vocab_size]. When you create a tensor for your model this way, the batch size is up to you, and sequence length is variable.
When you employ an embedding, such as GloVe or word2vec, what you're doing is condensing your vocab size (from many thousands to <1000) to make it more information dense and to give your model some contextual information about each word. So you can see that your output from an embedding layer will be of shape: [seq_len, batch, embedding_size]. Therefore, it doesn't matter how long your sequences are at this stage because your output is sequence-length-independent.
Without knowing your embedding layer, I just have to assume that it's operating along the correct dimensions. If it is not, a couple of transpose operations to ensure that it is will work.
The only way to gain intuition about hidden layer size (I'm assuming you're not talking about your h_0 input into your RNN) and number of hidden layers is to unfortunately try things about. I can't really even give you a guess because I have no idea what dataset you're using or what problem you're trying to solve. There is no one-size-fits-all policy in machine learning!

Can Keras embedding layer give random vector for a certain index (e.g: -1) instead of a fixed vector

I have a problem where I have texts ( that can be very long max ~9000 words) that I need to embed with Keras Layer. I choose the fixed size 5000 for every text and I need to pad each sequence to get to the right shape. The classical way is to use Keras' pad_sequence that take as input list of lists of indexes and pad with zeros or cut the lists of indexes to 5000.
For my downstream task, I use a sort of convnet inspired by Kim's Paper ( My concern is that the network learns in a certain sense the Wordcount by detecting the pattern of vectors that embed the 0 I used to pad the sequences. I am not saying that this feature is not important but I would like to force the network to learn other features in preferences. I was thinking about two things, first using an additional task (like an adversarial task) that take the latent representation created by the model before the output and use a branch of the model to predict the size of the text or a cluster of size, for example :
[,1000 words] -- cluster 1
[1001,2000words] -- cluster 2
Then use the output to encourage the network to map other information in the latent space by adding an adversarial loss to the main loss term. My other idea was instead of using zeros' vectors to pad the embed the zero paddings, we could use random vectors, generated on the fly while training. (every time the network sees a particular index, for example -1, it knows it has to generate a random vector). I was thinking that it breaks the symmetry introduced by using zeros vectors and helps the model to generalize better as it introduces noise in the training process.
As I didn't find any papers on this task of padding with something else than zeros, I turn to the community. What do you think? I went through the Embedding layer implementation and I am pretty sure that the implementation of the second idea is pretty straightforward in keras by changing the K.gather() by a flag for the right indexes (It would be longer execution time though).
Thanks in advance for your feedback and your ressources !

How to fill in the blank using bidirectional RNN and pytorch?

I am trying to fill in the blank using a bidirectional RNN and pytorch.
The input will be like: The dog is _____, but we are happy he is okay.
The output will be like:
1. hyper (Perplexity score here)
2. sad (Perplexity score here)
3. scared (Perplexity score here)
I discovered this idea here:
import torch, torch.nn as nn
from torch.autograd import Variable
text = ['BOS', 'How', 'are', 'you', 'EOS']
seq_len = len(text)
batch_size = 1
embedding_size = 1
hidden_size = 1
output_size = 1
random_input = Variable(
torch.FloatTensor(seq_len, batch_size, embedding_size).normal_(), requires_grad=False)
bi_rnn = torch.nn.RNN(
input_size=embedding_size, hidden_size=hidden_size, num_layers=1, batch_first=False, bidirectional=True)
bi_output, bi_hidden = bi_rnn(random_input)
# stagger
forward_output, backward_output = bi_output[:-2, :, :hidden_size], bi_output[2:, :, hidden_size:]
staggered_output =, backward_output), dim=-1)
linear = nn.Linear(hidden_size * 2, output_size)
# only predict on words
labels = random_input[1:-1]
# for language models, use cross-entropy :)
loss = nn.MSELoss()
output = loss(linear(staggered_output), labels)
I am trying to reimplement the code above found at the bottom of the blog post. I am new to pytorch and nlp, and can't understand what the input and output to the code is.
Question about the input: I am guessing the input are the few words that are given. Why does one need beginning of sentence and end of sentence tags in this case? Why don't I see the input being a corpus on which the model is trained like other classic NLP problems? I would like to use the Enron email corpus to train the RNN.
Question about the output: I see the output is a tensor. My understanding is the tensor is a vector, so maybe a word vector in this case. How can you use the tensor to output the words themselves?
As this question is rather open-ended I will start from the last parts, moving towards the more general answer to the main question posed in the title.
Quick note: as pointed in the comments by #Qusai Alothman, you should find a better resource on the topic, this one is rather sparse when it comes to necessary informations.
Additional note: full code for the process described in the last section would take way too much space to provide as an exact answer, it would be more of a blog post. I will highlight possible steps one should take to create such a network with helpful links as we go along.
Final note: If there is anything dumb down there below (or you would like to expand the answer in any way or form, please do correct me/add info by posting a comment below).
Question about the input
Input here is generated from the random normal distribution and has no connection to the actual words. It is supposed to represent word embeddings, e.g. representation of words as numbers carrying semantic (this is important!) meaning (sometimes depending on the context as well (see one of the current State Of The Art approaches, e.g. BERT)).
Shape of the input
In your example it is provided as:
seq_len, batch_size, embedding_size,
seq_len - means length of a single sentence (varies across your
dataset), we will get to it later.
batch_size - how many sentences
should be processed in one step of forward pass (in case of
PyTorch it is the forward method of class inheriting from
embedding_size - vector with which one word is represented (it
might range from the usual 100/300 using word2vec up to 4096 or
so using the more recent approaches like the BERT mentioned
In this case it's all hard-coded of size one, which is not really useful for a newcomer, it only outlines the idea that way.
Why does one need beginning of sentence and end of sentence tags in this case?
Correct me if I'm wrong, but you don't need it if your input is separated into sentences. It is used if you provide multiple sentences to the model, and want to indicate unambiguously the beginning and end of each (used with models which depend on the previous/next sentences, it seems to not be the case here). Those are encoded by special tokens (the ones which are not present in the entire corpus), so neural network "could learn" they represent end and beginning of sentence (one special token for this approach would be enough).
If you were to use serious dataset, I would advise to split your text using libraries like spaCy or nltk (the first one is a pleasure to use IMO), they do a really good job for this task.
You dataset might be already splitted into sentences, in those cases you are kind of ready to go.
Why don't I see the input being a corpus on which the model is trained like other classic NLP problems?
I don't recall models being trained on the corpuses as is, e.g. using strings. Usually those are represented by floating-points numbers using:
Simple approaches, e.g. Bag Of
Words or
More sophisticated ones, which provide some information about word
relationships (e.g. king is more semantically related to queen
than to a, say, banana). Those were already linked above, some
other noticeable might be
GloVe or
ELMo and tons of other creative
Question about the output
One should output indices into embeddings, which in turn correspond to words represented by a vector (more sophisticated approach mentioned above).
Each row in such embedding represents a unique word and it's respective columns are their unique representations (in PyTorch, first index might be reserved for the words for which a representation is unknown [if using pretrained embeddings], you may also delete those words, or represent them as aj average of sentence/document, there are some other viable approaches as well).
Loss provided in the example
# for language models, use cross-entropy :)
loss = nn.MSELoss()
For this task it makes no sense, as Mean Squared Error is a regression metric, not a classification one.
We want to use one for classification, so softmax should be used for multiclass case (we should be outputting numbers spanning [0, N], where N is the number of unique words in our corpus).
PyTorch's CrossEntropyLoss already takes logits (output of last layer without activation like softmax) and returns loss value for each example. I would advise this approach as it's numerically stable (and I like it as the most minimal one).
I am trying to fill in the blank using a bidirectional RNN and pytorch
This is a long one, I will only highlight steps I would undertake in order to create a model whose idea represents the one outlined in the post.
Basic preparation of dataset
You may use the one you mentioned above or start with something easier like 20 newsgroups from scikit-learn.
First steps should be roughly this:
scrape the metadata (if any) from your dataset (those might be HTML tags, some headers etc.)
split your text into sentences using a pre-made library (mentioned above)
Next, you would like to create your target (e.g. words to be filled) in each sentence.
Each word should be replaced by a special token (say <target-token>) and moved to target.
sentence: Neural networks can do some stuff.
would give us the following sentences and it's respective targets:
sentence: <target-token> networks can do some stuff. target: Neural
sentence: Neural <target-token> can do some stuff. target: networks
sentence: Neural networks <target-token> do some stuff. target: can
sentence: Neural networks can <target-token> some stuff. target: do
sentence: Neural networks can do <target-token> stuff. target: some
sentence: Neural networks can do some <target-token>. target: some
sentence: Neural networks can do some stuff <target-token> target: .
You should adjust this approach to the problem at hand by correcting typos if there are any, tokenizing, lemmatizing and others, experiment!
Each word in each sentence should be replaced by an integer, which in turn points to it embedding.
I would advise you to use a pre-trained one. spaCy provides word vectors, but another interesting approach I would highly recommend is in the open source library flair.
You may train your own, but it would take a lot of time + a lot of data for unsupervised training, and I think it is way beyond the scope of this question.
Data batching
One should use PyTorch's and
In my case, a good idea is was to provide custom collate_fn to DataLoader, which is responsible for creating padded batches of data (or represented as torch.nn.utils.rnn.PackedSequence already).
Important: currently, you have to sort the batch by length (word-wise) and keep the indices able to "unsort" the batch into it's original form, you should remember that during implementation. You may use torch.sort for that task. In future versions of PyTorch, there is a chance, one might not have to do that, see this issue.
Oh, and remember to shuffle your dataset using DataLoader, while we're at it.
You should create a proper model by inheriting from torch.nn.Module. I would advise you to create a more general model, where you can provide PyTorch's cells (like GRU, LSTM or RNN), multilayered and bidirectional (as is described in the post).
Something along those lines when it comes to model construction:
import torch
class Filler(torch.nn.Module):
def __init__(self, cell, embedding_words_count: int):
self.cell = cell
# We want to output vector of N
self.linear = torch.nn.Linear(self.cell.hidden_size, embedding_words_count)
def forward(self, batch):
# Assuming batch was properly prepared before passing into the network
output, _ = self.cell(batch)
# Batch shape[0] is the length of longest already padded sequence
# Batch shape[1] is the length of batch, e.g. 32
# Here we create a view, which allows us to concatenate bidirectional layers in general manner
output = output.view(
2 if self.cell.bidirectional else 1,
# Here outputs of bidirectional RNNs are summed, you may concatenate it
# It makes up for an easier implementation, and is another often used approach
summed_bidirectional_output = output.sum(dim=2)
# Linear layer needs batch first, we have to permute it.
# You may also try with batch_first=True in self.cell and prepare your batch that way
# In such case no need to permute dimensions
linear_input = summed_bidirectional_output.permute(1, 0, 2)
return self.linear(embedding_words_count)
As you can see, information about shapes can be obtained in a general fashion. Such approach will allow you to create a model with how many layers you want, bidirectional or not (batch_first argument is problematic, but you can get around it too in a general way, left it out for improved clarity), see below:
model = Filler(
# Size of your embeddings, for BERT it could be 4096, for spaCy's word2vec 300
# How many unique words are there in your dataset
You may pass torch.nn.Embedding into your model (if pretrained and already filled), create it from numpy matrix or plethora of other approaches, it's highly dependent how your structure your code exactly. Still, please, make your code more general, do not hardcode shapes unless it's totally necessary (usually it's not).
Remember it's only a showcase, you will have to tune and fix it on your own.
This implementation returns logits and no softmax layer is used. If you wish to calculate perplexity, you may have to add it in order to obtain a correct probability distribution across all possible vectors.
BTW: Here is some info on concatenation of bidirectional output of RNN.
Model training
I would highly recommend PyTorch ignite as it's quite customizable, you can log a lot of info using it, perform validation and abstract cluttering parts like for loops in training.
Oh, and split your model, training and others into separate modules, don't put everything into one unreadable file.
Final notes
This is the outline of how I would approach this problem, you may have more fun using attention networks instead of merely using the last output layer as in this example, though you shouldn't start with that.
And please check PyTorch's 1.0 documentation and do not follow blindly tutorials or blog posts you see online as they might be out of date really fast and quality of the code varies enormously. For example torch.autograd.Variable is deprecated as can be seen in the link.

LSTM Embedding output size and No. of LSTM

I am not sure why we have only output vector of size 32, while have LSTM 100?
What I am confuse is that if we have only 32 words vector, if fetch into LSTM, 32 LSTM should big enough to hold it?
Those are hyper-parameters of your model and there is no best way of setting them without experimentation. In your case, embedding single words into a vector of dimension 32 might be enough, but the LSTM will process a sequence of them and might require more capacity (ie dimensions) to store information about multiple words. Without knowing the objective or the dataset it is difficult to make an educated guess on what those parameters would be. Often we look at past research papers tackling similar problems and see what hyper-parameters they used and then tune them via experimentation.

Multilabel classification using LSTM on variable length signal using Keras

I have recently started working on ECG signal classification in to various classes. It is basically multi label classification task (Total 4 classes). I am new to Deep Learning, LSTM and Keras that why i am confused in few things.
I am thinking about giving normalized original signal as input to the network, is this a good approach?
I also need to understand training input shape for LSTM as ECG signals are of variable length (9000 to 18000 samples) and usually classifier need fixed variable input. How can i handle such type of input in case of LSTM.
Finally what should be structure of deep LSTM network for such lengthy input and how many layers should i use.
Thanks for your time.
I am thinking about giving normalized original signal as input to the network, is this a good approach?
Yes this is a good approach. It is actually quite standard for Deep Learning algorithms to give them your input normalized or rescaled.
This usually helps your model converge faster, as now you are inside smaller range (i.e.: [-1, 1]) instead of greater un-normalized ranges from your original input (say [0, 1000]). It also helps you get better, more precise results, as it helps solve problems like the vanishing gradient as well as adapting better to modern activation and optimizer functions.
I also need to understand training input shape for LSTM as ECG signals are of variable length (9000 to 18000 samples) and usually classifier need fixed variable input. How can i handle such type of input in case of LSTM.
This part is really important. You are correct, LSTM expects to receive inputs with a fixed shape, one that you know beforehand (in fact, any Deep Learning layer expects fixed shape inputs). This is also explained in the keras docs on Recurrent Layers where they say:
Input shape
3D tensor with shape (batch_size, timesteps, input_dim).
As we can see, it expects your data to have a number of timesteps as well as a dimension on each one of those timesteps (batch size is usually 1). To exemplify, suppose your input data consists of elements like: [[1,4],[2,3],[3,2],[4,1]]. Then, using a batch_size of 1, the shape of your data would be (1,4,2). As you have 4 timesteps, each with 2 features.
So bottom line, you have to make sure that you pre-process you data so it has a fixed shape you can then pass to your LSTM layers. This one you will have to find out by yourself, as you know your data and problem better than we do.
Maybe you can fix the samples you obtain from your signal, discarding some and keeping others so every signal is of the same length (if you say your signals are between 9k and 18k choosing 9000 could be the logical choice, discarding samples from the others you get). You could even do some other conversion to your data in a way that you can map from inputs of 9000-18000 to a fixed size.
Finally what should be structure of deep LSTM network for such lengthy input and how many layers should i use.
This one is really quite broad and doesn't have a unique answer. It would depend on the nature of your problem, and determining those parameters a priori is not so straightforward.
What I recommend you do is to start with a simple model first, and then add layers and blocks (neurons) incrementally until you are satisfied with the results.
Try just one hidden layer first, train and test your model and check your performance. You can then add more blocks and see if your performance improved. You can also add more layers and check for the same until you are satisfied.
This is a good way to create Deep Learning models, as you will arrive to the results you want while keeping your Network as lean as possible, which in turn helps your execution time and complexity. Good luck with your coding, hope you find this useful.

