I have a problem where I have texts ( that can be very long max ~9000 words) that I need to embed with Keras Layer. I choose the fixed size 5000 for every text and I need to pad each sequence to get to the right shape. The classical way is to use Keras' pad_sequence that take as input list of lists of indexes and pad with zeros or cut the lists of indexes to 5000.
For my downstream task, I use a sort of convnet inspired by Kim's Paper (https://arxiv.org/abs/1408.5882). My concern is that the network learns in a certain sense the Wordcount by detecting the pattern of vectors that embed the 0 I used to pad the sequences. I am not saying that this feature is not important but I would like to force the network to learn other features in preferences. I was thinking about two things, first using an additional task (like an adversarial task) that take the latent representation created by the model before the output and use a branch of the model to predict the size of the text or a cluster of size, for example :
[,1000 words] -- cluster 1
[1001,2000words] -- cluster 2
ect..
Then use the output to encourage the network to map other information in the latent space by adding an adversarial loss to the main loss term. My other idea was instead of using zeros' vectors to pad the embed the zero paddings, we could use random vectors, generated on the fly while training. (every time the network sees a particular index, for example -1, it knows it has to generate a random vector). I was thinking that it breaks the symmetry introduced by using zeros vectors and helps the model to generalize better as it introduces noise in the training process.
As I didn't find any papers on this task of padding with something else than zeros, I turn to the community. What do you think? I went through the Embedding layer implementation and I am pretty sure that the implementation of the second idea is pretty straightforward in keras by changing the K.gather() by a flag for the right indexes (It would be longer execution time though).
Thanks in advance for your feedback and your ressources !
I already asked this question here: Can Convolutional Neural Networks (CNN) be represented by a Mathematical formula? but I feel that I was not clear enough and also the proposed idea did not work for me.
Let's say that using my computer, I train a certain machine learning algorithm (i.e. naive bayes, decision tree, linear regression, and others). So I already have a trained model which I can give a input value and it returns the result of the prediction (i.e. 1 or 0).
Now, let's say that I still want to give an input and get a predicted output. However, at this time I would like that my input value to be, for example, multiplied by some sort of mathematical formula, weights, or matrix that represents my "trained model".
In other words, I would like that my trained model "transformed" in some sort of formula which I can give an input and get the predicted number.
The reason why I want to do this is because I wanna train a big dataset and use complex prediction model. And use this trained prediciton model in simpler hardwares such as a PIC32 microcontroler. The PIC32 Microntroler would not train the machine learning or store all inputs. Instead, the microcontroler would simple read from the system certain numbers, apply a math formula or some sort of matrix multiplication and give me the predicted output. With that, I can use "fancy" neural networks in much simpler devices that can easily operate math formulas.
If I read this properly, you want a generally continuous function in many variables to replace a CNN. The central point of a CNN existing in a world with ANNs ("normal" neural networks) is that in includes irruptive transformations: non-linearities, discontinuities, etc. that enable the CNN to develop recognitions and relationships that simple linear combinations -- such as matrix multiplication -- cannot handle.
If you want to understand this better, I recommend that you choose an introduction to Deep Learning and CNNs in whatever presentation mode fits your learning styles.
Essentially, every machine learning algorithm is a parameterized formula, with your trained model being the learned parameters that are applied to the input.
So what you're actually asking is to simplify arbitrary computations to, more or less, a matrix multiplication. I'm afraid that's mathematically impossible. If you ever do come up with a solution to this, make sure to share it - you'd instantly become famous, most likely rich, and put a hell of a lot of researchers out of business. If you can't train a matrix multiplication to get the accuracy you want from the start, what makes you think you can boil down arbitrary "complex prediction models" to such simple computations?
I am working on a sequence prediction problem where my inputs are of size (numOfSamples, numOfTimeSteps, features) where each sample is independent, number of time steps is uniform for each sample (after pre-padding the length with 0's using keras.pad_sequences), and my number of features is 2. To summarize my question(s), I am wondering how to structure my Y-label dataset to feed the model and want to gain some insight on how to properly structure my model to output what I want.
My first feature is a categorical variable encoded to a unique int and my second is numerical. I want to be able to predict the next categorical variable as well as an associated feature2 value, and then use this to feed back into the network to predict a sequence until the EOS category is output.
This is a main source I've been referencing to try and understand how to create a generator for use with keras.fit_generator.
[1]
There is no confusion with how the mini-batch for "X" data is grabbed, but for the "Y" data, I am not sure about the proper format for what I am trying to do. Since I am trying to predict a category, I figured a one-hot vector representation of the t+1 timestep would be the proper way to encode the first feature, I guess resulting in a 4? Dimensional numpy matrix?? But I'm kinda lost with how to deal with the second numerical feature.
Now, this leads me to questions concerning architecture and how to structure a model to do what I am wanting. Does the following architecture make sense? I believe there is something missing that I am not understanding.
Proposed architecture (parameters loosely filled in, nothing set yet):
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(timesteps, features)))
model.add(LSTM(hidden_size, return_sequences=True))
model.add(TimeDistributed(Dense(vocab_size)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
model.fit_generator(...) #ill figure this out
So, at the end, a softmax activation can predict the next categorical value for feature1. How do I also output a value for feature2 so that I can feed the new prediction for both features back as the next time-step? Do I need some sort of parallel architecture with two LSTMs that are combined somehow?
This is my first attempt at doing anything with neural networks or Keras, and I would not say I'm "great" at python, I can get by though. However, I feel I have a decent grasp at the fundamental theoretical concepts, but lack the practice.
This question is sorta open ended, with encouragement to pick apart my current strategy.
Once again, the overall goal is to predict both features (categorical, numeric) in order to predict "full sequences" from intermediate length sequences.
Ex. I train on these padded max-len sequences, but in production I want to use this to predict the remaining part of the currently unseen time-steps, which would be variable length.
Okay, so If I understand you properly (correct me if I'm wrong) you would like to predict next features based on the current ones.
When it comes to categorical variables, you are on point, your Dense layer should output N-1 vector containing probability of each class (while we are at it, if you, by any chance, use pandas.get_dummies remember to specify argument drop_first=True, similiar approach should be employed whatever you are using for one-hot encoding).
Except those N-1 output vector for each sample, it should output one more number for numerical value.
Remember to output logits (no activation, don't use softmax at the end like you currently do). Afterwards network output should be separated into N-1 part (your categorical feature) and passed to loss function able to handle logits (e.g. in Tensorflow it is tf.nn.softmax_cross_entropy_with_logits_v2 which applies numerically stable softmax for you).
Now, your N-th element of network output should be passed to different loss, probably Mean Squared Error.
Based on loss value of those two losses (you could take a mean of both to obtain one loss value), you backpropagate through the network and it might do just fine.
Unfortunately I'm not skilled enough in Keras in order to help you with the code, but I think you will figure it out yourself. While we're at it, I would like to suggest PyTorch for more custom neural networks (I think yours fits this description), though it's definitely doable in Keras as well, your choice.
Additional 'maybe helpful' thought: you may check Teacher Forcing for your kind of task. More on the topic and theory behind it can be found in the outstanding Deep Learning Book and code example (though in PyTorch once again), can be found in their docs here.
BTW interesting idea, mind if I use it in connection with my current research trajectory (with kudos going to you of course)? Comment on this answer if so we can talk it out in chat.
Basically every answer I was looking for was exampled and explained in this tutorial. Absolutely great resource for trying to understand how to model multi-output networks. This one goes through a lengthy walkthrough of a multi-output CNN architecture. It only took me about three weeks to stumble upon, however.
https://www.pyimagesearch.com/2018/06/04/keras-multiple-outputs-and-multiple-losses/
I am trying to fill in the blank using a bidirectional RNN and pytorch.
The input will be like: The dog is _____, but we are happy he is okay.
The output will be like:
1. hyper (Perplexity score here)
2. sad (Perplexity score here)
3. scared (Perplexity score here)
I discovered this idea here: https://medium.com/#plusepsilon/the-bidirectional-language-model-1f3961d1fb27
import torch, torch.nn as nn
from torch.autograd import Variable
text = ['BOS', 'How', 'are', 'you', 'EOS']
seq_len = len(text)
batch_size = 1
embedding_size = 1
hidden_size = 1
output_size = 1
random_input = Variable(
torch.FloatTensor(seq_len, batch_size, embedding_size).normal_(), requires_grad=False)
bi_rnn = torch.nn.RNN(
input_size=embedding_size, hidden_size=hidden_size, num_layers=1, batch_first=False, bidirectional=True)
bi_output, bi_hidden = bi_rnn(random_input)
# stagger
forward_output, backward_output = bi_output[:-2, :, :hidden_size], bi_output[2:, :, hidden_size:]
staggered_output = torch.cat((forward_output, backward_output), dim=-1)
linear = nn.Linear(hidden_size * 2, output_size)
# only predict on words
labels = random_input[1:-1]
# for language models, use cross-entropy :)
loss = nn.MSELoss()
output = loss(linear(staggered_output), labels)
I am trying to reimplement the code above found at the bottom of the blog post. I am new to pytorch and nlp, and can't understand what the input and output to the code is.
Question about the input: I am guessing the input are the few words that are given. Why does one need beginning of sentence and end of sentence tags in this case? Why don't I see the input being a corpus on which the model is trained like other classic NLP problems? I would like to use the Enron email corpus to train the RNN.
Question about the output: I see the output is a tensor. My understanding is the tensor is a vector, so maybe a word vector in this case. How can you use the tensor to output the words themselves?
As this question is rather open-ended I will start from the last parts, moving towards the more general answer to the main question posed in the title.
Quick note: as pointed in the comments by #Qusai Alothman, you should find a better resource on the topic, this one is rather sparse when it comes to necessary informations.
Additional note: full code for the process described in the last section would take way too much space to provide as an exact answer, it would be more of a blog post. I will highlight possible steps one should take to create such a network with helpful links as we go along.
Final note: If there is anything dumb down there below (or you would like to expand the answer in any way or form, please do correct me/add info by posting a comment below).
Question about the input
Input here is generated from the random normal distribution and has no connection to the actual words. It is supposed to represent word embeddings, e.g. representation of words as numbers carrying semantic (this is important!) meaning (sometimes depending on the context as well (see one of the current State Of The Art approaches, e.g. BERT)).
Shape of the input
In your example it is provided as:
seq_len, batch_size, embedding_size,
where
seq_len - means length of a single sentence (varies across your
dataset), we will get to it later.
batch_size - how many sentences
should be processed in one step of forward pass (in case of
PyTorch it is the forward method of class inheriting from
torch.nn.Module)
embedding_size - vector with which one word is represented (it
might range from the usual 100/300 using word2vec up to 4096 or
so using the more recent approaches like the BERT mentioned
above)
In this case it's all hard-coded of size one, which is not really useful for a newcomer, it only outlines the idea that way.
Why does one need beginning of sentence and end of sentence tags in this case?
Correct me if I'm wrong, but you don't need it if your input is separated into sentences. It is used if you provide multiple sentences to the model, and want to indicate unambiguously the beginning and end of each (used with models which depend on the previous/next sentences, it seems to not be the case here). Those are encoded by special tokens (the ones which are not present in the entire corpus), so neural network "could learn" they represent end and beginning of sentence (one special token for this approach would be enough).
If you were to use serious dataset, I would advise to split your text using libraries like spaCy or nltk (the first one is a pleasure to use IMO), they do a really good job for this task.
You dataset might be already splitted into sentences, in those cases you are kind of ready to go.
Why don't I see the input being a corpus on which the model is trained like other classic NLP problems?
I don't recall models being trained on the corpuses as is, e.g. using strings. Usually those are represented by floating-points numbers using:
Simple approaches, e.g. Bag Of
Words or
TF-IDF
More sophisticated ones, which provide some information about word
relationships (e.g. king is more semantically related to queen
than to a, say, banana). Those were already linked above, some
other noticeable might be
GloVe or
ELMo and tons of other creative
approaches.
Question about the output
One should output indices into embeddings, which in turn correspond to words represented by a vector (more sophisticated approach mentioned above).
Each row in such embedding represents a unique word and it's respective columns are their unique representations (in PyTorch, first index might be reserved for the words for which a representation is unknown [if using pretrained embeddings], you may also delete those words, or represent them as aj average of sentence/document, there are some other viable approaches as well).
Loss provided in the example
# for language models, use cross-entropy :)
loss = nn.MSELoss()
For this task it makes no sense, as Mean Squared Error is a regression metric, not a classification one.
We want to use one for classification, so softmax should be used for multiclass case (we should be outputting numbers spanning [0, N], where N is the number of unique words in our corpus).
PyTorch's CrossEntropyLoss already takes logits (output of last layer without activation like softmax) and returns loss value for each example. I would advise this approach as it's numerically stable (and I like it as the most minimal one).
I am trying to fill in the blank using a bidirectional RNN and pytorch
This is a long one, I will only highlight steps I would undertake in order to create a model whose idea represents the one outlined in the post.
Basic preparation of dataset
You may use the one you mentioned above or start with something easier like 20 newsgroups from scikit-learn.
First steps should be roughly this:
scrape the metadata (if any) from your dataset (those might be HTML tags, some headers etc.)
split your text into sentences using a pre-made library (mentioned above)
Next, you would like to create your target (e.g. words to be filled) in each sentence.
Each word should be replaced by a special token (say <target-token>) and moved to target.
Example:
sentence: Neural networks can do some stuff.
would give us the following sentences and it's respective targets:
sentence: <target-token> networks can do some stuff. target: Neural
sentence: Neural <target-token> can do some stuff. target: networks
sentence: Neural networks <target-token> do some stuff. target: can
sentence: Neural networks can <target-token> some stuff. target: do
sentence: Neural networks can do <target-token> stuff. target: some
sentence: Neural networks can do some <target-token>. target: some
sentence: Neural networks can do some stuff <target-token> target: .
You should adjust this approach to the problem at hand by correcting typos if there are any, tokenizing, lemmatizing and others, experiment!
Embeddings
Each word in each sentence should be replaced by an integer, which in turn points to it embedding.
I would advise you to use a pre-trained one. spaCy provides word vectors, but another interesting approach I would highly recommend is in the open source library flair.
You may train your own, but it would take a lot of time + a lot of data for unsupervised training, and I think it is way beyond the scope of this question.
Data batching
One should use PyTorch's torch.utils.data.Dataset and torch.utils.data.DataLoader.
In my case, a good idea is was to provide custom collate_fn to DataLoader, which is responsible for creating padded batches of data (or represented as torch.nn.utils.rnn.PackedSequence already).
Important: currently, you have to sort the batch by length (word-wise) and keep the indices able to "unsort" the batch into it's original form, you should remember that during implementation. You may use torch.sort for that task. In future versions of PyTorch, there is a chance, one might not have to do that, see this issue.
Oh, and remember to shuffle your dataset using DataLoader, while we're at it.
Model
You should create a proper model by inheriting from torch.nn.Module. I would advise you to create a more general model, where you can provide PyTorch's cells (like GRU, LSTM or RNN), multilayered and bidirectional (as is described in the post).
Something along those lines when it comes to model construction:
import torch
class Filler(torch.nn.Module):
def __init__(self, cell, embedding_words_count: int):
self.cell = cell
# We want to output vector of N
self.linear = torch.nn.Linear(self.cell.hidden_size, embedding_words_count)
def forward(self, batch):
# Assuming batch was properly prepared before passing into the network
output, _ = self.cell(batch)
# Batch shape[0] is the length of longest already padded sequence
# Batch shape[1] is the length of batch, e.g. 32
# Here we create a view, which allows us to concatenate bidirectional layers in general manner
output = output.view(
batch.shape[0],
batch.shape[1],
2 if self.cell.bidirectional else 1,
self.cell.hidden_size,
)
# Here outputs of bidirectional RNNs are summed, you may concatenate it
# It makes up for an easier implementation, and is another often used approach
summed_bidirectional_output = output.sum(dim=2)
# Linear layer needs batch first, we have to permute it.
# You may also try with batch_first=True in self.cell and prepare your batch that way
# In such case no need to permute dimensions
linear_input = summed_bidirectional_output.permute(1, 0, 2)
return self.linear(embedding_words_count)
As you can see, information about shapes can be obtained in a general fashion. Such approach will allow you to create a model with how many layers you want, bidirectional or not (batch_first argument is problematic, but you can get around it too in a general way, left it out for improved clarity), see below:
model = Filler(
torch.nn.GRU(
# Size of your embeddings, for BERT it could be 4096, for spaCy's word2vec 300
input_size=300,
hidden_size=100,
num_layers=3,
batch_first=False,
dropout=0.4,
bidirectional=True,
),
# How many unique words are there in your dataset
embedding_words_count=10000,
)
You may pass torch.nn.Embedding into your model (if pretrained and already filled), create it from numpy matrix or plethora of other approaches, it's highly dependent how your structure your code exactly. Still, please, make your code more general, do not hardcode shapes unless it's totally necessary (usually it's not).
Remember it's only a showcase, you will have to tune and fix it on your own.
This implementation returns logits and no softmax layer is used. If you wish to calculate perplexity, you may have to add it in order to obtain a correct probability distribution across all possible vectors.
BTW: Here is some info on concatenation of bidirectional output of RNN.
Model training
I would highly recommend PyTorch ignite as it's quite customizable, you can log a lot of info using it, perform validation and abstract cluttering parts like for loops in training.
Oh, and split your model, training and others into separate modules, don't put everything into one unreadable file.
Final notes
This is the outline of how I would approach this problem, you may have more fun using attention networks instead of merely using the last output layer as in this example, though you shouldn't start with that.
And please check PyTorch's 1.0 documentation and do not follow blindly tutorials or blog posts you see online as they might be out of date really fast and quality of the code varies enormously. For example torch.autograd.Variable is deprecated as can be seen in the link.
I was recently learning about neural networks and came across MNIST data set. i understood that a sigmoid cost function is used to reduce the loss. Also, weights and biases gets adjusted and an optimum weights and biases are found after the training. the thing i did not understand is, on what basis the images are classified. For example, to classify whether a patient has cancer or not, data like age, location, etc., becomes features. in MNIST dataset, i did not find any of that. Am i missing something here. Please help me with this
First of all the Network pipeline consists of 3 main parts:
Input Manipulation:
Parameters that effect the finding of minimum:
Parameters like your descission function in your interpretation
layer (often fully connected layer)
In contrast to your regular machine learning pipeline where you have to extract features manually a CNN uses filters. (Filters like in edge detection or viola and jones).
If a filter runs across the images and is convolved with pixels it Produces an output.
This output is then interpreted by a neuron. If the output is above a threshold it is considered as valid (Step function counts 1 if valid or in case of Sigmoid it has a value on the sigmoid function).
The next steps are the same as before.
This is progressed until the interpretation layer (often softmax). This layer interprets your computation (if the filters are good adapted to your problem you will get a good predicted label) which means you have a low difference between (y_guess - y_true_label).
Now you can see that for the guess of y we have multiplied the input x with many weights w and also used functions on it. This can be seen like a chain rule in analysis.
To get better results the effect of a single weight on the input must be known. Therefore, you use Backpropagation which is a derivative of the Error with respect to all w. The Trick is that you can reuse derivatives which is more or less Backpropagation and it becomes easier since you can use Matrix vector notation.
If you have your gradient, you can use the normal concept of minimization where you walk along the steepest descent. (There are also many other gradient methods like adagrad or adam etc).
The steps will repeat until convergence or until you reach the maximum epochs.
So the answer is: THE COMPUTED WEIGHTS (FILTERS) ARE THE KEY TO DETECT NUMBERS AND DIGITS :)