LSTM skews predictions towards one value - python

I am trying to train an LSTM, followed by a Dense layer in keras with numerical input sequences of different lengths. the numbers range is [1,13]. each one of those sequences ends with the same number, 13 in my case.
I train the controller on a few sequences, use the trained model to generate a few more sequences with the same properties, add them to the training set and train the LSTM again. As this loop goes on, the LSTM predictions start converging towards the final value of each sequence.
The sequences are padded to a certain maximum length. as a result the x_train data is of the size (None, max_len-1) and y_train data is categorical data of the last element of each input sequence. in this case, every element in the y_train data is the same (one hot encoded vector for the number 13).
Is the way input and output data is structured the reason for this skewing of predictions?
Is there a way to work around it?

Related

Pytorch - Token embeddings using Character level LSTM

I'm trying to train a neural network that classifies a sequence of words. Based on a paper I'm trying to replicate, I'd need to have both token-level embeddings and character-level embeddings of tokens.
For example, take this sentence:
The shop is open
I need 2 embeddings - one is the normal nn.Embedding layer for the token-level embedding (very simplified!):
[The, shop, is, open] -> nn.Embedding -> [4,3,7,2]
the other is a BiLSTM embedding on the character-level:
[[T,h,e], [s,h,o,p], [i,s], [o,p,e,n]] -> nn.LSTM -> [9,10,23,5]
Both of them produce word-level embeddings but on a different scale. I tried working out how to do this in PyTorch but I can't seem to do it. The only time I can do them both at the same time is if I pass the characters as one long sequence ([t,h,e,s,h,o,p,i,s,o,p,e,n]), but that will only produce one embedding.
If anyone could help that would be appreciated.
The only time I can do them both at the same time is if I pass the characters as one long sequence ([t,h,e,s,h,o,p,i,s,o,p,e,n])
Essentially, what you have to do is:
Split sentences into words (each sentence has (or will have) it's respective nn.Embedding)
Split each word into single letters (essentially adding another dimension)
About second point
Compare word-level embeddings:
[The, shop, is, open]
This is single example, let's assume each word is encoded with 300 dimensional vector. So you get shape of (1, 4, 300) (batch goes first, also padding as per usual with RNNs is needed). This data can go directly to some RNN or similar "text" models.
[[T,h,e], [s,h,o,p], [i,s], [o,p,e,n]]
In this case, we would have data of shape (assuming 50 dimensional vector for single letter) (1, 4, 4, 50). Please notice I have padded to the longest word based on length!
Such input cannot go into RNNs for obvious reasons (it's 4D instead of 3D as required). But one can notice, that each word can be treated independently (as different sample), hence we can go for shape (4, 4, 50) (transpose is needed), where zeroth dimension corresponds to single words, first to letters contained in that word and last is vector dimensionality.
For batches of data
In general, for word-level encoding it is pretty simple, as you always have (batch, timesteps, embedding).
For character level, you should form your data into vector of shape (batch, word_timesteps, character_timesteps, embedding), which has to be transformed into (batch * word_timesteps, character_timesteps, embedding).
This requires some fun with padding and the size of batch grows really fast so data splitting might be needed.
Output from character level LSTM
You should get (batch * word_timesteps, network_embedding) as output (remember to take last timestep from each word!). In our case it would be (4, 50).
Given that, you can reshape this matrix into (batch, timesteps, dimension) ((1, 4, 50) in our case).
Finally, you can concatenate this embedding with word-level embedding across last dimension to get (1, 4, 350) output matrix in total. You can pass this into another RNN layer or however you wish to proceed
Additional points
If you wish to keep information between words for character-level embedding, you would have to pass hidden_state to N elements in batch (where N is the number of words in sentence). That might it a little harder, but should be doable, just remember LSTM has effective capacity of 100-1000 AFAIK and with long sentences you can easily surpass this number of letters.

Understanding what exactly neural network is predicting in documentation example (MNIST)

I've taken a quick course in neural networks to better understand them and now I'm trying them out for myself in R. I'm following this documentation of Keras.
The way I understand what is happening:
We are inputting a series of images and transforming these images to numerical matrices based on the arrangement of the pixels and colors in those pixels. We then build a neural network model to learn the pattern of these arrangements, depending on the classification (0 to 9). We then use the model to predict which class an image belongs to. I'll be honest and admit I'm not entirely sure what y_train and x_train is. I simply see it as one training and one validation set so I'm not sure what the difference between x and y is.
My question:
I've followed the steps to the T and the model runs fine and the predictions look like they do in the documentation. Ultimately, the prediction looks like this:
I take this to mean that observation 1 in x_test is predicted to be a category 7.
However, looking at x_test it looks like this:
There is a 0 in every column and row, also if I scroll further down. This is where I get confused. I'm also not sure how I view the original images to view for myself how well they are predicting them. I would eventually like to draw a number myself in paint or so and then see if the model can predict it, but for that I need to first understand what is going on. I feel I am close but I just need a little nudge!
I think if you read more about the input and output layer's dimensions, that would help.
In your example:
Input layer:
A single training example of image has two dimensions 28*28, which is then converted to a single vector of dimension 784. This acts as the input layer for the neural network.
So for m training examples, your input layer will have dimensions (m, 784). Analogically speaking (to traditional ML systems), you can imagine that each pixel of an image is converted into a feature (or x1, x2, ... x784), and your training set is a dataframe with m rows and 784 columns, which is then fed into neural network to compute y_hat = f(x1,x2,x3,...x784).
Output layer:
As an output for our neural network, we want it to predict which number it is from 0 to 9. So for a single training example, the output layer has dimension 10, representing each number from 0 to 9 and for n testing examples the output layer would be a matrix with dimension n*10.
Our y is a vector of length n which would be something like [1,7,8,2,.....] containing true value for each testing example. But to match the dimension of output layer, the y vector's dimension are converted using one-hot encoding. Imagine a length 10 vector, representing number 7 by putting 1 at 7th place and rest of the positions zeros something like [0,0,0,0,0,0,1,0,0,0].
So in your question, if you wish to see the original image, you should be able to see it before reshaping the training examples with something like image(mnist$test$x[1, , ]
Hope this helps!!
y_train are the labels and x_train is the training data, so images in this example. You need to use some kind of plotting library to plot x'es. In this example you probably are not expected to input your own drawings and if you want you would need to preprocess them in the same way as in MNIST and pass them to the model.

How to use Multivariate time-series prediction with Keras, when multiple samples are used

As the title states, I am doing multivariate time-series prediction. I have some experience with this situation and was able to successfully setup and train a working model in TF Keras.
However, I did not know the 'proper' way to handle having multiple unrelated time-series samples. I have about 8000 unique sample 'blocks' with anywhere from 800 time steps to 30,000 time steps per sample. Of course I couldn't concatenate them all into one single time series because the first points of sample 2 are not related in time with the last points of sample 1.
Thus my solution was to fit each sample individually in a loop (at great inefficiency).
My new idea is can/should I pad the start of each sample with empty time-steps = to the amount of look back for the RNN and then concatenate the padded samples into one time-series? This will mean that the first time-step will have a look-back data of mostly 0's which sounds like another 'hack' for my problem and not the right way to do it.
The main challenge is in 800 vs. 30,000 timesteps, but nothing you can't do.
Model design: group sequences into chunks - for example, 30 sequences of 800-to-900 timesteps, padded, then 60 sequences of 900-to-1000, etc. - don't have to be contiguous (i.e. next can be 1200-to-1500)
Input shape: (samples, timesteps, channels) - or equivalently, (sequences, timesteps, features)
Layers: Conv1D and/or RNNs - e.g. GRU, LSTM. Each can handle variable timesteps
Concatenation: don't do it. If each of your sequences is independent, then each must be fed along dimension 0 in Keras - the batch or samples dimension. If they are dependent, e.g. multivariate timeseries, like many channels in a signal - then feed them along the channels dimension (dim 2). But never concatenate along timeseries dimension, as it implies causal continuity whrere none exists.
Stateful RNNs: can help in processing long sequences - info on how they work here
RNN capability: is limited w.r.t. long sequences, and 800 is already in danger zone even for LSTMs; I'd suggest dimensionality reduction via either autoencoders or CNNs w/ strides > 1 at input, then feeding their outputs to RNNs.
RNN training: is difficult. Long train times, hyperparameter sensitivity, vanishing gradients - but, with proper regularization, they can be powerful. More info here
Zero-padding: before/after/both - debatable, can read about it, but probably stay clear from "both" as learning to ignore paddings is easier with one locality; I personally use "before"
RNN variant: use CuDNNLSTM or CuDNNGRU whenever possible, as they are 10x faster
Note: "samples" above, and in machine learning, refers to independent examples / observations, rather than measured signal datapoints (which would be referred to as timesteps).
Below is a minimal code for what a timeseries-suited model would look like:
from tensorflow.keras.layers import Input, Conv1D, LSTM, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import numpy as np
def make_data(batch_shape): # dummy data
return (np.random.randn(*batch_shape),
np.random.randint(0, 2, (batch_shape[0], 1)))
def make_model(batch_shape): # example model
ipt = Input(batch_shape=batch_shape)
x = Conv1D(filters=16, kernel_size=10, strides=2, padding='valid')(ipt)
x = LSTM(units=16)(x)
out = Dense(1, activation='sigmoid')(x) # assuming binary classification
model = Model(ipt, out)
model.compile(Adam(lr=1e-3), 'binary_crossentropy')
return model
batch_shape = (32, 100, 16) # 32 samples, 100 timesteps, 16 channels
x, y = make_data(batch_shape)
model = make_model(batch_shape)
model.train_on_batch(x, y)

Processing Text for Classification with Keras

I'm trying to train a basic text classification NN using Keras. I download 12,500 pos and 12,500 negative movie reviews from a website. I'm having trouble processing the data into something Keras can use however.
First, I open the 25000 text files and store each file into an array. I then run each array (one positive and one negative) through this function:
def process_for_model(textArray):
'''
Given a 2D array of the form:
[[fileLines1],[fileLines2]...[fileLinesN]]
converts the text into integers
'''
result = []
for file_ in textArray:
inner = []
for line in file_:
length = len(set(text_to_word_sequence(line)))
inner.append(hashing_trick(line,round(length*1.3),hash_function='md5'))
result.append(inner)
return result
With the purpose of converting the words into numbers to get them close to something a Keras model can use.
I then append the converted numbers into a single array, along with appending a 0 or 1 to another array as labels:
training_labels = []
train_batches = []
for i in range(len(positive_encoded)):
train_batches.append(positive_encoded[i])
training_labels.append([0])
for i in range(len(negative_encoded)):
train_batches.append(negative_encoded[i])
training_labels.append([1])
And finally I convert each array to a np array:
train_batches = array(train_batches)
training_labels = array(training_labels)
However, I'm not really sure where to go from here. Each review is, I believe, 168 words. I don't know how to create an appropriate model for this data or how to properly scale all the numbers to be between 0 and 1 using sklearn.
The things I am most confused on are: how many layers should I have, how many neurons each layer should have, and how many input dimensions should I have for the first layer.
Should I be taking another approach entirely?
Here is quite a good tutotial with Keras and this dataset: https://machinelearningmastery.com/predict-sentiment-movie-reviews-using-deep-learning/
You can also use Keras official tutorial for text classification.
It basically downloads 50k reviews from the IMDB set, equally balanced (half positive, half negative). They split (randomly) half for training, half for testing, and take 10k (40%) of the training examples as a validation set.
imdb = keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
The reviews are already in their word-dictionary representation (i.e. each review is an array of numbers). The total dictionary has about 80k+ words, but they only use the top 10k most frequent words (all the other words in a particular review are mapped to a special token - unknown ('<UNK>')).
(In the tutorial they create a reversed word dictionary - for the sake of showing you the original reviews. But it's not important.)
Each review is max 256 words, so they pre-process each review and pad it with 0 (<PAD> token) in case it's shorter. (Padding is done post, i.e. at the end)
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)
test_data = keras.preprocessing.sequence.pad_sequences(test_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)
Their NN architecture consists of 4 layers:
Input Embedding layer: takes a batch of reviews, each a 256 vector who's numbers are [0, 10,000) and tries to find a 16 dimensional vector (for each word) to represent them.
Global Average Pooling layer: average over all the words (16-D representation) in a review, and gives you a single 16 dimensional vector to represent the whole review.
Fully connected dense layer of 16 nodes - the 'vanilla' NN layer. They chose a ReLu activation function.
An output layer of 1 node: with a sigmoid activation function - gives a number from 0 to 1 which represents the confidence it's a positive/negative review.
Here is the code for it:
vocab_size = 10000
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
Then they fit the model and run it:
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(partial_x_train,
partial_y_train,
epochs=40,
batch_size=512,
validation_data=(x_val, y_val),
verbose=1)
In summary - they chose to simplify what could have been a 10k dimensional vector to only 16 dimensions, and then run one dense layer NN - with which they got a pretty good results (87%).

How to train and predict Keras LSTM with time series data?

I have figured out how to train a LSTM using just values, but what would the data look like if I wanted to include the time? Perhaps input dimension of 2, with time as epoch seconds and normalized values? There may be time gaps in the data and I want the training to reflect that.
Assuming I only want to periodically train the LSTM, since this is an expensive operation, how would you predict values in the future with a gap between the last training time and the first predicted time? For example, lets says I trained the LSTM 3 days ago, but now I want to predict the values for the next day.
All my work so far is based on this article: http://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/. But it doesn't cover these kinds of questions.
I think you can handle this situation when constructing your training set, at least if the time delay between the last value (in the input sequence) and the value to predict is fixed.
Let X_train have dimension: (nb_samples, timesteps, input_dim) and y_train have dimension (n_samples, output_dim). Let x be one training input sample. It corresponds to a multivariate time series with dimension (timesteps, input_dim). Its corresponding output is y with dimension (output_dim).
In y you put the value to predict which can be 3 days after the last value in x, the LSTM "should" grasp the temporal dependency. So if the time delay between the last value in the input and the value to predict is fixed, this should work.
That was the case for such a problem: https://challengedata.ens.fr/en/challenge/9/prediction_of_transaction_volumes_in_financial_markets.html

Categories

Resources