Using Deep Learning to Predict Subsequence from Sequence

Using Deep Learning to Predict Subsequence from Sequence - python

I have a data that looks like this:
It can be viewed here and has been included in the code below.
In actuality I have ~7000 samples (row), downloadable too.
The task is given antigen, predict the corresponding epitope.
So epitope is always an exact substring of antigen. This is equivalent with
the Sequence to Sequence Learning. Here is my code running on Recurrent Neural Network under Keras. It was modeled according the example.
My question are:
Can RNN, LSTM or GRU used to predict subsequence as posed above?
How can I improve the accuracy of my code?
How can I modify my code so that it can run faster?
Here is my running code which gave very bad accuracy score.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
import json
import pandas as pd
from keras.models import Sequential
from keras.engine.training import slice_X
from keras.layers.core import Activation, RepeatVector, Dense
from keras.layers import recurrent, TimeDistributed
import numpy as np
from six.moves import range
class CharacterTable(object):
'''
Given a set of characters:
+ Encode them to a one hot integer representation
+ Decode the one hot integer representation to their character output
+ Decode a vector of probabilties to their character output
'''
def __init__(self, chars, maxlen):
self.chars = sorted(set(chars))
self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
self.indices_char = dict((i, c) for i, c in enumerate(self.chars))
self.maxlen = maxlen
def encode(self, C, maxlen=None):
maxlen = maxlen if maxlen else self.maxlen
X = np.zeros((maxlen, len(self.chars)))
for i, c in enumerate(C):
X[i, self.char_indices[c]] = 1
return X
def decode(self, X, calc_argmax=True):
if calc_argmax:
X = X.argmax(axis=-1)
return ''.join(self.indices_char[x] for x in X)
class colors:
ok = '\033[92m'
fail = '\033[91m'
close = '\033[0m'
INVERT = True
HIDDEN_SIZE = 128
BATCH_SIZE = 64
LAYERS = 3
# Try replacing GRU, or SimpleRNN
RNN = recurrent.LSTM
def main():
"""
Epitope_core = answers
Antigen = questions
"""
epi_antigen_df = pd.io.parsers.read_table("http://dpaste.com/2PZ9WH6.txt")
antigens = epi_antigen_df["Antigen"].tolist()
epitopes = epi_antigen_df["Epitope Core"].tolist()
if INVERT:
antigens = [ x[::-1] for x in antigens]
allchars = "".join(antigens+epitopes)
allchars = list(set(allchars))
aa_chars = "".join(allchars)
sys.stderr.write(aa_chars + "\n")
max_antigen_len = len(max(antigens, key=len))
max_epitope_len = len(max(epitopes, key=len))
X = np.zeros((len(antigens),max_antigen_len, len(aa_chars)),dtype=np.bool)
y = np.zeros((len(epitopes),max_epitope_len, len(aa_chars)),dtype=np.bool)
ctable = CharacterTable(aa_chars, max_antigen_len)
sys.stderr.write("Begin vectorization\n")
for i, antigen in enumerate(antigens):
X[i] = ctable.encode(antigen, maxlen=max_antigen_len)
for i, epitope in enumerate(epitopes):
y[i] = ctable.encode(epitope, maxlen=max_epitope_len)
# Shuffle (X, y) in unison as the later parts of X will almost all be larger digits
indices = np.arange(len(y))
np.random.shuffle(indices)
X = X[indices]
y = y[indices]
# Explicitly set apart 10% for validation data that we never train over
split_at = len(X) - len(X) / 10
(X_train, X_val) = (slice_X(X, 0, split_at), slice_X(X, split_at))
(y_train, y_val) = (y[:split_at], y[split_at:])
sys.stderr.write("Build model\n")
model = Sequential()
# "Encode" the input sequence using an RNN, producing an output of HIDDEN_SIZE
# note: in a situation where your input sequences have a variable length,
# use input_shape=(None, nb_feature).
model.add(RNN(HIDDEN_SIZE, input_shape=(max_antigen_len, len(aa_chars))))
# For the decoder's input, we repeat the encoded input for each time step
model.add(RepeatVector(max_epitope_len))
# The decoder RNN could be multiple layers stacked or a single layer
for _ in range(LAYERS):
model.add(RNN(HIDDEN_SIZE, return_sequences=True))
# For each of step of the output sequence, decide which character should be chosen
model.add(TimeDistributed(Dense(len(aa_chars))))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
# Train the model each generation and show predictions against the validation dataset
for iteration in range(1, 200):
print()
print('-' * 50)
print('Iteration', iteration)
model.fit(X_train, y_train, batch_size=BATCH_SIZE, nb_epoch=5,
validation_data=(X_val, y_val))
###
# Select 10 samples from the validation set at random so we can visualize errors
for i in range(10):
ind = np.random.randint(0, len(X_val))
rowX, rowy = X_val[np.array([ind])], y_val[np.array([ind])]
preds = model.predict_classes(rowX, verbose=0)
q = ctable.decode(rowX[0])
correct = ctable.decode(rowy[0])
guess = ctable.decode(preds[0], calc_argmax=False)
# print('Q', q[::-1] if INVERT else q)
print('T', correct)
print(colors.ok + '☑' + colors.close if correct == guess else colors.fail + '☒' + colors.close, guess)
print('---')
if __name__ == '__main__':
main()

Can RNN, LSTM or GRU used to predict subsequence as posed above?
Yes, you can use any of these. LSTMs and GRUs are types of RNNs; if by RNN you mean a fully-connected RNN, these have fallen out of favor because of the vanishing gradients problem (1, 2). Because of the relatively small number of examples in your dataset, a GRU might be preferable to an LSTM due to its simpler architecture.
How can I improve the accuracy of my code?
You mentioned that training and validation error are both bad. In general, this could be due to one of several factors:
The learning rate is too low (not an issue since you're using Adam, a per-parameter adaptive learning rate algorithm)
The model is too simple for the data (not at all the issue, since you have a very complex model and a small dataset)
You have vanishing gradients (probably the issue since you have a 3-layer RNN). Try reducing the number of layers to 1 (in general, it's good to start by getting a simple model working and then increase the complexity), and also consider hyperparameter search (e.g. a 128-dimensional hidden state may be too large - try 30?).
Another option, since your epitope is a substring of your input, is to predict the start and end indices of the epitope within the antigen sequence (potentially normalized by the length of the antigen sequence) instead of predicting the substring one character at a time. This would be a regression problem with two tasks. For instance, if the antigen is FSKIAGLTVT (10 letters long) and its epitope is KIAGL (positions 3 to 7, one-based) then the input would be FSKIAGLTVT and the outputs would be 0.3 (first task) and 0.7 (second task).
Alternatively, if you can make all the antigens be the same length (by removing parts of your dataset with short antigens and/or chopping off the ends of long antigens assuming you know a priori that the epitope is not near the ends), you can frame it as a classification problem with two tasks (start and end) and sequence-length classes, where you're trying to assign a probability to the antigen starting and ending at each of the positions.
How can I modify my code so that it can run faster?
Reducing the number of layers will speed your code up significantly. Also, GRUs will be faster than LSTMs due to their simpler architecture. However, both types of recurrent networks will be slower than, e.g. convolutional networks.
Feel free to send me an email (address in my profile) if you're interested in a collaboration.

Related

LSTM Autoencoder problems

TLDR:
Autoencoder underfits timeseries reconstruction and just predicts average value.
Question Set-up:
Here is a summary of my attempt at a sequence-to-sequence autoencoder. This image was taken from this paper: https://arxiv.org/pdf/1607.00148.pdf
Encoder: Standard LSTM layer. Input sequence is encoded in the final hidden state.
Decoder: LSTM Cell (I think!). Reconstruct the sequence one element at a time, starting with the last element x[N].
Decoder algorithm is as follows for a sequence of length N:
Get Decoder initial hidden state hs[N]: Just use encoder final hidden state.
Reconstruct last element in the sequence: x[N]= w.dot(hs[N]) + b.
Same pattern for other elements: x[i]= w.dot(hs[i]) + b
use x[i] and hs[i] as inputs to LSTMCell to get x[i-1] and hs[i-1]
Minimum Working Example:
Here is my implementation, starting with the encoder:
class SeqEncoderLSTM(nn.Module):
def __init__(self, n_features, latent_size):
super(SeqEncoderLSTM, self).__init__()
self.lstm = nn.LSTM(
n_features,
latent_size,
batch_first=True)
def forward(self, x):
_, hs = self.lstm(x)
return hs
Decoder class:
class SeqDecoderLSTM(nn.Module):
def __init__(self, emb_size, n_features):
super(SeqDecoderLSTM, self).__init__()
self.cell = nn.LSTMCell(n_features, emb_size)
self.dense = nn.Linear(emb_size, n_features)
def forward(self, hs_0, seq_len):
x = torch.tensor([])
# Final hidden and cell state from encoder
hs_i, cs_i = hs_0
# reconstruct first element with encoder output
x_i = self.dense(hs_i)
x = torch.cat([x, x_i])
# reconstruct remaining elements
for i in range(1, seq_len):
hs_i, cs_i = self.cell(x_i, (hs_i, cs_i))
x_i = self.dense(hs_i)
x = torch.cat([x, x_i])
return x
Bringing the two together:
class LSTMEncoderDecoder(nn.Module):
def __init__(self, n_features, emb_size):
super(LSTMEncoderDecoder, self).__init__()
self.n_features = n_features
self.hidden_size = emb_size
self.encoder = SeqEncoderLSTM(n_features, emb_size)
self.decoder = SeqDecoderLSTM(emb_size, n_features)
def forward(self, x):
seq_len = x.shape[1]
hs = self.encoder(x)
hs = tuple([h.squeeze(0) for h in hs])
out = self.decoder(hs, seq_len)
return out.unsqueeze(0)
And here's my training function:
def train_encoder(model, epochs, trainload, testload=None, criterion=nn.MSELoss(), optimizer=optim.Adam, lr=1e-6, reverse=False):
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Training model on {device}')
model = model.to(device)
opt = optimizer(model.parameters(), lr)
train_loss = []
valid_loss = []
for e in tqdm(range(epochs)):
running_tl = 0
running_vl = 0
for x in trainload:
x = x.to(device).float()
opt.zero_grad()
x_hat = model(x)
if reverse:
x = torch.flip(x, [1])
loss = criterion(x_hat, x)
loss.backward()
opt.step()
running_tl += loss.item()
if testload is not None:
model.eval()
with torch.no_grad():
for x in testload:
x = x.to(device).float()
loss = criterion(model(x), x)
running_vl += loss.item()
valid_loss.append(running_vl / len(testload))
model.train()
train_loss.append(running_tl / len(trainload))
return train_loss, valid_loss
Data:
Large dataset of events scraped from the news (ICEWS). Various categories exist that describe each event. I initially one-hot encoded these variables, expanding the data to 274 dimensions. However, in order to debug the model, I've cut it down to a single sequence that is 14 timesteps long and only contains 5 variables. Here is the sequence I'm trying to overfit:
tensor([[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971]], dtype=torch.float64)
And here is the custom Dataset class:
class TimeseriesDataSet(Dataset):
def __init__(self, data, window, n_features, overlap=0):
super().__init__()
if isinstance(data, (np.ndarray)):
data = torch.tensor(data)
elif isinstance(data, (pd.Series, pd.DataFrame)):
data = torch.tensor(data.copy().to_numpy())
else:
raise TypeError(f"Data should be ndarray, series or dataframe. Found {type(data)}.")
self.n_features = n_features
self.seqs = torch.split(data, window)
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
try:
return self.seqs[idx].view(-1, self.n_features)
except TypeError:
raise TypeError("Dataset only accepts integer index/slices, not lists/arrays.")
Problem:
The model only learns the average, no matter how complex I make the model or now long I train it.
Predicted/Reconstruction:
Actual:
My research:
This problem is identical to the one discussed in this question: LSTM autoencoder always returns the average of the input sequence
The problem in that case ended up being that the objective function was averaging the target timeseries before calculating loss. This was due to some broadcasting errors because the author didn't have the right sized inputs to the objective function.
In my case, I do not see this being the issue. I have checked and double checked that all of my dimensions/sizes line up. I am at a loss.
Other Things I've Tried
I've tried this with varied sequence lengths from 7 timesteps to 100 time steps.
I've tried with varied number of variables in the time series. I've tried with univariate all the way to all 274 variables that the data contains.
I've tried with various reduction parameters on the nn.MSELoss module. The paper calls for sum, but I've tried both sum and mean. No difference.
The paper calls for reconstructing the sequence in reverse order (see graphic above). I have tried this method using the flipud on the original input (after training but before calculating loss). This makes no difference.
I tried making the model more complex by adding an extra LSTM layer in the encoder.
I've tried playing with the latent space. I've tried from 50% of the input number of features to 150%.
I've tried overfitting a single sequence (provided in the Data section above).
Question:
What is causing my model to predict the average and how do I fix it?

Okay, after some debugging I think I know the reasons.
TLDR
You try to predict next timestep value instead of difference between current timestep and the previous one
Your hidden_features number is too small making the model unable to fit even a single sample
Analysis
Code used
Let's start with the code (model is the same):
import seaborn as sns
import matplotlib.pyplot as plt
def get_data(subtract: bool = False):
# (1, 14, 5)
input_tensor = torch.tensor(
[
[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971],
]
).unsqueeze(0)
if subtract:
initial_values = input_tensor[:, 0, :]
input_tensor -= torch.roll(input_tensor, 1, 1)
input_tensor[:, 0, :] = initial_values
return input_tensor
if __name__ == "__main__":
torch.manual_seed(0)
HIDDEN_SIZE = 10
SUBTRACT = False
input_tensor = get_data(SUBTRACT)
model = LSTMEncoderDecoder(input_tensor.shape[-1], HIDDEN_SIZE)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.MSELoss()
for i in range(1000):
outputs = model(input_tensor)
loss = criterion(outputs, input_tensor)
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"{i}: {loss}")
if loss < 1e-4:
break
# Plotting
sns.lineplot(data=outputs.detach().numpy().squeeze())
sns.lineplot(data=input_tensor.detach().numpy().squeeze())
plt.show()
What it does:
get_data either works on the data your provided if subtract=False or (if subtract=True) it subtracts value of the previous timestep from the current timestep
Rest of the code optimizes the model until 1e-4 loss reached (so we can compare how model's capacity and it's increase helps and what happens when we use the difference of timesteps instead of timesteps)
We will only vary HIDDEN_SIZE and SUBTRACT parameters!
NO SUBTRACT, SMALL MODEL
HIDDEN_SIZE=5
SUBTRACT=False
In this case we get a straight line. Model is unable to fit and grasp the phenomena presented in the data (hence flat lines you mentioned).
1000 iterations limit reached
SUBTRACT, SMALL MODEL
HIDDEN_SIZE=5
SUBTRACT=True
Targets are now far from flat lines, but model is unable to fit due to too small capacity.
1000 iterations limit reached
NO SUBTRACT, LARGER MODEL
HIDDEN_SIZE=100
SUBTRACT=False
It got a lot better and our target was hit after 942 steps. No more flat lines, model capacity seems quite fine (for this single example!)
SUBTRACT, LARGER MODEL
HIDDEN_SIZE=100
SUBTRACT=True
Although the graph does not look that pretty, we got to desired loss after only 215 iterations.
Finally
Usually use difference of timesteps instead of timesteps (or some other transformation, see here for more info about that). In other cases, neural network will try to simply... copy output from the previous step (as that's the easiest thing to do). Some minima will be found this way and going out of it will require more capacity.
When you use the difference between timesteps there is no way to "extrapolate" the trend from previous timestep; neural network has to learn how the function actually varies
Use larger model (for the whole dataset you should try something like 300 I think), but you can simply tune that one.
Don't use flipud. Use bidirectional LSTMs, in this way you can get info from forward and backward pass of LSTM (not to confuse with backprop!). This also should boost your score
Questions
Okay, question 1: You are saying that for variable x in the time
series, I should train the model to learn x[i] - x[i-1] rather than
the value of x[i]? Am I correctly interpreting?
Yes, exactly. Difference removes the urge of the neural network to base it's predictions on the past timestep too much (by simply getting last value and maybe changing it a little)
Question 2: You said my calculations for zero bottleneck were
incorrect. But, for example, let's say I'm using a simple dense
network as an auto encoder. Getting the right bottleneck indeed
depends on the data. But if you make the bottleneck the same size as
the input, you get the identity function.
Yes, assuming that there is no non-linearity involved which makes the thing harder (see here for similar case). In case of LSTMs there are non-linearites, that's one point.
Another one is that we are accumulating timesteps into single encoder state. So essentially we would have to accumulate timesteps identities into a single hidden and cell states which is highly unlikely.
One last point, depending on the length of sequence, LSTMs are prone to forgetting some of the least relevant information (that's what they were designed to do, not only to remember everything), hence even more unlikely.
Is num_features * num_timesteps not a bottle neck of the same size as
the input, and therefore shouldn't it facilitate the model learning
the identity?
It is, but it assumes you have num_timesteps for each data point, which is rarely the case, might be here. About the identity and why it is hard to do with non-linearities for the network it was answered above.
One last point, about identity functions; if they were actually easy to learn, ResNets architectures would be unlikely to succeed. Network could converge to identity and make "small fixes" to the output without it, which is not the case.
I'm curious about the statement : "always use difference of timesteps
instead of timesteps" It seem to have some normalizing effect by
bringing all the features closer together but I don't understand why
this is key ? Having a larger model seemed to be the solution and the
substract is just helping.
Key here was, indeed, increasing model capacity. Subtraction trick depends on the data really. Let's imagine an extreme situation:
We have 100 timesteps, single feature
Initial timestep value is 10000
Other timestep values vary by 1 at most
What the neural network would do (what is the easiest here)? It would, probably, discard this 1 or smaller change as noise and just predict 1000 for all of them (especially if some regularization is in place), as being off by 1/1000 is not much.
What if we subtract? Whole neural network loss is in the [0, 1] margin for each timestep instead of [0, 1001], hence it is more severe to be wrong.
And yes, it is connected to normalization in some sense come to think about it.

RAM-memory usage while training a RNN with LSTM units

I'm following a tutorial on Recurrent neural networks, and I am training a RNN to learn how to predict the next letter from the alphabet, given a sequence of letters. The problem is, my RAM-usage is slowly going up every epoch I train the network for. I can not finish training this network because I have "only" 8192MB of RAM-memory, and it is exhausted after +- 100 epochs. Why is this? I think it has something to do with the way LSTM's work, since they do keep some information in memory, but it would be nice if someone could explain me some more details.
The code I'm using is relatively simple, and completely self-contained (You can copy/paste and run it, no need for a external dataset since the dataset is just the alphabet). Therefore I included it in full, so the problem is easily reproducible.
The tensorflow version I am using is 1.14.
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.utils import np_utils
from keras_preprocessing.sequence import pad_sequences
np.random.seed(7)
# define the raw dataset
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# create mapping of characters to integers (0-25) and the reverse
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))
num_inputs = 1000
max_len = 5
dataX = []
dataY = []
for i in range(num_inputs):
start = np.random.randint(len(alphabet)-2)
end = np.random.randint(start, min(start+max_len,len(alphabet)-1))
sequence_in = alphabet[start:end+1]
sequence_out = alphabet[end + 1]
dataX.append([char_to_int[char] for char in sequence_in])
dataY.append(char_to_int[sequence_out])
print(sequence_in, "->" , sequence_out)
#Pad sequences with 0's, reshape X, then normalize data
X = pad_sequences(dataX, maxlen=max_len, dtype= "float32" )
X = np.reshape(X, (X.shape[0], max_len, 1))
X = X / float(len(alphabet))
print(X.shape)
#OHE the output variable.
y = np_utils.to_categorical(dataY)
#Create & fit the model
batch_size=1
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], 1)))
model.add(Dense(y.shape[1], activation= "softmax" ))
model.compile(loss= "categorical_crossentropy" , optimizer= "adam" , metrics=[ "accuracy" ])
model.fit(X, y, epochs=500, batch_size=batch_size, verbose=2)

The problem is that your sequences are rather long (1000 consecutive inputs). As LSTM-units do maintain some kind of state over epochs and you are trying to train it for 500 epochs (Which is a lot), especially when you're training on a CPU, your RAM will get flooded over time. I suggest you try to train on GPU, which has dedicated memory of its own. Also check out this issue: https://github.com/Element-Research/rnn/issues/5

Recreating char level RNN for generating text

I tied to follow a book on deep learning, where there is an chapter about generating text in the style of an example. They used an char level RNN with two LSTM layers in it to generate text in the style of shakespare. But the code in the book (also online: https://github.com/DOsinga/deep_learning_cookbook/blob/master/05.1%20Generating%20Text%20in%20the%20Style%20of%20an%20Example%20Text.ipynb) is written in keras and I only use pytorch. So i tied to recreate it exactly in pytorch, with same network structure and hyperparameters.
So after recreating it and making it work without errors it trained it and it only learned to write the most common character, a space. Then i tried to overfit it on one realy simple sentence, so I had to decrease the sequence lenght to 8. This also did not work, but when decreasing the hidden size of the LSTMs to only 32 it learned it nearly perfectly.
So then I continued working on the original text and started to play arround with the hidden size, learning rate, optimizer (also tried adam) and trained it even longer. The best I could achieve were some random letters, still with a lot of spaces and somtimes something like "her", but far from readable, with still an quite high loss. I used RMSprop with lr=0.01 and a hidden size of 128 over 20000 epochs. I also tried to initialize the hidden state and cell state to zero.
The problem is, that my results are far worse than those in the book, but I did exactly the same just in pytorch. Can someone please tell me, what I should try or what I have done wrong. Any help is appreciated!
PS: Sorry for my bad english.
Here is my code with the original hyperparameters:
#hyperparameters
batch_size = 256
seq_len = 160
hidden_size = 640
layers = 2
#network structure
class RNN(nn.Module):
def __init__(self):
super().__init__()
self.lstm = nn.LSTM(len(chars),hidden_size,layers)
self.linear = nn.Linear(hidden_size,len(chars))
self.softmax = nn.Softmax(dim=2)
def forward(self,x,h,c):
x,(h,c) = self.lstm(x,(h,c))
x = self.softmax(self.linear(x))
return x,h,c
#create network, optimizer and criterion
rnn = RNN().cuda()
optimizer = torch.optim.RMSprop(rnn.parameters(),lr=0.01)
criterion = nn.CrossEntropyLoss()
#training loop
plt.ion()
losses = []
loss_sum = 0
for epoch in range(10000):
#generate input and target filled with zeros
input = numpy.zeros((seq_len,batch_size,len(chars)))
target = numpy.zeros((seq_len,batch_size))
for batch in range(batch_size):
#choose random starting index in text
start = random.randrange(len(text)-seq_len-1)
#generate sequences for that batch filled with zeros
input_seq = numpy.zeros((seq_len+1,len(chars)))
target_seq = numpy.zeros((seq_len+1))
for i,char in enumerate(text[start:start+seq_len+1]):
#convert character to index
idx = char_to_idx[char]
#set value of index to one (one-hot-encoding)
input_seq[i,idx] = 1
#set value to index (only label)
target_seq[i] = idx
#insert sequences into input and target
input[:,batch,:] = input_seq[:-1]
target[:,batch] = target_seq[1:]
#convert input and target from numpy array to pytorch tensor on gpu
input = torch.from_numpy(input).float().cuda()
target = torch.from_numpy(target).long().cuda()
#initialize hidden state and cell state to zero
h0 = torch.zeros(layers,batch_size,hidden_size).cuda()
c0 = torch.zeros(layers,batch_size,hidden_size).cuda()
#run the network on the input
output,h,c = rnn(input,h0,c0)
#calculate loss and perform gradient descent
optimizer.zero_grad()
loss = criterion(output.view(-1,len(chars)),target.view(-1))
loss.backward()
optimizer.step()
Plot of the loss with original hyperparameters:
Example of target and output after training:
Target: can bring this instrument of honour
again into his native quarter, be magnanimous in the enterprise,
and go on; I will grace the attempt for a worthy e
Output:
Plot of the loss with hidden size of 128 over 20000 epochs (best results):

I later finally found a way to achive something close to real sentences, maybe it will help someone. Here is an example result:
-I have not seen him and the prince was a signt of the streme of the sumpering of the property of th
In my case the important change was to not inizialize the hidden and cell state to zero every batch but only every epoch. For this to work I had to rewrite the batch generator, so that it produces batches following on each other.

Keras: Making a neural network to find a number's modulus

I'm an experienced Python developer, but a complete newbie in machine learning. This is my first attempt to use Keras. Can you tell what I'm doing wrong?
I'm trying to make a neural network that takes a number in binary form, and outputs its modulo when dividing by 7. (My goal was to take a very simple task just to see that everything works.)
In the code below I define the network and I train it on 10,000 random numbers. Then I test it on 500 random numbers.
For some reason the accuracy that I get is around 1/7, which is the accuracy you'd expect from a completely random algorithm, i.e. my neural network isn't doing anything.
Can anyone help me figure out what's wrong?
import keras.models
import numpy as np
from python_toolbox import random_tools
RADIX = 7
def _get_number(vector):
return sum(x * 2 ** i for i, x in enumerate(vector))
def _get_mod_result(vector):
return _get_number(vector) % RADIX
def _number_to_vector(number):
binary_string = bin(number)[2:]
if len(binary_string) > 20:
raise NotImplementedError
bits = (((0,) * (20 - len(binary_string))) +
tuple(map(int, binary_string)))[::-1]
assert len(bits) == 20
return np.c_[bits]
def get_mod_result_vector(vector):
return _number_to_vector(_get_mod_result(vector))
def main():
model = keras.models.Sequential(
(
keras.layers.Dense(
units=20, activation='relu', input_dim=20
),
keras.layers.Dense(
units=20, activation='relu'
),
keras.layers.Dense(
units=20, activation='softmax'
)
)
)
model.compile(optimizer='sgd',
loss='categorical_crossentropy',
metrics=['accuracy'])
data = np.random.randint(2, size=(10000, 20))
labels = np.vstack(map(get_mod_result_vector, data))
model.fit(data, labels, epochs=10, batch_size=50)
def predict(number):
foo = model.predict(_number_to_vector(number))
return _get_number(tuple(map(round, foo[0])))
def is_correct_for_number(x):
return bool(predict(x) == x % RADIX)
predict(7)
sample = random_tools.shuffled(range(2 ** 20))[:500]
print('Total accuracy:')
print(sum(map(is_correct_for_number, sample)) / len(sample))
print(f'(Accuracy of random algorithm is {1/RADIX:.2f}')
if __name__ == '__main__':
main()

This achieves an accuracy of 99.74% and a validation accuracy of 99.69%.
import tensorflow as tf, numpy as np
def int2bits(i,fill=20):
return list(map(int,bin(i)[2:].zfill(fill)))
def bits2int(b):
return sum(i*2**n for n,i in enumerate(reversed(b)))
# Data.
I = np.random.randint(0,2**20,size=(250_000,))
X = np.array(list(map(int2bits,I)))
Y = np.array([int2bits(2**i,7) for i in I % 7])
# Test Data.
It = np.random.randint(0,2**20,size=(10_000,))
Xt = np.array(list(map(int2bits,It)))
Yt = np.array([int2bits(2**i,7) for i in It % 7])
# Model.
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(1000,'relu'),
tf.keras.layers.Dense(7,'softmax'),
])
model.compile('adam','categorical_crossentropy',['accuracy'])
# Train.
model.fit(X,Y,10_000,100,validation_data=(Xt,Yt))
Some take-aways:
1) You had way too little data. You were uniformly sampling points from 0 to 2**20, but only sampled 10,000, which is only about 1% of the possible vectors that the model is suppose to learn about. The point is that a lot of components (in the binary representation) would be mostly fixed at zero or one without any opportunity to learn how they function in the overall data or how they interact with other components.
2) You needed an embedding layer, namely extend the space into some massive higher dimension, so the neurons can move around more easily. This allows the learning to shuffle things better hopefully finding the algorithm your looking for. A single Dense(1000) seems to work.
3) Ran batches of 10_000 (just so I maximize my CPU usage). Ran 100 epochs. Included my validation_data in the training so I could see how the validation set performs at each epoch (including this doesn't effect the training, just makes it easier to see if the model is doing well, while training).
Thanks. :-)

UPD
After some tinkering I was able to get to a reasonably good solution using RNNs. It trains on less than 5% of all possible unique inputs and gives >90% accuracy on the random test sample. You can increase number of batches to 100 from 40 to make it a bit more accurate (though in some runs there is a chance the model won't converge to the right answer - here it is higher than usually). I have switched to using Adam optimizer here and had to increase number of samples to 50K (10K led to overfitting for me).
Please understand that this solution is a bit of a tongue-in-cheek thing, because it is based on the task-domain knowledge that our target function can be defined by a simple recurring formula on the sequence of input bits (even simpler formula if you reverse your input bit sequence, but using go_backwards=True in LSTM didn't help here).
If you inverse the input bits order (so that we always start with the most significant bit) than the recurring formula for the target function is just F_n = G(F_{n-1}, x_n), where F_n = MOD([x_1,...,x_n], 7), and G(x, y) = MOD(2*x+y, 7) - only has 49 different inputs and 7 possible outputs. So the model kind of have to learn initial state + this G update function. For the sequence starting with the least significant bit the recurring formula is slightly more complicated cause it will also need to keep track on what is current MOD(2**n, 7) on each step, but it seems that this difficulty doesn't matter for training.
Please note - these formulas are only to explain why RNN works here. The net below is just a plain LSTM layer + softmax with original input of bits treated as a sequence.
Full code for the answer using RNN layer:
import keras.models
import numpy as np
from python_toolbox import random_tools
RADIX = 7
FEATURE_BITS = 20
def _get_number(vector):
return sum(x * 2 ** i for i, x in enumerate(vector))
def _get_mod_result(vector):
return _get_number(vector) % RADIX
def _number_to_vector(number):
binary_string = bin(number)[2:]
if len(binary_string) > FEATURE_BITS:
raise NotImplementedError
bits = (((0,) * (FEATURE_BITS - len(binary_string))) +
tuple(map(int, binary_string)))[::-1]
assert len(bits) == FEATURE_BITS
return np.c_[bits]
def get_mod_result_vector(vector):
v = np.repeat(0, 7)
v[_get_mod_result(vector)] = 1
return v
def main():
model = keras.models.Sequential(
(
keras.layers.Reshape(
(1, -1)
),
keras.layers.LSTM(
units=100,
),
keras.layers.Dense(
units=7, activation='softmax'
)
)
)
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.01),
loss='categorical_crossentropy',
metrics=['accuracy'])
data = np.random.randint(2, size=(50000, FEATURE_BITS))
labels = np.vstack(map(get_mod_result_vector, data))
model.fit(data, labels, epochs=40, batch_size=50)
def predict(number):
foo = model.predict(_number_to_vector(number))
return np.argmax(foo)
def is_correct_for_number(x):
return bool(predict(x) == x % RADIX)
sample = random_tools.shuffled(range(2 ** FEATURE_BITS))[:500]
print('Total accuracy:')
print(sum(map(is_correct_for_number, sample)) / len(sample))
print(f'(Accuracy of random algorithm is {1/RADIX:.2f}')
if __name__ == '__main__':
main()
ORIGINAL ANSWER
I'm not sure how it happened, but the particular task you chose to check your code is extremely difficult for a NN. I think the best explanation would be that NNs are not really good when features are interconnected in such way that changing one feature always change value of your target output completely. One way to look at it would be to see the sets of features when you expect a certain answer - in your case they will look like unions of very large number of parallel hyper planes in 20 dimensional space - and for each of 7 categories these sets of planes are "nicely" interleaved and left for NN to distinguish.
That said - if your number of examples is large, say 10K and number of possible inputs is smaller, say your input bit numbers are only 8 bits large (so 256 unique inputs possible only) - networks should "learn" the right function quite ok (by "remembering" correct answers for every input, without generalization). In your case that doesn't happen because the code has the following bug.
Your labels were 20-dimensional vectors with bits of 0-6 integer (your actual desired label) - so I guess you were pretty much trying to teach NN to learn bits of the answer as separate classifiers (with only 3 bits ever possible to be non-zero). I changed that to what I assume you actually wanted - vectors of length 7 with only one value being 1 and others 0 (so-called one hot encoding which keras actually expects for categorical_crossentropy according to this). If you wanted to try to learn each bit separately you definitely shouldn't have used softmax 20 in the last layer, cause such output generates probabilities on 20 classes which sum up to 1 (in that case you should have trained 20-or-rather-3 binary classifiers instead). Since your code didn't give keras correct input the model you got in the end was kind of random and with rounding you applied was intented to output the same value for 95%-100% of inputs.
Slightly changed code below trains a model which can more or less correctly guess the mod 7 answer for every number 0 to 255 (again, pretty much remembers the correct answer for every input). If you try to increase FEATURE_BITS you will see large degradation of the results. If you actually want to train NN to learn this task as is with 20 or more bits of input (and without supplying NN with all possible inputs and infinite time to train) you will need to apply some task-specific feature transformations and/or some layers carefully designed to exactly be good at task you want to achieve as others already mentioned in comments to your question.
import keras.models
import numpy as np
from python_toolbox import random_tools
RADIX = 7
FEATURE_BITS = 8
def _get_number(vector):
return sum(x * 2 ** i for i, x in enumerate(vector))
def _get_mod_result(vector):
return _get_number(vector) % RADIX
def _number_to_vector(number):
binary_string = bin(number)[2:]
if len(binary_string) > FEATURE_BITS:
raise NotImplementedError
bits = (((0,) * (FEATURE_BITS - len(binary_string))) +
tuple(map(int, binary_string)))[::-1]
assert len(bits) == FEATURE_BITS
return np.c_[bits]
def get_mod_result_vector(vector):
v = np.repeat(0, 7)
v[_get_mod_result(vector)] = 1
return v
def main():
model = keras.models.Sequential(
(
keras.layers.Dense(
units=20, activation='relu', input_dim=FEATURE_BITS
),
keras.layers.Dense(
units=20, activation='relu'
),
keras.layers.Dense(
units=7, activation='softmax'
)
)
)
model.compile(optimizer='sgd',
loss='categorical_crossentropy',
metrics=['accuracy'])
data = np.random.randint(2, size=(10000, FEATURE_BITS))
labels = np.vstack(map(get_mod_result_vector, data))
model.fit(data, labels, epochs=100, batch_size=50)
def predict(number):
foo = model.predict(_number_to_vector(number))
return np.argmax(foo)
def is_correct_for_number(x):
return bool(predict(x) == x % RADIX)
sample = random_tools.shuffled(range(2 ** FEATURE_BITS))[:500]
print('Total accuracy:')
print(sum(map(is_correct_for_number, sample)) / len(sample))
print(f'(Accuracy of random algorithm is {1/RADIX:.2f}')
if __name__ == '__main__':
main()

model.get_weights() returning array of NaNs after training due to NaN masking

I'm trying to train an LSTM to classify sequences of various lengths. I want to get the weights of this model, so I can use them in stateful version of the model. Before training, the weights are normal. Also, the training seems to run successfully, with a gradually decreasing error. However, when I change the mask value from -10 to np.Nan, mod.get_weights() starts returning arrays of NaNs and the validation error drops suddenly to a value close to zero. Why is this occurring?
from keras import models
from keras.layers import Dense, Masking, LSTM
from keras.optimizers import RMSprop
from keras.losses import categorical_crossentropy
from keras.preprocessing.sequence import pad_sequences
import numpy as np
import matplotlib.pyplot as plt
def gen_noise(noise_len, mag):
return np.random.uniform(size=noise_len) * mag
def gen_sin(t_val, freq):
return 2 * np.sin(2 * np.pi * t_val * freq)
def train_rnn(x_train, y_train, max_len, mask, number_of_categories):
epochs = 3
batch_size = 100
# three hidden layers of 256 each
vec_dims = 1
hidden_units = 256
in_shape = (max_len, vec_dims)
model = models.Sequential()
model.add(Masking(mask, name="in_layer", input_shape=in_shape,))
model.add(LSTM(hidden_units, return_sequences=False))
model.add(Dense(number_of_categories, input_shape=(number_of_categories,),
activation='softmax', name='output'))
model.compile(loss=categorical_crossentropy, optimizer=RMSprop())
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs,
validation_split=0.05)
return model
def gen_sig_cls_pair(freqs, t_stops, num_examples, noise_magnitude, mask, dt=0.01):
x = []
y = []
num_cat = len(freqs)
max_t = int(np.max(t_stops) / dt)
for f_i, f in enumerate(freqs):
for t_stop in t_stops:
t_range = np.arange(0, t_stop, dt)
t_len = t_range.size
for _ in range(num_examples):
sig = gen_sin(f, t_range) + gen_noise(t_len, noise_magnitude)
x.append(sig)
one_hot = np.zeros(num_cat, dtype=np.bool)
one_hot[f_i] = 1
y.append(one_hot)
pad_kwargs = dict(padding='post', maxlen=max_t, value=mask, dtype=np.float32)
return pad_sequences(x, **pad_kwargs), np.array(y)
if __name__ == '__main__':
noise_mag = 0.01
mask_val = -10
frequencies = (5, 7, 10)
signal_lengths = (0.8, 0.9, 1)
dt_val = 0.01
x_in, y_in = gen_sig_cls_pair(frequencies, signal_lengths, 50, noise_mag, mask_val)
mod = train_rnn(x_in[:, :, None], y_in, int(np.max(signal_lengths) / dt_val), mask_val, len(frequencies))
This persists even if I change the network architecture to return_sequences=True and wrap the Dense layer with TimeDistributed, nor does removing the LSTM layer.

I had the same problem. In your case I can see it was probably something different but someone might have the same problem and come here from Google. So in my case I was passing sample_weight parameter to fit() method and when the sample weights contained some zeros in it, get_weights() was returning an array with NaNs. When I omitted the samples where sample_weight=0 (they were useless anyway if sample_weight=0), it started to work.

The weights are indeed changing. The unchanging weights are from the edge of the image, and they may have not changed because the edge isn't helpful for classifying digits.
to check select a specific layer and see the result:
print(model.layers[70].get_weights()[1])
70 : is the number of the last layer in my case.

get_weights() method of keras.engine.training.Model instance should retrieve the weights of the model.
This should be a flat list of Numpy arrays, or in other words this should be the list of all weight tensors in the model.
mw = model.get_weights()
print(mw)
If you got the NaN(s) this has a specific meaning. You are dealing simple with vanishing gradients problem. (In some cases even with Exploding gradients).
I would first try to alter the model to reduce the chances for the vanishing gradients. Try reducing the hidden_units first, and normalize your activations.
Even though LSTM are there to solve the problem of vanishing/exploding gradients problem you need to set the right activations from (-1, 1) interval.
Note this interval is where float points are most precise.
Working with np.nan under the masking layer is not a predictable operation since you cannot do comparison with np.nan.
Try print(np.nan==np.nan) and it will return False. This is an old problem with the IEEE 754 standard.
Or it may actually be this is a bug in Tensorflow, based on the IEEE 754 standard weakness.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.