RAM-memory usage while training a RNN with LSTM units

RAM-memory usage while training a RNN with LSTM units - python

I'm following a tutorial on Recurrent neural networks, and I am training a RNN to learn how to predict the next letter from the alphabet, given a sequence of letters. The problem is, my RAM-usage is slowly going up every epoch I train the network for. I can not finish training this network because I have "only" 8192MB of RAM-memory, and it is exhausted after +- 100 epochs. Why is this? I think it has something to do with the way LSTM's work, since they do keep some information in memory, but it would be nice if someone could explain me some more details.
The code I'm using is relatively simple, and completely self-contained (You can copy/paste and run it, no need for a external dataset since the dataset is just the alphabet). Therefore I included it in full, so the problem is easily reproducible.
The tensorflow version I am using is 1.14.
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.utils import np_utils
from keras_preprocessing.sequence import pad_sequences
np.random.seed(7)
# define the raw dataset
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# create mapping of characters to integers (0-25) and the reverse
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))
num_inputs = 1000
max_len = 5
dataX = []
dataY = []
for i in range(num_inputs):
start = np.random.randint(len(alphabet)-2)
end = np.random.randint(start, min(start+max_len,len(alphabet)-1))
sequence_in = alphabet[start:end+1]
sequence_out = alphabet[end + 1]
dataX.append([char_to_int[char] for char in sequence_in])
dataY.append(char_to_int[sequence_out])
print(sequence_in, "->" , sequence_out)
#Pad sequences with 0's, reshape X, then normalize data
X = pad_sequences(dataX, maxlen=max_len, dtype= "float32" )
X = np.reshape(X, (X.shape[0], max_len, 1))
X = X / float(len(alphabet))
print(X.shape)
#OHE the output variable.
y = np_utils.to_categorical(dataY)
#Create & fit the model
batch_size=1
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], 1)))
model.add(Dense(y.shape[1], activation= "softmax" ))
model.compile(loss= "categorical_crossentropy" , optimizer= "adam" , metrics=[ "accuracy" ])
model.fit(X, y, epochs=500, batch_size=batch_size, verbose=2)

The problem is that your sequences are rather long (1000 consecutive inputs). As LSTM-units do maintain some kind of state over epochs and you are trying to train it for 500 epochs (Which is a lot), especially when you're training on a CPU, your RAM will get flooded over time. I suggest you try to train on GPU, which has dedicated memory of its own. Also check out this issue: https://github.com/Element-Research/rnn/issues/5

Related

How can I get constant value val accuracy and val loss in keras [duplicate]

This question already has answers here:
How to get reproducible results in keras
(11 answers)
Closed 16 days ago.
I'm newbie in neural network and I try to do mlp text classification using keras. everytime I run the code, it get different val loss and and val accuracy. Val loss is increase and val accuarcy is decrease everytime I re-run it. The code that I'm using is like this:
#Split data training and testing (80:20)
Train_X2, Test_X2, Train_Y2, Test_Y2 = model_selection.train_test_split(dataset['review'],dataset['sentiment'],test_size=0.2, random_state=1)
Encoder = LabelEncoder()
Train_Y2 = Encoder.fit_transform(Train_Y2)
Test_Y2 = Encoder.fit_transform(Test_Y2)
Tfidf_vect2 = TfidfVectorizer(max_features=None)
Tfidf_vect2.fit(dataset['review'])
Train_X2_Tfidf = Tfidf_vect2.transform(Train_X2)
Test_X2_Tfidf = Tfidf_vect2.transform(Test_X2)
#Model
model = Sequential()
model.add(Dense(100, input_dim= 1148, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
opt = Adam (learning_rate=0.01)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
model.summary()
from keras.backend import clear_session
clear_session()
es = EarlyStopping(monitor="val_loss",mode='min',patience=10)
history = model.fit(arr_Train_X2_Tfidf, Train_Y2, epochs=100,verbose=1, validation_split=0.2,validation_data=(arr_Test_X2_Tfidf, Test_Y2), batch_size=32, callbacks =[es])
I try using clear_session() to make the model not start off with the computed weights from the previous training. But it still get difference value. How to fix it? thank you

How can I get constant value val accuracy and val loss in keras
I guess what you want is reproducible train runs. For that you will have to seed the random number generator. Getting reproducible result with seed is tricky on the GPU because some operations on GPU are non deterministic. However, with the model architecture you are using it is not a problem.
make model not start off with the computed weights from the previous training.
It is not the case, you are creating the model every time and the Dense layers you are using gets initialised from glorot_uniform distributions.
Sample:
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import optimizers
from tensorflow.keras import callbacks
import matplotlib.pyplot as plt
import os
import numpy as np
import random as rn
import random as python_random
def seed():
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = ""
np.random.seed(123)
python_random.seed(123)
tf.random.set_seed(1234)
def train(set_seed):
if set_seed:
seed()
dataset = {
'review': [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
],
'sentiment': [0,1,0,1]
}
Train_X2, Test_X2, Train_Y2, Test_Y2 = train_test_split(
dataset['review'],dataset['sentiment'],test_size=0.2, random_state=1)
Encoder = LabelEncoder()
Train_Y2 = Encoder.fit_transform(Train_Y2)
Test_Y2 = Encoder.fit_transform(Test_Y2)
Tfidf_vect2 = TfidfVectorizer(max_features=None)
Tfidf_vect2.fit(dataset['review'])
Train_X2_Tfidf = Tfidf_vect2.transform(Train_X2).toarray()
Test_X2_Tfidf = Tfidf_vect2.transform(Test_X2).toarray()
#Model
model = keras.Sequential()
model.add(layers.Dense(100, input_dim= 9, activation='sigmoid'))
model.add(layers.Dense(1, activation='sigmoid'))
opt = optimizers.Adam (learning_rate=0.01)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
#model.summary()
history = model.fit(Train_X2_Tfidf, Train_Y2, epochs=10,verbose=0,
validation_data=(Test_X2_Tfidf, Test_Y2), batch_size=32)
return history
def run(set_seed=False):
plt.figure(figsize=(7,7))
for i in range(5):
history = train(set_seed)
plt.plot(history.history['val_loss'], label=f"{i+1}")
plt.legend()
plt.title("With Seed" if set_seed else "Wihout Seed")
run()
run(True)
Outout:
You can see how the val_loss is different without seed (as it depends on the initial value of Dense layer and other places where random number generation is used) and how the val_loss is exactly the same with seed which make sure the initial values of Dense layers you are using are same between runs (and at other places where random number generation is used).

During the training process, several different sources of randomness exist. You would need to eliminate all of those to get perfectly reproducible results. Here are a few of those:
The random weight and bias initialization leads to different training runs each time.
During training mini batching is used which influences the trajectory of the training process differently each time. Each training process that uses Stochastic Gradient Descent (SGD) as an optimization technique will thus observe a different ordering of inputs.
Randomness induced by regularization techniques such as Dropout.
Some of those sources of randomness can be overcome by setting a fixed seed, how this is done is described in this older question.
This paper names several other possible sources of bias (such as data augmentation or hardware differences).

Why do I get "logits and labels must have the same shape" error?

I have the following piece of code, which is a simplification of my actual code:
#!/usr/bin/env python3
import tensorflow as tf
import re
import numpy as np
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer #text to vector.
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.models import Sequential
DATASIZE = 1000
model = Sequential()
model.add(Embedding(1000, 120, input_length = DATASIZE))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(176, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer='adam', metrics = ['accuracy'])
print(model.summary())
training = [[0 for x in range(10)] for x in range (DATASIZE)] #random value
label = [[1 for x in range(10)] for x in range (DATASIZE)] #random value
model.fit(training, label, epochs = 5, batch_size=32, verbose = 'auto')
What I need:
My endgoal is to be able to check whether a given input vector (which is a numerical representation of data) is positive or negative, so it is a binary classification issue here. Here all the 1000 vectors are 10 digits long, but it could be waay longer or even much shorter, length might vary throughout the dataset.
When running this I get the following error:
ValueError: logits and labels must have the same shape ((None, 2) vs (None, 10))
How do I have to structure my vectors in order to not get this error and actually correctly fit the model?
EDIT:
I could change the model.add(Dense(2,activation='sigmoid')) call to model.add(Dense(10,activation='sigmoid')). But this doesn't make much sense, I think. As the first parameter of Dense is the actual number of output possibilities. In my case there are only 2 possibilities: positive or negative. So rnndomly changing this to 10, makes the program run, but doesn't make sense to me. And I am not even sure it is using my 1000 training vectors....

Your last layer dense has 2 neurons, while your dataset has 10 labels. So technically, the last layer must have same number of neurons as the number of classes in your dataset which is 10 in your case. Just replace 2 with 10 in your last dense layer.

Training a LSTM auto-encoder gets NaN / super high MSE loss

I'm trying to train a LSTM ae.
It's like a seq2seq model, you throw a signal in to get a reconstructed signal sequence. And the I'm using a sequence which should be quite easy. The loss function and metric is MSE. The first hundred epochs went well. However after some epochs I got MSE which is super high and it goes to NaN sometimes. I don't know what causes this.
Can you inspect the code and give me a hint?
The sequence gets normalization before, so it's in a [0,1] range, how can it produce such a high MSE error?
This is the input sequence I get from training set:
sequence1 = x_train[0][:128]
looks like this:
I get the data from a public signal dataset(128*1)
This is the code: (I modify it from keras blog)
# lstm autoencoder recreate sequence
from numpy import array
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
from keras.utils import plot_model
from keras import regularizers
# define input sequence. sequence1 is only a one dimensional list
# reshape sequence1 input into [samples, timesteps, features]
n_in = len(sequence1)
sequence = sequence1.reshape((1, n_in, 1))
# define model
model = Sequential()
model.add(LSTM(1024, activation='relu', input_shape=(n_in,1)))
model.add(RepeatVector(n_in))
model.add(LSTM(1024, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(1)))
model.compile(optimizer='adam', loss='mse')
for epo in [50,100,1000,2000]:
model.fit(sequence, sequence, epochs=epo)
The first few epochs went all well. all the losses are about 0.003X or so. Then it became big suddenly, to some very big number, the goes to NaN all the way up.

You might have a problem with exploding gradient values when doing the backpropagation.
Try using the clipnorm and clipvalue parameters to control gradient clipping: https://keras.io/optimizers/
Alternatively, what is the learning rate you are using? I would also try to reduce the learning rate by 10,100,1000 to check if you observe the same behavior.

'relu' is the main culprit - see here. Possible solutions:
Initialize weights to smaller values, e.g. keras.initializers.TruncatedNormal(mean=0.0, stddev=0.01)
Clip weights (at initialization, or via kernel_constraint, recurrent_constraint, ...)
Increase weight decay
Use a warmup learning rate scheme (start low, gradually increase)
Use 'selu' activation, which is more stable, is ReLU-like, and works better than ReLU on some tasks
Since your training went stable for many epochs, 3 sounds the most promising, as it seems that eventually your weights norm gets too large and gradients explode. Generally, I suggest keeping the weight norms around 1 for 'relu'; you can monitor the l2 norms using the function below. I also recommend See RNN for inspecting layer activations & gradients.
def inspect_weights_l2(model, names='lstm', axis=-1):
def _get_l2(w, axis=-1):
axis = axis if axis != -1 else len(w.shape) - 1
reduction_axes = tuple([ax for ax in range(len(w.shape)) if ax != axis])
return np.sqrt(np.sum(np.square(w), axis=reduction_axes))
def _print_layer_l2(layer, idx, axis=-1):
W = layer.get_weights()
l2_all = []
txt = "{} "
for w in W:
txt += "{:.4f}, {:.4f} -- "
l2 = _get_l2(w, axis)
l2_all.extend([l2.max(), l2.mean()])
txt = txt.rstrip(" -- ")
print(txt.format(idx, *l2_all))
names = [names] if isinstance(names, str) else names
for idx, layer in enumerate(model.layers):
if any([name in layer.name.lower() for name in names]):
_print_layer_l2(layer, idx, axis=axis)

Neural network: estimating sine wave frequency

With an objective of learning Keras LSTM and RNNs, I thought to create a simple problem to work on: given a sine wave, can we predict its frequency?
I wouldn't expect a simple neural network to be able to predict the frequency, given that the notion of time is important here. However, even with LSTMs, I am unable to learn the frequency; I'm able to learn a trivial zero as the estimated frequency (even for train samples).
Here's the code to create the train set.
import numpy as np
import matplotlib.pyplot as plt
def create_sine(frequency):
return np.sin(frequency*np.linspace(0, 2*np.pi, 2000))
train_x = np.array([create_sine(x) for x in range(1, 300)])
train_y = list(range(1, 300))
Now, here's a simple neural network for this example.
from keras.models import Model
from keras.layers import Dense, Input, LSTM
input_series = Input(shape=(2000,),name='Input')
dense_1 = Dense(100)(input_series)
pred = Dense(1, activation='relu')(dense_1)
model = Model(input_series, pred)
model.compile('adam','mean_absolute_error')
model.fit(train_x[:100], train_y[:100], epochs=100)
As expected, this NN doesn't learn anything useful. Next, I tried a simple LSTM example.
input_series = Input(shape=(2000,1),name='Input')
lstm = LSTM(100)(input_series)
pred = Dense(1, activation='relu')(lstm)
model = Model(input_series, pred)
model.compile('adam','mean_absolute_error')
model.fit(train_x[:100].reshape(100, 2000, 1), train_y[:100], epochs=100)
However, this LSTM based model also doesn't learn anything useful.

Why doesn't it learn?
You think it's a simple problem to train an RNN on, but actually your setup isn't easy for the network at all:
As already mentioned, there's lack of important samples. You throw so much data into it (300 * 2000 points), but the actual target (frequency) is seen only once by the network. Even if the network does learn something, there's high chance it will overfit.
Inconsistent data. Remember that RNNs are good at capturing similar patterns in the series data. For instance, in NLP all sentences in the corpus are governed by the same language rules and more sentences help RNN to understand these rules better, i.e., more data helps.
In your case, the series with different frequencies aren't very much alike: compare the sine with frequency=1 and frequency=100. This kind of diversity in the data makes it harder to learn, not easier. It doesn't mean that the frequency is impossible for an RNN to learn, it simply means that you shouldn't be surprised that a trivial RNN like yours has hard time.
Data scale. Changing the frequency from 1 to 300, changes the scale of both x and y by two orders of magnitude, which may be problematic for any neural network.
Solution
Since your goal is rather educational, I solved the second and third items simply by limiting the target frequency to 10, so that scaling and distribution diversity isn't much of an issue (you are welcome to try different values here: you should see that increasing this one parameter to, say, 50 makes the task much more complex).
The first item is solved by giving the RNN 10 examples of each frequency, instead of just one. I've also added one more hidden layer to increase network flexibility, plus a simple regularizer (Dropout layer).
The complete code:
import numpy as np
from keras.models import Model
from keras.layers import Input, Dense, Dropout, LSTM
max_freq = 10
time_steps = 100
def create_sine(frequency, offset):
return np.sin(frequency * np.linspace(offset, 2 * np.pi + offset, time_steps))
train_y = list(range(1, max_freq)) * 10
train_x = np.array([create_sine(freq, np.random.uniform(0,1)) for freq in train_y])
train_y = np.array(train_y)
input_series = Input(shape=(time_steps, 1), name='Input')
lstm = LSTM(units=100)(input_series)
hidden = Dense(units=100, activation='relu')(lstm)
dropout = Dropout(rate=0.1)(hidden)
output = Dense(units=1, activation='relu')(dropout)
model = Model(input_series, output)
model.compile('adam', 'mean_squared_error')
model.fit(train_x.reshape(-1, time_steps, 1), train_y, epochs=200)
# Trying the network on the same data
test_x = train_x.reshape(-1, time_steps, 1)
test_y = train_y
predicted = model.predict(test_x).reshape([-1])
print()
print((predicted - train_y)[:12])
print(np.mean(np.abs(predicted - train_y)))
The output:
max_freq=10
[-0.05612183 -0.01982236 -0.03744316 -0.02568841 -0.11959982 -0.0770483
0.04643679 0.12057972 -0.00625324 -0.00724655 -0.16919005 -0.04512954]
0.0503574344847
max_freq=20 (everything else is the same)
[ 0.51365542 0.09269333 -0.009691 0.0619092 0.09852839 0.04378462
0.01430321 -0.01953268 0.00722599 0.02558327 -0.04520988 -0.0614748 ]
0.146024380232
max_freq=30 (everything else is the same)
[-0.28205156 -0.28922796 -0.00569081 -0.21314907 0.1068716 0.23497915
0.23975039 0.25955486 0.26333141 0.24235058 0.08320332 -0.03686047]
0.406703719805
Note that results are random and actually increasing the max_freq increases the changes of divergence. But even when it converges, the performance doesn't improve despite having more data, instead gets worse and pretty fast.

sample data item very low, one for each freq,
add small noise and use more data,
normalize output data -1 to 1 range
then try again

As you said, you want to predict the frequency. You also want to use LSTM. First we generate enough data to train, then we build the network. I'm sorry my example is not with keras, I'm using tflearn.
import numpy as np
import tflearn
from random import shuffle
# parameters
n_input=100
n_train=2000
n_test = 500
# generate data
xs=[]
ys=[]
frequencies = np.linspace(1,50,n_train+n_test)
shuffle(frequencies)
t=np.linspace(0,2*np.pi,n_input)
for freq in frequencies:
xs.append(np.sin(t*freq))
ys.append(freq)
xs_train=np.array(xs[:n_train]).reshape(n_train,n_input,1)
xs_test=np.array(xs[n_train:]).reshape(n_test,n_input,1)
ys_train = np.array(ys[:n_train]).reshape(-1,1)
ys_test = np.array(ys[n_train:]).reshape(-1,1)
# LSTM network prediction
net = tflearn.input_data(shape=[None, n_input, 1])
net = tflearn.lstm(net, 10)
net = tflearn.fully_connected(net, 100, activation="relu")
net = tflearn.fully_connected(net, 1)
net = tflearn.regression(net, optimizer='adam', loss='mean_square')
model = tflearn.DNN(net)
model.fit(xs_train, ys_train, n_epoch=100)
print(np.hstack((model.predict(xs_test),ys_test))[:10])
# [[ 13.08494568 12.76470588]
# [ 22.23135376 21.98039216]
# [ 39.0812912 37.58823529]
# [ 15.77548409 15.66666667]
# [ 26.57996941 25.58823529]
# [ 26.57759476 25.11764706]
# [ 16.42217445 15.8627451 ]
# [ 32.55020905 30.80392157]
# [ 44.16622925 43.01960784]
# [ 26.18071365 25.45098039]]
If you have the data in that order, you don't actually need LSTM, you can easily replace the LSTM part with a Deep Neural Network:
# Deep network instead of LSTM
net = tflearn.input_data(shape=[None, n_input])
net = tflearn.fully_connected(net, 100)
net = tflearn.fully_connected(net, 100)
net = tflearn.fully_connected(net, 1)
net = tflearn.regression(net, optimizer='adam',loss='mean_square')
model = tflearn.DNN(net)
model.fit(xs_train, ys_train)
print(np.hstack((model.predict(xs_test),ys_test))[:10])
Both codes are going to give you as result the predicted value of the frequency. I also created a gist with the program.

Using Deep Learning to Predict Subsequence from Sequence

I have a data that looks like this:
It can be viewed here and has been included in the code below.
In actuality I have ~7000 samples (row), downloadable too.
The task is given antigen, predict the corresponding epitope.
So epitope is always an exact substring of antigen. This is equivalent with
the Sequence to Sequence Learning. Here is my code running on Recurrent Neural Network under Keras. It was modeled according the example.
My question are:
Can RNN, LSTM or GRU used to predict subsequence as posed above?
How can I improve the accuracy of my code?
How can I modify my code so that it can run faster?
Here is my running code which gave very bad accuracy score.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
import json
import pandas as pd
from keras.models import Sequential
from keras.engine.training import slice_X
from keras.layers.core import Activation, RepeatVector, Dense
from keras.layers import recurrent, TimeDistributed
import numpy as np
from six.moves import range
class CharacterTable(object):
'''
Given a set of characters:
+ Encode them to a one hot integer representation
+ Decode the one hot integer representation to their character output
+ Decode a vector of probabilties to their character output
'''
def __init__(self, chars, maxlen):
self.chars = sorted(set(chars))
self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
self.indices_char = dict((i, c) for i, c in enumerate(self.chars))
self.maxlen = maxlen
def encode(self, C, maxlen=None):
maxlen = maxlen if maxlen else self.maxlen
X = np.zeros((maxlen, len(self.chars)))
for i, c in enumerate(C):
X[i, self.char_indices[c]] = 1
return X
def decode(self, X, calc_argmax=True):
if calc_argmax:
X = X.argmax(axis=-1)
return ''.join(self.indices_char[x] for x in X)
class colors:
ok = '\033[92m'
fail = '\033[91m'
close = '\033[0m'
INVERT = True
HIDDEN_SIZE = 128
BATCH_SIZE = 64
LAYERS = 3
# Try replacing GRU, or SimpleRNN
RNN = recurrent.LSTM
def main():
"""
Epitope_core = answers
Antigen = questions
"""
epi_antigen_df = pd.io.parsers.read_table("http://dpaste.com/2PZ9WH6.txt")
antigens = epi_antigen_df["Antigen"].tolist()
epitopes = epi_antigen_df["Epitope Core"].tolist()
if INVERT:
antigens = [ x[::-1] for x in antigens]
allchars = "".join(antigens+epitopes)
allchars = list(set(allchars))
aa_chars = "".join(allchars)
sys.stderr.write(aa_chars + "\n")
max_antigen_len = len(max(antigens, key=len))
max_epitope_len = len(max(epitopes, key=len))
X = np.zeros((len(antigens),max_antigen_len, len(aa_chars)),dtype=np.bool)
y = np.zeros((len(epitopes),max_epitope_len, len(aa_chars)),dtype=np.bool)
ctable = CharacterTable(aa_chars, max_antigen_len)
sys.stderr.write("Begin vectorization\n")
for i, antigen in enumerate(antigens):
X[i] = ctable.encode(antigen, maxlen=max_antigen_len)
for i, epitope in enumerate(epitopes):
y[i] = ctable.encode(epitope, maxlen=max_epitope_len)
# Shuffle (X, y) in unison as the later parts of X will almost all be larger digits
indices = np.arange(len(y))
np.random.shuffle(indices)
X = X[indices]
y = y[indices]
# Explicitly set apart 10% for validation data that we never train over
split_at = len(X) - len(X) / 10
(X_train, X_val) = (slice_X(X, 0, split_at), slice_X(X, split_at))
(y_train, y_val) = (y[:split_at], y[split_at:])
sys.stderr.write("Build model\n")
model = Sequential()
# "Encode" the input sequence using an RNN, producing an output of HIDDEN_SIZE
# note: in a situation where your input sequences have a variable length,
# use input_shape=(None, nb_feature).
model.add(RNN(HIDDEN_SIZE, input_shape=(max_antigen_len, len(aa_chars))))
# For the decoder's input, we repeat the encoded input for each time step
model.add(RepeatVector(max_epitope_len))
# The decoder RNN could be multiple layers stacked or a single layer
for _ in range(LAYERS):
model.add(RNN(HIDDEN_SIZE, return_sequences=True))
# For each of step of the output sequence, decide which character should be chosen
model.add(TimeDistributed(Dense(len(aa_chars))))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
# Train the model each generation and show predictions against the validation dataset
for iteration in range(1, 200):
print()
print('-' * 50)
print('Iteration', iteration)
model.fit(X_train, y_train, batch_size=BATCH_SIZE, nb_epoch=5,
validation_data=(X_val, y_val))
###
# Select 10 samples from the validation set at random so we can visualize errors
for i in range(10):
ind = np.random.randint(0, len(X_val))
rowX, rowy = X_val[np.array([ind])], y_val[np.array([ind])]
preds = model.predict_classes(rowX, verbose=0)
q = ctable.decode(rowX[0])
correct = ctable.decode(rowy[0])
guess = ctable.decode(preds[0], calc_argmax=False)
# print('Q', q[::-1] if INVERT else q)
print('T', correct)
print(colors.ok + '☑' + colors.close if correct == guess else colors.fail + '☒' + colors.close, guess)
print('---')
if __name__ == '__main__':
main()

Can RNN, LSTM or GRU used to predict subsequence as posed above?
Yes, you can use any of these. LSTMs and GRUs are types of RNNs; if by RNN you mean a fully-connected RNN, these have fallen out of favor because of the vanishing gradients problem (1, 2). Because of the relatively small number of examples in your dataset, a GRU might be preferable to an LSTM due to its simpler architecture.
How can I improve the accuracy of my code?
You mentioned that training and validation error are both bad. In general, this could be due to one of several factors:
The learning rate is too low (not an issue since you're using Adam, a per-parameter adaptive learning rate algorithm)
The model is too simple for the data (not at all the issue, since you have a very complex model and a small dataset)
You have vanishing gradients (probably the issue since you have a 3-layer RNN). Try reducing the number of layers to 1 (in general, it's good to start by getting a simple model working and then increase the complexity), and also consider hyperparameter search (e.g. a 128-dimensional hidden state may be too large - try 30?).
Another option, since your epitope is a substring of your input, is to predict the start and end indices of the epitope within the antigen sequence (potentially normalized by the length of the antigen sequence) instead of predicting the substring one character at a time. This would be a regression problem with two tasks. For instance, if the antigen is FSKIAGLTVT (10 letters long) and its epitope is KIAGL (positions 3 to 7, one-based) then the input would be FSKIAGLTVT and the outputs would be 0.3 (first task) and 0.7 (second task).
Alternatively, if you can make all the antigens be the same length (by removing parts of your dataset with short antigens and/or chopping off the ends of long antigens assuming you know a priori that the epitope is not near the ends), you can frame it as a classification problem with two tasks (start and end) and sequence-length classes, where you're trying to assign a probability to the antigen starting and ending at each of the positions.
How can I modify my code so that it can run faster?
Reducing the number of layers will speed your code up significantly. Also, GRUs will be faster than LSTMs due to their simpler architecture. However, both types of recurrent networks will be slower than, e.g. convolutional networks.
Feel free to send me an email (address in my profile) if you're interested in a collaboration.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.