Encoder-Decoder LSTM model gives 'nan' loss and predictions - python

I am trying to create a basic encoder-decoder model for training a chatbot. X contains the questions or human dialogues and Y contains the bot answers. I have padded the sequences to the max size of input and output sentences. X.shape = (2363, 242, 1) and Y.shape = (2363, 144, 1). But during training, the loss has value 'nan' for all epochs and the prediction gives array with all values as 'nan'. I have tried using 'rmsprop' optimizer instead of 'adam'. I cannot use loss function 'categorical_crossentropy' as the output is not one-hot encoded but a sequence. What exactly is wrong with my code?
Model
model = Sequential()
model.add(LSTM(units=64, activation='relu', input_shape=(X.shape[1], 1)))
model.add(RepeatVector(Y.shape[1]))
model.add(LSTM(units=64, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(units=1)))
print(model.summary())
model.compile(optimizer='adam', loss='mean_squared_error')
hist = model.fit(X, Y, epochs=20, batch_size=64, verbose=2)
model.save('encoder_decoder_model_epochs20.h5')
Data Preparation
def remove_punctuation(s):
s = s.translate(str.maketrans('','',string.punctuation))
s = s.encode('ascii', 'ignore').decode('ascii')
return s
def prepare_data(fname):
word2idx = {'PAD': 0}
curr_idx = 1
sents = list()
for line in open(fname):
line = line.strip()
if line:
tokens = remove_punctuation(line.lower()).split()
tmp = []
for t in tokens:
if t not in word2idx:
word2idx[t] = curr_idx
curr_idx += 1
tmp.append(word2idx[t])
sents.append(tmp)
sents = np.array(pad_sequences(sents, padding='post'))
return sents, word2idx
human = 'rdany-conversations/human_text.txt'
robot = 'rdany-conversations/robot_text.txt'
X, input_vocab = prepare_data(human)
Y, output_vocab = prepare_data(robot)
X = X.reshape((X.shape[0], X.shape[1], 1))
Y = Y.reshape((Y.shape[0], Y.shape[1], 1))

First of all check that you do not have any NaNs in your input. If this is not the case it might be exploding gradients. Standardize your inputs (MinMax- or Z-scaling), try smaller learning rates, clip gradients the gradients, try different weight initializations.

Related

Word2Vec embedding to LSTM layers?

I am now working on a neural network that should predict the next activity and the outcome (both or just one, depending on the self.net_out parameter of a trace (sequence of events, taken from an eventlog). The inputs of the net are windows (prefixes) of a trace of a specific size. Right now it looks like this:
def nn(self,params):
#done in this function so that, in case, win_size easily can become a parameter
X_train,Y_train,Z_train = self.build_windows(self.traces_train,self.win_size)
if(self.net_embedding==0):
if(self.net_out!=2):
Y_train = self.leA.fit_transform(Y_train)
Y_train = to_categorical(Y_train)
label=Y_train
if(self.net_out!=1):
Z_train = self.leO.fit_transform(Z_train)
Z_train = to_categorical(Z_train)
label=Z_train
unique_events = len(self.act_dictionary)
input_act = Input(shape=self.win_size, dtype='int32', name='input_act')
if(self.net_embedding==0):
x_act = Embedding(output_dim=params["output_dim_embedding"], input_dim=unique_events + 1, input_length=self.win_size)(
input_act)
else:
print("WIP")
n_layers = int(params["n_layers"]["n_layers"])
l1 = LSTM(params["shared_lstm_size"], return_sequences=True, kernel_initializer='glorot_uniform',dropout=params['dropout'])(x_act)
l1 = BatchNormalization()(l1)
if(self.net_out!=2):
l_a = LSTM(params["lstmA_size_1"], return_sequences=(n_layers != 1), kernel_initializer='glorot_uniform',dropout=params['dropout'])(l1)
l_a = BatchNormalization()(l_a)
elif(self.net_out!=1):
l_o = LSTM(params["lstmO_size_1"], return_sequences=(n_layers != 1), kernel_initializer='glorot_uniform',dropout=params['dropout'])(l1)
l_o = BatchNormalization()(l_o)
for i in range(2,n_layers+1):
if(self.net_out!=2):
l_a = LSTM(params["n_layers"]["lstmA_size_%s_%s" % (i, n_layers)], return_sequences=(n_layers != i), kernel_initializer='glorot_uniform',dropout=params['dropout'])(l_a)
l_a = BatchNormalization()(l_a)
if(self.net_out!=1):
l_o = LSTM(params["n_layers"]["lstmO_size_%s_%s" % (i, n_layers)], return_sequences=(n_layers != i), kernel_initializer='glorot_uniform',dropout=params['dropout'])(l_o)
l_o = BatchNormalization()(l_o)
outputs=[]
if(self.net_out!=2):
output_l = Dense(self.outsize_act, activation='softmax', name='act_output')(l_a)
outputs.append(output_l)
if(self.net_out!=1):
output_o = Dense(self.outsize_out, activation='softmax', name='outcome_output')(l_o)
outputs.append(output_o)
model = Model(inputs=input_act, outputs=outputs)
print(model.summary())
opt = Adam(lr=params["learning_rate"])
if(self.net_out==0):
loss = {'act_output':'categorical_crossentropy', 'outcome_output':'categorical_crossentropy'}
loss_weights= [params['gamma'], 1-params['gamma']]
if(self.net_out==1):
loss = {'act_output':'categorical_crossentropy'}
loss_weights= [1,1]
if(self.net_out==2):
loss = {'outcome_output':'categorical_crossentropy'}
loss_weights=[1,1]
model.compile(loss=loss, optimizer=opt, loss_weights=loss_weights ,metrics=['accuracy'])
early_stopping = EarlyStopping(monitor='val_loss',
patience=20)
lr_reducer = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=10, verbose=0, mode='auto',
min_delta=0.0001, cooldown=0, min_lr=0)
if(self.net_out==0):
history = model.fit(X_train, [Y_train,Z_train], epochs=3, batch_size=2**params['batch_size'], verbose=2, callbacks=[early_stopping, lr_reducer], validation_split =0.2 )
else:
history = model.fit(X_train, label, epochs=300, batch_size=2**params['batch_size'], verbose=2, callbacks=[early_stopping, lr_reducer], validation_split =0.2 )
scores = [history.history['val_loss'][epoch] for epoch in range(len(history.history['loss']))]
score = min(scores)
#global best_score, best_model
if self.best_score > score:
self.best_score = score
self.best_model = model
return {'loss': score, 'status': STATUS_OK}
As it can be seen, I need to consider 2 types of embeddings: for the one that I already implemented and tested (self.net_embedding=0), each activity/event in each trace (and consequently window) is mapped as an integer; then I apply fit_transform and to_categorical.
The second type of embedding that I have to try is by using word2vec. To do so, I already changed the format of the input, not converting each activity in an integer but by keeping it as a string (the actual name of the activity, standardized to just numbers and letters). I don't know how to proceed though: I guess I should do something like
w2vModel= Word2Vec(X_train, size=params['word2vec_size'], min_count=1)
to get the embedded windows by w2vModel.wv, but how do I pass these to the lstm layers then? Into what should I change the embedding layer after the input one (where I put print(WIP) for now)?

ValueError: No gradients provided for any variable when using model.fit

I'm trying to use the features extracted from two pre-trained models (resnet and mobilenet) as inputs to train a functional model using Keras. I need to classify images as categories 1,2 or 3 using a softmax layer.
My model.fit function is giving me the following error:
ValueError: No gradients provided for any variable: ['dense_66/kernel:0', 'dense_66/bias:0',
'dense_64/kernel:0', 'dense_64/bias:0', 'dense_67/kernel:0', 'dense_67/bias:0',
'dense_65/kernel:0', 'dense_65/bias:0', 'dense_68/kernel:0', 'dense_68/bias:0',
'dense_69/kernel:0', 'dense_69/bias:0', 'dense_70/kernel:0', 'dense_70/bias:0'].
Here's the relevant part of code:
Creating the dataset
def datasetgenerator(url,BATCH_SIZE,IMG_SIZE):
data=image_dataset_from_directory(url,
shuffle=True,
batch_size=BATCH_SIZE,
image_size=IMG_SIZE,
label_mode='int'
)
return data
BATCH_SIZE = 20
IMG_SIZE = (160, 160)
train_dir='wound_dataset2/train'
train_dataset = datasetgenerator(url=train_dir,BATCH_SIZE=BATCH_SIZE,IMG_SIZE= IMG_SIZE)
val_dir='wound_dataset2/val'
validation_dataset = datasetgenerator(url=val_dir,BATCH_SIZE=BATCH_SIZE,IMG_SIZE= IMG_SIZE)
test_dir='wound_dataset2/test'
test_dataset = datasetgenerator(url=test_dir,BATCH_SIZE=BATCH_SIZE,IMG_SIZE= IMG_SIZE)
print(train_dataset)
Feature extraction
mobilenet_features = np.empty([20, 1280])
resnet_features = np.empty([20, 2048])
for data in train_dataset:
image_batch, label_batch = data
image_batch = data_augmentation(image_batch)
preprocess_input_image_resnet = preprocess_input_resnet(image_batch)
preprocess_input_image_mobilenet = preprocess_input_mobilenet(image_batch)
feature_batch_resnet = base_model_resnet(preprocess_input_image_resnet)
feature_batch_average_resnet = global_average_layer(feature_batch_resnet)
feature_batch_mobilenet = base_model_mobilenet(preprocess_input_image_mobilenet)
feature_batch_average_mobilenet = global_average_layer(feature_batch_mobilenet)
mobilenet_features = np.concatenate((mobilenet_features, np.array(feature_batch_average_mobilenet)))
resnet_features = np.concatenate((resnet_features, np.array(feature_batch_average_resnet)))
Model Generation
from tensorflow.keras.layers import concatenate
# define two sets of inputs
inputA = tf.keras.Input(shape=(1280,))
inputB = tf.keras.Input(shape=(2048,))
# the first branch operates on the first input
x = tf.keras.layers.Dense(8, activation="relu")(inputA)
x = tf.keras.layers.Dense(4, activation="relu")(x)
x = tf.keras.Model(inputs=inputA, outputs=x)
# the second branch opreates on the second input
y = tf.keras.layers.Dense(64, activation="relu")(inputB)
y = tf.keras.layers.Dense(32, activation="relu")(y)
y = tf.keras.layers.Dense(4, activation="relu")(y)
y = tf.keras.Model(inputs=inputB, outputs=y)
# combine the output of the two branches
combined = concatenate([x.output, y.output])
fc_layers = [1024, 1024]
dropout = 0.5
# apply a FC layer and then a regression prediction on the
# combined outputs
z = Flatten()(combined)
for fc in fc_layers:
# New FC layer, random init
z = Dense(fc, activation='relu')(z)
z = Dropout(dropout)(z)
# New softmax layer
predictions = Dense(3, activation='softmax')(z)
# our model will accept the inputs of the two branches and
# then output a single value
model = tf.keras.Model(inputs=[x.input, y.input], outputs=z)
Training
model.compile(optimizer=tf.keras.optimizers.Adam(1e-3),
loss= tf.keras.losses.CategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
history = model.fit((mobilenet_features, resnet_features), batch_size=20, epochs=10)
I'm trying this as a method to improve accuracy over what I got using transfer learning. Any help would be appreciated.
z = Flatten()(combined)
z = Dense(fc, activation='relu')(z)
z = Dropout(dropout)(z)
z = Dense(fc, activation='relu')(z)
z = Dropout(dropout)(z)
predictions = Dense(3, activation='softmax')(z)
# use the prediction as output layer
model = tf.keras.Model(inputs=[x.input, y.input], outputs=predictions)
#add target tensor to the fit method
history = model.fit((mobilenet_features, resnet_features),youTarget, batch_size=20, epochs=10)

Character-based Text Classification with Triplet Loss

Im trying to implement a text-classifier using triplet loss to classify different job descriptions into categories based on this paper. But whatever i do, the classifier yields very bad results.
For the embedding i followed this tutorial and the NN architecture is based on this article.
I create my encodings using:
max_char_len = 20
group_numbers = range(0, len(job_groups))
char_vocabulary = {'PAD':0}
X_char = []
y_temp = []
i = 1
for group, number in zip(job_groups, group_numbers):
for job in group:
job_cleaned = some_cleaning_function(job)
job_enc = []
for c in job_cleaned:
if c in char_vocabulary.keys():
job_enc.append(char_vocabulary[c])
else:
char_vocabulary[c] = i
job_enc.append(char_vocabulary[c])
i+=1
X_char.append(job_enc)
y_temp.append(number)
X_char = pad_sequences(X_char, maxlen = max_char_length, truncating='post')
My Neural Network is set up the following way:
def create_base_model():
char_in = Input(shape=(max_char_length,), name='Char_Input')
char_enc = Embedding(input_dim=len(char_vocabulary)+1, output_dim=20, mask_zero=True,name='Char_Embedding')(char_in)
x = Bidirectional(LSTM(64, return_sequences=True, recurrent_dropout=0.2, dropout=0.4))(char_enc)
x = Bidirectional(LSTM(64, return_sequences=True, recurrent_dropout=0.2, dropout=0.4))(x)
x = Bidirectional(LSTM(64, return_sequences=True, recurrent_dropout=0.2, dropout=0.4))(x)
x = Bidirectional(LSTM(64, return_sequences=False, recurrent_dropout=0.2, dropout=0.4))(x)
out = Dense(128, activation = "softmax")(x)
return Model(char_in, out)
def get_siamese_triplet_char():
anchor_input_c = Input(shape=(max_char_length,),name='Char_Input_Anchor')
pos_input_c = Input(shape=(max_char_length,),name='Char_Input_Positive')
neg_input_c = Input(shape=(max_char_length,),name='Char_Input_Negative')
base_model = create_base_model(encoding_generator)
encoded_anchor = base_model(anchor_input_c)
encoded_positive = base_model(pos_input_c)
encoded_negative = base_model(neg_input_c)
inputs = [anchor_input_c, pos_input_c, neg_input_c]
outputs = [encoded_anchor, encoded_positive, encoded_negative]
siamese_triplet = Model(inputs, outputs)
siamese_triplet.add_loss((triplet_loss(outputs)))
siamese_triplet.compile(loss=None, optimizer='adam')
return siamese_triplet, base_model
The triplet loss is defined as follows:
def triplet_loss(inputs):
anchor, positive, negative = inputs
positive_distance = K.square(anchor - positive)
negative_distance = K.square(anchor - negative)
positive_distance = K.sqrt(K.sum(positive_distance, axis=-1, keepdims = True))
negative_distance = K.sqrt(K.sum(negative_distance, axis=-1, keepdims = True))
loss = positive_distance - negative_distance
loss = K.maximum(0.0, 1 + loss)
return K.mean(loss)
The model is then trained with:
siamese_triplet_char.fit(x=
[Anchor_chars_train,
Positive_chars_train,
Negative_chars_train],
shuffle=True, batch_size=8, epochs=22, verbose=1)
My goal is to: First, train the network with no label data in order to minimize the space of the different phrases and second, add a classification layer and create the final classifier.
My general problem is that even the first phase shows sinking cost-values it overfits and the validation results jump around and the second phase fails badly as I'm not able to train the model to actually classify.
My questions are the following:
Could someone explain the Embedding Architecture? What is the output dimension refering to? The individual characters? Would that even make sense? Or is there a better way to encode the input data?
How can i add validation_data to a network that does not contain labeled data? I could use validation_split, but i would rather prefer passing specific data to validate as my data is stratified.
Is there a reason why the classification does not work? Applying a simple K-Nearest Neighbor algorithm achieves at best 0.5 accuracy! Is it because of the data? Or is there a systematic error in my system?
All ideas and suggestions are really appreciated!

Loss not converging in visual question answering with keras

I am trying to train a neural network for visual question answering but the loss keeps diverging.
Basic hyperparameters modifications gave no results and i've tried different models too with no result. Here is a model i used:
word2vec_dim = 30
num_hidden_nodes_mlp = 1024
num_hidden_nodes_lstm = 30
num_layers_lstm = 2
dropout = 0.3
activation_mlp = 'tanh'
num_epochs = 1
image_model = Sequential()
image_model.add(Reshape(input_shape = (320,480,4), target_shape=(320,480,4)))
image_model.add(Conv2D(4,(3,1)))
image_model.add(Conv2D(4,(1,3)))
image_model.add(MaxPooling2D(pool_size=(2, 2)))
image_model.add(Conv2D(4,(3,1)))
image_model.add(Conv2D(4,(1,3)))
image_model.add(MaxPooling2D(pool_size=(2, 2)))
image_model.add(Conv2D(4,(3,1)))
image_model.add(Conv2D(4,(1,3)))
image_model.add(MaxPooling2D(pool_size=(2, 2)))
image_model.add(Conv2D(4,(3,1)))
image_model.add(Conv2D(4,(1,3)))
image_model.add(Flatten())
image_model.add(Dense(num_hidden_nodes_lstm, activation='relu'))
model1 = Model(inputs = image_model.input, outputs = image_model.output)
model1.summary()
language_model = Sequential()
language_model.add(Embedding(len(unique_words)+1, word2vec_dim, input_length=max_lenght))
language_model.add(LSTM(units=num_hidden_nodes_lstm,
return_sequences=True, input_shape=(None, word2vec_dim)))
for i in range(num_layers_lstm-2):
language_model.add(LSTM(units=num_hidden_nodes_lstm, return_sequences=True))
language_model.add(LSTM(units=num_hidden_nodes_lstm, return_sequences=False))
model2 = Model(language_model.input, language_model.output)
model2.summary()
combined = concatenate([image_model.output, language_model.output])
model = Dense(512, activation="tanh", kernel_initializer="uniform")(combined)
#model = Activation('tanh')(model)
model = Dropout(0.3)(model)
model = Dense(512, activation="tanh", kernel_initializer="uniform")(model)
#model = Activation('tanh')(model)
#model = Dropout(0.5)(model)
#model = Dense(1024, activation="tanh", kernel_initializer="uniform")(model)
#model = Activation('tanh')(model)
#model = Dropout(0.5)(model)
model = Dense(13, activation="softmax")(model)
model = Model(inputs=[image_model.input, language_model.input], outputs=model)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
model.summary()
Here instead the training code. The dataset was split 80/20, batch size 64, epochs are low but since the dataset is big (3k batches), the loss explode before even getting to 10% of a single one.
words target class is one hot encoded and the question encoding is done with a one to one dictionary correspondence (using a dictyonary with every word since there are not many), leaving 0 as padding value. I have ingnored commas, question marks etc. tough.
train_gen=image_generator(batch_size=batch_size)
eval_gen=evaluation_generator(batch_size=batch_size)
model.fit(x=train_gen, epochs=2, verbose=1, validation_data=eval_gen, steps_per_epoch=training_batches ,validation_steps=evaluation_batches, shuffle=True, max_queue_size=10, callbacks=[save])
I also get this warning message
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/indexed_slices.py:433: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Epoch 1/2
522/3243 [===>..........................] - ETA: 33:05 - loss: 2825421622922535501824.0000
I observed that the model would answer to all the question with the same class(I immagine it to be the cause of the diverging loss).
where image_generator is defined:
def my_hash (word):
for x in range(dictionary_lenght-1):
if word==unique_words[x]:
return (x+1)
print("Error, word not in the vocabulary")
def pad(sequence, lenght, value=0):
for x in range(len(sequence), lenght):
sequence.append(value)
return sequence
def image_generator(batch_size = 32):
zeros=[0]*13
while True:
for x2 in range(training_batches):# Select files (paths/indices) for the batch
input_img_batch = []
input_question_batch = []
output_batch = []
img_name=""
for x in range(batch_size):
temp=[]
img_name=training_data["questions"][x+x2*batch_size]["image_filename"]
question=training_data["questions"][x+x2*batch_size]["question"].replace("?","")
question=hashing_trick(question, dictionary_lenght,hash_function=my_hash)
question=pad(question, max_lenght)
img = Image.open("/kaggle/input/ann-and-dl-vqa/dataset_vqa/train/" + img_name , 'r')
img=img.resize([img_width, img_height])
img=np.asarray(img)#execute the same process as before but the corrispective mask
img=img/255
input_img_batch.append(img)
input_question_batch.append(question)
dummy=zeros
dummy[encode_answer(training_data["questions"][x+x2*batch_size]["answer"])]=1
output_batch.append(dummy)
# Return a tuple of (input,output) to feed the network
batch_x1 = np.array( input_img_batch )
batch_x2 = np.array( input_question_batch )
batch_y = np.array( output_batch )
yield( [batch_x1, batch_x2], batch_y )
I solved the problem.
There was an issue in the image_generator.
The vector zero somehow changed value and became equal to dummy(instead of the other way) and messed with the prediction target.

Error when checking model input keras when predicting new results

I am trying to use a keras model I built on new data, except I have an input error when trying to predict the predictions.
Here's my code for the model:
def build_model(max_features, maxlen):
"""Build LSTM model"""
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(LSTM(128))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop')
return model
And my code to predict the output predictions of my new data:
LSTM_model = load_model('LSTMmodel.h5')
data = pickle.load(open('traindata.pkl', 'rb'))
#### LSTM ####
"""Run train/test on logistic regression model"""
# Extract data and labels
X = [x[1] for x in data]
labels = [x[0] for x in data]
# Generate a dictionary of valid characters
valid_chars = {x:idx+1 for idx, x in enumerate(set(''.join(X)))}
max_features = len(valid_chars) + 1
maxlen = np.max([len(x) for x in X])
# Convert characters to int and pad
X = [[valid_chars[y] for y in x] for x in X]
X = sequence.pad_sequences(X, maxlen=maxlen)
# Convert labels to 0-1
y = [0 if x == 'benign' else 1 for x in labels]
y_pred = LSTM_model.predict(X)
The error I get when running this code:
ValueError: Error when checking input: expected embedding_1_input to have shape (57,) but got array with shape (36,)
My error comes from maxlen because for my training data, maxlen=57 and with my new data, maxlen=36.
So I tried to set in my prediction code maxlen=57 but then I get this error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[31,53] = 38 is not in [0, 38)
[[Node: embedding_1/embedding_lookup = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_1/embeddings/read, embedding_1/Cast, embedding_1/embedding_lookup/axis)]]
What should I do in order to resolve these issues? Change my embedding layer?
Either set the input_length of the Embedding layer to the maximum length you would see in the dataset, or just use the same maxlen value you used when constructing the model in pad_sequences. In that case any sequence shorter than maxlen would be padded and any sequence longer than maxlen would be truncated.
Further make sure that the features you use are the same in both train and test time (i.e. their numbers should not change).

Categories

Resources