Character-based Text Classification with Triplet Loss - python

Im trying to implement a text-classifier using triplet loss to classify different job descriptions into categories based on this paper. But whatever i do, the classifier yields very bad results.
For the embedding i followed this tutorial and the NN architecture is based on this article.
I create my encodings using:
max_char_len = 20
group_numbers = range(0, len(job_groups))
char_vocabulary = {'PAD':0}
X_char = []
y_temp = []
i = 1
for group, number in zip(job_groups, group_numbers):
for job in group:
job_cleaned = some_cleaning_function(job)
job_enc = []
for c in job_cleaned:
if c in char_vocabulary.keys():
job_enc.append(char_vocabulary[c])
else:
char_vocabulary[c] = i
job_enc.append(char_vocabulary[c])
i+=1
X_char.append(job_enc)
y_temp.append(number)
X_char = pad_sequences(X_char, maxlen = max_char_length, truncating='post')
My Neural Network is set up the following way:
def create_base_model():
char_in = Input(shape=(max_char_length,), name='Char_Input')
char_enc = Embedding(input_dim=len(char_vocabulary)+1, output_dim=20, mask_zero=True,name='Char_Embedding')(char_in)
x = Bidirectional(LSTM(64, return_sequences=True, recurrent_dropout=0.2, dropout=0.4))(char_enc)
x = Bidirectional(LSTM(64, return_sequences=True, recurrent_dropout=0.2, dropout=0.4))(x)
x = Bidirectional(LSTM(64, return_sequences=True, recurrent_dropout=0.2, dropout=0.4))(x)
x = Bidirectional(LSTM(64, return_sequences=False, recurrent_dropout=0.2, dropout=0.4))(x)
out = Dense(128, activation = "softmax")(x)
return Model(char_in, out)
def get_siamese_triplet_char():
anchor_input_c = Input(shape=(max_char_length,),name='Char_Input_Anchor')
pos_input_c = Input(shape=(max_char_length,),name='Char_Input_Positive')
neg_input_c = Input(shape=(max_char_length,),name='Char_Input_Negative')
base_model = create_base_model(encoding_generator)
encoded_anchor = base_model(anchor_input_c)
encoded_positive = base_model(pos_input_c)
encoded_negative = base_model(neg_input_c)
inputs = [anchor_input_c, pos_input_c, neg_input_c]
outputs = [encoded_anchor, encoded_positive, encoded_negative]
siamese_triplet = Model(inputs, outputs)
siamese_triplet.add_loss((triplet_loss(outputs)))
siamese_triplet.compile(loss=None, optimizer='adam')
return siamese_triplet, base_model
The triplet loss is defined as follows:
def triplet_loss(inputs):
anchor, positive, negative = inputs
positive_distance = K.square(anchor - positive)
negative_distance = K.square(anchor - negative)
positive_distance = K.sqrt(K.sum(positive_distance, axis=-1, keepdims = True))
negative_distance = K.sqrt(K.sum(negative_distance, axis=-1, keepdims = True))
loss = positive_distance - negative_distance
loss = K.maximum(0.0, 1 + loss)
return K.mean(loss)
The model is then trained with:
siamese_triplet_char.fit(x=
[Anchor_chars_train,
Positive_chars_train,
Negative_chars_train],
shuffle=True, batch_size=8, epochs=22, verbose=1)
My goal is to: First, train the network with no label data in order to minimize the space of the different phrases and second, add a classification layer and create the final classifier.
My general problem is that even the first phase shows sinking cost-values it overfits and the validation results jump around and the second phase fails badly as I'm not able to train the model to actually classify.
My questions are the following:
Could someone explain the Embedding Architecture? What is the output dimension refering to? The individual characters? Would that even make sense? Or is there a better way to encode the input data?
How can i add validation_data to a network that does not contain labeled data? I could use validation_split, but i would rather prefer passing specific data to validate as my data is stratified.
Is there a reason why the classification does not work? Applying a simple K-Nearest Neighbor algorithm achieves at best 0.5 accuracy! Is it because of the data? Or is there a systematic error in my system?
All ideas and suggestions are really appreciated!

Related

How to access all outputs from a single custom loss function in keras

I'm trying to reproduce the architecture of the network proposed in this publication in tensorFlow. Being a total beginner to this, I've been using this tutorial as a base to work on, using tensorflow==2.3.2.
To train this network, they use a loss which implies outputs from two branches of the network at the same time, which made me look towards custom losses function in keras. I've got that you can define your own, as long as the definition of the function looks like the following:
def custom_loss(y_true, y_pred):
I also understood that you could give other arguments like so:
def loss_function(margin=0.3):
def custom_loss(y_true, y_pred):
# And now you can use margin
You then just have to call these while compiling your model. When it comes to using multiple outputs, the most common approach seem to be the one proposed here, where you would give several losses functions, one being called for each of your output.
However, I could not find a solution to give several outputs to a loss function, which is what I need here.
To further explain it, here is a minimal working example showing what I've tried, which you can try for yourself in this collab.
import os
import tensorflow as tf
import keras.backend as K
from tensorflow.keras import datasets, layers, models, applications, losses
from tensorflow.keras.preprocessing import image_dataset_from_directory
_URL = 'https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip'
path_to_zip = tf.keras.utils.get_file('cats_and_dogs.zip', origin=_URL, extract=True)
PATH = os.path.join(os.path.dirname(path_to_zip), 'cats_and_dogs_filtered')
train_dir = os.path.join(PATH, 'train')
validation_dir = os.path.join(PATH, 'validation')
BATCH_SIZE = 32
IMG_SIZE = (160, 160)
IMG_SHAPE = IMG_SIZE + (3,)
train_dataset = image_dataset_from_directory(train_dir,
shuffle=True,
batch_size=BATCH_SIZE,
image_size=IMG_SIZE)
validation_dataset = image_dataset_from_directory(validation_dir,
shuffle=True,
batch_size=BATCH_SIZE,
image_size=IMG_SIZE)
data_augmentation = tf.keras.Sequential([
layers.experimental.preprocessing.RandomFlip('horizontal'),
layers.experimental.preprocessing.RandomRotation(0.2),
])
preprocess_input = applications.resnet50.preprocess_input
base_model = applications.ResNet50(input_shape=IMG_SHAPE,
include_top=False,
weights='imagenet')
base_model.trainable = True
conv = layers.Conv2D(filters=128, kernel_size=(1,1))
global_pooling = layers.GlobalAveragePooling2D()
horizontal_pooling = layers.AveragePooling2D(pool_size=(1, 5))
reshape = layers.Reshape((-1, 128))
def custom_loss(y_true, y_pred):
print(y_pred.shape)
# Do some stuffs involving both outputs
# Returning something trivial here for correct behavior
return K.mean(y_pred)
inputs = tf.keras.Input(shape=IMG_SHAPE)
x = data_augmentation(inputs)
x = preprocess_input(x)
x = base_model(x, training=True)
first_branch = global_pooling(x)
second_branch = conv(x)
second_branch = horizontal_pooling(second_branch)
second_branch = reshape(second_branch)
model = tf.keras.Model(inputs, [first_branch, second_branch])
base_learning_rate = 0.0001
model.compile(optimizer=tf.keras.optimizers.Adam(lr=base_learning_rate),
loss=custom_loss,
metrics=['accuracy'])
model.summary()
initial_epochs = 10
history = model.fit(train_dataset,
epochs=initial_epochs,
validation_data=validation_dataset)
while doing so, I thought that the y_pred given to loss function would be a list, containing both outputs. However, while running it, what I've got in stdout was this:
Epoch 1/10
(None, 2048)
(None, 5, 128)
What I understand from this is that the loss function is called with every output, one by one, instead of being called once with all the outputs, which means I can't define a loss that would use both the outputs at the same time. Is there any way to achieve this?
Please let me know if I'm unclear, or if you need further details.
I had the same problem trying to implement Triplet_Loss function.
I refered to Keras's implementation for Siamese Network with Triplet Loss Function but something didnt work out and I had to implement the network by myself.
def get_siamese_model(input_shape, conv2d_filters):
# Define the tensors for the input images
anchor_input = Input(input_shape, name="Anchor_Input")
positive_input = Input(input_shape, name="Positive_Input")
negative_input = Input(input_shape, name="Negative_Input")
body = build_body(input_shape, conv2d_filters)
# Generate the feature vectors for the images
encoded_a = body(anchor_input)
encoded_p = body(positive_input)
encoded_n = body(negative_input)
distance = DistanceLayer()(encoded_a, encoded_p, encoded_n)
# Connect the inputs with the outputs
siamese_net = Model(inputs=[anchor_input, positive_input, negative_input],
outputs=distance)
return siamese_net
and the "bug" was in DistanceLayer Implementation Keras posted (also in the same link above).
class DistanceLayer(tf.keras.layers.Layer):
"""
This layer is responsible for computing the distance between the anchor
embedding and the positive embedding, and the anchor embedding and the
negative embedding.
"""
def __init__(self, **kwargs):
super().__init__(**kwargs)
def call(self, anchor, positive, negative):
ap_distance = tf.math.reduce_sum(tf.math.square(anchor - positive), axis=1, keepdims=True, name='ap_distance')
an_distance = tf.math.reduce_sum(tf.math.square(anchor - negative), axis=1, keepdims=True, name='an_distance')
return (ap_distance, an_distance)
When I was training the model, the loss function took only one of the vectors ap_distance or an_distance.
FINALLY, THE FIX WAS to concatenate the vectors together (along axis=1 this case) and on the loss function, take them apart:
def call(self, anchor, positive, negative):
ap_distance = tf.math.reduce_sum(tf.math.square(anchor - positive), axis=1, keepdims=True, name='ap_distance')
an_distance = tf.math.reduce_sum(tf.math.square(anchor - negative), axis=1, keepdims=True, name='an_distance')
return tf.concat([ap_distance, an_distance], axis=1)
on my custom loss:
def get_loss(margin=1.0):
def triplet_loss(y_true, y_pred):
# The output of the network is NOT A tuple, but a matrix shape (batch_size, 2),
# containing the distances between the anchor and the positive example,
# and the anchor and the negative example.
ap_distance = y_pred[:, 0]
an_distance = y_pred[:, 1]
# Computing the Triplet Loss by subtracting both distances and
# making sure we don't get a negative value.
loss = tf.math.maximum(ap_distance - an_distance + margin, 0.0)
# tf.print("\n", ap_distance, an_distance)
# tf.print(f"\n{loss}\n")
return loss
return triplet_loss
Ok, here is an easy way to achieve this. We can achieve this by using the loss_weights parameter. We can weigh multiple outputs exactly the same so that we can get the combined loss results. So, for two output we can do
loss_weights = 1*output1 + 1*output2
In your case, your network has two outputs, by the name they are reshape, and global_average_pooling2d. You can do now as follows
# calculation of loss for one output, i.e. reshape
def reshape_loss(y_true, y_pred):
# do some math with these two
return K.mean(y_pred)
# calculation of loss for another output, i.e. global_average_pooling2d
def gap_loss(y_true, y_pred):
# do some math with these two
return K.mean(y_pred)
And while compiling now you need to do as this
model.compile(
optimizer=tf.keras.optimizers.Adam(lr=base_learning_rate),
loss = {
'reshape':reshape_loss,
'global_average_pooling2d':gap_loss
},
loss_weights = {
'reshape':1.,
'global_average_pooling2d':1.
}
)
Now, the loss is the result of 1.*reshape + 1.*global_average_pooling2d.

Unable to completely separate outputs of model in TensorFlow

I am trying to create a convolutional neural network that has two regression outputs, a score and a confidence. I have frozen the layers they have in common in the hopes that the addition of the confidence output doesn't change the score, but in my experiments it has. For the model with just the score, I used Xception and added a simple GlobalAveragePooling2D and Dense(512) layer then output a single number.
base_model = Xception(input_shape=(224, 224, 3), weights='imagenet', include_top=False)
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(512, activation='relu')(x)
predictions = Dense(1, activation='sigmoid')(x)
model = Model(inputs=base_model.input, outputs=predictions)
for layer in base_model.layers:
layer.trainable = False
optimizer = Adam(learning_rate=learning_rate)
model.compile(loss='mae', optimizer=optimizer, metrics=['mse','mae'], run_eagerly=True)
Here is what the end of model.summary() looks like:
When I fit it, the model produces good results.
But when I try to add a second output the result of the first becomes much worse. The new model gets trained off tuples where is first number is the same as the first model and the second number is a confidence value. The model is very similar to the one above.
base_model = Xception(input_shape=(224, 224, 3), weights='imagenet', include_top=False)
x = base_model.output
x = GlobalAveragePooling2D()(x)
score_x = Dense(512, activation='relu')(x)
score_out = Dense(1, activation='sigmoid', name='score_model')(score_x)
confidence_x = Dense(512, activation='relu')(x)
confidence_out = Dense(1, name='confidence_model')(confidence_x)
model = Model(inputs=base_model.input, outputs=[score_out, confidence_out])
for layer in base_model.layers:
layer.trainable = False
losses = {'score_model': 'mae', 'confidence_model': 'mae'}
loss_weights = {'score_model': 1, 'confidence_model': 1}
model.compile(loss=losses, loss_weights=loss_weights, optimizer=optimizer, metrics=['mse','mae'], run_eagerly=True)
When I look at model.summary(), it has twice as many trainable parameters as the previous model, which is exactly what I was expecting. Everything looks right to me so far.
But when I train this model the performance on the score is much worse. I was thinking it would be the same (within stochastic variation). After the first epoch, the loss from the first model is around 0.125. The score_model_loss from the second model is around 0.554. Clearly I'm not completely separating the models. What am I missing?
Note: This answer will work well only because the layer that do the feature extraction are frozen. As #Akshay Sehgal stated in the comments :
optimizing for 2 goals together is actually a completely different problem than optimizing 2 independent goals separately
In that case, we are optimizing for 2 goals separately.
The easiest solution is probably to write a custom training loop with 2 tf.GradientTape, one for each goal. Lets consider this really simple example:
Dummy data
Let's create some random Data
import tensorflow as tf
X = tf.random.normal((1000,1))
y1= 3*X + 1
y2 = -2*X +2
ds = tf.data.Dataset.from_tensor_slices((X,y1,y2)).batch(10)
Creating a model with 2 outputs
In that example, I skip the feature extraction step, as a simple linear regression will work for the data. But as your feature extractor network is frozen, the example is similar.
inp = tf.keras.Input((1,))
dense_1 = tf.keras.layers.Dense(1, name="objective1")(inp)
dense_2 = tf.keras.layers.Dense(1, name="objective2")(inp)
model = tf.keras.Model(inputs=inp, outputs=[dense_1, dense_2])
# setting up the loss functions as well as the optimizer
opt = tf.optimizers.SGD()
loss_func1 = tf.losses.mean_squared_error
loss_func2 = tf.losses.mean_absolute_error
Note the name given to the two dense layers: I will use them later to retrieve the appropriate weights.
Getting the weights to optimize
We can use the name set before to retrieve the variable belonging to each objective :
var1, var2 = [],[]
for l in model.layers:
if "objective1" in l.name:
var1 += l.trainable_variables
if "objective2" in l.name:
var2 += l.trainable_variables
The training loop
You simply need to tapes, one for each objective. You can use different optimizer as well, if it makes the training better.
counter = 0
for x, y1, y2 in ds:
counter += 1
with tf.GradientTape() as tape1, tf.GradientTape() as tape2:
pred1, pred2 = model(x)
loss1 = loss_func1(y1, pred1)
loss2 = loss_func2(y2, pred2)
grad1 = tape1.gradient(loss1, var1)
grad2 = tape2.gradient(loss2, var2)
opt.apply_gradients(zip(grad1, var1))
opt.apply_gradients(zip(grad2, var2))
if counter % 10:
print(f"Step : {counter}, objective1: {tf.reduce_mean(loss1)}, objective2: {tf.reduce_mean(loss2)}")
If we run the training, we get:
Step : 1, objective1: 4.609124183654785, objective2: 2.6634981632232666
[...]
Step : 99, objective1: 7.176481902227555e-14, objective2: 0.030187154188752174
The principle advantage training that way is that you just need to extract the features once for the two objectives.

Keras: how to implement target replication for LSTM?

Using examples from Lipton et al (2016), target replication is basically calculating the loss at each time step (except final) of the LSTM (or GRU) and averaging this loss and adding it to the main loss while training. Mathematically, it is given by -
Graphically, it can be represented as -
So how do I go about exactly implementing this in Keras? Say, I have binary classification task. Let's say my model is a simple one given below -
model.add(LSTM(50))
model.add(Dense(1))
model.compile(loss='binary_crossentropy', class_weights={0:0.5, 1:4}, optimizer=Adam(), metrics=['accuracy'])
model.fit(x_train, y_train)
I think y_train needs to be reshaped/tiled from (batch_size, 1) to (batch_size, time_step) right?
The dense layer needs TimeDistributed to be applied correctly to the LSTM after setting return_sequences=True?
How do I exactly implement the exact loss function given above? Will class_weights need to be modified?
Target replication is only during training. How to implement validation set evaluation using only the main loss?
How should I deal with zero paddings in target replication? My sequences are padded to a max_len of 15 with average length being 7. Since the target replication loss averages over all the steps, how do I make sure it doesn't use the padded words in calculating the loss? Basically, dynamically assign T the actual sequence length.
Question 1:
So, for the targets, you need it shaped as (batch_size, time_steps, 1). Just use:
y_train = np.stack([y_train]*time_steps, axis=1)
Question 2:
You're correct, but TimeDistributed is optional in Keras 2.
Question 3:
I don't know how class weights will behave, but a regular loss function should go like:
from keras.losses import binary_crossentropy
def target_replication_loss(alpha):
def inner_loss(true,pred):
losses = binary_crossentropy(true,pred)
return (alpha*K.mean(losses[:,:-1], axis=-1)) + ((1-alpha)*losses[:,-1])
return inner_loss
model.compile(......, loss = target_replication_loss(alpha), ...)
Question 3a:
Since the above doens't work well with class weights, I created an alternative where the weights go into the loss:
def target_replication_loss(alpha, class_weights):
def get_weights(x):
b = class_weights[0]
a = class_weights[1] - b
return (a*x) + b
def inner_loss(true,pred):
#this will only work for classification with only one class 0 or 1
#and only if the target is the same for all classes
true_classes = true[:,-1,0]
weights = get_weights(true_classes)
losses = binary_crossentropy(true,pred)
return weights*((alpha*K.mean(losses[:,:-1], axis=-1)) + ((1-alpha)*losses[:,-1]))
return inner_loss
Question 4:
To avoid complexity, I'd say you should use an additional metric in validation:
def last_step_BC(true,pred):
return binary_crossentropy(true[:,-1], pred[:,-1])
model.compile(....,
loss = target_replication_loss(alpha),
metrics=[last_step_BC])
Question 5:
This is a hard one and I'd need to research a little....
As an initial workaround, you can set the model with an input shape of (None, features), and train each sequence individually.
Working example without class_weight
def target_replication_loss(alpha):
def inner_loss(true,pred):
losses = binary_crossentropy(true,pred)
#print(K.int_shape(losses))
#print(K.int_shape(losses[:,:-1]))
#print(K.int_shape(K.mean(losses[:,:-1], axis=-1)))
#print(K.int_shape(losses[:,-1]))
return (alpha*K.mean(losses[:,:-1], axis=-1)) + ((1-alpha)*losses[:,-1])
return inner_loss
alpha = 0.6
i1 = Input((5,2))
i2 = Input((5,2))
out = LSTM(1, activation='sigmoid', return_sequences=True)(i1)
model = Model(i1, out)
model.compile(optimizer='adam', loss = target_replication_loss(alpha))
model.fit(np.arange(30).reshape((3,5,2)), np.arange(15).reshape((3,5,1)), epochs = 200)
Working example with class weights:
def target_replication_loss(alpha, class_weights):
def get_weights(x):
b = class_weights[0]
a = class_weights[1] - b
return (a*x) + b
def inner_loss(true,pred):
#this will only work for classification with only one class 0 or 1
#and only if the target is the same for all classes
true_classes = true[:,-1,0]
weights = get_weights(true_classes)
losses = binary_crossentropy(true,pred)
print(K.int_shape(losses))
print(K.int_shape(losses[:,:-1]))
print(K.int_shape(K.mean(losses[:,:-1], axis=-1)))
print(K.int_shape(losses[:,-1]))
print(K.int_shape(weights))
return weights*((alpha*K.mean(losses[:,:-1], axis=-1)) + ((1-alpha)*losses[:,-1]))
return inner_loss
alpha = 0.6
class_weights={0: 0.5, 1:4.}
i1 = Input(batch_shape=(3,5,2))
i2 = Input((5,2))
out = LSTM(1, activation='sigmoid', return_sequences=True)(i1)
model = Model(i1, out)
model.compile(optimizer='adam', loss = target_replication_loss(alpha, class_weights))
model.fit(np.arange(30).reshape((3,5,2)), np.arange(15).reshape((3,5,1)), epochs = 200)

Triplet embedded layer with custom (None) loss on Keras 2

I've been looking for simple implementations of triplet embedding in deep learning. I wanted to use Keras as it is what I am slightly more familiar (although still very inexperienced in it).
Here is a reference on one of the inspiration works: paper on embedded triplets
I've found a pretty good example to start off with, working with the mnist dataset, as far as I can tell it is working pretty well. Problems arise on the implementation of the merge of the 3 embedded layers.
def build_model(input_shape):
base_input = Input(input_shape)
x = Conv2D(32, (3, 3), activation='relu')(base_input)
x = MaxPooling2D((2, 2))(x)
x = Conv2D(64, (3, 3), activation='relu')(x)
x = MaxPooling2D((2, 2))(x)
x = Dropout(0.25)(x)
x = Flatten()(x)
x = Dense(2, activation='linear')(x)
embedding_model = Model(base_input, x, name='embedding')
anchor_input = Input(input_shape, name='anchor_input')
positive_input = Input(input_shape, name='positive_input')
negative_input = Input(input_shape, name='negative_input')
anchor_embedding = embedding_model(anchor_input)
positive_embedding = embedding_model(positive_input)
negative_embedding = embedding_model(negative_input)
inputs = [anchor_input, positive_input, negative_input]
outputs = [anchor_embedding, positive_embedding, negative_embedding]
triplet_model = Model(inputs, outputs)
triplet_model.add_loss(K.mean(triplet_loss(outputs)))
triplet_model.compile(loss=None, optimizer='adam') # <-- CRITICAL LINE
return embedding_model, triplet_model
With the currently implementation the loss is added through model.add_loss and I haven't find many examples like this. The real issue though, is that I cannot load the saved model. The lines
triplet_model.save('triplet.h5')
model = load_model('triplet.h5')
return:
ValueError: The model cannot be compiled because it has no loss to optimize.
Adding a parameter to the 'loss' argument raises another error when I try to compile the model. I wanted to ask how can I circumvent this issue or if there is a better way to create the model with the embedded models (without the empty loss function, maybe).
Here is the triplet_loss function for reference:
def triplet_loss(inputs, dist='sqeuclidean', margin='maxplus'):
anchor, positive, negative = inputs
positive_distance = K.square(anchor - positive)
negative_distance = K.square(anchor - negative)
if dist == 'euclidean':
positive_distance = K.sqrt(K.sum(positive_distance, axis=-1, keepdims=True))
negative_distance = K.sqrt(K.sum(negative_distance, axis=-1, keepdims=True))
elif dist == 'sqeuclidean':
positive_distance = K.mean(positive_distance, axis=-1, keepdims=True)
negative_distance = K.mean(negative_distance, axis=-1, keepdims=True)
loss = positive_distance - negative_distance
if margin == 'maxplus':
loss = K.maximum(0.0, 1 + loss)
elif margin == 'softplus':
loss = K.log(1 + K.exp(loss))
return K.mean(loss)
Here is the full script: link
The problem here is that you leave triplet_model.compile(loss=None), but keras does not know how to deal with it properly in load_model(). I understand that you have to do so, but you can load the model in a different way to solve your current issue.
In short, don't load the entire model through load_model(), but just the weights through load_weights().
For example, you can do
# save only weights
triplet_model.save_weights('tmp.h5')
# load saved weights
new_embedding_model, new_triplet_model = build_model(input_shape)
new_triplet_model.load_weights('tmp.h5') # load only weights

Getting extremely low loss in a bidirectional RNN?

I have implemented a bi-directional RNN in TensorFlow using a BasicLSTMCell and rnn.bidirectional_rnn. I am calculating the loss using seq2seq.sequence_loss_by_example after concatenating the outputs I receive. My application is a next character predictor.
I getting an extremely low cost, (~50 times lesser than the unidirectional RNN). I suspect I've made a mistake in the seq2seq.sequence_loss_by_example step.
Here is my model -
# Model begins
cell_fn = rnn_cell.BasicLSTMCell
cell = fw_cell = cell_fn(args.rnn_size, state_is_tuple=True)
cell2 = bw_cell = cell_fn(args.rnn_size, state_is_tuple=True)
input_data = tf.placeholder(tf.int32, [args.batch_size, args.seq_length])
targets = tf.placeholder(tf.int32, [args.batch_size, args.seq_length])
initial_state = fw_cell.zero_state(args.batch_size, tf.float32)
initial_state2 = bw_cell.zero_state(args.batch_size, tf.float32)
with tf.variable_scope('rnnlm'):
softmax_w = tf.get_variable("softmax_w", [2*args.rnn_size, args.vocab_size])
softmax_b = tf.get_variable("softmax_b", [args.vocab_size])
with tf.device("/cpu:0"):
embedding = tf.get_variable("embedding", [args.vocab_size, args.rnn_size])
input_embeddings = tf.nn.embedding_lookup(embedding, input_data)
inputs = tf.unpack(input_embeddings, axis=1)
outputs, last_state, last_state2 = rnn.bidirectional_rnn(fw_cell,
bw_cell,
inputs,
initial_state_fw=initial_state,
initial_state_bw=initial_state2,
dtype=tf.float32)
output = tf.reshape(tf.concat(1, outputs), [-1, 2*args.rnn_size])
logits = tf.matmul(output, softmax_w) + softmax_b
probs = tf.nn.softmax(logits)
loss = seq2seq.sequence_loss_by_example([logits],
[tf.reshape(targets, [-1])],
[tf.ones([args.batch_size * args.seq_length])],
args.vocab_size)
cost = tf.reduce_sum(loss) / args.batch_size / args.seq_length
lr = tf.Variable(0.0, trainable=False)
tvars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars),
args.grad_clip)
optimizer = tf.train.AdamOptimizer(lr)
train_op = optimizer.apply_gradients(zip(grads, tvars))
I think there is no any mistake in your code.
The problem is the objective function with the Bi-RNN model in your application (next character predictor).
The unidirectional RNN (such as ptb_word_lm or char-rnn-tensorflow), it is really a model used for the prediction, for example, if raw_text is 1,3,5,2,4,8,9,0, then, your inputs and target will be:
inputs: 1,3,5,2,4,8,9
target: 3,5,2,4,8,9,0
and the prediction is (1)->3, (1,3)->5, ..., (1,3,5,2,4,8,9)->0
But in Bi-RNN, the first prediction is really not just (1)->3, because the output[0] in your code contians the reverse information of the raw_text by use bw_cell (also not (1,3)->5, ..., (1,3,5,2,4,8,9)->0). A similar example is: I tell you that flower is a rose, and than I let you to predict what the flower is? I think you can give me the right answer very easy, and this is also the reason why you getting an extremely low loss in your Bi-RNN model for the application.
In fact, I think Bi-RNN (or Bi-LSTM) is not an appropriate model for the application of next character predictor. Bi-RNN need the full sequence when it works, you will find you can't use this model easily when you want to predict the next character.

Categories

Resources