I am trying to model sentence similarity use tensorflow. The ideal is that first feed the concatenated sentence to a rnn, then feed the output of the rnn to a softmax to do a binary classification, similar or not. I make some minus changes to the PTB language model example , but the cost won't decrease as expected.
cost: 0.694479
cost: 0.695012
cost: 0.6955
...
The code is as follows. Any kind of help would be appreciated.
class PTBModel(object):
"""The PTB model."""
def __init__(self, is_training, config):
self.batch_size = batch_size = config.batch_size
self.num_steps = num_steps = config.num_steps
size = config.hidden_size
vocab_size = config.vocab_size
label_size = 2
self._input_data = tf.placeholder(tf.int32, [batch_size, num_steps], name="inputs")
# for each sentence pair we only have one output
self._targets = tf.placeholder(tf.int64, [batch_size], name="labels")
# Slightly better results can be obtained with forget gate biases
# initialized to 1 but the hyperparameters of the model would need to be
# different than reported in the paper.
lstm_cell = rnn_cell.BasicLSTMCell(size, forget_bias=0.0)
if is_training and config.keep_prob < 1:
lstm_cell = rnn_cell.DropoutWrapper(
lstm_cell, output_keep_prob=config.keep_prob)
cell = rnn_cell.MultiRNNCell([lstm_cell] * config.num_layers)
self._initial_state = cell.zero_state(batch_size, tf.float32)
with tf.device("/cpu:0"):
embedding = tf.get_variable("embedding", [vocab_size, size])
inputs = tf.nn.embedding_lookup(embedding, self._input_data)
if is_training and config.keep_prob < 1:
inputs = tf.nn.dropout(inputs, config.keep_prob)
# Simplified version of tensorflow.models.rnn.rnn.py's rnn().
# This builds an unrolled LSTM for tutorial purposes only.
# In general, use the rnn() or state_saving_rnn() from rnn.py.
#
# The alternative version of the code below is:
#
# from tensorflow.models.rnn import rnn
# inputs = [tf.squeeze(input_, [1])
# for input_ in tf.split(1, num_steps, inputs)]
# outputs, states = rnn.rnn(cell, inputs, initial_state=self._initial_state)
outputs = []
states = []
state = self._initial_state
with tf.variable_scope("RNN"):
for time_step in range(num_steps):
if time_step > 0: tf.get_variable_scope().reuse_variables()
(cell_output, state) = cell(inputs[:, time_step, :], state)
outputs.append(cell_output)
states.append(state)
# use the output of the last word as the input feature to softmax
output = tf.reshape(tf.concat(1, outputs[-1]), [-1, size])
softmax_w = tf.get_variable("softmax_w", [size, label_size])
softmax_b = tf.get_variable("softmax_b", [label_size])
self.logits = logits = tf.matmul(output, softmax_w) + softmax_b
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits, self.targets)
self._cost = cost = tf.reduce_sum(loss) / batch_size
# self._cost = cost = -tf.reduce_sum(tf.reshape(self.targets, [-1, 1])
# *tf.log(tf.clip_by_value(tf.log(tf.nn.softmax(logits)), 1e-10,1.0)))
self._final_state = states[-1]
if not is_training:
return
self._lr = tf.Variable(0.0, trainable=False)
tvars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars),
config.max_grad_norm)
optimizer = tf.train.GradientDescentOptimizer(self.lr)
self._train_op = optimizer.apply_gradients(zip(grads, tvars))
def step(self, session, inputs, labels, is_training):
if is_training:
output_feeds = [self.cost, self._final_state, self._train_op]
else:
output_feeds = [self.cost]
input_feeds = {
self.input_data: inputs,
self.targets: labels
}
cost, state, logits = session.run(output_feeds, input_feeds)
return cost, state, logits
A small recommendation for you.
You can use two encoders(either RNN or CNN) one for each sentence, and then encoder these two sentence into two sentence embeddings. Once you have the two sentence vectors you just calculate the cosine similarity as the output. 1 if the two sentence have the same meaning and 0 if not for training. In inference you can input any two sentences and get the cosine similarity as the sentence semantic similarity.
With RNN the assumption is that there are sequential relationships between the inputs. Concatenating two sentences implies to the network that they compose a single sequence, but this isn't the case. For a comparison between two sentences, each should be represented individually.
Related
I derived a Bidirectional LSTM for Sentiment analysis on news headlines, but when training the model the loss function value doesn't improve, stays around 0.6 and 0.7.
Certainly I'm making something wrong, I wonder if it is something related with the Embedding layer.
I'm iteratively passing each batch into the Network with a size of 10 and sentence length of 30 words, my vocabulary size is 5745 so the shape of this tensor would be (10, 30, 5745) after one hot encoding.
My Embedding layer has num_embeddings = 5745 and embed-dim = 100 so when I call self.embedding(input), have as output shape: (10, 30, 5745, 100)
I wanted to have an output shape of: (10,30,100)
Therefore I used this line of code:
embeddings = torch.max(embeddings, dim=2)
But I'm not sure if it does what I expect it to do for each word/one-hot vector which is:
if I have one hot encoding vector representing a word with shape (5745,1) and an embbeding matrix with shape (100, 5745), I get an embedding vector of (100,1) and therefore I would have an ouptut of (10,30,100) by doing the above code?
Maybe I'm not thinking correctly and it's affecting my end result
RNN:
class RNN(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim, n_layers, dropout):
#calling the init function of the RNN parent
super(RNN, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.encoder = nn.LSTM(embed_dim,
hidden_dim,
n_layers,
dropout=dropout,
bidirectional=True
)
#Linear transformation
self.decoder = nn.Linear(hidden_dim*2, output_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, inputs):
#(batch_size, timesteps, embed_dim)
embeddings = self.dropout(self.embedding(inputs))
embeddings = torch.max(embeddings, dim=2)
embeddings = embeddings[0].type(torch.cuda.FloatTensor)
#output of each timestep
output, (hidden, cell) = self.encoder(embeddings)
merge = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
output = self.decoder(merge)
return output
train:
def train(model, text, label, epochs, lr=0.001):
model.train()
opt = torch.optim.Adam(model.parameters(), lr=lr)
#The BCEWithLogitsLoss criterion carries out both the sigmoid and the binary cross entropy steps
criterion = nn.BCEWithLogitsLoss()
counter = 0
for e in range(epochs):
for x, y in batches(text, label):
#In order to avoid gradient accumulation before backpropagation
x = one_hot_encode(x, vocab)
x = torch.from_numpy(x).to(device)
output = model(x)
#print(torch.cuda.memory_summary(device=None, abbreviated=False))
y = torch.from_numpy(y)
y = torch.unsqueeze(y,1).to(device)
#print(output.shape)
#print(y.shape)
loss = criterion(output, y.float())
acc = binaryAccuracy(output,y)
opt.zero_grad()
loss.backward()
opt.step()
counter += 1
print("Epoch {}/{}".format(e+1, epochs),
"Loss: {}".format(loss.item()),
"accuracy: {}".format(acc))
one-hot:
def one_hot_encode(arr, n_labels):
one_hot = np.zeros((np.multiply(*arr.shape), n_labels), dtype=np.int64)
one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1
one_hot = one_hot.reshape((*arr.shape, n_labels))
return one_hot
batches:
def batches(text, label, num_seqs=10):
counter = 0
#create empty arrays with the specified number of columns
x = np.array([], dtype=int).reshape(0,findMaxLen())
y = np.array([], dtype=int).reshape(0,1)
for sent, l in zip(text, label):
# create a np array with zeros with length 30
tmp1 = np.zeros((findMaxLen()), dtype=int)
#tmp1 = np.randint(0, high=vocab,size=findMaxLen(), dtype=int)
#create a 1d array
tmp2 = np.atleast_1d(np.array(l))
for ind, wrd in enumerate(sent):
if wrd in uniqueWrds and ind < 30:
tmp1[ind] = word_to_index[wrd]
# the arrays to the empty arrays
x = np.vstack([x, tmp1])
y = np.vstack([y, tmp2])
counter +=1
if counter == num_seqs:
yield x, np.squeeze(y,1)
counter = 0
x = np.array([], dtype=int).reshape(0,findMaxLen())
y = np.array([], dtype=int).reshape(0,1)
If you want your output to be of dimension (10,30,100), I think you need to input the indexes of the words of the sentences in the form of an array of dimension (10x30). You can probably get the indexes by performing a torch.argmax on the one hot encoded input tensor.
My training set is a set of images (either 3 channel or 1 ofc i use only one type of channel). And the labels are a sequence of points in a specific order that i want to predict from the images.
I am using a model inspired by the image captioning example on the tensorflow website. This is the also the approach that this paper takes https://arxiv.org/pdf/1901.03781.pdf
class CNN_Encoder(tf.keras.Model):
# Since you have already extracted the features and dumped it using pickle
# This encoder passes those features through a Fully connected layer
def __init__(self, embedding_dim):
super(CNN_Encoder, self).__init__()
self.fc = tf.keras.layers.Dense(embedding_dim)
def call(self, x):
x = self.fc(x)
x = tf.nn.relu(x)
return x
class RNN_Decoder(tf.keras.Model):
def __init__(self, embedding_dim, units, output_dim):
super(RNN_Decoder, self).__init__()
self.units = units
self.gru = tf.keras.layers.GRU(self.units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
self.fc1 = tf.keras.layers.Dense(self.units)
self.fc2 = tf.keras.layers.Dense(output_dim)
def call(self, x, features, hidden):
x = tf.concat((features, x), axis=-1)
output, state = self.gru(x)
x = self.fc1(state)
x = self.fc2(x)
return x
def reset_state(self, batch_size):
return tf.zeros((batch_size, self.units))
#tf.function
def train_step(img_tensor, target):
loss = 0
hidden = decoder.reset_state(batch_size=target.shape[0])
dec_input = tf.expand_dims([[0., 0.]] * target.shape[0], 1)
with tf.GradientTape() as tape:
features = encoder(img_tensor)
for i in (range(1, target.shape[1])):
predictions = decoder(dec_input, features, hidden)
loss += loss_function(target[:, i], predictions)
# using teacher forcing
dec_input = tf.expand_dims(target[:, i], 1)
total_loss = (loss / int(target.shape[1]))
trainable_variables = encoder.trainable_variables + decoder.trainable_variables
gradients = tape.gradient(loss, trainable_variables)
optimizer.apply_gradients(zip(gradients, trainable_variables))
return loss, total_loss
EPOCHS = 20
batch_size = 8
for epoch in tqdm(range(start_epoch, EPOCHS)):
start = time.time()
total_loss = 0
for (batch, (img_tensor, target)) in enumerate((data_generator(preds_t, labels_t))):
img_tensor = img_tensor.reshape((-1, 1, 128*128))
batch_loss, t_loss = train_step(img_tensor, target)
total_loss += t_loss
if batch % 100 == 0:
print ('Epoch {} Batch {} Loss {:.4f}'.format(
epoch + 1, batch, batch_loss.numpy() / int(target.shape[1])))
if batch == 10000:
break
# storing the epoch end loss value to plot later
#loss_plot.append(total_loss / num_steps)
if epoch % 5 == 0:
ckpt_manager.save()
print ('Epoch {} Loss {:.6f}'.format(epoch + 1,
total_loss/num_steps))
print ('Time taken for 1 epoch {} sec\n'.format(time.time() - start))
For the features vector. I am extracting the last layer of a unet. So each image has a size 1x128x128. I reshape it to be 1x1x128*128. Which i then pass through a fully connected layer. The shape then becomes 1x1x256
My labels i want to predict are image coordinates so (x, y). The input to the gru layer is the
concatenated 1x1x256 , 1x1x2 (t-1 coordinates). Which i then further pass through a 2 layer fc layer with output dimension 2 for the 2 coordinates. I have removed attention for now to get a simpler model. I normalize my images. I pad the coordinate sequences with 0,0 for the start -1, -1 for the end and -2,-2 for the regular padding to get uniform sequence length of 350x2.
The network doesnt seem to learn much. I just get a few points scattered diagonally across the image. The biggest difference i see with the image captioning model is that the words can be converted to embeddings and then you have a 128 image features 128 word features being concatenated and fed into the lstm. In my case the sequence information is just 1 entry. Could that be the reason that the network is not learning much.
If someone has any insights into what i should change that would be great
Your question requires certain experience and a deep investigation. I'd only suggest general advices for under-fitting issue. Here's a list of things to try.
Personally, I'd start by trying to overfit on a single batch.
I am doing a sequence to label learning model in PyTorch. I have two sentences and I am classifying whether they are entailed or not (SNLI dataset). I concatenate two 50 word sentences together (sometimes padded) into a vector of length 100. I then send in minibatches into word embeddings -> LSTM -> Linear layer. I am doing cross entropy loss but I need a vector of [mini_batch, C] to go into the CrossEntropyLoss function. Instead I still have the 100 words in my vector as [mini_batch, 100, C]
Here is my model:
class myLSTM(nn.Module):
def __init__(self, h_size=128, v_size=10, embed_d=300, mlp_d=256):
super(myLSTM, self).__init__()
self.embedding = nn.Embedding(v_size, embed_d)
self.lstm = nn.LSTM(embed_d, h_size, num_layers=1, bidirectional=True, batch_first=True)
self.mlp = nn.Linear(mlp_d, 1024)
# Set static embedding vectors
self.embedding.weight.requires_grad = False
#self.sm = nn.CrossEntropyLoss()
def display(self):
for param in self.parameters():
print(param.data.size())
def filter_params(self):
# Might not be compatible with python 3
#self.parameters = filter(lambda p: p.requires_grad, self.parameters())
pass
def init_hidden(self):
# Need to init hidden weights in LSTM
pass
def forward(self, sentence):
print(sentence.size())
embeds = self.embedding(sentence)
print(embeds.size())
out, _ = self.lstm(embeds)
print(out.size())
out = self.mlp(out)
return out
My training sequences with output:
batch_size = 3
SGD_optimizer = optim.SGD(filter(lambda p: p.requires_grad, model.parameters()), lr=0.01, weight_decay=1e-4)
ADM_optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=0.01)
criterion = nn.CrossEntropyLoss()
num_epochs = 50
from torch.autograd import Variable
from torch import optim
for epoch in range(num_epochs):
print("Epoch {0}/{1}: {2}%".format(epoch, num_epochs, float(epoch)/num_epochs))
for start, end in tqdm(batch_index_gen(batch_size, len(n_data))):
# Convert minibatch to numpy
s1, s2, y = convert_to_numpy(n_data[start:end])
# Convert numpy to Tensor
res = np.concatenate((s1,s2), axis=1) # Attach two sentences into 1 input vector
input_tensor = torch.from_numpy(res).type(torch.LongTensor)
target_tensor = torch.from_numpy(y).type(torch.FloatTensor)
data, target = Variable(input_tensor), Variable(target_tensor)
# Zero gradients
SGD_optimizer.zero_grad()
# Forward Pass
output = model.forward(data)
print("Output size: ")
print(output.size())
print("Target size: ")
print(target.size())
# Calculate loss with respect to training labels
loss = criterion(output, target)
# Backprogogate and update optimizer
loss.backward()
SGD_optimizer.step()
#ADAM_optimizer.step()
output:
Epoch 0/50: 0.0%
torch.Size([3, 100])
torch.Size([3, 100, 300])
torch.Size([3, 100, 256])
Output size:
torch.Size([3, 100, 1024])
Target size:
torch.Size([3])
error:
ValueError: Expected 2 or 4 dimensions (got 3)
EDITED -------------------------------------------------------------------
I have now got my model training but I am getting low accuracy. Is there an issue with my LSTM outputs being concatenated and then condensed to a smaller tensor to go through my linear layer?
New Model:
class myLSTM(nn.Module):
def __init__(self, h_size=128, v_size=10, embed_d=300, mlp_d=256, num_classes=3, lstm_layers=1):
super(myLSTM, self).__init__()
self.num_layers = lstm_layers
self.hidden_size = h_size
self.embedding = nn.Embedding(v_size, embed_d)
self.lstm = nn.LSTM(embed_d, h_size, num_layers=lstm_layers, bidirectional=True, batch_first=True)
self.mlp = nn.Linear(2 * h_size * 2, num_classes)
# Set static embedding vectors
self.embedding.weight.requires_grad = False
def forward(self, s1, s2):
# Set initial states
#h0 = Variable(torch.zeros(self.num_layers*2, s1.size(0), self.hidden_size)).cuda() # 2 for bidirection
#c0 = Variable(torch.zeros(self.num_layers*2, s1.size(0), self.hidden_size)).cuda()
batch_size = s1.size()[0]
embeds_1 = self.embedding(s1)
embeds_2 = self.embedding(s2)
_, (h_1_last, _) = self.lstm(embeds_1)#, (h0, c0)) #note the change here. Last hidden state is taken
_, (h_2_last, _) = self.lstm(embeds_2)#, (h0, c0))
concat = torch.cat( (h_1_last, h_2_last), dim=2) #double check the dimension
concat = concat.view(batch_size, -1)
scores = self.mlp(concat)
return scores
New Training
batch_size = 64
SGD_optimizer = optim.SGD(filter(lambda p: p.requires_grad, model.parameters()), lr=0.001, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()
num_epochs = 10
model.train()
if cuda:
model = model.cuda()
criterion = criterion.cuda()
from torch.autograd import Variable
from torch import optim
epoch_losses = []
for epoch in range(num_epochs):
print("Epoch {0}/{1}: {2}%".format(epoch, num_epochs, 100*float(epoch)/num_epochs))
# Batch loss aggregator
losses = []
for start, end in tqdm(batch_index_gen(batch_size, len(n_data))):
# Convert minibatch to numpy
s1, s2, y = convert_to_numpy(n_data[start:end])
# Convert numpy to Tensor
s1_tensor = torch.from_numpy(s1).type(torch.LongTensor)
s2_tensor = torch.from_numpy(s2).type(torch.LongTensor)
target_tensor = torch.from_numpy(y).type(torch.LongTensor)
s1 = Variable(s1_tensor)
s2 = Variable(s2_tensor)
target = Variable(target_tensor)
if cuda:
s1 = s1.cuda()
s2 = s2.cuda()
target = target.cuda()
# Zero gradients
SGD_optimizer.zero_grad()
# Forward Pass
output = model.forward(s1,s2)
# Calculate loss with respect to training labels
loss = criterion(output, target)
losses.append(loss.data[0])
# Backprogogate and update optimizer
loss.backward()
SGD_optimizer.step()
# concat losses to epoch losses
epoch_losses += losses
training with tensor sizes printed:
Epoch 0/10: 0.0%
Batch size: 64
Sentences
torch.Size([64, 50])
torch.Size([64, 50])
torch.Size([64, 50, 300])
torch.Size([64, 50, 300])
Hidden states
torch.Size([2, 64, 128])
torch.Size([2, 64, 128])
Concatenated hidden states
torch.Size([2, 64, 256])
Reshaped tensors for linear layer
torch.Size([64, 512])
Linear propogation
torch.Size([64, 3])
Evaluation
def eval_model(model, mode='dev'):
file_name = 'snli_1.0/snli_1.0_dev.jsonl' if mode == 'dev' else 'snli_1.0/snli_1.0_test.jsonl'
dev_data, _ = obtain_data(file_name)
dev_n_data = vocab.process_data(dev_data)
print("Length of data: {}".format(len(dev_n_data)))
eval_batch_size = 1024
model.eval()
total = len(dev_n_data)
hit = 0
correct = 0
# Batch dev eval
for start, end in batch_index_gen(eval_batch_size, len(dev_n_data)):
s1, s2, y = convert_to_numpy(dev_n_data[start:end])
s1_tensor = torch.from_numpy(s1).type(torch.LongTensor)
s2_tensor = torch.from_numpy(s2).type(torch.LongTensor)
target_tensor = torch.from_numpy(y).type(torch.LongTensor)
s1 = Variable(s1_tensor, volatile=True)
s2 = Variable(s2_tensor, volatile=True)
target = Variable(target_tensor, volatile=True)
if cuda:
s1 = s1.cuda()
s2 = s2.cuda()
target = target.cuda()
output = model.forward(s1,s2)
loss = criterion(output, target)
#print("output size: {}".format(output.size()))
#print("target size: {}".format(target.size()))
pred = output.data.max(1)[1] # get the index of the max log-probability
#print(pred[:5])
#print(output[:])
correct += pred.eq(target.data).cpu().sum()
return correct / float(total)
eval_model(model)
I think there is an issue in a way you are trying to solve an entailment problem.
Maybe you can do it this way:
design your module to accept two sentences as input
embed both of them with your embeddings
encode them using the LSTM module.
now you have two fixed length vector representations of two sentences. Simpliest thing to do is to just concatenate them
together.
Add liner layer on top to evaluate scores for each entailment class (3 I suppose)
apply softmax to get a proper probability distribution
So your model can look like this (double check the dimensions):
class myLSTM(nn.Module):
def __init__(self, h_size=128, v_size=10, embed_d=300, num_classes = 3):
super(myLSTM, self).__init__()
self.embedding = nn.Embedding(v_size, embed_d)
self.lstm = nn.LSTM(embed_d, h_size, num_layers=1, bidirectional=True, batch_first=True)
self.mlp = nn.Linear(2*h_size*2, num_classes) #<- change here
def forward(self, sentence1, sentence2):
embeds_1 = self.embedding(sentence1)
embeds_2 = self.embedding(sentence2)
_, (h_1_last, _) = self.lstm(embeds_1) #note the change here. Last hidden state is taken
_, (h_2_last, _) = self.lstm(embeds_2)
concat = torch.concat([h_1_last, h_2_last], dim=1) #double check the dimension
scores = self.mlp(concat)
probas = F.softmax(scores) #from torch.functional ...
Then you can play around with adding more hidden layers or thinking how combining two sentences can be done in more intelligent way (attention, etc).
Double check what CrossEntropyLoss accepts as input and target and adjust (is it unnormalized class scores or probability distribution). Check http://pytorch.org/docs/master/nn.html#lstm for LSTM module documentation to clarify what LSTM returns (do you need hidden states for every word or just the representation after the last one).
I'm trying a basic averaging example, but the validation and loss don't match and the network fails to converge if I increase the training time. I'm training a network with 2 hidden layers, each 500 units wide on three integers from the range [0,9] with a learning rate of 1e-1, Adam, batch size of 1, and dropout for 3000 iterations and validate every 100 iterations. If the absolute difference between the label and the hypothesis is less than a threshold, here I set the threshold to 1, I consider that correct. Could someone let me know if this is an issue with the choice of loss function, something wrong with Pytorch, or something I'm doing. Below are some plots:
val_diff = 1
acc_diff = torch.FloatTensor([val_diff]).expand(self.batch_size)
Loop 100 times to during validation:
num_correct += torch.sum(torch.abs(val_h - val_y) < acc_diff)
Append after each validation phase:
validate.append(num_correct / total_val)
Here are some examples of the (hypothesis, and labels):
[...(-0.7043088674545288, 6.0), (-0.15691305696964264, 2.6666667461395264),
(0.2827358841896057, 3.3333332538604736)]
I tried six of the loss functions in the API that are typically used for regression:
torch.nn.L1Loss(size_average=False)
torch.nn.L1Loss()
torch.nn.MSELoss(size_average=False)
torch.nn.MSELoss()
torch.nn.SmoothL1Loss(size_average=False)
torch.nn.SmoothL1Loss()
Thanks.
Network code:
class Feedforward(nn.Module):
def __init__(self, topology):
super(Feedforward, self).__init__()
self.input_dim = topology['features']
self.num_hidden = topology['hidden_layers']
self.hidden_dim = topology['hidden_dim']
self.output_dim = topology['output_dim']
self.input_layer = nn.Linear(self.input_dim, self.hidden_dim)
self.hidden_layer = nn.Linear(self.hidden_dim, self.hidden_dim)
self.output_layer = nn.Linear(self.hidden_dim, self.output_dim)
self.dropout_layer = nn.Dropout(p=0.2)
def forward(self, x):
batch_size = x.size()[0]
feat_size = x.size()[1]
input_size = batch_size * feat_size
self.input_layer = nn.Linear(input_size, self.hidden_dim).cuda()
hidden = self.input_layer(x.view(1, input_size)).clamp(min=0)
for _ in range(self.num_hidden):
hidden = self.dropout_layer(F.relu(self.hidden_layer(hidden)))
output_size = batch_size * self.output_dim
self.output_layer = nn.Linear(self.hidden_dim, output_size).cuda()
return self.output_layer(hidden).view(output_size)
Training code:
def train(self):
if self.cuda:
self.network.cuda()
dh = DataHandler(self.data)
# loss_fn = nn.L1Loss(size_average=False)
# loss_fn = nn.L1Loss()
# loss_fn = nn.SmoothL1Loss(size_average=False)
# loss_fn = nn.SmoothL1Loss()
# loss_fn = nn.MSELoss(size_average=False)
loss_fn = torch.nn.MSELoss()
losses = []
validate = []
hypos = []
labels = []
val_size = 100
val_diff = 1
total_val = float(val_size * self.batch_size)
for i in range(self.iterations):
x, y = dh.get_batch(self.batch_size)
x = self.tensor_to_Variable(x)
y = self.tensor_to_Variable(y)
self.optimizer.zero_grad()
loss = loss_fn(self.network(x), y)
loss.backward()
self.optimizer.step()
It looks like you've misunderstood how layers in pytorch works, here are a few tips:
In your forward when you do nn.Linear(...) you are definining new layers instead of using those you pre-defined in your network __init__. Therefore, it cannot learn anything as weights are constantly reinitalized.
You shouldn't need to call .cuda() inside net.forward(...) since you've already copied the network on gpu in your train by calling self.network.cuda()
Ideally the net.forward(...) input should directly have the shape of the first layer so you won't have to modify it. Here you should have x.size() <=> Linear -- > (Batch_size, Features).
Your forward should look close to this:
def forward(self, x):
x = F.relu(self.input_layer(x))
x = F.dropout(F.relu(self.hidden_layer(x)),training=self.training)
x = self.output_layer(x)
return x
I have implemented a bi-directional RNN in TensorFlow using a BasicLSTMCell and rnn.bidirectional_rnn. I am calculating the loss using seq2seq.sequence_loss_by_example after concatenating the outputs I receive. My application is a next character predictor.
I getting an extremely low cost, (~50 times lesser than the unidirectional RNN). I suspect I've made a mistake in the seq2seq.sequence_loss_by_example step.
Here is my model -
# Model begins
cell_fn = rnn_cell.BasicLSTMCell
cell = fw_cell = cell_fn(args.rnn_size, state_is_tuple=True)
cell2 = bw_cell = cell_fn(args.rnn_size, state_is_tuple=True)
input_data = tf.placeholder(tf.int32, [args.batch_size, args.seq_length])
targets = tf.placeholder(tf.int32, [args.batch_size, args.seq_length])
initial_state = fw_cell.zero_state(args.batch_size, tf.float32)
initial_state2 = bw_cell.zero_state(args.batch_size, tf.float32)
with tf.variable_scope('rnnlm'):
softmax_w = tf.get_variable("softmax_w", [2*args.rnn_size, args.vocab_size])
softmax_b = tf.get_variable("softmax_b", [args.vocab_size])
with tf.device("/cpu:0"):
embedding = tf.get_variable("embedding", [args.vocab_size, args.rnn_size])
input_embeddings = tf.nn.embedding_lookup(embedding, input_data)
inputs = tf.unpack(input_embeddings, axis=1)
outputs, last_state, last_state2 = rnn.bidirectional_rnn(fw_cell,
bw_cell,
inputs,
initial_state_fw=initial_state,
initial_state_bw=initial_state2,
dtype=tf.float32)
output = tf.reshape(tf.concat(1, outputs), [-1, 2*args.rnn_size])
logits = tf.matmul(output, softmax_w) + softmax_b
probs = tf.nn.softmax(logits)
loss = seq2seq.sequence_loss_by_example([logits],
[tf.reshape(targets, [-1])],
[tf.ones([args.batch_size * args.seq_length])],
args.vocab_size)
cost = tf.reduce_sum(loss) / args.batch_size / args.seq_length
lr = tf.Variable(0.0, trainable=False)
tvars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars),
args.grad_clip)
optimizer = tf.train.AdamOptimizer(lr)
train_op = optimizer.apply_gradients(zip(grads, tvars))
I think there is no any mistake in your code.
The problem is the objective function with the Bi-RNN model in your application (next character predictor).
The unidirectional RNN (such as ptb_word_lm or char-rnn-tensorflow), it is really a model used for the prediction, for example, if raw_text is 1,3,5,2,4,8,9,0, then, your inputs and target will be:
inputs: 1,3,5,2,4,8,9
target: 3,5,2,4,8,9,0
and the prediction is (1)->3, (1,3)->5, ..., (1,3,5,2,4,8,9)->0
But in Bi-RNN, the first prediction is really not just (1)->3, because the output[0] in your code contians the reverse information of the raw_text by use bw_cell (also not (1,3)->5, ..., (1,3,5,2,4,8,9)->0). A similar example is: I tell you that flower is a rose, and than I let you to predict what the flower is? I think you can give me the right answer very easy, and this is also the reason why you getting an extremely low loss in your Bi-RNN model for the application.
In fact, I think Bi-RNN (or Bi-LSTM) is not an appropriate model for the application of next character predictor. Bi-RNN need the full sequence when it works, you will find you can't use this model easily when you want to predict the next character.