I want to build a multi task learning model on two related datasets with different inputs and targets. The two tasks are sharing lower-level layers but with different header layers, a minimal example:
class MultiMLP(nn.Module):
"""
A simple dense network for MTL on hard parameter sharing.
"""
def __init__(self):
super().__init__()
self.hidden = nn.Linear(100, 200)
self.out_task0= nn.Linear(200, 1)
self.out_task0= nn.Linear(200, 1)
def forward(self, x):
x = self.hidden(x)
x = F.relu(x)
y_task0 = self.out_task0(x)
y_task1 = self.out_task1(x)
return [y_task0, y_task1]
The dataloader is constructed so that the batches are alternatively generated from two datasets, i.e. batch 0, 2, 4, ... from task 0, batch 1, 3, 5, ... from task 1. I wanted to train the network in this way: only update weights for hidden layer and out_task0 for batches from task 0, and update only hidden and out_task1 for task 1.
I then alternatively switch requires_grad for the corresponding tasks during training as following. But I observed that all weights are updated for every iteration.
...
criterion = MSELoss()
for i, data in enumerate(combined_loader):
x, y = data[0], data[1]
optimizer.zero_grad()
# controller is 0 for task0, 1 for task1
# altenate the header layer
controller = i % 2
task0_mode = True if controller == 0 else False
for name, param in model.named_parameters():
if name in ['out_task0.weight', 'out_task0.bias']:
param.requires_grad = task0_mode
elif name in ['out_task1.weight', 'out_task1.bias']:
param.requires_grad = not task0_mode
outputs = model(x)[controller]
loss = criterion(outputs, y)
loss.backward()
optimizer.step()
# Monitor the parameter updates
for name, p in model.named_parameters():
if name in ['out_task0.weight', 'out_task1.weight']:
print(f"Controller: {controller}")
print(name, p)
Did I miss anything in the training procedure? Or the overall setup will not work?
Disclaimer: the question has been answered from PyTorch Forum, I put things together here in case someone runs into the same problem, the credit goes to ptrblk
The problem could arise from any variants of stochastic gradient descent(sgd) which utilizes gradients from previous steps, for instance, stochastic gradient descent with momentum(sgd-m), Nesterov accelerated gradient (NAG), Adagrad, RMSprop, Adam and so on. Zero-ing gradient at step t would not affect the terms relying on historical gradients. Thus the weights are still updated with the setting in the posted question.
One can see that from the following code example.
model = nn.Linear(1, 1, bias=False)
#optimizer = torch.optim.SGD(model.parameters(), lr=1., momentum=0.) # same results for w1 and w2
optimizer = torch.optim.SGD(model.parameters(), lr=1., momentum=0.5) # w2 gets updated
#optimizer = torch.optim.Adam(model.parameters(), lr=1.) # w2 gets updated
w0 = model.weight.clone()
out = model(torch.randn(1, 1))
out.mean().backward()
optimizer.step()
w1 = model.weight.clone()
optimizer.zero_grad()
print(model.weight.grad)
optimizer.step()
w2 = model.weight.clone()
print(w1 - w0)
print(w2 - w1)
With the native SGD optimizer, w2 and w1 are the same. But it is not the case for SGD-M and Adam.
Related
I have defined the model as in the code below, and I used batch normalization merging to make 3 layers into 1 linear layer.
The first layer of the model is a linear layer and there is no bias.
The second layer of the model is a batch normalization and there is no weight and bias ( affine is false )
The third layer of the model is a linear layer.
The variables named new_weight and new_bias are the weight and bias of the newly created linear layer, respectively.
My question is: Why is the output of the following two print functions different? And where is the wrong part in the code below the batch merge comment?
import torch
import torch.nn as nn
import torch.optim as optim
learning_rate = 0.01
in_nodes = 20
internal_nodes = 8
out_nodes = 9
batch_size = 100
# model define
class M(nn.Module):
def __init__(self):
super(M, self).__init__()
self.layer1 = nn.Linear(in_nodes, internal_nodes, bias=False)
self.layer2 = nn.BatchNorm1d(internal_nodes, affine=False)
self.layer3 = nn.Linear(internal_nodes, out_nodes)
def forward(self, x):
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
return x
# optimizer and criterion
model = M()
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss()
# training
for batch_num in range(1000):
model.train()
optimizer.zero_grad()
input = torch.randn(batch_size, in_nodes)
target = torch.ones(batch_size, out_nodes)
output = model(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# batch merge
divider = torch.sqrt(model.layer2.eps + model.layer2.running_var)
w_bn = torch.diag(torch.ones(internal_nodes) / divider)
new_weight = torch.mm(w_bn, model.layer1.weight)
new_weight = torch.mm(model.layer3.weight, new_weight)
b_bn = - model.layer2.running_mean / divider
new_bias = model.layer3.bias + torch.squeeze(torch.mm(model.layer3.weight, b_bn.reshape(-1, 1)))
input = torch.randn(batch_size, in_nodes)
print(model(input))
print(torch.t(torch.mm(new_weight, torch.t(input))) + new_bias)
Short Answer: As far as I can tell you need a model.eval() before the line
input = torch.randn(batch_size, in_nodes)
such that the end looks like this
...
model.eval()
input = torch.randn(batch_size, in_nodes)
test_input = torch.ones(batch_size,internal_nodes)/100
print(model(input))
print(torch.t(torch.mm(new_weight, torch.t(input))) + new_bias)
with that (I tested it) the two print-statements should output the same. It fixed the weights.
Long Answer:
When using Batch-Normalization according to PyTorch documentation a default momentum of 0.1 is used to compute the running_mean and running_var. The momentum defines how much the estimated statistics and how much the new observed value influence the value.
Now when you don't set a model.eval() statement the batch_normalization computes an updated running_mean and running_var due to the momentum in line
print(model(input))
For further details and or confirmation: Related Question, PyTorch-Documentation
I did a quick experiment to see if I could understand what the hidden state in an LSTM does...
I tried to make an LSTM predict a sequence of [1,0,1,0,1...] based off an input sequence of X with X[0] = 1 and the remainder as random noise.
X = [1, randFloat, randFloat, randFloat...]
label = [1, 0, 1, 0...]
In my head, the model would understand:
The inputs X mean nothing, or at least very little (as it's noise) - so it'd discard these values for the most part
Solely the hidden state from the previous sequence/timestep n would be used to predict the next timestep n+1... [1, 0, 1, 0...]
I also set X[0] = 1 so the first initial in an attempt to guide the net to predicting 1 on the first item (which it does)
So, this didn't work. In theory, should it not? Can you someone explain?
It essentially never converges, and is on the cusp of guessing between 0 or 1
## Code
import os
import numpy as np
import torch
from torchvision import transforms
from torch import nn
from sklearn import preprocessing
from util import create_sequences
import torch.optim as optim
Create some fake data
sequence_1 = torch.tensor(np.random.uniform(size=50)).float().detach()
sequence_1[0] = 1
sequence_2 = torch.tensor(np.random.uniform(size=50)).float().detach()
sequence_2[0] = 1
labels_1 = np.zeros(50)
labels_1[::2] = 1
labels_1 = torch.tensor(labels_1, dtype=torch.long)
labels_2 = labels_1.clone()
training_data = [sequence_1, sequence_2]
label_data = [labels_1, labels_2]
Create simple LSTM Model
class LSTM(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(LSTM, self).__init__()
self.lstm = nn.LSTM(input_dim, hidden_dim)
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, seq):
lstm_out, _ = self.lstm(seq.view(len(seq), 1, -1))
out = self.fc(lstm_out.view(len(seq), -1))
out = F.log_softmax(out, dim=1)
return out
We try to overfit on the dataset
INPUT_DIM = 1
HIDDEN_DIM = 6
model = LSTM(INPUT_DIM, HIDDEN_DIM, 2)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
for epoch in range(500):
for i, seq in enumerate(training_data):
labels = label_data[i]
model.zero_grad()
scores = model(seq)
loss = loss_function(scores, labels)
loss.backward()
print(loss)
optimizer.step()
with torch.no_grad():
seq_d = training_data[0]
tag_scores = model(seq_d)
for score in tag_scores:
print(np.argmax(score))
I would say it's not meant to work.
The model would always try to make sense and find patterns in the data it's trained on i.e sequence_1 and to "verify" that it has "found" them, it uses labels_1. Since the data is random the model fails to find the pattern.
The pattern the model tries to find is not in the label but in the data, so it doesn't matter how the label is arranged. The label actually never passes through the model, so NO.
If perhaps, you trained it on a single example then definitely. The model will become overfit and give you your ones and zeros and fail miserably on other examples, otherwise it just won't be able to make sense of the random data no matter the size.
Hidden State
Solely the hidden state from the previous sequence/timestep n would be used to predict the next timestep n+1... [1, 0, 1, 0...]
Concerning Hidden state, NOTE that it is not a trainable parameter, it is the result of performing some operations on the data and parameters, meaning that the input data determines the Hidden state.
What the Hidden state does is to hold the information the model has extracted from the previous timesteps and passes it to the next timestep or as output. In the case of LSTM, it does some forgetting and updating before passing it.
I am trying to create a simple 3D U-net for image segmentation, just to learn how to use the layers. Therefore I do a 3D convolution with stride 2 and then a transpose deconvolution to get back the same image size. I am also overfitting to a small set (test set) just to see if my network is learning.
I created the same net in Keras and it works just fine. Now I want to create in tensorflow but I been having trouble with it.
The cost changes slightly but no matter what I do (reduce learning rate, add more epochs, add more layers, change batch size...) the output is always the same. I believe the net is not updating the weights. I am sure I am doing something wrong but I can find what it is. Any help would be greatly appreciate it.
Here is my code:
def forward_propagation(X):
if ( mode == 'train'): print(" --------- Net --------- ")
# Convolutional Layer 1
with tf.variable_scope('CONV1'):
Z1 = tf.layers.conv3d(X, filters = 16, kernel =[3,3,3], strides = [ 2, 2, 2], padding='SAME', name = 'S2/conv3d')
A1 = tf.nn.relu(Z1, name = 'S2/ReLU')
if ( mode == 'train'): print("Convolutional Layer 1 S2 " + str(A1.get_shape()))
# DEConvolutional Layer 1
with tf.variable_scope('DeCONV1'):
output_deconv1 = tf.stack([X.get_shape()[0] , X.get_shape()[1], X.get_shape()[2], X.get_shape()[3], 1])
dZ1 = tf.nn.conv3d_transpose(A1, filters = 1, kernel =[3,3,3], strides = [2, 2, 2], padding='SAME', name = 'S2/conv3d_transpose')
dA1 = tf.nn.relu(dZ1, name = 'S2/ReLU')
if ( mode == 'train'): print("Deconvolutional Layer 1 S1 " + str(dA1.get_shape()))
return dA1
def compute_cost(output, target, method = 'dice_hard_coe'):
with tf.variable_scope('COST'):
if (method == 'sigmoid_cross_entropy') :
# Make them vectors
output = tf.reshape( output, [-1, output.get_shape().as_list()[0]] )
target = tf.reshape( target, [-1, target.get_shape().as_list()[0]] )
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits = output, labels = target)
cost = tf.reduce_mean(loss)
return cost
and the main function for the model:
def model(X_h5, Y_h5, learning_rate = 0.009,
num_epochs = 100, minibatch_size = 64, print_cost = True):
ops.reset_default_graph() # to be able to rerun the model without overwriting tf variables
#tf.set_random_seed(1) # to keep results consistent (tensorflow seed)
#seed = 3 # to keep results consistent (numpy seed)
(m, n_D, n_H, n_W, num_channels) = X_h5["test_data"].shape #TTT
num_labels = Y_h5["test_mask"].shape[4] #TTT
img_size = Y_h5["test_mask"].shape[1] #TTT
costs = [] # To keep track of the cost
accuracies = [] # To keep track of the accuracy
# Create Placeholders of the correct shape
X, Y = create_placeholders(n_H, n_W, n_D, minibatch_size)
# Forward propagation: Build the forward propagation in the tensorflow graph
nn_output = forward_propagation(X)
prediction = tf.nn.sigmoid(nn_output)
# Cost function: Add cost function to tensorflow graph
cost_method = 'sigmoid_cross_entropy'
cost = compute_cost(nn_output, Y, cost_method)
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer that minimizes the cost.
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)
# Initialize all the variables globally
init = tf.global_variables_initializer()
# Start the session to compute the tensorflow graph
with tf.Session() as sess:
print('------ Training ------')
# Run the initialization
tf.local_variables_initializer().run(session=sess)
sess.run(init)
# Do the training loop
for i in range(num_epochs*m):
# ----- TRAIN -------
current_epoch = i//m
patient_start = i-(current_epoch * m)
patient_end = patient_start + minibatch_size
current_X_train = np.zeros((minibatch_size, n_D, n_H, n_W,num_channels))
current_X_train[:,:,:,:,:] = np.array(X_h5["test_data"][patient_start:patient_end,:,:,:,:]) #TTT
current_X_train = np.nan_to_num(current_X_train) # make nan zero
current_Y_train = np.zeros((minibatch_size, n_D, n_H, n_W, num_labels))
current_Y_train[:,:,:,:,:] = np.array(Y_h5["test_mask"][patient_start:patient_end,:,:,:,:]) #TTT
current_Y_train = np.nan_to_num(current_Y_train) # make nan zero
feed_dict = {X: current_X_train, Y: current_Y_train}
_ , temp_cost = sess.run([optimizer, cost], feed_dict=feed_dict)
# ----- TEST -------
# Print the cost every 1/5 epoch
if ((i % (num_epochs*m/5) )== 0):
# Calculate the predictions
test_predictions = np.zeros(Y_h5["test_mask"].shape)
for j in range(0, X_h5["test_data"].shape[0], minibatch_size):
patient_start = j
patient_end = patient_start + minibatch_size
current_X_test = np.zeros((minibatch_size, n_D, n_H, n_W, num_channels))
current_X_test[:,:,:,:,:] = np.array(X_h5["test_data"][patient_start:patient_end,:,:,:,:])
current_X_test = np.nan_to_num(current_X_test) # make nan zero
current_Y_test = np.zeros((minibatch_size, n_D, n_H, n_W, num_labels))
current_Y_test[:,:,:,:,:] = np.array(Y_h5["test_mask"][patient_start:patient_end,:,:,:,:])
current_Y_test = np.nan_to_num(current_Y_test) # make nan zero
feed_dict = {X: current_X_test, Y: current_Y_test}
_, current_prediction = sess.run([cost, prediction], feed_dict=feed_dict)
test_predictions[j:j + minibatch_size,:,:,:,:] = current_prediction
costs.append(temp_cost)
print ("[" + str(current_epoch) + "|" + str(num_epochs) + "] " + "Cost : " + str(costs[-1]))
display_progress(X_h5["test_data"], Y_h5["test_mask"], test_predictions, 5, n_H, n_W)
# plot the cost
plt.plot(np.squeeze(costs))
plt.ylabel('cost')
plt.xlabel('epochs')
plt.show()
return
I call the model with:
model(hdf5_data_file, hdf5_mask_file, num_epochs = 500, minibatch_size = 1, learning_rate = 1e-3)
These are the results that I am currently getting:
Edit:
I have tried reducing the learning rate and it doesn't help. I also tried using tensorboard debug and the weights are not being updated:
I am not sure why this is happening.
I Created the same simple model in keras and it works fine. I am not sure what I am doing wrong in tensorflow.
Not sure if you are still looking for help, as I am answering this question half a year later your posted date. :) I've listed my observations and also some suggestions for you to try below. It my primary observation is right... then you probably just need a coffee break / a night of good sleep.
primary observation:
tf.reshape( output, [-1, output.get_shape().as_list()[0]] ) seems wrong. If you prefer to flatten the vector, it should be something like tf.reshape(output,[-1,np.prod(image_shape_list)]).
other observations:
With such a shallow network, I doubt the network have enough spatial resolution to differentiate tumor voxels from non-tumor voxels. Can you show the keras implementation and the performance compared to a pure tf implementation? I would probably go with 2+ layers, let's .
say with 3 layers, with a stride of 2 per layer, and an input image width of 256, you will end with a width of 32 at your deepest encoder layer. (If you have a limited GPU memory, downsample the input image.)
if changing the loss computation does not work, as #bremen_matt mentioned, reduce LR to say maybe 1e-5.
after the basic architecture tweaks and you "feel" that the network is sort of learning and not stuck, try augmenting the training data, add dropout, batch norm during training, and then maybe fancy up your loss by adding a discriminator.
This is a simple example of using LSTM cell from tensor flow. I am generating a sin wave and training my network for ten periods and I'm trying to predict the eleventh period. The predictor values X are one epoch lag of the true y. After training, I save the session to the disk and I restore it at prediction time - this is typical of training and deploying models to production.
When I predict the last period, y_predicted is matching very well the true y.
If I try to predict the sin wave using an arbitrary starting point, (i.e. uncomment line 114)
test_data = test_data[16:]
such that the true values of y would be shifted by a quarter period, it seems like the LSTM prediction still starts at zero and it takes a couple of epochs to catch up with the true values, eventually matching the previous prediction. As a matter of fact it seems that the prediction in the second case is still a full sin wave instead of the 3/4 wave.
What is the reason why this is happening. If I implement a regressor I would like to use it starting with any point.
https://github.com/fbora/mytensorflow/issues/1
import os
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow.contrib.rnn as rnn
def sin_signal():
'''
generate a sin function
the train set is ten periods in length
the test set is one additional period
the return variable is in pandas format for easy plotting
'''
phase = np.arange(0, 2*np.pi*11, 0.1)
y = np.sin(phase)
data = pd.DataFrame.from_dict({'phase': phase, 'y':y})
# fill the last element by 0 - it's the end of the period anyways
data['X'] = data.y.shift(-1).fillna(0.0)
train_data = data[data.phase<=2*np.pi*10].copy()
test_data = data[data.phase>2*np.pi*10].copy()
return train_data, test_data
class lstm_model():
def __init__(self, size_x, size_y, num_units=32, num_layers=3, keep_prob=0.5):
# def single_unit():
# return rnn.DropoutWrapper(
# rnn.LSTMCell(num_units), output_keep_prob=keep_prob)
def single_unit():
return rnn.LSTMCell(num_units)
self.graph = tf.Graph()
with self.graph.as_default():
'''input place holders'''
self.X = tf.placeholder(tf.float32, [None, size_x], name='X')
self.y = tf.placeholder(tf.float32, [None, size_y], name='y')
'''network'''
cell = rnn.MultiRNNCell([single_unit() for _ in range(num_layers)])
X = tf.expand_dims(self.X, -1)
val, state = tf.nn.dynamic_rnn(cell, X, time_major=True, dtype=tf.float32)
val = tf.transpose(val, [1, 0, 2])
last = tf.gather(val, int(val.get_shape()[0])-1)
weights = tf.Variable(tf.truncated_normal([num_units, size_y], 0.0, 1.0), name='weights')
bias = tf.Variable(tf.zeros(size_y), name='bias')
predicted_y = tf.nn.xw_plus_b(last, weights, bias, name='predicted_y')
'''optimizer'''
optimizer = tf.train.AdamOptimizer(name='adam_optimizer')
global_step = tf.Variable(0, trainable=False, name='global_step')
self.loss = tf.reduce_mean(tf.squared_difference(predicted_y, self.y), name='mse_loss')
self.train_op = optimizer.minimize(self.loss, global_step=global_step, name='training_op')
'''initializer'''
self.init_op = tf.global_variables_initializer()
class lstm_regressor():
def __init__(self):
if not os.path.isdir('./check_pts'):
os.mkdir('./check_pts')
#staticmethod
def get_shape(dataframe):
df_shape = dataframe.shape
num_rows = df_shape[0]
num_cols = 1 if len(df_shape)<2 else df_shape[1]
return num_rows, num_cols
def train(self, X_train, y_train, iterations):
train_pts, size_x = lstm_regressor.get_shape(X_train)
train_pts, size_y = lstm_regressor.get_shape(y_train)
model = lstm_model(size_x=size_x, size_y=size_y, num_units=32, num_layers=1)
with tf.Session(graph=model.graph) as sess:
sess.run(model.init_op)
saver = tf.train.Saver()
feed_dict={
model.X: X_train.values.reshape(-1, size_x),
model.y: y_train.values.reshape(-1, size_y)
}
for step in range(iterations):
_, loss = sess.run([model.train_op, model.loss], feed_dict=feed_dict)
if step%100==0:
print('step={}, loss={}'.format(step, loss))
saver.save(sess, './check_pts/lstm')
def predict(self, X_test):
test_pts, size_x = lstm_regressor.get_shape(X_test)
X_np = X_test.values.reshape(-1, size_x)
graph = tf.Graph()
with graph.as_default():
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
saver = tf.train.import_meta_graph('./check_pts/lstm.meta')
saver.restore(sess, './check_pts/lstm')
X = graph.get_tensor_by_name('X:0')
y_tf = graph.get_tensor_by_name('predicted_y:0')
y_np = sess.run(y_tf, feed_dict={X: X_np})
return y_np.reshape(test_pts)
def main():
train_data, test_data = sin_signal()
regressor = lstm_regressor()
regressor.train(train_data.X, train_data.y, iterations=1000)
# test_data = test_data[16:]
y_predicted = regressor.predict(test_data.X)
test_data['y_predicted'] = y_predicted
test_data[['y', 'y_predicted']].plot()
if __name__ == '__main__':
main()
I suspect that since you are starting your predictions at an arbitrary starting point in the future, there is a gap of values between what your model was trained on and what it is starting to see for predictions, and the State of your LSTM has not updated with the values in that gap?
*** UPDATE:
In your code, you have this:
val, state = tf.nn.dynamic_rnn(cell, X, time_major=True, dtype=tf.float32)
and then during training this:
_, loss = sess.run([model.train_op, model.loss], feed_dict=feed_dict)
I would suggest feeding the initial State into dynamic_rnn and re-feeding the updated state at each training iteration, something like this:
inState = tf.placeholder(tf.float32, [YOUR_DIMENSIONS], name='inState')
val, state = tf.nn.dynamic_rnn(cell, X, time_major=True, dtype=tf.float32, initial_state=inState)
And during training:
iState = np.zeros([YOUR_DIMENSIONS])
feed_dict={
model.X: X_train.values.reshape(-1, size_x),
model.y: y_train.values.reshape(-1, size_y),
inState: iState # feed initial value for state placeholder
}
_, loss, oState = sess.run([model.train_op, model.loss, model.state], feed_dict=feed_dict) # run one additional variable from the session
iState = oState # assign latest out-state to be re-fed as in-state
So, this way your model not only learns the parameters during training, but also keeps track of everything that it's seen during training in the State. NOW, you save this State with the rest of your session and use it during the prediction stage.
The small difficulty with this is that technically this State is a placeholder, so it won't be saved in the Graph automatically in my experience. So you create another variable manually at the end of training and assign the State to it; this way it is saved in the graph for later:
# make sure this variable is declared BEFORE the saver is declared
savedState = tf.get_variable('savedState', shape=[YOUR_DIMENSIONS])
# then, at the end of training:
assignOp = tf.assign(savedState, oState)
sess.run(assignOp)
# now save your graph
So now once you restore the Graph, if you want to start your predictions after some artificial gap, then somehow you still have to run your model through this gap so as to update the state. In my case, I just run one dummy prediction for the whole gap, just so as to update the state, and then you continue at your normal intervals from here.
Hope this helps...
I am trying to detect micro-events in a long time series. For this purpose, I will train a LSTM network.
Data. Input for each time sample is 11 different features somewhat normalized to fit 0-1. Output will be either one of two classes.
Batching. Due to huge class imbalance I have extracted the data in batches of each 60 time samples, of which at least 5 will always be class 1, and the rest class to. In this way the class imbalance is reduced from 150:1 to around 12:1 I have then randomized the order of all my batches.
Model. I am attempting to train an LSTM, with initial configuration of 3 different cells with 5 delay steps. I expect the micro events to arrive in sequences of at least 3 time steps.
Problem: When I try to train the network it will quickly converge towards saying that EVERYTHING belongs to the majority class. When I implement a weighted loss function, at some certain threshold it will change to saying that EVERYTHING belongs to the minority class. I suspect (without being expert) that there is no learning in my LSTM cells, or that my configuration is off?
Below is the code for my implementation. I am hoping that someone can tell me
Is my implementation correct?
What other reasons could there be for such behaviour?
ar_model.py
import numpy as np
import tensorflow as tf
from tensorflow.models.rnn import rnn
import ar_config
config = ar_config.get_config()
class ARModel(object):
def __init__(self, is_training=False, config=None):
# Config
if config is None:
config = ar_config.get_config()
# Placeholders
self._features = tf.placeholder(tf.float32, [None, config.num_features], name='ModelInput')
self._targets = tf.placeholder(tf.float32, [None, config.num_classes], name='ModelOutput')
# Hidden layer
with tf.variable_scope('lstm') as scope:
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(config.num_hidden, forget_bias=0.0)
cell = tf.nn.rnn_cell.MultiRNNCell([lstm_cell] * config.num_delays)
self._initial_state = cell.zero_state(config.batch_size, dtype=tf.float32)
outputs, state = rnn.rnn(cell, [self._features], dtype=tf.float32)
# Output layer
output = outputs[-1]
softmax_w = tf.get_variable('softmax_w', [config.num_hidden, config.num_classes], tf.float32)
softmax_b = tf.get_variable('softmax_b', [config.num_classes], tf.float32)
logits = tf.matmul(output, softmax_w) + softmax_b
# Evaluate
ratio = (60.00 / 5.00)
class_weights = tf.constant([ratio, 1 - ratio])
weighted_logits = tf.mul(logits, class_weights)
loss = tf.nn.softmax_cross_entropy_with_logits(weighted_logits, self._targets)
self._cost = cost = tf.reduce_mean(loss)
self._predict = tf.argmax(tf.nn.softmax(logits), 1)
self._correct = tf.equal(tf.argmax(logits, 1), tf.argmax(self._targets, 1))
self._accuracy = tf.reduce_mean(tf.cast(self._correct, tf.float32))
self._final_state = state
if not is_training:
return
# Optimize
optimizer = tf.train.AdamOptimizer()
self._train_op = optimizer.minimize(cost)
#property
def features(self):
return self._features
#property
def targets(self):
return self._targets
#property
def cost(self):
return self._cost
#property
def accuracy(self):
return self._accuracy
#property
def train_op(self):
return self._train_op
#property
def predict(self):
return self._predict
#property
def initial_state(self):
return self._initial_state
#property
def final_state(self):
return self._final_state
ar_train.py
import os
from datetime import datetime
import numpy as np
import tensorflow as tf
from tensorflow.python.platform import gfile
import ar_network
import ar_config
import ar_reader
config = ar_config.get_config()
def main(argv=None):
if gfile.Exists(config.train_dir):
gfile.DeleteRecursively(config.train_dir)
gfile.MakeDirs(config.train_dir)
train()
def train():
train_data = ar_reader.ArousalData(config.train_data, num_steps=config.max_steps)
test_data = ar_reader.ArousalData(config.test_data, num_steps=config.max_steps)
with tf.Graph().as_default(), tf.Session() as session, tf.device('/cpu:0'):
initializer = tf.random_uniform_initializer(minval=-0.1, maxval=0.1)
with tf.variable_scope('model', reuse=False, initializer=initializer):
m = ar_network.ARModel(is_training=True)
s = tf.train.Saver(tf.all_variables())
tf.initialize_all_variables().run()
for batch_input, batch_target in train_data:
step = train_data.iter_steps
dict = {
m.features: batch_input,
m.targets: batch_target
}
session.run(m.train_op, feed_dict=dict)
state, cost, accuracy = session.run([m.final_state, m.cost, m.accuracy], feed_dict=dict)
if not step % 10:
test_input, test_target = test_data.next()
test_accuracy = session.run(m.accuracy, feed_dict={
m.features: test_input,
m.targets: test_target
})
now = datetime.now().time()
print ('%s | Iter %4d | Loss= %.5f | Train= %.5f | Test= %.3f' % (now, step, cost, accuracy, test_accuracy))
if not step % 1000:
destination = os.path.join(config.train_dir, 'ar_model.ckpt')
s.save(session, destination)
if __name__ == '__main__':
tf.app.run()
ar_config.py
class Config(object):
# Directories
train_dir = '...'
ckpt_dir = '...'
train_data = '...'
test_data = '...'
# Data
num_features = 13
num_classes = 2
batch_size = 60
# Model
num_hidden = 3
num_delays = 5
# Training
max_steps = 100000
def get_config():
return Config()
UPDATED ARCHITECTURE:
# Placeholders
self._features = tf.placeholder(tf.float32, [None, config.num_features, config.num_delays], name='ModelInput')
self._targets = tf.placeholder(tf.float32, [None, config.num_output], name='ModelOutput')
# Weights
weights = {
'hidden': tf.get_variable('w_hidden', [config.num_features, config.num_hidden], tf.float32),
'out': tf.get_variable('w_out', [config.num_hidden, config.num_classes], tf.float32)
}
biases = {
'hidden': tf.get_variable('b_hidden', [config.num_hidden], tf.float32),
'out': tf.get_variable('b_out', [config.num_classes], tf.float32)
}
#Layer in
with tf.variable_scope('input_hidden') as scope:
inputs = self._features
inputs = tf.transpose(inputs, perm=[2, 0, 1]) # (BatchSize,NumFeatures,TimeSteps) -> (TimeSteps,BatchSize,NumFeatures)
inputs = tf.reshape(inputs, shape=[-1, config.num_features]) # (TimeSteps,BatchSize,NumFeatures -> (TimeSteps*BatchSize,NumFeatures)
inputs = tf.add(tf.matmul(inputs, weights['hidden']), biases['hidden'])
#Layer hidden
with tf.variable_scope('hidden_hidden') as scope:
inputs = tf.split(0, config.num_delays, inputs) # -> n_steps * (batchsize, features)
cell = tf.nn.rnn_cell.BasicLSTMCell(config.num_hidden, forget_bias=0.0)
self._initial_state = cell.zero_state(config.batch_size, dtype=tf.float32)
outputs, state = rnn.rnn(cell, inputs, dtype=tf.float32)
#Layer out
with tf.variable_scope('hidden_output') as scope:
output = outputs[-1]
logits = tf.add(tf.matmul(output, weights['out']), biases['out'])
Odd elements
Weighted loss
I am not sure your "weighted loss" does what you want it to do:
ratio = (60.00 / 5.00)
class_weights = tf.constant([ratio, 1 - ratio])
weighted_logits = tf.mul(logits, class_weights)
this is applied before calculating the loss function (further I think you wanted an element-wise multiplication as well? also your ratio is above 1 which makes the second part negative?) so it forces your predictions to behave in a certain way before applying the softmax.
If you want weighted loss you should apply this after
loss = tf.nn.softmax_cross_entropy_with_logits(weighted_logits, self._targets)
with some element-wise multiplication of your weights.
loss = loss * weights
Where your weights have a shape like [2,]
However, I would not recommend you to use weighted losses. Perhaps try increasing the ratio even further than 1:6.
Architecture
As far as I can read, you are using 5 stacked LSTMs with 3 hidden units per layer?
Try removing the multi rnn and just use a single LSTM/GRU (maybe even just a vanilla RNN) and jack the hidden units up to ~100-1000.
Debugging
Often when you are facing problems with an odd behaving network, it can be a good idea to:
Print everything
Literally print the shapes and values of every tensor in your model, use sess to fetch it and then print it. Your input data, the first hidden representation, your predictions, your losses etc.
You can also use tensorflows tf.Print() x_tensor = tf.Print(x_tensor, [tf.shape(x_tensor)])
Use tensorboard
Using tensorboard summaries on your gradients, accuracy metrics and histograms will reveal patterns in your data that might explain certain behavior, such as what lead to exploding weights. Like maybe your forget bias goes to infinity or your not tracking gradient through a certain layer etc.
Other questions
How large is your dataset?
How long are your sequences?
Are the 13 features categorical or continuous? You should not normalize categorical variables or represent them as integers, instead you should use one-hot encoding.
Gunnar has already made lots of good suggestions. A few more small things worth paying attention to in general for this sort of architecture:
Try tweaking the Adam learning rate. You should determine the proper learning rate by cross-validation; as a rough start, you could just check whether a smaller learning rate saves your model from crashing on the training data.
You should definitely use more hidden units. It's cheap to try larger networks when you first start out on a dataset. Go as large as necessary to avoid the underfitting you've observed. Later you can regularize / pare down the network after you get it to learn something useful.
Concretely, how long are the sequences you are passing into the network? You say you have a 30k-long time sequence.. I assume you are passing in subsections / samples of this sequence?