So I'm trying to create a VERY simple neural network with no hidden layers, just input (3 elements) and linear output (2 elements).
I then define some variables to store configurations and weights
# some configs
input_size = 3
action_size = 2
min_delta, max_delta = -1, 1
learning_rate_op = 0.5
w = {} # weights
I then create the training network
# training network
with tf.variable_scope('prediction'):
state_tensor = tf.placeholder('float32', [None, input_size], name='state_tensor')
w['q_w'] = tf.get_variable('Matrix', [state_tensor.get_shape().as_list()[1], action_size], tf.float32, tf.random_normal_initializer(stddev=0.02))
w['q_b'] = tf.get_variable('bias', [action_size], initializer=tf.constant_initializer(0))
q = tf.nn.bias_add(tf.matmul(state_tensor, w['q_w']), w['q_b'])
I define the optimizer to minimize the square different between the target value and the training network
# weight optimizer
with tf.variable_scope('optimizer'):
# tensor to hold target value
# eg, target_q_tensor=[10;11]
target_q_tensor = tf.placeholder('float32', [None], name='target_q_tensor')
# tensors for action_tensor, for action_tensor matrix and for value deltas
# eg, action_tensor=[0;1], action_one_hot=[[1,0];[0,1]], q_acted=[Q_0,Q_1]
action_tensor = tf.placeholder('int64', [None], name='action_tensor')
action_one_hot = tf.one_hot(action_tensor, action_size, 1.0, 0.0, name='action_one_hot')
q_acted = tf.reduce_sum(q * action_one_hot, reduction_indices=1, name='q_acted')
# delta
delta = target_q_tensor - q_acted
clipped_delta = tf.clip_by_value(delta, min_delta, max_delta, name='clipped_delta')
# error function
loss = tf.reduce_mean(tf.square(clipped_delta), name='loss')
# optimizer
# optim = tf.train.AdamOptimizer(learning_rate_op).minimize(loss)
optim = tf.train.GradientDescentOptimizer(learning_rate_op).minimize(loss)
And finally, I run some values in an infinite loop. However, the weights are never updated, they maintain the random values with which they were initialized
with tf.Session() as sess:
tf.initialize_all_variables().run()
s_t = np.array([[1,0,0],[1,0,1],[1,1,0],[1,0,0]])
action = np.array([0, 1, 0, 1])
target_q = np.array([10, -11, -12, 13])
while True:
if counter % 10000 == 0:
q_values = q.eval({state_tensor: s_t})
for i in range(len(s_t)):
print("q", q_values[i])
print("w", sess.run(w['q_w']), '\nb', sess.run(w['q_b']))
sess.run(optim, {target_q_tensor: target_q, action_tensor: action, state_tensor: s_t})
I took the code from a working DQN implementation, so I figure I'm doing something blatantly wrong. The network should converge to:
# 0 | 1
####################
1,0,0 # 10 13
1,0,1 # x -11
1,1,0 # -12 x
But they do not change at all. Any pointers?
Turns out that clipping the loss is causing the issue. However, I don't understand why...
If your loss is always 1, then it means your clipped delta is always clipping it to 1. It strikes me as an odd choice to clip the loss anyways. Perhaps you meant to clip the gradient of the loss? See this also.
Removing the clipping entirely will (probably) work as well in simple cases.
Related
I've been working for the past 4 months with a partner on a machine learning project using the CelebA dataset.
Our next step in the project is to create a Conditional DCGAN that would receive 40 labels for each class in the dataset, as either a 0 for a non-existing facial attribute and a 1 for an existing facial attribute, like this: [1, 0, 0, 1, ..., 1, 0] -> length of list is 40,
and generate from it a human face accordingly. So if the set of 40 labels matches a young male with sunglasses, we'll expect the generator to synthesize for us a young-male-with-sunglasses-ish persona (with variation from generation to generation because we feed it random noise each time).
I'm going to use Google Colab Pro's premium GPU option to train the model, if it matters.
I've didn't want to reinvent the wheel and begin working from scratch, so I copied some working code for a CelebA, image generating, non-conditional DCGAN from this article, and tried to tweak it to be a Conditional GAN.
I got stuck in the training loop section due to a tensor size mismatch.
# Training Loop
# Lists to keep track of progress
img_list = []
G_losses = []
D_losses = []
iters = 0
print("Starting Training Loop...")
# For each epoch
for epoch in range(num_epochs):
# For each batch in the dataloader
for i, (real_cpu, conditions) in enumerate(dataloader, 0):
############################
# (1) Update D network: maximize log(D(x)) + log(1 - D(G(z)))
###########################
## Train with all-real batch
netD.zero_grad()
# Format batch
b_size = real_cpu.size(0)
label = torch.full((b_size, 1, 1, 1), real_label, device=device, dtype=torch.float32)
# Repeat the conditions tensor along the spatial dimensions
conditions = conditions.view(-1, 40, 1, 1)
conditions = conditions.repeat(1, 1, image_size, image_size)
# Forward pass real batch through D
output = netD(real_cpu.to(device), conditions.to(device))
label = torch.full((b_size, 1, 1, 1), real_label, device=device, dtype=torch.float32)
# Calculate loss on all-real batch
errD_real = criterion(output, label)
# Calculate gradients for D in backward pass
errD_real.backward()
D_x = output.mean().item()
## Train with all-fake batch
# Generate batch of latent vectors
noise = torch.randn(b_size, nz, 1, 1, device=device)
# Generate fake image batch with G
fake = netG(noise, conditions.to(device)) # <-- ERROR HERE
label.fill_(fake_label)
# Classify all fake batch with D
output = netD(fake.detach(), conditions.to(device))
# Calculate D's loss on the all-fake batch
errD_fake = criterion(output, label)
# Calculate the gradients for this batch, accumulated (summed) with previous gradients
errD_fake.backward()
D_G_z1 = output.mean().item()
# Compute error of D as sum over the fake and the real batches
errD = errD_real + errD_fake
# Update D
optimizerD.step()
############################
# (2) Update G network: maximize log(D(G(z)))
###########################
netG.zero_grad()
label.fill_(real_label) # fake labels are real for generator cost
# Since we just updated D, perform another forward pass of all-fake batch through D
output = netD(fake, conditions.to(device))
# Calculate G's loss based on this output
errG = criterion(output, label)
# Calculate gradients for G
errG.backward()
D_G_z2 = output.mean().item()
# Update G
optimizerG.step()
# Output training stats
if i % 50 == 0:
print('[%d/%d][%d/%d]\tLoss_D: %.4f\tLoss_G: %.4f\tD(x): %.4f\tD(G(z)): %.4f / %.4f'
% (epoch, num_epochs, i, len(dataloader),
errD.item(), errG.item(), D_x, D_G_z1, D_G_z2))
# Save Losses for plotting later
G_losses.append(errG.item())
D_losses.append(errD.item())
# Check how the generator is doing by saving G's output on fixed_noise
if (iters % 500 == 0) or ((epoch == num_epochs-1) and (i == len(dataloader)-1)):
with torch.no_grad():
fake = netG(fixed_noise).detach().cpu()
img_list.append(vutils.make_grid(fake, padding=2, normalize=True))
iters += 1
The error is in this line:
fake = netG(noise, conditions.to(device))
When I try to concatenate the two tensors in the generator's forward function:
def forward(self, input, label):
# Concatenate the input noise and label
input = torch.cat((input, label), 1)
return self.conv(input)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 1 but got size 64 for tensor number 1 in the list.
since:
noise.shape = torch.Size([64, 100, 1, 1])
and
conditions.shape = torch.Size([64, 40, 64, 64])
I tried to reshape conditions with these lines, before feeding the generator with it as its second parameter:
conditions = conditions.permute(0, 2, 3, 1)
conditions = conditions.view(batch_size, 1, 1, 40)
which didn't work:
RuntimeError: shape '[64, 1, 1, 40]' is invalid for input of size 10485760
Thanks in advance for any suggestions/ideas on how to change the code. I really feel like I'm missing something really fundamental here that's preventing the training loop from working properly.
I have been bashing my head against the wall for the past few days - and I simply cannot figure it out.
Would some of you good people perhaps let me know what I am doing wrong?
I am trying to port code from https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Deep%20Q%20Learning/Doom/Deep%20Q%20learning%20with%20Doom.ipynb (written in Tensorflow) to Keras. Here is the original part of the code:
class DQNetwork:
def __init__(self, state_size, action_size, learning_rate, name='DQNetwork'):
self.state_size = state_size
self.action_size = action_size
self.learning_rate = learning_rate
with tf.variable_scope(name):
self.inputs_ = tf.placeholder(tf.float32, [None, *state_size], name="inputs")
self.actions_ = tf.placeholder(tf.float32, [None, 3], name="actions_")
self.target_Q = tf.placeholder(tf.float32, [None], name="target")
#First convnet: CNN => BatchNormalization => ELU; Input is 84x84x4
self.conv1 = tf.layers.conv2d(inputs = self.inputs_,
filters = 32, kernel_size = [8,8],strides = [4,4],padding = "VALID",
kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(), name = "conv1")
self.conv1_batchnorm = tf.layers.batch_normalization(self.conv1,training = True,
epsilon = 1e-5,name = 'batch_norm1')
self.conv1_out = tf.nn.elu(self.conv1_batchnorm, name="conv1_out")
## --> [20, 20, 32]
#Second convnet: CNN => BatchNormalization => ELU
self.conv2 = tf.layers.conv2d(inputs = self.conv1_out,
filters = 64,kernel_size = [4,4],strides = [2,2],padding = "VALID",
kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),name = "conv2")
self.conv2_batchnorm = tf.layers.batch_normalization(self.conv2,training = True,
epsilon = 1e-5,name = 'batch_norm2')
self.conv2_out = tf.nn.elu(self.conv2_batchnorm, name="conv2_out")
## --> [9, 9, 64]
#Third convnet: CNN => BatchNormalization => ELU
self.conv3 = tf.layers.conv2d(inputs = self.conv2_out,
filters = 128,kernel_size = [4,4],strides = [2,2],padding = "VALID",
kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),name = "conv3")
self.conv3_batchnorm = tf.layers.batch_normalization(self.conv3,training = True,
epsilon = 1e-5,name = 'batch_norm3')
self.conv3_out = tf.nn.elu(self.conv3_batchnorm, name="conv3_out")
## --> [3, 3, 128]
self.flatten = tf.layers.flatten(self.conv3_out)
## --> [1152]
self.fc = tf.layers.dense(inputs = self.flatten,
units = 512, activation = tf.nn.elu,
kernel_initializer=tf.contrib.layers.xavier_initializer(),name="fc1")
self.output = tf.layers.dense(inputs = self.fc, kernel_initializer=tf.contrib.layers.xavier_initializer(),
units = 3, activation=None)
# Q is our predicted Q value.
self.Q = tf.reduce_sum(tf.multiply(self.output, self.actions_), axis=1)
# The loss is the difference between our predicted Q_values and the Q_target
# Sum(Qtarget - Q)^2
self.loss = tf.reduce_mean(tf.square(self.target_Q - self.Q))
self.optimizer = tf.train.RMSPropOptimizer(self.learning_rate).minimize(self.loss)
# farther below...
Qs_next_state = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: next_states_mb})
# Set Q_target = r if the episode ends at s+1, otherwise set Q_target = r + gamma*maxQ(s', a')
for i in range(0, len(batch)):
terminal = dones_mb[i]
# If we are in a terminal state, only equals reward
if terminal:
target_Qs_batch.append(rewards_mb[i])
else:
target = rewards_mb[i] + gamma * np.max(Qs_next_state[i])
target_Qs_batch.append(target)
targets_mb = np.array([each for each in target_Qs_batch])
loss, _ = sess.run([DQNetwork.loss, DQNetwork.optimizer],
feed_dict={DQNetwork.inputs_: states_mb,
DQNetwork.target_Q: targets_mb,
DQNetwork.actions_: actions_mb})
And here is my conversion:
class DQNetworkA:
def __init__(self, state_size, action_size, learning_rate):
self.state_size = state_size
self.action_size = action_size
self.learning_rate = learning_rate
self.model = keras.models.Sequential()
self.model.add(keras.layers.Conv2D(32, (8, 8), strides=(4, 4), padding = "VALID", input_shape=state_size))#, kernel_initializer='glorot_normal'))
self.model.add(keras.layers.BatchNormalization(epsilon = 1e-5))
self.model.add(keras.layers.Activation('elu'))
self.model.add(keras.layers.Conv2D(64, (4, 4), strides=(2, 2), padding = "VALID"))#, kernel_initializer='glorot_normal'))
self.model.add(keras.layers.BatchNormalization(epsilon = 1e-5))
self.model.add(keras.layers.Activation('elu'))
self.model.add(keras.layers.Conv2D(128, (4, 4), strides=(2, 2), padding = "VALID"))#, kernel_initializer='glorot_normal'))
self.model.add(keras.layers.BatchNormalization(epsilon = 1e-5))
self.model.add(keras.layers.Activation('elu'))
self.model.add(keras.layers.Flatten())
self.model.add(keras.layers.Dense(512))
self.model.add(keras.layers.Activation('elu'))
self.model.add(keras.layers.Dense(action_size))
self.model.compile(loss="mse", optimizer=keras.optimizers.RMSprop(lr=self.learning_rate))
print(self.model.summary())
# farther below...
Qs = DQNetwork.predict(states_mb)
Qs_next_state = DQNetwork.predict(next_states_mb)
# Set Q_target = r if the episode ends at s+1, otherwise set Q_target = r + gamma*maxQ(s', a')
for i in range(0, len(batch)):
terminal = dones_mb[i]
t = np.copy(Qs[i])
a = np.argmax(actions_mb[i])
# If we are in a terminal state, only equals reward
if terminal:
t[a] = rewards_mb[i]
else:
t[a] = rewards_mb[i] + gamma * np.max(Qs_next_state[i])
target_Qs_batch.append(t)
dbg_target_Qs_batch.append(t[a])
targets_mb = np.array([each for each in target_Qs_batch])
loss = DQNetwork.train_on_batch(states_mb, targets_mb)
Everything else is the same. I have even tried to mess around with a custom loss function to minimize differences in the code – and it simply does not work! While the original code quickly converges my Keras doodlings simply does not seem to want to work!
Does anyone have a clue? Any hints or help would be highly appreciated...
A little further explanation:
This is a simple DQN playing Doom - so the after about 100 episodes (games), the model seems to be able to shoot the target without a problem every episode. Loss goes down, rewards per game go up - as one would expect... However, in the Keras model loss graph is flat, reward graph is flat - it almost seems not to be able to learn anything. (see the graphs linked below)
Here is how it works. In TF code, model outputs a tensor [a, b, c] where a, b and c give probability of each action the main character might take (ie: [left, right, shoot]). Model is then given reward for every action, so it is passed a target value (target_mb, f.ex. 10) along with which action this is for (one-hot encoded in actions_mb, ie [0,1,0] - if this is a target for moving right). Loss is then computed with a simple MSE over difference between target and predicted value of the model for the given action.
I have done two things:
1) I tried to use the standard "mse" loss as I have seen in other models of this type. To make the loss behave the same way, I pass the model its own input apart from target value. So if model predicts [3,4,5] and the target is 10 for [0,1,0] - we pass [3,10,5] as the truth to the model. This should be equivalent to the actions of the TF model. ie, difference between 10 and 4, squared and then mean over all differences from the batch.
2) When 1) did not work, I tried to make a custom loss function that basically attempts to mimick behaviour of the TF model as closely as possible. So if model predicts [3,4,5] and the target is 10 for [0,1,0] (as above) - we pass [0,10,0] as the truth to the model. Then the custom loss function through some finicky multiplication and division arrives at difference between 10 and 4 - squares it and takes mean of all squared errors as below:
def custom_loss(y_true, y_pred):
isolated_truths = tf.reduce_sum(y_true, axis=1)
isolated_predictions = tf.divide(tf.reduce_sum(tf.multiply(y_true, y_pred), axis=1), isolated_truths)
delta = isolated_predictions - isolated_truths
return tf.reduce_mean(tf.square(delta))
# when training, this small modification is made to targets:
loss = DQN_Keras.train_on_batch(states_mb, targets_mb.reshape(len(targets_mb),1) * actions_mb)
And it still does not work (although you can see on the graphs that the loss seems to behave far more reasonably!).
Take a look at the graphs:
tf model: https://pasteboard.co/IN1b5MN.png
keras model with mse loss: https://pasteboard.co/IN1kH6P.png
keras model with custom loss: https://pasteboard.co/IN17ktg.png
edit #2 - runnable code
Original TF code - copy pasted from tutorial above, working:
=> https://pastebin.com/QLb7nWZi
My code with custom loss in full:
=> https://pastebin.com/3HiYg6t7
Well, I have made work - by removing BatchNormalization layers. Now I am completely mystified... so does batch normalization work differently in Keras and Tensorflow? Or is the missing clue this mysterious "training=True" parameter in TF (not present in Keras)?
PS.
While digging into the issue, I also found this very useful article describing how to create advanced Keras models with several inputs like masks (like in the original TF code!):
https://becominghuman.ai/lets-build-an-atari-ai-part-1-dqn-df57e8ff3b26
I am trying to create a simple 3D U-net for image segmentation, just to learn how to use the layers. Therefore I do a 3D convolution with stride 2 and then a transpose deconvolution to get back the same image size. I am also overfitting to a small set (test set) just to see if my network is learning.
I created the same net in Keras and it works just fine. Now I want to create in tensorflow but I been having trouble with it.
The cost changes slightly but no matter what I do (reduce learning rate, add more epochs, add more layers, change batch size...) the output is always the same. I believe the net is not updating the weights. I am sure I am doing something wrong but I can find what it is. Any help would be greatly appreciate it.
Here is my code:
def forward_propagation(X):
if ( mode == 'train'): print(" --------- Net --------- ")
# Convolutional Layer 1
with tf.variable_scope('CONV1'):
Z1 = tf.layers.conv3d(X, filters = 16, kernel =[3,3,3], strides = [ 2, 2, 2], padding='SAME', name = 'S2/conv3d')
A1 = tf.nn.relu(Z1, name = 'S2/ReLU')
if ( mode == 'train'): print("Convolutional Layer 1 S2 " + str(A1.get_shape()))
# DEConvolutional Layer 1
with tf.variable_scope('DeCONV1'):
output_deconv1 = tf.stack([X.get_shape()[0] , X.get_shape()[1], X.get_shape()[2], X.get_shape()[3], 1])
dZ1 = tf.nn.conv3d_transpose(A1, filters = 1, kernel =[3,3,3], strides = [2, 2, 2], padding='SAME', name = 'S2/conv3d_transpose')
dA1 = tf.nn.relu(dZ1, name = 'S2/ReLU')
if ( mode == 'train'): print("Deconvolutional Layer 1 S1 " + str(dA1.get_shape()))
return dA1
def compute_cost(output, target, method = 'dice_hard_coe'):
with tf.variable_scope('COST'):
if (method == 'sigmoid_cross_entropy') :
# Make them vectors
output = tf.reshape( output, [-1, output.get_shape().as_list()[0]] )
target = tf.reshape( target, [-1, target.get_shape().as_list()[0]] )
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits = output, labels = target)
cost = tf.reduce_mean(loss)
return cost
and the main function for the model:
def model(X_h5, Y_h5, learning_rate = 0.009,
num_epochs = 100, minibatch_size = 64, print_cost = True):
ops.reset_default_graph() # to be able to rerun the model without overwriting tf variables
#tf.set_random_seed(1) # to keep results consistent (tensorflow seed)
#seed = 3 # to keep results consistent (numpy seed)
(m, n_D, n_H, n_W, num_channels) = X_h5["test_data"].shape #TTT
num_labels = Y_h5["test_mask"].shape[4] #TTT
img_size = Y_h5["test_mask"].shape[1] #TTT
costs = [] # To keep track of the cost
accuracies = [] # To keep track of the accuracy
# Create Placeholders of the correct shape
X, Y = create_placeholders(n_H, n_W, n_D, minibatch_size)
# Forward propagation: Build the forward propagation in the tensorflow graph
nn_output = forward_propagation(X)
prediction = tf.nn.sigmoid(nn_output)
# Cost function: Add cost function to tensorflow graph
cost_method = 'sigmoid_cross_entropy'
cost = compute_cost(nn_output, Y, cost_method)
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer that minimizes the cost.
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)
# Initialize all the variables globally
init = tf.global_variables_initializer()
# Start the session to compute the tensorflow graph
with tf.Session() as sess:
print('------ Training ------')
# Run the initialization
tf.local_variables_initializer().run(session=sess)
sess.run(init)
# Do the training loop
for i in range(num_epochs*m):
# ----- TRAIN -------
current_epoch = i//m
patient_start = i-(current_epoch * m)
patient_end = patient_start + minibatch_size
current_X_train = np.zeros((minibatch_size, n_D, n_H, n_W,num_channels))
current_X_train[:,:,:,:,:] = np.array(X_h5["test_data"][patient_start:patient_end,:,:,:,:]) #TTT
current_X_train = np.nan_to_num(current_X_train) # make nan zero
current_Y_train = np.zeros((minibatch_size, n_D, n_H, n_W, num_labels))
current_Y_train[:,:,:,:,:] = np.array(Y_h5["test_mask"][patient_start:patient_end,:,:,:,:]) #TTT
current_Y_train = np.nan_to_num(current_Y_train) # make nan zero
feed_dict = {X: current_X_train, Y: current_Y_train}
_ , temp_cost = sess.run([optimizer, cost], feed_dict=feed_dict)
# ----- TEST -------
# Print the cost every 1/5 epoch
if ((i % (num_epochs*m/5) )== 0):
# Calculate the predictions
test_predictions = np.zeros(Y_h5["test_mask"].shape)
for j in range(0, X_h5["test_data"].shape[0], minibatch_size):
patient_start = j
patient_end = patient_start + minibatch_size
current_X_test = np.zeros((minibatch_size, n_D, n_H, n_W, num_channels))
current_X_test[:,:,:,:,:] = np.array(X_h5["test_data"][patient_start:patient_end,:,:,:,:])
current_X_test = np.nan_to_num(current_X_test) # make nan zero
current_Y_test = np.zeros((minibatch_size, n_D, n_H, n_W, num_labels))
current_Y_test[:,:,:,:,:] = np.array(Y_h5["test_mask"][patient_start:patient_end,:,:,:,:])
current_Y_test = np.nan_to_num(current_Y_test) # make nan zero
feed_dict = {X: current_X_test, Y: current_Y_test}
_, current_prediction = sess.run([cost, prediction], feed_dict=feed_dict)
test_predictions[j:j + minibatch_size,:,:,:,:] = current_prediction
costs.append(temp_cost)
print ("[" + str(current_epoch) + "|" + str(num_epochs) + "] " + "Cost : " + str(costs[-1]))
display_progress(X_h5["test_data"], Y_h5["test_mask"], test_predictions, 5, n_H, n_W)
# plot the cost
plt.plot(np.squeeze(costs))
plt.ylabel('cost')
plt.xlabel('epochs')
plt.show()
return
I call the model with:
model(hdf5_data_file, hdf5_mask_file, num_epochs = 500, minibatch_size = 1, learning_rate = 1e-3)
These are the results that I am currently getting:
Edit:
I have tried reducing the learning rate and it doesn't help. I also tried using tensorboard debug and the weights are not being updated:
I am not sure why this is happening.
I Created the same simple model in keras and it works fine. I am not sure what I am doing wrong in tensorflow.
Not sure if you are still looking for help, as I am answering this question half a year later your posted date. :) I've listed my observations and also some suggestions for you to try below. It my primary observation is right... then you probably just need a coffee break / a night of good sleep.
primary observation:
tf.reshape( output, [-1, output.get_shape().as_list()[0]] ) seems wrong. If you prefer to flatten the vector, it should be something like tf.reshape(output,[-1,np.prod(image_shape_list)]).
other observations:
With such a shallow network, I doubt the network have enough spatial resolution to differentiate tumor voxels from non-tumor voxels. Can you show the keras implementation and the performance compared to a pure tf implementation? I would probably go with 2+ layers, let's .
say with 3 layers, with a stride of 2 per layer, and an input image width of 256, you will end with a width of 32 at your deepest encoder layer. (If you have a limited GPU memory, downsample the input image.)
if changing the loss computation does not work, as #bremen_matt mentioned, reduce LR to say maybe 1e-5.
after the basic architecture tweaks and you "feel" that the network is sort of learning and not stuck, try augmenting the training data, add dropout, batch norm during training, and then maybe fancy up your loss by adding a discriminator.
I have highly unbalanced data in a two class problem that I am trying to use TensorFlow to solve with a NN. I was able to find a posting that exactly described the difficulty that I'm having and gave a solution which appears to address my problem. However I'm working with an assistant, and neither of us really knows python and so TensorFlow is being used like a black box for us. I have extensive (decades) of experience working in a variety of programming languages in various paradigms. That experience allows me to have a pretty good intuitive grasp of what I see happening in the code my assistant cobbled together to get a working model, but neither of us can follow what is going on enough to be able to tell exactly where in TensorFlow we need to make edits to get what we want.
I'm hoping someone with a good knowledge of Python and TensorFlow can look at this and just tell us something like, "Hey, just edit the file called xxx and at the lines at yyy," so we can get on with it.
Below, I have a link to the solution we want to implement, and I've also included the code my assistant wrote that initially got us up and running. Our code produces good results when our data is balanced, but when highly imbalanced, it tends to classify everything skewed to the larger class to get better results.
Here is a link to the solution we found that looks promising:
Loss function for class imbalanced binary classifier in Tensor flow
I've included the relevant code from this link below. Since I know that where we make these edits will depend on how we are using TensorFlow, I've also included our implementation immediately under it in the same code block with appropriate comments to make it clear what we want to add and what we are currently doing:
# Here is the stuff we need to add some place in the TensorFlow source code:
ratio = 31.0 / (500.0 + 31.0)
class_weight = tf.constant([[ratio, 1.0 - ratio]])
logits = ... # shape [batch_size, 2]
weight_per_label = tf.transpose( tf.matmul(labels
, tf.transpose(class_weight)) ) #shape [1, batch_size]
# this is the weight for each datapoint, depending on its label
xent = tf.mul(weight_per_label
, tf.nn.softmax_cross_entropy_with_logits(logits, labels, name="xent_raw") #shape [1, batch_size]
loss = tf.reduce_mean(xent) #shape 1
# NOW HERE IS OUR OWN CODE TO SHOW HOW WE ARE USING TensorFlow:
# (Obviously this is not in the same file in real life ...)
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import tensorflow as tf
import numpy as np
from math import exp
from PreProcessData import load_and_process_training_Data,
load_and_process_test_data
from PrintUtilities import printf, printResultCompare
tf.set_random_seed(0)
#==============================================================
# predefine file path
''' Unbalanced Training Data, hence there are 1:11 target and nontarget '''
targetFilePath = '/Volumes/Extend/BCI_TestData/60FeaturesVersion/Train1-35/tar.txt'
nontargetFilePath = '/Volumes/Extend/BCI_TestData/60FeaturesVersion/Train1-35/nontar.txt'
testFilePath = '/Volumes/Extend/BCI_TestData/60FeaturesVersion/Test41/feats41.txt'
labelFilePath = '/Volumes/Extend/BCI_TestData/60FeaturesVersion/Test41/labs41.txt'
# train_x,train_y =
load_and_process_training_Data(targetFilePath,nontargetFilePath)
train_x, train_y =
load_and_process_training_Data(targetFilePath,nontargetFilePath)
# test_x,test_y = load_and_process_test_data(testFilePath,labelFilePath)
test_x, test_y = load_and_process_test_data(testFilePath,labelFilePath)
# trained neural network path
save_path = "nn_saved_model/model.ckpt"
# number of classes
n_classes = 2 # in this case, target or non_target
# number of hidden layers
num_hidden_layers = 1
# number of nodes in each hidden layer
nodes_in_layer1 = 40
nodes_in_layer2 = 100
nodes_in_layer3 = 30 # We think: 3 layers is dangerous!! try to avoid it!!!!
# number of data features in each blocks
block_size = 3000 # computer may not have enough memory, so we divide the train into blocks
# number of times we iterate through training data
total_iterations = 1000
# terminate training if computed loss < supposed loss
expected_loss = 0.1
# max learning rate and min learnign rate
max_learning_rate = 0.002
min_learning_rate = 0.0002
# These are placeholders for some values in graph
# tf.placeholder(dtype, shape=None(optional), name=None(optional))
# It's a tensor to hold our datafeatures
x = tf.placeholder(tf.float32, [None,len(train_x[0])])
# Every row has either [1,0] for targ or [0,1] for non_target. placeholder to hold one hot value
Y_C = tf.placeholder(tf.int8, [None, n_classes])
# variable learning rate
lr = tf.placeholder(tf.float32)
# neural network model
def neural_network_model(data):
if (num_hidden_layers == 1):
# layers contain weights and bias for case like all neurons fired a 0 into the layer, we will need result out
# When using RELUs, make sure biases are initialised with small *positive* values for example 0.1 = tf.ones([K])/10
hidden_1_layer = {'weights': tf.Variable(tf.random_normal([len(train_x[0]), nodes_in_layer1])),
'bias': tf.Variable(tf.ones([nodes_in_layer1]) / 10)}
# no more bias when come to the output layer
output_layer = {'weights': tf.Variable(tf.random_normal([nodes_in_layer1, n_classes])),
'bias': tf.Variable(tf.zeros([n_classes]))}
# multiplication of the raw input data multipled by their unique weights (starting as random, but will be optimized)
l1 = tf.add(tf.matmul(data, hidden_1_layer['weights']), hidden_1_layer['bias'])
l1 = tf.nn.relu(l1)
# We repeat this process for each of the hidden layers, all the way down to our output, where we have the final values still being the multiplication of the input and the weights, plus the output layer's bias values.
Ylogits = tf.matmul(l1, output_layer['weights']) + output_layer['bias']
if (num_hidden_layers == 2):
# layers contain weights and bias for case like all neurons fired a 0 into the layer, we will need result out
# When using RELUs, make sure biases are initialised with small *positive* values for example 0.1 = tf.ones([K])/10
hidden_1_layer = {'weights': tf.Variable(tf.random_normal([len(train_x[0]), nodes_in_layer1])),
'bias': tf.Variable(tf.ones([nodes_in_layer1]) / 10)}
hidden_2_layer = {'weights': tf.Variable(tf.random_normal([nodes_in_layer1, nodes_in_layer2])),
'bias': tf.Variable(tf.ones([nodes_in_layer2]) / 10)}
# no more bias when come to the output layer
output_layer = {'weights': tf.Variable(tf.random_normal([nodes_in_layer2, n_classes])),
'bias': tf.Variable(tf.zeros([n_classes]))}
# multiplication of the raw input data multipled by their unique weights (starting as random, but will be optimized)
l1 = tf.add(tf.matmul(data, hidden_1_layer['weights']), hidden_1_layer['bias'])
l1 = tf.nn.relu(l1)
l2 = tf.add(tf.matmul(l1, hidden_2_layer['weights']), hidden_2_layer['bias'])
l2 = tf.nn.relu(l2)
# We repeat this process for each of the hidden layers, all the way down to our output, where we have the final values still being the multiplication of the input and the weights, plus the output layer's bias values.
Ylogits = tf.matmul(l2, output_layer['weights']) + output_layer['bias']
if (num_hidden_layers == 3):
# layers contain weights and bias for case like all neurons fired a 0 into the layer, we will need result out
# When using RELUs, make sure biases are initialised with small *positive* values for example 0.1 = tf.ones([K])/10
hidden_1_layer = {'weights':tf.Variable(tf.random_normal([len(train_x[0]), nodes_in_layer1])), 'bias':tf.Variable(tf.ones([nodes_in_layer1]) / 10)}
hidden_2_layer = {'weights':tf.Variable(tf.random_normal([nodes_in_layer1, nodes_in_layer2])), 'bias':tf.Variable(tf.ones([nodes_in_layer2]) / 10)}
hidden_3_layer = {'weights':tf.Variable(tf.random_normal([nodes_in_layer2, nodes_in_layer3])), 'bias':tf.Variable(tf.ones([nodes_in_layer3]) / 10)}
# no more bias when come to the output layer
output_layer = {'weights':tf.Variable(tf.random_normal([nodes_in_layer3, n_classes])), 'bias':tf.Variable(tf.zeros([n_classes]))}
# multiplication of the raw input data multipled by their unique weights (starting as random, but will be optimized)
l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['bias'])
l1 = tf.nn.relu(l1)
l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['bias'])
l2 = tf.nn.relu(l2)
l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['bias'])
l3 = tf.nn.relu(l3)
# We repeat this process for each of the hidden layers, all the way down to our output, where we have the final values still being the multiplication of the input and the weights, plus the output layer's bias values.
Ylogits = tf.matmul(l3,output_layer['weights']) + output_layer['bias']
return Ylogits # return the neural network model
# set up the training process
def train_neural_network(x):
# produce the prediction base on output of nn model
Ylogits = neural_network_model(x)
# measure the error use build in cross entropy function, the value that we want to minimize
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=Ylogits, labels=Y_C))
# To optimize our cost (cross_entropy), reduce error, default learning_rate is 0.001, but you can change it, this case we use default
# optimizer = tf.train.GradientDescentOptimizer(0.003)
optimizer = tf.train.AdamOptimizer(lr)
train_step = optimizer.minimize(cross_entropy)
# start the session
with tf.Session() as sess:
# We initialize all of our variables first before start
sess.run(tf.global_variables_initializer())
# iterate epoch count time (cycles of feed forward and back prop), each epoch means neural see through all train_data once
for epoch in range(total_iterations):
# count the total cost per epoch, declining mean better result
epoch_loss=0
i=0
decay_speed = 150
# current learning rate
learning_rate = min_learning_rate + (max_learning_rate - min_learning_rate) * exp(-epoch/decay_speed)
# divide the dataset in to data_set/batch_size in case run out of memory
while i < len(train_x):
# load train data
start = i
end = i + block_size
batch_x = np.array(train_x[start:end])
batch_y = np.array(train_y[start:end])
train_data = {x: batch_x, Y_C: batch_y, lr: learning_rate}
# train
# sess.run(train_step,feed_dict=train_data)
# run optimizer and cost against batch of data.
_, c = sess.run([train_step, cross_entropy], feed_dict=train_data)
epoch_loss += c
i+=block_size
# print iteration status
printf("epoch: %5d/%d , loss: %f", epoch, total_iterations, epoch_loss)
# terminate training when loss < expected_loss
if epoch_loss < expected_loss:
break
# how many predictions we made that were perfect matches to their labels
# test model
# test data
test_data = {x:test_x, Y_C:test_y}
# calculate accuracy
correct_prediction = tf.equal(tf.argmax(Ylogits, 1), tf.argmax(Y_C, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, 'float'))
print('Accuracy:',accuracy.eval(test_data))
# result matrix, return the position of 1 in array
result = (sess.run(tf.argmax(Ylogits.eval(feed_dict=test_data),1)))
answer = []
for i in range(len(test_y)):
if test_y[i] == [0,1]:
answer.append(1)
elif test_y[i]==[1,0]:
answer.append(0)
answer = np.array(answer)
printResultCompare(result,answer)
# save the prediction of correctness
np.savetxt('nn_prediction.txt', Ylogits.eval(feed_dict={x: test_x}), delimiter=',',newline="\r\n")
# save the nn model for later use again
# 'Saver' op to save and restore all the variables
saver = tf.train.Saver()
saver.save(sess, save_path)
#print("Model saved in file: %s" % save_path)
# load the trained neural network model
def test_loaded_neural_network(trained_NN_path):
Ylogits = neural_network_model(x)
saver = tf.train.Saver()
with tf.Session() as sess:
# load saved model
saver.restore(sess, trained_NN_path)
print("Loading variables from '%s'." % trained_NN_path)
np.savetxt('nn_prediction.txt', Ylogits.eval(feed_dict={x: test_x}), delimiter=',',newline="\r\n")
# test model
# result matrix
result = (sess.run(tf.argmax(Ylogits.eval(feed_dict={x:test_x}),1)))
# answer matrix
answer = []
for i in range(len(test_y)):
if test_y[i] == [0,1]:
answer.append(1)
elif test_y[i]==[1,0]:
answer.append(0)
answer = np.array(answer)
printResultCompare(result,answer)
# calculate accuracy
correct_prediction = tf.equal(tf.argmax(Ylogits, 1), tf.argmax(Y_C, 1))
print(Ylogits.eval(feed_dict={x: test_x}).shape)
train_neural_network(x)
#test_loaded_neural_network(save_path)
So, can anyone help point us to the right place to make the edits that we need to make to resolve our problem? (i.e. what is the name of the file we need to edit, and where is it located.) Thanks in advance!
-gt-
The answer you want:
You should add these codes in your train_neural_network(x) function.
ratio = (num of classes 1) / ((num of classes 0) + (num of classes 1))
class_weight = tf.constant([[ratio, 1.0 - ratio]])
Ylogits = neural_network_model(x)
weight_per_label = tf.transpose( tf.matmul(Y_C , tf.transpose(class_weight)) )
cross_entropy = tf.reduce_mean( tf.mul(weight_per_label, tf.nn.softmax_cross_entropy_with_logits(logits=Ylogits, labels=Y_C) ) )
optimizer = tf.train.AdamOptimizer(lr)
train_step = optimizer.minimize(cross_entropy)
instead of these lines:
Ylogits = neural_network_model(x)
# measure the error use build in cross entropy function, the value that we want to minimize
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=Ylogits, labels=Y_C))
# To optimize our cost (cross_entropy), reduce error, default learning_rate is 0.001, but you can change it, this case we use default
# optimizer = tf.train.GradientDescentOptimizer(0.003)
optimizer = tf.train.AdamOptimizer(lr)
train_step = optimizer.minimize(cross_entropy)
More Details:
Since in neural network, we calculate the error of prediction with respect to the targets( the true labels ), in your case, you use the cross entropy error which finds the sum of targets multiple Log of predicted probabilities.
The optimizer of network backpropagates to minimize the error to achieve more accuracy.
Without weighted loss, the weight for each class are equals, so optimizer reduce the error for the classes which have more amount and overlook the other class.
So in order to prevent this phenomenon, we should force the optimizer to backpropogate larger error for class with small amount, to do this we should multiply the errors with a scalar.
I hope it was useful :)
I have implemented and trained a neural network with Theano of k binary inputs (0,1), one hidden layer and one unit in the output layer. Once it has been trained I want to obtain inputs that maximizes the output (e.g. x which makes unit of output layer closest to 1). So far I haven't found an implementation of it, so I am trying the following approach:
Train network => obtain trained weights (theta1, theta2)
Define the neural network function with x as input and trained theta1, theta2 as fixed parameters. That is: f(x) = sigmoid( theta1*(sigmoid (theta2*x ))). This function takes x and with given trained weights (theta1, theta2) gives output between 0 and 1.
Apply gradient descent w.r.t. x on the neural network function f(x) and obtain x that maximizes f(x) with theta1 and theta2 given.
For these I have implemented the following code with a toy example (k = 2). Based on the tutorial on http://outlace.com/Beginner-Tutorial-Theano/ but changed vector y, so that there is only one combination of inputs that gives f(x) ~ 1 which is x = [0, 1].
Edit1: As suggested optimizer was set to None and bias unit was fixed to 1.
Step 1: Train neural network. This runs well and with out error.
import os
os.environ["THEANO_FLAGS"] = "optimizer=None"
import theano
import theano.tensor as T
import theano.tensor.nnet as nnet
import numpy as np
x = T.dvector()
y = T.dscalar()
def layer(x, w):
b = np.array([1], dtype=theano.config.floatX)
new_x = T.concatenate([x, b])
m = T.dot(w.T, new_x) #theta1: 3x3 * x: 3x1 = 3x1 ;;; theta2: 1x4 * 4x1
h = nnet.sigmoid(m)
return h
def grad_desc(cost, theta):
alpha = 0.1 #learning rate
return theta - (alpha * T.grad(cost, wrt=theta))
in_units = 2
hid_units = 3
out_units = 1
theta1 = theano.shared(np.array(np.random.rand(in_units + 1, hid_units), dtype=theano.config.floatX)) # randomly initialize
theta2 = theano.shared(np.array(np.random.rand(hid_units + 1, out_units), dtype=theano.config.floatX))
hid1 = layer(x, theta1) #hidden layer
out1 = T.sum(layer(hid1, theta2)) #output layer
fc = (out1 - y)**2 #cost expression
cost = theano.function(inputs=[x, y], outputs=fc, updates=[
(theta1, grad_desc(fc, theta1)),
(theta2, grad_desc(fc, theta2))])
run_forward = theano.function(inputs=[x], outputs=out1)
inputs = np.array([[0,1],[1,0],[1,1],[0,0]]).reshape(4,2) #training data X
exp_y = np.array([1, 0, 0, 0]) #training data Y
cur_cost = 0
for i in range(5000):
for k in range(len(inputs)):
cur_cost = cost(inputs[k], exp_y[k]) #call our Theano-compiled cost function, it will auto update weights
print(run_forward([0,1]))
Output of run forward for [0,1] is: 0.968905860574.
We can also get values of weights with theta1.get_value() and theta2.get_value()
Step 2: Define neural network function f(x). Trained weights (theta1, theta2) are constant parameters of this function.
Things get a little trickier here because of the bias unit, which is part of he vector of inputs x. To do this I concatenate b and x. But the code now runs well.
b = np.array([[1]], dtype=theano.config.floatX)
#b_sh = theano.shared(np.array([[1]], dtype=theano.config.floatX))
rand_init = np.random.rand(in_units, 1)
rand_init[0] = 1
x_sh = theano.shared(np.array(rand_init, dtype=theano.config.floatX))
th1 = T.dmatrix()
th2 = T.dmatrix()
nn_hid = T.nnet.sigmoid( T.dot(th1, T.concatenate([x_sh, b])) )
nn_predict = T.sum( T.nnet.sigmoid( T.dot(th2, T.concatenate([nn_hid, b]))))
Step 3:
Problem is now in gradient descent as is not limited to values between 0 and 1.
fc2 = (nn_predict - 1)**2
cost3 = theano.function(inputs=[th1, th2], outputs=fc2, updates=[
(x_sh, grad_desc(fc2, x_sh))])
run_forward = theano.function(inputs=[th1, th2], outputs=nn_predict)
cur_cost = 0
for i in range(10000):
cur_cost = cost3(theta1.get_value().T, theta2.get_value().T) #call our Theano-compiled cost function, it will auto update weights
if i % 500 == 0: #only print the cost every 500 epochs/iterations (to save space)
print('Cost: %s' % (cur_cost,))
print x_sh.get_value()
The last iteration prints:
Cost: 0.000220317356533
[[-0.11492753]
[ 1.99729555]]
Furthermore input 1 keeps becoming more negative and input 2 increases, while the optimal solution is [0, 1]. How can this be fixed?
You are adding b=[1] via broadcasting rules as opposed to concatenating it. Also, once you concatenate it, your x_sh has one dimension to many which is why the error occurs at nn_predict and not nn_hid