I'm using AdamOptimizer to train a simple DNN network.(I'm using version tf1.4).
And I want to throw away some bad samples within a batch during training. Say I have 4096 samples within a batch, and I want to throw 96 samples away and only use the remaining 4000 samples to calculate loss and do backpropagation.
How can I achieve this?
The code set up is very straightforward like below:
lables = tf.reshape(labels, [batch_size, 1])
logits = tf.reshape(logits, [batch_size, 1])
loss_vector = tf.nn.sigmoid_cross_entropy_with_logits(multi_class_labels=labels,
logits=logits)
loss_scalar = tf.reduce_mean(loss_vector)
opt = tf.train.AdamOptimizer(learning_rate=learning_rate)
train_op = opt.minimize(loss_scalar, global_step=global_step)
one possible solution is to do a mask operation after loss_vector and before reduce_mean. But I'm not sure if it's the right solution and I have some questions about what's going on underhood:
in the minimize() operation, since the input parameter is a scalar, how will it know that some of input samples are masked out?
in the minimize() operation, how many times of backpropagation will happen?
during training, samples are feed in the graph one by one, how can TF know that which should be kept and which should throw away?
Related
This is a strange problem: Imagine a neural network classifier. It is a simple linear layer followed by a sigmoid activation that has an input size of 64, and an output size of 112. There also are 112 training samples, where I expect the output to be a one-hot vector. So the basic structure of a training loop is as follows, where samples is a list of integer indices:
model = nn.Sequential(nn.Linear(64,112),nn.Sequential())
loss_fn = nn.BCELoss()
optimizer = optim.AdamW(model.parameters(),lr=3e-4)
for epoch in range(500):
for input_state, index in samples:
one_hot = torch.zeros(112).float()
one_hot[index] = 1.0
optimizer.zero_grad()
prediction = model(input_state)
loss = loss_fn(prediction,one_hot)
loss.backward()
optimizer.step()
This model does not perform well, but I don't think it's a problem with the model itself, but rather how it's trained. I think that this is happening because for the most part, all of the one_hot tensor is zeros, that the model just tends to gravitate toward all of the outputs being zeros, which is what's happening. The question becomes: "How does this get solved?" I tried using the average loss with all the samples, to no avail. So what do I do?
So this is very embarrassing, but the answer actually lies in how I process my data. This is a text-input project, so I used basic python lists to create blocks of messages, but when I did this, I ended up making it so that all of the inputs the net got were the same, but the output was different every time. I solved tho s problem with the copy method.
I am having some issues with regards to the dimensionality of my tensors in my training function at present. I am using the MNIST dataset, so 10 possible targets, and originally wrote the prototype code using a training batch size of 10, which was in retrospect not the wisest choice. It gave poor results during some earlier tests, and increasing the amount of training iterations saw no benefit. Upon trying to then increase the batch size, I realised that what I had written was not that general, and I was likely never training it on the proper data. Below is my training function:
def Train(tLoops, Lrate):
for _ in range(tLoops):
tempData = train_data.view(batch_size_train, 1, 1, -1)
output = net(tempData)
trainTarget = train_targets
criterion = nn.MSELoss()
print("target:", trainTarget.size())
print("Output:", output.size())
loss = criterion(output, trainTarget.float())
# print("error is:", loss)
net.zero_grad() # zeroes the gradient buffers of all parameters
loss.backward()
for j in net.parameters():
j.data.sub_(j.grad.data * Lrate)
of which the print functions return
target: torch.Size([100])
Output: torch.Size([100, 1, 1, 10])
before the error message on the line where loss is calculated;
RuntimeError: The size of tensor a (10) must match the size of tensor b (100) at non-singleton dimension 3
The first print, target, is a 1-dimensional list of the respective ground truth values for each image. Output contains the output of the neural net for each of those 100 samples, so a 10 x 100 list, however from skimming and reshaping the data from 28 x 28 to 1 x 784 earlier, I seem to have extra dimensions unnecessarily. Does PyTorch provide a way to remove these? I couldn't find anything in the documentation, or is there something else that could be my issue?
There are several problems in your training script. I will address each of them below.
First, you should NOT do data batching by hand. Pytorch/torchvision have functions for that, use a dataset and a data loader: https://pytorch.org/tutorials/recipes/recipes/loading_data_recipe.html.
You should also NEVER update the parameters of you network by hand. Use an Optimizer: https://pytorch.org/docs/stable/optim.html. In your case, SGD without momentum will have the same effect.
The dimensionality of your input seems to be wrong, for MNIST an input tensor should be (batch_size, 1, 28, 28) or (batch_size, 784) if you're training a MLP. Furthermore, the output of your network should be (batch_size, 10)
I want to train my neural network (in Keras) with an additional condition on the output elements.
An example:
Minimize my loss function MSE between network output y_pred and y_true.
Additionally, ensure that the norm of y_pred is less or equal 1.
Without the condition, the task is straightforward.
Note: The condition is not necessarily the vector norm of y_pred.
How can I implement the additional condition/restriction in a Keras (or maybe Tensorflow) model?
In principle, tensorflow (and keras) don't allow you to add hard constraints to your model.
You have to convert your invarient (norm <= 1) to a penalty function, which is added to the loss. This could look like this:
y_norm = tf.norm(y_pred)
norm_loss = tf.where(y_norm > 1, y_norm, 0)
total_loss = mse + norm_loss
Look at the docs of where. If your prediction has a norm bigger than one, backpropagation tries to minimize the norm. If it is less than or equal, this part of the loss is simply 0. No gradient is produced.
But this can be very hard to optimize. Your predictions could oscillate around a norm of 1. It is also possible to add a factor: total_loss = mse + 1000* norm_loss. Be very careful with this, it makes optimization even harder.
In the example above, the norm above one contributes linearly to the loss. This is called l1-regularization. You could also square it, which would become l2-regularization.
In your specific case, you could get creative. Why not normalize your predictions and the targets to one (just a suggestion, might be a bad idea)?
loss = mse(y_pred / tf.norm(y_pred), y_target / np.linalg.norm(y_target)
While trying to train a GAN for image generation I ran into a problem which I cannot explain.
When training the generator, the loss which is returned by train_on_batch after just 2 or 3 iterations directly drops to zero. After investigating I realized some strange behavior of the train_on_batch method:
When I check the following:
noise = np.random.uniform(-1.0, 1.0, size=[batch_size, gen_noise_length])
predictions = GAN.stackedModel.predict(noise)
This returns values all close to zero as I would expect since the generator is not trained yet.
However:
y = np.ones([batch_size, 1])
noise = np.random.uniform(-1.0, 1.0, size=[batch_size, gen_noise_length])
loss = GAN.stackedModel.train_on_batch(noise, y)
here the loss is almost zero even though my expected targets are obvious ones.
When I run:
y = np.ones([batch_size, 1])
noise = np.random.uniform(-1.0, 1.0, size=[batch_size, gen_noise_length])
loss = GAN.stackedModel.test_on_batch(noise, y)
the returned loss is high as I would expect.
What is going on with the train_on_batch method? I'm really clueless here...
edit
My loss is binary-crossentropy and I build the model like:
def createStackedModel(self):
# Build stacked GAN model
gan_in = Input([self.noise_length])
H = self.genModel(gan_in)
gan_V = self.disModel(H)
GAN = Model(gan_in, gan_V)
opt = RMSprop(lr=0.0001, decay=3e-8)
GAN.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
return GAN
edit 2
The generator is constructed by stacking some of those blocks each containing a BatchNormalization:
self.G.add(UpSampling2D())
self.G.add(Conv2DTranspose(int(depth/8), 5, padding='same'))
self.G.add(BatchNormalization(momentum=0.5))
self.G.add(Activation('relu'))
edit 3
I loaded my code to https://gitlab.com/benjamingraf24/DCGAN/
Apparently the problem results from the way how I build the GAN network. So in GANBuilder.py there must be something wrong. However, I cant find it...
BatchNormalization layers behave differently during training and testing phase.
During training phase they will use the current batch mean and variance of the activations to normalize.
However, during testing phase they use the moving mean and moving variance that they collected during training. Without enough previous training these collected values can be far from the actual batch statistics, resulting in significant loss value differences.
Refer to the Keras documentation for BatchNormalization. The momentum argument is used to define how fast the moving mean and moving average will adapt to freshly collected values of batches during training.
I would like to train the weights of a model based on the sum of the loss value of several batches. However it seems that once you run the graph for each of the individual batches, the object that is returned is just a regular numpy array. So when you try and use an optimizer like GradientDescentOptimizer, it no longer has information about the variables that were used to calculate the sum of the losses, so it can't find the gradients of the weights that what help minimize the loss. Here's an example tensorflow script to illustrate what I'm talking about:
weights = tf.Variable(tf.ones([num_feature_values], tf.float32))
feature_values = tf.placeholder(tf.int32, shape=[num_feature_values])
labels = tf.placeholder(tf.int32, shape=[1])
loss_op = some_loss_function(weights, feature_values, labels)
with tf.Session() as sess:
for batch in batches:
feed_dict = fill_feature_values_and_labels(batch)
#Calculates loss for one batch
loss = sess.run(loss_op, feed_dict=feed_dict)
#Adds it to total loss
total_loss += loss
# Want to train weights to minimize total_loss, however this
# doesn't work because the graph has already been run.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(total_loss)
with tf.Session() as sess:
for step in xrange(num_steps):
sess.run(optimizer)
The total_loss is a numpy array and thus cannot be used in the optimizer. Does anyone know a way around the problem, where I want to use information across many batches but still need the graph intact in order to preserve the fact that the total_loss is a function of the weights?
The thing you optimize in any of the trainers must be a part of the graph, here what you train on is the actual realized result, so it won't work.
I think the way you should probably do this is to construct your input as a batch of batches e.g.
intput = tf.placeholder("float", (number_of_batches, batch_size, input_size)
Then have your target also be a 3d tensor which can be trained on.