I am training a deep neural network multi-class classifier using TensorFlow. The network outputs the linear values from the final layer, which the tf.nn.softmax_cross_entropy_with_logits cost function takes as input. However, I don't really care about that linear output per se - I want to know what it looks like when the softmax function is applied to it.
Below the relevant parts of my code:
def train_network(x, num_hidden_layers):
prediction = neural_network_model(x, num_hidden_layers)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=0.01).minimize(cost)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# train the network
...
# get the network output; x_test is my test data (len=663)
output = sess.run(prediction,feed_dict={x: x_test})
# get softmax values of output
for i in range(len(x_test)):
softm = sess.run(tf.nn.softmax(output[i]))
pred_class = sess.run(tf.argmax(softm))
print(pred_class)
...
Now, that final for-loop in which I calculate the softmax values is extremely slow. Why is that, and how do I do this properly?
Related
I am currently trying to train a model using tf.GradientTape, as model.fit(...) from keras will not be able to handle my data input in the future. However, while a test run with model.fit(...) and my model works perfectly, tf.GradientTape does not.
During training, the loss using the tf.GradientTape custom workflow will first slightly decrease, but then become stuck and not improve any further, no matter how many epochs I run. The chosen metric will also not change after the first few batches. Additionally, the loss per batch is unstable and jumps between nearly zero to something very large. The running loss is more stable but shows the model not improving.
This is all in contrast to using model.fit(...), where loss and metrics are improving immediately.
My code:
def build_model(kernel_regularizer=l2(0.0001), dropout=0.001, recurrent_dropout=0.):
x1 = Input(62)
x2 = Input((62, 3))
x = Embedding(30, 100, mask_zero=True)(x1)
x = Concatenate()([x, x2])
x = Bidirectional(LSTM(500,
return_sequences=True,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Bidirectional(LSTM(500,
return_sequences=False,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Activation('softmax')(x)
x = Dense(1000)(x)
x = Dense(500)(x)
x = Dense(250)(x)
x = Dense(1, bias_initializer='ones')(x)
x = tf.math.abs(x)
return Model(inputs=[x1, x2], outputs=x)
optimizer = Adam(learning_rate=0.0001)
model = build_model()
model.compile(optimizer=optimizer, loss='mse', metrics='mse')
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA
dat_train = tf.data.Dataset.from_generator(
generator= lambda: <load_function()>
output_types=((tf.int32, tf.float32), tf.float32)
)
dat_train = dat_train.with_options(options)
# keras training
model.fit(dat_train, epochs=50)
# custom training
for epoch in range(50):
for (x1, x2), y in dat_train:
with tf.GradientTape() as tape:
y_pred = model((x1, x2), training=True)
loss = model.loss(y, y_pred)
grads = tape.gradient(loss, model.trainable_variables)
model.optimizer.apply_gradients(zip(grads, model.trainable_variables))
I could use relu at the output layer, however, I found the abs to be more robust. Changing it does not change the outcome. The input x1 of the model is a sequence, x2 are some additional features, that are later concatenated to the embedded x1 sequence. For my approach, I'm not using the MSE, but it works either way.
I could provide some data, however, my dataset is quite large, so I would need to extract a bit out of it.
All in all, my problem seems to be similar to:
Keras model doesn't train when using GradientTape
Edit 1
The softmax activation is currently not necessary, but is relevant for my future goal of splitting the model.
Additionally, some things I noticed:
The custom training takes roughly 2x the amount of time compared to model.fit(...).
The gradients in the custom training seem very small and range from ±1e-3 to ±1e-9 inside the model. I don't know if that's normal and don't know how to compare it to the gradients provided by model.fit(...).
Edit 2
I've added a Google Colab notebook to reproduce the issue:
https://colab.research.google.com/drive/1pk66rbiux5vHZcav9VNSBhdWWIhQM-nF?usp=sharing
The loss and MSE for 20 epochs is shown here:
custom training
keras training
While I only used a portion of my data in the notebook, it will still run for a very long time. For the custom training run, the loss for each batch is simply stored in losses. It matches the behavior in the custom training run image.
So far, I've noticed two ways of improving the performance of the custom training:
The usage of custom layer initialization
Using MSE as a loss function
Using the MSE, compared to my own loss function actually improves the custom training performance. Still, using MSE and/or different initialization won't come close to the performance of keras fit.
I have found the solution, it was a simple shape mismatch, which was somehow not picked up by any error check and worked both with my custom loss function and MSE. Using x = Reshape(())(x) as final layer did the trick.
I am working on super-resolution GAN and having some doubts about the code I found on Github. In particular, I have multiple inputs, multiple outputs in the model. Also, I have two different loss functions.
In the following code will the mse loss be applied to img_hr and fake_features?
# Build and compile the discriminator
self.discriminator = self.build_discriminator()
self.discriminator.compile(loss='mse',
optimizer=optimizer,
metrics=['accuracy'])
# Build the generator
self.generator = self.build_generator()
# High res. and low res. images
img_hr = Input(shape=self.hr_shape)
img_lr = Input(shape=self.lr_shape)
# Generate high res. version from low res.
fake_hr = self.generator(img_lr)
# Extract image features of the generated img
fake_features = self.vgg(fake_hr)
# For the combined model we will only train the generator
self.discriminator.trainable = False
# Discriminator determines validity of generated high res. images
validity = self.discriminator(fake_hr)
self.combined = Model([img_lr, img_hr], [validity, fake_features])
self.combined.compile(loss=['binary_crossentropy', 'mse'],
loss_weights=[1e-3, 1],
optimizer=optimizer)
In the following code will the mse loss be applied to img_hr and
fake_features?
From the documentation, https://keras.io/models/model/#compile
"If the model has multiple outputs, you can use a different loss on each output by passing a dictionary or a list of losses."
In this case, the mse loss will be applied to fake_features and the corresponding y_true passed as part of self.combined.fit().
In neural networks Loss is applied to the Outputs of a network in order to have a way of measurement of "How wrong is this output?" so you can take this value and minimize it via Gradient decent and backprop.
Following this Intuition the Losses in keras are a List with the same length as the Outputs of your model. They are appied to the Output with the same index.
self.combined = Model([img_lr, img_hr], [validity, fake_features])
This gives you a model with 2 Inputs (img_lr, img_hr) and 2 outputs (validity, fake_features). So combined.compile(loss=['binary_crossentropy', 'mse']... uses binary_crossentropy loss for validity and Mean Squared Error for fake_features.
I am relatively new in machine learning especially when it comes to implementing algorithms. I am using python and tensorflow library to implement a neural network to train on a dataset which has about 20 classes. I am able to train and get predictions successfully but I have a question,
Is it possible to get top k classes along with their probabilities using tensorflow instead of just a single prediction?
If it is possible how can this be done? Thanks for your guidance.
Update 01:
I am adding code of what I am doing. So I build a neural network with 3 layers having tanh, sigmoid, & sigmoid respectively as activation functions for the hidden layers and softmax for output layer. The code for training and prediction is as follows:
y_pred = None
with tf.Session() as sess:
sess.run(init)
for epoch in range(training_epochs):
# running the training_epoch numbered epoch
_,cost = sess.run([optimizer,cost_function],feed_dict={X:tr_features,Y:tr_labels})
cost_history = np.append(cost_history,cost)
# predict results based on the trained model
y_pred = sess.run(tf.argmax(y_,1),feed_dict={X: ts_features})
Right now y_pred is a list of class labels for each test example of ts_features. But instead of getting 1 single class label for each test example I am hoping to get top-k predictions for each example each of the k-predictions accompanied by some kind of probability.
Using tf.nn.top_k():
top_k_values, top_k_indices = tf.nn.top_k(predictions, k=k)
If predictions is a vector of probabilities per class (i.e. predictions[i] = prediction probability for class i), then top_k_values will contain the k highest probabilities in predictions, and top_k_indices will contain the indices of these probabilities, i.e. the corresponding classes.
Supposing that in your code, y_ is the vector of predicted probabilities per class:
k = 3 # replace with your value
# Instead of `y_pred`:
y_k_probs, y_k_pred = sess.run(
tf.nn.top_k(y_, k=k), feed_dict={X: ts_features})
I have have trained binary classifier model. The model class contains self.cost, self.initial_state, self.final_state and self.logits params. It is saved simply with tf.train.Saver:
saver = tf.train.Saver(tf.global_variables(), max_to_keep=1)
saver.save(session, 'model.ckpt')
After the model was trained I load it as:
with tf.variable_scope("Model", reuse=False):
model = MODEL(config, is_training=False)
with tf.Session() as session:
saver = tf.train.Saver(tf.global_variables())
saver.restore(session, 'model.ckpt')
However, my model.run function returns cross-entropy loss which is the last op in the graph. I don't need loss, I need the model predictions for each batch element
logits = tf.sigmoid(tf.nn.xw_plus_b(last_layer, self.output_w, self.output_b))
where last_layer is a 800x1 matrix which I then later reshape into 32x25x1 (batch_size, sequence_length, 1) matrix. It is this matrix that contains the model prediction values in [0-1] range.
So, how can I use this model to make a prediction for single element matrix 1x1x1?
Add the OPs necessary to compute accuracy, something like what I have copied below (simply copied out of the closest model I had at hand).
self.logits_flat = tf.argmax(logits, axis=1, output_type=tf.int32)
labels_flat = tf.argmax(labels, axis=1, output_type=tf.int32)
accuracy = tf.cast(tf.equal(self.logits_flat, labels_flat), tf.float32, name='accuracy')
Now when you run the model (either during test or training time) add accuracy to the sess.run call as:
sess.run([train_op, accuracy], feed_dict=...)
or
sess.run([accuracy, logits], feed_dict=...)
All you're doing when you call sess.run is to tell tensorflow to compute the value of whatever you ask for. You need to pass it in any data it needs to perform those computations. Tensorflow is lazy, it won't perform any computations that aren't explicitly necessary to produce the results you request. E.g. if you run the second version of sess.run listed above the optimizer will not be run and hence your weights will not be updated.
Note that you can add the OPs after the network was trained because none of them actually add any variables so they won't affect the save/restore process any.
I would like to train the weights of a model based on the sum of the loss value of several batches. However it seems that once you run the graph for each of the individual batches, the object that is returned is just a regular numpy array. So when you try and use an optimizer like GradientDescentOptimizer, it no longer has information about the variables that were used to calculate the sum of the losses, so it can't find the gradients of the weights that what help minimize the loss. Here's an example tensorflow script to illustrate what I'm talking about:
weights = tf.Variable(tf.ones([num_feature_values], tf.float32))
feature_values = tf.placeholder(tf.int32, shape=[num_feature_values])
labels = tf.placeholder(tf.int32, shape=[1])
loss_op = some_loss_function(weights, feature_values, labels)
with tf.Session() as sess:
for batch in batches:
feed_dict = fill_feature_values_and_labels(batch)
#Calculates loss for one batch
loss = sess.run(loss_op, feed_dict=feed_dict)
#Adds it to total loss
total_loss += loss
# Want to train weights to minimize total_loss, however this
# doesn't work because the graph has already been run.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(total_loss)
with tf.Session() as sess:
for step in xrange(num_steps):
sess.run(optimizer)
The total_loss is a numpy array and thus cannot be used in the optimizer. Does anyone know a way around the problem, where I want to use information across many batches but still need the graph intact in order to preserve the fact that the total_loss is a function of the weights?
The thing you optimize in any of the trainers must be a part of the graph, here what you train on is the actual realized result, so it won't work.
I think the way you should probably do this is to construct your input as a batch of batches e.g.
intput = tf.placeholder("float", (number_of_batches, batch_size, input_size)
Then have your target also be a 3d tensor which can be trained on.