I'm following this tutorial for tensorflow:
It describes the implementation of the cross entropy function as:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
First, tf.log computes the logarithm of each element of y. Next, we
multiply each element of y_ with the corresponding element of
tf.log(y). Then tf.reduce_sum adds the elements in the second
dimension of y, due to the reduction_indices=1 parameter. Finally,
tf.reduce_mean computes the mean over all the examples in the batch.
It is my understanding that both the actual and predicted values of y, from reading the tutorial, are 2D tensors. The rows are the number of MNIST vectors that you use of size 784 which represents the columns.
The quote above says that "we multiply each element of y_ with the corresponding element of tf.log(y)".
My question is - are we doing traditional matrix multiplication here i.e row x column because the sentence suggests that we are not?
The traditional matrix multiplication is only used when calculating the model hypothesis as seen in the code to multiply x by W:
y = tf.nn.softmax(tf.matmul(x, W) + b)
The code y_ * tf.log(y) in the code block:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y),
reduction_indices=[1]))
performs an element-wise multiplication of the original targets => y_ with the log of the predicted targets => y.
The goal of calculating the cross-entropy loss function is to find the probability that an observation belongs to a particular class or group in the classification problem.
It is this measure (i.e., the cross-entropy loss) that is minimized by the optimization function of which Gradient Descent is a popular example to find the best set of parameters for W that will improve the performance of the classifier. We say the loss is minimized because the lower the loss or cost of error, the better the model.
We are doing element wise multiplication here: y_ * tf.log(y)
Related
If I've an following layer
x = Conv2D(x, activation='linear')
Is this layer trainable? As we know derivative of linear function is constant, so in this cases will the weight get ever updated? Situation like also
tf.keras.activation.linear (x) # no changes
tf.keras.activation.relu (x) # will change
The layer is trainable. Your data will be approximated by linear function.
Training process is finding a function which is the best approximation of your data. If you don't use activation - your data will be approximated by linear function.
E.g. if your layer is Dense(1) - your data will be approximated by line. If your data is 2D - you can draw the points, run training and see that your data will be approximated by line: dense.w * x + dense.b.
The finction should be differentiable (for backpropagation). Linear finction is differentiable, so it is fine.
Loss function can not be linear because it should have minimum. But it is not required for layer.
I am having a hard time with calculating cross entropy in tensorflow. In particular, I am using the function:
tf.nn.softmax_cross_entropy_with_logits()
Using what is seemingly simple code, I can only get it to return a zero
import tensorflow as tf
import numpy as np
sess = tf.InteractiveSession()
a = tf.placeholder(tf.float32, shape =[None, 1])
b = tf.placeholder(tf.float32, shape = [None, 1])
sess.run(tf.global_variables_initializer())
c = tf.nn.softmax_cross_entropy_with_logits(
logits=b, labels=a
).eval(feed_dict={b:np.array([[0.45]]), a:np.array([[0.2]])})
print c
returns
0
My understanding of cross entropy is as follows:
H(p,q) = p(x)*log(q(x))
Where p(x) is the true probability of event x and q(x) is the predicted probability of event x.
There if input any two numbers for p(x) and q(x) are used such that
0<p(x)<1 AND 0<q(x)<1
there should be a nonzero cross entropy. I am expecting that I am using tensorflow incorrectly. Thanks in advance for any help.
In addition to Don's answer (+1), this answer written by mrry may interest you, as it gives the formula to calculate the cross entropy in TensorFlow:
An alternative way to write:
xent = tf.nn.softmax_cross_entropy_with_logits(logits, labels)
...would be:
softmax = tf.nn.softmax(logits)
xent = -tf.reduce_sum(labels * tf.log(softmax), 1)
However, this alternative would be (i) less numerically stable (since
the softmax may compute much larger values) and (ii) less efficient
(since some redundant computation would happen in the backprop). For
real uses, we recommend that you use
tf.nn.softmax_cross_entropy_with_logits().
Like they say, you can't spell "softmax_cross_entropy_with_logits" without "softmax". Softmax of [0.45] is [1], and log(1) is 0.
Measures the probability error in discrete classification tasks in which the
classes are mutually exclusive (each entry is in exactly one class). For
example, each CIFAR-10 image is labeled with one and only one label: an image
can be a dog or a truck, but not both.
NOTE: While the classes are mutually exclusive, their probabilities
need not be. All that is required is that each row of labels is
a valid probability distribution. If they are not, the computation of the
gradient will be incorrect.
If using exclusive labels (wherein one and only
one class is true at a time), see sparse_softmax_cross_entropy_with_logits.
WARNING: This op expects unscaled logits, since it performs a softmax
on logits internally for efficiency. Do not call this op with the
output of softmax, as it will produce incorrect results.
logits and labels must have the same shape [batch_size, num_classes]
and the same dtype (either float16, float32, or float64).
Here is an implementation in Tensorflow 2.0 in case somebody else (me probably) needs it in the future.
#tf.function
def cross_entropy(x, y, epsilon = 1e-9):
return -2 * tf.reduce_mean(y * tf.math.log(x + epsilon), -1) / tf.math.log(2.)
x = tf.constant([
[1.0,0],
[0.5,0.5],
[.75,.25]
]
,dtype=tf.float32)
with tf.GradientTape() as tape:
tape.watch(x)
y = entropy(x, x)
tf.print(y)
tf.print(tape.gradient(y, x))
Output
[-0 1 0.811278105]
[[-1.44269502 29.8973541]
[-0.442695022 -0.442695022]
[-1.02765751 0.557305]]
I am trying to implement a window based classifier with tensorflow,
The word embedding matrix is called word_vec and is initialized randomly (I tried Xavier also).
And the ind variable is the a vector of the indices of the word vectors from the matrix.
The first layer is config['window_size'] (5) word vectors concatenated.
word_vecs = tf.Variable(tf.random_uniform([len(words), config['embed_size']], -1.0, 1.0),dtype=tf.float32)
ind = tf.placeholder(tf.int32, [None, config['window_size']])
x = tf.concat(1,tf.unpack(tf.nn.embedding_lookup(word_vecs, ind),axis=1))
W0 = tf.Variable(tf.random_uniform([config['window_size']*config['embed_size'], config['hidden_layer']]))
b0 = tf.Variable(tf.zeros([config['hidden_layer']]))
W1 = tf.Variable(tf.random_uniform([config['hidden_layer'], out_layer]))
b1 = tf.Variable(tf.zeros([out_layer]))
y0 = tf.nn.tanh(tf.matmul(x, W0) + b0)
y1 = tf.nn.softmax(tf.matmul(y0, W1) + b1)
y_ = tf.placeholder(tf.float32, [None, out_layer])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y1), reduction_indices=[1]))
train_step = tf.train.AdamOptimizer(0.5).minimize(cross_entropy)
And this is how I run the graph:
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
for i in range(config['iterations'] ):
r = random.randint(0,len(sentences)-1)
inds=generate_windows([w for w,t in sentences[r]])
#inds now contains an array of n rows on window_size columns
ys=[one_hot(tags.index(t),len(tags)) for w,t in sentences[r]]
#ys now contains an array of n rows on output_size columns
sess.run(train_step, feed_dict={ind: inds, y_: ys})
The dimensions work out, and the code runs
However, the accuracy is near zero, and I suspect that the the word vectors aren't being updated properly.
How can I make tensorflow update the word vectors back from the concatenated window form ?
Your embeddings are initialised using tf.Variable which are by default trainable. They will be updated. The problem might be with the way you are calculating loss. Look at these following lines
y1 = tf.nn.softmax(tf.matmul(y0, W1) + b1)
y_ = tf.placeholder(tf.float32, [None, out_layer])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y1), reduction_indices=[1]))
Here you are calculating the softmax function which converts the scores into probabilities
If the denominator here becomes too large or too small then this function can go for a toss. To avoid this numerical instability usually a small epsilon is added like below. This makes sure that there is numerical stability.
You can see that even after adding an epsilon the softmax functions value remains the same. If you don't handle this on your own then the gradients may not update properly due to vanishing or exploding gradients.
Avoid the three lines of code and use the tensorflow version
tf.nn.sparse_softmax_cross_entropy_with_logits
Note that this function will calculate the softmax function internally.
It is advisable to use this instead of calculating the loss manually. You can use this as follows
y1 = tf.matmul(y0, W1) + b1
y_ = tf.placeholder(tf.float32, [None, out_layer])
cross_entropy = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=y1, labels=y_))
You need to initialize your W matrices to a random value.
Right now y1 is always 0 due to zero initialization.
Your starting of algorithm is fine. But I have some confidence this approach don't work. Actually word to vector trick is became working after estimation approximations are found applicable for NLP . For example techniques called Importance Sampling and Noise-Contrastive Estimation.
So why straight approach doesn't work? I think, that to solve the task model must precisely find right 1 answer from large vocabulary, say 80000 words. 1 from 80000 - is too hard to optimize model, gradients didn't tell anything for most cases.
Update:
I forgot to mention that main reason of estimation approximation is performance issues of straight approach were you have large output. Each iteration steps for all examples require calculating loss for each output unit (like 80000). Optimization will take long time to be intractable.
How to implement right word2vec using sampling and NCE loss? Easily, following tutorial here, loss function looks like this:
loss = tf.reduce_mean(
tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
Main idea is we need only few m negative samples and 1 positive. Where m is far less than actual vocabulary size.
Tensorflow also has tf.nn.nce_loss
You can read more about mathematical behind of approaches in online book www.deeplearningbook.org (I. Goodfellow et al)
I need to study TF in the express way and i cant understant this part:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
It's explained with this: First, tf.log computes the logarithm of each element of y. Next, we multiply each element of y_ with the corresponding element of tf.log(y). Then tf.reduce_sum adds the elements in the second dimension of y, due to the reduction_indices=[1] parameter. Finally, tf.reduce_mean computes the mean over all the examples in the batch.
Why it does this manipulations, which are marked bold? why do wee need another dimensiom? Thanks
There are two dimensions because cross_entropy computes values for a batch of training examples. Therefore, the dimension 0 is for a batch, and dimension 1 is for different classes of a specific example. For example, if there are 3 possible classes and batch size is 2, then y is a 2D tensor of size (2, 3).
In my data set I have for every entry (event) a weight. This weight consist of several quantities but basically represent how important this event for the data and must be accounted for.
How can I use this weights when training in Tensorflow? I don't want to simply use this as another feature.
Thanks
One simple solution is to multiply the computed cost for each example by its weight, before computing the overall cost for a mini-batch.
Let's say you have the following:
# Vector of features per example.
x = tf.placeholder(tf.float32, shape=[batch_size, num_features])
# Scalar weight per example.
x_weights = tf.placeholder(tf.float32, shape=[batch_size])
# Vector of outputs per example.
y = tf.placeholder(tf.float32, shape=[batch_size, num_outputs])
# ...
logits = ...
# Insert appropriate cost function here.
cost = tf.nn.softmax_cross_entropy_with_logits(logits, y)
The computed cost tensor is a vector of length batch_size. You can simply perform an element-wise multiplication with x_weights to get a weighted cost.
overall_cost = tf.mul(cost, x_weights) / batch_size
Finally you can use overall_cost as the value to minimize in your optimizer.