When I use keras's binary_crossentropy as the loss function (that calls tensorflow's sigmoid_cross_entropy, it seems to produce loss values only between [0, 1]. However, the equation itself
# The logistic loss formula from above is
# x - x * z + log(1 + exp(-x))
# For x < 0, a more numerically stable formula is
# -x * z + log(1 + exp(x))
# Note that these two expressions can be combined into the following:
# max(x, 0) - x * z + log(1 + exp(-abs(x)))
# To allow computing gradients at zero, we define custom versions of max and
# abs functions.
zeros = array_ops.zeros_like(logits, dtype=logits.dtype)
cond = (logits >= zeros)
relu_logits = array_ops.where(cond, logits, zeros)
neg_abs_logits = array_ops.where(cond, -logits, logits)
return math_ops.add(
relu_logits - logits * labels,
math_ops.log1p(math_ops.exp(neg_abs_logits)), name=name)
implies that the range is from [0, infinity). So is Tensorflow doing some sort of clipping that I'm not catching? Moreover, since it's doing math_ops.add() I'd assume it'd be for sure greater than 1. Am I right to assume that loss range can definitely exceed 1?
The cross entropy function is indeed not bounded upwards. However it will only take on large values if the predictions are very wrong. Let's first look at the behavior of a randomly initialized network.
With random weights, the many units/layers will usually compound to result in the network outputing approximately uniform predictions. That is, in a classification problem with n classes you will get probabilities of around 1/n for each class (0.5 in the two-class case). In this case, the cross entropy will be around the entropy of an n-class uniform distribution, which is log(n), under certain assumptions (see below).
This can be seen as follows: The cross entropy for a single data point is -sum(p(k)*log(q(k))) where p are the true probabilities (labels), q are the predictions, k are the different classes and the sum is over the classes. Now, with hard labels (i.e. one-hot encoded) only a single p(k) is 1, all others are 0. Thus, the term reduces to -log(q(k)) where k is now the correct class. If with a randomly initialized network q(k) ~ 1/n, we get -log(1/n) = log(n).
We can also go of the definition of the cross entropy which is generally entropy(p) + kullback-leibler divergence(p,q). If p and q are the same distributions (e.g. p is uniform when we have the same number of examples for each class, and q is around uniform for random networks) then the KL divergence becomes 0 and we are left with entropy(p).
Now, since the training objective is usually to reduce cross entropy, we can think of log(n) as a kind of worst-case value. If it ever gets higher, there is probably something wrong with your model. Since it looks like you only have two classes (0 and 1), log(2) < 1 and so your cross entropy will generally be quite small.
Related
I have two tensors that I am calculating the Spearmans Rank Correlation from, and I would like to be able to have PyTorch automatically adjust the values in these Tensors in a way that increases my Spearmans Rank Correlation number as high as possible.
I have explored autograd but nothing I've found has explained it simply enough.
Initialized tensors:
a=Var(torch.randn(20,1),requires_grad=True)
psfm_s=Var(torch.randn(12,20),requires_grad=True)
How can I have a loop of constant adjustments of the values in these two tensors to get the highest spearmans rank correlation from 2 lists I make from these 2 tensors while having PyTorch do the work? I just need a guide of where to go. Thank you!
I'm not familiar with Spearman's Rank Correlation, but if I understand your question you're asking how to use PyTorch to solve problems other than deep networks?
If that's the case then I'll provide a simple least squares example which I believe should be informative to your effort.
Consider a set of 200 measurements of 10 dimensional vectors x and y. Say we want to find a linear transform from x to y.
The least squares approach dictates we can accomplish this by finding the matrix M and vector b which minimize |(y - (M x+b))²|
The following example code generates some example data and then uses pytorch to perform this minimization. I believe the comments are sufficient to help you understand what is occurring here.
import torch
from torch.nn.parameter import Parameter
from torch import optim
# define some fake data
M_true = torch.randn(10, 10)
b_true = torch.randn(10, 1)
x = torch.randn(200, 10, 1)
noise = torch.matmul(M_true, 0.05 * torch.randn(200, 10, 1))
y = torch.matmul(M_true, x) + b_true + noise
# begin optimization
# define the parameters we want to optimize (using random starting values in this case)
M = Parameter(torch.randn(10, 10))
b = Parameter(torch.randn(10, 1))
# define the optimizer and provide the parameters we want to optimize
optimizer = optim.SGD((M, b), lr=0.1)
for i in range(500):
# compute loss that we want to minimize
y_hat = torch.matmul(M, x) + b
loss = torch.mean((y - y_hat)**2)
# zero the gradients of the parameters referenced by the optimizer (M and b)
optimizer.zero_grad()
# compute new gradients
loss.backward()
# update parameters M and b
optimizer.step()
if (i + 1) % 100 == 0:
# scale learning rate by factor of 0.9 every 100 steps
optimizer.param_groups[0]['lr'] *= 0.9
print('step', i + 1, 'mse:', loss.item())
# final parameter values (data contains a torch.tensor)
print('Resulting parameters:')
print(M.data)
print(b.data)
print('Compare to the "real" values')
print(M_true)
print(b_true)
Of course this problem has a simple closed form solution, but this numerical approach is just to demonstrate how to use PyTorch's autograd to solve problems not necessarily neural network related. I also choose to explicitly define the matrix M and vector b here rather than using an equivalent nn.Linear layer since I think that would just confuse things.
In your case you want to maximize something so make sure to negate your objective function before calling backward.
I want to train my neural network (in Keras) with an additional condition on the output elements.
An example:
Minimize my loss function MSE between network output y_pred and y_true.
Additionally, ensure that the norm of y_pred is less or equal 1.
Without the condition, the task is straightforward.
Note: The condition is not necessarily the vector norm of y_pred.
How can I implement the additional condition/restriction in a Keras (or maybe Tensorflow) model?
In principle, tensorflow (and keras) don't allow you to add hard constraints to your model.
You have to convert your invarient (norm <= 1) to a penalty function, which is added to the loss. This could look like this:
y_norm = tf.norm(y_pred)
norm_loss = tf.where(y_norm > 1, y_norm, 0)
total_loss = mse + norm_loss
Look at the docs of where. If your prediction has a norm bigger than one, backpropagation tries to minimize the norm. If it is less than or equal, this part of the loss is simply 0. No gradient is produced.
But this can be very hard to optimize. Your predictions could oscillate around a norm of 1. It is also possible to add a factor: total_loss = mse + 1000* norm_loss. Be very careful with this, it makes optimization even harder.
In the example above, the norm above one contributes linearly to the loss. This is called l1-regularization. You could also square it, which would become l2-regularization.
In your specific case, you could get creative. Why not normalize your predictions and the targets to one (just a suggestion, might be a bad idea)?
loss = mse(y_pred / tf.norm(y_pred), y_target / np.linalg.norm(y_target)
This is my first post here, so I hope it complies to the guidelines and is interesting also for other people except myself.
I am building a CNN autoencoder that takes as input matrixes of fixed sizes with the goal of getting a lower dimensional representation of them (I call them hashes here). I want to make these hashes similar, when the matrixes are similar. Since just a few of my data are labeled, I want to make the loss function a combination of two separate functions. One part will be the reconstruction error of the autoencoder (This part is correctly working). The other part, I want it to be for the labeled data. Since I will have three different classes, I want that on each batch, to calculate the distance between hash values belonging to the same class (I am having trouble implementing this).
My effort so far:
X = tf.placeholder(shape=[None, 512, 128, 1], dtype=tf.float32)
class1_indices = tf.placeholder(shape=[None], dtype=tf.int32)
class2_indices = tf.placeholder(shape=[None], dtype=tf.int32)
hashes, reconstructed_output = self.conv_net(X, weights, biases_enc, biases_dec, keep_prob)
class1_hashes = tf.gather(hashes, class1_indices)
class1_cost = self.calculate_within_class_loss(class1_hashes)
class2_hashes = tf.gather(hashes, class2_indices)
class2_cost = self.calculate_within_class_loss(class2_hashes)
loss_all = tf.reduce_sum(tf.square(reconstructed_output - X))
loss_labeled = class1_cost + class2_cost
loss_op = loss_all + loss_labeled
optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss_op)
Where calclulate_within_class_loss is a separate function that I created. I have currently implemented it only for the difference of the first hash of a class with other hashes of that class in the same batch, however, I am not happy with my current implementation and it looks that it is not working.
def calculate_within_class_loss(self, hash_values):
first_hash = tf.slice(hash_values, [0, 0], [1, 256])
total_loss = tf.foldl(lambda d, e: d + tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(e, first_hash)))), hash_values, initializer=0.0)
return total_loss
So, I have two questions / issues:
Is there any easy way to calculate the distance of every raw with all other raws in a tensor?
My current implementation of calculate within class distance, even if it is just for the first element with other elements, will give me a 'nan' when I try to optimize it.
Thanks for your time and help :)
In the sample code, you are calculating the sum of Eucledian distance between the points.
For this, you will have to loop over the entire dataset and do O(n^2 * m) calculations and have O(n^2 * m) space, i.e. Tensorflow graph operations.
Here, n is the number of vectors and m is the size of the hash, i.e. 256.
However, if you could change your object to the following:
Then, you can use the nifty relationship between the squared Euclidean distance and the variance and rewrite the same calculation as
Where mu_k is the average value of the kth coordinate for the cluster.
This will allow you to compute the value in O(n * m) time and O(n * m) Tensorflow operations.
This would be the way to go if you think this change (i.e. from Euclidean distance to squared Euclidean distance) will not adversely effect your loss function.
I am trying to implement a solution to Ridge regression in Python using Stochastic gradient descent as the solver. My code for SGD is as follows:
def fit(self, X, Y):
# Convert to data frame in case X is numpy matrix
X = pd.DataFrame(X)
# Define a function to calculate the error given a weight vector beta and a training example xi, yi
# Prepend a column of 1s to the data for the intercept
X.insert(0, 'intercept', np.array([1.0]*X.shape[0]))
# Find dimensions of train
m, d = X.shape
# Initialize weights to random
beta = self.initializeRandomWeights(d)
beta_prev = None
epochs = 0
prev_error = None
while (beta_prev is None or epochs < self.nb_epochs):
print("## Epoch: " + str(epochs))
indices = range(0, m)
shuffle(indices)
for i in indices: # Pick a training example from a randomly shuffled set
beta_prev = beta
xi = X.iloc[i]
errori = sum(beta*xi) - Y[i] # Error[i] = sum(beta*x) - y = error of ith training example
gradient_vector = xi*errori + self.l*beta_prev
beta = beta_prev - self.alpha*gradient_vector
epochs += 1
The data I'm testing this on is not normalized and my implementation always ends up with all the weights being Infinity, even though I initialize the weights vector to low values. Only when I set the learning rate alpha to a very small value ~1e-8, the algorithm ends up with valid values of the weights vector.
My understanding is that normalizing/scaling input features only helps reduce convergence time. But the algorithm should not fail to converge as a whole if the features are not normalized. Is my understanding correct?
You can check from scikit-learn's Stochastic Gradient Descent documentation that one of the disadvantages of the algorithm is that it is sensitive to feature scaling. In general, gradient based optimization algorithms converge faster on normalized data.
Also, normalization is advantageous for regression methods.
The updates to the coefficients during each step will depend on the ranges of each feature. Also, the regularization term will be affected heavily by large feature values.
SGD may converge without data normalization, but that is subjective to the data at hand. Therefore, your assumption is not correct.
Your assumption is not correct.
It's hard to answer this, because there are so many different methods/environments but i will try to mention some points.
Normalization
When some method is not scale-invariant (i think every linear-regression is not) you really should normalize your data
I take it that you are just ignoring this because of debugging / analyzing
Normalizing your data is not only relevant for convergence-time, the results will differ too (think about the effect within the loss-function; big values might effect in much more loss to small ones)!
Convergence
There is probably much to tell about convergence of many methods on normalized/non-normalized data, but your case is special:
SGD's convergence theory only guarantees convergence to some local-minimum (= global-minimum in your convex-opt problem) for some chosings of hyper-parameters (learning-rate and learning-schedule/decay)
Even optimizing normalized data can fail with SGD when those params are bad!
This is one of the most important downsides of SGD; dependence on hyper-parameters
As SGD is based on gradients and step-sizes, non-normalized data has a possibly huge effect on not achieving this convergence!
In order for sgd to converge in linear regression the step size should be smaller than 2/s where s is the largest singular value of the matrix (see the Convergence and stability in the mean section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter), in the case of ridge regression it should be less than 2*(1+p/s^2)/s where p is the ridge penalty.
Normalizing rows of the matrix (or gradients) changes the loss function to give each sample an equal weight and it changes the singular values of the matrix such that you can choose a step size near 1 (see the NLMS section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter). Depending on your data it might require smaller step sizes or allow for larger step sizes. It all depends on whether or not the normalization increases or deacreses the largest singular value of the matrix.
Note that when deciding whether or not to normalize the rows you shouldn't just think about the convergence rate (which is determined by the ratio between the largest and smallest singular values) or stability in the mean, but also about how it changes the loss function and whether or not it fits your needs because of that, sometimes it makes sense to normalize but sometimes (for example when you want to give different importance for different samples or when you think that a larger energy for the signal means better snr) it doesn't make sense to normalize.
I am having a hard time with calculating cross entropy in tensorflow. In particular, I am using the function:
tf.nn.softmax_cross_entropy_with_logits()
Using what is seemingly simple code, I can only get it to return a zero
import tensorflow as tf
import numpy as np
sess = tf.InteractiveSession()
a = tf.placeholder(tf.float32, shape =[None, 1])
b = tf.placeholder(tf.float32, shape = [None, 1])
sess.run(tf.global_variables_initializer())
c = tf.nn.softmax_cross_entropy_with_logits(
logits=b, labels=a
).eval(feed_dict={b:np.array([[0.45]]), a:np.array([[0.2]])})
print c
returns
0
My understanding of cross entropy is as follows:
H(p,q) = p(x)*log(q(x))
Where p(x) is the true probability of event x and q(x) is the predicted probability of event x.
There if input any two numbers for p(x) and q(x) are used such that
0<p(x)<1 AND 0<q(x)<1
there should be a nonzero cross entropy. I am expecting that I am using tensorflow incorrectly. Thanks in advance for any help.
In addition to Don's answer (+1), this answer written by mrry may interest you, as it gives the formula to calculate the cross entropy in TensorFlow:
An alternative way to write:
xent = tf.nn.softmax_cross_entropy_with_logits(logits, labels)
...would be:
softmax = tf.nn.softmax(logits)
xent = -tf.reduce_sum(labels * tf.log(softmax), 1)
However, this alternative would be (i) less numerically stable (since
the softmax may compute much larger values) and (ii) less efficient
(since some redundant computation would happen in the backprop). For
real uses, we recommend that you use
tf.nn.softmax_cross_entropy_with_logits().
Like they say, you can't spell "softmax_cross_entropy_with_logits" without "softmax". Softmax of [0.45] is [1], and log(1) is 0.
Measures the probability error in discrete classification tasks in which the
classes are mutually exclusive (each entry is in exactly one class). For
example, each CIFAR-10 image is labeled with one and only one label: an image
can be a dog or a truck, but not both.
NOTE: While the classes are mutually exclusive, their probabilities
need not be. All that is required is that each row of labels is
a valid probability distribution. If they are not, the computation of the
gradient will be incorrect.
If using exclusive labels (wherein one and only
one class is true at a time), see sparse_softmax_cross_entropy_with_logits.
WARNING: This op expects unscaled logits, since it performs a softmax
on logits internally for efficiency. Do not call this op with the
output of softmax, as it will produce incorrect results.
logits and labels must have the same shape [batch_size, num_classes]
and the same dtype (either float16, float32, or float64).
Here is an implementation in Tensorflow 2.0 in case somebody else (me probably) needs it in the future.
#tf.function
def cross_entropy(x, y, epsilon = 1e-9):
return -2 * tf.reduce_mean(y * tf.math.log(x + epsilon), -1) / tf.math.log(2.)
x = tf.constant([
[1.0,0],
[0.5,0.5],
[.75,.25]
]
,dtype=tf.float32)
with tf.GradientTape() as tape:
tape.watch(x)
y = entropy(x, x)
tf.print(y)
tf.print(tape.gradient(y, x))
Output
[-0 1 0.811278105]
[[-1.44269502 29.8973541]
[-0.442695022 -0.442695022]
[-1.02765751 0.557305]]