This is my first post here, so I hope it complies to the guidelines and is interesting also for other people except myself.
I am building a CNN autoencoder that takes as input matrixes of fixed sizes with the goal of getting a lower dimensional representation of them (I call them hashes here). I want to make these hashes similar, when the matrixes are similar. Since just a few of my data are labeled, I want to make the loss function a combination of two separate functions. One part will be the reconstruction error of the autoencoder (This part is correctly working). The other part, I want it to be for the labeled data. Since I will have three different classes, I want that on each batch, to calculate the distance between hash values belonging to the same class (I am having trouble implementing this).
My effort so far:
X = tf.placeholder(shape=[None, 512, 128, 1], dtype=tf.float32)
class1_indices = tf.placeholder(shape=[None], dtype=tf.int32)
class2_indices = tf.placeholder(shape=[None], dtype=tf.int32)
hashes, reconstructed_output = self.conv_net(X, weights, biases_enc, biases_dec, keep_prob)
class1_hashes = tf.gather(hashes, class1_indices)
class1_cost = self.calculate_within_class_loss(class1_hashes)
class2_hashes = tf.gather(hashes, class2_indices)
class2_cost = self.calculate_within_class_loss(class2_hashes)
loss_all = tf.reduce_sum(tf.square(reconstructed_output - X))
loss_labeled = class1_cost + class2_cost
loss_op = loss_all + loss_labeled
optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss_op)
Where calclulate_within_class_loss is a separate function that I created. I have currently implemented it only for the difference of the first hash of a class with other hashes of that class in the same batch, however, I am not happy with my current implementation and it looks that it is not working.
def calculate_within_class_loss(self, hash_values):
first_hash = tf.slice(hash_values, [0, 0], [1, 256])
total_loss = tf.foldl(lambda d, e: d + tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(e, first_hash)))), hash_values, initializer=0.0)
return total_loss
So, I have two questions / issues:
Is there any easy way to calculate the distance of every raw with all other raws in a tensor?
My current implementation of calculate within class distance, even if it is just for the first element with other elements, will give me a 'nan' when I try to optimize it.
Thanks for your time and help :)
In the sample code, you are calculating the sum of Eucledian distance between the points.
For this, you will have to loop over the entire dataset and do O(n^2 * m) calculations and have O(n^2 * m) space, i.e. Tensorflow graph operations.
Here, n is the number of vectors and m is the size of the hash, i.e. 256.
However, if you could change your object to the following:
Then, you can use the nifty relationship between the squared Euclidean distance and the variance and rewrite the same calculation as
Where mu_k is the average value of the kth coordinate for the cluster.
This will allow you to compute the value in O(n * m) time and O(n * m) Tensorflow operations.
This would be the way to go if you think this change (i.e. from Euclidean distance to squared Euclidean distance) will not adversely effect your loss function.
Related
I have a working CNN-LSTM model trying to predict keypoints of human bodyparts on videos.
Currently, I have four keypoints as labels right hand, left hand, head and pelvis.
The problem is that on some frames I can't see the four parts of the human that I want to label, so by default I set those values to (0,0) (which is a null coordinate).
The problem that I faced was the model taking in account those points and trying to regress on them while being in a sequence.
Thus, I removed the (0,0) points in the loss calculation and the gradient retropropagation and it works much better.
The problem is that the Four points are still predicted, so I am trying to know by any means how to make it predict a variable number of keypoints.
I thought of adding a third parameter (is it visible ?), but it will probably add some complexity and loose the model.
I think that you'll have to write a custom loss function that computes the loss between points only when the target coordinates are not null.
See PyTorch custom loss function on writing custom losses.
Something like:
def loss(outputs, labels):
err = 0
n = 0
for xo, xt in zip(outputs, labels):
if xt.values == torch.zeros(2): # null coord
continue
err += torch.nn.functional.mse_loss(xo, xt)
n += 1
return (err / n)
This is pseudo-code only! An alternative form which will avoid the loop is to have an explicit binary vector (as suggested by #leleogere) that you can then multiply by the loss on each coordinate before reducing.
I am training 2 autoencoders with 2 separate input paths jointly and I would like to randomly set one of the input paths to zero.
I use tensorflow with keras backend (functional API).
I am computing a joint loss (sum of two losses) for backpropagation.
A -> A' & B ->B'
loss => l2(A,A')+l2(B,B')
networks taking A and B are connected in latent space.
I would like to randomly set A or B to zero and compute the loss only on the corresponding path, meaning if input path A is set to zero loss be computed only by using outputs of only path B and vice versa; e.g.:
0 -> A' & B ->B'
loss: l2(B,B')
How do I randomly set input path to zero? How do I write a callback which does this?
Maybe try the following:
import random
def decision(probability):
return random.random() < probability
Define a method that makes a random decision based on a certain probability x and make your loss calculation depend on this decision.
if current_epoch == random.choice(epochs):
keep_mask = tf.ones_like(A.input, dtype=float32)
throw_mask = tf.zeros_like(A.input, dtype=float32)
if decision(probability=0.5):
total_loss = tf.reduce_sum(reconstruction_loss_a * keep_mask
+ reconstruction_loss_b * throw_mask)
else:
total_loss = tf.reduce_sum(reconstruction_loss_a * throw_mask
+ reconstruction_loss_b * keep_mask)
else:
total_loss = tf.reduce_sum(reconstruction_loss_a + reconstruction_loss_b)
I assume that you do not want to set one of the paths to zero every time you update your model parameters, as then there is a risk that one or even both models will not be sufficiently trained. Also note that I use the input of A to create zero_like and one_like tensors as I assume that both inputs have the same shape; if this is not the case, it can easily be adjusted.
Depending on what your goal is, you may also consider replacing your input of A or B with a random tensor e.g. tf.random.normal based on a random decision. This creates noise in your model, which may be desirable, as your model would be forced to look into the latent space to try reconstruct your original input. This means precisely that you still calculate your reconstruction loss with A.input and A.output, but in reality your model never received the A.input, but rather the random tensor.
Note that this answer serves as a simple conceptual example. A working example with Tensorflow can be found here.
You can set an input to 0 simply:
A = A*random.choice([0,1])
This code can be used inside a loss function
I am new to Pytorch and am looking for a quick get score function. That, given a bunch of samples and a distribution, outputs a tensor consisting of the corresponding score for each individual sample. For instance, consider the following code:
norm = torch.distributions.multivariate_normal.MultivariateNormal(torch.zeros(2),torch.eye(2))
samples = norm.sample((1000,))
samples.requires_grad_(True)
Using samples I would like to create a score tensor of size [1000,2] where the ith component score[i] is the gradient of log p(samples[i]), where p is the density of the given distribution. The method I have come up with is the following:
def get_score(samples,distribution):
log_probs = distribution.log_prob(samples)
for i in range(log_probs.size()[0]):
log_probs[i].backward(retain_graph = True)
The resulting score tensor is then samples.grad. The issue is that my method is quite slow for larger samples (e.g. for a sample of size [50000,2] it takes about 25-30 seconds on my CPU). Is this as fast as it can get?
The only alternative I can think of is to hard-code the score function for each distribution I will use, this doesn't seem like a good solution!
From experimentation, for 50000 samples, the following is about 50% quicker:
for i in range(50000):
sample = norm.sample((1,))
sample.requires_grad_(True)
log_prob = norm.log_prob(a)
log_prob.backward()
This indicates that there should be a better way!
I'm assuming that log_probs is stored as a pytorch tensor.
You can take advantage of the linearity of differentiation to calculate the derivative for all samples at once: log_probs.sum().backward(retain_graph = True)
At least with GPU acceleration this will be a lot faster.
If log_probs is not a tensor but a list of scalars (represented as pytorch tensors of rank 0), you can use log_probs = torch.stack(log_probs) first.
I have two tensors that I am calculating the Spearmans Rank Correlation from, and I would like to be able to have PyTorch automatically adjust the values in these Tensors in a way that increases my Spearmans Rank Correlation number as high as possible.
I have explored autograd but nothing I've found has explained it simply enough.
Initialized tensors:
a=Var(torch.randn(20,1),requires_grad=True)
psfm_s=Var(torch.randn(12,20),requires_grad=True)
How can I have a loop of constant adjustments of the values in these two tensors to get the highest spearmans rank correlation from 2 lists I make from these 2 tensors while having PyTorch do the work? I just need a guide of where to go. Thank you!
I'm not familiar with Spearman's Rank Correlation, but if I understand your question you're asking how to use PyTorch to solve problems other than deep networks?
If that's the case then I'll provide a simple least squares example which I believe should be informative to your effort.
Consider a set of 200 measurements of 10 dimensional vectors x and y. Say we want to find a linear transform from x to y.
The least squares approach dictates we can accomplish this by finding the matrix M and vector b which minimize |(y - (M x+b))²|
The following example code generates some example data and then uses pytorch to perform this minimization. I believe the comments are sufficient to help you understand what is occurring here.
import torch
from torch.nn.parameter import Parameter
from torch import optim
# define some fake data
M_true = torch.randn(10, 10)
b_true = torch.randn(10, 1)
x = torch.randn(200, 10, 1)
noise = torch.matmul(M_true, 0.05 * torch.randn(200, 10, 1))
y = torch.matmul(M_true, x) + b_true + noise
# begin optimization
# define the parameters we want to optimize (using random starting values in this case)
M = Parameter(torch.randn(10, 10))
b = Parameter(torch.randn(10, 1))
# define the optimizer and provide the parameters we want to optimize
optimizer = optim.SGD((M, b), lr=0.1)
for i in range(500):
# compute loss that we want to minimize
y_hat = torch.matmul(M, x) + b
loss = torch.mean((y - y_hat)**2)
# zero the gradients of the parameters referenced by the optimizer (M and b)
optimizer.zero_grad()
# compute new gradients
loss.backward()
# update parameters M and b
optimizer.step()
if (i + 1) % 100 == 0:
# scale learning rate by factor of 0.9 every 100 steps
optimizer.param_groups[0]['lr'] *= 0.9
print('step', i + 1, 'mse:', loss.item())
# final parameter values (data contains a torch.tensor)
print('Resulting parameters:')
print(M.data)
print(b.data)
print('Compare to the "real" values')
print(M_true)
print(b_true)
Of course this problem has a simple closed form solution, but this numerical approach is just to demonstrate how to use PyTorch's autograd to solve problems not necessarily neural network related. I also choose to explicitly define the matrix M and vector b here rather than using an equivalent nn.Linear layer since I think that would just confuse things.
In your case you want to maximize something so make sure to negate your objective function before calling backward.
When I use keras's binary_crossentropy as the loss function (that calls tensorflow's sigmoid_cross_entropy, it seems to produce loss values only between [0, 1]. However, the equation itself
# The logistic loss formula from above is
# x - x * z + log(1 + exp(-x))
# For x < 0, a more numerically stable formula is
# -x * z + log(1 + exp(x))
# Note that these two expressions can be combined into the following:
# max(x, 0) - x * z + log(1 + exp(-abs(x)))
# To allow computing gradients at zero, we define custom versions of max and
# abs functions.
zeros = array_ops.zeros_like(logits, dtype=logits.dtype)
cond = (logits >= zeros)
relu_logits = array_ops.where(cond, logits, zeros)
neg_abs_logits = array_ops.where(cond, -logits, logits)
return math_ops.add(
relu_logits - logits * labels,
math_ops.log1p(math_ops.exp(neg_abs_logits)), name=name)
implies that the range is from [0, infinity). So is Tensorflow doing some sort of clipping that I'm not catching? Moreover, since it's doing math_ops.add() I'd assume it'd be for sure greater than 1. Am I right to assume that loss range can definitely exceed 1?
The cross entropy function is indeed not bounded upwards. However it will only take on large values if the predictions are very wrong. Let's first look at the behavior of a randomly initialized network.
With random weights, the many units/layers will usually compound to result in the network outputing approximately uniform predictions. That is, in a classification problem with n classes you will get probabilities of around 1/n for each class (0.5 in the two-class case). In this case, the cross entropy will be around the entropy of an n-class uniform distribution, which is log(n), under certain assumptions (see below).
This can be seen as follows: The cross entropy for a single data point is -sum(p(k)*log(q(k))) where p are the true probabilities (labels), q are the predictions, k are the different classes and the sum is over the classes. Now, with hard labels (i.e. one-hot encoded) only a single p(k) is 1, all others are 0. Thus, the term reduces to -log(q(k)) where k is now the correct class. If with a randomly initialized network q(k) ~ 1/n, we get -log(1/n) = log(n).
We can also go of the definition of the cross entropy which is generally entropy(p) + kullback-leibler divergence(p,q). If p and q are the same distributions (e.g. p is uniform when we have the same number of examples for each class, and q is around uniform for random networks) then the KL divergence becomes 0 and we are left with entropy(p).
Now, since the training objective is usually to reduce cross entropy, we can think of log(n) as a kind of worst-case value. If it ever gets higher, there is probably something wrong with your model. Since it looks like you only have two classes (0 and 1), log(2) < 1 and so your cross entropy will generally be quite small.