Problem with "Regression with Probabilistic Layers in TensorFlow Probability"

Problem with "Regression with Probabilistic Layers in TensorFlow Probability" - python

I'm having trouble using tfp.layers.DistributionLambda, I'm a TF newbie trying hard to make the tensors flow. Can someone please provide some insights into how to set up the output distribution's parameters?
Context:
TFP team wrote a tutorial on Regression with Probabilistic Layers in TensorFlow Probability, it set up the following model:
# Build model.
model = tfk.Sequential([
tf.keras.layers.Dense(1 + 1),
tfp.layers.DistributionLambda(
lambda t: tfd.Normal(loc=t[..., :1],
scale=1e-3 + tf.math.softplus(0.05 * t[..., 1:]))),
])
My problem:
It outputs a normal distribution using tfp.layers.DistributionLambda, but I'm unclear how tfd.Normal's parameters (mean/loc and standard deviation/scale) were set up, so I'm having trouble changing the Normal to a Gamma Distribution. I tried the following, but didn't work (predicted distribution parameters are nan).
def dist_output_layer (t, softplus_scale=0.05):
"""Create distribution with variable mean and variance
"""
mean = t[..., :1]
std_dev = 1e-3 + tf.math.softplus(softplus_scale * mean)
alpha = (mean/std_dev)**2
beta = alpha/mean
return tfd.Gamma(concentration = alpha,
rate = beta
)
# Build model.
model = tf.keras.Sequential([
tf.keras.layers.Dense(20,activation="relu"), # "By using a deeper neural network and introducing nonlinear activation functions, however, we can learn more complicated functional dependencies!
tf.keras.layers.Dense(1 + 1), #two neurons here b/c the output layer's distribution's mean and std. deviation
tfp.layers.DistributionLambda(dist_output_layer)
])
Thanks a lot in advance.

There is a a lot to say about the code snippet you pasted from Medium, to be honest.
I hope you will find my comments below somewhat useful, though.
# Build model.
model = tfk.Sequential([
# The first layer is a Dense layer with 2 units, one for each of the parameters that will
# be learnt (see next layer). Its implied shape is (batch_size, 2).
# Note that this Dense layer has no activation function as we want are any real value that will be used
# to parameterize the Normal distribution in the Normal distribution component of the following
# layer
tf.keras.layers.Dense(1 + 1),
# The following layer is a DistributionLambda that encapsulates a Normal distribution. The
# DistributionLambda takes a function in its constructor, and this function should take the output
# tensor from the previous layer as its input (this is the Dense layer and the comments above).
# The goal is to learn the 2 parameters of the distribution that is loc (the mean) and scale (the standard
# deviation). For this, a lambda construct is used. The ellipsis you can see for the loc
# and scale arguments (that is the 3 dots) are for the batch size. Also note that scale (the standard deviation)
# cannot be negative. The softplus function was used to make sure that the learnt parameter scale doesn't get
# negative.
tfp.layers.DistributionLambda(
lambda t: tfd.Normal(loc=t[..., :1],
scale=1e-3 + tf.math.softplus(0.05 * t[..., 1:]))),
])

Regarding the question about the .05 being added, it's a small offset to solve some gradient issues that can arise without it. Basically a prior saying we're confident that the real variability is NOT smaller than epsilon (here .05), so we're gonna make sure that the std dev is never smaller by just adding that.
See https://github.com/tensorflow/probability/issues/751
Money quote:
"If infinitesimal scales end up being a problem in practice on a given task, the fix we commonly use is a softplus-and-shift, e.g. scale = epsilon + tf.math.softplus(unconstrained_scale), where epsilon is some tiny value like 1e-5 that we are a priori confident is much smaller than the true scale."
EDIT: Actually what is added is 1e-3 for the reasons I described above. As for the multiplication.... might again just be a scaling or gradient adjustment. Or perhaps to make the scale parameter begin at a certain size.

Related

Where to start on creating a method that saves desired changes in a Tensor with PyTorch?

I have two tensors that I am calculating the Spearmans Rank Correlation from, and I would like to be able to have PyTorch automatically adjust the values in these Tensors in a way that increases my Spearmans Rank Correlation number as high as possible.
I have explored autograd but nothing I've found has explained it simply enough.
Initialized tensors:
a=Var(torch.randn(20,1),requires_grad=True)
psfm_s=Var(torch.randn(12,20),requires_grad=True)
How can I have a loop of constant adjustments of the values in these two tensors to get the highest spearmans rank correlation from 2 lists I make from these 2 tensors while having PyTorch do the work? I just need a guide of where to go. Thank you!

I'm not familiar with Spearman's Rank Correlation, but if I understand your question you're asking how to use PyTorch to solve problems other than deep networks?
If that's the case then I'll provide a simple least squares example which I believe should be informative to your effort.
Consider a set of 200 measurements of 10 dimensional vectors x and y. Say we want to find a linear transform from x to y.
The least squares approach dictates we can accomplish this by finding the matrix M and vector b which minimize |(y - (M x+b))²|
The following example code generates some example data and then uses pytorch to perform this minimization. I believe the comments are sufficient to help you understand what is occurring here.
import torch
from torch.nn.parameter import Parameter
from torch import optim
# define some fake data
M_true = torch.randn(10, 10)
b_true = torch.randn(10, 1)
x = torch.randn(200, 10, 1)
noise = torch.matmul(M_true, 0.05 * torch.randn(200, 10, 1))
y = torch.matmul(M_true, x) + b_true + noise
# begin optimization
# define the parameters we want to optimize (using random starting values in this case)
M = Parameter(torch.randn(10, 10))
b = Parameter(torch.randn(10, 1))
# define the optimizer and provide the parameters we want to optimize
optimizer = optim.SGD((M, b), lr=0.1)
for i in range(500):
# compute loss that we want to minimize
y_hat = torch.matmul(M, x) + b
loss = torch.mean((y - y_hat)**2)
# zero the gradients of the parameters referenced by the optimizer (M and b)
optimizer.zero_grad()
# compute new gradients
loss.backward()
# update parameters M and b
optimizer.step()
if (i + 1) % 100 == 0:
# scale learning rate by factor of 0.9 every 100 steps
optimizer.param_groups[0]['lr'] *= 0.9
print('step', i + 1, 'mse:', loss.item())
# final parameter values (data contains a torch.tensor)
print('Resulting parameters:')
print(M.data)
print(b.data)
print('Compare to the "real" values')
print(M_true)
print(b_true)
Of course this problem has a simple closed form solution, but this numerical approach is just to demonstrate how to use PyTorch's autograd to solve problems not necessarily neural network related. I also choose to explicitly define the matrix M and vector b here rather than using an equivalent nn.Linear layer since I think that would just confuse things.
In your case you want to maximize something so make sure to negate your objective function before calling backward.

Why my one-filter convolutional neural network is unable to learn a simple gaussian kernel?

I was surprised that the deep learning algorithms I had implemented did not work, and I decided to create a very simple example, to understand the functioning of CNN better. Here is my attempt of constructing a small CNN for a very simple task, which provides unexpected results.
I have implemented a simple CNN with only one layer of one filter. I have created a dataset of 5000 samples, the inputs x being 256x256 simulated images, and the outputs y being the corresponding blurred images (y = signal.convolvded2d(x,gaussian_kernel,boundary='fill',mode='same')).
Thus, I would like my CNN to learn the convolutional filter which would transform the original image into its blurred version. In other words, I would like my CNN to recover the gaussian filter I used to create the blurred images. Note: As I want to 'imitate' the convolution process such as it is described in the mathematical framework, I am using a gaussian filter which has the same size as my images: 256x256.
It seems to me quite an easy task, and nonetheless, the CNN is unable to provide the results I would expect. Please find below the code of my training function and the results.
# Parameters
size_image = 256
normalization = 1
sigma = 7
n_train = 4900
ind_samples_training =np.linspace(1, n_train, n_train).astype(int)
nb_epochs = 5
minibatch_size = 5
learning_rate = np.logspace(-3,-5,nb_epochs)
tf.reset_default_graph()
tf.set_random_seed(1)
seed = 3
n_train = len(ind_samples_training)
costs = []
# Create Placeholders of the correct shape
X = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'X')
Y_blur_true = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'Y_true')
learning_rate_placeholder = tf.placeholder(tf.float32, shape=[])
# parameters to learn --should be an approximation of the gaussian filter
filter_to_learn = tf.get_variable('filter_to_learn',\
shape = [size_image,size_image,1,1],\
dtype = tf.float64,\
initializer = tf.contrib.layers.xavier_initializer(seed = 0),\
trainable = True)
# Forward propagation: Build the forward propagation in the tensorflow graph
Y_blur_hat = tf.nn.conv2d(X, filter_to_learn, strides = [1,1,1,1], padding = 'SAME')
# Cost function: Add cost function to tensorflow graph
cost = tf.losses.mean_squared_error(Y_blur_true,Y_blur_hat,weights=1.0)
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer that minimizes the cost.
opt_adam = tf.train.AdamOptimizer(learning_rate=learning_rate_placeholder)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = opt_adam.minimize(cost)
# Initialize all the variables globally
init = tf.global_variables_initializer()
lr = learning_rate[0]
# Start the session to compute the tensorflow graph
with tf.Session() as sess:
# Run the initialization
sess.run(init)
# Do the training loop
for epoch in range(nb_epochs):
minibatch_cost = 0.
seed = seed + 1
permutation = list(np.random.permutation(n_train))
shuffled_ind_samples = np.array(ind_samples_training)[permutation]
# Learning rate update
if learning_rate.shape[0]>1:
lr = learning_rate[epoch]
nb_minibatches = int(np.ceil(n_train/minibatch_size))
for num_minibatch in range(nb_minibatches):
# Minibatch indices
ind_minibatch = shuffled_ind_samples[num_minibatch*minibatch_size:(num_minibatch+1)*minibatch_size]
# Loading of the original image (X) and the blurred image (Y)
minibatch_X, minibatch_Y = load_dataset_blur(ind_minibatch,size_image, normalization, sigma)
_ , temp_cost, filter_learnt = sess.run([optimizer,cost,filter_to_learn],\
feed_dict = {X:minibatch_X, Y_blur_true:minibatch_Y, learning_rate_placeholder: lr})
I have run the training on 5 epochs of 4900 samples, with a batch size equal to 5. The gaussian kernel has a variance of 7^2=49.
I have tried to initialize the filter to be learnt both with the xavier initiliazer method provided by tensorflow, and with the true values of the gaussian kernel we actually would like to learn. In both cases, the filter that is learnt results too different from the true gaussian one as it can be seen on the two images available at https://github.com/megalinier/Helsinki-project.

By examining the photos it seems like the network is learning OK, as the predicted image is not so far off the true label - for better results you can tweak some hyperparams but that is not the case.
I think what you are missing is the fact that different kernels can get quite similar results since it is a convolution.
Think about it, you are multiplying some matrix with another, and then summing all the results to create a new pixel. Now if the true label sum is 10, it could be a results of 2.5 + 2.5 + 2.5 + 2.5 and -10 + 10 + 10 + 0.
What I am trying to say, is that your network could be learning just fine, but you will get a different values in the conv kernel than the filter.

I think this would better serve as a comment as it's somewhat speculative, but it's too long...
Hard to say what exactly is wrong but there could be multiple culprits here. For one, squared error provides a weak signal in the case that target and prediction are already quite similar -- and while the xavier-initalized filter looks quite bad, the predicted (filtered) image isn't too far off the target. You could experiment with other metrics such as absolute error (e.g. 1-norm instead of 2-norm).
Second, adding regularization should help, i.e. add a weight penalty to the loss function to encourage the filter values to become small where they are not needed. As it is, what I suppose happens is: The random values in the filter average out to about 0, leading to a similar "filtering" effect as if they were actually all 0. As such, the learning algorithm doesn't have much incentive to actually pull them to 0. By adding a weight penalty, you provide this incentive.
Third, it could just be Adam messing up. It is known to provide "strange" non-optimal solutions in some very simple (e.g. convex) problems. Maybe try default Gradient Descent with learning rate decay (and possibly momentum).

How can I predict the expected value and the variance simultaneously with a neural network?

I'd like to use a neural network to predict a scalar value which is the sum of a function of the input values and a random value (I'm assuming gaussian distribution) whose variance also depends on the input values. Now I'd like to have a neural network that has two outputs - the first output should approximate the deterministic part - the function, and the second output should approximate the variance of the random part, depending on the input values. What loss function do I need to train such a network?
(It would be nice if there was an example with Python for Tensorflow, but I'm also interested in general answers. I'm also not quite clear how I could write something like in Python code - none of the examples I found so far show how to address individual outputs from the loss function.)

You can use dropout for that. With a dropout layer you can make several different predictions based on different settings of which nodes dropped out. Then you can simply count the outcomes and interpret the result as a measure for uncertainty.
For details, read:
Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.

Since I've found nothing simple to implement, I wrote something myself, that models that explicitly: here is a custom loss function that tries to predict mean and variance. It seems to work but I'm not quite sure how well that works out in practice, and I'd appreciate feedback. This is my loss function:
def meanAndVariance(y_true: tf.Tensor , y_pred: tf.Tensor) -> tf.Tensor :
"""Loss function that has the values of the last axis in y_true
approximate the mean and variance of each value in the last axis of y_pred."""
y_pred = tf.convert_to_tensor(y_pred)
y_true = math_ops.cast(y_true, y_pred.dtype)
mean = y_pred[..., 0::2]
variance = y_pred[..., 1::2]
res = K.square(mean - y_true) + K.square(variance - K.square(mean - y_true))
return K.mean(res, axis=-1)
The output dimension is twice the label dimension - mean and variance of each value in the label. The loss function consists of two parts: a mean squared error that has the mean approximate the mean of the label value, and the variance that approximates the difference of the value from the predicted mean.

When using dropout to estimate the uncertainty (or any other stochastic regularization method), make sure to also checkout our recent work on providing a sampling-free approximation of Monte-Carlo dropout.
https://arxiv.org/pdf/1908.00598.pdf
We essentially follow ur idea. Treat the activations as random variables and then propagate mean and variance using error propagation to the output layer. Consequently, we obtain two outputs - the mean and the variance.

Incorporate side conditions into Keras neural network

I want to train my neural network (in Keras) with an additional condition on the output elements.
An example:
Minimize my loss function MSE between network output y_pred and y_true.
Additionally, ensure that the norm of y_pred is less or equal 1.
Without the condition, the task is straightforward.
Note: The condition is not necessarily the vector norm of y_pred.
How can I implement the additional condition/restriction in a Keras (or maybe Tensorflow) model?

In principle, tensorflow (and keras) don't allow you to add hard constraints to your model.
You have to convert your invarient (norm <= 1) to a penalty function, which is added to the loss. This could look like this:
y_norm = tf.norm(y_pred)
norm_loss = tf.where(y_norm > 1, y_norm, 0)
total_loss = mse + norm_loss
Look at the docs of where. If your prediction has a norm bigger than one, backpropagation tries to minimize the norm. If it is less than or equal, this part of the loss is simply 0. No gradient is produced.
But this can be very hard to optimize. Your predictions could oscillate around a norm of 1. It is also possible to add a factor: total_loss = mse + 1000* norm_loss. Be very careful with this, it makes optimization even harder.
In the example above, the norm above one contributes linearly to the loss. This is called l1-regularization. You could also square it, which would become l2-regularization.
In your specific case, you could get creative. Why not normalize your predictions and the targets to one (just a suggestion, might be a bad idea)?
loss = mse(y_pred / tf.norm(y_pred), y_target / np.linalg.norm(y_target)

Tensorflow GAN discriminator loss NaN since negativ discriminator output

In my implementation of a GAN network the output of the discriminator is something like 2.05145e+07 which leads to 1 - disc_output -> 1-2.05145e+07=-2.05145e+07 (a negativ number) therefore log(1-2.05145e+07) leads to NaN.
I am not the first one with this kind of problem. One solution is to only allow positive values inside the log like done here.
Does anyone knows any better solution to this?
maybe some different loss function ?

Because discriminator returns a probability value, its output must be between 0 and 1. Try using sigmoid ( https://www.tensorflow.org/api_docs/python/tf/sigmoid) before using discriminator outputs.
Additionally, as others did, I suggest using tf.log(tf.maximum(x, 1e-9)) in case of a numerical instability.

There are standard techniques to avoid log numerical instability. For example, what you often care about is the loss (which is a function of the log), not the log value itself. For instance, with logistic loss:
For brevity, let x = logits, z = labels. The logistic loss is
z * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))
= max(x, 0) - x * z + log(1 + exp(-abs(x)))
These tricks are already implemented in standard tensorflow losses (like tf.losses.sigmoid_cross_entropy). Note that the naive solution of taking a max or a min inside of the log is not a good solution, since there aren't meaningful gradients in the saturated regions: for instance, d/dx[max(x, 0)] = 0 for x < 0, which means there won't be gradients in the saturated region.
TensorFlow has GAN support with tf.contrib.gan. These losses already implement all of the standard numerical stability tricks, and an avoid you having to recreate the wheel.
tfgan = tf.contrib.gan
tfgan.losses.minimax_discriminator_loss(...)
See https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/gan for more details.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem with "Regression with Probabilistic Layers in TensorFlow Probability" - python

Related

Where to start on creating a method that saves desired changes in a Tensor with PyTorch?

Why my one-filter convolutional neural network is unable to learn a simple gaussian kernel?

How can I predict the expected value and the variance simultaneously with a neural network?

Incorporate side conditions into Keras neural network

Tensorflow GAN discriminator loss NaN since negativ discriminator output

Categories

Resources