Measuring incertainty in Bayesian Neural Network - python

Hy everybody,
I'm beginning with tensorflow probability and I have some difficulties to interpret my Bayesian neural network outputs.
I'm working on a regression case, and started with the example provided by tensorflow notebook here: https://blog.tensorflow.org/2019/03/regression-with-probabilistic-layers-in.html?hl=fr
As I seek to know the uncertainty of my network predictions, I dived directly into example 4 with Aleatoric & Epistemic Uncertainty. You can find my code bellow:
def negative_loglikelihood(targets, estimated_distribution):
return -estimated_distribution.log_prob(targets)
def posterior_mean_field(kernel_size, bias_size, dtype=None):
n = kernel_size + bias_size #number of total paramaeters (Weights and Bias)
c = np.log(np.expm1(1.))
return tf.keras.Sequential([
tfp.layers.VariableLayer(2 * n, dtype=dtype, initializer=lambda shape, dtype: random_gaussian_initializer(shape, dtype), trainable=True),
tfp.layers.DistributionLambda(lambda t: tfd.Independent(
# The Normal distribution with location loc and scale parameters.
tfd.Normal(loc=t[..., :n],
scale=1e-5 +0.01*tf.nn.softplus(c + t[..., n:])),
reinterpreted_batch_ndims=1)),
])
def prior(kernel_size, bias_size, dtype=None):
n = kernel_size + bias_size
return tf.keras.Sequential([
tfp.layers.VariableLayer(n, dtype=dtype),
tfp.layers.DistributionLambda(lambda t: tfd.Independent(
tfd.Normal(loc=t, scale=1),
reinterpreted_batch_ndims=1)),
])
def build_model(param):
model = keras.Sequential()
for i in range(param["n_layers"] ):
name="n_units_l"+str(i)
num_hidden = param[name]
model.add(tfp.layers.DenseVariational(units=num_hidden, make_prior_fn=prior,make_posterior_fn=posterior_mean_field,kl_weight=1/len(X_train),activation="relu"))
model.add(tfp.layers.DenseVariational(units=2, make_prior_fn=prior,make_posterior_fn=posterior_mean_field,activation="relu",kl_weight=1/len(X_train)))
model.add(tfp.layers.DistributionLambda(lambda t: tfd.Normal(loc=t[..., :1],scale=1e-3 + tf.math.softplus(0.01 * t[...,1:]))))
lr = param["learning_rate"]
optimizer=optimizers.Adam(learning_rate=lr)
model.compile(
loss=negative_loglikelihood, #negative_loglikelihood,
optimizer=optimizer,
metrics=[keras.metrics.RootMeanSquaredError()],
)
return model
I think I have the same network than in tfp example, I just added few hidden layers with differents units. Also I added 0.01 in front of the Softplus in the posterior as suggested here, which allows the network to come up to good performances.
Not able to get reasonable results from DenseVariational
The performances of the model are very good (less than 1% of error) but I have some questions:
As Bayesian neural networks "promise" to mesure the uncertainty of the predictions, I was expecting bigger errors on high variance predictions. I ploted the absolute error versus variance and the results are not good enough on my mind. Of course, the model is better at low variance but I can have really bad predicitions at low variance, and therefore cannot really use standard deviation to filter bad predictions. Why is my Bayesian neural netowrk struggling to give me the uncertainty ?
The previous network was train 2000 epochs and we can notice a strange phenome with a vertical bar on lowest stdv. If I increase the number of epoch up to 25000, my results get better either on training and validation set.
But the phenomene of vertical bar that we may notice on the figure 1 is much more obvious. It seems that as much as I increase the number or EPOCH, all output variance converge to 0.68. Is that a case of overfitting ? Why this value of 0.6931571960449219 and why I can't get lower stdv ? As the phenome start appearing at 2000 EPOCH, am i already overfitting at 2000 epochs ?
At this point stdv is totaly useless. So is there a kind of trade off ? With few epochs my model is less performant but gives me some insigh about uncertainty (even if I think they're not sufficient), where with lot of epochs I have better performances but no more uncertainty informations as all outputs have the same stdv.
Sorry for the long post and the language mistakes.
Thank you in advance for you help and any feed back.

I solved the problem of why my uncertainty could not get lower than 0.6931571960449219.
Actually this value is converging to log(2). This is due to my relu activation function on my last Dense Variational layer.
Indeed, the scale of tfd.Normal is a softplus (tf.math.softplus).
And softplus is implement like that : softplus(x) = log(exp(x) + 1). As my x doesn't go in negative values, my minumum incertainty il log(2).
A basic linear activation function solved the problem and my uncertainty has a normal behavior now.

Related

Is there a way to improve DNN for linear regression?

I'm creating a Deep Neural Network for linear regression. The net has 3 hidden layers with 256 units per layer. Here is the model:
Each unit has ReLU as activation function. I also used Early Stopping to make sure it doesn't overfit.
The target is an integer and in the training set its values goes from 0 to 7860.
After the training i've got the following losses:
train_MSE = 33640.5703, train_MAD = 112.6294,
val_MSE = 53932.8125, val_MAD = 138.7836,
test_MSE = 52595.9414, test_MAD= 137.2564
I've tried many different configurations of the net (different optimizer, loss functions, normalizations, regularizers...) but nothing seems to help me to reduce the loss even further. Even if the training error decrease, the test error never goes under a value of MAD = 130.
Here's the behavior of my net:
My question is if there's a way to improve my dnn to make more accurate predictions or this is the best that i can achieve with my dataset?
If your problem is linear by nature, meaning the real function behind your data is of the from: y = a*x + b + epsilon where the last term is just random noise.
You won't get any better than fitting the underlying function y = a*x + b. Fitting espilon would only result in loss of generalization over new data.
You can try a many different things to improve DNN,
Increase hidden layers
Scale or Normalize your data
Try rectified linear unit as Activation
Take More data
Change learning algorithm parameters like learning rates

PyTorch Binary Classification - same network structure, 'simpler' data, but worse performance?

To get to grips with PyTorch (and deep learning in general) I started by working through some basic classification examples. One such example was classifying a non-linear dataset created using sklearn (full code available as notebook here)
n_pts = 500
X, y = datasets.make_circles(n_samples=n_pts, random_state=123, noise=0.1, factor=0.2)
x_data = torch.FloatTensor(X)
y_data = torch.FloatTensor(y.reshape(500, 1))
This is then accurately classified using a pretty basic neural net
class Model(nn.Module):
def __init__(self, input_size, H1, output_size):
super().__init__()
self.linear = nn.Linear(input_size, H1)
self.linear2 = nn.Linear(H1, output_size)
def forward(self, x):
x = torch.sigmoid(self.linear(x))
x = torch.sigmoid(self.linear2(x))
return x
def predict(self, x):
pred = self.forward(x)
if pred >= 0.5:
return 1
else:
return 0
As I have an interest in health data I then decided to try and use the same network structure to classify some a basic real-world dataset. I took heart rate data for one patient from here, and altered it so all values > 91 would be labelled as anomalies (e.g. a 1 and everything <= 91 labelled a 0). This is completely arbitrary, but I just wanted to see how the classification would work. The complete notebook for this example is here.
What is not intuitive to me is why the first example reaches a loss of 0.0016 after 1,000 epochs, whereas the second example only reaches a loss of 0.4296 after 10,000 epochs
Perhaps I am being naive in thinking that the heart rate example would be much easier to classify. Any insights to help me understand why this is not what I am seeing would be great!
TL;DR
Your input data is not normalized.
use x_data = (x_data - x_data.mean()) / x_data.std()
increase the learning rate optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
You'll get
convergence in only 1000 iterations.
More details
The key difference between the two examples you have is that the data x in the first example is centered around (0, 0) and has very low variance.
On the other hand, the data in the second example is centered around 92 and has relatively large variance.
This initial bias in the data is not taken into account when you randomly initialize the weights which is done based on the assumption that the inputs are roughly normally distributed around zero.
It is almost impossible for the optimization process to compensate for this gross deviation - thus the model gets stuck in a sub-optimal solution.
Once you normalize the inputs, by subtracting the mean and dividing by the std, the optimization process becomes stable again and rapidly converges to a good solution.
For more details about input normalization and weights initialization, you can read section 2.2 in He et al Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (ICCV 2015).
What if I cannot normalize the data?
If, for some reason, you cannot compute mean and std data in advance, you can still use nn.BatchNorm1d to estimate and normalize the data as part of the training process. For example
class Model(nn.Module):
def __init__(self, input_size, H1, output_size):
super().__init__()
self.bn = nn.BatchNorm1d(input_size) # adding batchnorm
self.linear = nn.Linear(input_size, H1)
self.linear2 = nn.Linear(H1, output_size)
def forward(self, x):
x = torch.sigmoid(self.linear(self.bn(x))) # batchnorm the input x
x = torch.sigmoid(self.linear2(x))
return x
This modification without any change to the input data, yields similar convergance after only 1000 epochs:
A minor comment
For numerical stability, it is better to use nn.BCEWithLogitsLoss instead of nn.BCELoss. For this end, you need to remove the torch.sigmoid from the forward() output, the sigmoid will be computed inside the loss.
See, for example, this thread regarding the related sigmoid + cross entropy loss for binary predictions.
Let's start first by understanding how neural networks work, neural networks observe patterns, hence the necessity for large datasets. In the case of the example, two what pattern you intend to find is when if HR < 91: label = 0, this if-condition can be represented by the formula, sigmoid((HR-91) * 1) , if you plug various values into the formula you can see you that all values < 91, label 0 and others label 1. I have inferred this formula and it could be anything as long as it gives the correct values.
Basically, we apply the formula wx+b, where x in our input data and we learn the values for w and b. Now initially the values are all random, so getting the b value from 1030131190 (a random value), to maybe 98 is fast, since the loss is great, the learning rate allows the values to jump fast. But once you reach 98, your loss is decreasing, and when you apply the learning rate, it takes it more time to reach closer to 91, hence the slow decrease in loss. As the values get closer, the steps taken are even slower.
This can be confirmed via the loss values, they are constantly decreasing, initially, the deceleration is higher, but then it becomes smaller. Your network is still learning but slowly.
Hence in deep learning, you use this method called stepped learning rate, wherewith the increase in epochs you decrease your learning rate so that your learning is faster

setting two different multiple regression layers

I'm now working on a small project, but I don't know how I should build the model.
So, the number of inputs is 27, outputs is 163.
I need to find weights and biases by training, and I am done with this by using 5 layers including relu and dropout.
When I see a cost graph about training loss and validation loss from a tensorboard, it looks ok.
1) However, what I also need to concern about is uniformity, which is calculated as below:
uniformity = (max. of y - min. of y) / (max. of y + max. of y)
I have a real uniformity data which are given, and when I find uniformities from y_predict value, and the difference is too big from the real uniformity value.
Is there any way to add uniformity while training, so that it not only care about finding the right weights and biases, but also close uniformity?
Thank you!
You may incorporate a uniformity constraint into your loss function during training.
def my_loss(labels, predictions):
lambda_ = 0.01
return tf.losses.mean_squared_error(labels, predictions) + \
lambda_ * uniformity(labels) / uniformity(predictions)

Cross entropy loss suddenly increases to infinity

I am attempting to replicate an deep convolution neural network from a research paper. I have implemented the architecture, but after 10 epochs, my cross entropy loss suddenly increases to infinity. This can be seen in the chart below. You can ignore what happens to the accuracy after the problem occurs.
Here is the github repository with a picture of the architecture
After doing some research I think using an AdamOptimizer or relu might be a problem.
x = tf.placeholder(tf.float32, shape=[None, 7168])
y_ = tf.placeholder(tf.float32, shape=[None, 7168, 3])
#Many Convolutions and Relus omitted
final = tf.reshape(final, [-1, 7168])
keep_prob = tf.placeholder(tf.float32)
W_final = weight_variable([7168,7168,3])
b_final = bias_variable([7168,3])
final_conv = tf.tensordot(final, W_final, axes=[[1], [1]]) + b_final
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=final_conv))
train_step = tf.train.AdamOptimizer(1e-5).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(final_conv, 2), tf.argmax(y_, 2))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
EDIT
If anyone is interested, the solution was that I was basically feeding in incorrect data.
Solution: Control the solution space. This might mean using smaller datasets when training, it might mean using less hidden nodes, it might mean initializing your wb differently. Your model is reaching a point where the loss is undefined, which might be due to the gradient being undefined, or the final_conv signal.
Why: Sometimes no matter what, a numerical instability is reached. Eventually adding a machine epsilon to prevent dividing by zero (cross entropy loss here) just won't help because even then the number cannot be accurately represented by the precision you are using. (Ref: https://en.wikipedia.org/wiki/Round-off_error and https://floating-point-gui.de/basic/)
Considerations:
1) When tweaking epsilons, be sure to be consistent with your data type (Use the machine epsilon of the precision you are using, in your case float32 is 1e-6 ref: https://en.wikipedia.org/wiki/Machine_epsilon and python numpy machine epsilon.
2) Just in-case others reading this are confused: The value in the constructor for Adamoptimizer is the learning rate, but you can set the epsilon value (ref: How does paramater epsilon affects AdamOptimizer? and https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer)
3) Numerical instability of tensorflow is there and its difficult to get around. Yes there is tf.nn.softmax_with_cross_entropy but this is too specific (what if you don't want a softmax?). Refer to Vahid Kazemi's 'Effective Tensorflow' for an insightful explanation: https://github.com/vahidk/EffectiveTensorflow#entropy
that jump in your loss graph is very weird...
I would like you to focus on few points :
if your images are not normalized between 0 and 1 then normalize them
if you have normalized your values between -1 and 1 then use a sigmoid layer instead of softmax because softmax squashes the values between 0 and 1
before using softmax add a sigmoid layer to squash your values (Highly Recommended)
other things you can do is add dropouts for every layer
also I would suggest you to use tf.clip so that your gradients does not explode and implode
you can also use L2 regularization
and experiment with the learning rate and epsilon of AdamOptimizer
I would also suggest you to use tensor-board to keep track of the weights so that way you will come to know where the weights are exploding
You can also use tensor-board for keeping track of loss and accuracy
See The softmax formula below:
Probably that e to power of x, the x is being a very large number because of which softmax is giving infinity and hence the loss is infinity
Heavily use tensorboard to debug and print the values of the softmax so that you can figure out where you are going wrong
One more thing I noticed you are not using any kind of activation functions after the convolution layers... I would suggest you to leaky relu after every convolution layer
Your network is a humongous network and it is important to use leaky relu as activation function so that it adds non-linearity and hence improves the performance
You may want to use a different value for epsilon in the Adam optimizer (e.g. 0.1 -- 1.0).This is mentioned in the documentation:
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.

High Train Set Accuracy, Low Test Set Accuracy, Tensorflow, Regularization, Dropout Tried

I am trying to build a model using Tensorflow NN.
Input Matrix Size: [6699, m] -> 'm' examples;
Output Matrix Size: [11, m] -> 11 output layer nodes with softmax implementation
I am consistently getting very high Train accuracy (>95%) and quite low Test accuracy (~20-30%). Some of the things I tried:
Increased training set size from around m = 1200 to m = 13000
Added Regularization (lambda = 0.5 & 0.7)
Added Dropouts (keep_prob = 0.5)
Gradient Descent and then Adam optimizer (which is useful only for speeding up the convergence)
Tried various values of learning rates
Tried changing number of layers and neurons in each layers. Again from shallow 1 hidden layer up till 6 hidden layers.
Currently using 1200 num_epochs, but the cost pretty much stabilizes after 500-600 epochs.
Tried some options on mini_batch size as well.
I have tried all these over last week or so. Still, I am unable to decipher what is causing so much divergence in training set and test set accuracies and I am getting this divergence for pretty much every scenario I described above. It's a clear case of overfitting and I am pretty much exhausted with the options.
Please suggest what more can I do.

Categories

Resources