I am a noob in programming who tried to study machine learning. I used tensorflow for Python. Here's the code, written (but not 100% copied) with official tensorflow guide (here's it https://www.tensorflow.org/guide/basics). I can't see the final graph with the results after training. I've tried two methods of training and both share the same problem. Could anyone help me?
import matplotlib as mp
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as pl
mp.rcParams["figure.figsize"] = [20, 10]
precision = 500
x = tf.linspace(-10.0, 10.0, precision)
def y(x): return 4 * np.sin(x - 1) + 3
newY = y(x) + tf.random.normal(shape=[precision])
class Model(tf.keras.Model):
def __init__(self, units):
self.dense1 = tf.keras.layers.Dense(units = units, activation = tf.nn.relu, kernel_initializer=tf.random.normal, bias_initializer=tf.random.normal)
self.dense2 = tf.keras.layers.Dense(1)
def __call__(self, x, training = True):
x = x[:, tf.newaxis]
x = self.dense1(x)
x = self.dense2(x)
return tf.squeeze(x, axis=1)
model = Model(164)
pl.plot(x, y(x), label = "origin")
pl.plot(x, newY, ".", label = "corrupted")
pl.plot(x, model(x), label = "before training")
""" The first method
vars = model.variables
optimizer = tf.optimizers.SGD(learning_rate = 0.01)
for i in range(1000):
with tf.GradientTape() as tape:
prediction = model(x)
error = (newY-prediction)**2
mean_error = tf.reduce_mean(error)
gradient = tape.gradient(mean_error, vars)
optimizer.apply_gradients(zip(gradient, vars))
model.compile(loss = tf.keras.losses.MSE, optimizer = tf.optimizers.SGD(learning_rate = 0.01))
model.fit(x, newY, epochs=100,batch_size=32,verbose=0)
pl.plot(x, model(x), label = "after training")
I copied your code and investigated it. Your model returns NaN loss during training, I removed kernel and bias initializers and it works. For now I don't know what's wrong with your initialization. It seems that some weights got initialized with NaNs which then made the predictions become NaNs, hence you couldn't plot them.
Update: use the initializers module (like tensorflow.initializers or tensorflow.keras.initializers, not tensorflow.random). For example, use kernel_initializer=tf.initializers.random_normal instead of what you have.
As I can see, your third graph and your fourth graph are the same. They are
pl.plot(x, model(x), label = "before training") and pl.plot(x, model(x), label = "after training") You can figure out that the x-axis and y-axis data of two graphs are the same.
Hope my answer is helpful to you!
I'm trying to implement a neural network with aleatoric uncertainty estimation for regression with pytorch according to
Kendall et al.: "What Uncertainties Do We Need in Bayesian Deep
Learning for Computer Vision?" (Link).
However, while the predicted regression values fit the desired ground truth values quite well, the predicted variance looks weird and the loss gets negative during training.
The paper suggests to have two outputs mean and variance instead of only predicting the regression value. To be more precise, it is suggested to predict mean and log(variance) due to stability reasons. Therefore, my network looks as follows:
class ReferenceResNet(nn.Module):
def __init__(self):
self.fcl1 = nn.Linear(1, 32)
self.fcl2 = nn.Linear(32, 64)
self.fcl3 = nn.Linear(64, 128)
self.fcl_mean = nn.Linear(128,1)
self.fcl_var = nn.Linear(128,1)
def forward(self, x):
x = torch.tanh(self.fcl1(x))
x = torch.tanh(self.fcl2(x))
x = torch.tanh(self.fcl3(x))
mean = self.fcl_mean(x)
log_var = self.fcl_var(x)
return mean, log_var
According to the paper, given these outputs, the corresponding loss function consists of a residual regression-part and a regularization term:
where si is the log(variance) predicted by the network.
I implemented this loss-function accordingly:
def loss_function(pred_mean, pred_log_var, y):
return 1/len(pred_mean)*(0.5 * torch.exp(-pred_log_var)*torch.sqrt(torch.pow(y-pred_mean, 2))+0.5*pred_log_var).sum()
I tried this code on a self-generated toy dataset (see image with results), however, the loss gets negative during training and when I plot the variance over the dataset after training, for me it does not really make sense while the corresponding mean values fit the ground truth quite well:
I already figured out that the negative loss comes from the regularization term as logarithms are negative for values between 0 and 1, however, I don't believe that the absolute value of the regularization term is supposed to grow bigger than the regression part. Does anyone know what is the reason for this and how I can prevent this from happening? And why does my variance look so weird?
For reproduction, my full code looks as follows:
import torch.nn as nn
import torch
import torch.nn.functional as F
import torch.optim as optim
from torch.optim import lr_scheduler
from torch.utils.data.dataset import TensorDataset
from torchvision import datasets, transforms
import math
import numpy as np
import torch.nn.functional as F
import matplotlib.pyplot as plt
from tqdm import tqdm
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
class ReferenceRegNet(nn.Module):
def __init__(self):
self.fcl1 = nn.Linear(1, 32)
self.fcl2 = nn.Linear(32, 64)
self.fcl3 = nn.Linear(64, 128)
self.fcl_mean = nn.Linear(128,1)
self.fcl_var = nn.Linear(128,1)
def forward(self, x):
x = torch.tanh(self.fcl1(x))
x = torch.tanh(self.fcl2(x))
x = torch.tanh(self.fcl3(x))
mean = self.fcl_mean(x)
log_var = self.fcl_var(x)
return mean, log_var
def toy_function(x):
return math.sin(x/15-4)+2 + math.sin(x/10-5)
def loss_function(x_mean, x_log_var, y):
return 1/len(x_mean)*(0.5 * torch.exp(-x_log_var)*torch.sqrt(torch.pow(y-x_mean, 2))+0.5*x_log_var).sum()
# generate toy dataset: A train-set in form of a complex sin-curve
x_train_data = np.array([])
y_train_data = np.array([])
for repeat in range(2):
for i in range(50, 150):
for j in range(100):
sampled_x = i+np.random.randint(101)/100
sampled_y = toy_function(sampled_x)+np.random.normal(0,0.2)
x_train_data = np.append(x_train_data, sampled_x)
y_train_data = np.append(y_train_data, sampled_y)
x_eval_data = list(np.arange(50.0, 150.0, 0.1))
y_eval_data = [toy_function(x) for x in x_eval_data]
LOADER_KWARGS = {'num_workers': 0, 'pin_memory': False} if torch.cuda.is_available() else {}
train_set = TensorDataset(torch.Tensor(x_train_data),torch.Tensor(y_train_data))
eval_set = TensorDataset(torch.Tensor(x_eval_data), torch.Tensor(y_eval_data))
train_loader = torch.utils.data.DataLoader(train_set, batch_size=BATCH_SIZE, shuffle=True, **LOADER_KWARGS)
eval_loader = torch.utils.data.DataLoader(eval_set, batch_size=EVAL_BATCH_SIZE, shuffle=False, **LOADER_KWARGS)
TRAIN_SIZE = len(train_loader.dataset)
EVAL_SIZE = len(eval_loader.dataset)
assert (TRAIN_SIZE % BATCH_SIZE) == 0
net = ReferenceRegNet().to(DEVICE)
optimizer = optim.Adam(net.parameters(), lr=1e-3)
losses = {}
# train network
for epoch in range(1,TRAIN_EPOCHS+1):
mean_epoch_loss = 0
mean_epoch_mse = 0
# train batches
for batch_idx, (data, target) in enumerate(tqdm(train_loader), start=1):
data, target = (data.to(DEVICE)).unsqueeze(dim=1), (target.to(DEVICE)).unsqueeze(dim=1)
output_means, output_log_var = net(data)
target_np = target.detach().cpu().numpy()
output_means_np = output_means.detach().cpu().numpy()
loss = loss_function(output_means, output_log_var, target)
loss_value = loss.item() # get raw float-value out of loss-tensor
mean_epoch_loss += loss_value
# optimize network
mean_epoch_loss = mean_epoch_loss / len(train_loader)
print("Epoch " + str(epoch) + ": Train-Loss = " + str(mean_epoch_loss))
with torch.no_grad():
mean_loss = 0
mean_mse = 0
for data, target in eval_loader:
data, target = (data.to(DEVICE)).unsqueeze(dim=1), (target.to(DEVICE)).unsqueeze(dim=1)
output_means, output_log_var = net(data) # perform prediction
target_np = target.detach().cpu().numpy()
output_means_np = output_means.detach().cpu().numpy()
mean_loss += loss_function(output_means, output_log_var, target).item()
mean_loss = mean_loss/len(eval_loader)
#print("Epoch " + str(epoch) + ": Eval-loss = " + str(mean_loss))
fig = plt.figure(figsize=(40,12)) # create a 30x30 inch figure
ax = fig.add_subplot(1,3,1)
ax.set_title("regression value")
ax.set_ylabel("regression mean")
ax.plot(x_train_data, y_train_data, 'x', color='black')
ax.plot(x_eval_data, y_eval_data, color='red')
pred_means_list = []
output_vars_list_train = []
output_vars_list_test = []
for x_test in sorted(x_train_data):
x_test = (torch.Tensor([x_test]).to(DEVICE))
pred_means, output_log_vars = net.forward(x_test)
ax.plot(sorted(x_train_data), pred_means_list, color='blue', label = 'training_perform')
pred_means_list = []
for x_test in x_eval_data:
x_test = (torch.Tensor([x_test]).to(DEVICE))
pred_means, output_log_vars = net.forward(x_test)
ax.plot(sorted(x_eval_data), pred_means_list, color='green', label = 'eval_perform')
ax = fig.add_subplot(1,3,2)
ax.set_ylabel("regression var")
ax.plot(sorted(x_train_data), output_vars_list_train, label = 'training data')
ax.plot(x_eval_data, output_vars_list_test, label = 'test data')
ax = fig.add_subplot(1,3,3)
ax.set_title("training loss")
lists = sorted(losses.items())
epoch, loss = zip(*lists)
ax.plot(epoch, loss, label = 'loss')
TLDR: The optimization drives the loss to a minimum where the gradient
becomes zero, regardless of what the nominal loss value is.
A comprehensive explanation by K.Frank:
A smaller loss – algebraically less positive or algebraically more
negative – means (or should mean) better predictions. The
optimization step uses some version of gradient descent to make
your loss smaller. The overall level of the loss doesn’t matter as
far as the optimization goes. The gradient tells the optimizer how
to change the model parameters to reduce the loss, and it doesn’t
care about the overall level of the loss.
An example from the same source:
Consider, for example, optimizing with lossA = MSELoss. Now
imagine optimizing with lossB = lossA - 17.2. The 17.2 doesn’t
really change anything at all. It is true that “perfect” predictions
will yield lossB = -17.2 rather than zero. (lossA will, of course,
be zero for “perfect” predictions.) But who cares?
In your example: you are right, the negative loss value comes from the logarithmic term. This is completely OK and it means that your training is dominated by contributions of high-confidence loss terms. Regarding the high values of variance - can't comment much on it but it should be fine since the loss curve drops as expected.
I am new to tensorflow-2 and I was starting my learning curve, with the follow simple Linear-Regression model:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
# Make data
num_samples, w, b = 20, 0.5, 2
xs = np.asarray(range(num_samples))
ys = np.asarray([x*w + b + np.random.normal() for x in range(num_samples)])
xts = tf.convert_to_tensor(xs, dtype=tf.float32)
yts = tf.convert_to_tensor(xs, dtype=tf.float32)
plt.plot(xs, ys, 'ro')
class Linear(tf.keras.Model):
def __init__(self, name='linear', **kwargs):
super().__init__(name='linear', **kwargs)
self.w = tf.Variable(0, True, name="w", dtype=tf.float32)
self.b = tf.Variable(1, True, name="b", dtype=tf.float32)
def call(self, inputs):
return self.w*inputs + self.b
class Custom(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
if epoch % 20 == 0:
preds = self.model.predict(xts)
plt.plot(xs, preds, label='{} {:7.2f}'.format(epoch, logs['loss']))
print('The average loss for epoch {} is .'.format(epoch, logs['loss']))
x = tf.keras.Input(dtype=tf.float32, shape=[])
#model = tf.keras.Sequential([tf.keras.layers.Dense(units=1, input_shape=[1])])
model = Linear()
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='MSE')
model.fit(x=xts, y=yts, verbose=1, batch_size=4, epochs=250, callbacks=[Custom()])
For a reason I don't understand it seems like my model is not fitting the curve.
I also tried with keras.layers.Dense(1) and I had the same exact result.
Also it seems like the results don't correspond to a proper loss function, as around epoch 120 the model should have less loss than on 250.
Can you maybe help me understand what I am doing wrong?
Thanks a lot!
There is a small bug in your code as xts and yts are identical to each other, i.e. you wrote
xts = tf.convert_to_tensor(xs, dtype=tf.float32)
yts = tf.convert_to_tensor(xs, dtype=tf.float32)
instead of
xts = tf.convert_to_tensor(xs, dtype=tf.float32)
yts = tf.convert_to_tensor(ys, dtype=tf.float32)
which is why the loss doesn't make sense. Once this has been fixed the results are as expected, see the plot below.
I am making a simple PyTorch neural net to approximate the sine function on x = [0, 2pi]. This is a simple architecture I use with different deep learning libraries to test whether I understand how to use it or not. The neural net, when untrained, always produces a straight horizontal line, and when trained, produces a straight line at y = 0. In general, it always produces a straight line at y = (The mean of the function). This leads me to believe something is wrong with the forward prop portion of it, as the boundary should not just be a straight line when untrained. Here is the code for the net:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.model = nn.Sequential(
nn.Linear(1, 20),
nn.Linear(20, 50),
nn.Linear(50, 50),
nn.Linear(50, 1)
def forward(self, x):
x = self.model(x)
return x
Here is the training loop
def train(net, trainloader, valloader, learningrate, n_epochs):
net = net.train()
loss = nn.MSELoss()
optimizer = torch.optim.SGD(net.parameters(), lr = learningrate)
for epoch in range(n_epochs):
for X, y in trainloader:
X = X.reshape(-1, 1)
y = y.view(-1, 1)
outputs = net(X)
error = loss(outputs, y)
#net.parameters() net.parameters() * learningrate
total_loss = 0
for X, y in valloader:
X = X.reshape(-1, 1).float()
y = y.view(-1, 1)
outputs = net(X)
error = loss(outputs, y)
total_loss += error.data
print('Val loss for epoch', epoch, 'is', total_loss / len(valloader) )
it is called as:
net = Net()
losslist = train(net, trainloader, valloader, .0001, n_epochs = 4)
Where trainloader and valloader are the training and validation loaders. Can anyone help me see what's wrong with this? I know its not the learning rate since its the one I use in other frameworks, and I know its not the fact im using SGD or sigmoid activation functions, although I have a suspicion the error is in the activation functions somewhere.
Does anyone know how to fix this? Thanks.
After a while playing with some hyperparameters, modifying the net and changing the optimizer (following this excellent recipe) I ended up with changing the line optimizer = torch.optim.SGD(net.parameters(), lr = learningrate) to optimizer = torch.optim.Adam(net.parameters()) (the default optimizer parameters was used), running for 100 epochs and batch size equal to 1.
The following code was used (tested on CPU only):
import torch
import torch.nn as nn
from torch.utils import data
import numpy as np
import matplotlib.pyplot as plt
# for reproducibility
class Dataset(data.Dataset):
def __init__(self, init, end, n):
self.n = n
self.x = np.random.rand(self.n, 1) * (end - init) + init
self.y = np.sin(self.x)
def __len__(self):
return self.n
def __getitem__(self, idx):
x = self.x[idx, np.newaxis]
y = self.y[idx, np.newaxis]
return torch.Tensor(x), torch.Tensor(y)
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.model = nn.Sequential(
nn.Linear(1, 20),
nn.Linear(20, 50),
nn.Linear(50, 50),
nn.Linear(50, 1)
def forward(self, x):
x = self.model(x)
return x
def train(net, trainloader, valloader, n_epochs):
loss = nn.MSELoss()
# Switch the two following lines and run the code
# optimizer = torch.optim.SGD(net.parameters(), lr = 0.0001)
optimizer = torch.optim.Adam(net.parameters())
for epoch in range(n_epochs):
for x, y in trainloader:
outputs = net(x).view(-1)
error = loss(outputs, y)
total_loss = 0
for x, y in valloader:
outputs = net(x)
error = loss(outputs, y)
total_loss += error.data
print('Val loss for epoch', epoch, 'is', total_loss / len(valloader) )
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
def plot_result(ax, dataloader):
out, xx, yy = [], [], []
for x, y in dataloader:
out = torch.cat(out, dim=0).detach().numpy().reshape(-1)
xx = torch.cat(xx, dim=0).numpy().reshape(-1)
yy = torch.cat(yy, dim=0).numpy().reshape(-1)
ax.scatter(xx, yy, facecolor='green')
ax.scatter(xx, out, facecolor='red')
xx = np.linspace(0.0, 3.14159*2, 1000)
ax.plot(xx, np.sin(xx), color='green')
plot_result(ax1, trainloader)
plot_result(ax2, valloader)
train_dataset = Dataset(0.0, 3.14159*2, 100)
val_dataset = Dataset(0.0, 3.14159*2, 30)
params = {'batch_size': 1,
'shuffle': True,
'num_workers': 4}
trainloader = data.DataLoader(train_dataset, **params)
valloader = data.DataLoader(val_dataset, **params)
net = Net()
losslist = train(net, trainloader, valloader, n_epochs = 100)
Result with Adam optimizer:
Result with SGD optimizer:
In general, it always produces a straight line at y = (The mean of the function).
Usually, this means that the NN has only successfully trained the final layer so far. You need to train it for longer or with better optimizations, as ViniciusArruda shows here.
Edit: To explain further.. When only the final layer has been trained, the NN is effectively trying to guess the output y with no knowledge of the input X. In this case, the best guess it can make is the mean value. That way, it can minimize its MSE loss.
When using GradientDescentOptimizer instead of Adam Optimizer the model doesn't seem to converge. On the otherhand, AdamOptimizer seems to work fine. Is the something wrong with the GradientDescentOptimizer from tensorflow?
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
def randomSample(size=100):
y = 2 * x -3
x = np.random.randint(500, size=size)
y = x * 2 - 3 - np.random.randint(-20, 20, size=size)
return x, y
def plotAll(_x, _y, w, b):
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(_x, _y)
x = np.random.randint(500, size=20)
y = w * x + b
ax.plot(x, y,'r')
def lr(_x, _y):
w = tf.Variable(2, dtype=tf.float32)
b = tf.Variable(3, dtype=tf.float32)
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
linear_model = w * x + b
loss = tf.reduce_sum(tf.square(linear_model - y))
optimizer = tf.train.AdamOptimizer(0.0003) #GradientDescentOptimizer
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()
sess = tf.Session()
for i in range(10000):
sess.run(train, {x : _x, y: _y})
cw, cb, closs = sess.run([w, b, loss], {x:_x, y:_y})
return cw, cb
x,y = randomSample()
w,b = lr(x,y)
plotAll(x,y, w, b)
I had a similar problem once and it took me a long time to find out the real problem. With gradient descent my loss function was actually growing instead of getting smaller.
It turned out that my learning rate was too high. If you take too big of a step with gradient descent you can end up jumping over the minimum. And if you are really unlucky, like I was you end up jumping so far ahead that your error increases.
Lowering the learning rate should make the model converge. But it could take a long time.
Adam optimizer has momentum, that is, it doesn't just follow the instantaneous gradient, but it keeps track of the direction it was going before with a sort of velocity. This way, if you start going back and forth because of the gradient than the momentum will force you to go slower in this direction. This helps a lot! Adam has a few more tweeks other than momentum that make it the prefered deep learning optimizer.
If you want to read more about optimizers this blog post is very informative.
I'm trying to make a simple multivariate linear Regression with Lasagne.
This is my Input:
x_train = np.array([[37.93, 139.5, 329., 16.64,
16.81, 16.57, 1., 707.,
39.72, 149.25, 352.25, 16.61,
16.91, 16.60, 40.11, 151.5,
361.75, 16.95, 16.98, 16.79]]).astype(np.float32)
y_train = np.array([37.92, 138.25, 324.66, 16.28, 16.27, 16.28]).astype(np.float32)
For this two data points the network should be able to learn y perfectly.
Here is the model:
i1 = T.matrix()
y = T.vector()
lay1 = lasagne.layers.InputLayer(shape=(None,20),input_var=i1)
out1 = lasagne.layers.get_output(lay1)
lay2 = lasagne.layers.DenseLayer(lay1, 6, nonlinearity=lasagne.nonlinearities.linear)
out2 = lasagne.layers.get_output(lay2)
params = lasagne.layers.get_all_params(lay2, trainable=True)
cost = T.sum(lasagne.objectives.squared_error(out2, y))
grad = T.grad(cost, params)
updates = lasagne.updates.sgd(grad, params, learning_rate=0.1)
f_train = theano.function([i1, y], [out1, out2, cost], updates=updates)
After executing multiple times
the cost explodes to infinity. Any idea what is going wrong here?
The network has too much capacity for a single training instance. You would need to apply some strong regularization to prevent the training diverging. Alternatively, and hopefully more realistically, give it more complex training data (many instances).
With a single instance the task can be solved using just one input, instead of 20, and with the DenseLayer's bias disabled:
import numpy as np
import theano
import lasagne
import theano.tensor as T
def compile():
x, z = T.matrices('x', 'z')
lh = lasagne.layers.InputLayer(shape=(None, 1), input_var=x)
ly = lasagne.layers.DenseLayer(lh, 6, nonlinearity=lasagne.nonlinearities.linear,
y = lasagne.layers.get_output(ly)
params = lasagne.layers.get_all_params(ly, trainable=True)
cost = T.sum(lasagne.objectives.squared_error(y, z))
updates = lasagne.updates.sgd(cost, params, learning_rate=0.0001)
return theano.function([x, z], [y, cost], updates=updates)
def main():
f_train = compile()
x_train = np.array([[37.93]]).astype(theano.config.floatX)
y_train = np.array([[37.92, 138.25, 324.66, 16.28, 16.27, 16.28]])\
for _ in xrange(100):
print f_train(x_train, y_train)
Note that the learning rate also needs to be reduced a lot to prevent divergence.