Neural network - loss not converging

Neural network - loss not converging - python

This network contains an input layer and an output layer, with no nonlinearities. The output is just a linear combination of the input.I am using a regression loss to train the network. I generated some random 1D test data according to a simple linear function, with Gaussian noise added. The problem is that the loss function doesn't converge to zero.
import numpy as np
import matplotlib.pyplot as plt
n = 100
alp = 1e-4
a0 = np.random.randn(100,1) # Also x
y = 7*a0+3+np.random.normal(0,1,(100,1))
w = np.random.randn(100,100)*0.01
b = np.random.randn(100,1)
def compute_loss(a1,y,w,b):
return np.sum(np.power(y-w*a1-b,2))/2/n
def gradient_step(w,b,a1,y):
w -= (alp/n)*np.dot((a1-y),a1.transpose())
b -= (alp/n)*(a1-y)
return w,b
loss_vec = []
num_iterations = 10000
for i in range(num_iterations):
a1 = np.dot(w,a0)+b
loss_vec.append(compute_loss(a1,y,w,b))
w,b = gradient_step(w,b,a1,y)
plt.plot(loss_vec)

The convergence also depends on the value of alpha you use. I played with your code a bit and for
alp = 5e-3
I get the following convergence plotted on a logarithmic x-axis
plt.semilogx(loss_vec)
Output

If I understand your code correctly, you only have one weight matrix and one bias vector despite the fact that you have 2 layers. This is odd and might be at least part of your problem.

Related

Training with parametric partial derivatives in pytorch

Given a neural network with weights theta and inputs x, I am interested in calculating the partial derivatives of the neural network's output w.r.t. x, so that I can use the result when training the weights theta using a loss depending both on the output and the partial derivatives of the output. I figured out how to calculate the partial derivatives following this post. I also found this post that explains how to use sympy to achieve something similar, however, adapting it to a neural network context within pytorch seems like a huge amount of work and a recipee for very slow code.
Thus, I tried something different, which failed. As a minimal example, I created a function (substituting my neural network)
theta = torch.ones([3], requires_grad=True, dtype=torch.float32)
def trainable_function(time):
return theta[0]*time**3 + theta[1]*time**2 + theta[2]*time
Then, I defined a second function to give me partial derivatives:
def trainable_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = trainable_function(deriv_time)
gradient = torch.autograd.grad(fun_value, deriv_time, create_graph=True, retain_graph=True)
deriv_time.requires_grad = False
return gradient
Given some noisy observations of the derivatives, I now try to train theta. For simplicity, I create a loss that only depends on the derivatives. In this minimal example, the derivatives are used directly as observations, not as regularization, to avoid complicated loss functions that are besides the point.
def objective(train_times, observations):
predictions = torch.squeeze(torch.tensor([trainable_derivative(a) for a in train_times]))
return torch.sum((predictions - observations)**2)
optimizer = Adam([theta], lr=0.1)
for iteration in range(200):
optimizer.zero_grad()
loss = objective(data_times, noisy_targets)
loss.backward()
optimizer.step()
Unfortunately, when running this code, I get the error
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
I suppose that when calculating the partial derivatives in the way I do, I do not really create a computational graph through which autodiff could differentiate through. Thus, the connection to the parameters theta somehow gets lost and now it looks to the optimizer as if the loss is completely independent of the parameters theta. However, I could be totally wrong..
Does anyone know how to fix this?
Is it possible to include this type of derivatives in the loss function in pytorch?
And if so, what would be the most pytorch-style way of doing this?
Many thanks for your help and advise, it is much appreciated.
For completeness:
To run the above code, some training data needs to be generated. I used the following code, which works perfectly and has been tested against the analytical derivatives:
true_a = 1
true_b = 1
true_c = 1
def true_function(time):
return true_a*time**3 + true_b*time**2 + true_c*time
def true_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = true_function(deriv_time)
return torch.autograd.grad(fun_value, deriv_time)
data_times = torch.linspace(0, 1, 500)
true_targets = torch.squeeze(torch.tensor([true_derivative(a) for a in data_times]))
noisy_targets = torch.tensor(true_targets) + torch.randn_like(true_targets)*0.1

Your approach to the problem appears overly complicated.
I believe that what you're trying to achieve is within reach in PyTorch.
I include here a simple code snippet that I believe showcases what you would like to do:
import torch
import torch.nn as nn
# Data and Function
torch.manual_seed(0)
input_dim = 1
output_dim = 2
n = 10 # batchsize
simple_function = nn.Sequential(nn.Linear(1, 2), nn.Sigmoid())
t = (torch.arange(n).float() / n).view(n, 1)
x = torch.randn(n, output_dim)
t.requires_grad = True
# Actual computation
xhat = simple_function(t)
jac = torch.autograd.functional.jacobian(simple_function, t, create_graph=True)
grad = jac[torch.arange(n),:,torch.arange(n),0]
loss = (x -xhat).pow(2).sum() + grad.pow(2).sum()
loss.backward()

Where to start on creating a method that saves desired changes in a Tensor with PyTorch?

I have two tensors that I am calculating the Spearmans Rank Correlation from, and I would like to be able to have PyTorch automatically adjust the values in these Tensors in a way that increases my Spearmans Rank Correlation number as high as possible.
I have explored autograd but nothing I've found has explained it simply enough.
Initialized tensors:
a=Var(torch.randn(20,1),requires_grad=True)
psfm_s=Var(torch.randn(12,20),requires_grad=True)
How can I have a loop of constant adjustments of the values in these two tensors to get the highest spearmans rank correlation from 2 lists I make from these 2 tensors while having PyTorch do the work? I just need a guide of where to go. Thank you!

I'm not familiar with Spearman's Rank Correlation, but if I understand your question you're asking how to use PyTorch to solve problems other than deep networks?
If that's the case then I'll provide a simple least squares example which I believe should be informative to your effort.
Consider a set of 200 measurements of 10 dimensional vectors x and y. Say we want to find a linear transform from x to y.
The least squares approach dictates we can accomplish this by finding the matrix M and vector b which minimize |(y - (M x+b))²|
The following example code generates some example data and then uses pytorch to perform this minimization. I believe the comments are sufficient to help you understand what is occurring here.
import torch
from torch.nn.parameter import Parameter
from torch import optim
# define some fake data
M_true = torch.randn(10, 10)
b_true = torch.randn(10, 1)
x = torch.randn(200, 10, 1)
noise = torch.matmul(M_true, 0.05 * torch.randn(200, 10, 1))
y = torch.matmul(M_true, x) + b_true + noise
# begin optimization
# define the parameters we want to optimize (using random starting values in this case)
M = Parameter(torch.randn(10, 10))
b = Parameter(torch.randn(10, 1))
# define the optimizer and provide the parameters we want to optimize
optimizer = optim.SGD((M, b), lr=0.1)
for i in range(500):
# compute loss that we want to minimize
y_hat = torch.matmul(M, x) + b
loss = torch.mean((y - y_hat)**2)
# zero the gradients of the parameters referenced by the optimizer (M and b)
optimizer.zero_grad()
# compute new gradients
loss.backward()
# update parameters M and b
optimizer.step()
if (i + 1) % 100 == 0:
# scale learning rate by factor of 0.9 every 100 steps
optimizer.param_groups[0]['lr'] *= 0.9
print('step', i + 1, 'mse:', loss.item())
# final parameter values (data contains a torch.tensor)
print('Resulting parameters:')
print(M.data)
print(b.data)
print('Compare to the "real" values')
print(M_true)
print(b_true)
Of course this problem has a simple closed form solution, but this numerical approach is just to demonstrate how to use PyTorch's autograd to solve problems not necessarily neural network related. I also choose to explicitly define the matrix M and vector b here rather than using an equivalent nn.Linear layer since I think that would just confuse things.
In your case you want to maximize something so make sure to negate your objective function before calling backward.

Optical Character Recognition using Neural Networks in Python

This code is for OCR using ANN ,it contains one hidden layer, the input is an image of size 28x28.the code runs without any error but the output is not at all accurate even after giving 5000+ images for training.I am using the mnist dataset which is of the form of jpg images. Please tell me what is wrong with my logic.
import numpy as np
from PIL import Image
import random
from random import randint
y = [[0,0,0,0,0,0,0,0,0,0]]
W1 = [[ random.uniform(-1, 1) for q in range(40)] for p in range(784)]
W2 = [[ random.uniform(-1, 1) for q in range(10)] for p in range(40)]
def sigmoid(x):
global b
return (1.0 / (1.0 + np.exp(-x)))
#run the neural net forward
def run(X, W):
return sigmoid(np.matmul(X,W)) #1x2 * 2x2 = 1x1 matrix
#cost function
def cost(X, y, W):
nn_output = run(X, W)
return ((nn_output - y))
def gradient_Descent(X,y,W1,W2):
alpha = 0.12 #learning rate
epochs = 15000 #num iterations
for i in range(epochs):
Z2=sigmoid(np.matmul(run(X,W1),W2)) #final activation function(1X10))
Z1=run(X,W1) #first activation function(1X40)
phi1=Z1*(1-Z1) #differentiation of Z1
phi2=Z2*(1-Z2) #differentiation of Z2
delta2 = phi2*cost(Z1,y,W2) #delta for outer layer(1X10)
delta1 = np.transpose(np.transpose(phi1)*np.matmul(W2,np.transpose(delta2)))
deltaW2 = alpha*(np.matmul(np.transpose(Z1),delta2))
deltaW1 = alpha*(np.matmul(np.transpose(X),delta1))
W1=W1+deltaW1
W2=W2+deltaW2
def Training():
for j in range(8):
y[0][j]=1
k=1
while k<=15: #5421
print(k)
q=0
img = Image.open('mnist_jpgfiles/train/mnist_'+str(j)+'_'+str(k)+'.jpg')
iar = np.array(img) #image array
ar=np.reshape(iar,(1,np.product(iar.shape)))
ar=np.array(ar,dtype=float)
X = ar
'''
for p in range(784):
if X[0][p]>0:
X[0][p]=1
else:
X[0][p]=0
'''
k+=1
gradient_Descent(X,y,W1,W2)
print(np.argmin(cost(run(X,W1),y,W2)))
#print(W1)
y[0][j]=0
Training()
def test():
global W1,W2
for j in range(3):
k=1
while k<=5: #890
img = Image.open('mnist_jpgfiles/test/mnist_'+str(j)+'_'+str(k)+'.jpg')
iar = np.array(img) #image array
ar=np.reshape(iar,(1,np.product(iar.shape)))
ar=np.array(ar,dtype=float)
X = ar/256
'''
for p in range(784):
if X[0][p]>0:
X[0][p]=1
else:
X[0][p]=0
'''
k+=1
print("Should be "+str(j))
print((run(run(X,W1),W2)))
print((np.argmax(run(run(X,W1),W2))))
print("Testing.....")
test()

There is a problem with your cost function, because you simply calculate the difference between the hypothesis output with the actual output.It makes your cost function linear, so it's strictly increasing(or strictly decreasing), which can't be optimized.
You need to make a cross-entropy cost function(because you use sigmoid as activation function).
Also, gradient descent simply can't optimize ANN cost function, you should use back-propagation with gradient descent to optimize it.

I haven't worked with ANN but when working with gradient descent algorithm for regression problems like in Andrew Nag Machine Learning course in coursera, I found it is helpful to have learning rate alpha less than 0.05 and no of iterations more than 100000.
Try tweaking your learning rate then create a confusion matrix which will help you understand the accuracy of your system.

In my experience there are a lot of things that can go wrong with an ANN. I'll list some possible errors for you to consider.
Assuming the classification accuracy does not increase at all after training.
Something is wrong with the training or testing sets.
Too high
learning rates can sometimes cause the algorithm to not converge at
all. Try setting it very small like 0.01 or 0.001. If there is still no convergence. The issue probably has to do with something else than the gradient descent.
Assuming the training does increase but the accuracy is worse than expected.
The normalisation process is not correctly implemented. For images it is recommended to use zero-mean-unit-variance.
The learning rate is too low or too high

Neural network tutorial, softmax activation instead of sigmoid

I am working through an FNN tutorial, previously after researching i learnt that i will need to use softmax activation for my own ML problem rather than sigmoid as shown.
After hours of looking through tutorials i cannot find a basic example of softmax code (outside of modules) that i can learn to rework the tutorial code in the way that sigmoid is used. If i can figure this out then i can break it down and understand how the math transforms the data and is fed through the NN, then i can apply this to other basic ML starting points like SVMs etc.
I need pointers on untangling the sigmoid math/code and reworking it to use softmax activation and one hot encoding on the y outputs, thank you for any help.
import numpy as np
# sigmoid function
def nonlin(x,deriv=False):
if(deriv==True):
return x*(1-x)
return 1/(1+np.exp(-x))
# input dataset
X = np.array([ [0,0,1],
[0,1,1],
[1,0,1],
[1,1,1] ])
# output dataset
y = np.array([[0,0,1,1]]).T
# seed random numbers to make calculation
# deterministic (just a good practice)
np.random.seed(1)
# initialize weights randomly with mean 0
syn0 = 2*np.random.random((3,1)) - 1
for iter in xrange(10000):
# forward propagation
l0 = X
l1 = nonlin(np.dot(l0,syn0))
# how much did we miss?
l1_error = y - l1
# multiply how much we missed by the
# slope of the sigmoid at the values in l1
l1_delta = l1_error * nonlin(l1,True)
# update weights
syn0 += np.dot(l0.T,l1_delta)
print "Output After Training:"
print l1

Why do I get different weights when using TensorFlow for multiple linear regression?

I have two implementations of multiple linear regressions, one using tensorflow and one using only numpy. I generate a dummy set of data and I try to recover the weights I used, but although the numpy one returns the initial weights, the tensorflow one always returns different weights (which also sort of work)
The numpy implementation is here, and here's the TF implementation:
import numpy as np
import tensorflow as tf
x = np.array([[i, i + 10] for i in range(100)]).astype(np.float32)
y = np.array([i * 0.4 + j * 0.9 + 1 for i, j in x]).astype(np.float32)
# Add bias
x = np.hstack((x, np.ones((x.shape[0], 1)))).astype(np.float32)
# Create variable for weights
n_features = x.shape[1]
np.random.rand(n_features)
w = tf.Variable(tf.random_normal([n_features, 1]))
w = tf.Print(w, [w])
# Loss function
y_hat = tf.matmul(x, w)
loss = tf.reduce_mean(tf.square(tf.sub(y, y_hat)))
operation = tf.train.GradientDescentOptimizer(learning_rate=0.000001).minimize(loss)
with tf.Session() as session:
session.run(tf.initialize_all_variables())
for iteration in range(5000):
session.run(operation)
weights = w.eval()
print(weights)
Running the script gets me weights around [-0.481, 1.403, 0.701], while running the numpy version gets me weights around [0.392, 0.907, 0.9288] which are much closer to the weights I used to generate the data: [0.4, 0.9, 1]
Both learning rates/epochs parameters are the same, and both initialise weights randomly. I don't normalize the data for either of the implementations, and I've ran them multiple times.
Why are the results different? I also tried to initialise weights in the TF version using w = tf.Variable(np.random.rand(n_features).reshape(n_features,1).astype(np.float32)) but that didn't fix it either. Is there something wrong with the TF implementation?

The problem appears to be with broadcasting. The shape of y_hat in the above is (100,1), while y is (100,). So, when you do tf.sub(y, y_hat) you end up with a matrix of (100,100) which are all the possible combinations of subtractions between the two vectors. I don't know, but I guess that you managed to avoid this in the numpy code.
Two ways to fix your code:
y = np.array([[i * 0.4 + j * 0.9 + 1 for i, j in x]]).astype(np.float32).T
or
y_hat = tf.squeeze(tf.matmul(x, w))
Although, that said, when I run this it still doesn't actually converge to the answer you want, but at least it's actually able to minimize the loss function.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Neural network - loss not converging - python

The convergence also depends on the value of alpha you use. I played with your code a bit and for alp = 5e-3 I get the following convergence plotted on a logarithmic x-axis plt.semilogx(loss_vec) Output

If I understand your code correctly, you only have one weight matrix and one bias vector despite the fact that you have 2 layers. This is odd and might be at least part of your problem.

Related

Training with parametric partial derivatives in pytorch

Where to start on creating a method that saves desired changes in a Tensor with PyTorch?

Optical Character Recognition using Neural Networks in Python

Neural network tutorial, softmax activation instead of sigmoid

Why do I get different weights when using TensorFlow for multiple linear regression?

Categories

Resources