Loss function increasing instead of decreasing - python

I have been trying to make my own neural networks from scratch. After some time, I made it, but I run into a problem I cannot solve. I have been following a tutorial which shows how to do this. The problem I run into, was how my network updates weights and biases. Well, I know that gradient descent won't be always decreasing loss and for a few epochs it might even increase a bit, bit it still should decrease and work much better than mine does. Sometimes the whole process gets stuck on loss 9 and 13 and it cannot get out of it. I have checked many tutorials, videos and websites, but I couldn't find anything wrong in my code.
self.activate, self.dactivate, self.loss and self.dloss:
# sigmoid
self.activate = lambda x: np.divide(1, 1 + np.exp(-x))
self.dactivate = lambda x: np.multiply(self.activate(x), (1 - self.activate(x)))
# relu
self.activate = lambda x: np.where(x > 0, x, 0)
self.dactivate = lambda x: np.where(x > 0, 1, 0)
# loss I use (cross-entropy)
clip = lambda x: np.clip(x, 1e-10, 1 - 1e-10) # it's used to squeeze x into a probability between 0 and 1 (which I think is required)
self.loss = lambda x, y: -(np.sum(np.multiply(y, np.log(clip(x))) + np.multiply(1 - y, np.log(1 - clip(x))))/y.shape[0])
self.dloss = lambda x, y: -(np.divide(y, clip(x)) - np.divide(1 - y, 1 - clip(x)))
The code I use for forwardpropagation:
self.activate(np.dot(X, self.weights) + self.biases) # it's an example for first hidden layer
And that's the code for backpropagation:
First part, in DenseNeuralNetwork class:
last_derivative = self.dloss(output, y)
for layer in reversed(self.layers):
last_derivative = layer.backward(last_derivative, self.lr)
And the second part, in Dense class:
def backward(self, last_derivative, lr):
w = self.weights
dfunction = self.dactivate(last_derivative)
d_w = np.dot(self.layer_input.T, dfunction) * (1./self.layer_input.shape[1])
d_b = (1./self.layer_input.shape[1]) * np.dot(np.ones((self.biases.shape[0], last_derivative.shape[0])), last_derivative)
self.weights -= np.multiply(lr, d_w)
self.biases -= np.multiply(lr, d_b)
return np.dot(dfunction, w.T)
I have also made a repl so you can check the whole code and run it without any problems.

1.
line 12
self.dloss = lambda x, y: -(np.divide(y, clip(x)) - np.divide(1 - y, 1 - clip(x)))
if you're going to clip x, you shoud clip y too.
I mean there are some ways to implement this, but if you are going to use this way.
change to
self.dloss = lambda x, y: -(np.divide(clip(y), clip(x)) - np.divide(1 - clip(y), 1 - clip(x)))
2.
line 75
dfunction = self.dactivate(last_derivative)
this back propagation part is just wrong.
change to
dfunction = last_derivative*self.dactivate(np.dot(self.layer_input, self.weights) + self.biases)
3.
line 77
d_b = (1./self.layer_input.shape[1]) * np.dot(np.ones((self.biases.shape[0], last_derivative.shape[0])), last_derivative)
last_derivative should be dfunction. I think this is just a mistake.
change to
d_b = (1./self.layer_input.shape[1]) * np.dot(np.ones((self.biases.shape[0], last_derivative.shape[0])), dfunction)
4.
line 85
self.weights = np.random.randn(neurons, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons))
self.biases = np.random.randn(1, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons))
Not sure where you are going with this, but I think the initialized values are too big. We're not doing precise hypertuning, so I just made it small.
self.weights = np.random.randn(neurons, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons)) / 100
self.biases = np.random.randn(1, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons)) / 100
All good now
After this I changed the learning rate to 0.01 because it was to slow, and it worked fine.
I think you are misunderstanding back propagation. You should probably double check how it works. The other parts are ok I think.

This can be caused by your training data. Either it is too small or too many diverse labels (What i get from your code from the link you share).
I re-run your code several times and it produce different training performance. Sometimes the loss keeps decreasing until last epoch, some times it keep increasing, in one time it decreased until some point and it increasing. (With minimum loss achieved of 0.5)
I think it is your training data that matters this time. The learning rate is good enough though (Assuming you did the calculation for Linear combination, back propagation, etc right).

Related

Matrix multiplication quickly overflows in backpropagation equations

I'm trying to do my own neural network implementation from scratch, but I'm having some problems. Specifically, with the way one of the terms grows exponentially as iterations advance, which makes the accuracy quickly reach a plateau and gradient descent not find any optimal solution.
The equations I'm using I reached on my own, while studying some of the available resources online. I chose to write my own equations so I could understand them better:
... (1) where L is the last layer, and the hadamard product is being performed, and A is the values from the Lth layer after the activation function, Z is the values before the activation function is applied.
... (2) where W is the matrix of weights from the i+1th layer, with the number of rows is equal to the number of neurons in the i+1th layer and columns equals the number of neurones in the ith layer.
... (3)
... (4)
I struggled a lot with these, but I finally was able to make it work and keep all the correct dimensions of each matrix. Now, implementing these equations into my code:
def back_prop(self):
delta = np.multiply(self.a_record[-1] - self.target, self.layers[-1].prime(self.z_record[-1]))
for i, layer in reversed(list(enumerate(self.layers))):
dw = np.matmul(delta, self.a_record[i].T)
db = np.matmul(delta, np.ones((self.n, 1)))
self.dw_record.append(dw)
self.db_record.append(db)
if i > 0:
delta = np.multiply(np.matmul(self.w[i].T, delta), layer.prime(self.z_record[i - 1]))
a_record and z_record are lists that store the arrays corresponding to each layer. Same for dw and db records. I wrote it like this so it is as close as possible to my equations.
Problem comes once I start to measure performance on any dataset, which remain stagnant after just a couple of iterations. I get this warning, since it's not an error it doesn't stop the script.
RuntimeWarning: overflow encountered in multiply delta = np.multiply(self.a_record[-1] - self.target, self.layers[-1].prime(self.z_record[-1]))
When I analize the values that come from delta I find that the value of delta is growing way too fast, it starts with small values but grows up to e+200 magnitude and then it just returns infinity, which makes the next delta a bunch of np.nan and all the weights and biases to stop changing.
I'm fairly certain that the equations are equivalent to other equations I found in other neural networks, so I don't really know why this is happening. I don't imagine the values from any randomly generated dataset to be so insanely big that it overflows after just a couple of iterations.
I will post the whole class I made:
class Network():
def __init__(self, input, target, layers, alpha = 0.1, iter= 1000):
self.input = np.array(input).T
self.y_true = target
#self.target = np.array(one_hot(target)).T
self.target = np.array(target).T
self.layers = layers
self.alpha = alpha
self.iter = iter
self.n = len(input)
self.w = []
self.b = []
for i, layer in enumerate(layers):
if i == 0:
self.w.append(np.random.rand(layer.neurons, len(self.input)) - 0.5)
else:
self.w.append(np.random.rand(layer.neurons, layers[i - 1].neurons) - 0.5)
self.b.append(np.random.rand(layer.neurons, 1) - 0.5)
self.a_record = []
self.z_record = []
self.dw_record = []
self.db_record = []
def forward_prop(self):
a = self.input
self.a_record.append(a)
for i, layer in enumerate(self.layers):
z = np.matmul(self.w[i], a) + self.b[i].reshape(-1, 1)
a = layer.func(z)
self.z_record.append(z)
self.a_record.append(a)
self.output = a
def back_prop(self):
delta = np.multiply(self.a_record[-1] - self.target, self.layers[-1].prime(self.z_record[-1]))
print(delta)
for i, layer in reversed(list(enumerate(self.layers))):
dw = np.matmul(delta, self.a_record[i].T)
db = np.matmul(delta, np.ones((self.n, 1)))
self.dw_record.append(dw)
self.db_record.append(db)
print(self.a_record[i].T)
print("------------------------------------------")
if i > 0:
delta = np.multiply(np.matmul(self.w[i].T, delta), layer.prime(self.z_record[i - 1]))
def gradient_desc(self):
w_copy = self.w
for i, (w, dw) in enumerate(zip(w_copy, self.dw_record[::-1])):
self.w[i] = w - self.alpha*dw
b_copy = self.b
for i, (b, db) in enumerate(zip(b_copy, self.db_record[::-1])):
self.b[i] = b - self.alpha*db
def fit(self):
for i in range(self.iter):
self.forward_prop()
self.back_prop()
self.gradient_desc()
if i % 40 == 0:
print(f"Iteration: {i + 1} / {self.iter} =====================================")
print(f"Accuracy: {get_accuracy(self.output, self.y_true)}")
I've been trying for a couple days now to figure this out, but I don't seem to find the error or maybe a workaround to prevent overflow. Maybe the equations are wrong from the start? Please any help would be greatly appreciated, thank you in advance.

GPU Optimization of k-armed Bandit Problem

I have the following code:
class k_armed_bandit:
def __init__(self, epsilon, k, steps):
self.batches = 4000
self.epsilon = epsilon
self.k = k
self.steps = steps
self.mean = 4 * torch.rand((self.batches, self.k), device=device) - 2
self.var = torch.rand((self.batches, self.k), device=device) + 0.5
self.estimates = 4 * torch.ones((self.batches, self.k), device=device)
self.counts = torch.ones((self.batches, self.k), device=device)
def run(self):
rewards = torch.zeros(self.steps, device=device)
for i in range(self.steps):
pos = torch.where(
torch.rand(self.batches, device=device) < self.epsilon,
torch.randint(0, k, size=(self.batches, ), device=device),
torch.argmax(self.estimates, dim=1)
)
pos_mask = F.one_hot(pos, num_classes=self.k)
val = torch.normal(mean=torch.sum(pos_mask * self.mean, dim=1), std=torch.sum(pos_mask * self.var, dim=1))
self.counts += pos_mask
val = val[:, None] * pos_mask
self.estimates += ((val - (self.estimates * pos_mask)) / self.counts) * pos_mask
rewards[i] = torch.sum(val) / self.batches
return rewards
If you're not familiar with the problem, the idea is that you have k options in front of you, and you are trying to maximize the sum of rewards received from the options across some number of turns. The reward from option i is sampled from N(mean[i], var[i]). You keep a running track of returns from each option (stored in self.estimates, which is updated in streaming fashion). With probability epsilon you pick a random option, and with probability 1 - epsilon you simply pick the option with the best return (the position selected is stored in pos). At the end, you will return the sequence of rewards you obtained from the options you selected.
Anyways, the code runs pretty fast, but I suspect that some of my naive mistakes make this slower than it should be. First of all, my torch.where requires that for every agent in the batch, both a random position and the argmax should be computed, even though only one is used. This can't be good. Next, pos_mask is a big problem. It allows me to update the averages each agent keeps track of, only if they were picked as the sampled position. It also allows me to update the number of times each option was selected. The issue is that pos_mask, and all related computations, are much larger than they need to be, as it a mask and I cannot seem to update the relevant indices in parallel. I tried to fiddle with torch's index related methods, but none of them seem to help.
Any clues how to speed this up? Or maybe this problem is just not nice to implement on GPU?

Neural Network Backpropogation code not working

I need to write a simple neural network that consists of 1 output node, one hidden layer of 3 nodes, and 1 input layer (variable size). For now I am just trying to train on the xor data so lets presume that there are 3 input nodes (one node represents the bias and is always 1). The data is labeled 0,1.
I did out the equations for backpropogation and found that despite being so simple, my code does not converge to the xor data being correct.
Let W be the 3x3 matrix of weights connecting the input and hidden layer, and w be the 1x3 matrix that connects the hidden to output layer. Here are some helper functions for my method
def feed_forward_predict(x, W, w):
sigmoid = lambda x: 1/(1+np.exp(-x))
z = np.array(list(map(sigmoid, np.matmul(W, x))))
L = sigmoid(np.matmul(w, z))
return [L, z, x]
this just takes in a value and makes a prediction using the formula sig(w*sig(W*x)). We also have
def calculate_objective(data, labels, W, w):
obj = 0
for point, label in zip(data, labels):
L, z, x = feed_forward_predict(point, W, w)
obj += (label - L)**2
return obj
which calculates the Mean Squared Error for a bunch of given data points. Both of these functions should work as I checked them by hand. Now the problem comes in for the back propogation algorithm
def back_prop(traindata, trainlabels):
sigmoid = lambda x: 1/(1+np.exp(-x))
sigmoid_prime = lambda x: np.exp(-x)/((1+np.exp(-x))**2)
W = np.random.rand(3, len(traindata[0]))
w = np.random.rand(1, 3)
obj = calculate_objective(traindata, trainlabels, W, w)
print(obj)
epochs = 10_000
eta = .01
prevobj = np.inf
i=0
while(i < epochs):
prevobj = obj
dellw = np.zeros((1,3))
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
dellw += 2*(y - label) * sigmoid_prime(np.dot(w, z)) * z
w -= eta * dellw
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
temp = 2 * (y - label) * sigmoid_prime(np.dot(w, z))
# Note that s,u,v represent the hidden node weights. My professor required it this way
dells = temp * w[0][0] * sigmoid_prime(np.matmul(W[0,:], x)) * x
dellu = temp * w[0][1] * sigmoid_prime(np.matmul(W[1,:], x)) * x
dellv = temp * w[0][2] * sigmoid_prime(np.matmul(W[2,:], x)) * x
dellW = np.array([dells, dellu, dellv])
W -= eta*dellW
obj = calculate_objective(traindata, trainlabels, W, w)
i = i + 1
print("i=", i, " Objective=",obj)
return [W, w]
However this code, despite seemingly being correct in terms of the matrix multiplications and derivatives I took, does not converge to anything. In fact the error consistantly bounces: it will fall, then rise, then fall back to the same spot, then rise again. I believe that the problem lies with the W matrix gradient but I do not know what exactly it is.
If you'd like to see for yourself what is happening, the input data I used is
0: 0 0 1
0: 1 1 1
1: 1 0 1
1: 0 1 1
where the first number represents the label. I also set the random seed to np.random.seed(0) just so that I could be consistant with my matrices I'm dealing with.
It appears you are attempting to setup a manual version of stochastic gradient decent with a fixed learning rate (a classic NN problem).
Some notes on your code. It is very difficult to follow all the steps you are doing with so much loops and inconsistencies. In general, it defeats the purpose of using np.array() if you are using loops. Likewise you should know that np.matmul() is * and np.dot() is #. It is unclear how you are using the derivative. You have it explicitly stated at the start for the activation function and then partially derived in the middle of your loop for the MSE. Ugh.
Some other pointers. Explicitly state all your functions and your data, those should be globals. Those should also be derived all at once based on your fixed data as np.array(). In particular, note that while traditional statistics (like finding the line of best fit) means that we are solving for a fixed set of weights given a random variable; in stochastic gradient decent, we are doing the opposite. We are instead fixing the random variable to our data and optimizing our weights. Hence, your functions should only have your weights as "free variables", everything else is fixed. It is important to follow what is being fixed and what is free to update. Your code does not reflect that you know what is being update and what is fixed.
SGD algorithm outline:
Random params.
Update params by moving params a small percentage in the direction of lowest decent.
Run step (2) for a specified amount of time.
Print your params.
Example of SGD code (here is an example of performing SGD to find the line of best fit for some data).
import numpy as np
#Data
X = np.random.random((100,)) #Random points
Y = (2.3*X + 8) + 0.1*np.random.random((100,)) #Linear model + Noise
#Functions (only free variable is the params) (we want the F of best fit under MSE)
F = lambda p : p[0]*X+p[1]
dF = lambda p : np.array([X,np.ones(X.shape)])
MSE = lambda p : (1/Y.shape[0])*((Y-F(p))**2).sum(0)
dMSE = lambda p : (1/Y.shape[0])*(-2*(Y-F(p))*dF(p)).sum(1)
#SGD loop
lr = 0.05
epochs = 1000
params = np.array([0.0,0.0])
for i in range(epochs):
params -= lr*dMSE(params)
print(params)
Hopefully, written this way it is super clear exactly where the subtraction of the gradient is occurring and exactly how it is calculated. Note also, in case it wasn't clear, the derivative in both dF and dMSE is with respect to the params. Obviously this is a toy problem that can be solved explicitly with the scipy module. Hence, SGD is a clearly useless way to optimize two variables.
from scipy.stats import linregress
params = linregress(X,Y)
print(params)
I think I figured it out, in my code I was not summing the hidden node weight derivatives and instead was assigning at every loop iteration. The correct version would be as follow
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
temp = 2 * (y - label) * sigmoid_prime(np.dot(w, z))
# Note that s,u,v represent the hidden node weights. My professor required it this way
dells += temp * w[0][0] * sigmoid_prime(np.matmul(W[0,:], x)) * x
dellu += temp * w[0][1] * sigmoid_prime(np.matmul(W[1,:], x)) * x
dellv += temp * w[0][2] * sigmoid_prime(np.matmul(W[2,:], x)) * x

Get LR from cyclical learning rate in PyTorch

I'm trying to implement the cyclical learning rate approach on top of the PyTorch reimplementation of StyleGAN by rosinality. To do so, I am building on what suggested in this blog post.
To check how the loss changes as a function of the learning rate, I need to plot how the (LR, loss) changes. Here you can find my modified version of train.py. These are the main changes I made:
Define cyclical_lr, a function regulating the cyclical learning rate
def cyclical_lr(stepsize, min_lr, max_lr):
# Scaler: we can adapt this if we do not want the triangular CLR
scaler = lambda x: 1.
# Lambda function to calculate the LR
lr_lambda = lambda it: min_lr + (max_lr - min_lr) * relative(it, stepsize)
# Additional function to see where on the cycle we are
def relative(it, stepsize):
cycle = math.floor(1 + it / (2 * stepsize))
x = abs(it / stepsize - 2 * cycle + 1)
return max(0, (1 - x)) * scaler(cycle)
return lr_lambda
Implement the cyclical learning rate for both the discriminator and the generator
step_size = 5*256
end_lr = 10**-1
factor = 10**5
clr = cyclical_lr(step_size, min_lr=end_lr / factor, max_lr=end_lr)
scheduler_g = torch.optim.lr_scheduler.LambdaLR(g_optimizer, [clr, clr])
d_optimizer = optim.Adam(discriminator.parameters(), lr=args.lr, betas=(0.0, 0.99))
scheduler_d = torch.optim.lr_scheduler.LambdaLR(d_optimizer, [clr])
Do you have suggestions on how to plot how the loss changes as a function of the learning rate? Ideally, I would like to do it using TensorBoard, where for now I am plotting the generator loss, the discriminator loss and the size of the generated images as a function of the iteration number:
if (i + 1) % 100 == 0:
writer.add_scalar('Loss/G', gen_loss_val, i * args.batch.get(resolution))
writer.add_scalar('Loss/D', disc_loss_val, i * args.batch.get(resolution))
writer.add_scalar('Step/pixel_size', (4 * 2 ** step), i * args.batch.get(resolution))
print(args.batch.get(resolution))

Bad results from LMS stochastic gradient descent

I'm trying to adapt a batch gradient descent algorithm from a previous question to do stochastic gradient descent, my cost seems to get stuck pretty far from the minimum value (in the example, around 1750 when the minimum is around 1450). It would seem like once it reaches that value, it just starts oscillating there. I also tried to shuffle range(0, x.shape[0]-1) every l but it didn't make any difference. I expect oscillations around the optimal value, but this just seemed too far off, so I think there must be a mistake.
import numpy as np
y = np.asfarray([[400], [330], [369], [232], [540]])
x = np.asfarray([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x = np.concatenate((np.ones((5,1)), x), axis=1)
theta = np.asfarray([[0], [.5], [.5]])
fscale = np.sum(x, axis=0)
x /= fscale
alpha = .1
for l in range(1,100000):
for i in range(0, x.shape[0]-1):
h = np.dot(x, theta)
gradient = ((h[i:i+1] - y[i:i+1]) * x[i:i+1]).T
theta -= alpha * gradient
print ((h - y)**2).sum(), theta.squeeze() / fscale

Categories

Resources