Optical Character Recognition using Neural Networks in Python - python

This code is for OCR using ANN ,it contains one hidden layer, the input is an image of size 28x28.the code runs without any error but the output is not at all accurate even after giving 5000+ images for training.I am using the mnist dataset which is of the form of jpg images. Please tell me what is wrong with my logic.
import numpy as np
from PIL import Image
import random
from random import randint
y = [[0,0,0,0,0,0,0,0,0,0]]
W1 = [[ random.uniform(-1, 1) for q in range(40)] for p in range(784)]
W2 = [[ random.uniform(-1, 1) for q in range(10)] for p in range(40)]
def sigmoid(x):
global b
return (1.0 / (1.0 + np.exp(-x)))
#run the neural net forward
def run(X, W):
return sigmoid(np.matmul(X,W)) #1x2 * 2x2 = 1x1 matrix
#cost function
def cost(X, y, W):
nn_output = run(X, W)
return ((nn_output - y))
def gradient_Descent(X,y,W1,W2):
alpha = 0.12 #learning rate
epochs = 15000 #num iterations
for i in range(epochs):
Z2=sigmoid(np.matmul(run(X,W1),W2)) #final activation function(1X10))
Z1=run(X,W1) #first activation function(1X40)
phi1=Z1*(1-Z1) #differentiation of Z1
phi2=Z2*(1-Z2) #differentiation of Z2
delta2 = phi2*cost(Z1,y,W2) #delta for outer layer(1X10)
delta1 = np.transpose(np.transpose(phi1)*np.matmul(W2,np.transpose(delta2)))
deltaW2 = alpha*(np.matmul(np.transpose(Z1),delta2))
deltaW1 = alpha*(np.matmul(np.transpose(X),delta1))
W1=W1+deltaW1
W2=W2+deltaW2
def Training():
for j in range(8):
y[0][j]=1
k=1
while k<=15: #5421
print(k)
q=0
img = Image.open('mnist_jpgfiles/train/mnist_'+str(j)+'_'+str(k)+'.jpg')
iar = np.array(img) #image array
ar=np.reshape(iar,(1,np.product(iar.shape)))
ar=np.array(ar,dtype=float)
X = ar
'''
for p in range(784):
if X[0][p]>0:
X[0][p]=1
else:
X[0][p]=0
'''
k+=1
gradient_Descent(X,y,W1,W2)
print(np.argmin(cost(run(X,W1),y,W2)))
#print(W1)
y[0][j]=0
Training()
def test():
global W1,W2
for j in range(3):
k=1
while k<=5: #890
img = Image.open('mnist_jpgfiles/test/mnist_'+str(j)+'_'+str(k)+'.jpg')
iar = np.array(img) #image array
ar=np.reshape(iar,(1,np.product(iar.shape)))
ar=np.array(ar,dtype=float)
X = ar/256
'''
for p in range(784):
if X[0][p]>0:
X[0][p]=1
else:
X[0][p]=0
'''
k+=1
print("Should be "+str(j))
print((run(run(X,W1),W2)))
print((np.argmax(run(run(X,W1),W2))))
print("Testing.....")
test()

There is a problem with your cost function, because you simply calculate the difference between the hypothesis output with the actual output.It makes your cost function linear, so it's strictly increasing(or strictly decreasing), which can't be optimized.
You need to make a cross-entropy cost function(because you use sigmoid as activation function).
Also, gradient descent simply can't optimize ANN cost function, you should use back-propagation with gradient descent to optimize it.

I haven't worked with ANN but when working with gradient descent algorithm for regression problems like in Andrew Nag Machine Learning course in coursera, I found it is helpful to have learning rate alpha less than 0.05 and no of iterations more than 100000.
Try tweaking your learning rate then create a confusion matrix which will help you understand the accuracy of your system.

In my experience there are a lot of things that can go wrong with an ANN. I'll list some possible errors for you to consider.
Assuming the classification accuracy does not increase at all after training.
Something is wrong with the training or testing sets.
Too high
learning rates can sometimes cause the algorithm to not converge at
all. Try setting it very small like 0.01 or 0.001. If there is still no convergence. The issue probably has to do with something else than the gradient descent.
Assuming the training does increase but the accuracy is worse than expected.
The normalisation process is not correctly implemented. For images it is recommended to use zero-mean-unit-variance.
The learning rate is too low or too high

Related

Training with parametric partial derivatives in pytorch

Given a neural network with weights theta and inputs x, I am interested in calculating the partial derivatives of the neural network's output w.r.t. x, so that I can use the result when training the weights theta using a loss depending both on the output and the partial derivatives of the output. I figured out how to calculate the partial derivatives following this post. I also found this post that explains how to use sympy to achieve something similar, however, adapting it to a neural network context within pytorch seems like a huge amount of work and a recipee for very slow code.
Thus, I tried something different, which failed. As a minimal example, I created a function (substituting my neural network)
theta = torch.ones([3], requires_grad=True, dtype=torch.float32)
def trainable_function(time):
return theta[0]*time**3 + theta[1]*time**2 + theta[2]*time
Then, I defined a second function to give me partial derivatives:
def trainable_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = trainable_function(deriv_time)
gradient = torch.autograd.grad(fun_value, deriv_time, create_graph=True, retain_graph=True)
deriv_time.requires_grad = False
return gradient
Given some noisy observations of the derivatives, I now try to train theta. For simplicity, I create a loss that only depends on the derivatives. In this minimal example, the derivatives are used directly as observations, not as regularization, to avoid complicated loss functions that are besides the point.
def objective(train_times, observations):
predictions = torch.squeeze(torch.tensor([trainable_derivative(a) for a in train_times]))
return torch.sum((predictions - observations)**2)
optimizer = Adam([theta], lr=0.1)
for iteration in range(200):
optimizer.zero_grad()
loss = objective(data_times, noisy_targets)
loss.backward()
optimizer.step()
Unfortunately, when running this code, I get the error
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
I suppose that when calculating the partial derivatives in the way I do, I do not really create a computational graph through which autodiff could differentiate through. Thus, the connection to the parameters theta somehow gets lost and now it looks to the optimizer as if the loss is completely independent of the parameters theta. However, I could be totally wrong..
Does anyone know how to fix this?
Is it possible to include this type of derivatives in the loss function in pytorch?
And if so, what would be the most pytorch-style way of doing this?
Many thanks for your help and advise, it is much appreciated.
For completeness:
To run the above code, some training data needs to be generated. I used the following code, which works perfectly and has been tested against the analytical derivatives:
true_a = 1
true_b = 1
true_c = 1
def true_function(time):
return true_a*time**3 + true_b*time**2 + true_c*time
def true_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = true_function(deriv_time)
return torch.autograd.grad(fun_value, deriv_time)
data_times = torch.linspace(0, 1, 500)
true_targets = torch.squeeze(torch.tensor([true_derivative(a) for a in data_times]))
noisy_targets = torch.tensor(true_targets) + torch.randn_like(true_targets)*0.1
Your approach to the problem appears overly complicated.
I believe that what you're trying to achieve is within reach in PyTorch.
I include here a simple code snippet that I believe showcases what you would like to do:
import torch
import torch.nn as nn
# Data and Function
torch.manual_seed(0)
input_dim = 1
output_dim = 2
n = 10 # batchsize
simple_function = nn.Sequential(nn.Linear(1, 2), nn.Sigmoid())
t = (torch.arange(n).float() / n).view(n, 1)
x = torch.randn(n, output_dim)
t.requires_grad = True
# Actual computation
xhat = simple_function(t)
jac = torch.autograd.functional.jacobian(simple_function, t, create_graph=True)
grad = jac[torch.arange(n),:,torch.arange(n),0]
loss = (x -xhat).pow(2).sum() + grad.pow(2).sum()
loss.backward()

Gradient descent using TensorFlow is much slower than a basic Python implementation, why?

I'm following a machine learning course. I have a simple linear regression (LR) problem to help me get used to TensorFlow. The LR problem is to find parameters a and b such that Y = a*X + b approximates an (x, y) point cloud (which I generated myself for the sake of simplicity).
I am solving this LR problem using a 'fixed step size gradient descent (FSSGD)'. I implemented it using TensorFlow and it works but I noticed that it is really slow both on GPU and CPU. Because I was curious I implemented the FSSGD myself in Python/NumPy and as expected this runs much faster, about:
10x faster than TF#CPU
20x faster than TF#GPU
If TensorFlow is this slow, I cannot imagine that so many people are using this framework. So I must be doing something wrong. Can anyone help me so I can speedup my TensorFlow implementation.
I'm NOT interested in the difference between the CPU and GPU performance. Both performance indicators are merely provided for completeness and illustration. I'm interested in why my TensorFlow implementation is so much slower than a raw Python/NumPy implementation.
As reference, I add my code below.
Stripped to a minimal (but fully working) example.
Using Python v3.7.9 x64.
Used tensorflow-gpu==1.15 for now (because the course uses TensorFlow v1)
Tested to run in both Spyder and PyCharm.
My FSSGD implementation using TensorFlow (execution time about 40 sec #CPU to 80 sec #GPU):
#%% General imports
import numpy as np
import timeit
import tensorflow.compat.v1 as tf
#%% Get input data
# Generate simulated input data
x_data_input = np.arange(100, step=0.1)
y_data_input = x_data_input + 20 * np.sin(x_data_input/10) + 15
#%% Define tensorflow model
# Define data size
n_samples = x_data_input.shape[0]
# Tensorflow is finicky about shapes, so resize
x_data = np.reshape(x_data_input, (n_samples, 1))
y_data = np.reshape(y_data_input, (n_samples, 1))
# Define placeholders for input
X = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_x_data")
Y = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_y_data")
# Define variables to be learned
with tf.variable_scope("linear-regression", reuse=tf.AUTO_REUSE): #reuse= True | False | tf.AUTO_REUSE
W = tf.get_variable("weights", (1, 1), initializer=tf.constant_initializer(0.0))
b = tf.get_variable("bias", (1,), initializer=tf.constant_initializer(0.0))
# Define loss function
Y_pred = tf.matmul(X, W) + b
loss = tf.reduce_sum((Y - Y_pred) ** 2 / n_samples) # Quadratic loss function
# %% Solve tensorflow model
#Define algorithm parameters
total_iterations = 1e5 # Defines total training iterations
#Construct TensorFlow optimizer
with tf.variable_scope("linear-regression", reuse=tf.AUTO_REUSE): #reuse= True | False | tf.AUTO_REUSE
opt = tf.train.GradientDescentOptimizer(learning_rate = 1e-4)
opt_operation = opt.minimize(loss, name="GDO")
#To measure execution time
time_start = timeit.default_timer()
with tf.Session() as sess:
#Initialize variables
sess.run(tf.global_variables_initializer())
#Train variables
for index in range(int(total_iterations)):
_, loss_val_tmp = sess.run([opt_operation, loss], feed_dict={X: x_data, Y: y_data})
#Get final values of variables
W_val, b_val, loss_val = sess.run([W, b, loss], feed_dict={X: x_data, Y: y_data})
#Print execution time
time_end = timeit.default_timer()
print('')
print("Time to execute code: {0:0.9f} sec.".format(time_end - time_start))
print('')
# %% Print results
print('')
print('Iteration = {0:0.3f}'.format(total_iterations))
print('W_val = {0:0.3f}'.format(W_val[0,0]))
print('b_val = {0:0.3f}'.format(b_val[0]))
print('')
My own python FSSGD implementation (execution time about 4 sec):
#%% General imports
import numpy as np
import timeit
#%% Get input data
# Define input data
x_data_input = np.arange(100, step=0.1)
y_data_input = x_data_input + 20 * np.sin(x_data_input/10) + 15
#%% Define Gradient Descent (GD) model
# Define data size
n_samples = x_data_input.shape[0]
#Initialize data
W = 0.0 # Initial condition
b = 0.0 # Initial condition
# Compute initial loss
y_gd_approx = W*x_data_input+b
loss = np.sum((y_data_input - y_gd_approx)**2)/n_samples # Quadratic loss function
#%% Execute Gradient Descent algorithm
#Define algorithm parameters
total_iterations = 1e5 # Defines total training iterations
GD_stepsize = 1e-4 # Gradient Descent fixed step size
#To measure execution time
time_start = timeit.default_timer()
for index in range(int(total_iterations)):
#Compute gradient (derived manually for the quadratic cost function)
loss_gradient_W = 2.0/n_samples*np.sum(-x_data_input*(y_data_input - y_gd_approx))
loss_gradient_b = 2.0/n_samples*np.sum(-1*(y_data_input - y_gd_approx))
#Update trainable variables using fixed step size gradient descent
W = W - GD_stepsize * loss_gradient_W
b = b - GD_stepsize * loss_gradient_b
#Compute loss
y_gd_approx = W*x_data_input+b
loss = np.sum((y_data_input - y_gd_approx)**2)/x_data_input.shape[0]
#Print execution time
time_end = timeit.default_timer()
print('')
print("Time to execute code: {0:0.9f} sec.".format(time_end - time_start))
print('')
# %% Print results
print('')
print('Iteration = {0:0.3f}'.format(total_iterations))
print('W_val = {0:0.3f}'.format(W))
print('b_val = {0:0.3f}'.format(b))
print('')
I think it's the result of big iteration number. I've changed the iteration number from 1e5 to 1e3 and also changed x from x_data_input = np.arange(100, step=0.1) to x_data_input = np.arange(100, step=0.0001). This way I've reduced the iteration number but increased the computation by 10x. With np it's done in 22 sec and in tensorflow it's done in 25 sec.
My guess: tensorflow has alot of overhead in each iteration (to give us a framework that can do a lot) but the forward pass and backward pass speed are ok.
The actual answer to my question is hidden in the various comments. For future readers, I will summarize these findings in this answer.
About the speed difference between TensorFlow and a raw Python/NumPy implementation
This part of the answer is actually quite logically.
Each iteration (= each call of Session.run()) TensorFlow performs computations. TensorFlow has a large overhead for starting each computation. On GPU, this overhead is even worse than on CPU. However, TensorFlow executes the actual computations very efficient and more efficiently than the above raw Python/NumPy implementation does.
So, when the number of data points is increased, and therefore the number of computations per iteration you will see that the relative performances between TensorFlow and Python/NumPy shifts in the advantage of TensorFlow. The opposite is also true.
The problem described in the question is very small meaning that the number of computation is very low while the number of iterations is very large. That is why TensorFlow performs so badly. This type of small problems is not the typical use case for which TensorFlow was designed.
To reduce the execution time
Still the execution time of the TensorFlow script can be reduced a lot! To reduce the execution time the number of iterations must be reduced (no matter the size of the problem, this is a good aim anyway).
As #amin's pointed out, this is achieved by scaling the input data. A very briefly explanation why this works: the size of the gradient and variable updates are more balanced compared to the absolute values for which the values are to be found. Therefore, less steps (= iterations) are required.
Followings #amin's advise, I finally ended up by scaling my x-data as follows (some code is repeated to make the position of the new code clear):
# Tensorflow is finicky about shapes, so resize
x_data = np.reshape(x_data_input, (n_samples, 1))
y_data = np.reshape(y_data_input, (n_samples, 1))
### START NEW CODE ###
# Scale x_data
x_mean = np.mean(x_data)
x_std = np.std(x_data)
x_data = (x_data - x_mean) / x_std
### END NEW CODE ###
# Define placeholders for input
X = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_x_data")
Y = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_y_data")
Scaling speed up the convergence by a factor 1000. Instead of 1e5 iterations, 1e2 iterations are needed. This is partially because a maximum step size of 1e-1 can be used instead of a step size of 1e-4.
Please note that the found weight and bias are different and that you must feed scaled data from now on.
Optionally, you can choose to unscale the found weight and bias so you can feed unscaled data. Unscaling is done using this code (put somewhere at the end of the code):
#%% Unscaling
W_val_unscaled = W_val[0,0]/x_std
b_val_unscaled = b_val[0]-x_mean*W_val[0,0]/x_std

What Loss Or Reward Is Backpropagated In Policy Gradients For Reinforcement Learning?

I have made a small script in Python to solve various Gym environments with policy gradients.
import gym, os
import numpy as np
#create environment
env = gym.make('Cartpole-v0')
env.reset()
s_size = len(env.reset())
a_size = 2
#import my neural network code
os.chdir(r'C:\---\---\---\Python Code')
import RLPolicy
policy = RLPolicy.NeuralNetwork([s_size,a_size],learning_rate=0.000001,['softmax']) #a 3layer network might be ([s_size, 5, a_size],learning_rate=1,['tanh','softmax'])
#it supports the sigmoid activation function also
print(policy.weights)
DISCOUNT = 0.95 #parameter for discounting future rewards
#first step
action = policy.feedforward(env.reset)
state,reward,done,info = env.step(action)
for t in range(3000):
done = False
states = [] #lists for recording episode
probs2 = []
rewards = []
while not done:
#env.render() #to visualize learning
probs = policy.feedforward(state)[-1] #calculate probabilities of actions
action = np.random.choice(a_size,p=probs) #choose action from probs
#record and update state
probs2.append(probs)
states.append(state)
state,reward,done,info = env.step(action)
rewards.append(reward) #should reward be before updating state?
#calculate gradients
gradients_w = []
gradients_b = []
for i in range(len((rewards))):
totalReward = sum([rewards[t]*DISCOUNT**t for t in range(len(rewards[i:]))]) #discounted reward
## !! this is the line that I need help with
gradient = policy.backpropagation(states[i],totalReward*(probs2[i])) #what should be backpropagated through the network
## !!
##record gradients
gradients_w.append(gradient[0])
gradients_b.append(gradient[1])
#combine gradients and update the weights and biases
gradients_w = np.array(gradients_w,object)
gradients_b = np.array(gradients_b,object)
policy.weights += policy.learning_rate * np.flip(np.sum(gradients_w,0),0) #np.flip because the gradients are calculated backwards
policy.biases += policy.learning_rate * np.flip(np.sum(gradients_b,0),0)
#reset and record
env.reset()
if t%100==0:
print('t'+str(t),'r',sum(rewards))
What should be passed backwards to calculate the gradients? I am using gradient ascent but I could switch it to descent. Some people have defined the reward function as totalReward*log(probabilities). Would that make the score derivative totalReward*(1/probs) or log(probs) or something else? Do you use a cost function like cross entropy?
I have tried
totalReward*np.log(probs)
totalReward*(1/probs)
totalReward*(probs**2)
totalReward*probs
probs = np.zeros(a_size)
probs[action] = 1
totalRewards*probs
and a couple others.
The last one is the only one that was able to solve any of them and it only worked on Cartpole. I have tested the various loss or score functions for thousands of episodes with gradient ascent and descent on Cartpole, Pendulum, and MountainCar. Sometimes it will improve a small amount but it will never solve it. What am I doing wrong?
And here is the RLPolicy code. It is not well written or pseudo coded but I don't think it is the problem because I checked it with gradient checking several times. But it would be helpful even if I could narrow it down to a problem with the neural network or somewhere else in my code.
#Neural Network
import numpy as np
import random, math, time, os
from matplotlib import pyplot as plt
def activation(x,function):
if function=='sigmoid':
return(1/(1+math.e**(-x))) #Sigmoid
if function=='relu':
x[x<0]=0
return(x)
if function=='tanh':
return(np.tanh(x.astype(float))) #tanh
if function=='softmax':
z = np.exp(np.array((x-max(x)),float))
y = np.sum(z)
return(z/y)
def activationDerivative(x,function):
if function=='sigmoid':
return(x*(1-x))
if function=='relu':
x[x<0]==0
x[x>0]==1
return(x)
if function=='tanh':
return(1-x**2)
if function=='softmax':
s = x.reshape(-1,1)
return(np.diagflat(s) - np.dot(s, s.T))
class NeuralNetwork():
def __init__ (self,layers,learning_rate,momentum,regularization,activations):
self.learning_rate = learning_rate
if (isinstance(layers[1],list)):
h = layers[1][:]
del layers[1]
for i in h:
layers.insert(-1,i)
self.layers = layers
self.weights = [2*np.random.rand(self.layers[i]*self.layers[i+1])-1 for i in range(len(self.layers)-1)]
self.biases = [2*np.random.rand(self.layers[i+1])-1 for i in range(len(self.layers)-1)]
self.weights = np.array(self.weights,object)
self.biases = np.array(self.biases,object)
self.activations = activations
def feedforward(self, input_array):
layer = input_array
neuron_outputs = [layer]
for i in range(len(self.layers)-1):
layer = np.tile(layer,self.layers[i+1])
layer = np.reshape(layer,[self.layers[i+1],self.layers[i]])
weights = np.reshape(self.weights[i],[self.layers[i+1],self.layers[i]])
layer = weights*layer
layer = np.sum(layer,1)#,self.layers[i+1]-1)
layer = layer+self.biases[i]
layer = activation(layer,self.activations[i])
neuron_outputs.append(np.array(layer,float))
return(neuron_outputs)
def neuronErrors(self,l,neurons,layerError,n_os):
if (l==len(self.layers)-2):
return(layerError)
totalErr = [] #total error
for e in range(len(layerError)): #-layers
e = e*self.layers[l+2]
a_ws = self.weights[l+1][e:e+self.layers[l+1]]
e = int(e/self.layers[l+2])
err = layerError[e]*a_ws #error
totalErr.append(err)
return(sum(totalErr))
def backpropagation(self,state,loss):
weights_gradient = [np.zeros(self.layers[i]*self.layers[i+1]) for i in range(len(self.layers)-1)]
biases_gradient = [np.zeros(self.layers[i+1]) for i in range(len(self.layers)-1)]
neuron_outputs = self.feedforward(state)
grad = self.individualBackpropagation(loss, neuron_outputs)
return(grad)
def individualBackpropagation(self, difference, neuron_outputs): #number of output
lr = self.learning_rate
n_os = neuron_outputs[:]
w_o = self.weights[:]
b_o = self.biases[:]
w_n = self.weights[:]
b_n = self.biases[:]
gradient_w = []
gradient_b = []
error = difference[:] #error for neurons
for l in range(len(self.layers)-2,-1,-1):
p_n = np.tile(n_os[l],self.layers[l+1]) #previous neuron
neurons = np.arange(self.layers[l+1])
error = (self.neuronErrors(l,neurons,error,n_os))
if not self.activations[l]=='softmax':
error = error*activationDerivative(neuron_outputs[l+1],self.activations[l])
else:
error = error # activationDerivative(neuron_outputs[l+1],self.activations[l]) #because softmax derivative returns different dimensions
w_grad = np.repeat(error,self.layers[l]) #weights gradient
b_grad = np.ravel(error) #biases gradient
w_grad = w_grad*p_n
b_grad = b_grad
gradient_w.append(w_grad)
gradient_b.append(b_grad)
return(gradient_w,gradient_b)
Thanks for any answers, this is my first question here.
Using as reference this post for the computation of the gradient ( https://medium.com/#jonathan_hui/rl-policy-gradients-explained-9b13b688b146) :
It seems to me that totalRewardOfEpisode*np.log(probability of sampled action) is the right computation. However in order to have a good estimate of the gradient I'd suggest using many episodes to compute it. (30 for example, you'd just need to average your end gradient by dividing by 30)
The main difference with your test with totalReward*np.log(probs) is that for each step I think you should only backpropagate on the probability of the action you sampled, not the whole output. Initialy in the cited article they use the total reward but then they suggest in the end using the discounted reward of the present and future rewards as you do, so that part doesn't seem theoretically problematic.
OLD answer :
To my knowledge deepRL methods usely use some estimate of the value of the state in the game or the value of each action. From what I see in your code you have a neural network that only outputs probabilities for each action.
Although what you want is definitely to maximize the total reward, you can't compute a gradient on the end reward because of the environment. I'd suggest you'd look into methods such as deepQLearning or Actor/Critic based methods such as PPO.
Given the method you chose you'll get different answers on how to compute your gradient.
mprouveur's answer was half correct but I felt that I needed to explain the right thing to backpropagate. The answer to my question on ai.stackexchange.com was how I came to understand this. The correct error to backpropagate is the log probability of taking the action multiplied by the goal reward. This can also be calculated as the cross entropy loss between the outputted probabilities and an array of zeros with the action that was taken being one 1. Because of the derivative of cross entropy loss, this will have the effect of pushing only the probability of the action that was taken closer to one. Then, the multiplication of the total reward makes better actions get pushed more to a higher probability. So, with the label being a one-hot encoded vector, the correct equation is label/probs * totalReward because that is the derivative of cross entropy loss and the derivative of the log of probs. I got this working in other code, but even with this equation I think something else in my code is wrong. It probably has something to do with how I made the softmax derivative too complicated instead of calculating the usual way, by combing the cross entropy derivative and softmax derivative. I will update this answer soon with correct code and more information.
The loss here depends on what output on each problem. Generaly, loss for backpropagate should be a number that represents for everything you have processed. For policy gradient, it will be the reward that it think it will get compare with the original reward, the log is just a way to bring it back to a probabily random variable. Single dimension. If you want to inspect the behavior behind codes, you should always check the shape/dimension between each process to fully understand

Neural network - loss not converging

This network contains an input layer and an output layer, with no nonlinearities. The output is just a linear combination of the input.I am using a regression loss to train the network. I generated some random 1D test data according to a simple linear function, with Gaussian noise added. The problem is that the loss function doesn't converge to zero.
import numpy as np
import matplotlib.pyplot as plt
n = 100
alp = 1e-4
a0 = np.random.randn(100,1) # Also x
y = 7*a0+3+np.random.normal(0,1,(100,1))
w = np.random.randn(100,100)*0.01
b = np.random.randn(100,1)
def compute_loss(a1,y,w,b):
return np.sum(np.power(y-w*a1-b,2))/2/n
def gradient_step(w,b,a1,y):
w -= (alp/n)*np.dot((a1-y),a1.transpose())
b -= (alp/n)*(a1-y)
return w,b
loss_vec = []
num_iterations = 10000
for i in range(num_iterations):
a1 = np.dot(w,a0)+b
loss_vec.append(compute_loss(a1,y,w,b))
w,b = gradient_step(w,b,a1,y)
plt.plot(loss_vec)
The convergence also depends on the value of alpha you use. I played with your code a bit and for
alp = 5e-3
I get the following convergence plotted on a logarithmic x-axis
plt.semilogx(loss_vec)
Output
If I understand your code correctly, you only have one weight matrix and one bias vector despite the fact that you have 2 layers. This is odd and might be at least part of your problem.

Tensorflow: How to set the learning rate in log scale and some Tensorflow questions

I am a deep learning and Tensorflow beginner and I am trying to implement the algorithm in this paper using Tensorflow. This paper uses Matconvnet+Matlab to implement it, and I am curious if Tensorflow has the equivalent functions to achieve the same thing. The paper said:
The network parameters were initialized using the Xavier method [14]. We used the regression loss across four wavelet subbands under l2 penalty and the proposed network was trained by using the stochastic gradient descent (SGD). The regularization parameter (λ) was 0.0001 and the momentum was 0.9. The learning rate was set from 10−1 to 10−4 which was reduced in log scale at each epoch.
This paper uses wavelet transform (WT) and residual learning method (where the residual image = WT(HR) - WT(HR'), and the HR' are used for training). Xavier method suggests to initialize the variables normal distribution with
stddev=sqrt(2/(filter_size*filter_size*num_filters)
Q1. How should I initialize the variables? Is the code below correct?
weights = tf.Variable(tf.random_normal[img_size, img_size, 1, num_filters], stddev=stddev)
This paper does not explain how to construct the loss function in details . I am unable to find the equivalent Tensorflow function to set the learning rate in log scale (only exponential_decay). I understand MomentumOptimizer is equivalent to Stochastic Gradient Descent with momentum.
Q2: Is it possible to set the learning rate in log scale?
Q3: How to create the loss function described above?
I followed this website to write the code below. Assume model() function returns the network mentioned in this paper and lamda=0.0001,
inputs = tf.placeholder(tf.float32, shape=[None, patch_size, patch_size, num_channels])
labels = tf.placeholder(tf.float32, [None, patch_size, patch_size, num_channels])
# get the model output and weights for each conv
pred, weights = model()
# define loss function
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=pred)
for weight in weights:
regularizers += tf.nn.l2_loss(weight)
loss = tf.reduce_mean(loss + 0.0001 * regularizers)
learning_rate = tf.train.exponential_decay(???) # Not sure if we can have custom learning rate for log scale
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(loss, global_step)
NOTE: As I am a deep learning/Tensorflow beginner, I copy-paste code here and there so please feel free to correct it if you can ;)
Q1. How should I initialize the variables? Is the code below correct?
Use tf.get_variable or switch to slim (it does the initialization automatically for you). example
Q2: Is it possible to set the learning rate in log scale?
You can but do you need it? This is not the first thing that you need to solve in this network. Please check #3
However, just for reference, use following notation.
learning_rate_node = tf.train.exponential_decay(learning_rate=0.001, decay_steps=10000, decay_rate=0.98, staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_node).minimize(loss)
Q3: How to create the loss function described above?
At first, you have not written "pred" to "image" conversion to this message(Based on the paper you need to apply subtraction and IDWT to obtain final image).
There is one problem here, logits have to be calculated based on your label data. i.e. if you will use marked data as "Y : Label", you need to write
pred = model()
pred = tf.matmul(pred, weights) + biases
logits = tf.nn.softmax(pred)
loss = tf.reduce_mean(tf.abs(logits - labels))
This will give you the output of Y : Label to be used
If your dataset's labeled images are denoised ones, in this case you need to follow this one:
pred = model()
pred = tf.matmul(image, weights) + biases
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(image - labels))
Logits are the output of your network. You will use this one as result to calculate the rest. Instead of matmul, you can add a conv2d layer in here without a batch normalization and an activation function and set output feature count as 4. Example:
pred = model()
pred = slim.conv2d(pred, 4, [3, 3], activation_fn=None, padding='SAME', scope='output')
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(logits - labels))
This loss function will give you basic training capabilities. However, this is L1 distance and it may suffer from some issues (check). Think following situation
Let's say you have following array as output [10, 10, 10, 0, 0] and you try to achieve [10, 10, 10, 10, 10]. In this case, your loss is 20 (10 + 10). However, you have 3/5 success. Also, it may indicate some overfit.
For same case, think following output [6, 6, 6, 6, 6]. It still has loss of 20 (4 + 4 + 4 + 4 + 4). However, whenever you apply threshold of 5, you can achieve 5/5 success. Hence, this is the case that we want.
If you use L2 loss, for the first case, you will have 10^2 + 10^2 = 200 as loss output. For the second case, you will get 4^2 * 5 = 80.
Hence, optimizer will try to run away from #1 as quick as possible to achieve global success rather than perfect success of some outputs and complete failure of the others. You can apply loss function like this for that.
tf.reduce_mean(tf.nn.l2_loss(logits - image))
Alternatively, you can check for cross entropy loss function. (it does apply softmax internally, do not apply softmax twice)
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, image))
Q1. How should I initialize the variables? Is the code below correct?
That's correct (although missing an opening parentheses). You could also look into tf.get_variable if the variables are going to be reused.
Q2: Is it possible to set the learning rate in log scale?
Exponential decay decreases the learning rate at every step. I think what you want is tf.train.piecewise_constant, and set boundaries at each epoch.
EDIT: Look at the other answer, use the staircase=True argument!
Q3: How to create the loss function described above?
Your loss function looks correct.
Other answers are very detailed and helpful. Here is a code example that uses placeholder to decay learning rate at log scale. HTH.
import tensorflow as tf
import numpy as np
# data simulation
N = 10000
D = 10
x = np.random.rand(N, D)
w = np.random.rand(D,1)
y = np.dot(x, w)
print y.shape
#modeling
batch_size = 100
tni = tf.truncated_normal_initializer()
X = tf.placeholder(tf.float32, [batch_size, D])
Y = tf.placeholder(tf.float32, [batch_size,1])
W = tf.get_variable("w", shape=[D,1], initializer=tni)
B = tf.zeros([1])
lr = tf.placeholder(tf.float32)
pred = tf.add(tf.matmul(X,W), B)
print pred.shape
mse = tf.reduce_sum(tf.losses.mean_squared_error(Y, pred))
opt = tf.train.MomentumOptimizer(lr, 0.9)
train_op = opt.minimize(mse)
learning_rate = 0.0001
do_train = True
acc_err = 0.0
sess = tf.Session()
sess.run(tf.global_variables_initializer())
while do_train:
for i in range (100000):
if i > 0 and i % N == 0:
# epoch done, decrease learning rate by 2
learning_rate /= 2
print "Epoch completed. LR =", learning_rate
idx = i/batch_size + i%batch_size
f = {X:x[idx:idx+batch_size,:], Y:y[idx:idx+batch_size,:], lr: learning_rate}
_, err = sess.run([train_op, mse], feed_dict = f)
acc_err += err
if i%5000 == 0:
print "Average error = {}".format(acc_err/5000)
acc_err = 0.0

Categories

Resources