Neural network tutorial, softmax activation instead of sigmoid - python

I am working through an FNN tutorial, previously after researching i learnt that i will need to use softmax activation for my own ML problem rather than sigmoid as shown.
After hours of looking through tutorials i cannot find a basic example of softmax code (outside of modules) that i can learn to rework the tutorial code in the way that sigmoid is used. If i can figure this out then i can break it down and understand how the math transforms the data and is fed through the NN, then i can apply this to other basic ML starting points like SVMs etc.
I need pointers on untangling the sigmoid math/code and reworking it to use softmax activation and one hot encoding on the y outputs, thank you for any help.
import numpy as np
# sigmoid function
def nonlin(x,deriv=False):
return x*(1-x)
return 1/(1+np.exp(-x))
# input dataset
X = np.array([ [0,0,1],
[1,1,1] ])
# output dataset
y = np.array([[0,0,1,1]]).T
# seed random numbers to make calculation
# deterministic (just a good practice)
# initialize weights randomly with mean 0
syn0 = 2*np.random.random((3,1)) - 1
for iter in xrange(10000):
# forward propagation
l0 = X
l1 = nonlin(,syn0))
# how much did we miss?
l1_error = y - l1
# multiply how much we missed by the
# slope of the sigmoid at the values in l1
l1_delta = l1_error * nonlin(l1,True)
# update weights
syn0 +=,l1_delta)
print "Output After Training:"
print l1


Training with parametric partial derivatives in pytorch

Given a neural network with weights theta and inputs x, I am interested in calculating the partial derivatives of the neural network's output w.r.t. x, so that I can use the result when training the weights theta using a loss depending both on the output and the partial derivatives of the output. I figured out how to calculate the partial derivatives following this post. I also found this post that explains how to use sympy to achieve something similar, however, adapting it to a neural network context within pytorch seems like a huge amount of work and a recipee for very slow code.
Thus, I tried something different, which failed. As a minimal example, I created a function (substituting my neural network)
theta = torch.ones([3], requires_grad=True, dtype=torch.float32)
def trainable_function(time):
return theta[0]*time**3 + theta[1]*time**2 + theta[2]*time
Then, I defined a second function to give me partial derivatives:
def trainable_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = trainable_function(deriv_time)
gradient = torch.autograd.grad(fun_value, deriv_time, create_graph=True, retain_graph=True)
deriv_time.requires_grad = False
return gradient
Given some noisy observations of the derivatives, I now try to train theta. For simplicity, I create a loss that only depends on the derivatives. In this minimal example, the derivatives are used directly as observations, not as regularization, to avoid complicated loss functions that are besides the point.
def objective(train_times, observations):
predictions = torch.squeeze(torch.tensor([trainable_derivative(a) for a in train_times]))
return torch.sum((predictions - observations)**2)
optimizer = Adam([theta], lr=0.1)
for iteration in range(200):
loss = objective(data_times, noisy_targets)
Unfortunately, when running this code, I get the error
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
I suppose that when calculating the partial derivatives in the way I do, I do not really create a computational graph through which autodiff could differentiate through. Thus, the connection to the parameters theta somehow gets lost and now it looks to the optimizer as if the loss is completely independent of the parameters theta. However, I could be totally wrong..
Does anyone know how to fix this?
Is it possible to include this type of derivatives in the loss function in pytorch?
And if so, what would be the most pytorch-style way of doing this?
Many thanks for your help and advise, it is much appreciated.
For completeness:
To run the above code, some training data needs to be generated. I used the following code, which works perfectly and has been tested against the analytical derivatives:
true_a = 1
true_b = 1
true_c = 1
def true_function(time):
return true_a*time**3 + true_b*time**2 + true_c*time
def true_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = true_function(deriv_time)
return torch.autograd.grad(fun_value, deriv_time)
data_times = torch.linspace(0, 1, 500)
true_targets = torch.squeeze(torch.tensor([true_derivative(a) for a in data_times]))
noisy_targets = torch.tensor(true_targets) + torch.randn_like(true_targets)*0.1
Your approach to the problem appears overly complicated.
I believe that what you're trying to achieve is within reach in PyTorch.
I include here a simple code snippet that I believe showcases what you would like to do:
import torch
import torch.nn as nn
# Data and Function
input_dim = 1
output_dim = 2
n = 10 # batchsize
simple_function = nn.Sequential(nn.Linear(1, 2), nn.Sigmoid())
t = (torch.arange(n).float() / n).view(n, 1)
x = torch.randn(n, output_dim)
t.requires_grad = True
# Actual computation
xhat = simple_function(t)
jac = torch.autograd.functional.jacobian(simple_function, t, create_graph=True)
grad = jac[torch.arange(n),:,torch.arange(n),0]
loss = (x -xhat).pow(2).sum() + grad.pow(2).sum()

LSTM doesn't learn to add random numbers

I was trying to do a pretty simple thing, train an LSTM that picks a sequence of random numbers and outputs the sum of them. But after some hours without converging I decided to ask here which of my premises doesn't work.
The idea is simple:
I generate a training set of sequences of some sequence length of random numbers and label them with the sum of them (numbers are drawn from a normal distribution)
I use an LSTM with an RMSE loss for predicting the output, the sum of these numbers, given the sequence input
Intuitively the LSTM should learn to set the weight of the input gate to 1 (bias 0) the weights of the forget gate to 0 (bias 1) and the weight to the output gate to 1 (bias 0) and learn to add these numbers, but it doesn't. I pasting the code I use, I tried with different learning rates, optimizers, batching, observed the gradients and the outputs and don't find the exact reason why is failing.
Code for generating sequences:
import tensorflow as tf
import numpy as np
def generate_sequences(n_samples, seq_len):
total_shape = n_samples*seq_len
random_values = np.random.randn(total_shape)
random_values = random_values.reshape(n_samples, -1)
targets = np.sum(random_values, axis=1)
return random_values, targets
Code for training:
n_samples = 100000
seq_len = 2
epochs = n_samples
batch_size = 1
input_shape = 1
data, targets = generate_sequences(n_samples, seq_len)
train_data =, targets))
output = tf.keras.layers.RNN(tf.keras.layers.LSTMCell(1, dtype='float64', recurrent_activation=None, activation=None), input_shape=(batch_size, seq_len, input_shape))
iterator = train_data.batch(batch_size).make_one_shot_iterator()
optimizer = tf.train.AdamOptimizer(lr)
for i in range(epochs):
my_inp, target = iterator.get_next()
with tf.GradientTape(persistent=True) as tape:
my_out = output(tf.reshape(my_inp, shape=(batch_size,seq_len,1)))
loss = tf.sqrt(tf.reduce_sum(tf.square(target - my_out)),1)/batch_size
grads = tape.gradient(loss, output.trainable_variables)
optimizer.apply_gradients(zip(grads, output.trainable_variables),
I also has a conjecture that this a theoretical problem (If we sum different random values drawn form a normal distribution then the output is not in the [-1, 1] range and the LSTM due to the tanh activations can't learn it. But changing them doesn't improved the performance (changed to linear in the example code).
Set activations to linear, I realised that the tanh() squashes the values.
The problem was actually the tanh() of the gates and recurrent states which was squashing my outputs and not allowing them to grow by adding up the summands. Putting all activations to linear works pretty fine.

Neural network - loss not converging

This network contains an input layer and an output layer, with no nonlinearities. The output is just a linear combination of the input.I am using a regression loss to train the network. I generated some random 1D test data according to a simple linear function, with Gaussian noise added. The problem is that the loss function doesn't converge to zero.
import numpy as np
import matplotlib.pyplot as plt
n = 100
alp = 1e-4
a0 = np.random.randn(100,1) # Also x
y = 7*a0+3+np.random.normal(0,1,(100,1))
w = np.random.randn(100,100)*0.01
b = np.random.randn(100,1)
def compute_loss(a1,y,w,b):
return np.sum(np.power(y-w*a1-b,2))/2/n
def gradient_step(w,b,a1,y):
w -= (alp/n)*,a1.transpose())
b -= (alp/n)*(a1-y)
return w,b
loss_vec = []
num_iterations = 10000
for i in range(num_iterations):
a1 =,a0)+b
w,b = gradient_step(w,b,a1,y)
The convergence also depends on the value of alpha you use. I played with your code a bit and for
alp = 5e-3
I get the following convergence plotted on a logarithmic x-axis
If I understand your code correctly, you only have one weight matrix and one bias vector despite the fact that you have 2 layers. This is odd and might be at least part of your problem.

Tensorflow: How to set the learning rate in log scale and some Tensorflow questions

I am a deep learning and Tensorflow beginner and I am trying to implement the algorithm in this paper using Tensorflow. This paper uses Matconvnet+Matlab to implement it, and I am curious if Tensorflow has the equivalent functions to achieve the same thing. The paper said:
The network parameters were initialized using the Xavier method [14]. We used the regression loss across four wavelet subbands under l2 penalty and the proposed network was trained by using the stochastic gradient descent (SGD). The regularization parameter (λ) was 0.0001 and the momentum was 0.9. The learning rate was set from 10−1 to 10−4 which was reduced in log scale at each epoch.
This paper uses wavelet transform (WT) and residual learning method (where the residual image = WT(HR) - WT(HR'), and the HR' are used for training). Xavier method suggests to initialize the variables normal distribution with
Q1. How should I initialize the variables? Is the code below correct?
weights = tf.Variable(tf.random_normal[img_size, img_size, 1, num_filters], stddev=stddev)
This paper does not explain how to construct the loss function in details . I am unable to find the equivalent Tensorflow function to set the learning rate in log scale (only exponential_decay). I understand MomentumOptimizer is equivalent to Stochastic Gradient Descent with momentum.
Q2: Is it possible to set the learning rate in log scale?
Q3: How to create the loss function described above?
I followed this website to write the code below. Assume model() function returns the network mentioned in this paper and lamda=0.0001,
inputs = tf.placeholder(tf.float32, shape=[None, patch_size, patch_size, num_channels])
labels = tf.placeholder(tf.float32, [None, patch_size, patch_size, num_channels])
# get the model output and weights for each conv
pred, weights = model()
# define loss function
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=pred)
for weight in weights:
regularizers += tf.nn.l2_loss(weight)
loss = tf.reduce_mean(loss + 0.0001 * regularizers)
learning_rate = tf.train.exponential_decay(???) # Not sure if we can have custom learning rate for log scale
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(loss, global_step)
NOTE: As I am a deep learning/Tensorflow beginner, I copy-paste code here and there so please feel free to correct it if you can ;)
Q1. How should I initialize the variables? Is the code below correct?
Use tf.get_variable or switch to slim (it does the initialization automatically for you). example
Q2: Is it possible to set the learning rate in log scale?
You can but do you need it? This is not the first thing that you need to solve in this network. Please check #3
However, just for reference, use following notation.
learning_rate_node = tf.train.exponential_decay(learning_rate=0.001, decay_steps=10000, decay_rate=0.98, staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_node).minimize(loss)
Q3: How to create the loss function described above?
At first, you have not written "pred" to "image" conversion to this message(Based on the paper you need to apply subtraction and IDWT to obtain final image).
There is one problem here, logits have to be calculated based on your label data. i.e. if you will use marked data as "Y : Label", you need to write
pred = model()
pred = tf.matmul(pred, weights) + biases
logits = tf.nn.softmax(pred)
loss = tf.reduce_mean(tf.abs(logits - labels))
This will give you the output of Y : Label to be used
If your dataset's labeled images are denoised ones, in this case you need to follow this one:
pred = model()
pred = tf.matmul(image, weights) + biases
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(image - labels))
Logits are the output of your network. You will use this one as result to calculate the rest. Instead of matmul, you can add a conv2d layer in here without a batch normalization and an activation function and set output feature count as 4. Example:
pred = model()
pred = slim.conv2d(pred, 4, [3, 3], activation_fn=None, padding='SAME', scope='output')
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(logits - labels))
This loss function will give you basic training capabilities. However, this is L1 distance and it may suffer from some issues (check). Think following situation
Let's say you have following array as output [10, 10, 10, 0, 0] and you try to achieve [10, 10, 10, 10, 10]. In this case, your loss is 20 (10 + 10). However, you have 3/5 success. Also, it may indicate some overfit.
For same case, think following output [6, 6, 6, 6, 6]. It still has loss of 20 (4 + 4 + 4 + 4 + 4). However, whenever you apply threshold of 5, you can achieve 5/5 success. Hence, this is the case that we want.
If you use L2 loss, for the first case, you will have 10^2 + 10^2 = 200 as loss output. For the second case, you will get 4^2 * 5 = 80.
Hence, optimizer will try to run away from #1 as quick as possible to achieve global success rather than perfect success of some outputs and complete failure of the others. You can apply loss function like this for that.
tf.reduce_mean(tf.nn.l2_loss(logits - image))
Alternatively, you can check for cross entropy loss function. (it does apply softmax internally, do not apply softmax twice)
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, image))
Q1. How should I initialize the variables? Is the code below correct?
That's correct (although missing an opening parentheses). You could also look into tf.get_variable if the variables are going to be reused.
Q2: Is it possible to set the learning rate in log scale?
Exponential decay decreases the learning rate at every step. I think what you want is tf.train.piecewise_constant, and set boundaries at each epoch.
EDIT: Look at the other answer, use the staircase=True argument!
Q3: How to create the loss function described above?
Your loss function looks correct.
Other answers are very detailed and helpful. Here is a code example that uses placeholder to decay learning rate at log scale. HTH.
import tensorflow as tf
import numpy as np
# data simulation
N = 10000
D = 10
x = np.random.rand(N, D)
w = np.random.rand(D,1)
y =, w)
print y.shape
batch_size = 100
tni = tf.truncated_normal_initializer()
X = tf.placeholder(tf.float32, [batch_size, D])
Y = tf.placeholder(tf.float32, [batch_size,1])
W = tf.get_variable("w", shape=[D,1], initializer=tni)
B = tf.zeros([1])
lr = tf.placeholder(tf.float32)
pred = tf.add(tf.matmul(X,W), B)
print pred.shape
mse = tf.reduce_sum(tf.losses.mean_squared_error(Y, pred))
opt = tf.train.MomentumOptimizer(lr, 0.9)
train_op = opt.minimize(mse)
learning_rate = 0.0001
do_train = True
acc_err = 0.0
sess = tf.Session()
while do_train:
for i in range (100000):
if i > 0 and i % N == 0:
# epoch done, decrease learning rate by 2
learning_rate /= 2
print "Epoch completed. LR =", learning_rate
idx = i/batch_size + i%batch_size
f = {X:x[idx:idx+batch_size,:], Y:y[idx:idx+batch_size,:], lr: learning_rate}
_, err =[train_op, mse], feed_dict = f)
acc_err += err
if i%5000 == 0:
print "Average error = {}".format(acc_err/5000)
acc_err = 0.0

Optical Character Recognition using Neural Networks in Python

This code is for OCR using ANN ,it contains one hidden layer, the input is an image of size 28x28.the code runs without any error but the output is not at all accurate even after giving 5000+ images for training.I am using the mnist dataset which is of the form of jpg images. Please tell me what is wrong with my logic.
import numpy as np
from PIL import Image
import random
from random import randint
y = [[0,0,0,0,0,0,0,0,0,0]]
W1 = [[ random.uniform(-1, 1) for q in range(40)] for p in range(784)]
W2 = [[ random.uniform(-1, 1) for q in range(10)] for p in range(40)]
def sigmoid(x):
global b
return (1.0 / (1.0 + np.exp(-x)))
#run the neural net forward
def run(X, W):
return sigmoid(np.matmul(X,W)) #1x2 * 2x2 = 1x1 matrix
#cost function
def cost(X, y, W):
nn_output = run(X, W)
return ((nn_output - y))
def gradient_Descent(X,y,W1,W2):
alpha = 0.12 #learning rate
epochs = 15000 #num iterations
for i in range(epochs):
Z2=sigmoid(np.matmul(run(X,W1),W2)) #final activation function(1X10))
Z1=run(X,W1) #first activation function(1X40)
phi1=Z1*(1-Z1) #differentiation of Z1
phi2=Z2*(1-Z2) #differentiation of Z2
delta2 = phi2*cost(Z1,y,W2) #delta for outer layer(1X10)
delta1 = np.transpose(np.transpose(phi1)*np.matmul(W2,np.transpose(delta2)))
deltaW2 = alpha*(np.matmul(np.transpose(Z1),delta2))
deltaW1 = alpha*(np.matmul(np.transpose(X),delta1))
def Training():
for j in range(8):
while k<=15: #5421
img ='mnist_jpgfiles/train/mnist_'+str(j)+'_'+str(k)+'.jpg')
iar = np.array(img) #image array
X = ar
for p in range(784):
if X[0][p]>0:
def test():
global W1,W2
for j in range(3):
while k<=5: #890
img ='mnist_jpgfiles/test/mnist_'+str(j)+'_'+str(k)+'.jpg')
iar = np.array(img) #image array
X = ar/256
for p in range(784):
if X[0][p]>0:
print("Should be "+str(j))
There is a problem with your cost function, because you simply calculate the difference between the hypothesis output with the actual output.It makes your cost function linear, so it's strictly increasing(or strictly decreasing), which can't be optimized.
You need to make a cross-entropy cost function(because you use sigmoid as activation function).
Also, gradient descent simply can't optimize ANN cost function, you should use back-propagation with gradient descent to optimize it.
I haven't worked with ANN but when working with gradient descent algorithm for regression problems like in Andrew Nag Machine Learning course in coursera, I found it is helpful to have learning rate alpha less than 0.05 and no of iterations more than 100000.
Try tweaking your learning rate then create a confusion matrix which will help you understand the accuracy of your system.
In my experience there are a lot of things that can go wrong with an ANN. I'll list some possible errors for you to consider.
Assuming the classification accuracy does not increase at all after training.
Something is wrong with the training or testing sets.
Too high
learning rates can sometimes cause the algorithm to not converge at
all. Try setting it very small like 0.01 or 0.001. If there is still no convergence. The issue probably has to do with something else than the gradient descent.
Assuming the training does increase but the accuracy is worse than expected.
The normalisation process is not correctly implemented. For images it is recommended to use zero-mean-unit-variance.
The learning rate is too low or too high

