Weird gradient results with recurrent layers

Weird gradient results with recurrent layers - python

I've been experimenting with very basic recurrent networks and have seen really strange behaviors. I've spent quite a bit of time trying to narrow down where it goes wrong and I ended up finding that the gradients computed by theano and by finite differentiation are radically different when using a recurrent layer. What is going on here?
Here is the kind of problem I have:
I have n_seq sequences of n_steps feature vectors of dimension n_feat, along with their labels among n_class classes. The labels are per time step, not per sequence (so I have n_seq*n_steps labels).
My goal is to train a model to correctly classify the feature vectors.
Here's my minimal example:
(In reality, there would be some sequential information in the data, hence recurrent networks should do better, but in this minimal example I generate purely random data, which is good enough to expose the bug.)
I create 2 minimal networks:
1) A regular feed-forward (not recurrent), with just the input layer and an output layer with softmax (no hidden layer). I discard the sequential information by considering a "batch" of n_seq*n_steps "independent" feature vectors.
2) An identical network but where the output layer is recurrent. The batch is now of size n_seq and each input is a full sequence of n_steps feature vectors. Finally I reshape the output back to a "batch" of size n_seq*n_steps.
If the recurrent weights are set to 0, the 2 networks should be equivalent. Indeed I do see that the initial loss for both networks are the same in this case, no matter what random initialization for the feed-forward weights I have.
If I implement a finite differentiation, I also get that the (initial) gradients for the feed-forward weights are the same (as they should). However the gradients obtained from theano are radically different (but only for the recurrent network).
Here is my code with sample result:
NOTE: when run for the first time, I get this warning, I don't know what triggers it but I bet it is relevant to my problem.
Warning: In the strict mode, all neccessary shared variables must be passed as a part of non_sequences
'must be passed as a part of non_sequences', Warning)
Any insight would be much appreciated!
CODE:
import numpy as np
import theano
import theano.tensor as T
import lasagne
# GENERATE RANDOM DATA
n_steps = 10**4
n_seq = 10
n_feat = 2
n_class = 2
data_X = lasagne.utils.floatX(np.random.randn(n_seq, n_steps, n_feat))
data_y = np.random.randint(n_class, size=(n_seq, n_steps))
# INITIALIZE WEIGHTS
# feed-forward weights (random)
W = theano.shared(lasagne.utils.floatX(np.random.randn(n_feat,n_class)), name="W")
# recurrent weights (set to 0)
W_rec = theano.shared(lasagne.utils.floatX(np.zeros((n_class,n_class))), name="Wrec")
# bias (set to 0)
b = theano.shared(lasagne.utils.floatX(np.zeros((n_class,))), name="b")
def create_functions(model, X, y, givens):
"""Helper for building a network."""
loss = lasagne.objectives.categorical_crossentropy(lasagne.layers.get_output(model, X), y).mean()
get_loss = theano.function(
[], loss,
givens=givens
)
all_params = lasagne.layers.get_all_params(model)
get_theano_grad = [
theano.function(
[], g,
givens=givens
)
for g in theano.grad(loss, all_params)
]
return get_loss, get_theano_grad
def feedforward():
"""Creates a minimal feed-forward network."""
l_in = lasagne.layers.InputLayer(
shape=(n_seq*n_steps, n_feat),
)
l_out = lasagne.layers.DenseLayer(
l_in,
num_units=n_class,
nonlinearity=lasagne.nonlinearities.softmax,
W=W,
b=b
)
model = l_out
X = T.matrix('X')
y = T.ivector('y')
givens={
X: theano.shared(data_X.reshape((n_seq*n_steps, n_feat))),
y: T.cast(theano.shared(data_y.reshape((n_seq*n_steps,))), 'int32'),
}
return (model,) + create_functions(model, X, y, givens)
def recurrent():
"""Creates a minimal recurrent network."""
l_in = lasagne.layers.InputLayer(
shape=(n_seq, n_steps, n_feat),
)
l_out = lasagne.layers.RecurrentLayer(
l_in,
num_units=n_class,
nonlinearity=lasagne.nonlinearities.softmax,
gradient_steps=1,
W_in_to_hid=W,
W_hid_to_hid=W_rec,
b=b,
)
l_reshape = lasagne.layers.ReshapeLayer(l_out, (n_seq*n_steps, n_class))
model = l_reshape
X = T.tensor3('X')
y = T.ivector('y')
givens={
X: theano.shared(data_X),
y: T.cast(theano.shared(data_y.reshape((n_seq*n_steps,))), 'int32'),
}
return (model,) + create_functions(model, X, y, givens)
def finite_diff(param, loss_func, epsilon):
"""Computes a finitie differentation gradient of loss_func wrt param."""
loss = loss_func()
P = param.get_value()
grad = np.zeros_like(P)
it = np.nditer(P , flags=['multi_index'])
while not it.finished:
ind = it.multi_index
dP = P.copy()
dP[ind] += epsilon
param.set_value(dP)
grad[ind] = (loss_func()-loss)/epsilon
it.iternext()
param.set_value(P)
return grad
def theano_diff(net, get_theano_grad):
for p,g in zip(lasagne.layers.get_all_params(net), get_theano_grad):
if p.name == "W":
gW = np.array(g())
if p.name == "b":
gb = np.array(g())
return gW, gb
def compare_ff_rec():
eps = 1e-3 # for finite differentiation
ff, get_loss_ff, get_theano_grad_ff = feedforward()
rec, get_loss_rec, get_theano_grad_rec = recurrent()
gW_ff_finite = finite_diff(W, get_loss_ff, eps)
gb_ff_finite = finite_diff(b, get_loss_ff, eps)
gW_rec_finite = finite_diff(W, get_loss_rec, eps)
gb_rec_finite = finite_diff(b, get_loss_rec, eps)
gW_ff_theano, gb_ff_theano = theano_diff(ff, get_theano_grad_ff)
gW_rec_theano, gb_rec_theano = theano_diff(rec, get_theano_grad_rec)
print "\nloss:"
print "FF:\t", get_loss_ff()
print "REC:\t", get_loss_rec()
print "\ngradients:"
print "W"
print "FF finite:\n", gW_ff_finite.ravel()
print "FF theano:\n", gW_ff_theano.ravel()
print "REC finite:\n", gW_rec_finite.ravel()
print "REC theano:\n", gW_rec_theano.ravel()
print "b"
print "FF finite:\n", gb_ff_finite.ravel()
print "FF theano:\n", gb_ff_theano.ravel()
print "REC finite:\n", gb_rec_finite.ravel()
print "REC theano:\n", gb_rec_theano.ravel()
compare_ff_rec()
RESULTS:
loss:
FF: 0.968060314655
REC: 0.968060314655
gradients:
W
FF finite:
[ 0.23925304 -0.23907423 0.14013052 -0.14001131]
FF theano:
[ 0.23917811 -0.23917811 0.14011626 -0.14011627]
REC finite:
[ 0.23931265 -0.23907423 0.14024973 -0.14001131]
REC theano:
[ 1.77408110e-05 -1.77408110e-05 1.21677476e-05 -1.21677458e-05]
b
FF finite:
[ 0.00065565 -0.00047684]
FF theano:
[ 0.00058145 -0.00058144]
REC finite:
[ 0.00071526 -0.00047684]
REC theano:
[ 7.53380482e-06 -7.53380482e-06]

The problem was coming from non-intuitive (maybe) effect of gradient_steps clipping in BPTT as explained here:
https://groups.google.com/forum/#!topic/theano-users/QNge6fC6C4s

Related

Tensorflow: Simple 3D Convnet not learning

I am trying to create a simple 3D U-net for image segmentation, just to learn how to use the layers. Therefore I do a 3D convolution with stride 2 and then a transpose deconvolution to get back the same image size. I am also overfitting to a small set (test set) just to see if my network is learning.
I created the same net in Keras and it works just fine. Now I want to create in tensorflow but I been having trouble with it.
The cost changes slightly but no matter what I do (reduce learning rate, add more epochs, add more layers, change batch size...) the output is always the same. I believe the net is not updating the weights. I am sure I am doing something wrong but I can find what it is. Any help would be greatly appreciate it.
Here is my code:
def forward_propagation(X):
if ( mode == 'train'): print(" --------- Net --------- ")
# Convolutional Layer 1
with tf.variable_scope('CONV1'):
Z1 = tf.layers.conv3d(X, filters = 16, kernel =[3,3,3], strides = [ 2, 2, 2], padding='SAME', name = 'S2/conv3d')
A1 = tf.nn.relu(Z1, name = 'S2/ReLU')
if ( mode == 'train'): print("Convolutional Layer 1 S2 " + str(A1.get_shape()))
# DEConvolutional Layer 1
with tf.variable_scope('DeCONV1'):
output_deconv1 = tf.stack([X.get_shape()[0] , X.get_shape()[1], X.get_shape()[2], X.get_shape()[3], 1])
dZ1 = tf.nn.conv3d_transpose(A1, filters = 1, kernel =[3,3,3], strides = [2, 2, 2], padding='SAME', name = 'S2/conv3d_transpose')
dA1 = tf.nn.relu(dZ1, name = 'S2/ReLU')
if ( mode == 'train'): print("Deconvolutional Layer 1 S1 " + str(dA1.get_shape()))
return dA1
def compute_cost(output, target, method = 'dice_hard_coe'):
with tf.variable_scope('COST'):
if (method == 'sigmoid_cross_entropy') :
# Make them vectors
output = tf.reshape( output, [-1, output.get_shape().as_list()[0]] )
target = tf.reshape( target, [-1, target.get_shape().as_list()[0]] )
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits = output, labels = target)
cost = tf.reduce_mean(loss)
return cost
and the main function for the model:
def model(X_h5, Y_h5, learning_rate = 0.009,
num_epochs = 100, minibatch_size = 64, print_cost = True):
ops.reset_default_graph() # to be able to rerun the model without overwriting tf variables
#tf.set_random_seed(1) # to keep results consistent (tensorflow seed)
#seed = 3 # to keep results consistent (numpy seed)
(m, n_D, n_H, n_W, num_channels) = X_h5["test_data"].shape #TTT
num_labels = Y_h5["test_mask"].shape[4] #TTT
img_size = Y_h5["test_mask"].shape[1] #TTT
costs = [] # To keep track of the cost
accuracies = [] # To keep track of the accuracy
# Create Placeholders of the correct shape
X, Y = create_placeholders(n_H, n_W, n_D, minibatch_size)
# Forward propagation: Build the forward propagation in the tensorflow graph
nn_output = forward_propagation(X)
prediction = tf.nn.sigmoid(nn_output)
# Cost function: Add cost function to tensorflow graph
cost_method = 'sigmoid_cross_entropy'
cost = compute_cost(nn_output, Y, cost_method)
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer that minimizes the cost.
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)
# Initialize all the variables globally
init = tf.global_variables_initializer()
# Start the session to compute the tensorflow graph
with tf.Session() as sess:
print('------ Training ------')
# Run the initialization
tf.local_variables_initializer().run(session=sess)
sess.run(init)
# Do the training loop
for i in range(num_epochs*m):
# ----- TRAIN -------
current_epoch = i//m
patient_start = i-(current_epoch * m)
patient_end = patient_start + minibatch_size
current_X_train = np.zeros((minibatch_size, n_D, n_H, n_W,num_channels))
current_X_train[:,:,:,:,:] = np.array(X_h5["test_data"][patient_start:patient_end,:,:,:,:]) #TTT
current_X_train = np.nan_to_num(current_X_train) # make nan zero
current_Y_train = np.zeros((minibatch_size, n_D, n_H, n_W, num_labels))
current_Y_train[:,:,:,:,:] = np.array(Y_h5["test_mask"][patient_start:patient_end,:,:,:,:]) #TTT
current_Y_train = np.nan_to_num(current_Y_train) # make nan zero
feed_dict = {X: current_X_train, Y: current_Y_train}
_ , temp_cost = sess.run([optimizer, cost], feed_dict=feed_dict)
# ----- TEST -------
# Print the cost every 1/5 epoch
if ((i % (num_epochs*m/5) )== 0):
# Calculate the predictions
test_predictions = np.zeros(Y_h5["test_mask"].shape)
for j in range(0, X_h5["test_data"].shape[0], minibatch_size):
patient_start = j
patient_end = patient_start + minibatch_size
current_X_test = np.zeros((minibatch_size, n_D, n_H, n_W, num_channels))
current_X_test[:,:,:,:,:] = np.array(X_h5["test_data"][patient_start:patient_end,:,:,:,:])
current_X_test = np.nan_to_num(current_X_test) # make nan zero
current_Y_test = np.zeros((minibatch_size, n_D, n_H, n_W, num_labels))
current_Y_test[:,:,:,:,:] = np.array(Y_h5["test_mask"][patient_start:patient_end,:,:,:,:])
current_Y_test = np.nan_to_num(current_Y_test) # make nan zero
feed_dict = {X: current_X_test, Y: current_Y_test}
_, current_prediction = sess.run([cost, prediction], feed_dict=feed_dict)
test_predictions[j:j + minibatch_size,:,:,:,:] = current_prediction
costs.append(temp_cost)
print ("[" + str(current_epoch) + "|" + str(num_epochs) + "] " + "Cost : " + str(costs[-1]))
display_progress(X_h5["test_data"], Y_h5["test_mask"], test_predictions, 5, n_H, n_W)
# plot the cost
plt.plot(np.squeeze(costs))
plt.ylabel('cost')
plt.xlabel('epochs')
plt.show()
return
I call the model with:
model(hdf5_data_file, hdf5_mask_file, num_epochs = 500, minibatch_size = 1, learning_rate = 1e-3)
These are the results that I am currently getting:
Edit:
I have tried reducing the learning rate and it doesn't help. I also tried using tensorboard debug and the weights are not being updated:
I am not sure why this is happening.
I Created the same simple model in keras and it works fine. I am not sure what I am doing wrong in tensorflow.

Not sure if you are still looking for help, as I am answering this question half a year later your posted date. :) I've listed my observations and also some suggestions for you to try below. It my primary observation is right... then you probably just need a coffee break / a night of good sleep.
primary observation:
tf.reshape( output, [-1, output.get_shape().as_list()[0]] ) seems wrong. If you prefer to flatten the vector, it should be something like tf.reshape(output,[-1,np.prod(image_shape_list)]).
other observations:
With such a shallow network, I doubt the network have enough spatial resolution to differentiate tumor voxels from non-tumor voxels. Can you show the keras implementation and the performance compared to a pure tf implementation? I would probably go with 2+ layers, let's .
say with 3 layers, with a stride of 2 per layer, and an input image width of 256, you will end with a width of 32 at your deepest encoder layer. (If you have a limited GPU memory, downsample the input image.)
if changing the loss computation does not work, as #bremen_matt mentioned, reduce LR to say maybe 1e-5.
after the basic architecture tweaks and you "feel" that the network is sort of learning and not stuck, try augmenting the training data, add dropout, batch norm during training, and then maybe fancy up your loss by adding a discriminator.

Modify neural net to classify single example

This is my custom extension of one of Andrew NG's neural network from deep learning course where instead of producing 0 or 1 for binary classification I'm attempting
to classify multiple examples.
Both the inputs and outputs are one hot encoded.
With not much training I receive an accuracy of 'train accuracy: 67.51658067499625 %'
How can I classify a single training example instead of classifying all training examples?
I think a bug exists in my implementation as an issue with this network is training examples (train_set_x) and output values (train_set_y) both need to have same dimensions or an error related to the dimensionality of matrices is received.
For example using :
train_set_x = np.array([
[1,1,1,1],[0,1,1,1],[0,0,1,1]
])
train_set_y = np.array([
[1,1,1],[1,1,0],[1,1,1]
])
returns error :
ValueError Traceback (most recent call last)
<ipython-input-11-0d356e8d66f3> in <module>()
27 print(A)
28
---> 29 np.multiply(train_set_y,A)
30
31 def initialize_with_zeros(numberOfTrainingExamples):
ValueError: operands could not be broadcast together with shapes (3,3) (1,4)
network code :
import numpy as np
import matplotlib.pyplot as plt
import h5py
import scipy
from scipy import ndimage
import pandas as pd
%matplotlib inline
train_set_x = np.array([
[1,1,1,1],[0,1,1,1],[0,0,1,1]
])
train_set_y = np.array([
[1,1,1,0],[1,1,0,0],[1,1,1,1]
])
numberOfFeatures = 4
numberOfTrainingExamples = 3
def sigmoid(z):
s = 1 / (1 + np.exp(-z))
return s
w = np.zeros((numberOfTrainingExamples , 1))
b = 0
A = sigmoid(np.dot(w.T , train_set_x))
print(A)
np.multiply(train_set_y,A)
def initialize_with_zeros(numberOfTrainingExamples):
w = np.zeros((numberOfTrainingExamples , 1))
b = 0
return w, b
def propagate(w, b, X, Y):
m = X.shape[1]
A = sigmoid(np.dot(w.T , X) + b)
cost = -(1/m)*np.sum(np.multiply(Y,np.log(A)) + np.multiply((1-Y),np.log(1-A)), axis=1)
dw = ( 1 / m ) * np.dot( X, ( A - Y ).T ) # consumes ( A - Y )
db = ( 1 / m ) * np.sum( A - Y ) # consumes ( A - Y ) again
# cost = np.squeeze(cost)
grads = {"dw": dw,
"db": db}
return grads, cost
def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = True):
costs = []
for i in range(num_iterations):
grads, cost = propagate(w, b, X, Y)
dw = grads["dw"]
db = grads["db"]
w = w - (learning_rate * dw)
b = b - (learning_rate * db)
if i % 100 == 0:
costs.append(cost)
if print_cost and i % 10000 == 0:
print(cost)
params = {"w": w,
"b": b}
grads = {"dw": dw,
"db": db}
return params, grads, costs
def model(X_train, Y_train, num_iterations, learning_rate = 0.5, print_cost = False):
w, b = initialize_with_zeros(numberOfTrainingExamples)
parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost = True)
w = parameters["w"]
b = parameters["b"]
Y_prediction_train = sigmoid(np.dot(w.T , X_train) + b)
print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.0001, print_cost = True)
Update: A bug exists in this implementation in that the training example pairs (train_set_x , train_set_y) must contain the same dimensions. Can point in direction of how linear algebra should be modified?
Update 2 :
I modified #Paul Panzer answer so that learning rate is 0.001 and train_set_x , train_set_y pairs are unique :
train_set_x = np.array([
[1,1,1,1,1],[0,1,1,1,1],[0,0,1,1,0],[0,0,1,0,1]
])
train_set_y = np.array([
[1,0,0],[0,0,1],[0,1,0],[1,0,1]
])
grads = model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True)
# To classify single training example :
print(sigmoid(dw # [0,0,1,1,0] + db))
This update produces following output :
-2.09657359028
-3.94918577439
[[ 0.74043089 0.32851512 0.14776077 0.77970162]
[ 0.04810012 0.08033521 0.72846174 0.1063849 ]
[ 0.25956911 0.67148488 0.22029838 0.85223923]]
[[1 0 0 1]
[0 0 1 0]
[0 1 0 1]]
train accuracy: 79.84462279013312 %
[[ 0.51309252 0.48853845 0.50945862]
[ 0.5110232 0.48646923 0.50738869]
[ 0.51354109 0.48898712 0.50990734]]
Should print(sigmoid(dw # [0,0,1,1,0] + db)) produce a vector that once rounded matches train_set_y corresponding value : [0,1,0] ?
Modifying to produce a vector with (adding [0,0,1,1,0] to numpy array and taking transpose):
print(sigmoid(dw # np.array([[0,0,1,1,0]]).T + db))
returns :
array([[ 0.51309252],
[ 0.48646923],
[ 0.50990734]])
Again, rounding these values to nearest whole number produces vector [1,0,1] when [0,1,0] is expected.
These are incorrect operations to produce a prediction for single training example ?

Your difficulties come from mismatched dimensions, so let's walk through the problem and try and get them straight.
Your network has a number of inputs, the features, let's call their number N_in (numberOfFeatures in your code). And it has a number of outputs which correspond to different classes let's call their number N_out. Inputs and outputs are connected by the weights w.
Now here is the problem. Connections are all-to-all, so we need a weight for each of the N_out x N_in pairs of outputs and inputs. Therefore in your code the shape of w must be changed to (N_out, N_in). You probably also want an offset b for each output, so b should be a vector of size (N_out,) or rather (N_out, 1) so it plays well with the 2d terms.
I've fixed that in the modified code below and I tried to make it very explicit. I've also thrown a mock data creator into the bargain.
Re the one-hot encoded categorical output, I'm not an expert on neural networks but I think, most people understand it so that classes are mutually exclusive, so each sample in your mock output should have one one and the rest zeros.
Side note:
At one point a competing answer advised you to get rid of the 1-... terms in the cost function. While that looks like an interesting idea to me my gut feeling (Edit Now confirmed using gradient-free minimizer; use activation="hybrid" in code below. Solver will simply maximize all outputs which are active in at least one training example.) is it won't work just like that because the cost will then fail to penalise false positives (see below for detailed explanation). To make it work you'd have to add some kind of regularization. One method that appears to work is using the softmax instead of the sigmoid. The softmax is to one-hot what the sigmoid is to binary. It makes sure the output is "fuzzy one-hot".
Therefore my recommendation is:
If you want to stick with sigmoid and not explicitly enforce one-hot predictions. Keep the 1-... term.
If you want to use the shorter cost function. Enforce one-hot predictions. For example by using softmax instead of sigmoid.
I've added an activation="sigmoid"|"softmax"|"hybrid" parameter to the code that switches between models. I've also made the scipy general purpose minimizer available, which may be useful when the gradient of the cost is not at hand.
Recap on how the cost function works:
The cost is a sum over all classes and all training samples of the term
-y log (y') - (1-y) log (1-y')
where y is the expected response, i.e. the one given by the "y" training sample for the input (the "x" training sample). y' is the prediction, the response the network with its current weights and biases generates. Now, because the expected response is either 0 or 1 the cost for a single category and a single training sample can be written
-log (y') if y = 1
-log(1-y') if y = 0
because in the first case (1-y) is zero, so the second term vanishes and in the secondo case y is zero, so the first term vanishes.
One can now convince oneself that the cost is high if
the expected response y is 1 and the network prediction y' is close to zero
the expected response y is 0 and the network prediction y' is close to one
In other words the cost does its job in punishing wrong predictions. Now, if we drop the second term (1-y) log (1-y') half of this mechanism is gone. If the expected response is 1, a low prediction will still incur a cost, but if the expected response is 0, the cost will be zero, regardless of the prediction, in particular, a high prediction (or false positive) will go unpunished.
Now, because the total cost is a sum over all training samples, there are three possibilities.
all training samples prescribe that the class be zero:
then the cost will be completely independent of the predictions for this class and no learning can take place
some training samples put the class at zero, some at one:
then because "false negatives" or "misses" are still punished but false positives aren't the net will find the easiest way to minimize the cost which is to indiscriminately increase the prediction of the class for all samples
all training samples prescribe that the class be one:
essentially the same as in the second scenario will happen, only here it's no problem, because that is the correct behavior
And finally, why does it work if we use softmax instead of sigmoid? False positives will still be invisible. Now it is easy to see that the sum over all classes of the softmax is one. So I can only increase the prediction for one class if at least one other class is reduced to compensate. In particular, there can be no false positives without a false negative, and the false negative the cost will detect.
On how to get a binary prediction:
For binary expected responses rounding is indeed the appropriate procedure. For one-hot I'd rather find the largest value, set that to one and all others to zero. I've added a convenience function, predict, implementing that.
import numpy as np
from scipy import optimize as opt
from collections import namedtuple
# First, a few structures to keep ourselves organized
Problem_Size = namedtuple('Problem_Size', 'Out In Samples')
Data = namedtuple('Data', 'Out In')
Network = namedtuple('Network', 'w b activation cost gradient most_likely')
def get_dims(Out, In, transpose=False):
"""extract dimensions and ensure everything is 2d
return Data, Dims"""
# gracefully acccept lists etc.
Out, In = np.asanyarray(Out), np.asanyarray(In)
if transpose:
Out, In = Out.T, In.T
# if it's a single sample make sure it's n x 1
Out = Out[:, None] if len(Out.shape) == 1 else Out
In = In[:, None] if len(In.shape) == 1 else In
Dims = Problem_Size(Out.shape[0], *In.shape)
if Dims.Samples != Out.shape[1]:
raise ValueError("number of samples must be the same for Out and In")
return Data(Out, In), Dims
def sigmoid(z):
s = 1 / (1 + np.exp(-z))
return s
def sig_cost(Net, data):
A = process(data.In, Net)
logA = np.log(A)
return -(data.Out * logA + (1-data.Out) * (1-logA)).sum(axis=0).mean()
def sig_grad (Net, Dims, data):
A = process(data.In, Net)
return dict(dw = (A - data.Out) # data.In.T / Dims.Samples,
db = (A - data.Out).mean(axis=1, keepdims=True))
def sig_ml(z):
return np.round(z).astype(int)
def sof_ml(z):
hot = np.argmax(z, axis=0)
z = np.zeros(z.shape, dtype=int)
z[hot, np.arange(len(hot))] = 1
return z
def softmax(z):
z = z - z.max(axis=0, keepdims=True)
z = np.exp(z)
return z / z.sum(axis=0, keepdims=True)
def sof_cost(Net, data):
A = process(data.In, Net)
logA = np.log(A)
return -(data.Out * logA).sum(axis=0).mean()
sof_grad = sig_grad
def get_net(Dims, activation='softmax'):
activation, cost, gradient, ml = {
'sigmoid': (sigmoid, sig_cost, sig_grad, sig_ml),
'softmax': (softmax, sof_cost, sof_grad, sof_ml),
'hybrid': (sigmoid, sof_cost, None, sig_ml)}[activation]
return Network(w=np.zeros((Dims.Out, Dims.In)),
b=np.zeros((Dims.Out, 1)),
activation=activation, cost=cost, gradient=gradient,
most_likely=ml)
def process(In, Net):
return Net.activation(Net.w # In + Net.b)
def propagate(data, Dims, Net):
return Net.gradient(Net, Dims, data), Net.cost(Net, data)
def optimize_no_grad(Net, Dims, data):
def f(x):
Net.w[...] = x[:Net.w.size].reshape(Net.w.shape)
Net.b[...] = x[Net.w.size:].reshape(Net.b.shape)
return Net.cost(Net, data)
x = np.r_[Net.w.ravel(), Net.b.ravel()]
res = opt.minimize(f, x, options=dict(maxiter=10000)).x
Net.w[...] = res[:Net.w.size].reshape(Net.w.shape)
Net.b[...] = res[Net.w.size:].reshape(Net.b.shape)
def optimize(Net, Dims, data, num_iterations, learning_rate, print_cost = True):
w, b = Net.w, Net.b
costs = []
for i in range(num_iterations):
grads, cost = propagate(data, Dims, Net)
dw = grads["dw"]
db = grads["db"]
w -= learning_rate * dw
b -= learning_rate * db
if i % 100 == 0:
costs.append(cost)
if print_cost and i % 10000 == 0:
print(cost)
return grads, costs
def model(X_train, Y_train, num_iterations, learning_rate = 0.5, print_cost = False, activation='sigmoid'):
data, Dims = get_dims(Y_train, X_train, transpose=True)
Net = get_net(Dims, activation)
if Net.gradient is None:
optimize_no_grad(Net, Dims, data)
else:
grads, costs = optimize(Net, Dims, data, num_iterations, learning_rate, print_cost = True)
Y_prediction_train = process(data.In, Net)
print(Y_prediction_train)
print(data.Out)
print(Y_prediction_train.sum(axis=0))
print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - data.Out)) * 100))
return Net
def predict(In, Net, probability=False):
In = np.asanyarray(In)
is1d = In.ndim == 1
if is1d:
In = In.reshape(-1, 1)
Out = process(In, Net)
if not probability:
Out = Net.most_likely(Out)
if is1d:
Out = Out.reshape(-1)
return Out
def create_data(Dims):
Out = np.zeros((Dims.Out, Dims.Samples), dtype=int)
Out[np.random.randint(0, Dims.Out, (Dims.Samples,)), np.arange(Dims.Samples)] = 1
In = np.random.randint(0, 2, (Dims.In, Dims.Samples))
return Data(Out, In)
train_set_x = np.array([
[1,1,1,1,1],[0,1,1,1,1],[0,0,1,1,0],[0,0,1,0,1]
])
train_set_y = np.array([
[1,0,0],[1,0,0],[0,0,1],[0,0,1]
])
Net1 = model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True, activation='sigmoid')
Net2 = model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True, activation='softmax')
Net3 = model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True, activation='hybrid')
Dims = Problem_Size(8, 100, 50)
data = create_data(Dims)
model(data.In.T, data.Out.T, num_iterations = 40000, learning_rate = 0.001, print_cost = True, activation='softmax')
model(data.In.T, data.Out.T, num_iterations = 40000, learning_rate = 0.001, print_cost = True, activation='sigmoid')

Both the idea of how to fix the bug and how you can extend the implementation to classify between more classes can be solved with some dimensionality analysis.
I am assuming that you by classifying multiple examples mean multiple classes and not multiple samples, as we need multiple samples to train even for 2 classes.
Where N = number of samples, D = number of features, K = number of categories(with K=2 being a special case where one can reduce this down to one dimension,ie K=1 with y=0 signifying one class and y=1 the other). The data should have the following dimensions:
X: N * D #input
y: N * K #output
W: D * K #weights, also dW has same dimensions
b: 1 * K #bias, also db has same dimensions
#A should have same dimensions as y
The order of the dimensions can be switched around, as long as the dot products are done correctly.
First dealing with your bug: You are initializing W as N * K instead of D * K ie. in the binary case:
w = np.zeros((numberOfTrainingExamples , 1))
#instead of
w = np.zeros((numberOfFeatures , 1))
This means that the only time you are initializing W to correct dimensions is when y and X (coincidentally) have same dimensions.
This will mess with your dot products as well:
np.dot(X, w) # or np.dot(w.T,X.T) if you define y as [K * N] dimensions
#instead of
np.dot(w.T , X)
and
np.dot( X.T, ( A - Y ) ) #np.dot( X.T, ( A - Y ).T ) if y:[K * N]
#instead of
np.dot( X, ( A - Y ).T )
Also make sure that the cost function returns one number (ie. not an array).
Secondly going on to K>2 you need to make some changes. b is no longer a single number, but a vector (1D-array). y and W go from being 1D-array to 2D array. To avoid confusion and hard-to-find bugs it could be good to set K, N and D to different values

Neural Network seems to be getting stuck on a single output with each execution

I've created a neural network to estimate the sin(x) function for an input x. The network has 21 output neurons (representing numbers -1.0, -0.9, ..., 0.9, 1.0) with numpy that does not learn, as I think I implemented the neuron architecture incorrectly when I defined the feedforward mechanism.
When I execute the code, the amount of test data it estimates correctly sits around 48/1000. This happens to be the average data point count per category if you split 1000 test data points between 21 categories. Looking at the network output, you can see that the network seems to just start picking a single output value for every input. For example, it may pick -0.5 as the estimate for y regardless of the x you give it. Where did I go wrong here? This is my first network. Thank you!
import random
import numpy as np
import math
class Network(object):
def __init__(self,inputLayerSize,hiddenLayerSize,outputLayerSize):
#Create weight vector arrays to represent each layer size and initialize indices randomly on a Gaussian distribution.
self.layer1 = np.random.randn(hiddenLayerSize,inputLayerSize)
self.layer1_activations = np.zeros((hiddenLayerSize, 1))
self.layer2 = np.random.randn(outputLayerSize,hiddenLayerSize)
self.layer2_activations = np.zeros((outputLayerSize, 1))
self.outputLayerSize = outputLayerSize
self.inputLayerSize = inputLayerSize
self.hiddenLayerSize = hiddenLayerSize
# print(self.layer1)
# print()
# print(self.layer2)
# self.weights = [np.random.randn(y,x)
# for x, y in zip(sizes[:-1], sizes[1:])]
def feedforward(self, network_input):
#Propogate forward through network as if doing this by hand.
#first layer's output activations:
for neuron in range(self.hiddenLayerSize):
self.layer1_activations[neuron] = 1/(1+np.exp(network_input * self.layer1[neuron]))
#second layer's output activations use layer1's activations as input:
for neuron in range(self.outputLayerSize):
for weight in range(self.hiddenLayerSize):
self.layer2_activations[neuron] += self.layer1_activations[weight]*self.layer2[neuron][weight]
self.layer2_activations[neuron] = 1/(1+np.exp(self.layer2_activations[neuron]))
#convert layer 2 activation numbers to a single output. The neuron (weight vector) with highest activation will be output.
outputs = [x / 10 for x in range(-int((self.outputLayerSize/2)), int((self.outputLayerSize/2))+1, 1)] #range(-10, 11, 1)
return(outputs[np.argmax(self.layer2_activations)])
def train(self, training_pairs, epochs, minibatchsize, learn_rate):
#apply gradient descent
test_data = build_sinx_data(1000)
for epoch in range(epochs):
random.shuffle(training_pairs)
minibatches = [training_pairs[k:k + minibatchsize] for k in range(0, len(training_pairs), minibatchsize)]
for minibatch in minibatches:
loss = 0 #calculate loss for each minibatch
#Begin training
for x, y in minibatch:
network_output = self.feedforward(x)
loss += (network_output - y) ** 2
#adjust weights by abs(loss)*sigmoid(network_output)*(1-sigmoid(network_output)*learn_rate
loss /= (2*len(minibatch))
adjustWeights = loss*(1/(1+np.exp(-network_output)))*(1-(1/(1+np.exp(-network_output))))*learn_rate
self.layer1 += adjustWeights
#print(adjustWeights)
self.layer2 += adjustWeights
#when line 63 placed here, results did not improve during minibatch.
print("Epoch {0}: {1}/{2} correct".format(epoch, self.evaluate(test_data), len(test_data)))
print("Training Complete")
def evaluate(self, test_data):
"""
Returns number of test inputs which network evaluates correctly.
The ouput assumed to be neuron in output layer with highest activation
:param test_data: test data set identical in form to train data set.
:return: integer sum
"""
correct = 0
for x, y in test_data:
output = self.feedforward(x)
if output == y:
correct+=1
return(correct)
def build_sinx_data(data_points):
"""
Creates a list of tuples (x value, expected y value) for Sin(x) function.
:param data_points: number of desired data points
:return: list of tuples (x value, expected y value
"""
x_vals = []
y_vals = []
for i in range(data_points):
#parameter of randint signifies range of x values to be used*10
x_vals.append(random.randint(-2000,2000)/10)
y_vals.append(round(math.sin(x_vals[i]),1))
return (list(zip(x_vals,y_vals)))
# training_pairs, epochs, minibatchsize, learn_rate
sinx_test = Network(1,21,21)
print(sinx_test.feedforward(10))
sinx_test.train(build_sinx_data(600),20,10,2)
print(sinx_test.feedforward(10))

I didn't examine thoroughly all of your code, but some issues are clearly visible:
* operator doesn't perform matrix multiplication in numpy, you have to use numpy.dot. This affects, for instance, these lines: network_input * self.layer1[neuron], self.layer1_activations[weight]*self.layer2[neuron][weight], etc.
Seems like you are solving your problem via classification (selecting 1 out of 21 classes), but using L2 loss. This is somewhat mixed up. You have two options: either stick to classification and use a cross entropy loss function, or perform regression (i.e. predict the numeric value) with L2 loss.
You should definitely extract sigmoid function to avoid writing the same expression all over again:
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def sigmoid_derivative(x):
return sigmoid(x) * (1 - sigmoid(x))
You perform the same update of self.layer1 and self.layer2, which clearly wrong. Take some time analyzing how exactly backpropagation works.

I edited how my loss function was integrated into my function and also correctly implemented gradient descent. I also removed the use of mini-batches and simplified what my network was trying to do. I now have a network which attempts to classify something as even or odd.
Some extremely helpful guides I used to fix things up:
Chapter 1 and 2 of Neural Networks and Deep Learning, by Michael Nielsen, available for free at http://neuralnetworksanddeeplearning.com/chap1.html . This book gives thorough explanations for how Neural Nets work, including breakdowns of the math behind their execution.
Backpropagation from the Beginning, by Erik Hallström, linked by Maxim. https://medium.com/#erikhallstrm/backpropagation-from-the-beginning-77356edf427d
. Not as thorough as the above guide, but I kept both open concurrently, as this guide is more to the point about what is important and how to apply the mathematical formulas that are thoroughly explained in Nielsen's book.
How to build a simple neural network in 9 lines of Python code https://medium.com/technology-invention-and-more/how-to-build-a-simple-neural-network-in-9-lines-of-python-code-cc8f23647ca1
. A useful and fast introduction to some neural networking basics.
Here is my (now functioning) code:
import random
import numpy as np
import scipy
import math
class Network(object):
def __init__(self,inputLayerSize,hiddenLayerSize,outputLayerSize):
#Layers represented both by their weights array and activation and inputsums vectors.
self.layer1 = np.random.randn(hiddenLayerSize,inputLayerSize)
self.layer2 = np.random.randn(outputLayerSize,hiddenLayerSize)
self.layer1_activations = np.zeros((hiddenLayerSize, 1))
self.layer2_activations = np.zeros((outputLayerSize, 1))
self.layer1_inputsums = np.zeros((hiddenLayerSize, 1))
self.layer2_inputsums = np.zeros((outputLayerSize, 1))
self.layer1_errorsignals = np.zeros((hiddenLayerSize, 1))
self.layer2_errorsignals = np.zeros((outputLayerSize, 1))
self.layer1_deltaw = np.zeros((hiddenLayerSize, inputLayerSize))
self.layer2_deltaw = np.zeros((outputLayerSize, hiddenLayerSize))
self.outputLayerSize = outputLayerSize
self.inputLayerSize = inputLayerSize
self.hiddenLayerSize = hiddenLayerSize
print()
print(self.layer1)
print()
print(self.layer2)
print()
# self.weights = [np.random.randn(y,x)
# for x, y in zip(sizes[:-1], sizes[1:])]
def feedforward(self, network_input):
#Calculate inputsum and and activations for each neuron in the first layer
for neuron in range(self.hiddenLayerSize):
self.layer1_inputsums[neuron] = network_input * self.layer1[neuron]
self.layer1_activations[neuron] = self.sigmoid(self.layer1_inputsums[neuron])
# Calculate inputsum and and activations for each neuron in the second layer. Notice that each neuron in the second layer represented by
# weights vector, consisting of all weights leading out of the kth neuron in (l-1) layer to the jth neuron in layer l.
self.layer2_inputsums = np.zeros((self.outputLayerSize, 1))
for neuron in range(self.outputLayerSize):
for weight in range(self.hiddenLayerSize):
self.layer2_inputsums[neuron] += self.layer1_activations[weight]*self.layer2[neuron][weight]
self.layer2_activations[neuron] = self.sigmoid(self.layer2_inputsums[neuron])
return self.layer2_activations
def interpreted_output(self, network_input):
#convert layer 2 activation numbers to a single output. The neuron (weight vector) with highest activation will be output.
self.feedforward(network_input)
outputs = [x / 10 for x in range(-int((self.outputLayerSize/2)), int((self.outputLayerSize/2))+1, 1)] #range(-10, 11, 1)
return(outputs[np.argmax(self.layer2_activations)])
# def build_expected_output(self, training_data):
# #Views expected output number y for each x to generate an expected output vector from the network
# index=0
# for pair in training_data:
# expected_output_vector = np.zeros((self.outputLayerSize,1))
# x = training_data[0]
# y = training_data[1]
# for i in range(-int((self.outputLayerSize / 2)), int((self.outputLayerSize / 2)) + 1, 1):
# if y == i / 10:
# expected_output_vector[i] = 1
# #expect the target category to be a 1.
# break
# training_data[index][1] = expected_output_vector
# index+=1
# return training_data
def train(self, training_data, learn_rate):
self.backpropagate(training_data, learn_rate)
def backpropagate(self, train_data, learn_rate):
#Perform for each x,y pair.
for datapair in range(len(train_data)):
x = train_data[datapair][0]
y = train_data[datapair][1]
self.feedforward(x)
# print("l2a " + str(self.layer2_activations))
# print("l1a " + str(self.layer1_activations))
# print("l2 " + str(self.layer2))
# print("l1 " + str(self.layer1))
for neuron in range(self.outputLayerSize):
#Calculate first error equation for error signals of output layer neurons
self.layer2_errorsignals[neuron] = (self.layer2_activations[neuron] - y[neuron]) * self.sigmoid_prime(self.layer2_inputsums[neuron])
#Use recursive formula to calculate error signals of hidden layer neurons
self.layer1_errorsignals = np.multiply(np.array(np.matrix(self.layer2.T) * np.matrix(self.layer2_errorsignals)) , self.sigmoid_prime(self.layer1_inputsums))
#print(self.layer1_errorsignals)
# for neuron in range(self.hiddenLayerSize):
# #Use recursive formula to calculate error signals of hidden layer neurons
# self.layer1_errorsignals[neuron] = np.multiply(self.layer2[neuron].T,self.layer2_errorsignals[neuron]) * self.sigmoid_prime(self.layer1_inputsums[neuron])
#Partial derivative of C with respect to weight for connection from kth neuron in (l-1)th layer to jth neuron in lth layer is
#(jth error signal in lth layer) * (kth activation in (l-1)th layer.)
#Update all weights for network at each iteration of a training pair.
#Update weights in second layer
for neuron in range(self.outputLayerSize):
for weight in range(self.hiddenLayerSize):
self.layer2_deltaw[neuron][weight] = self.layer2_errorsignals[neuron]*self.layer1_activations[weight]*(-learn_rate)
self.layer2 += self.layer2_deltaw
#Update weights in first layer
for neuron in range(self.hiddenLayerSize):
self.layer1_deltaw[neuron] = self.layer1_errorsignals[neuron]*(x)*(-learn_rate)
self.layer1 += self.layer1_deltaw
#Comment/Uncomment to enable error evaluation.
#print("Epoch {0}: Error: {1}".format(datapair, self.evaluate(test_data)))
# print("l2a " + str(self.layer2_activations))
# print("l1a " + str(self.layer1_activations))
# print("l1 " + str(self.layer1))
# print("l2 " + str(self.layer2))
def evaluate(self, test_data):
error = 0
for x, y in test_data:
#x is integer, y is single element np.array
output = self.feedforward(x)
error += y - output
return error
#eval function for sin(x)
# def evaluate(self, test_data):
# """
# Returns number of test inputs which network evaluates correctly.
# The ouput assumed to be neuron in output layer with highest activation
# :param test_data: test data set identical in form to train data set.
# :return: integer sum
# """
# correct = 0
# for x, y in test_data:
# outputs = [x / 10 for x in range(-int((self.outputLayerSize / 2)), int((self.outputLayerSize / 2)) + 1,
# 1)] # range(-10, 11, 1)
# newy = outputs[np.argmax(y)]
# output = self.interpreted_output(x)
# #print("output: " + str(output))
# if output == newy:
# correct+=1
# return(correct)
def sigmoid(self, z):
return 1 / (1 + np.exp(-z))
def sigmoid_prime(self, z):
return (1 - self.sigmoid(z)) * self.sigmoid(z)
def build_simple_data(data_points):
x_vals = []
y_vals = []
for each in range(data_points):
x = random.randint(-3,3)
expected_output_vector = np.zeros((1, 1))
if x > 0:
expected_output_vector[[0]] = 1
else:
expected_output_vector[[0]] = 0
x_vals.append(x)
y_vals.append(expected_output_vector)
print(list(zip(x_vals,y_vals)))
print()
return (list(zip(x_vals,y_vals)))
simpleNet = Network(1, 3, 1)
# print("Pretest")
# print(simpleNet.feedforward(-3))
# print(simpleNet.feedforward(10))
# init_weights_l1 = simpleNet.layer1
# init_weights_l2 = simpleNet.layer2
# simpleNet.train(build_simple_data(10000),.1)
# #sometimes Error converges to 0, sometimes error converges to 10.
# print("Initial Weights:")
# print(init_weights_l1)
# print(init_weights_l2)
# print("Final Weights")
# print(simpleNet.layer1)
# print(simpleNet.layer2)
# print("Post-test")
# print(simpleNet.feedforward(-3))
# print(simpleNet.feedforward(10))
def test_network(iterations,net,training_points):
"""
Casually evaluates pre and post test
:param iterations: number of trials to be run
:param net: name of network to evaluate.
;param training_points: size of training data to be used
:return: four 1x1 arrays.
"""
pretest_negative = 0
pretest_positive = 0
posttest_negative = 0
posttest_positive = 0
for each in range(iterations):
pretest_negative += net.feedforward(-10)
pretest_positive += net.feedforward(10)
net.train(build_simple_data(training_points),.1)
for each in range(iterations):
posttest_negative += net.feedforward(-10)
posttest_positive += net.feedforward(10)
return(pretest_negative/iterations, pretest_positive/iterations, posttest_negative/iterations, posttest_positive/iterations)
print(test_network(10000, simpleNet, 10000))
While much differs between this code and the code posted in the OP, there is a particular difference that is interesting. In the original feedforward method notice
#second layer's output activations use layer1's activations as input:
for neuron in range(self.outputLayerSize):
for weight in range(self.hiddenLayerSize):
self.layer2_activations[neuron] += self.layer1_activations[weight]*self.layer2[neuron][weight]
self.layer2_activations[neuron] = 1/(1+np.exp(self.layer2_activations[neuron]))
The line
self.layer2_activations[neuron] += self.layer1_activations[weight]*self.layer2[neuron][weight]
Resembles
self.layer2_inputsums[neuron] += self.layer1_activations[weight]*self.layer2[neuron][weight]
In the updated code. This line performs the dot product between each weight vector and each input vector (the activations from layer 1) to arrive at the input_sum for a neuron, commonly referred to as z (think sigmoid(z)). In my network, the derivative of the sigmoid function, sigmoid_prime, is used to calculate the gradient of the cost function with respect to all the weights. By multiplying sigmoid_prime(z) * network error between actual and expected output. If z is very big (and positive), the neuron will have an activation value very close to 1. That means that the network is confident that that neuron should be activating. The same is true if z is very negative. The network, then, doesn't want to radically adjust weights that it is happy with, so the scale of the change in each weight for a neuron is given by the gradient of sigmoid(z), sigmoid_prime(z). Very large z means very small gradient and very small change applied to weights (the gradient of sigmoid is maximized at z = 0, when the network is unconfident about how a neuron should be categorized and when the activation for that neuron is 0.5).
Since I was continually adding on to each neuron's input_sum (z) and never resetting the value for new inputs of dot(weights, activations), the value for z kept growing, continually slowing the rate of change for the weights until weight modification grew to a standstill. I added the following line to cope with this:
self.layer2_inputsums = np.zeros((self.outputLayerSize, 1))
The new posted network can be copy and pasted into an editor and executed so long as you have the numpy module installed. The final line of output to print will be a list of 4 arrays representing final network output. The first two are the pretest values for a negative and positive input, respectively. These should be random. The second two are post-test values to determine how well the network classifies as positive and negative number. A number near 0 denotes negative, near 1 denotes positive.

Calculate optimal input of a neural network with theano, by using gradient descent w.r.t. inputs

I have implemented and trained a neural network with Theano of k binary inputs (0,1), one hidden layer and one unit in the output layer. Once it has been trained I want to obtain inputs that maximizes the output (e.g. x which makes unit of output layer closest to 1). So far I haven't found an implementation of it, so I am trying the following approach:
Train network => obtain trained weights (theta1, theta2)
Define the neural network function with x as input and trained theta1, theta2 as fixed parameters. That is: f(x) = sigmoid( theta1*(sigmoid (theta2*x ))). This function takes x and with given trained weights (theta1, theta2) gives output between 0 and 1.
Apply gradient descent w.r.t. x on the neural network function f(x) and obtain x that maximizes f(x) with theta1 and theta2 given.
For these I have implemented the following code with a toy example (k = 2). Based on the tutorial on http://outlace.com/Beginner-Tutorial-Theano/ but changed vector y, so that there is only one combination of inputs that gives f(x) ~ 1 which is x = [0, 1].
Edit1: As suggested optimizer was set to None and bias unit was fixed to 1.
Step 1: Train neural network. This runs well and with out error.
import os
os.environ["THEANO_FLAGS"] = "optimizer=None"
import theano
import theano.tensor as T
import theano.tensor.nnet as nnet
import numpy as np
x = T.dvector()
y = T.dscalar()
def layer(x, w):
b = np.array([1], dtype=theano.config.floatX)
new_x = T.concatenate([x, b])
m = T.dot(w.T, new_x) #theta1: 3x3 * x: 3x1 = 3x1 ;;; theta2: 1x4 * 4x1
h = nnet.sigmoid(m)
return h
def grad_desc(cost, theta):
alpha = 0.1 #learning rate
return theta - (alpha * T.grad(cost, wrt=theta))
in_units = 2
hid_units = 3
out_units = 1
theta1 = theano.shared(np.array(np.random.rand(in_units + 1, hid_units), dtype=theano.config.floatX)) # randomly initialize
theta2 = theano.shared(np.array(np.random.rand(hid_units + 1, out_units), dtype=theano.config.floatX))
hid1 = layer(x, theta1) #hidden layer
out1 = T.sum(layer(hid1, theta2)) #output layer
fc = (out1 - y)**2 #cost expression
cost = theano.function(inputs=[x, y], outputs=fc, updates=[
(theta1, grad_desc(fc, theta1)),
(theta2, grad_desc(fc, theta2))])
run_forward = theano.function(inputs=[x], outputs=out1)
inputs = np.array([[0,1],[1,0],[1,1],[0,0]]).reshape(4,2) #training data X
exp_y = np.array([1, 0, 0, 0]) #training data Y
cur_cost = 0
for i in range(5000):
for k in range(len(inputs)):
cur_cost = cost(inputs[k], exp_y[k]) #call our Theano-compiled cost function, it will auto update weights
print(run_forward([0,1]))
Output of run forward for [0,1] is: 0.968905860574.
We can also get values of weights with theta1.get_value() and theta2.get_value()
Step 2: Define neural network function f(x). Trained weights (theta1, theta2) are constant parameters of this function.
Things get a little trickier here because of the bias unit, which is part of he vector of inputs x. To do this I concatenate b and x. But the code now runs well.
b = np.array([[1]], dtype=theano.config.floatX)
#b_sh = theano.shared(np.array([[1]], dtype=theano.config.floatX))
rand_init = np.random.rand(in_units, 1)
rand_init[0] = 1
x_sh = theano.shared(np.array(rand_init, dtype=theano.config.floatX))
th1 = T.dmatrix()
th2 = T.dmatrix()
nn_hid = T.nnet.sigmoid( T.dot(th1, T.concatenate([x_sh, b])) )
nn_predict = T.sum( T.nnet.sigmoid( T.dot(th2, T.concatenate([nn_hid, b]))))
Step 3:
Problem is now in gradient descent as is not limited to values between 0 and 1.
fc2 = (nn_predict - 1)**2
cost3 = theano.function(inputs=[th1, th2], outputs=fc2, updates=[
(x_sh, grad_desc(fc2, x_sh))])
run_forward = theano.function(inputs=[th1, th2], outputs=nn_predict)
cur_cost = 0
for i in range(10000):
cur_cost = cost3(theta1.get_value().T, theta2.get_value().T) #call our Theano-compiled cost function, it will auto update weights
if i % 500 == 0: #only print the cost every 500 epochs/iterations (to save space)
print('Cost: %s' % (cur_cost,))
print x_sh.get_value()
The last iteration prints:
Cost: 0.000220317356533
[[-0.11492753]
[ 1.99729555]]
Furthermore input 1 keeps becoming more negative and input 2 increases, while the optimal solution is [0, 1]. How can this be fixed?

You are adding b=[1] via broadcasting rules as opposed to concatenating it. Also, once you concatenate it, your x_sh has one dimension to many which is why the error occurs at nn_predict and not nn_hid

Convergence of LSTM network using Tensorflow

I am trying to detect micro-events in a long time series. For this purpose, I will train a LSTM network.
Data. Input for each time sample is 11 different features somewhat normalized to fit 0-1. Output will be either one of two classes.
Batching. Due to huge class imbalance I have extracted the data in batches of each 60 time samples, of which at least 5 will always be class 1, and the rest class to. In this way the class imbalance is reduced from 150:1 to around 12:1 I have then randomized the order of all my batches.
Model. I am attempting to train an LSTM, with initial configuration of 3 different cells with 5 delay steps. I expect the micro events to arrive in sequences of at least 3 time steps.
Problem: When I try to train the network it will quickly converge towards saying that EVERYTHING belongs to the majority class. When I implement a weighted loss function, at some certain threshold it will change to saying that EVERYTHING belongs to the minority class. I suspect (without being expert) that there is no learning in my LSTM cells, or that my configuration is off?
Below is the code for my implementation. I am hoping that someone can tell me
Is my implementation correct?
What other reasons could there be for such behaviour?
ar_model.py
import numpy as np
import tensorflow as tf
from tensorflow.models.rnn import rnn
import ar_config
config = ar_config.get_config()
class ARModel(object):
def __init__(self, is_training=False, config=None):
# Config
if config is None:
config = ar_config.get_config()
# Placeholders
self._features = tf.placeholder(tf.float32, [None, config.num_features], name='ModelInput')
self._targets = tf.placeholder(tf.float32, [None, config.num_classes], name='ModelOutput')
# Hidden layer
with tf.variable_scope('lstm') as scope:
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(config.num_hidden, forget_bias=0.0)
cell = tf.nn.rnn_cell.MultiRNNCell([lstm_cell] * config.num_delays)
self._initial_state = cell.zero_state(config.batch_size, dtype=tf.float32)
outputs, state = rnn.rnn(cell, [self._features], dtype=tf.float32)
# Output layer
output = outputs[-1]
softmax_w = tf.get_variable('softmax_w', [config.num_hidden, config.num_classes], tf.float32)
softmax_b = tf.get_variable('softmax_b', [config.num_classes], tf.float32)
logits = tf.matmul(output, softmax_w) + softmax_b
# Evaluate
ratio = (60.00 / 5.00)
class_weights = tf.constant([ratio, 1 - ratio])
weighted_logits = tf.mul(logits, class_weights)
loss = tf.nn.softmax_cross_entropy_with_logits(weighted_logits, self._targets)
self._cost = cost = tf.reduce_mean(loss)
self._predict = tf.argmax(tf.nn.softmax(logits), 1)
self._correct = tf.equal(tf.argmax(logits, 1), tf.argmax(self._targets, 1))
self._accuracy = tf.reduce_mean(tf.cast(self._correct, tf.float32))
self._final_state = state
if not is_training:
return
# Optimize
optimizer = tf.train.AdamOptimizer()
self._train_op = optimizer.minimize(cost)
#property
def features(self):
return self._features
#property
def targets(self):
return self._targets
#property
def cost(self):
return self._cost
#property
def accuracy(self):
return self._accuracy
#property
def train_op(self):
return self._train_op
#property
def predict(self):
return self._predict
#property
def initial_state(self):
return self._initial_state
#property
def final_state(self):
return self._final_state
ar_train.py
import os
from datetime import datetime
import numpy as np
import tensorflow as tf
from tensorflow.python.platform import gfile
import ar_network
import ar_config
import ar_reader
config = ar_config.get_config()
def main(argv=None):
if gfile.Exists(config.train_dir):
gfile.DeleteRecursively(config.train_dir)
gfile.MakeDirs(config.train_dir)
train()
def train():
train_data = ar_reader.ArousalData(config.train_data, num_steps=config.max_steps)
test_data = ar_reader.ArousalData(config.test_data, num_steps=config.max_steps)
with tf.Graph().as_default(), tf.Session() as session, tf.device('/cpu:0'):
initializer = tf.random_uniform_initializer(minval=-0.1, maxval=0.1)
with tf.variable_scope('model', reuse=False, initializer=initializer):
m = ar_network.ARModel(is_training=True)
s = tf.train.Saver(tf.all_variables())
tf.initialize_all_variables().run()
for batch_input, batch_target in train_data:
step = train_data.iter_steps
dict = {
m.features: batch_input,
m.targets: batch_target
}
session.run(m.train_op, feed_dict=dict)
state, cost, accuracy = session.run([m.final_state, m.cost, m.accuracy], feed_dict=dict)
if not step % 10:
test_input, test_target = test_data.next()
test_accuracy = session.run(m.accuracy, feed_dict={
m.features: test_input,
m.targets: test_target
})
now = datetime.now().time()
print ('%s | Iter %4d | Loss= %.5f | Train= %.5f | Test= %.3f' % (now, step, cost, accuracy, test_accuracy))
if not step % 1000:
destination = os.path.join(config.train_dir, 'ar_model.ckpt')
s.save(session, destination)
if __name__ == '__main__':
tf.app.run()
ar_config.py
class Config(object):
# Directories
train_dir = '...'
ckpt_dir = '...'
train_data = '...'
test_data = '...'
# Data
num_features = 13
num_classes = 2
batch_size = 60
# Model
num_hidden = 3
num_delays = 5
# Training
max_steps = 100000
def get_config():
return Config()
UPDATED ARCHITECTURE:
# Placeholders
self._features = tf.placeholder(tf.float32, [None, config.num_features, config.num_delays], name='ModelInput')
self._targets = tf.placeholder(tf.float32, [None, config.num_output], name='ModelOutput')
# Weights
weights = {
'hidden': tf.get_variable('w_hidden', [config.num_features, config.num_hidden], tf.float32),
'out': tf.get_variable('w_out', [config.num_hidden, config.num_classes], tf.float32)
}
biases = {
'hidden': tf.get_variable('b_hidden', [config.num_hidden], tf.float32),
'out': tf.get_variable('b_out', [config.num_classes], tf.float32)
}
#Layer in
with tf.variable_scope('input_hidden') as scope:
inputs = self._features
inputs = tf.transpose(inputs, perm=[2, 0, 1]) # (BatchSize,NumFeatures,TimeSteps) -> (TimeSteps,BatchSize,NumFeatures)
inputs = tf.reshape(inputs, shape=[-1, config.num_features]) # (TimeSteps,BatchSize,NumFeatures -> (TimeSteps*BatchSize,NumFeatures)
inputs = tf.add(tf.matmul(inputs, weights['hidden']), biases['hidden'])
#Layer hidden
with tf.variable_scope('hidden_hidden') as scope:
inputs = tf.split(0, config.num_delays, inputs) # -> n_steps * (batchsize, features)
cell = tf.nn.rnn_cell.BasicLSTMCell(config.num_hidden, forget_bias=0.0)
self._initial_state = cell.zero_state(config.batch_size, dtype=tf.float32)
outputs, state = rnn.rnn(cell, inputs, dtype=tf.float32)
#Layer out
with tf.variable_scope('hidden_output') as scope:
output = outputs[-1]
logits = tf.add(tf.matmul(output, weights['out']), biases['out'])

Odd elements
Weighted loss
I am not sure your "weighted loss" does what you want it to do:
ratio = (60.00 / 5.00)
class_weights = tf.constant([ratio, 1 - ratio])
weighted_logits = tf.mul(logits, class_weights)
this is applied before calculating the loss function (further I think you wanted an element-wise multiplication as well? also your ratio is above 1 which makes the second part negative?) so it forces your predictions to behave in a certain way before applying the softmax.
If you want weighted loss you should apply this after
loss = tf.nn.softmax_cross_entropy_with_logits(weighted_logits, self._targets)
with some element-wise multiplication of your weights.
loss = loss * weights
Where your weights have a shape like [2,]
However, I would not recommend you to use weighted losses. Perhaps try increasing the ratio even further than 1:6.
Architecture
As far as I can read, you are using 5 stacked LSTMs with 3 hidden units per layer?
Try removing the multi rnn and just use a single LSTM/GRU (maybe even just a vanilla RNN) and jack the hidden units up to ~100-1000.
Debugging
Often when you are facing problems with an odd behaving network, it can be a good idea to:
Print everything
Literally print the shapes and values of every tensor in your model, use sess to fetch it and then print it. Your input data, the first hidden representation, your predictions, your losses etc.
You can also use tensorflows tf.Print() x_tensor = tf.Print(x_tensor, [tf.shape(x_tensor)])
Use tensorboard
Using tensorboard summaries on your gradients, accuracy metrics and histograms will reveal patterns in your data that might explain certain behavior, such as what lead to exploding weights. Like maybe your forget bias goes to infinity or your not tracking gradient through a certain layer etc.
Other questions
How large is your dataset?
How long are your sequences?
Are the 13 features categorical or continuous? You should not normalize categorical variables or represent them as integers, instead you should use one-hot encoding.

Gunnar has already made lots of good suggestions. A few more small things worth paying attention to in general for this sort of architecture:
Try tweaking the Adam learning rate. You should determine the proper learning rate by cross-validation; as a rough start, you could just check whether a smaller learning rate saves your model from crashing on the training data.
You should definitely use more hidden units. It's cheap to try larger networks when you first start out on a dataset. Go as large as necessary to avoid the underfitting you've observed. Later you can regularize / pare down the network after you get it to learn something useful.
Concretely, how long are the sequences you are passing into the network? You say you have a 30k-long time sequence.. I assume you are passing in subsections / samples of this sequence?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Weird gradient results with recurrent layers - python

The problem was coming from non-intuitive (maybe) effect of gradient_steps clipping in BPTT as explained here: https://groups.google.com/forum/#!topic/theano-users/QNge6fC6C4s

Related

Tensorflow: Simple 3D Convnet not learning

Modify neural net to classify single example

Neural Network seems to be getting stuck on a single output with each execution

Calculate optimal input of a neural network with theano, by using gradient descent w.r.t. inputs

Convergence of LSTM network using Tensorflow

Categories

Resources