I tried to solve Experiment 3a described in the original LSTM paper here: http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf with tensorflow LSTM and failed
From the paper: The task is to observe and then classify input sequences. There are two classes, each occurring with probability 0.5. There is only one input line. Only the rst N real-valued sequence elements convey relevant information about the class. Sequence elements at positions t > N are generated by a Gaussian with mean zero and variance 0.2.
The net architecture that he described in the paper:
"We use a 3-layer net with 1 input unit, 1 output unit, and 3 cell blocks of size 1. The output layer receives connections only from memory cells. Memory cells and gate units receive inputs from input units, memory cells and gate units, and have bias weights. Gate units and output unit are logistic sigmoid in [0; 1], h in [-1; 1], and g in [-2; 2]"
I tried to reproduce it with LSTM with 3 hidden units for T=100 and N=3 but failed.
I used online training (i.e. update the weights after each sequence) as described in the original paper
The core of my code was as follow:
self.batch_size = batch_size = config.batch_size
hidden_size = 3
self._input_data = tf.placeholder(tf.float32, (1, T))
self._targets = tf.placeholder(tf.float32, [1, 1])
lstm_cell = rnn_cell.BasicLSTMCell(hidden_size , forget_bias=1.0)
cell = rnn_cell.MultiRNNCell([lstm_cell] * 1)
self._initial_state = cell.zero_state(1, tf.float32)
weights_hidden = tf.constant(1.0, shape= [config.num_features, config.n_hidden])
prepare the input
inputs = []
for k in range(num_steps):
nextitem = tf.matmul(tf.reshape(self._input_data[:, k], [1, 1]) , weights_hidden)
inputs.append(nextitem)
outputs, states = rnn.rnn(cell, inputs, initial_state=self._initial_state)
use the last output
pred = tf.sigmoid(tf.matmul(outputs[-1], tf.get_variable("weights_out", [config.n_hidden,1])) + tf.get_variable("bias_out", [1]))
self._final_state = states[-1]
self._cost = cost = tf.reduce_mean(tf.square((pred - self.targets)))
self._result = tf.abs(pred[0, 0] - self.targets[0,0])
optimizer = tf.train.GradientDescentOptimizer(learning_rate = config.learning_rate).minimize(cost)
Any idea why it couldn't learn?
My first instinct was to create 2 outputs one for each class but in the paper he specifically mentioned only one output unit.
Thanks
It seems that i needed forget_bias > 1.0. for long sequences the network couldn't work with default forget_bias for T=50 for example i needed forget_bias = 2.1
Related
This is a classical visualization of the perceptron learning model with 1 neuron. Let's say that I'd like to use 3 neuron or 5 neuron for training, can I do it without hidden layer ? I just can't picture in my head. Here is the code;
import numpy as np
def tanh(x):
return (np.exp(x)-np.exp(-x))/(np.exp(x)+np.exp(-x))
def tanh_derivative(x):
return 1-x**2
#inputs
training_inputs = np.array([[0,0,0],[0,0,1],[0,1,0],[0,1,1],[1,0,0],[1,0,1],[1,1,0],[1,1,1]])
#outputs
training_outputs =np.array([[1,0,0,1,0,1,1,0]]).T
#3 input 1 output //
synaptic_weights = 2* np.random.random((3,1))-1
print('Random weights :{}'.format(synaptic_weights))
for i in range(20000):
input_layer = training_inputs
outputs = tanh(np.dot(input_layer,synaptic_weights))
error = training_outputs - outputs
weight_adjust = error * tanh_derivative(outputs)
synaptic_weights += np.dot(input_layer.T, weight_adjust)
print('After training Synaptic Weights: {}'.format(synaptic_weights))
print('\n')
print('After training Outputs :\n{}'.format(outputs))
If you have 3 neurons in the output layer, you have three outputs. This makes sense for some problems - imagine a color with RGB components.
The size of your input determines your number input nodes; the size of your output determines your number of output nodes. Only hidden layers sizes can be chosen freely. But any interesting network has at least one hidden layer.
I was trying to do a pretty simple thing, train an LSTM that picks a sequence of random numbers and outputs the sum of them. But after some hours without converging I decided to ask here which of my premises doesn't work.
The idea is simple:
I generate a training set of sequences of some sequence length of random numbers and label them with the sum of them (numbers are drawn from a normal distribution)
I use an LSTM with an RMSE loss for predicting the output, the sum of these numbers, given the sequence input
Intuitively the LSTM should learn to set the weight of the input gate to 1 (bias 0) the weights of the forget gate to 0 (bias 1) and the weight to the output gate to 1 (bias 0) and learn to add these numbers, but it doesn't. I pasting the code I use, I tried with different learning rates, optimizers, batching, observed the gradients and the outputs and don't find the exact reason why is failing.
Code for generating sequences:
import tensorflow as tf
import numpy as np
tf.enable_eager_execution()
def generate_sequences(n_samples, seq_len):
total_shape = n_samples*seq_len
random_values = np.random.randn(total_shape)
random_values = random_values.reshape(n_samples, -1)
targets = np.sum(random_values, axis=1)
return random_values, targets
Code for training:
n_samples = 100000
seq_len = 2
lr=0.1
epochs = n_samples
batch_size = 1
input_shape = 1
data, targets = generate_sequences(n_samples, seq_len)
train_data = tf.data.Dataset.from_tensor_slices((data, targets))
output = tf.keras.layers.RNN(tf.keras.layers.LSTMCell(1, dtype='float64', recurrent_activation=None, activation=None), input_shape=(batch_size, seq_len, input_shape))
iterator = train_data.batch(batch_size).make_one_shot_iterator()
optimizer = tf.train.AdamOptimizer(lr)
for i in range(epochs):
my_inp, target = iterator.get_next()
with tf.GradientTape(persistent=True) as tape:
tape.watch(my_inp)
my_out = output(tf.reshape(my_inp, shape=(batch_size,seq_len,1)))
loss = tf.sqrt(tf.reduce_sum(tf.square(target - my_out)),1)/batch_size
grads = tape.gradient(loss, output.trainable_variables)
optimizer.apply_gradients(zip(grads, output.trainable_variables),
global_step=tf.train.get_or_create_global_step())
I also has a conjecture that this a theoretical problem (If we sum different random values drawn form a normal distribution then the output is not in the [-1, 1] range and the LSTM due to the tanh activations can't learn it. But changing them doesn't improved the performance (changed to linear in the example code).
EDIT:
Set activations to linear, I realised that the tanh() squashes the values.
SOLVED:
The problem was actually the tanh() of the gates and recurrent states which was squashing my outputs and not allowing them to grow by adding up the summands. Putting all activations to linear works pretty fine.
I am a deep learning and Tensorflow beginner and I am trying to implement the algorithm in this paper using Tensorflow. This paper uses Matconvnet+Matlab to implement it, and I am curious if Tensorflow has the equivalent functions to achieve the same thing. The paper said:
The network parameters were initialized using the Xavier method [14]. We used the regression loss across four wavelet subbands under l2 penalty and the proposed network was trained by using the stochastic gradient descent (SGD). The regularization parameter (λ) was 0.0001 and the momentum was 0.9. The learning rate was set from 10−1 to 10−4 which was reduced in log scale at each epoch.
This paper uses wavelet transform (WT) and residual learning method (where the residual image = WT(HR) - WT(HR'), and the HR' are used for training). Xavier method suggests to initialize the variables normal distribution with
stddev=sqrt(2/(filter_size*filter_size*num_filters)
Q1. How should I initialize the variables? Is the code below correct?
weights = tf.Variable(tf.random_normal[img_size, img_size, 1, num_filters], stddev=stddev)
This paper does not explain how to construct the loss function in details . I am unable to find the equivalent Tensorflow function to set the learning rate in log scale (only exponential_decay). I understand MomentumOptimizer is equivalent to Stochastic Gradient Descent with momentum.
Q2: Is it possible to set the learning rate in log scale?
Q3: How to create the loss function described above?
I followed this website to write the code below. Assume model() function returns the network mentioned in this paper and lamda=0.0001,
inputs = tf.placeholder(tf.float32, shape=[None, patch_size, patch_size, num_channels])
labels = tf.placeholder(tf.float32, [None, patch_size, patch_size, num_channels])
# get the model output and weights for each conv
pred, weights = model()
# define loss function
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=pred)
for weight in weights:
regularizers += tf.nn.l2_loss(weight)
loss = tf.reduce_mean(loss + 0.0001 * regularizers)
learning_rate = tf.train.exponential_decay(???) # Not sure if we can have custom learning rate for log scale
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(loss, global_step)
NOTE: As I am a deep learning/Tensorflow beginner, I copy-paste code here and there so please feel free to correct it if you can ;)
Q1. How should I initialize the variables? Is the code below correct?
Use tf.get_variable or switch to slim (it does the initialization automatically for you). example
Q2: Is it possible to set the learning rate in log scale?
You can but do you need it? This is not the first thing that you need to solve in this network. Please check #3
However, just for reference, use following notation.
learning_rate_node = tf.train.exponential_decay(learning_rate=0.001, decay_steps=10000, decay_rate=0.98, staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_node).minimize(loss)
Q3: How to create the loss function described above?
At first, you have not written "pred" to "image" conversion to this message(Based on the paper you need to apply subtraction and IDWT to obtain final image).
There is one problem here, logits have to be calculated based on your label data. i.e. if you will use marked data as "Y : Label", you need to write
pred = model()
pred = tf.matmul(pred, weights) + biases
logits = tf.nn.softmax(pred)
loss = tf.reduce_mean(tf.abs(logits - labels))
This will give you the output of Y : Label to be used
If your dataset's labeled images are denoised ones, in this case you need to follow this one:
pred = model()
pred = tf.matmul(image, weights) + biases
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(image - labels))
Logits are the output of your network. You will use this one as result to calculate the rest. Instead of matmul, you can add a conv2d layer in here without a batch normalization and an activation function and set output feature count as 4. Example:
pred = model()
pred = slim.conv2d(pred, 4, [3, 3], activation_fn=None, padding='SAME', scope='output')
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(logits - labels))
This loss function will give you basic training capabilities. However, this is L1 distance and it may suffer from some issues (check). Think following situation
Let's say you have following array as output [10, 10, 10, 0, 0] and you try to achieve [10, 10, 10, 10, 10]. In this case, your loss is 20 (10 + 10). However, you have 3/5 success. Also, it may indicate some overfit.
For same case, think following output [6, 6, 6, 6, 6]. It still has loss of 20 (4 + 4 + 4 + 4 + 4). However, whenever you apply threshold of 5, you can achieve 5/5 success. Hence, this is the case that we want.
If you use L2 loss, for the first case, you will have 10^2 + 10^2 = 200 as loss output. For the second case, you will get 4^2 * 5 = 80.
Hence, optimizer will try to run away from #1 as quick as possible to achieve global success rather than perfect success of some outputs and complete failure of the others. You can apply loss function like this for that.
tf.reduce_mean(tf.nn.l2_loss(logits - image))
Alternatively, you can check for cross entropy loss function. (it does apply softmax internally, do not apply softmax twice)
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, image))
Q1. How should I initialize the variables? Is the code below correct?
That's correct (although missing an opening parentheses). You could also look into tf.get_variable if the variables are going to be reused.
Q2: Is it possible to set the learning rate in log scale?
Exponential decay decreases the learning rate at every step. I think what you want is tf.train.piecewise_constant, and set boundaries at each epoch.
EDIT: Look at the other answer, use the staircase=True argument!
Q3: How to create the loss function described above?
Your loss function looks correct.
Other answers are very detailed and helpful. Here is a code example that uses placeholder to decay learning rate at log scale. HTH.
import tensorflow as tf
import numpy as np
# data simulation
N = 10000
D = 10
x = np.random.rand(N, D)
w = np.random.rand(D,1)
y = np.dot(x, w)
print y.shape
#modeling
batch_size = 100
tni = tf.truncated_normal_initializer()
X = tf.placeholder(tf.float32, [batch_size, D])
Y = tf.placeholder(tf.float32, [batch_size,1])
W = tf.get_variable("w", shape=[D,1], initializer=tni)
B = tf.zeros([1])
lr = tf.placeholder(tf.float32)
pred = tf.add(tf.matmul(X,W), B)
print pred.shape
mse = tf.reduce_sum(tf.losses.mean_squared_error(Y, pred))
opt = tf.train.MomentumOptimizer(lr, 0.9)
train_op = opt.minimize(mse)
learning_rate = 0.0001
do_train = True
acc_err = 0.0
sess = tf.Session()
sess.run(tf.global_variables_initializer())
while do_train:
for i in range (100000):
if i > 0 and i % N == 0:
# epoch done, decrease learning rate by 2
learning_rate /= 2
print "Epoch completed. LR =", learning_rate
idx = i/batch_size + i%batch_size
f = {X:x[idx:idx+batch_size,:], Y:y[idx:idx+batch_size,:], lr: learning_rate}
_, err = sess.run([train_op, mse], feed_dict = f)
acc_err += err
if i%5000 == 0:
print "Average error = {}".format(acc_err/5000)
acc_err = 0.0
I'm trying to build a model from scratch that can classify MNIST images (handwritten digits). The model needs to output a list of probabilities representing how likely it is that the input image is a certain number.
This is the code I have so far:
from sklearn.datasets import load_digits
import numpy as np
def softmax(x):
return np.exp(x) / np.sum(np.exp(x), axis=0)
digits = load_digits()
features = digits.data
targets = digits.target
train_count = int(0.8 * len(features))
train_x = features[: train_count]
train_y = targets[: train_count]
test_x = features[train_count:]
test_y = targets[train_count:]
bias = np.random.rand()
weights = np.random.rand(len(features[0]))
rate = 0.02
for i in range(1000):
for i, sample in enumerate(train_x):
prod = np.dot(sample, weights) - bias
soft = softmax(prod)
predicted = np.argmax(soft) + 1
error = predicted - train_y[i]
weights -= error * rate * sample
bias -= rate * error
# print(error)
I'm trying to build the model so that it uses stochastic gradient descent but I'm a little confused as to what to pass to the softmax function. I understand it's supposed to expect a vector of numbers, but what I'm used to (when building a small NN) is that the model should produce one number, which is passed to an activation function, which in turn produces the prediction. Here, I feel like I'm missing a step and I don't know what it is.
In the simplest implementation, your last layer (just before softmax) should indeed output a 10-dim vector, which will be squeezed to [0, 1] by the softmax. This means that weights should be a matrix of shape [features, 10] and bias should be a [10] vector.
In addition to this, you should one-hot encode your train_y labels, i.e. convert each item to [0, 0, ..., 1, ..., 0] vector. The shape of train_y is thus [size, 10].
Take a look at logistic regression example - it's in tensorflow, but the model is likely to be similar to yours: they use 768 features (all pixels), one-hot encoding for labels and a single hidden layer. They also use mini-batches to speed-up learning.
I am now manually building a 1-layer neural network in Python without using packages like tensorflow. For this neural nets, each input is a 500 dimensional one-hot encoder, and output is a 3 dimensional vector representing probability of each class.
The neural nets works, but my problem is that the number of training instances is very large, slightly more than 1 million. And because I need to run at least 3 epochs, I cannot find an efficient way to store the weights matrix.
I tried to use a 3 dimensional numpy random matrix and dictionaries to represent weights and then perform weight update. The first dimension of 3-d matrix is number of training instances, and the later 2 are corresponding dimension that match dimension of each input and hidden layer. Both method works fine with small samples, but the program died with full sample.
#first feature.shape[0] is number of training samples, and feature.shape[1] is 500.
#d is the dimension of hidden layer
#using 3-d matrices
w_1 = np.random.rand(feature.shape[0], d,feature.shape[1])
b_1 = np.random.rand(feature.shape[0], 1,d)
w_2 = np.random.rand(feature.shape[0], 3, d)
b_2 = np.random.rand(feature.shape[0], 1, 3)
#iterate through every training epoch
for iteration in range(epoch):
correct, i = 0,0
#iterate through every training instance
while i < feature.shape[0]:
#net and out for hidden layer
net1 = feature[i].toarray().flatten().dot(w_1[i].T) + b_1[i].flatten()
h_1 = sigmoid(net1)
#net and out for output
y_hat = h_1.dot(w_2[i].T) + b_2[i].flatten()
prob = softmax(y_hat)
loss = squared_loss(label[i],prob)
#backpropagation steps omitted here
#using dictionaries
w_1 = {i: np.random.rand(d, feature.shape[1]) for i in range(feature.shape[0])}
b_1 = {i: np.random.rand(d) for i in range(feature.shape[0])}
w_2 = {i: np.random.rand(3, d) for i in range(feature.shape[0])}
b_2 = {i: np.random.rand(3) for i in range(feature.shape[0])}
for iteration in range(epoch):
correct, i = 0,0
while i < feature.shape[0]:
#net and out for hidden layer
net1 = feature[i].toarray().flatten().dot(w_1[i].T) + b_1[i]
h_1 = sigmoid(net1)
#output and probabilities
y_hat = h_1.dot(w_2[i].T) + b_2[i]
prob = softmax(y_hat)
loss = squared_loss(label[i],prob)
As you can see, I need to initialize all weights first so that when I neural nets go through each epoch, weights can be updated and will not be lost. But the problem is that this is inefficient! And program dies!
So could anyone suggest anything about this? How could I store weights and update weights in each training epoch?
Any help is greatly appreciated!