I am now manually building a 1-layer neural network in Python without using packages like tensorflow. For this neural nets, each input is a 500 dimensional one-hot encoder, and output is a 3 dimensional vector representing probability of each class.
The neural nets works, but my problem is that the number of training instances is very large, slightly more than 1 million. And because I need to run at least 3 epochs, I cannot find an efficient way to store the weights matrix.
I tried to use a 3 dimensional numpy random matrix and dictionaries to represent weights and then perform weight update. The first dimension of 3-d matrix is number of training instances, and the later 2 are corresponding dimension that match dimension of each input and hidden layer. Both method works fine with small samples, but the program died with full sample.
#first feature.shape[0] is number of training samples, and feature.shape[1] is 500.
#d is the dimension of hidden layer
#using 3-d matrices
w_1 = np.random.rand(feature.shape[0], d,feature.shape[1])
b_1 = np.random.rand(feature.shape[0], 1,d)
w_2 = np.random.rand(feature.shape[0], 3, d)
b_2 = np.random.rand(feature.shape[0], 1, 3)
#iterate through every training epoch
for iteration in range(epoch):
correct, i = 0,0
#iterate through every training instance
while i < feature.shape[0]:
#net and out for hidden layer
net1 = feature[i].toarray().flatten().dot(w_1[i].T) + b_1[i].flatten()
h_1 = sigmoid(net1)
#net and out for output
y_hat = h_1.dot(w_2[i].T) + b_2[i].flatten()
prob = softmax(y_hat)
loss = squared_loss(label[i],prob)
#backpropagation steps omitted here
#using dictionaries
w_1 = {i: np.random.rand(d, feature.shape[1]) for i in range(feature.shape[0])}
b_1 = {i: np.random.rand(d) for i in range(feature.shape[0])}
w_2 = {i: np.random.rand(3, d) for i in range(feature.shape[0])}
b_2 = {i: np.random.rand(3) for i in range(feature.shape[0])}
for iteration in range(epoch):
correct, i = 0,0
while i < feature.shape[0]:
#net and out for hidden layer
net1 = feature[i].toarray().flatten().dot(w_1[i].T) + b_1[i]
h_1 = sigmoid(net1)
#output and probabilities
y_hat = h_1.dot(w_2[i].T) + b_2[i]
prob = softmax(y_hat)
loss = squared_loss(label[i],prob)
As you can see, I need to initialize all weights first so that when I neural nets go through each epoch, weights can be updated and will not be lost. But the problem is that this is inefficient! And program dies!
So could anyone suggest anything about this? How could I store weights and update weights in each training epoch?
Any help is greatly appreciated!
Related
I am handeling a timeseries dataset with n timesteps, m features and k objects.
As a result my feature vector has a shape of (n,k,m) While my targets shape is (n,m)
I want to predict the targets for every timestep and object, but with the same weights for every opject. Also my loss function looks like this.
average_loss = loss_func(prediction, labels)
sum_loss = loss_func(sum(prediction), sum(labels))
loss = loss_weight * average_loss + (1-loss_weight) * sum_loss
My plan is to not only make sure, that I predict every item as good as possible, but also that the sum of all items get perdicted. loss_weights is a constant.
Currently I am doing this kind of ugly solution:
features = local_batch.squeeze(dim = 0)
labels = torch.unsqueeze(local_labels.squeeze(dim = 0), 1)
prediction = net(features)
I set my batchsize = 1. And squeeze it to make the k objects my batch.
My network looks like this:
def __init__(self, n_feature, n_hidden, n_output):
super(Net, self).__init__()
self.hidden = torch.nn.Linear(n_feature, n_hidden) # hidden layer
self.predict = torch.nn.Linear(n_hidden, n_output) # output layer
def forward(self, x):
x = F.relu(self.hidden(x)) # activation function for hidden layer
x = self.predict(x) # linear output
return x
How do I make sure I do a reasonable convolution over the opject dimension in order to keep the same weights for all objects, without commiting to batchsize=1? Also, how do I achieve the same loss function, where I compute the loss of the prediction sum vs target sum for any timestamp?
It's not exactly ugly -- I would do the same but generalize it a bit for batch size >1 using view.
# Using your notations
n, k, m = features.shape
features = local_batch.view(n*k, m)
prediction = net(features).view(n, k, m)
With the prediction in the correct shape (n*k*m), implementing your loss function should not be difficult.
I was trying to do a pretty simple thing, train an LSTM that picks a sequence of random numbers and outputs the sum of them. But after some hours without converging I decided to ask here which of my premises doesn't work.
The idea is simple:
I generate a training set of sequences of some sequence length of random numbers and label them with the sum of them (numbers are drawn from a normal distribution)
I use an LSTM with an RMSE loss for predicting the output, the sum of these numbers, given the sequence input
Intuitively the LSTM should learn to set the weight of the input gate to 1 (bias 0) the weights of the forget gate to 0 (bias 1) and the weight to the output gate to 1 (bias 0) and learn to add these numbers, but it doesn't. I pasting the code I use, I tried with different learning rates, optimizers, batching, observed the gradients and the outputs and don't find the exact reason why is failing.
Code for generating sequences:
import tensorflow as tf
import numpy as np
tf.enable_eager_execution()
def generate_sequences(n_samples, seq_len):
total_shape = n_samples*seq_len
random_values = np.random.randn(total_shape)
random_values = random_values.reshape(n_samples, -1)
targets = np.sum(random_values, axis=1)
return random_values, targets
Code for training:
n_samples = 100000
seq_len = 2
lr=0.1
epochs = n_samples
batch_size = 1
input_shape = 1
data, targets = generate_sequences(n_samples, seq_len)
train_data = tf.data.Dataset.from_tensor_slices((data, targets))
output = tf.keras.layers.RNN(tf.keras.layers.LSTMCell(1, dtype='float64', recurrent_activation=None, activation=None), input_shape=(batch_size, seq_len, input_shape))
iterator = train_data.batch(batch_size).make_one_shot_iterator()
optimizer = tf.train.AdamOptimizer(lr)
for i in range(epochs):
my_inp, target = iterator.get_next()
with tf.GradientTape(persistent=True) as tape:
tape.watch(my_inp)
my_out = output(tf.reshape(my_inp, shape=(batch_size,seq_len,1)))
loss = tf.sqrt(tf.reduce_sum(tf.square(target - my_out)),1)/batch_size
grads = tape.gradient(loss, output.trainable_variables)
optimizer.apply_gradients(zip(grads, output.trainable_variables),
global_step=tf.train.get_or_create_global_step())
I also has a conjecture that this a theoretical problem (If we sum different random values drawn form a normal distribution then the output is not in the [-1, 1] range and the LSTM due to the tanh activations can't learn it. But changing them doesn't improved the performance (changed to linear in the example code).
EDIT:
Set activations to linear, I realised that the tanh() squashes the values.
SOLVED:
The problem was actually the tanh() of the gates and recurrent states which was squashing my outputs and not allowing them to grow by adding up the summands. Putting all activations to linear works pretty fine.
I was working through here. I am currently modifying the loss. Looking at,
deltas=tf.square(y_est-y)
loss=tf.reduce_sum(deltas)
I am understanding this to be calculating the squared difference between the output and the true label. Then the loss is the sum of these squares. So, writing the squared single sample error as S_i for sample i, things are simple for the case when the batch loss is just \sum_{i} f(S_i),
where summation goes through the all samples. But what if you cannot write the loss in this form? That is, the batch loss for the whole data is f({S_i}) for some general f with i going over all the samples. That is, the loss for the whole data cannot be calculated as some simple linear combination of the losses for the constituent samples. How do you code this in tensorflow? Thank you.
Below is more details on f. The outputs of the neural network are u_i where i goes from 1 to n. n is the number of samples we have. My error is something like
sum_{i from 1 to n} C_i log{sum_{k from 1 to n} I_{ik} exp{-d(u_i,u_k)} }
C_i is number of nodes connected to node i, which I already have and is a constant. I_{ik} is 1 if node i and node k is not connected.
Thanks for the code. maybe my question has not been worded correctly. I am not really looking for the code for the loss. This I can do myself. If you look at,
deltas=tf.square(y_est-y)
loss=tf.reduce_sum(deltas)
deltas, are they (1,3)? A bit above it reads
# Placeholders for input and output data
X = tf.placeholder(shape=(120, 4), dtype=tf.float64, name='X')
y = tf.placeholder(shape=(120, 3), dtype=tf.float64, name='y')
# Variables for two group of weights between the three layers of the network
W1 = tf.Variable(np.random.rand(4, hidden_nodes), dtype=tf.float64)
W2 = tf.Variable(np.random.rand(hidden_nodes, 3), dtype=tf.float64)
# Create the neural net graph
A1 = tf.sigmoid(tf.matmul(X, W1))
y_est = tf.sigmoid(tf.matmul(A1, W2))
I think it is (1,3). They use (1,3) y_est and y. I want to know the specific tensorflow syntax for working with (m,3) y_est and y for any given m.
I might be wrong with syntaxes ... but this should give you a general idea. Also, you can optimize it further by vectorization. I have simply put your loss function as it is.
N is the batch size.
def f(C, I, U, N):
loss = 0
for i in range(N):
sum_ = 0
for k in range(N):
sum_ += I[i,k] * tf.exp(d(u[i]-u[k])
loss += C[i]*tf.log(sum)
return loss
loss = f(C,I,U, batch_size)
In a TensorFlow optimizer (python) the method apply_dense does get called for the neuron weights (layer connections) and the bias weights but I would like to use both in this method.
def _apply_dense(self, grad, weight):
...
For example: A fully connected neural network with two hidden layer with two neurons and a bias for each.
If we take a look at layer 2 we get in apply_dense a call for the neuron weights:
and a call for the bias weights:
But I would either need both matrix in one call of apply_dense or a weight matrix like this:
X_2X_4, B_1X_4, ... is just a notation for the weight of the connection between the two neurons. Therefore B_1X_4 ist only a placeholder for the weight between B_1 and X_4.
How to do this?
MWE
For an minimal working example here a stochastic gradient descent optimizer implementation with a momentum. For every layer the momentum of all incoming connections from other neurons is reduced to the mean (see ndims == 2). What i need instead is the mean of not only the momentum values from the incoming neuron connections but also from the incoming bias connections (as described above).
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from tensorflow.python.training import optimizer
class SGDmomentum(optimizer.Optimizer):
def __init__(self, learning_rate=0.001, mu=0.9, use_locking=False, name="SGDmomentum"):
super(SGDmomentum, self).__init__(use_locking, name)
self._lr = learning_rate
self._mu = mu
self._lr_t = None
self._mu_t = None
def _create_slots(self, var_list):
for v in var_list:
self._zeros_slot(v, "a", self._name)
def _apply_dense(self, grad, weight):
learning_rate_t = tf.cast(self._lr_t, weight.dtype.base_dtype)
mu_t = tf.cast(self._mu_t, weight.dtype.base_dtype)
momentum = self.get_slot(weight, "a")
if momentum.get_shape().ndims == 2: # neuron weights
momentum_mean = tf.reduce_mean(momentum, axis=1, keep_dims=True)
elif momentum.get_shape().ndims == 1: # bias weights
momentum_mean = momentum
else:
momentum_mean = momentum
momentum_update = grad + (mu_t * momentum_mean)
momentum_t = tf.assign(momentum, momentum_update, use_locking=self._use_locking)
weight_update = learning_rate_t * momentum_t
weight_t = tf.assign_sub(weight, weight_update, use_locking=self._use_locking)
return tf.group(*[weight_t, momentum_t])
def _prepare(self):
self._lr_t = tf.convert_to_tensor(self._lr, name="learning_rate")
self._mu_t = tf.convert_to_tensor(self._mu, name="momentum_term")
For a simple neural network: https://raw.githubusercontent.com/aymericdamien/TensorFlow-Examples/master/examples/3_NeuralNetworks/multilayer_perceptron.py (only change the optimizer to the custom SGDmomentum optimizer)
Update: I'll try to give a better answer (or at least some ideas) now that I have some understanding of your goal, but, as you suggest in the comments, there is probably not infallible way of doing this in TensorFlow.
Since TF is a general computation framework, there is no good way of determining what pairs of weights and biases are there in a model (or if it is a neural network at all). Here are some possible approaches to the problem that I can think of:
Annotating the tensors. This is probably not practical since you already said you have no control over the model, but an easy option would be to add extra attributes to the tensors to signify the weight/bias relationships. For example, you could do something like W.bias = B and B.weight = W, and then in _apply_dense check hasattr(weight, "bias") and hasattr(weight, "weight") (there may be some better designs in this sense).
You can look into some framework built on top of TensorFlow where you may have better information about the model structure. For example, Keras is a layer-based framework that implements its own optimizer classes (based on TensorFlow or Theano). I'm not too familiar with the code or its extensibility, but probably you have more tools there to use.
Detect the structure of the network yourself from the optimizer. This is quite complicated, but theoretically possible. from the loss tensor passed to the optimizer, it should be possible to "climb up" in the model graph to reach all of its nodes (taking the .op of the tensors and the .inputs of the ops). You could detect tensor multiplications and additions with variables and skip everything else (activations, loss computation, etc) to determine the structure of the network; if the model does not match your expectations (e.g. there are no multiplications or there is a multiplication without a later addition) you can raise an exception indicating that your optimizer cannot be used for that model.
Old answer, kept for the sake of keeping.
I'm not 100% clear on what you are trying to do, so I'm not sure if this really answers your question.
Let's say you have a dense layer transforming an input of size M to an output of size N. According to the convention you show, you'd have an N × M weights matrix W and a N-sized bias vector B. Then, an input vector X of size M (or a batch of inputs of size M × K) would be processed by the layer as W · X + B, and then applying the activation function (in the case of a batch, the addition would be a "broadcasted" operation). In TensorFlow:
X = ... # Input batch of size M x K
W = ... # Weights of size N x M
B = ... # Biases of size N
Y = tf.matmul(W, X) + B[:, tf.newaxis] # Output of size N x K
# Activation...
If you want, you can always put W and B together in a single extended weights matrix W*, basically adding B as a new row in W, so W* would be (N + 1) × M. Then you just need to add a new element to the input vector X containing a constant 1 (or a new row if it's a batch), so you would get X* with size N + 1 (or (N + 1) × K for a batch). The product W* · X* would then give you the same result as before. In TensorFlow:
X = ... # Input batch of size M x K
W_star = ... # Extended weights of size (N + 1) x M
# You can still have a "view" of the original W and B if you need it
W = W_star[:N]
B = W_star[-1]
X_star = tf.concat([X, tf.ones_like(X[:1])], axis=0)
Y = tf.matmul(W_star, X_star) # Output of size N x K
# Activation...
Now you can compute gradients and updates for weights and biases together. A drawback of this approach is that if you want to apply regularization then you should be careful to apply it only on the weights part of the matrix, not on the biases.
I tried to solve Experiment 3a described in the original LSTM paper here: http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf with tensorflow LSTM and failed
From the paper: The task is to observe and then classify input sequences. There are two classes, each occurring with probability 0.5. There is only one input line. Only the rst N real-valued sequence elements convey relevant information about the class. Sequence elements at positions t > N are generated by a Gaussian with mean zero and variance 0.2.
The net architecture that he described in the paper:
"We use a 3-layer net with 1 input unit, 1 output unit, and 3 cell blocks of size 1. The output layer receives connections only from memory cells. Memory cells and gate units receive inputs from input units, memory cells and gate units, and have bias weights. Gate units and output unit are logistic sigmoid in [0; 1], h in [-1; 1], and g in [-2; 2]"
I tried to reproduce it with LSTM with 3 hidden units for T=100 and N=3 but failed.
I used online training (i.e. update the weights after each sequence) as described in the original paper
The core of my code was as follow:
self.batch_size = batch_size = config.batch_size
hidden_size = 3
self._input_data = tf.placeholder(tf.float32, (1, T))
self._targets = tf.placeholder(tf.float32, [1, 1])
lstm_cell = rnn_cell.BasicLSTMCell(hidden_size , forget_bias=1.0)
cell = rnn_cell.MultiRNNCell([lstm_cell] * 1)
self._initial_state = cell.zero_state(1, tf.float32)
weights_hidden = tf.constant(1.0, shape= [config.num_features, config.n_hidden])
prepare the input
inputs = []
for k in range(num_steps):
nextitem = tf.matmul(tf.reshape(self._input_data[:, k], [1, 1]) , weights_hidden)
inputs.append(nextitem)
outputs, states = rnn.rnn(cell, inputs, initial_state=self._initial_state)
use the last output
pred = tf.sigmoid(tf.matmul(outputs[-1], tf.get_variable("weights_out", [config.n_hidden,1])) + tf.get_variable("bias_out", [1]))
self._final_state = states[-1]
self._cost = cost = tf.reduce_mean(tf.square((pred - self.targets)))
self._result = tf.abs(pred[0, 0] - self.targets[0,0])
optimizer = tf.train.GradientDescentOptimizer(learning_rate = config.learning_rate).minimize(cost)
Any idea why it couldn't learn?
My first instinct was to create 2 outputs one for each class but in the paper he specifically mentioned only one output unit.
Thanks
It seems that i needed forget_bias > 1.0. for long sequences the network couldn't work with default forget_bias for T=50 for example i needed forget_bias = 2.1