How can I speed up this Keras Attention computation? - python

I have written a custom keras layer for an AttentiveLSTMCell and AttentiveLSTM(RNN) in line with keras' new approach to RNNs. This attention mechanism is described by Bahdanau where, in an encoder/decoder model a "context" vector is created from all the ouputs of the encoder and the decoder's current hidden state. I then append the context vector, at every timestep, to the input.
The model is being used in to make a Dialog Agent, but is very similar to NMT models in architecture (similar tasks).
However, in adding this attention mechanism, I have slowed down the training of my network 5 fold, and I would really like to know how I could write the part of the code that is slowing it down so much in a more efficient way.
The brunt of the computation is done here:
h_tm1 = states[0] # previous memory state
c_tm1 = states[1] # previous carry state
# attention mechanism
# repeat the hidden state to the length of the sequence
_stm = K.repeat(h_tm1, self.annotation_timesteps)
# multiplty the weight matrix with the repeated (current) hidden state
_Wxstm = K.dot(_stm, self.kernel_w)
# calculate the attention probabilities
# self._uh is of shape (batch, timestep, self.units)
et = K.dot(activations.tanh(_Wxstm + self._uh), K.expand_dims(self.kernel_v))
at = K.exp(et)
at_sum = K.sum(at, axis=1)
at_sum_repeated = K.repeat(at_sum, self.annotation_timesteps)
at /= at_sum_repeated # vector of size (batchsize, timesteps, 1)
# calculate the context vector
context = K.squeeze(K.batch_dot(at, self.annotations, axes=1), axis=1)
# append the context vector to the inputs
inputs = K.concatenate([inputs, context])
in the call method of the AttentiveLSTMCell (one timestep).
The full code can be found here. If it is necessary that I provide some data and ways to interact with the model, then I can do that.
Any ideas? I am, of course, training on a GPU if there is something clever here.

I would recommend training your model using relu rather than tanh, as this operation is significantly faster to compute. This will save you computation time on the order of your training examples * average sequence length per example * number of epochs.
Also, I would evaluate the performance improvement of appending the context vector, keeping in mind that this will slow your iteration cycle on other parameters. If it's not giving you much improvement, it might be worth trying other approaches.

You modified the LSTM class which is good for CPU computation, but you mentioned that you're training on GPU.
I recommend looking into the cudnn-recurrent implementation
or further into the tf part that is used. Maybe you can extend the code there.

Related

Why scaling down the parameter many times during training will help the learning speed be the same for all weights in Progressive GAN?

Equalized learning rate is one of the special things in Progressive Gan, a paper of the NVIDIA team. By using this method, they introduced that
Our approach ensures that the dynamic range, and thus the learning speed, is the same for all weights.
In detail, they inited all learnable parameters by normal distribution . During training time, each forward time, they will scale the result with per-layer normalization constant from He's initializer
I reproduced the code from pytorch GAN zoo Github's repo
def forward(self, x, equalized):
# generate He constant depend on the size of tensor W
size = self.module.weight.size()
fan_in = prod(size[1:])
weight = math.sqrt(2.0 / fan_in)
'''
A module example:
import torch.nn as nn
module = nn.Conv2d(nChannelsPrevious, nChannels, kernelSize, padding=padding, bias=bias)
'''
x = self.module(x)
if equalized:
x *= self.weight
return x
At first, I thought the He constant will be as He's paper
Normally, so can be scaled up which leads to the gradient in backpropagation is increased as the formula in ProGan's paper prevent vanishing gradient.
However, the code shows that .
In summary, I can't understand why to scale down the parameter many times during training will help the learning speed be more stable.
I asked this question in some communities e.g: Artificial Intelligent, mathematics, and still haven't had an answer yet.
Please help me explain it, thank you!
There is already an explanation in the paper for the reason for scaling down the parameters in every single pass:
The benefit of doing this dynamically instead of during initialization is somewhat
subtle and relates to the scale-invariance in commonly used adaptive stochastic gradient descent
methods such as RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015). These
methods normalize a gradient update by its estimated standard deviation, thus making the update
independent of the scale of the parameter. As a result, if some parameters have a larger dynamic
range than others, they will take longer to adjust. This is a scenario modern initializers cause, and
thus it is possible that a learning rate is both too large and too small at the same time.
I believe multiplying He's constant in every pass ensures that the range of the parameter will not be too wide at any point in backpropagation and therefore they will not take so long to adjust. So if for example discriminator at some point in the learning process adjusted faster than the generator, it will not take the generator to adjust itself and consequently, the learning process of them will equalize.

Is there any way to copy all parameters of one Pytorch model to another specially Batch Normalization mean and std?

I have found many correct ways online to copy one pytorch model parameters to another but somehow the copy-paste operation always misses the batch normalization parameters. Everything works fine as long as I only use modules such as conv2d, linear, drop out, max pool etc in my model. But as soon as I add Batch normalization in pytorch model, the below-given script stop working and accuracy at test time is different :
net = model()
copy_net = model()
for param in net.module.parameters():
copy_param.append(param.clone().detach())
count = 0
for param in copy_net.module.parameters():
param.data = copy_param[count]
param.requires_grad = False
count = count +1
Can anybody give me a possible solution to copying batch normalization also ?
net.load_state_dict(copy_net.state_dict()) should work.
As per #dxtx, in pytorch's philosophy, the state dict should cover all the states in a 'module', e.g. in batch norm module , the running mean and var, if I remembered correctly, should be part of the state dict.
But in fact, if you wrote module like batch norm yourself, you have to override the 'state_dict' method.

Creating and train only specified weights in TensorFlow or PyTorch

I am wondering if there is a way in TensorFlow, PyTorch or some other library to selectively connect neurons. I want to make a network with a very large number of neurons in each layer, but that has very few connections between layers.
Note that I do not think this is a duplicate of this answer: Selectively zero weights in TensorFlow?. I implemented a custom keras layer using essentially the same method that appears in that question - essentially by creating a dense layer where all but the specified weights are ignored in training and evaluation. This fulfills part of what I want to do by not training specified weights, and not using them for prediction. But, the problems is that I still waste memory saving the untrained weights, and I waste time calculating the gradients of the zeroed weights. What I would like is for the computation of the gradient matrices to involve only sparse matrices, so that I do not waste time and memory.
Is there a way to selectively create and train weights without wasting memory? If my question is unclear or there is more information that it would be helpful for me to provide, please let me know. I would like to be helpful as a question-asker.
The usual, simple solution is to initialize your weight matrices to have zeros where there should be no connection. You store a mask of the location of these zeros, and set the weights at these positions to zero after each weight update. You need to do this as the gradient for zero weights may be nonzero, and this would introduce nonzero weights (i.e. connectios) where you don't want any.
Pseudocode:
# setup network
weights = sparse_init() # only nonzero for existing connections
zero_mask = where(weights == 0)
# train
for e in range(num_epochs):
train_operation() # may lead to introduction of new connections
weights[zero_mask] = 0 # so we set them to zero again
Both tensorflow and pytorch support sparse tensors (torch.sparse, tf.sparse).
My intuitive understanding would be that if you were willing to write your network using the respective low level APIs (e.g. actually implementing the forward-pass yourself), you could cast your weight matrices as sparse tensors. That would in turn result in sparse connectivity, since the weight matrix of layer [L] defines the connectivity between neurons of the previous layer [L-1] with neurons of layer [L].

On training LSTMs efficiently but well, parallelism vs training regime

For a model that I intend to spontaneously generate sequences I find that training it sample by sample and keeping state in between feels most natural. I've managed to construct this in Keras after reading many helpful resources. (SO: Q and two fantastic answers, Macine Learning Mastery 1, 2, 3)
First a sequence is constructed (in my case one-hot encoded too). X and Y are procuded from this sequence by shifting Y forward one time step. Training is done in batches of one sample and one time step.
For Keras this looks something like this:
data = get_some_data() # Shape (samples, features)
Y = data[1:, :] # Shape (samples-1, features)
X = data[:-1, :].reshape((-1, 1, data.shape[-1])) # Shape (samples-1, 1, features)
model = Sequential()
model.add(LSTM(256, batch_input_shape=(1, 1, X.shape[-1]), stateful=True))
model.add(Dense(Y.shape[-1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
for epoch in range(10):
model.fit(X, Y, batch_size=1, shuffle=False)
model.reset_states()
It does work. However after consulting my Task Manager, it seems it's only using ~10 % of my GPU resources, which are already quite limited. I'd like to improve this to speed up training. Increasing batch size would allow for parallel computations.
The network in its current state presumably "remembers" things even from the start of the training sequence. For training in batches one would need to first set up the sequence and then predict one value - and do this for multiple values. To train on the full sequence one would need to generate data in the shape (samples-steps, steps, features). I imagine it wouldn't be uncommon to have a sequence spanning at least a couple of hundred time steps. So that would mean a huge increase in the data amount.
Between framing the problem a bit differently and requiring more data to be stored in memory, and utilising only a small amount of processing resources, I must ask:
Is my intuition of the natural way of training and statefulness correct?
Are there other downsides to this training with one sample per batch?
Could the utilisation issues be resolved any other way?
Finally, is there an accepted way of performing this kind of training to generate long sequences?
Any help is greatly appreciated, I'm fairly new with LSTMs.
I do not know your specific application, however sending only one timestep of data in is surely not a good idea. You should instead, give the LSTM the entire sequence of previously given one-hot vectors (presumably words), and pre-pad (with zeros) if necessary as it appears you are working on sequences of varying length. Consider also using an embedding layer before your LSTM if these are indeed words. Read the documentation carefully.
The utilization of your GPU being low is not a problem. You simply do not have enough data to fully utilize all resources in each batch. Training with batches is a sequential process, there is no real way to parallelize this process, at least a way that is introductory and beneficial to what your goals appear to be. If you do give the LSTM more data per timestep, however, this surely will increase your utilization.
statefull in an LSTM does not do what you think it does. An LSTM always remembers the sequence it is iterating over as it updates it's internal hidden states, h and c. Furthermore, the weight transformations that "build" those internal states are learned during training. What stateful does is preserve the previous hidden state from the last batch index. Meaning, the final hidden state at the third element in the batch is sent as the initial hidden state in the third element of the next batch and so on. I do not believe this is useful for your applications.
There are downsides to training the LSTM with one sample per batch. In general, training with min-batches increases stability. However, you appear to not be training with one sample per batch but instead one timestep per sample.
Edit (from comments)
If you use stateful and send the next 'character' of your sequence in the same index of the previous batch this would be analogous to sending the full sequence timesteps per sample. I would still recommend the initial approach described above in order to improve the speed of the application and to be in more line with other LSTM applications. I see no disadvantages to the approach of sending the full sequence per sample instead of doing it along every batch. However, the advantage of speed, being able to shuffle your input data per batch, and being more readable/consistent would be worth the change IMO.

What make it so hard for a neural network to learn a classifier that class of x/256 is x?

What I'm original doing is to classify some wave data using neural network. In that problem, I've a vector of about size 200 and number of classes is 256. However, the loss never goes down.
So, I think, what about the wave is just the label? $wave_i(x) = N(i/256.0, (1/10000)^2)$, will labled i, N stand for normal distribution, for example.
For very small classes, like 32 or 64, NN works well, and learning rapidly.
When I take it to classes = 256, however, the learning speed is unbearably slow and even not learning at all.
The model I'm using is pretty simple. I think this is enough to even memorized the relationship between input and output. (why? you can easily contruct a unit that output 1 when abs(input - const) < eps. )
model = Sequential([
Dense(classes, input_dim=200),
Activation('sigmoid'),
Dense(classes * 2),
Activation('sigmoid'),
Dense(classes),
Activation('softmax'),
])
Then, I fed it data with batch size is 256, every different labels occur once.
The result is, the loss reached 2.xxxx and acc reached 0.07 after 2500 epochs, and stopped changing after 3000 epochs. (acc around 0.09 to 0.1)
I know more variables need more times to learn. However, it's clear that all single output cell should easily cut down their relationship between others (I have very different input set) so.
def generator():
while 1:
data = [numpy.random.normal(i/255.0,1/10000.0,225).tolist() for i in range(0, classes)]
labels = to_categorical([i for i in range(0, classes)], classes)
yield (data,labels)
When you have a really simple relationship between input and output, such as the one you are exploring, then this may not play to the strengths of a neural network, which is flexible enough to fit any function but rarely does so perfectly. When you have a simple function, you may find you will spot the imperfections in the fit from a neural network and models other than neural networks will do a better job.
Some things you could potentially do to get a better fit (roughly in order of things I would try):
Try a different optimiser. You don't say which optimiser you are using, but the Keras library comes with a few choices.
Neural networks work better when training and predicting against input features that have been normalised. An effective choice is mean 0, standard deviation 1. In your case, if you pre-process each batch - when training and testing - like this: data = (data - 0.5)/0.289, it may help.
Increase number of neurons in hidden layers, and/or change the activation function. Your ideal activation function here might even be shaped like a gaussian (so a single neuron could immediately tune to each class), but that isn't something you usually find in a NN library. Consider dropping the middle layer too, and just have e.g. 8*classes neurons in a single hidden layer before the softmax*.
Sample from your input examples in the generator instead of calculating one from each class each time. The generator is potentially too regular - I have seen the classic xor example network get stuck in a similar way to your description when fed the same inputs repeatedly.
* The simpler network model would look like this:
model = Sequential([
Dense(classes * 8, input_dim=200),
Activation('sigmoid'),
Dense(classes),
Activation('softmax'),
])

Categories

Resources