I'm currently trying to learn Sonnet.
My network (incomplete, the question is based on this):
class Model(snt.AbstractModule):
def __init__(self, name="LSTMNetwork"):
super(Model, self).__init__(name=name)
with self._enter_variable_scope():
self.l1 = snt.LSTM(100)
self.l2 = snt.LSTM(100)
self.out = snt.LSTM(10)
def _build(self, inputs):
# 'inputs' is of shape (batch_size, input_length)
# I need it to be of shape (batch_size, sequence_length, input_length)
l1_state = self.l1.initialize_state(np.shape(inputs)[0]) # init with batch_size
l2_state = self.l2.initialize_state(np.shape(inputs)[0]) # init with batch_size
out_state = self.out.initialize_state(np.shape(inputs)[0])
l1_out, l1_state = self.l1(inputs, l1_state)
l1_out = tf.tanh(l1_out)
l2_out, l2_state = self.l2(l1_out, l2_state)
l2_out = tf.tanh(l2_out)
output, out_state = self.out(l2_out, out_state)
output = tf.sigmoid(output)
return output, out_state
In other frameworks (eg. Keras), LSTM inputs are of the form (batch_size, sequence_length, input_length).
However, the Sonnet documentation states that the input to Sonnet's LSTM is of the form (batch_size, input_length).
How do I use them for sequential input?
So far, I've tried using a for loop inside _build, iterating over each timestep, but that gives seemingly random outputs.
I've tried the same architecture in Keras, which runs without any issues.
I'm executing in eager mode, using GradientTape for training.
We generally wrote the RNNs in Sonnet to work on a single timestep basis, as for Reinforcement Learning you often need to run one timestep to pick an action, and without that action you can't get the next observation (and the next input timestep) from the environment. It's easy to unroll a single timestep module over a sequence using tf.nn.dynamic_rnn (see below). We also have a wrapper which takes care of composing several RNN cores per timestep, which I believe is what you're looking to do. This has the advantage that the DeepCore object supports the start state methods required for dynamic_rnn, so it's API compatibe with LSTM or any other single-timestep module.
What you want to do should be achievable like this:
# Create a single-timestep RNN module by composing recurrent modules and
# non-recurrent ops.
model = snt.DeepRNN([
snt.LSTM(100),
tf.tanh,
snt.LSTM(100),
tf.tanh,
snt.LSTM(100),
tf.sigmoid
], skip_connections=False)
batch_size = 2
sequence_length = 3
input_size = 4
single_timestep_input = tf.random_uniform([batch_size, input_size])
sequence_input = tf.random_uniform([batch_size, sequence_length, input_size])
# Run the module on a single timestep
single_timestep_output, next_state = model(
single_timestep_input, model.initial_state(batch_size=batch_size))
# Unroll the module on a full sequence
sequence_output, final_state = tf.nn.dynamic_rnn(
core, sequence_input, dtype=tf.float32)
A few things to note - if you haven't already please have a look at the RNN example in the repository, as this shows a full graph mode training procedure setup around a fairly similar model.
Secondly, if you do end up needing to implement a more complex module that DeepRNN allows for, it's important to thread the recurrent state in and out of the module. In your example you're making the input state internally, and l1_state and l2_state as output are effectively discarded, so this can't be properly trained. If DeepRNN wasn't available, your model would look like this:
class LSTMNetwork(snt.RNNCore): # Note we inherit from the RNN-specific subclass
def __init__(self, name="LSTMNetwork"):
super(Model, self).__init__(name=name)
with self._enter_variable_scope():
self.l1 = snt.LSTM(100)
self.l2 = snt.LSTM(100)
self.out = snt.LSTM(10)
def initial_state(self, batch_size):
return (self.l1.initial_state(batch_size),
self.l2.initial_state(batch_size),
self.out.initial_state(batch_size))
def _build(self, inputs, prev_state):
# separate the components of prev_state
l1_prev_state, l2_prev_state, out_prev_state = prev_state
l1_out, l1_next_state = self.l1(inputs, l1_prev_state)
l1_out = tf.tanh(l1_out)
l2_out, l2_next_state = self.l2(l1_out, l2_prev_state)
l2_out = tf.tanh(l2_out)
output, out_next_state = self.out(l2_out, out_prev_state)
# Output state of LSTMNetwork contains the output states of inner modules.
full_output_state = (l1_next_state, l2_next_state, out_next_state)
return tf.sigmoid(output), full_output_state
Finally, if you're using eager mode I would strongly encourage you to have a look at Sonnet 2 - it's a complete rewrite for TF 2 / Eager mode. It's not backwards compatible, but all the same kinds of module compositions are possible. Sonnet 1 was written primarily for Graph mode TF, and while it does work with Eager mode you'll probably encounter some things that aren't very convenient.
We worked closely with the TensorFlow team to make sure that TF 2 & Sonnet 2 work nicely together, so please have a look: (https://github.com/deepmind/sonnet/tree/v2). Sonnet 2 should be considered alpha, and is being actively developed, so we don't have loads of examples yet, but more will be added in the near future.
Related
I am writing a simple custom model + training in tensorflow. My goal is to build a stateful LSTM based model and being able to reset the states when I want to.
So far this is my custom model:
class ResNetModel(Model):
def __init__(self, num_inputs, **kwargs):
"""
The class initialiser should call the base class initialiser, passing any keyword
arguments along. It should also create the layers of the network according to the
above specification.
"""
super(ResNetModel, self).__init__(**kwargs)
self.lstm_1 = tf.keras.layers.LSTM(units=32, input_shape=(None, num_inputs), return_sequences=True)
self.dense = tf.keras.layers.Dense(units=1, activation=None)
def call(self, inputs, training=False):
"""
This method should contain the code for calling the layer according to the above
specification, using the layer objects set up in the initialiser.
"""
x = self.lstm_1(inputs)
y = self.dense(x)
return y + inputs
And this is my custom training loop (I am omitting the whole code because it is quite big, but the function is self contained for the purpose of my question):
def run_training(self, in_train, out_train, epoch_loss, epoch_error, n_skip, n_block):
n_samples = in_train.shape[1]
self.model.reset_states() # clear existing state
self.model(in_train[:, :n_skip, :]) # process some samples to build up state
for n in range(n_skip, n_samples - n_block, n_block):
# compute loss
with tf.GradientTape() as tape:
y_pred = self.model(in_train[:, n:n + n_block, :])
loss = self.loss_func(out_train[:, n:n + n_block, :], y_pred)
grads = tape.gradient(loss, self.model.trainable_variables)
self.opt.apply_gradients(zip(grads, self.model.trainable_variables))
epoch_loss.update_state(loss)
epoch_error.update_state(out_train[:, n:n + n_block, :], y_pred)
And it trains fine, the whole code works as expected.
Then I make predictions like this:
for i in range(0, math.floor(24000/4096)):
predictions[i*4096: (i+1)*4096] = np.array(residual_net.model(X_test[idx][i*4096: (i+1)*4096].reshape(1, 4096, 1))).ravel()
So basically I am passing my input test to my model in residual_net.model(my_test_data) (the numpy slicing ecc... is just to make my input data coherent with the network, it works fine).
However, when I make predictions with my trained network (to give some context, it is working with audio data) I have the output audio that is as expected (an input song processed by the network that adds some distortion), but there are clicks in the output audio that are directly related to the input buffer size.
To make this point clearer: if I predict on chuncks of 512 samples, I have clicks every 512 samples, if I predict every 4096 samples, I have clicks every 4096.
This behaviour is pretty similar to the one you have with IIR filters that do not carry its filter state across audio buffers, so that got me thinking that my LSTM network is not working stateful as I expected.
So my question is:
does Tensorflow automatically reset the state of the network after each processed buffer (even in the case of custom training-prediction loops) if the parameter stateful = True is not specified in the LSTM layer?
I found no information about this, but I expected that behaviour for "standard" training (.fit/.predict functions) and not for custom training loops.
Does this hold also for the training step? (so basically I am messing up also the training)
I have a very long time series I want to feed into an LSTM for classification per-frame.
My data is labeled per frame, and I know some rare events happen that influence the classification heavily ever since they occur.
Thus, I have to feed the entire sequence to get meaningful predictions.
It is known that just feeding very long sequences into LSTM is sub-optimal, since the gradients vanish or explode just like normal RNNs.
I wanted to use a simple technique of cutting the sequence to shorter (say, 100-long) sequences, and run the LSTM on each, then pass the final LSTM hidden and cell states as the start hidden and cell state of the next forward pass.
Here is an example I found of someone who did just that. There it is called "Truncated Back propagation through time". I was not able to make the same work for me.
My attempt in Pytorch lightning (stripped of irrelevant parts):
def __init__(self, config, n_classes, datamodule):
...
self._criterion = nn.CrossEntropyLoss(
reduction='mean',
)
num_layers = 1
hidden_size = 50
batch_size=1
self._lstm1 = nn.LSTM(input_size=len(self._in_features), hidden_size=hidden_size, num_layers=num_layers, batch_first=True)
self._log_probs = nn.Linear(hidden_size, self._n_predicted_classes)
self._last_h_n = torch.zeros((num_layers, batch_size, hidden_size), device='cuda', dtype=torch.double, requires_grad=False)
self._last_c_n = torch.zeros((num_layers, batch_size, hidden_size), device='cuda', dtype=torch.double, requires_grad=False)
def training_step(self, batch, batch_index):
orig_batch, label_batch = batch
n_labels_in_batch = np.prod(label_batch.shape)
lstm_out, (self._last_h_n, self._last_c_n) = self._lstm1(orig_batch, (self._last_h_n, self._last_c_n))
log_probs = self._log_probs(lstm_out)
loss = self._criterion(log_probs.view(n_labels_in_batch, -1), label_batch.view(n_labels_in_batch))
return loss
Running this code gives the following error:
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.
The same happens if I add
def on_after_backward(self) -> None:
self._last_h_n.detach()
self._last_c_n.detach()
The error does not happen if I use
lstm_out, (self._last_h_n, self._last_c_n) = self._lstm1(orig_batch,)
But obviously this is useless, as the output from the current frame-batch is not forwarded to the next one.
What is causing this error? I thought detaching the output h_n and c_n should be enough.
How do I pass the output of a previous frame-batch to the next one and have torch back propagate each frame batch separately?
Apparently, I missed the trailing _ for detach():
Using
def on_after_backward(self) -> None:
self._last_h_n.detach_()
self._last_c_n.detach_()
works.
The problem was self._last_h_n.detach() does not update the reference to the new memory allocated by detach(), thus the graph is still de-referencing the old variable which backprop went through.
The reference answer solved that by H = H.detach().
Cleaner (and probably faster) is self._last_h_n.detach_() which does the operation in place.
I am trying to modify a code that could find in the following link in such a way that the proposed Transformer model that is related to the paper: all you need is attention would keep only the Encoder part of the whole Transformer model. Furthermore, I would like to modify the input of the Network, instead of being a sequence of text to be a sequence of images (or better-extracted features of images) coming from a video. In a sense, I would like to figure out which frames are related to each other from my input and encode that info in an output embedding in the same way that is happening to the Transformers model.
The project as it is in the link provided is mainly performing sequence-sequence transformation. The input is text from one language and the output is text in another language. The main formation of the model is happening in the lines 386-463. Where the model is initialized and the compile of the Model is happening. For me I would like to do something like:
#414-416
self.encoder = SelfAttention(d_model, d_inner_hid, n_head, layers, dropout)
#self.decoder = Decoder(d_model, d_inner_hid, n_head, layers, dropout)
#self.target_layer = TimeDistributed(Dense(o_tokens.num(), use_bias=False))
#434-436
enc_output = self.encoder(src_emb, src_seq, active_layers=active_layers)
#dec_output = self.decoder(tgt_emb, tgt_seq, src_seq, enc_output, active_layers=active_layers)
#final_output = self.target_layer(dec_output)
Furthermore, since I would like to combine the output of the Encoder which is the output of MultiHeadAttention and PositionwiseFeedForward using an LSTM and a Dense layer which will tune the whole Encoding procedure using classification optimization. Therefore, I add when I define my model the following layers:
self.lstm = LSTM(units = 256, input_shape = (None, 256), return_sequences = False, dropout = 0.5)
self.fc1 = Dense(64, activation='relu', name = "dense_one")
self.fc2 = Dense(6, activation='sigmoid', name = "dense_two")
and then pass the output of the encoder, in line 434 using the following code:
enc_output = self.lstm(enc_output)
enc_output = self.fc1(enc_output)
enc_output = self.fc2(enc_output)
Now the video data that I would like to replace the text data provided with the Github code, have the following dimensionality: Nx10x256 where N is the number of samples, 10 is the number of frames and 256 the number of features for each frame. I have some difficulties to understand some parts of the code, in order to successfully, modified it to my needs. I guess, that now the Embedding layer is not necessary for me anymore since it is related to text classification and NLP.
Furthermore, I need to modify the input to 419-420 to be sth like:
src_seq_input = Input(shape=(None, 256,), dtype='float32') # source input related to video
tgt_seq_input = Input(shape=(6,), dtype='int32') # the target classification size (since I have 6 classes)
What other parts of the code do I need to skip or modify? What is the usefulness of the PosEncodingLayer that is used in the following line:
self.pos_emb = PosEncodingLayer(len_limit, d_emb) if self.src_loc_info else None
Is it needed in my case? Can I skip it?
After my modification in the code I noticed that when I run the code, I can check the loss function from the def get_loss(y_pred, y_true), however, in my case it is crucial to define a loss for the classification task that returns also the accuracy. How can I do so, with the provided code?
Edit:
I have to add that I treat my input as the output of the Embedding layer from the initial NLP code. Therefore, for me (in the version of code that functioned for me):
src_seq_input = Input(shape=(None, 256,), dtype='float32')
tgt_seq_input = Input(shape=(6,), dtype='int32')
src_seq = src_seq_input
#src_emb_ = self.i_word_emb(src_seq)
src_emb = src_seq
enc_output = self.encoder(src_emb, src_emb, active_layers=active_layers)
I treat src_emb as my input and completely ignore src_seq.
Edit:
The way that the loss is calculated is using the following code:
def get_loss(y_pred, y_true):
y_true = tf.cast(y_true, 'int32')
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
mask = tf.cast(tf.not_equal(y_true, 0), 'float32')
loss = tf.reduce_sum(loss * mask, -1) / tf.reduce_sum(mask, -1)
loss = K.mean(loss)
return loss
loss = get_loss(enc_output, tgt_seq_input)
self.ppl = K.exp(loss)
Edit:
As it is the loss function (sparse_softmax_cross_entropy_with_logits) returns a loss score. Even if the whole procedure is about classification. How, can I further, tune my system to return also the accuracy?
I'm afraid this approach is not going to work.
Video data has massive dependence between adjacent frames, with each frame very similar to the last. There is also a weaker dependence on prior frames, because objects tend to continue to move relative to other objects in similar ways. Modern video formats use this redundancy to achieve high compression rates by modelling the motions.
This means that your network will have an extremely strong attention on the previous image. As you suggest, you could subsample frames several seconds apart to destroy much of the dependence on the previous frame, but if you did so I really wonder whether you would find structure at all in the result? Even if you feed it hand-coded features optimised for the purpose, there are are few general rules about which features will be in motion and which will not, so what structure can your attention network learn?
The problem of handling video is just radically different from handling sentences. Video has very complex elements (pictures) that are largely static over time and have locally predictable motions over a few frames in very simple ways. Text has simple elements (words) in a complex sentence structure with complex dependence extending over many words. These differences mean they require fundamentally different approaches.
A bit of background:
I've implemented an NLP classification model using mostly Keras functional model bits of Tensorflow 2.0. The model architecture is a pretty straightforward LSTM network with the addition of an Attention layer between the LSTM and the Dense output layer. The Attention layer comes from this Kaggle kernel (starting around line 51).
I wrapped the trained model in a simple Flask app and get reasonably accurate predictions. In addition to predicting a class for a specific input I also output the value of the attention weight vector "a" from the aforementioned Attention layer so I can visualize the weights applied to the input sequence.
My current method of extracting the attention weights variable works, but seems incredibly inefficient as I'm predicting the output class and then manually calculating the attention vector using an intermediate Keras model. In the Flask app, inference looks something like this:
# Load the trained model
model = tf.keras.models.load_model('saved_model.h5')
# Extract the trained weights and biases of the trained attention layer
attention_weights = model.get_layer('attention').get_weights()
# Create an intermediate model that outputs the activations of the LSTM layer
intermediate_model = tf.keras.Model(inputs=model.input, outputs=model.get_layer('bi-lstm').output)
# Predict the output class using the trained model
model_score = model.predict(input)
# Obtain LSTM activations by predicting the output again using the intermediate model
lstm_activations = intermediate_model.predict(input)
# Use the intermediate LSTM activations and the trained model attention layer weights and biases to calculate the attention vector.
# Maths from the custom Attention Layer (heavily modified for the sake of brevity)
eij = tf.keras.backend.dot(lstm_activations, attention_weights)
a = tf.keras.backend.exp(eij)
attention_vector = a
I think I should be able to include the attention vector as part of the model output, but I'm struggling with figuring out how to accomplish this. Ideally I'd extract the attention vector from the custom attention layer in a single forward pass rather than extracting the various intermediate model values and calculating a second time.
For example:
model_score = model.predict(input)
model_score[0] # The predicted class label or probability
model_score[1] # The attention vector, a
I think I'm missing some basic knowledge around how Tensorflow/Keras throw variables around and when/how I can access those values to include as model output. Any advice would be appreciated.
After a little more research I've managed to cobble together a working solution. I'll summarize here for any future weary internet travelers that come across this post.
The first clues came from this github thread. The attention layer defined there seems to build on the attention layer in the previously mentioned Kaggle kernel. The github user adds a return_attention flag to the layer init which, when enabled, includes the attention vector in addition to the weighted RNN output vector in the layer output.
I also added a get_config function suggested by this user in the same github thread which enables us to save and reload trained models. I had to add the return_attention flag to get_config, otherwise TF would throw a list iteration error when trying to load a saved model with return_attention=True.
With those changes made, the model definition needed to be updated to capture the additional layer outputs.
inputs = Input(shape=(max_sequence_length,))
lstm = Bidirectional(LSTM(lstm1_units, return_sequences=True))(inputs)
# Added 'attention_vector' to capture the second layer output
attention, attention_vector = Attention(max_sequence_length, return_attention=True)(lstm)
x = Dense(dense_units, activation="softmax")(attention)
The final, and most important piece of the puzzle came from this Stackoverflow answer. The method described there allows us to output multiple results while only optimizing on one of them. The code changes are subtle, but very important. I've added comments below in the spots I made changes to implement this functionality.
model = Model(
inputs=inputs,
outputs=[x, attention_vector] # Original value: outputs=x
)
model.compile(
loss=['categorical_crossentropy', None], # Original value: loss='categorical_crossentropy'
optimizer=optimizer,
metrics=[BinaryAccuracy(name='accuracy')])
With those changes in place, I retrained the model and voila! The output of model.predict() is now a list containing the score and its associated attention vector.
The results of the change were pretty dramatic. Running inference on 10k examples took about 20 minutes using this new method. The old method utilizing intermediate models took ~33 minutes to perform inference on the same dataset.
And for anyone that's interested, here is my modified Attention layer:
from tensorflow.python.keras.layers import Layer
from tensorflow.keras import initializers, regularizers, constraints
from tensorflow.keras import backend as K
class Attention(Layer):
def __init__(self, step_dim,
W_regularizer=None, b_regularizer=None,
W_constraint=None, b_constraint=None,
bias=True, return_attention=True, **kwargs):
self.supports_masking = True
self.init = initializers.get('glorot_uniform')
self.W_regularizer = regularizers.get(W_regularizer)
self.b_regularizer = regularizers.get(b_regularizer)
self.W_constraint = constraints.get(W_constraint)
self.b_constraint = constraints.get(b_constraint)
self.bias = bias
self.step_dim = step_dim
self.features_dim = 0
self.return_attention = return_attention
super(Attention, self).__init__(**kwargs)
def build(self, input_shape):
assert len(input_shape) == 3
self.W = self.add_weight(shape=(input_shape[-1],),
initializer=self.init,
name='{}_W'.format(self.name),
regularizer=self.W_regularizer,
constraint=self.W_constraint)
self.features_dim = input_shape[-1]
if self.bias:
self.b = self.add_weight(shape=(input_shape[1],),
initializer='zero',
name='{}_b'.format(self.name),
regularizer=self.b_regularizer,
constraint=self.b_constraint)
else:
self.b = None
self.built = True
def compute_mask(self, input, input_mask=None):
return None
def call(self, x, mask=None):
features_dim = self.features_dim
step_dim = self.step_dim
eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),
K.reshape(self.W, (features_dim, 1))), (-1, step_dim))
if self.bias:
eij += self.b
eij = K.tanh(eij)
a = K.exp(eij)
if mask is not None:
a *= K.cast(mask, K.floatx())
a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
a = K.expand_dims(a)
weighted_input = x * a
result = K.sum(weighted_input, axis=1)
if self.return_attention:
return [result, a]
return result
def compute_output_shape(self, input_shape):
if self.return_attention:
return [(input_shape[0], self.features_dim),
(input_shape[0], input_shape[1])]
else:
return input_shape[0], self.features_dim
def get_config(self):
config = {
'step_dim': self.step_dim,
'W_regularizer': regularizers.serialize(self.W_regularizer),
'b_regularizer': regularizers.serialize(self.b_regularizer),
'W_constraint': constraints.serialize(self.W_constraint),
'b_constraint': constraints.serialize(self.b_constraint),
'bias': self.bias,
'return_attention': self.return_attention
}
base_config = super(Attention, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
I have several questions about best practice in using recurrent networks in pytorch for generation of sequences.
The first one, if I want to build decoder net should I use nn.GRU (or nn.LSTM) instead nn.LSTMCell (nn.GRUCell)? From my experience, if I work with LSTMCell the speed of calculations is drammatically lower (up to 100 times) than if I use nn.LSTM. Maybe it is related with cudnn optimisation for LSTM (and GRU) module? Is any way to speedup LSTMCell calculations?
I try to build an autoencoder, that accepts sequences of variable length. My autoencoder looks like:
class SimpleAutoencoder(nn.Module):
def init(self, input_size, hidden_size, n_layers=3):
super(SimpleAutoencoder, self).init()
self.n_layers = n_layers
self.hidden_size = hidden_size
self.gru_encoder = nn.GRU(input_size, hidden_size,n_layers,batch_first=True)
self.gru_decoder = nn.GRU(input_size, hidden_size, n_layers, batch_first=True)
self.h2o = nn.Linear(hidden_size,input_size) # Hidden to output
def encode(self, input):
output, hidden = self.gru_encoder(input, None)
return output, hidden
def decode(self, input, hidden):
output,hidden = self.gru_decoder(input,hidden)
return output,hidden
def h2o_apply(self,input):
return self.h2o(input)
My training loop looks like:
one_hot_batch = list(map(lambda x:Variable(torch.FloatTensor(x)),one_hot_batch))
packed_one_hot_batch = pack_padded_sequence(pad_sequence(one_hot_batch,batch_first=True).cuda(),batch_lens, batch_first=True)
_, latent = vae.encode(packed_one_hot_batch)
outputs, = vae.decode(packed_one_hot_batch,latent)
packed = pad_packed_sequence(outputs,batch_first=True)
for string,length,index in zip(*packed,range(batch_size)):
decoded_string_without_sos_symbol = vae.h2o_apply(string[1:length])
loss += criterion(decoded_string_without_sos_symbol,real_strings_batch[index][1:])
loss /= len(batch)
The training in such manner, as I can understand, is teacher force. Because at the decoding stage the network feeds the real inputs (outputs,_ = vae.decode(packed_one_hot_batch,latent)). But, for my task it leads to the situation when, in the test stage, network can generate sequences very well only if I use the real symbols (as in training mode), but if I feed the output of the previous step, the network generates rubbish (just infinite repetition of one specific symbol).
I tried another one approach. I generated “fake” inputs( just ones), to make the model generate only from the hidden state.
one_hot_batch_fake = list(map(lambda x:torch.ones_like(x).cuda(),one_hot_batch))
packed_one_hot_batch_fake = pack_padded_sequence(pad_sequence(one_hot_batch_fake, batch_first=True).cuda(), batch_lens, batch_first=True)
_, latent = vae.encode(packed_one_hot_batch)
outputs, = vae.decode(packed_one_hot_batch_fake,latent)
packed = pad_packed_sequence(outputs,batch_first=True)
It works, but very inefficiently, the quality of reconstruction is very low. So the second question, what is the right way to generate sequences from latent representation?
I suppose, that good idea is to apply teacher forcing with some probability, but for that, how one can use nn.GRU layer so the output of the previous step should be the input for the next step?