My question surely has a simple answer, but I couldn't find it. I wish to apply MultiheadAttention to the same sequence without copying the sequence. My data is temporal data with dimensions (batch, time, channels). I treat the "channels" dimension as the embedding, and the time dimension as the sequence dimension. For example:
N, C, T = 2, 3, 5
n_heads = 7
X = torch.rand(N, T, C)
Now, I want to apply 7 different heads as self-attention to the same input X, but as far as I understand, it attrequires me to copy the data 7 times:
attn = torch.nn.MultiheadAttention(C * n_heads, n_heads, batch_first=True)
X_ = X.repeat(1, 1, n_heads)
attn(X_, X_, X_)
Is there any way to do this without copying the data 7 times?
Thanks!
Related
I'm currently trying to implement an Encoder-Decoder architecture for text summarization based on Transformers. Thus I need ti apply MultiHeadAttention on the Decoder site of the model. Since I want to ensure that the model doesn't attend to unseen tokens of the target sequence, I need to use the 3D attention mask (attn_mask) argument.
According to the documentation (https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html), the shape of the mask must be BATCH_SIZE * NUMBER_HEADS, SEQUENCE_LENGTH, SEQUENCE_LENGTH. Which is fine, as it provides the possibility to use different attentions between the Heads, which I don't need in my case ...
But the documentation doesn't state how the tensor needs to be filled regarding it's first dimension and I can't see/ find it in the implementation how it is actually used...
Is it:
[
[2D Attention for Batch 1 for Head 1]
[2D Attention for Batch 2 for Head 1]
...
[2D Attention for Batch 1 for Head 2]
[2D Attention for Batch 2 for Head 2]
...
[2D Attention for Batch n for Head n]
]
or
[
[2D Attention for Batch 1 for Head 1]
[2D Attention for Batch 1 for Head 2]
...
[2D Attention for Batch 2 for Head 1]
[2D Attention for Batch 2 for Head 2]
...
[2D Attention for Batch n for Head n]
]
Would be great if someone knows :)
I've the same query until I get to the link posted by cokeSchlimpf. Thanks for sharing this.
Overview:
If we want to set different mask = src_mask for each of the example/instance in a batch, then it's advised (here) to have the src_attention_mask in shape of N.num_heads, T, S where N is the batch-size, num_heads is the number of heads in MultiHeadAttention module. Additionally, T is the target sequence length and S is the source sequence length.
Explanation of code at link:
Say, mask is of shape N, T, S, then with torch.repeat_interleave(mask, num_heads, dim=0) each mask instance (in total there are N instances) is repeated num_heads times and stacked to form num_heads, T, S shape array. Repeating this for all such N masks we'll finally get an array of shape:
[
[num_heads, T, S] # for example 1 in the batch
[num_heads, T, S] # for example 2 in the batch
.
.
.
[num_heads, T, S] # for example N in the batch
] = [N.num_heads, T, S] # after concatenating along dim=0
The following is small snippet of code on implementation (with torch==1.12.1+cu102).
import torch
class test(nn.Module):
def __init__(self):
super(test, self).__init__()
enc_layer = torch.nn.TransformerEncoderLayer(d_model=16, nhead=8, batch_first=True)
self.layer = torch.nn.TransformerEncoder(enc_layer, num_layers=1)
def forward(self, x, src_mask, key_mask):
return self.layer(x, mask=src_mask, src_key_padding_mask=key_mask)
mod = test()
mod.eval()
out = mod(x=torch.randn(2, 22, 16), src_mask=torch.ones(8*2, 22, 22), key_mask=torch.ones(2, 22))
print(out.shape)
Hope this helps!
I have a tensor of the shape T x B x N (training data for a RNN, T is max seq length, B is number of batches, and N number of features) and I'd like to flatten all the features across timesteps, such that I get a tensor of the shape B x TN. Haven't been able to figure out how to do this..
You need to permute your axes before flattening, like so:
t = t.swapdims(0,1) # (T,B,N) -> (B,T,N)
t = t.view(B,-1) # (B,T,N) -> (B,T*N) (equivalent to `t.view(B,T*N)`)
I have some data that is of shape 10000 x 1440 x 8 where 10000 is the number of days, 1440 the number of minutes and 8 is the number of features.
For each day, ie. each submatrix of size 1440 x 8 I wish to train an autoencoder and extract the weights from the second layer, such that my output will be a matrix output = 10000 x 8
I can do this in a loop with
import numpy as np
from keras.layers import Input, Dense
from keras import regularizers, models, optimizers
data = np.random.random(size=(10000,1440,8))
def AE(y, epochs=100,learning_rate = 1e-4, regularization = 5e-4, epochs=3):
input = Input(shape=(y.shape[1],))
encoded = Dense(1, activation='relu',
kernel_regularizer=regularizers.l2(regularization))(input)
decoded = Dense(y.shape[1], activation='relu',
kernel_regularizer=regularizers.l2(regularization))(encoded)
autoencoder = models.Model(input, decoded)
autoencoder.compile(optimizer=optimizers.Adam(lr=learning_rate), loss='mean_squared_error')
autoencoder.fit(y, y, epochs=epochs, batch_size=10, shuffle=False)
(w1,b1,w2,b2)=autoencoder.get_weights()
return (w1,b1,w2,b2)
lst = []
for i in range(data.shape[0]):
y = data[i]
(_, _, w2, _) = AE(y)
lst.append(w2[0])
output = np.array(lst)
However, this feels very stupid as surely I must be able to just pass the 3D data to the autoencoder and retrieve what I want. However, if I try modify the shape of input to be input = Input(shape=(y.shape[1],y.shape[2]))
I get an error
ValueError: Dimensions must be equal, but are 1440 and 8 for '{{node
mean_squared_error/SquaredDifference}} =
SquaredDifference[T=DT_FLOAT](model_778/dense_1558/Relu,
IteratorGetNext:1)' with input shapes: [?,1440,1440], [?,1440,8].
Any pointers on how to get the shape right?
Simply reshape your your data like so and call the function.
data = data.reshape(data.shape[0]*data.shape[1], -1)
(w1, b1, w2, b2) = AE(data)
print(w2.shape)
Your first layer of the NN is a Dense layer. You can only pass two dimensional data into it. One dimension will be batch size and the other dimension will be the feature vector. When you are using the data in the way you are using it, you are considering each data point independently. Which means that you can join the first two axes together and just pass it on to the NN. However, note that you would still need to modify the code so that you are not passing the entire dataset at once to the NN. You need to split the data into batches and loop over those before passing it on. And honestly, it's the same as what you are doing now. So your looping is not as bad as you think it is for what you are trying to do.
However, also note that you have a time series data and considering each datapoint as an independent point doesn't really make sense. You need an LSTM layer or something to learn the time series encoding.
I have:
def __init__(self, feature_dim=15, hidden_size=5, num_layers=2):
super(BaselineModel, self).__init__()
self.num_layers = num_layers
self.hidden_size = hidden_size
self.lstm = nn.LSTM(input_size=feature_dim,
hidden_size=hidden_size, num_layers=num_layers)
and then I get an error:
RuntimeError: The size of tensor a (5) must match the size of tensor b (15) at non-singleton dimension 2
If I set the two sizes to be the same, then the error goes away. But I'm wondering if my input_size is some large number, say 15, and I want to reduce the number of hidden features to 5, why shouldn't that work?
It should work the error probably came from elsewhere.
This work for example:
feature_dim = 15
hidden_size = 5
num_layers = 2
seq_len = 5
batch_size = 3
lstm = nn.LSTM(input_size=feature_dim,
hidden_size=hidden_size, num_layers=num_layers)
t1 = torch.from_numpy(np.random.uniform(0,1,size=(seq_len, batch_size, feature_dim))).float()
output, states = lstm.forward(t1)
hidden_state, cell_state = states
print("output: ",output.size())
print("hidden_state: ",hidden_state.size())
print("cell_state: ",cell_state.size())
and return
output: torch.Size([5, 3, 5])
hidden_state: torch.Size([2, 3, 5])
cell_state: torch.Size([2, 3, 5])
Are you using the output somewhere after the lstm ? Did you notice it has a size equal to hidden dim ie 5 on last dim ? It looks like you're using it afterwards thinking it has a size of 15 instead
The short answer is: Yes, input_size can be different from hidden_size.
For an elaborated answer, take a look at the LSTM formulae in the PyTorch documentations, for instance:
This is the formula to compute i_t, the input activation at the t-th time step for one layer. Here the matrix W_ii has the shape of (hidden_size x input_size). Similarly in other formulae, matrices W_if, W_ig, and W_io all have the same shape. These matrices project the input tensor into the same space as hidden states, so that they can be added together.
Back to your specific problem, as the other answer pointed out, it's probably an error at another part of your code. Without looking at your forward implementation, it's hard to say what the problem is exactly.
Long story short, I have an RNN that is stacked on top of a CNN.
The CNN was created and trained separately. To clarify things, let's suppose the CNN takes input in the form of a [BATCH SIZE, H, W, C] placeholder (H = height, W = width, C = number of channels).
Now, when stacked on top of the RNN, the overall input to the combined network will have the shape: [BATCH SIZE, TIME SEQUENCE, H, W, C], i.e. each sample in the minibatch consists of TIME_SEQUENCE many images. Moreover, the time sequences are variable in length. There is a separate placeholder called sequence_lengths with shape [BATCH SIZE] that contains scalar values corresponding to the length of each sample in the minibatch. The value of TIME SEQUENCE corresponds to the maximum possible time sequence length, and for samples with smaller lengths, the remaining values are padded with zeros.
What I want to do
I want to accumulate the output from the CNN in a tensor of shape [BATCH SIZE, TIME SEQUENCE, 1] (the last dimension just contains the final score output by the CNN for each time sample for each batch element) so that I can forward this entire chunk of information to the RNN that is stacked on top of the CNN. The tricky thing is, I also want to be able to back-propagate the error from the RNN to the CNN (the CNN is already pre-trained, but I would like to fine-tune the weights a bit), so I have to stay inside the graph, i.e. I can't make any calls to session.run().
Option A:
The easiest way would be to just reshape the overall network input tensor to [BATCH SIZE * TIME SEQUENCE, H, W, C]. The problem with this is that BATCH SIZE * TIME SEQUENCE may be as large as 2000, so I'm bound to run out of memory when trying to feed a batch that big into my CNN. And the batch size is too large for training anyway. Also, a lot of sequences are just padded zeros, and it'd be a waste of computation.
Option B:
Use the tf.while_loop. My idea was to treat all the images along the time axis for a single minibatch element as a minibatch for the CNN. Essentially, the CNn would be processing batches of size [TIME SEQUENCE, H, W, C] at each iteration (not exactly TIME SEQUENCE many images every time; the exact number would depend on the sequence length). The code I have right now looks like this:
# The output tensor that I want populated
image_output_sequence = tf.Variable(tf.zeros([batch_size, max_sequence_length, 1], tf.float32))
# Counter for the loop. I'll process one batch element per iteration.
# One batch element contains a variable number of images for each time step. All these images will form a minibatch for the CNN.
loop_counter = tf.get_variable('loop_counter', dtype=tf.int32, initializer=0)
# Loop variables that will be passed to the body and cond methods
loop_vars = [input_image_sequence, sequence_lengths, image_output_sequence, loop_counter]
# input_image_sequence: [BATCH SIZE, TIME SEQUENCE, H, W, C]
# sequence_lengths: [BATCH SIZE]
# image_output_sequence: [BATCH SIZE, TIME SEQUENCE, 1]
# abbreviations for vars in loop_vars:
# iis --> input_image_sequence
# sl --> sequence_lengths
# ios --> image_output_sequence
# lc --> loop_counter
def cond(iis, sl, ios, lc):
return tf.less(lc, batch_size)
def body(iis, sl, ios, lc):
seq_len = sl[lc] # the sequence length of the current batch element
cnn_input_batch = iis[lc, :seq_len] # extract the relevant portion (the rest are just padded zeros)
# propagate this 'batch' through the CNN
my_cnn_model.process_input(cnn_input_batch)
# Pad the remaining indices
padding = [[0, 0], [0, batch_size - seq_len]]
padded_cnn_output = tf.pad(cnn_input_batch_features, paddings=padding, mode='CONSTANT', constant_values=0)
# The problematic part: assign these processed values to the output tensor
ios[lc].assign(padded_cnn_features)
return [iis, sl, ios, lc + 1]
_, _, result, _ = tf.while_loop(cond, body, loop_vars, swap_memory=True)
Inside my_cnn_model.process_input, I'm just passing the input through a vanilla CNN. All the variables created in it are with tf.AUTO_REUSE, so that should ensure that the while loop reuses the same weights for all the loop iterations.
The exact problem
image_output_sequence is a variable, but somehow when tf.while_loop calls the body method, it gets turned into a Tensor type object to which assignments can't be made. I get the error message: Sliced assignment is only supported for variables
This problem persists even if I use another format like using a tuple of BATCH SIZE Tensors each with dimensions [TIME SEQUENCE, H, W, C].
I'm open to a complete redesign of the code as well, as long as it gets the job done nicely.
The solution is to use an object of type TensorArray, which is specifically made to address such problems. The following line:
image_output_sequence = tf.Variable(tf.zeros([batch_size, max_sequence_length, 1], tf.float32))
is replaced by:
image_output_sequence = tf.TensorArray(size=batch_size, dtype=tf.float32, element_shape=[max_sequence_length, 1], infer_shape=True)
TensorArray doesn't actually require a fixed shape for each element, but for my case it is fixed, so it's better to enforce it.
Then inside the body function, replace this:
ios[lc].assign(padded_cnn_features)
with:
ios = ios.write(lc, padded_cnn_output)
Then after the tf.while_loop statement, the TensorArray can be stacked to form a regular Tensor for further processing:
stacked_tensor = result.stack()