Pytorch - How to add a self-attention to another architecture

Pytorch - How to add a self-attention to another architecture - python

I'm a beginner with pytorch framework and I'm trying to add a multiheaded self attention on top of another architecture (BERT) (this is a simple question but I'm not familiar with PyTorch):
UPDATE 1
import math
class PositionalEncoding(nn.Module):
def __init__(self, d_model, dropout=0.1, max_len=5000):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)
self.d_model = d_model
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0).transpose(0, 1)
self.register_buffer('pe', pe)
def forward(self, x, seq_len = 768, mask = None):
pos_emb = self.pe[:, :seq_len]
x = x * mask[:, :, None].float()
x = x + pos_emb
return x
The problem in how to add the transformer is in the following class:
class CamemBERTQA(nn.Module):
def __init__(self,bert_type, hidden_size, num_labels, num_inter_layers=1, heads = 12, do_lower_case = True):
super(CamemBERTQA, self).__init__()
self.do_lower_case = do_lower_case
self.bert_type = bert_type
self.hidden_size = hidden_size
self.num_labels = num_labels
self.num_inter_layers = num_inter_layers
self.camembert = CamembertModel.from_pretrained(self.bert_type)
# ---------------- Transformer ------------------------------------------
self.d_model = self.hidden_size # 768
dropout = 0.1
self.pos_emb = PositionalEncoding(d_model = self.d_model, dropout = dropout)
self.transformer_inter = nn.ModuleList(
[nn.TransformerEncoderLayer(d_model = self.d_model, nhead = heads, dim_feedforward = 2048, dropout = dropout)
for _ in range(num_inter_layers)])
# ---------------- Transformer ------------------------------------------
self.qa_outputs = nn.Linear(self.hidden_size, self.num_labels)
def forward(self, input_ids, mask=None):
bert_output = self.camembert(input_ids = input_ids) # input_ids is a tensor
# ---------------- Transformer ------------------------------------------
seq_len = self.hidden_size
x = self.pos_emb(x = bert_output, seq_len = seq_len, mask = None)
for i in range(self.num_inter_layers):
x = self.transformer_inter[i](i, x, x, 1 - mask) # all_tokens * max_tokens * dim
output = self.layer_norm(x)
# ---------------- Transformer ------------------------------------------
sequence_output = output[0]
logits = self.qa_outputs(sequence_output)
start_logits, end_logits = logits.split(1, dim=-1)
start_logits = start_logits.squeeze(-1)
end_logits = end_logits.squeeze(-1)
outputs = (start_logits, end_logits,)
return x
Thank you so much.

So it seems that you're trying to add a Transformer network on top of the BERT component. It has to be mentioned that the self-attention network is only a part of the Transformer network, meaning that Transformers have other components besides self-attention as well. I would recommend using the Transformer (which has the self-attention component included) as an encoder that receives BERT vectors and transforms them into another representation (in another space).
Try this instead of self.attention = MultiHeadAttention():
self.transformer_inter = nn.ModuleList(
[TransformerEncoderLayer(d_model, heads, d_ff, dropout)
for _ in range(num_inter_layers)])
and then in forward(), call self.transformer_inter through a loop which will give you the representations produced by Transformer architecture. Like this:
def forward(self, bert_output, mask):
batch_size, seq_len = bert_output.size(0), bert_output.size(1)
# Transformer Encoder
pos_emb = self.pos_emb.pe[:, :seq_len]
x = bert_output * mask[:, :, None].float()
x = x + pos_emb
for i in range(self.num_inter_layers):
x = self.transformer_inter[i](i, x, x, 1 - mask) # all_tokens * max_tokens * dim
x = self.layer_norm(x) # Transformer also normalizes the outputs from each layer.
# x is the encoded vectors by Transformer encoder
return x
Then using a nn.Linear(.) layer, do another transformation to map the hidden_size to the number of labels for your task, which will give you the logits for each label. These all should be done within BERT class that you have posted.
Note that the TransformerEncoderLayer is a placeholder class that I used above. So you have to either implement it or use open source packages. As Transformers are quite well-known, I think you won't have trouble finding an implementation of it.

Related

PYG model train loss not improving

I’m currently working on a graphical neural network project to predict results of soccer matches. I’ve trained the model with about 4000 matches, but the train loss of it only dropped by 0.05
train loss
It seems like I’ve made some mistakes when building my models?
I’m new to pytorch pls spare me if there’s some stupid mistakes. Or if you need any more details on my project for further inspect.
Any help appreciated!
class GNN(torch.nn.Module):
def __init__(self, feature_size, model_params):
super(GNN, self).__init__()
embedding_size = model_params["model_embedding_size"]
n_heads = model_params["model_attention_heads"]
self.n_layers = model_params["model_layers"]
dropout_rate = model_params["model_dropout_rate"]
top_k_ratio = model_params["model_top_k_ratio"]
self.top_k_every_n = model_params["model_top_k_every_n"]
dense_neurons = model_params["model_dense_neurons"]
#edge_dim = model_params["model_edge_dim"]
self.conv_layers = ModuleList([])
self.transf_layers = ModuleList([])
self.pooling_layers = ModuleList([])
self.bn_layers = ModuleList([])
# Transformation layer: transform original node features to embedding vector(size: embedding_size(defined in config.py))
self.conv1 = TransformerConv(feature_size,
embedding_size,
heads=n_heads,
dropout=dropout_rate,
#edge_dim=edge_dim,
beta=True)
self.transf1 = Linear(embedding_size*n_heads, embedding_size)
self.bn1 = BatchNorm1d(embedding_size)
# Other layers: message passing and pooling
for i in range(self.n_layers):
self.conv_layers.append(TransformerConv(embedding_size,
embedding_size,
heads=n_heads,
dropout=dropout_rate,
#edge_dim=edge_dim,
beta=True))
# map conv_layer output size back to emgedding_size(embedding_size*n_heads -> embedding_size)
self.transf_layers.append(Linear(embedding_size*n_heads, embedding_size))
# Batch normalization
self.bn_layers.append(BatchNorm1d(embedding_size))
# Top-k pooling to reduce the size of the graph
if i % self.top_k_every_n == 0:
self.pooling_layers.append(TopKPooling(embedding_size, ratio=top_k_ratio))
# Linear output layers: feed graph representation in & reduce until single value left
self.linear1 = Linear(embedding_size*2, dense_neurons)
self.linear2 = Linear(dense_neurons, int(dense_neurons/2))
self.linear3 = Linear(int(dense_neurons/2), 3)
def forward(self, x, edge_index, batch_index):
# Initial transformation
x = self.conv1(x, edge_index)
x = torch.relu(self.transf1(x))
x = self.bn1(x)
# Holds the intermediate graph representations
global_representation = []
for i in range(self.n_layers):
x = self.conv_layers[i](x, edge_index)
x = torch.relu(self.transf_layers[i](x))
x = self.bn_layers[i](x)
# Always aggregate last layer
if i % self.top_k_every_n == 0 or i == self.n_layers:
x , edge_index, edge_attr, batch_index, _, _ = self.pooling_layers[int(i/self.top_k_every_n)]( x,
edge_index,
None,
batch_index)
# Add current representation
global_representation.append(torch.cat([gmp(x, batch_index), gap(x, batch_index)], dim=1))
x = sum(global_representation) ######
# Output block
x = torch.relu(self.linear1(x))
x = F.dropout(x, p=0.8, training=self.training)
x = torch.relu(self.linear2(x))
x = F.dropout(x, p=0.8, training=self.training)
x = self.linear3(x)
return x

How can I determine a CNN model to train a multiple labels for image classifications?

I need your kind help.
How can I determine a CNNs model to train a multiple labels for image classification?
My data is a bunch of spectra. Each spectrum has 4 labels. How can I build a CNN to classify those images.
How to compose the forward function and initialization function
What kinds of layers do you recommend?
First, the metadata structure goes like this:
enter image description here
Here is my Dataset
class OurDataset(Dataset):
spectra_dir = f"./data/spectrograms_fm"
metaData_path = f"./data/FMAudio/metaData.csv"
def __init__(self):
self.audio_labels = panda.read_csv(self.metaData_path)
self.transform = torchvision.transforms.Compose([
torchvision.transforms.Resize((201, 81)),
torchvision.transforms.ToTensor()
])
def __len__(self):
return len(self.audio_labels)
def __getitem__(self, idx):
img = PILImage.open(f"./data/spectrograms_fm/mutiplelabels/{self.audio_labels.iloc[idx, 8]}.png").convert("RGB")
img = self.transform(img)
label_A = torch.tensor(self.audio_labels.iloc[idx, 4])
label_Fw = torch.tensor(self.audio_labels.iloc[idx, 5])
label_P = torch.tensor(self.audio_labels.iloc[idx, 6])
label_Fi = torch.tensor(self.audio_labels.iloc[idx, 7])
return img, label_A, label_Fw, label_P, label_Fi
I defined the training function
def train(dataloader, model, loss, optimizer):
model.train()
size = len(dataloader.dataset)
for batch, (img_tensors, Y_A, Y_Fw, Y_P, Y_Fi) in enumerate(dataloader):
optimizer.zero_grad()
pred = model(img_tensors.float())
loss_A = cost(pred, Y_A)
loss_Fw = cost(pred, Y_Fw)
loss_P = cost(pred, Y_P)
loss_Fi = cost(pred, Y_Fi)
loss = loss_A + loss_Fw + loss_P + loss_Fi
loss.backward()
optimizer.step()
I used the official CNN model from Microsoft pytorch tutorial image classification which is not apt to my case.
class CNNet(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 32, kernel_size=5)
self.conv2 = nn.Conv2d(32, 64, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(2112, 50)
self.fc2 = nn.Linear(50, 4)
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 4))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 4))
x = self.flatten(x)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = F.relu(self.fc2(x))
return F.log_softmax(x,dim=1)
Can you write my a demo?
Cheers

Correct way to feed LSTM with MFCC data in pytorch

I am trying to build a wakeword model for my AI Assistant. I have 1 second length 3 audios. I created the data. I have 3 audio MFCC extracted data as an example.
def test():
#look at the values of the tensors after printing
wwd = WakeWordData(data_json = '../../data_json_files/test.json');
print(wwd[0])
arr =[]
arr.append(wwd[87]) #shape(1,19,40)
arr.append(wwd[0]) #shape(1,78,40)
arr.append(wwd[4]) #shape(1,28,40)
mfccs, labels = collate_fn(arr) #torch.Size([78, 3, 40])
model_params = {
"size_of_output": 1, "input_size": 40, "hidden_size": 1,
"num_layers": 2, "dropout": 0.1, "bidirectional": True,
"device":'cpu'
}
lst_w = LSTM_WakeWord(**model_params)
o = lst_w(mfccs)
#print(str(o))
Here is my collate_fn below.
def collate_fn(data):
mfccs = []
labels = []
for d in data:
mfcc_tensor, label = d
#mfcc_tensor -> (channel, time, n_mfcc)
mfccs.append(mfcc_tensor.squeeze(0).transpose(0, 1))
labels.append(label)
mfccs = nn.utils.rnn.pad_sequence(mfccs, batch_first=True) # batch,
feature(n_mfcc),seq_len(time)
print("collate_fn MFCCs->" + str(mfccs.shape)) #torch.Size([3, 78, 40])
mfccs = mfccs.transpose(0, 1) #torch.Size([78, 3, 40])(feature(n_mfcc), batch,
seq_len(time))
labels = torch.Tensor(labels)
return mfccs, labels
when i run this code with 3 MFCC's , after pad_sequence i get the data as (3,78,40). Which is i think (batch, features(n_mfcc) ,seq_len(time)). is it correct ? then i traspose is and get ([78, 3, 40]).
then i try to give it to my LSTM. LSTM takes the input as ( seq_len,batch, feature). I can make the model work even though my (78,3,40) is ( features(n_mfcc) ,batch, seq_len(time)). Should i set the shape exactly as the model wants or it is good if it's working?
My model is below.
class LSTM_WakeWord(nn.Module):
def __init__(self,input_size,hidden_size,num_layers,dropout,bidirectional,size_of_output, device):
super(LSTM_WakeWord, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.num_layers = num_layers
self.device = device
self.bidirectional = bidirectional
self.directions = 2 if bidirectional else 1
self.lstm = nn.LSTM(input_size=input_size,
hidden_size = hidden_size,
num_layers = num_layers,
dropout=dropout,
bidirectional=bidirectional)
self.layernorm = nn.LayerNorm(input_size)
self.classifier = nn.Linear(hidden_size * self.directions, size_of_output)
def _init_hidden(self,batch_size):
n, d, hs = self.num_layers, self.directions, self.hidden_size
return (torch.zeros(n * d, batch_size, hs).to(self.device),
torch.zeros(n * d, batch_size, hs).to(self.device))
def forward(self,x):
# the values with e+xxx are gone. so it normalizes the values
x = self.layernorm(x)
# x shape -> feature(n_mfcc),batch,seq_len(time)
hidden = self._init_hidden(x.size()[1])
out, (hn, cn) = self.lstm(x, hidden)
print("hn "+str(hn.shape))
print("out " + str(out.shape))
out = self.classifier(hn)
return out
But then i get an error when i try to give the hidden state output to my Linear dense layer (classifier). It is a shape error. mat1 and mat2 shapes cannot be multiplied (12x1 and 2x1)
Why is this happening?

PyTorch: How to get data from an LSTM

I am new to PyTorch and I am trying to build a reinforcement learning system that uses OpenAI for trying to predict whether or not a stock should be bought or not and at what time.
class NeuronalNetwork(nn.Module):
def __init__(self, stock_env: StockEnv):
super(NeuronalNetwork, self).__init__()
self.stock_env = stock_env
input_size = len(self.stock_env.normalized_dataframe.columns)
self.hidden_size = 128
self.num_layers = 4
self.kernel = 2
output_size = self.stock_env.action_space.n
self.lstm = nn.LSTM(input_size=input_size, hidden_size=self.hidden_size, num_layers=self.num_layers, batch_first=True)
self.output_layer = nn.Linear(self.hidden_size, output_size)
self.softmax = nn.LogSoftmax(dim=output_size)
self.tanh = nn.Tanh()
def forward(self, x, hidden=None):
# N x T x D
# N - the number of windows sizes
# T - the window size
# D - the number of indicators and OHLCV in total
if len(x.shape) > 2:
batch_size = x.shape[0]
else:
batch_size = 1
if hidden is None:
hidden = (
torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device),
torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device),
)
D = len(self.stock_env.normalized_dataframe.columns)
T = self.stock_env.window_size
N = batch_size
x = x.view(N, T, D).type(torch.FloatTensor).to(device)
out, (ht, ct) = self.lstm(x, hidden)
out = self.tanh(out)
out = self.output_layer(out)
return out
My x from forward is representing my data under the form [Number_of_batches x Window_size x Features]
For the moment my out will be the shape Number_of_batches x Window_size x Action but what I want to make my model learn is to predict the best action ONLY for the 250th element. So does anyone know what can I do in order to obtain an out with a shape of (batch_size x action) where the action is going to be the last element from the column window_sie?
Example:
out.shape => (batch_size, windows_size, features)
FOR **b** all batch_size:
batch = []
FOR _ all actions
batch.append(out[b][250])
new_out.append(batch)
And on the end, I will have an out that is going to be batch_size x action where the action is only going to be the action of the 250th element from the window_size.
I'm not sure if makes sense for you what is my question, but it doesn't just let me know and I will try to explain it differently.

Updating specific rows of a tensor matrix during gradient updation?

I have been trying to implement the paper: SeER: An Explainable Deep Learning MIDI-based Hybrid Song Recommender System.
So, what I have been doing is this:
Model Code:
class HybridFactorization(tf.keras.layers.Layer):
# embedding_size is also the number of lstm units
# num_users, num_movies = input_shape
# required_users: (batch_size, embedding_size)
# songs_output: (batch_size, embedding_size)
def __init__(self, embedding_size, num_users, num_tracks):
super(HybridFactorization, self).__init__()
self.embedding_size = embedding_size
self.num_users = num_users
self.num_tracks = num_tracks
self.required_users = None
self.U = self.add_weight("U",
shape=[self.num_users, self.embedding_size],
dtype=tf.float32,
initializer=tf.initializers.GlorotUniform)
self.lstm = tf.keras.layers.LSTM(self.embedding_size)
def call(self, user_index, songs_batch):
output_lstm = self.lstm(songs_batch)
self.required_users = self.U.numpy()
self.required_users = tf.convert_to_tensor(self.required_users[np.array(user_index)],
dtype=tf.float32)
return tf.matmul(self.required_users, output_lstm, transpose_b=True)
class HybridRecommender(tf.keras.Model):
def __init__(self, embedding_size, num_users, num_tracks):
super(HybridRecommender, self).__init__()
self.HybridFactorization = HybridFactorization(embedding_size,
num_users, num_tracks)
def call(self, user_index, songs_batch):
output = self.HybridFactorization(user_index, songs_batch)
return output
Utility Functions and running the model:
def loss_fn(source, target):
mse = tf.keras.losses.MeanSquaredError()
return mse(source, target)
model = HybridRecommender(EMBEDDING_SIZE, num_users, num_tracks)
Xhat = model(user_index, songs_batch)
tf.keras.backend.clear_session()
optimizer = tf.keras.optimizers.Adam()
EPOCHS = 1
for epoch in range(EPOCHS):
start = time.time()
total_loss = 0
for (batch, (input_batch, target_batch)) in enumerate(train_dataset):
songs_batch = create_songs_batch(input_batch)
user_index = input_batch[:, 0].numpy()
X = create_pivot_batch(input_batch, target_batch)
with tf.GradientTape() as tape:
Xhat = model(user_index, songs_batch)
batch_loss = loss_fn(X, Xhat)
variables = model.trainable_variables
gradients = tape.gradient(batch_loss, variables)
optimizer.apply_gradients(zip(gradients, variables))
total_loss += batch_loss
Now, various functions like create_songs_batch(input_batch) and create_pivot_batch(input_batch, target_batch) just provide data in the required format.
My model runs but I get the warning:
WARNING:tensorflow:Gradients do not exist for variables ['U:0'] when minimizing the loss.
Now, I can see why variable U is not being updated as there is no direct path to it.
I want to update some specific rows of U which are mentioned in user_index in every batch call.
Is there a way to do it?

So, I was able to solve the problem by rather than copying some rows of U and trying to solve it. Instead, I used a temporary matrix that is one hot encoded form of user_index and multiplied it with U to desired results and it also removed the results.
Part of code that needs to be modified:
def call(self, user_index, songs_batch):
# output_lstm: (batch_size, emb_sz)
# batch_encoding: (batch_size, num_users)
# required_users: (batch_size, emb_sz)
output_lstm = self.lstm(songs_batch)
user_idx = np.array(user_index)
batch_encoding = np.zeros((user_idx.size, self.num_users))
batch_encoding[np.arange(user_idx.size), user_idx] = 1
batch_encoding = tf.convert_to_tensor(batch_encoding, dtype=tf.float32)
self.required_users = tf.matmul(batch_encoding, self.U)
return tf.matmul(self.required_users, output_lstm, transpose_b=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pytorch - How to add a self-attention to another architecture - python

Related

PYG model train loss not improving

How can I determine a CNN model to train a multiple labels for image classifications?

Correct way to feed LSTM with MFCC data in pytorch

PyTorch: How to get data from an LSTM

Updating specific rows of a tensor matrix during gradient updation?

Categories

Resources