tensorflow autodiff slower than pytorch's counterpart

tensorflow autodiff slower than pytorch's counterpart - python

I am using tensorflow 2.0 and trying to evaluate gradients for backpropagating to a simple feedforward neural network. Here's how my model looks like:
def __init__(self, input_size, output_size):
inputs = tf.keras.Input(shape=(input_size,))
hidden_layer1 = tf.keras.layers.Dense(30, activation='relu')(inputs)
outputs = tf.keras.layers.Dense(output_size)(hidden_layer1)
self.model = tf.keras.Model(inputs=inputs, outputs=outputs)
self.optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
self.loss_function = tf.keras.losses.Huber()
The forward pass to this network is fine but when I use gradient tape to train the model, it is at least 10x slower than PyTorch.
Training function:
def learn_modified_x(self, inputs, targets, actions):
with tf.GradientTape() as tape:
predictions = self.model(inputs)
predictions_for_action = gather_single_along_axis(predictions, actions)
loss = self.loss_function(targets, predictions_for_action)
grads = tape.gradient(loss, self.model.trainable_weights)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_weights))
I tried commenting lines to find what is actually causing the problem. I discovered that tape.gradient is a significant contributor to this situation.
Any idea?
PyTorch implementation
def __init__(self, input_size, nb_action):
super(Network, self).__init__()
self.input_size = input_size
self.nb_action = nb_action
self.fc1 = nn.Linear(input_size, 30)
self.fc2 = nn.Linear(30, nb_action)
def forward(self, state):
x = F.relu(self.fc1(state))
q_values = self.fc2(x)
return q_values
def learn(self, batch_state, batch_next_state, batch_reward, batch_action):
outputs = self.model(batch_state).gather(1, batch_action.unsqueeze(1)).squeeze(1)
next_outputs = self.model(batch_next_state).detach().max(1)[0]
target = self.gamma*next_outputs + batch_reward
td_loss = F.smooth_l1_loss(outputs, target)
self.optimizer.zero_grad()
td_loss.backward(retain_variables = True)
self.optimizer.step()

def __init__(self,...):
...
self.model.call = tf.function(self.model.call)
...
you need use tf.function to wrap your model's call function.

Related

How to summarize pytorch model

Hello I am building a DQN model for reinforcement learning on cartpole and want to print my model summary like keras model.summary() function
Here is my model class.
class DQN():
''' Deep Q Neural Network class. '''
def __init__(self, state_dim, action_dim, hidden_dim=64, lr=0.05):
super(DQN, self).__init__()
self.criterion = torch.nn.MSELoss()
self.model = torch.nn.Sequential(
torch.nn.Linear(state_dim, hidden_dim),
torch.nn.ReLU(),
torch.nn.Linear(hidden_dim, hidden_dim*2),
torch.nn.ReLU(),
torch.nn.Linear(hidden_dim*2, action_dim)
)
self.optimizer = torch.optim.Adam(self.model.parameters(), lr)
def update(self, state, y):
"""Update the weights of the network given a training sample. """
y_pred = self.model(torch.Tensor(state))
loss = self.criterion(y_pred, Variable(torch.Tensor(y)))
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
def predict(self, state):
""" Compute Q values for all actions using the DQL. """
with torch.no_grad():
return self.model(torch.Tensor(state))
Here is the model instance with the parameters passed.
# Number of states = 4
n_state = env.observation_space.shape[0]
# Number of actions = 2
n_action = env.action_space.n
# Number of episodes
episodes = 150
# Number of hidden nodes in the DQN
n_hidden = 50
# Learning rate
lr = 0.001
simple_dqn = DQN(n_state, n_action, n_hidden, lr)
I tried using torchinfo summary but I get an AttributeError:
'DQN' object has no attribute 'named_parameters'
from torchinfo import summary
simple_dqn = DQN(n_state, n_action, n_hidden, lr)
summary(simple_dqn, input_size=(4, 2, 50))
Any help is appreciated.

Your DQN should be a subclass of nn.Module
class DQN(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=64, lr=0.05):
...

Why does my function get good values for LSTM but not for GRU?

I'm trying to implement a program that compares LSTM's performance vs GRU's performance for word prediction. I am using the same parameters for both of them, however while I am getting good perplexity values for the LSTM, the GRU values I'm getting are absolutely terrible.
I recently attempted to debug the training function since it originally only ranfor the LSTM model but not for the GRU model. As I already said, both models should get similar values, however for now the LSTM models starts with around ~150 perplexity and converges to a normal value, when the GRU model starts with some random value that's in the 1000s that does not converge at all.
I am quite new for all the RNN, LSTM, and GRU stuff, so forgive me if there's something obvious that I am missing.
Any help would be appriciated!
I use the following two models:
class LSTM_Model(nn.Module):
def __init__(self, vocab_size, embed_size, hidden_size, num_layers, dropout=0):
super(LSTM_Model, self).__init__()
self.embed = nn.Embedding(vocab_size, embed_size)
self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True, dropout = dropout)
self.fc = nn.Linear(hidden_size, vocab_size)
def forward(self, x, hidden_state):
x = self.embed(x)
out, (hidden_state, cell_state) = self.lstm(x, hidden_state)
out = out.reshape(out.size(0)*out.size(1), out.size(2)) # Reshape output to (batch_size*sequence_length, hidden_size)
out = self.fc(out)
return out, (hidden_state, cell_state)
class GRU_Model(nn.Module):
def __init__(self, vocab_size, embed_size, hidden_size, num_layers, dropout=0):
super(GRU_Model, self).__init__()
self.embed = nn.Embedding(vocab_size, embed_size)
self.gru = nn.GRU(embed_size, hidden_size, num_layers, batch_first=True, dropout = dropout)
self.fc = nn.Linear(hidden_size, vocab_size)
def forward(self, x, hidden_state):
x = self.embed(x)
out, hidden_state = self.gru(x, hidden_state)
out = out.reshape(out.size(0)*out.size(1), out.size(2)) # Reshape output to (batch_size*sequence_length, hidden_size)
out = self.fc(out)
return out, hidden_state
Training function:
def run_model(model, epochs=epochs, learning_rate=learning_rate, clip=clip, momentum=momentum, LSTM=True, GRU=False, Dropout=False):
# Define loss criterion and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer=optimizer, step_size=step_size, gamma=decay_rate)
train_perplexity, test_perplexity, valid_perplexity = [], [], []
# Train the model
for e in range(epochs):
# Set all initial hidden and cell states to zeroes
train_states=init_states(LSTM, GRU, num_layers, batch_size, hidden_size)
test_states=init_states(LSTM, GRU, num_layers, batch_size, hidden_size)
valid_states=init_states(LSTM, GRU, num_layers, batch_size, hidden_size)
# RUN TRAINING SET #
model.train()
for i in range(0, ids.size(1) - seq_length, seq_length):
# Set train_inputs and train_targets
train_inputs = ids[:, i:i+seq_length].to(device)
train_targets = ids[:, (i+1):(i+1)+seq_length].to(device)
# Forward pass
model.zero_grad()
if(LSTM==True):
train_states = [state.detach() for state in train_states] # Detach the hidden state from how it was previously produced
if(GRU==True):
train_states = train_states.data #detach?
train_outputs, train_states = model(train_inputs, train_states)
train_loss = criterion(train_outputs, train_targets.reshape(-1))
# Backward and optimize
train_loss.backward()
clip_grad_norm_(model.parameters(), clip)
optimizer.step()
lr_scheduler.step()
model.eval()
with torch.no_grad():
#test and validation, removed to reduce length
model.train() # reset to train mode after iterating through validation data
train_perplexity.append(math.exp(train_loss.item()))
test_perplexity.append(np.exp(np.mean(test_losses)))
valid_perplexity.append(np.exp(np.mean(valid_losses)))
print('Epoch ' + str(e+1) + '/' + str(epochs) + ': ')
print('Train Perplexity - ' + str(train_perplexity[e]))
print('Test Perplexity - ' + str(test_perplexity[e]))
print('Validation Perplexity - ' + str(valid_perplexity[e]))
print("----------------------------------------------------")
return train_perplexity, test_perplexity, valid_perplexity
Hidden state initialization:
def init_states(LSTM, GRU, num_layers=num_layers, batch_size=batch_size, hidden_size=hidden_size):
if (LSTM==True):
return (torch.FloatTensor(num_layers, batch_size, hidden_size).uniform_(r1, r2).to(device),
torch.FloatTensor(num_layers, batch_size, hidden_size).uniform_(r1, r2).to(device))
if (GRU==True):
return torch.FloatTensor(num_layers, batch_size, hidden_size).uniform_(r1, r2).to(device)

Updating specific rows of a tensor matrix during gradient updation?

I have been trying to implement the paper: SeER: An Explainable Deep Learning MIDI-based Hybrid Song Recommender System.
So, what I have been doing is this:
Model Code:
class HybridFactorization(tf.keras.layers.Layer):
# embedding_size is also the number of lstm units
# num_users, num_movies = input_shape
# required_users: (batch_size, embedding_size)
# songs_output: (batch_size, embedding_size)
def __init__(self, embedding_size, num_users, num_tracks):
super(HybridFactorization, self).__init__()
self.embedding_size = embedding_size
self.num_users = num_users
self.num_tracks = num_tracks
self.required_users = None
self.U = self.add_weight("U",
shape=[self.num_users, self.embedding_size],
dtype=tf.float32,
initializer=tf.initializers.GlorotUniform)
self.lstm = tf.keras.layers.LSTM(self.embedding_size)
def call(self, user_index, songs_batch):
output_lstm = self.lstm(songs_batch)
self.required_users = self.U.numpy()
self.required_users = tf.convert_to_tensor(self.required_users[np.array(user_index)],
dtype=tf.float32)
return tf.matmul(self.required_users, output_lstm, transpose_b=True)
class HybridRecommender(tf.keras.Model):
def __init__(self, embedding_size, num_users, num_tracks):
super(HybridRecommender, self).__init__()
self.HybridFactorization = HybridFactorization(embedding_size,
num_users, num_tracks)
def call(self, user_index, songs_batch):
output = self.HybridFactorization(user_index, songs_batch)
return output
Utility Functions and running the model:
def loss_fn(source, target):
mse = tf.keras.losses.MeanSquaredError()
return mse(source, target)
model = HybridRecommender(EMBEDDING_SIZE, num_users, num_tracks)
Xhat = model(user_index, songs_batch)
tf.keras.backend.clear_session()
optimizer = tf.keras.optimizers.Adam()
EPOCHS = 1
for epoch in range(EPOCHS):
start = time.time()
total_loss = 0
for (batch, (input_batch, target_batch)) in enumerate(train_dataset):
songs_batch = create_songs_batch(input_batch)
user_index = input_batch[:, 0].numpy()
X = create_pivot_batch(input_batch, target_batch)
with tf.GradientTape() as tape:
Xhat = model(user_index, songs_batch)
batch_loss = loss_fn(X, Xhat)
variables = model.trainable_variables
gradients = tape.gradient(batch_loss, variables)
optimizer.apply_gradients(zip(gradients, variables))
total_loss += batch_loss
Now, various functions like create_songs_batch(input_batch) and create_pivot_batch(input_batch, target_batch) just provide data in the required format.
My model runs but I get the warning:
WARNING:tensorflow:Gradients do not exist for variables ['U:0'] when minimizing the loss.
Now, I can see why variable U is not being updated as there is no direct path to it.
I want to update some specific rows of U which are mentioned in user_index in every batch call.
Is there a way to do it?

So, I was able to solve the problem by rather than copying some rows of U and trying to solve it. Instead, I used a temporary matrix that is one hot encoded form of user_index and multiplied it with U to desired results and it also removed the results.
Part of code that needs to be modified:
def call(self, user_index, songs_batch):
# output_lstm: (batch_size, emb_sz)
# batch_encoding: (batch_size, num_users)
# required_users: (batch_size, emb_sz)
output_lstm = self.lstm(songs_batch)
user_idx = np.array(user_index)
batch_encoding = np.zeros((user_idx.size, self.num_users))
batch_encoding[np.arange(user_idx.size), user_idx] = 1
batch_encoding = tf.convert_to_tensor(batch_encoding, dtype=tf.float32)
self.required_users = tf.matmul(batch_encoding, self.U)
return tf.matmul(self.required_users, output_lstm, transpose_b=True)

pytorch, Using nn.DataParallel in LSTM

/pytorch/aten/src/ATen/native/cudnn/RNN.cpp:1266: UserWarning: RNN module weights are not part of single contiguous chunk of memory.
This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
Hello. I am using pytorch.
I am trying to use DataParallel function in pytorch,
but the model is LSTM. I'm warned to flatten the model again,
but I don't know when and where to flatten.
Can you let me know?
This is my model
import torch.nn as nn
from torchvision import models
class ConvLstm(nn.Module):
def __init__(self, latent_dim, model, hidden_size, lstm_layers, bidirectional, n_class):
super(ConvLstm, self).__init__()
self.conv_model = Pretrained_conv(latent_dim, model)
self.Lstm = Lstm(latent_dim, hidden_size, lstm_layers, bidirectional)
self.output_layer = nn.Sequential(
nn.Linear(2 * hidden_size if bidirectional ==
True else hidden_size, n_class),
nn.Softmax(dim=-1)
)
def forward(self, x):
batch_size, timesteps, channel_x, h_x, w_x = x.shape
conv_input = x.view(batch_size * timesteps, channel_x, h_x, w_x)
conv_output = self.conv_model(conv_input)
lstm_input = conv_output.view(batch_size, timesteps, -1)
lstm_output = self.Lstm(lstm_input)
lstm_output = lstm_output[:, -1, :]
output = self.output_layer(lstm_output)
return output
class Pretrained_conv(nn.Module):
def __init__(self, latent_dim, model):
if model == 'resnet152':
super(Pretrained_conv, self).__init__()
self.conv_model = models.resnet152(pretrained=True)
# ====== freezing all of the layers ======
for param in self.conv_model.parameters():
param.requires_grad = False
# ====== changing the last FC layer to an output with the size we need. this layer is un freezed ======
self.conv_model.fc = nn.Linear(
self.conv_model.fc.in_features, latent_dim)
def forward(self, x):
return self.conv_model(x)
class Lstm(nn.Module):
def __init__(self, latent_dim, hidden_size, lstm_layers, bidirectional):
super(Lstm, self).__init__()
self.Lstm = nn.LSTM(latent_dim, hidden_size=hidden_size,
num_layers=lstm_layers, batch_first=True, bidirectional=bidirectional)
self.hidden_state = None
def reset_hidden_state(self):
self.hidden_state = None
def forward(self, x):
output, self.hidden_state = self.Lstm(x, self.hidden_state)
return output
Enter LSTM and execute the following code.
def foward_step(model, images, labels, criterion, mode=''):
model.module.Lstm.reset_hidden_state()
if mode == 'test':
with torch.no_grad():
output = model(images)
else:
output = model(images)
loss = criterion(output, labels)
# Accuracy calculation
predicted_labels = output.detach().argmax(dim=1)
acc = (predicted_labels == labels).cpu().numpy().sum()
return loss, acc, predicted_labels.cpu()
This is main
model = nn.DataParallel(model, device_ids=[0,1,2,3]).cuda()

Restoring TF Eager model without using training code

I am training (and saving) a very simple model in eager mode as follows:
import os
import tensorflow as tf
import tensorflow.contrib.eager as tfe
tf.enable_eager_execution()
NUM_EXAMPLES = 2000
training_inputs = tf.random_normal([NUM_EXAMPLES])
noise = tf.random_normal([NUM_EXAMPLES])
outputs = training_inputs * 3 + 2 + noise
class Model(tf.keras.Model):
def __init__(self):
super(Model, self).__init__()
self.W = tfe.Variable(5., name="weight")
self.b = tfe.Variable(0., name="bias")
def predict(self, input):
return self.W * input + self.b
def loss(model, inputs, outputs):
error = model.predict(inputs) - outputs
return tf.reduce_mean(tf.square(error))
def grad(model, inputs, outputs):
with tf.GradientTape() as tape:
loss_value = loss(model, inputs, outputs)
return tape.gradient(loss_value, [model.W, model.b])
if __name__ == "__main__":
model = Model()
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
for i in range(300):
gradients = grad(model, training_inputs, outputs)
optimizer.apply_gradients(zip(gradients, [model.W, model.b]),
global_step=tf.train.get_or_create_global_step())
checkpoint_dir = './checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
root = tfe.Checkpoint(optimizer=optimizer,
model=model,
optimizer_step=tf.train.get_or_create_global_step())
root.save(file_prefix=checkpoint_prefix)
The only ways I found to save/restore (with Checkpoint or Saver) imply having access to the Model class to load it elsewhere, for instance:
model = Model()
checkpointer = tfe.Checkpoint(model=model)
checkpointer.restore(tf.train.latest_checkpoint('checkpoints/'))
print(model.predict(7))
The save method from tf.keras.Model doesn't seem to be implemented yet for Eager mode:
model.save("keras_model")
>>> NotImplementedError
Is there another way to save and load the model without having to instantiate a new Model object?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

tensorflow autodiff slower than pytorch's counterpart - python

def init(self,...): ... self.model.call = tf.function(self.model.call) ... you need use tf.function to wrap your model's call function.

Related

How to summarize pytorch model

Why does my function get good values for LSTM but not for GRU?

Updating specific rows of a tensor matrix during gradient updation?

pytorch, Using nn.DataParallel in LSTM

Restoring TF Eager model without using training code

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

tensorflow autodiff slower than pytorch's counterpart - python

def __init__(self,...): ... self.model.call = tf.function(self.model.call) ... you need use tf.function to wrap your model's call function.

Related

How to summarize pytorch model

Why does my function get good values for LSTM but not for GRU?

Updating specific rows of a tensor matrix during gradient updation?

pytorch, Using nn.DataParallel in LSTM

Restoring TF Eager model without using training code

Categories

Resources

def init(self,...): ... self.model.call = tf.function(self.model.call) ... you need use tf.function to wrap your model's call function.