Pytorch: weights not updating

Pytorch: weights not updating - python

I'm trying to train a model, and it doesn't work because weights aren't updating when I call the following:
self.optimizer = Adam(self.PPO.parameters(), lr=0.1, eps=epsilon)
total_loss = Variable(policy_loss + 0.5*value_loss - entropy_loss.mean() * 0.01, requires_grad=True)
self.optimizer.zero_grad()
(total_loss * 10).backward()
self.optimizer.step()
When I print the weights, they're all the same (loss isn't zero, and learning rate set to 0.1), and when I compare them (even with clone() called on each param) it always returns True. Total loss has a grad_fn attribute too... The optimizer is created in the constructor of my agent class.
My code is based on this repository:
https://github.com/andreiliphd/tennis-ppo/blob/master/agent.py
This is my agent constructor:
def __init__(self, PPO, learning_rate, epsilon, discount_rate, entropy_coefficient, ppo_clip, gradient_clip,
rollout_length, tau):
self.PPO = PPO
self.learning_rate = learning_rate
self.epsilon = epsilon
self.discount_rate = discount_rate
self.entropy_coefficient = entropy_coefficient
self.ppo_clip = 0.2
self.gradient_clip = 5
self.rollout_length = rollout_length
self.tau = tau
self.optimizer = Adam(self.PPO.actor.parameters(), lr=0.1, eps=epsilon)
self.device = torch.device('cpu')
This is my PPO class, which creates two networks with a forward function, and some hidden layers
class PPO(nn.Module):
def __init__(self, state_shape, action_num, mlp_layers, device=torch.device('cpu')):
super(PPO, self).__init__()
self.state_shape = state_shape
self.action_num = action_num
self.mlp_layers = mlp_layers
self.device = torch.device('cpu')
layer_dims = [np.prod(self.state_shape)] + self.mlp_layers
self.actor = PPO_Network(state_shape, action_num, layer_dims, True)
self.actor = self.actor.to(device)
self.critic = PPO_Network(state_shape, 1, layer_dims, False)
self.critic = self.critic.to(device)
self.to(device)
Any indications on why this is happening, and what I am overlooking are very welcome. :)
I can give more info or code if needed.

I fixed this :)
It was very stupid, it was because I was converted one of the values returned by my networks to numpy, and then converted it back to a tensor. It took me a while to realize because of some messy code.

Related

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed)

I am trying to get to grips with Pytorch and I wanted to try to reproduce this code:
https://github.com/andy-psai/MountainCar_ActorCritic/blob/master/RL%20Blog%20FINAL%20MEDIUM%20code%2002_12_19.ipynb
in Pytorch.
I am having a problem in that this error is being returned:
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
A similar question said to use zero_grad() again after the optimizer step, but this hasn't resolved the issue.
I've included the entire code below so hopefully it should be reproduceable.
Any advice would be much appreciated.
import gym
import os
import os.path as osp
import time
import math
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Normal
env = gym.envs.make("MountainCarContinuous-v0")
# Value function
class Value(nn.Module):
def __init__(self, dim_states):
super(Value, self).__init__()
self.net = nn.Sequential(
nn.Linear(dim_states, 400),
nn.ReLU(),
nn.Linear(400,400),
nn.ReLU(),
nn.Linear(400, 1)
)
self.optimizer = optim.Adam(self.parameters(), lr = 1e-3)
self.criterion = nn.MSELoss()
def forward(self, state):
return self.net(torch.from_numpy(state).float())
def compute_return(self, output, target):
self.optimizer.zero_grad()
loss = self.criterion(output, target)
loss.backward()
self.optimizer.step()
self.optimizer.zero_grad()
# Policy network
class Policy(nn.Module):
def __init__(self, dim_states, env):
super(Policy, self).__init__()
self.hidden1 = nn.Linear(dim_states, 40)
self.hidden2 = nn.Linear(40, 40)
self.mu = nn.Linear(40, 1)
self.sigma = nn.Linear(40,1)
self.env = env
self.optimizer = optim.Adam(self.parameters(), lr = 2e-5)
def forward(self, state):
state = torch.from_numpy(state).float()
x = F.relu(self.hidden1(state))
x = F.relu(self.hidden2(x))
mu = self.mu(x)
sigma = F.softmax(self.sigma(x), dim=-1)
action_dist = Normal(mu, sigma)
action_var = action_dist.rsample()
action_var = torch.clip(action_var,
self.env.action_space.low[0],
self.env.action_space.high[0])
return action_var, action_dist
def compute_return(self, action, dist, td_error):
self.optimizer.zero_grad()
loss_actor = -dist.log_prob(action)*td_error
loss_actor.backward()
self.optimizer.step()
self.optimizer.zero_grad()
# Normalise the state space
import sklearn
import sklearn.preprocessing
state_space_samples = np.array(
[env.observation_space.sample() for x in range(10000)])
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(state_space_samples)
# Normaliser
def scale_state(state):
scaled = scaler.transform([state])
return scaled
##################################
# Parameters
lr_actor = 0.00002
lr_critic = 0.001
actor = Policy(2, env)
critic = Value(2)
# Training loop params
gamma = 0.99
num_episodes = 300
episode_history = []
for episode in range(num_episodes):
# Receive initial state from E
state = env.reset()
reward_total = 0
steps = 0
done = False
while not done:
action, dist = actor(state)
# print(np.squeeze(action))
next_state, reward, done, _ = env.step(
np.array([action.item()]))
if episode % 50 == 0:
env.render()
steps += 1
reward_total += reward
# TD Target
target = reward + gamma * np.squeeze(critic(next_state), axis=0)
td_error = target - np.squeeze(critic(state), axis=0)
# Update actor
actor.compute_return(action, dist, td_error)
# Update critic
critic.compute_return(np.squeeze(critic(state), axis=0), target)
episode_history.append(reward_total)
print(f"Episode: {episode}, N Steps: {steps}, Cumulative reward {reward_total}")
if np.mean(episode_history[-100:]) > 90 and len(episode_history) > 101:
print("Solved")
print(f"Mean cumulative reward over 100 episodes {np.mean(episode_history[-100:])}")

Problem lies in this snippet. When you create target variable, there is a forward pass through critic which generates a computation graph and critic(next_state) is the leaf node of that graph making target a part of the graph (you can check this by printing target which will show you grad_fn=<AddBackward0>). Finally, when you call critic.compute_return(critic_out, target), a new computation graph is generated and passing target(which is a part of the previous computation graph) causes a Runtime error.
Solution is to call detach() on critic(next_state), this will free target variable and it will no longer be a part of the computation graph(again check by printing target).
target = reward + gamma * np.squeeze(critic(next_state).detach(), axis=0)
td_error = target - np.squeeze(critic(state), axis=0)
# Update actor
actor.compute_return(action, dist, td_error)
# Update critic
critic_out = np.squeeze(critic(state), axis=0)
print(critic_out)
critic.compute_return(critic_out, target)

Pytorch Runtime Error about inplace operation

I am developing a policy gradient NN with pytorch(version: 1.10.1) and I am having the run time error message as:
The error message as:
RuntimeError: one of the variables needed for gradient computation has been modified by an in-place operation: [torch.FloatTensor [1, 15]] is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
I had read some similar discussions and people suggest trying to avoid doing a += 1. There are also some discussions that suggest downgrading the pytorch. People also suggest using the clone() instead modify the tensor. I had tried them all but I still have this error.
The error does not show everytime in the update() function. Sometimes it works well. Why does this happen?
My related code is as following, there are some weird variables names such as previous_R. I used them to avoid in place operation such as a = a + 1
NN class:
class NN(nn.Module):
"""
Feel free to change the architecture for different tasks!
"""
def __init__(self, env):
super(NN, self).__init__()
# 15 in this case
self.state_size = 15
# 31 (1 and -1 for each task-server and void) (m*n*2 + 1)
self.action_size = 31
self.linear1 = nn.Linear(self.state_size, 128)
self.linear2 = nn.Linear(128, 256)
self.linear3 = nn.Linear(256, self.action_size)
def forward(self, state):
output1 = F.relu(self.linear1(state))
output2 = F.relu(self.linear2(output1.clone()))
output3 = self.linear3(output2.clone())
# Note the conversion to Pytorch distribution.
distribution = Categorical(F.softmax(output3.clone(), dim=-1))
return distribution
Reinforcement Learning part related code:
class Agent():
def __init__(self, env, lr, gamma):
self.env = env
self.NN = NN(env)
self.lr = lr
self.optim_NN = optim.Adam(self.NN.parameters(), lr = self.lr)
self.gamma = gamma
def update(self, log_probs,returns):
with torch.autograd.set_detect_anomaly(True):
print("updating")
baselines = self.compute_baselines(returns.clone())
loss = self.compute_loss(log_probs.clone(), returns, baselines)
self.optim_NN.zero_grad()
loss.backward()
self.optim_NN.step()
def compute_returns(self,rewards):
R = 0
returns = []
for r in rewards[::-1]:
pre_R = R
R = r + self.gamma*pre_R
returns.insert(0,R)
returns = torch.tensor(returns)
return returns
def compute_baselines(self,returns):
baselines = []
baselines.append(returns[0])
for v in returns:
t = len(baselines)
b = (baselines[t-1]*t + v)/(t+1)
baselines.append(b)
return baselines
def compute_loss(self, log_probs,returns, baselines):
with torch.autograd.set_detect_anomaly(True):
loss = 0
for i in range(0,len(returns)):
l = log_probs[i].clone()
r = returns[i].clone()
b = baselines[i].clone()
pre_loss = loss
loss =pre_loss + (-l*(r-b))
# losses.append(loss)
# losses.append(-log_probs[i].clone()*(returns[i].clone()-baselines[i].clone()))
policy_loss = loss
return policy_loss

DQN Pytorch Loss keeps increasing

I am implementing simple DQN algorithm using pytorch, to solve the CartPole environment from gym. I have been debugging for a while now, and I cant figure out why the model is not learning.
Observations:
using SmoothL1Loss performs worse than MSEloss, but loss increases for both
smaller LR in Adam does not work, I have tested using 0.0001, 0.00025, 0.0005 and default
Notes:
I have debugged various parts of the algorithm individually, and can say with good confidence that the issue is in the learn function. I am wondering if this bug is due to me misunderstanding detach in pytorch or some other framework mistake im making.
I am trying to stick as close to the original paper as possible (linked above)
References:
example: GitHub gist
example: pytroch official
import torch as T
import torch.nn as nn
import torch.nn.functional as F
import gym
import numpy as np
class ReplayBuffer:
def __init__(self, mem_size, input_shape, output_shape):
self.mem_counter = 0
self.mem_size = mem_size
self.input_shape = input_shape
self.actions = np.zeros(mem_size)
self.states = np.zeros((mem_size, *input_shape))
self.states_ = np.zeros((mem_size, *input_shape))
self.rewards = np.zeros(mem_size)
self.terminals = np.zeros(mem_size)
def sample(self, batch_size):
indices = np.random.choice(self.mem_size, batch_size)
return self.actions[indices], self.states[indices], \
self.states_[indices], self.rewards[indices], \
self.terminals[indices]
def store(self, action, state, state_, reward, terminal):
index = self.mem_counter % self.mem_size
self.actions[index] = action
self.states[index] = state
self.states_[index] = state_
self.rewards[index] = reward
self.terminals[index] = terminal
self.mem_counter += 1
class DeepQN(nn.Module):
def __init__(self, input_shape, output_shape, hidden_layer_dims):
super(DeepQN, self).__init__()
self.input_shape = input_shape
self.output_shape = output_shape
layers = []
layers.append(nn.Linear(*input_shape, hidden_layer_dims[0]))
for index, dim in enumerate(hidden_layer_dims[1:]):
layers.append(nn.Linear(hidden_layer_dims[index], dim))
layers.append(nn.Linear(hidden_layer_dims[-1], *output_shape))
self.layers = nn.ModuleList(layers)
self.loss = nn.MSELoss()
self.optimizer = T.optim.Adam(self.parameters())
def forward(self, states):
for layer in self.layers[:-1]:
states = F.relu(layer(states))
return self.layers[-1](states)
def learn(self, predictions, targets):
self.optimizer.zero_grad()
loss = self.loss(input=predictions, target=targets)
loss.backward()
self.optimizer.step()
return loss
class Agent:
def __init__(self, epsilon, gamma, input_shape, output_shape):
self.input_shape = input_shape
self.output_shape = output_shape
self.epsilon = epsilon
self.gamma = gamma
self.q_eval = DeepQN(input_shape, output_shape, [64])
self.memory = ReplayBuffer(10000, input_shape, output_shape)
self.batch_size = 32
self.learn_step = 0
def move(self, state):
if np.random.random() < self.epsilon:
return np.random.choice(*self.output_shape)
else:
self.q_eval.eval()
state = T.tensor([state]).float()
action = self.q_eval(state).max(axis=1)[1]
return action.item()
def sample(self):
actions, states, states_, rewards, terminals = \
self.memory.sample(self.batch_size)
actions = T.tensor(actions).long()
states = T.tensor(states).float()
states_ = T.tensor(states_).float()
rewards = T.tensor(rewards).view(self.batch_size).float()
terminals = T.tensor(terminals).view(self.batch_size).long()
return actions, states, states_, rewards, terminals
def learn(self, state, action, state_, reward, done):
self.memory.store(action, state, state_, reward, done)
if self.memory.mem_counter < self.batch_size:
return
self.q_eval.train()
self.learn_step += 1
actions, states, states_, rewards, terminals = self.sample()
indices = np.arange(self.batch_size)
q_eval = self.q_eval(states)[indices, actions]
q_next = self.q_eval(states_).detach()
q_target = rewards + self.gamma * q_next.max(axis=1)[0] * (1 - terminals)
loss = self.q_eval.learn(q_eval, q_target)
self.epsilon *= 0.9 if self.epsilon > 0.1 else 1.0
return loss.item()
def learn(env, agent, episodes=500):
print('Episode: Mean Reward: Last Loss: Mean Step')
rewards = []
losses = [0]
steps = []
num_episodes = episodes
for episode in range(num_episodes):
done = False
state = env.reset()
total_reward = 0
n_steps = 0
while not done:
action = agent.move(state)
state_, reward, done, _ = env.step(action)
loss = agent.learn(state, action, state_, reward, done)
state = state_
total_reward += reward
n_steps += 1
if loss:
losses.append(loss)
rewards.append(total_reward)
steps.append(n_steps)
if episode % (episodes // 10) == 0 and episode != 0:
print(f'{episode:5d} : {np.mean(rewards):5.2f} '
f': {np.mean(losses):5.2f}: {np.mean(steps):5.2f}')
rewards = []
losses = [0]
steps = []
print(f'{episode:5d} : {np.mean(rewards):5.2f} '
f': {np.mean(losses):5.2f}: {np.mean(steps):5.2f}')
return losses, rewards
if __name__ == '__main__':
env = gym.make('CartPole-v1')
agent = Agent(1.0, 1.0,
env.observation_space.shape,
[env.action_space.n])
learn(env, agent, 500)

The main problem I think is the discount factor, gamma. You are setting it to 1.0, which mean that you are giving the same weight to the future rewards as the current one. Usually in reinforcement learning we care more about the immediate reward than the future, so gamma should always be less than 1.
Just to give it a try I set gamma = 0.99 and run your code:
Episode: Mean Reward: Last Loss: Mean Step
100 : 34.80 : 0.34: 34.80
200 : 40.42 : 0.63: 40.42
300 : 65.58 : 1.78: 65.58
400 : 212.06 : 9.84: 212.06
500 : 407.79 : 19.49: 407.79
As you can see the loss still increases (even if not as much as before), but so does the reward. You should consider that loss here is not a good metric for the performance, because you have a moving target. You can reduce the instability of the target by using a target network. With additional parameter tuning and a target network one could probably make the loss even more stable.
Also generally note that in reinforcement learning the loss value is not as important as it is in supervised; a decrease in loss does not always imply an improvement in performance, and vice versa.
The problem is that the Q target is moving while the training steps happen; as the agent plays, predicting the correct sum of rewards gets extremely hard (e.g. more states and rewards explored means higher reward variance), so the loss increases. This is even clearer in more complex the environments (more states, variated rewards, etc).
At the same time the Q network is getting better at approximating the Q values for each action, so the rewards (could) increase.

Soft actor critic with discrete action space

I'm trying to implement the soft actor critic algorithm for discrete action space and I have trouble with the loss function.
Here is the link from SAC with continuous action space:
https://spinningup.openai.com/en/latest/algorithms/sac.html
I do not know what I'm doing wrong.
The problem is the network do not learn anything on the cartpole environment.
The full code on github: https://github.com/tk2232/sac_discrete/blob/master/sac_discrete.py
Here is my idea how to calculate the loss for discrete actions.
Value Network
class ValueNet:
def __init__(self, sess, state_size, hidden_dim, name):
self.sess = sess
with tf.variable_scope(name):
self.states = tf.placeholder(dtype=tf.float32, shape=[None, state_size], name='value_states')
self.targets = tf.placeholder(dtype=tf.float32, shape=[None, 1], name='value_targets')
x = Dense(units=hidden_dim, activation='relu')(self.states)
x = Dense(units=hidden_dim, activation='relu')(x)
self.values = Dense(units=1, activation=None)(x)
optimizer = tf.train.AdamOptimizer(0.001)
loss = 0.5 * tf.reduce_mean((self.values - tf.stop_gradient(self.targets)) ** 2)
self.train_op = optimizer.minimize(loss, var_list=_params(name))
def get_value(self, s):
return self.sess.run(self.values, feed_dict={self.states: s})
def update(self, s, targets):
self.sess.run(self.train_op, feed_dict={self.states: s, self.targets: targets})
In the Q_Network I'm gather the values with the collected actions
Example
q_out = [[0.5533, 0.4444], [0.2222, 0.6666]]
collected_actions = [0, 1]
gather = [[0.5533], [0.6666]]
gather function
def gather_tensor(params, idx):
idx = tf.stack([tf.range(tf.shape(idx)[0]), idx[:, 0]], axis=-1)
params = tf.gather_nd(params, idx)
return params
Q Network
class SoftQNetwork:
def __init__(self, sess, state_size, action_size, hidden_dim, name):
self.sess = sess
with tf.variable_scope(name):
self.states = tf.placeholder(dtype=tf.float32, shape=[None, state_size], name='q_states')
self.targets = tf.placeholder(dtype=tf.float32, shape=[None, 1], name='q_targets')
self.actions = tf.placeholder(dtype=tf.int32, shape=[None, 1], name='q_actions')
x = Dense(units=hidden_dim, activation='relu')(self.states)
x = Dense(units=hidden_dim, activation='relu')(x)
x = Dense(units=action_size, activation=None)(x)
self.q = tf.reshape(gather_tensor(x, self.actions), shape=(-1, 1))
optimizer = tf.train.AdamOptimizer(0.001)
loss = 0.5 * tf.reduce_mean((self.q - tf.stop_gradient(self.targets)) ** 2)
self.train_op = optimizer.minimize(loss, var_list=_params(name))
def update(self, s, a, target):
self.sess.run(self.train_op, feed_dict={self.states: s, self.actions: a, self.targets: target})
def get_q(self, s, a):
return self.sess.run(self.q, feed_dict={self.states: s, self.actions: a})
Policy Net
class PolicyNet:
def __init__(self, sess, state_size, action_size, hidden_dim):
self.sess = sess
with tf.variable_scope('policy_net'):
self.states = tf.placeholder(dtype=tf.float32, shape=[None, state_size], name='policy_states')
self.targets = tf.placeholder(dtype=tf.float32, shape=[None, 1], name='policy_targets')
self.actions = tf.placeholder(dtype=tf.int32, shape=[None, 1], name='policy_actions')
x = Dense(units=hidden_dim, activation='relu')(self.states)
x = Dense(units=hidden_dim, activation='relu')(x)
self.logits = Dense(units=action_size, activation=None)(x)
dist = Categorical(logits=self.logits)
optimizer = tf.train.AdamOptimizer(0.001)
# Get action
self.new_action = dist.sample()
self.new_log_prob = dist.log_prob(self.new_action)
# Calc loss
log_prob = dist.log_prob(tf.squeeze(self.actions))
loss = tf.reduce_mean(tf.squeeze(self.targets) - 0.2 * log_prob)
self.train_op = optimizer.minimize(loss, var_list=_params('policy_net'))
def get_action(self, s):
action = self.sess.run(self.new_action, feed_dict={self.states: s[np.newaxis, :]})
return action[0]
def get_next_action(self, s):
next_action, next_log_prob = self.sess.run([self.new_action, self.new_log_prob], feed_dict={self.states: s})
return next_action.reshape((-1, 1)), next_log_prob.reshape((-1, 1))
def update(self, s, a, target):
self.sess.run(self.train_op, feed_dict={self.states: s, self.actions: a, self.targets: target})
Update function
def soft_q_update(batch_size, frame_idx):
gamma = 0.99
alpha = 0.2
state, action, reward, next_state, done = replay_buffer.sample(batch_size)
action = action.reshape((-1, 1))
reward = reward.reshape((-1, 1))
done = done.reshape((-1, 1))
Q_target
v_ = value_net_target.get_value(next_state)
q_target = reward + (1 - done) * gamma * v_
V_target
next_action, next_log_prob = policy_net.get_next_action(state)
q1 = soft_q_net_1.get_q(state, next_action)
q2 = soft_q_net_2.get_q(state, next_action)
q = np.minimum(q1, q2)
v_target = q - alpha * next_log_prob
Policy_target
q1 = soft_q_net_1.get_q(state, action)
q2 = soft_q_net_2.get_q(state, action)
policy_target = np.minimum(q1, q2)

Since the algorithm is generic to both discrete and continuous policy, the key idea is that we need a discrete distribution that is reparametrizable. Then the extension should involve minimal code modification from the continuous SAC --- by just changing the policy distribution class.
There is one such distribution — the GumbelSoftmax distribution. PyTorch does not have this built-in, so I simply extend it from a close cousin which has the right rsample() and add a correct log prob calculation method. With the ability to calculate a reparametrized action and its log prob, SAC works beautifully for discrete actions with minimal extra code, as seen below.
def calc_log_prob_action(self, action_pd, reparam=False):
'''Calculate log_probs and actions with option to reparametrize from paper eq. 11'''
samples = action_pd.rsample() if reparam else action_pd.sample()
if self.body.is_discrete: # this is straightforward using GumbelSoftmax
actions = samples
log_probs = action_pd.log_prob(actions)
else:
mus = samples
actions = self.scale_action(torch.tanh(mus))
# paper Appendix C. Enforcing Action Bounds for continuous actions
log_probs = (action_pd.log_prob(mus) - torch.log(1 - actions.pow(2) + 1e-6).sum(1))
return log_probs, actions
# ... for discrete action, GumbelSoftmax distribution
class GumbelSoftmax(distributions.RelaxedOneHotCategorical):
'''
A differentiable Categorical distribution using reparametrization trick with Gumbel-Softmax
Explanation http://amid.fish/assets/gumbel.html
NOTE: use this in place PyTorch's RelaxedOneHotCategorical distribution since its log_prob is not working right (returns positive values)
Papers:
[1] The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables (Maddison et al, 2017)
[2] Categorical Reparametrization with Gumbel-Softmax (Jang et al, 2017)
'''
def sample(self, sample_shape=torch.Size()):
'''Gumbel-softmax sampling. Note rsample is inherited from RelaxedOneHotCategorical'''
u = torch.empty(self.logits.size(), device=self.logits.device, dtype=self.logits.dtype).uniform_(0, 1)
noisy_logits = self.logits - torch.log(-torch.log(u))
return torch.argmax(noisy_logits, dim=-1)
def log_prob(self, value):
'''value is one-hot or relaxed'''
if value.shape != self.logits.shape:
value = F.one_hot(value.long(), self.logits.shape[-1]).float()
assert value.shape == self.logits.shape
return - torch.sum(- value * F.log_softmax(self.logits, -1), -1)
And here's the LunarLander results. SAC learns to solve it really fast.
The full implementation code is in SLM Lab at https://github.com/kengz/SLM-Lab/blob/master/slm_lab/agent/algorithm/sac.py
The SAC benchmark results on Roboschool (continuous) and LunarLander (discrete) are shown here: https://github.com/kengz/SLM-Lab/pull/399

There is a paper about SAC with discrete action spaces. It says SAC for discrete action spaces doesn't need re-parametrization tricks like Gumbel softmax. Instead, SAC needs some modifications. please refer to the paper for more details.
Paper /
Author's implementation (without codes for atari) /
Reproduction (with codes for atari)
I hope it helps you.

Probably this repo may be helpful. Description says, that repo contains an implementation of SAC for discrete action space on PyTorch. There is file with SAC algorithm for continuous action space and file with SAC adapted for discrete action space.

Pytorch 1.8 has RelaxedOneHotCategorical, this supports re-parameterized sampling using gumbel softmax.
import torch
import torch.nn as nn
from torch.distributions import RelaxedOneHotCategorical
class Policy(nn.Module):
def __init__(self, input_dims, hidden_dims, actions):
super().__init__()
self.mlp = nn.Sequential(nn.Linear(input_dims, hidden_dims), nn.SELU(inplace=True),
nn.Linear(hidden_dims, hidden_dims), nn.SELU(inplace=True),
nn.Linear(hidden_dims, out_dims))
def forward(self, state):
logits = torch.log_softmax(self.mlp(state), dim=-1)
return RelaxedOneHotCategorical(logits=logits, temperature=torch.ones(1) * 1.0)
>>> policy = Policy(4, 16, 2)
>>> a_dist = policy(torch.randn(8, 4))
>>> a_dist.rsample()
tensor([[0.0353, 0.9647],
[0.1348, 0.8652],
[0.1110, 0.8890],
[0.4956, 0.5044],
[0.6941, 0.3059],
[0.6126, 0.3874],
[0.2932, 0.7068],
[0.0498, 0.9502]], grad_fn=<ExpBackward>)

DQN algorithm does not converge on CartPole-v0

Short Description of my model
I am trying to write my own DQN algorithm in Python, using Tensorflow following the paper(Mnih et al., 2015). In train_DQN function, I have defined the training procedure, and DQN_CartPole is for defining the function approximation(simple 3-layered Neural Network). For loss function, Huber loss or MSE is implemented followed by the gradient clipping(between -1 and 1). Then, I have implemented soft-update method instead of hard-update of the target network by copying the weights in the main network.
Question
I am trying it on the CartPole environment(OpenAI gym), but the rewards does not improve as it does in other people's algorithms, such as keras-rl. Any help will be appreciated.
reward over timestep
If possible, could you have a look at the source code?
DQN model: https://github.com/Rowing0914/TF_RL/blob/master/agents/DQN_model.py
Training Script: https://github.com/Rowing0914/TF_RL/blob/master/agents/DQN_train.py
Reddit post: https://www.reddit.com/r/reinforcementlearning/comments/ba7o55/question_dqn_algorithm_does_not_work_well_on/?utm_source=share&utm_medium=web2x
class Parameters:
def __init__(self, mode=None):
assert mode != None
print("Loading Params for {} Environment".format(mode))
if mode == "Atari":
self.state_reshape = (1, 84, 84, 1)
self.num_frames = 1000000
self.memory_size = 10000
self.learning_start = 10000
self.sync_freq = 1000
self.batch_size = 32
self.gamma = 0.99
self.update_hard_or_soft = "soft"
self.soft_update_tau = 1e-2
self.epsilon_start = 1.0
self.epsilon_end = 0.01
self.decay_steps = 1000
self.prioritized_replay_alpha = 0.6
self.prioritized_replay_beta_start = 0.4
self.prioritized_replay_beta_end = 1.0
self.prioritized_replay_noise = 1e-6
elif mode == "CartPole":
self.state_reshape = (1, 4)
self.num_frames = 10000
self.memory_size = 20000
self.learning_start = 100
self.sync_freq = 100
self.batch_size = 32
self.gamma = 0.99
self.update_hard_or_soft = "soft"
self.soft_update_tau = 1e-2
self.epsilon_start = 1.0
self.epsilon_end = 0.01
self.decay_steps = 500
self.prioritized_replay_alpha = 0.6
self.prioritized_replay_beta_start = 0.4
self.prioritized_replay_beta_end = 1.0
self.prioritized_replay_noise = 1e-6
class _DQN:
"""
Boilerplate for DQN Agent
"""
def __init__(self):
"""
define the deep learning model here!
"""
pass
def predict(self, sess, state):
"""
predict q-values given a state
:param sess:
:param state:
:return:
"""
return sess.run(self.pred, feed_dict={self.state: state})
def update(self, sess, state, action, Y):
feed_dict = {self.state: state, self.action: action, self.Y: Y}
_, loss = sess.run([self.train_op, self.loss], feed_dict=feed_dict)
# print(action, Y, sess.run(self.idx_flattened, feed_dict=feed_dict))
return loss
class DQN_CartPole(_DQN):
"""
DQN Agent for CartPole game
"""
def __init__(self, scope, env, loss_fn ="MSE"):
self.scope = scope
self.num_action = env.action_space.n
with tf.variable_scope(scope):
self.state = tf.placeholder(shape=[None, 4], dtype=tf.float32, name="X")
self.Y = tf.placeholder(shape=[None], dtype=tf.float32, name="Y")
self.action = tf.placeholder(shape=[None], dtype=tf.int32, name="action")
fc1 = tf.keras.layers.Dense(16, activation=tf.nn.relu)(self.state)
fc2 = tf.keras.layers.Dense(16, activation=tf.nn.relu)(fc1)
fc3 = tf.keras.layers.Dense(16, activation=tf.nn.relu)(fc2)
self.pred = tf.keras.layers.Dense(self.num_action, activation=tf.nn.relu)(fc3)
# indices of the executed actions
self.idx_flattened = tf.range(0, tf.shape(self.pred)[0]) * tf.shape(self.pred)[1] + self.action
# passing [-1] to tf.reshape means flatten the array
# using tf.gather, associate Q-values with the executed actions
self.action_probs = tf.gather(tf.reshape(self.pred, [-1]), self.idx_flattened)
if loss_fn == "huber_loss":
# use huber loss
self.losses = tf.subtract(self.Y, self.action_probs)
self.loss = huber_loss(self.losses)
elif loss_fn == "MSE":
# use MSE
self.losses = tf.squared_difference(self.Y, self.action_probs)
self.loss = tf.reduce_mean(self.losses)
else:
assert False
# you can choose whatever you want for the optimiser
# self.optimizer = tf.train.RMSPropOptimizer(0.00025, 0.99, 0.0, 1e-6)
self.optimizer = tf.train.AdamOptimizer()
# to apply Gradient Clipping, we have to directly operate on the optimiser
# check this: https://www.tensorflow.org/api_docs/python/tf/train/Optimizer#processing_gradients_before_applying_them
self.grads_and_vars = self.optimizer.compute_gradients(self.loss)
self.clipped_grads_and_vars = [(ClipIfNotNone(grad, -1., 1.), var) for grad, var in self.grads_and_vars]
self.train_op = self.optimizer.apply_gradients(self.clipped_grads_and_vars)
def train_DQN(main_model, target_model, env, replay_buffer, policy, params):
"""
Train DQN agent which defined above
:param main_model:
:param target_model:
:param env:
:param params:
:return:
"""
# log purpose
losses, all_rewards, cnt_action = [], [], []
episode_reward, index_episode = 0, 0
with tf.Session() as sess:
# initialise all variables used in the model
sess.run(tf.global_variables_initializer())
state = env.reset()
start = time.time()
for frame_idx in range(1, params.num_frames + 1):
action = policy.select_action(sess, target_model, state.reshape(params.state_reshape))
cnt_action.append(action)
next_state, reward, done, _ = env.step(action)
replay_buffer.add(state, action, reward, next_state, done)
state = next_state
episode_reward += reward
if done:
index_episode += 1
state = env.reset()
all_rewards.append(episode_reward)
if frame_idx > params.learning_start and len(replay_buffer) > params.batch_size:
states, actions, rewards, next_states, dones = replay_buffer.sample(params.batch_size)
next_Q = target_model.predict(sess, next_states)
Y = rewards + params.gamma * np.max(next_Q, axis=1) * np.logical_not(dones)
loss = main_model.update(sess, states, actions, Y)
# Logging and refreshing log purpose values
losses.append(np.mean(loss))
logging(frame_idx, params.num_frames, index_episode, time.time()-start, episode_reward, np.mean(loss), cnt_action)
episode_reward = 0
cnt_action = []
start = time.time()
if frame_idx > params.learning_start and frame_idx % params.sync_freq == 0:
# soft update means we partially add the original weights of target model instead of completely
# sharing the weights among main and target models
if params.update_hard_or_soft == "hard":
sync_main_target(sess, main_model, target_model)
elif params.update_hard_or_soft == "soft":
soft_target_model_update(sess, main_model, target_model, tau=params.soft_update_tau)
return all_rewards, losses
Modification
dones -> np.logical_not(dones)
np.argmax -> np.max
separating MSE from huber_loss

Briefly looking over, it seems that the dones variable is a binary vector where 1 denotes done, and 0 denotes not-done.
You then use dones here:
Y = rewards + params.gamma * np.argmax(next_Q, axis=1) * dones
So for all terminating transitions, you add the expected cumulative reward when following the policy for the rest of the episode (which is zero). For all non-terminating transitions, you do not add the expect cumulative reward.
I think you mean to do this the other way around, perhaps swap dones in the above line of code with np.logical_not(dones)?
Also, now that I look at it, it seems there is another major problem with this line. np.argmax(next_Q, axis=1) returns the index of the maximum value in next_Q vector, not the actual maximum value. You need np.maximum(next_Q, axis=1) (IIRC) to get the maximum expected reward of the next state's actions.
EDIT: The loss function is also very strangely defined. You are mixing Huber Loss with Mean-Squared-Error. If you want to use either huber_loss or MSE, you just compute them on the difference between the expected and predicted values. You appear to be doing both, which is certainly not a commonly defined loss function. For example, your model loss to use Huber Loss should just be:
self.loss = tf.reduce_mean(huber_loss(abs(self.Y - self.action_probs)))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.