Several questions regarding my implementation of PPO on Pytorch

Several questions regarding my implementation of PPO on Pytorch - python

I’ve been learning RL this summer and this week I’ve tried to make a PPO implementation on Pytorch with the help of some repositories from github with similiar algorithms.
The code runs OpenAI’s Lunar Lander but I have several errors that I have not been able to fix, the biggest one being that the algorithm quickly converges to doing the same action regardless of the state. The other major problem I’ve found is that even though I’m using backwards() only once, I get an error asking me to set retain_graph to True.
Because of that, I see no improvement of the rewards obtained over 1000 steps, I don’t know if the algorithm needs more steps to be able to see an improvement.
I’m really sorry if this kind of problems have no place in this forums, I just didn’t know where to post this.
Also I’m sorry for the messy code, it’s my first time doing this kind of algorithms, and I’m fairly new with pytorch and machine learning in general.
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
from torch.distributions import Categorical
import gym
class actorCritic(nn.Module):
def __init__(self):
super(actorCritic, self).__init__()
self.fc = nn.Sequential(
nn.Linear(8, 16),
nn.Linear(16, 32),
nn.Linear(32, 64),
nn.ReLU(inplace=True)
)
self.pi = nn.Linear(64, 4)
self.value = nn.Linear(64, 1)
def forward(self, x):
x = self.fc(x)
pi_1 = self.pi(x)
pi_out = F.softmax(pi_1, dim=-1)
value_out = self.value(x)
return pi_out, value_out
def GAE(rewards, values, masks):
gamma = 0.99
lamb = 0.95
advan_t = 0
sizes = rewards.size()
advantages = torch.zeros(1, sizes[1])
for t in reversed(range(sizes[1])):
delta = rewards[0][t] + gamma*values[0][t+1]*masks[0][t] - values[0][t]
advan_t = delta + gamma*lamb*advan_t*mask[0][t]
advantages[0][t] = advan_t
real_values = values[:,:sizes[1]] + advantages
return advantages, real_values
def plot_rewards(rewards):
plt.figure(2)
plt.clf()
plt.plot(rewards)
plt.pause(0.001)
plt.savefig('TruePPO 500 steps.png')
def interact(times, states):
rewards = torch.zeros(1, times)
actions = torch.zeros(1, times)
mask = torch.ones(1, times)
for steps in range(times):
action_probs, _ = network(states[steps])
m = Categorical(action_probs)
action = int(m.sample())
obs, reward, done, _ = env.step(action)
if done:
obs = env.reset()
mask[0][steps] = 0
states[steps+1] = torch.from_numpy(obs).float()
rewards[0][steps] = reward
actions[0][steps] = action
return states, rewards, actions, mask
#Parameters
total_steps = 1000
batch_size = 10
env = gym.make('LunarLander-v2')
network = actorCritic()
old_network = actorCritic()
optimizer = torch.optim.Adam(network.parameters(), lr = 0.001)
states = torch.zeros(batch_size+1, 8)
steps = 0
obs_ = env.reset()
obs = torch.from_numpy(obs_).float()
states[0] = obs
reward_means = []
nn_paramD = network.state_dict()
old_network.load_state_dict(nn_paramD)
while steps < total_steps:
print (steps)
states, rewards, actions, mask = interact(batch_size, states)
#calculate values, GAE, normalize advantages, randomize, calculate loss, backprop,
_, values = network(states)
values = values.view(-1, batch_size+1)
advantages, v_targ = GAE(rewards, values, mask)
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-5)
optimizer.zero_grad()
for n in range(rewards.size()[1]):
probabilities, _ = network(states[n])
print (probabilities)
m = Categorical(probabilities)
action_prob = m.probs[int(actions[0][n])]
entropia = m.entropy()
old_probabilities, _ = old_network(states[n])
m_old = Categorical(old_probabilities)
old_action_prob = m.probs[actions[0][n].int()]
old_action_prob.detach()
ratio = action_prob / old_action_prob
surr1 = ratio*advantages[0][n]
surr2 = torch.clamp(ratio, min = (1.-0.2), max = (1.+0.2))
policy_loss = -torch.min(surr1, surr2)
value_loss = 0.5*(values[0][n]-v_targ[0][n])**2
entropy_loss = -entropia
total_loss = policy_loss + value_loss + 0.01*entropy_loss
total_loss.backward(retain_graph = True)
optimizer.step()
reward_means.append(rewards.numpy().mean())
old_network.load_state_dict(nn_paramD)
nn_paramD = network.state_dict()
steps += 1
plot_rewards(reward_means)

Related

Problem getting DQN to learn CartPole-v1 (PyTorch)

So I had my DQN training fine, solves the environment after ~65_000 iterations. However, I started working on something else and now it's completely broken and won't get to even close to the same level anymore.
Following advice from previous work, I tuned hyperparameters and still didn't see the same results anymore.
import gym
import numpy as np
import torch
from torch import nn
from torch.nn import functional as F
from torch import optim
from models import DQN
from memory import Memory
from utils import wrap_input, epsilon_greedy
def main() -> int:
env = gym.make("CartPole-v1")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Online and offline model for learning
model = DQN(env.observation_space, env.action_space, 24).to(device)
target = DQN(env.observation_space, env.action_space, 24).to(device)
target.eval()
# Optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=.001)
loss_fn = F.smooth_l1_loss
memory = Memory(10_000)
obs, info = env.reset()
for it in range(65_000):
# Do this for the batch norm
model.eval()
# Maybe explore
if np.random.random() <= epsilon_greedy(1.0, .01, 15_000, it):
state = wrap_input(obs, device).unsqueeze(0)
action = model(state).argmax().item()
else:
action = env.action_space.sample()
# Act in environment and store the memory
next_state, reward, done, truncated, info = env.step(action)
if truncated or done:
next_state = np.zeros(env.observation_space.shape)
memory.store([obs, action, reward, int(done), next_state])
done = done or truncated
if done:
obs, info = env.reset()
# Train
if len(memory) > 32:
model.train()
states, actions, rewards, dones, next_states = memory.sample(32)
# Wrap and move all values to the cpu
states = wrap_input(states, device)
actions = wrap_input(actions, device, torch.int64, reshape=True)
next_states = wrap_input(next_states, device)
rewards = wrap_input(rewards, device, reshape=True)
dones = wrap_input(dones, device, reshape=True)
# Get current q-values
qs = model(states)
qs = torch.gather(qs, dim=1, index=actions)
# Compute target q-values
with torch.no_grad():
next_qs, _ = target(next_states).max(dim=1)
next_qs = next_qs.reshape(-1, 1)
target_qs = rewards + .9 * (1 - dones) * next_qs.reshape(-1, 1)
# Compute loss
loss = loss_fn(qs, target_qs)
optimizer.zero_grad()
loss.backward()
# Clip gradients
nn.utils.clip_grad_norm_(model.parameters(), 1)
# Backprop
optimizer.step()
# soft update
with torch.no_grad():
for target_param, local_param in zip(target.parameters(), model.parameters()):
target_param.data.copy_(1e-2 * local_param.data + (1 - 1e-2) * target_param.data)
if it % 200 == 0:
target.load_state_dict(model.state_dict())
# models.py
class FlatExtractor(nn.Module):
'''Does nothing but pass the input on'''
def __init__(self, obs_space):
super(FlatExtractor, self).__init__()
self.n_flatten = obs_space.shape[0]
def forward(self, obs):
return obs
class DQN(nn.Module):
def __init__(self, obs_space, act_space, layer_size):
super(DQN, self).__init__()
# Feature extractor
if len(obs_space.shape) == 1:
self.feature_extractor = FlatExtractor(obs_space)
elif len(obs_space.shape) == 3:
self.feature_extractor = NatureCnn(obs_space)
else:
raise NotImplementedErorr("This type of environment is not supported")
# Neural network
self.net = nn.Sequential(
nn.Linear(self.feature_extractor.n_flatten, layer_size),
nn.BatchNorm1d(layer_size),
nn.ReLU(),
nn.Linear(layer_size, layer_size),
nn.BatchNorm1d(layer_size),
nn.ReLU(),
nn.Linear(layer_size, act_space.n),
)
def forward(self, obs):
return self.net(self.feature_extractor(obs))
# memory.py
import random
from collections import deque
class Memory(object):
def __init__(self, maxlen):
self.memory = deque(maxlen=maxlen)
def store(self, experience):
self.memory.append(experience)
def sample(self, n_samples):
return zip(*random.sample(self.memory, n_samples))
def __len__(self):
return len(self.memory)
# utils.py
def wrap_input(arr, device, dtype=torch.float, reshape=False):
output = torch.from_numpy(np.array(arr)).type(dtype).to(device)
if reshape:
output = output.reshape(-1, 1)
return output
def epsilon_greedy(start, end, n_steps, it):
return max(start - (start - end) * (it / n_steps), end)
Is there something that I'm greatly missing? I've tried training for longer it doesn't change. What seems to be the biggest problem is that the loss explodes, and even changing the tau for hard updates didn't seem to fix this problem.

I had a lot of difficulty getting your code to run, therefore I had to comment several things out. I also commented things that added unnecessary complexity while debugging, for instance, a simple environment like cartpole doesn't require a target network. Also, focus more on the total reward gained, instead of the loss.
A few major changes that I made were -
At the end of the iteration, the next_state should become the current_state -
obs = next_state
I swapped your explore and exploit code
if np.random.random() <= epsilon_greedy(1.0, .01, 15_000, it):
state = wrap_input(obs, device).unsqueeze(0)
action = model(state).argmax().item()
else:
action = env.action_space.sample()
Your code basically starts off exploiting by taking the argmax and once the epsilon value is low enough, it starts randomly sampling. This needs to be swapped.
I replaced it with -
if np.random.random() <= epsilon_greedy(1.0, .01, 15_000, it):
action = env.action_space.sample()
else:
state = wrap_input(obs, device).unsqueeze(0)
action = model(state).argmax().item()
I increased your batch size. A larger batch size in cartpole, speeds up training considerably -
states, actions, rewards, dones, next_states = memory.sample(128)
Also, it is a good idea to wait for your model to gain sufficient experiences before starting training -
if len(memory) > 500:
model.train()
states, actions, rewards, dones, next_states = memory.sample(128)
The other changes that I made were to ease up debugging.
I didn't see any use of class FlatExtractor(nn.Module), therefore I removed it and made the following change -
if len(obs_space.shape) == 1:
self.feature_extractor = env.observation_space.shape[0]
def forward(self, obs):
return self.net(obs)
I removed all instances of BatchNorm
Replaced loss with MSELoss and removed clip gradients
loss_fn = nn.MSELoss()
Changed the learning rate to lr=.0001
Increased the width of your neural network -
model = DQN(env.observation_space, env.action_space, 128).to(device)
Removed the target network and its corresponding soft updates.
Added in total reward to check if the algorithm is learning
tot_rew = 0
for it in range(65_000):
next_state, reward, done, info = env.step(action)
tot_rew += reward
if done:
print("tot_rew = ", tot_rew)
obs= env.reset()
tot_rew = 0
Here is the total reward I get at the end -
tot_rew = 228.0
tot_rew = 472.0
tot_rew = 243.0
tot_rew = 300.0
Here is the entire fixed code -
import gym
import numpy as np
import torch
from torch import nn
from torch.nn import functional as F
from torch import optim
env = gym.make("CartPole-v1")
def main() -> int:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Online and offline model for learning
model = DQN(env.observation_space, env.action_space, 128).to(device)
target = DQN(env.observation_space, env.action_space, 24).to(device)
# target.eval()
# Optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=.0001)
loss_fn = nn.MSELoss()
memory = Memory(10_000)
obs = env.reset()
tot_rew = 0
for it in range(65_000):
# print("it = ", it)
# Do this for the batch norm
# model.eval()
# Maybe explore
if np.random.random() <= epsilon_greedy(1.0, .01, 15_000, it):
action = env.action_space.sample()
else:
state = wrap_input(obs, device).unsqueeze(0)
action = model(state).argmax().item()
# print("epsilon_greedy(1.0, .01, 15_000, it) = ", epsilon_greedy(1.0, .01, 15_000, it))
# print("check = ", model(state).detach().numpy())
# print("action = ", action)
# Act in environment and store the memory
next_state, reward, done, info = env.step(action)
tot_rew += reward
if done:
next_state = np.zeros(env.observation_space.shape)
memory.store([obs, action, reward, int(done), next_state])
done = done
obs = next_state
if done:
print("tot_rew = ", tot_rew)
obs= env.reset()
tot_rew = 0
# Train
if len(memory) > 500:
model.train()
states, actions, rewards, dones, next_states = memory.sample(128)
# Wrap and move all values to the cpu
states = wrap_input(states, device)
# print("states.shape = ",states.shape)
actions = wrap_input(actions, device, torch.int64, reshape=True)
next_states = wrap_input(next_states, device)
rewards = wrap_input(rewards, device, reshape=True)
dones = wrap_input(dones, device, reshape=True)
# Get current q-values
qs = model(states)
# print("qs.shape = ", qs.shape)
qs = torch.gather(qs, dim=1, index=actions)
# Compute target q-values
with torch.no_grad():
next_qs, _ = model(next_states).max(dim=1)
next_qs = next_qs.reshape(-1, 1)
target_qs = rewards + .9 * (1 - dones) * next_qs.reshape(-1, 1)
# Compute loss
loss = loss_fn(qs, target_qs)
# print("loss.shape = ", loss)
optimizer.zero_grad()
loss.backward()
# Clip gradients
# nn.utils.clip_grad_norm_(model.parameters(), 1)
# Backprop
optimizer.step()
# soft update
# with torch.no_grad():
# for target_param, local_param in zip(target.parameters(), model.parameters()):
# target_param.data.copy_(1e-2 * local_param.data + (1 - 1e-2) * target_param.data)
# if it % 200 == 0:
# target.load_state_dict(model.state_dict())
# models.py
class FlatExtractor(nn.Module):
'''Does nothing but pass the input on'''
def __init__(self, obs_space):
super(FlatExtractor, self).__init__()
self.n_flatten = 1
def forward(self, obs):
return obs
class DQN(nn.Module):
def __init__(self, obs_space, act_space, layer_size):
super(DQN, self).__init__()
# Feature extractor
if len(obs_space.shape) == 1:
self.feature_extractor = env.observation_space.shape[0]
elif len(obs_space.shape) == 3:
self.feature_extractor = NatureCnn(obs_space)
else:
raise NotImplementedErorr("This type of environment is not supported")
# Neural network
self.net = nn.Sequential(
nn.Linear(self.feature_extractor, layer_size),
nn.ReLU(),
nn.Linear(layer_size, layer_size),
nn.ReLU(),
nn.Linear(layer_size, act_space.n),
)
def forward(self, obs):
return self.net(obs)
# memory.py
import random
from collections import deque
class Memory(object):
def __init__(self, maxlen):
self.memory = deque(maxlen=maxlen)
def store(self, experience):
self.memory.append(experience)
def sample(self, n_samples):
return zip(*random.sample(self.memory, n_samples))
def __len__(self):
return len(self.memory)
# utils.py
def wrap_input(arr, device, dtype=torch.float, reshape=False):
output = torch.from_numpy(np.array(arr)).type(dtype).to(device)
if reshape:
output = output.reshape(-1, 1)
return output
def epsilon_greedy(start, end, n_steps, it):
return max(start - (start - end) * (it / n_steps), end)
main()

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed)

I am trying to get to grips with Pytorch and I wanted to try to reproduce this code:
https://github.com/andy-psai/MountainCar_ActorCritic/blob/master/RL%20Blog%20FINAL%20MEDIUM%20code%2002_12_19.ipynb
in Pytorch.
I am having a problem in that this error is being returned:
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
A similar question said to use zero_grad() again after the optimizer step, but this hasn't resolved the issue.
I've included the entire code below so hopefully it should be reproduceable.
Any advice would be much appreciated.
import gym
import os
import os.path as osp
import time
import math
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Normal
env = gym.envs.make("MountainCarContinuous-v0")
# Value function
class Value(nn.Module):
def __init__(self, dim_states):
super(Value, self).__init__()
self.net = nn.Sequential(
nn.Linear(dim_states, 400),
nn.ReLU(),
nn.Linear(400,400),
nn.ReLU(),
nn.Linear(400, 1)
)
self.optimizer = optim.Adam(self.parameters(), lr = 1e-3)
self.criterion = nn.MSELoss()
def forward(self, state):
return self.net(torch.from_numpy(state).float())
def compute_return(self, output, target):
self.optimizer.zero_grad()
loss = self.criterion(output, target)
loss.backward()
self.optimizer.step()
self.optimizer.zero_grad()
# Policy network
class Policy(nn.Module):
def __init__(self, dim_states, env):
super(Policy, self).__init__()
self.hidden1 = nn.Linear(dim_states, 40)
self.hidden2 = nn.Linear(40, 40)
self.mu = nn.Linear(40, 1)
self.sigma = nn.Linear(40,1)
self.env = env
self.optimizer = optim.Adam(self.parameters(), lr = 2e-5)
def forward(self, state):
state = torch.from_numpy(state).float()
x = F.relu(self.hidden1(state))
x = F.relu(self.hidden2(x))
mu = self.mu(x)
sigma = F.softmax(self.sigma(x), dim=-1)
action_dist = Normal(mu, sigma)
action_var = action_dist.rsample()
action_var = torch.clip(action_var,
self.env.action_space.low[0],
self.env.action_space.high[0])
return action_var, action_dist
def compute_return(self, action, dist, td_error):
self.optimizer.zero_grad()
loss_actor = -dist.log_prob(action)*td_error
loss_actor.backward()
self.optimizer.step()
self.optimizer.zero_grad()
# Normalise the state space
import sklearn
import sklearn.preprocessing
state_space_samples = np.array(
[env.observation_space.sample() for x in range(10000)])
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(state_space_samples)
# Normaliser
def scale_state(state):
scaled = scaler.transform([state])
return scaled
##################################
# Parameters
lr_actor = 0.00002
lr_critic = 0.001
actor = Policy(2, env)
critic = Value(2)
# Training loop params
gamma = 0.99
num_episodes = 300
episode_history = []
for episode in range(num_episodes):
# Receive initial state from E
state = env.reset()
reward_total = 0
steps = 0
done = False
while not done:
action, dist = actor(state)
# print(np.squeeze(action))
next_state, reward, done, _ = env.step(
np.array([action.item()]))
if episode % 50 == 0:
env.render()
steps += 1
reward_total += reward
# TD Target
target = reward + gamma * np.squeeze(critic(next_state), axis=0)
td_error = target - np.squeeze(critic(state), axis=0)
# Update actor
actor.compute_return(action, dist, td_error)
# Update critic
critic.compute_return(np.squeeze(critic(state), axis=0), target)
episode_history.append(reward_total)
print(f"Episode: {episode}, N Steps: {steps}, Cumulative reward {reward_total}")
if np.mean(episode_history[-100:]) > 90 and len(episode_history) > 101:
print("Solved")
print(f"Mean cumulative reward over 100 episodes {np.mean(episode_history[-100:])}")

Problem lies in this snippet. When you create target variable, there is a forward pass through critic which generates a computation graph and critic(next_state) is the leaf node of that graph making target a part of the graph (you can check this by printing target which will show you grad_fn=<AddBackward0>). Finally, when you call critic.compute_return(critic_out, target), a new computation graph is generated and passing target(which is a part of the previous computation graph) causes a Runtime error.
Solution is to call detach() on critic(next_state), this will free target variable and it will no longer be a part of the computation graph(again check by printing target).
target = reward + gamma * np.squeeze(critic(next_state).detach(), axis=0)
td_error = target - np.squeeze(critic(state), axis=0)
# Update actor
actor.compute_return(action, dist, td_error)
# Update critic
critic_out = np.squeeze(critic(state), axis=0)
print(critic_out)
critic.compute_return(critic_out, target)

Keras Double DQN average reward decreases over time and is unable to converge

I am attempting to teach a Double DQN agent to run a gridworld where there is one seeker (the agent) who will try to collect all the hiders which are randomly spawned. Every step has a path_cost of -0.1 and if a hider is collected a reward of 1 is received. The DQN net receives an array with the shape (world_width,world_height,1) as the state which is a complete translation of the environment viewed from above where empty space is described as 0, seeker as 2, and hider as 3. The agent is then supposed to choose one action, either left, up, right, or down. An example configuration of the environment is shown in the image below.
gridworld
However, when training my agent the reward initially decreases in correlation to the decreasing exploration and therefore it can be assumed that when the agent follows the DQN net it will perform worse than when choosing actions randomly. Here are a few examples of the reward graphs I have received when training with different hyperparameters (y-axis is total steps where each episode is 100 steps unless it finishes).
Reward Graph
As seen the agent becomes worse at solving the environment and it is approximately when epsilon becomes equal to my min_epsilon the curve stabilizes (meaning almost no exploration or random moves).
I have tried different hyperparameters but without any apparent differences in results and would there appreciate it if someone could give me a pointer to where the problem might be.
The hyperparameters I have been mostly using is:
wandb.config.epsilon = 1.0
wandb.config.epsilon_decay = 0.99
wandb.config.batch_size = 32
wandb.config.learning_rate = 1e-3
wandb.config.gamma = 0.8
wandb.config.min_epsilon = 1e-1
wandb.config.buffersize = 10000
wandb.config.epochs = 1
wandb.config.reward_discount = 0.01
wandb.config.episodes = 1000
And here is my code:
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Conv2D, MaxPooling2D, Flatten
from tensorflow.keras.optimizers import Adam
from collections import deque
from termcolor import colored
import wandb
from wandb.keras import WandbCallback
import numpy as np
import copy, os, random
from argparse import ArgumentParser
from plotter import plotter
from HNS import HNS
tf.keras.backend.set_floatx('float64')
wandb.init(name=name, project=project)
wandb.env.name = "HNS"
wandb.env.world_size = (8, 8)
wandb.env.state_dim = (8, 8, 1)
wandb.env.hider_count = 2
wandb.env.action_dim = 4
wandb.env.random_spawn = True
wandb.env.max_steps = 100
wandb.config.node = node
wandb.config.epsilon = 1.0
wandb.config.epsilon_decay = 0.99
wandb.config.batch_size = 32
wandb.config.learning_rate = 1e-3
wandb.config.gamma = 0.8
wandb.config.min_epsilon = 1e-1
wandb.config.buffersize = 10000
wandb.config.epochs = 1
wandb.config.reward_discount = 0.01
wandb.config.episodes = 1000
wandb.config.conv1_kernel = (8,8)
wandb.config.conv1_filters = 16
wandb.config.conv1_strides = 4
wandb.config.conv1_activation = "relu"
wandb.config.conv1_padding = "same"
wandb.config.conv2_kernel = (4,4)
wandb.config.conv2_filters = 32
wandb.config.conv2_strides = 4
wandb.config.conv2_activation = "relu"
wandb.config.conv2_padding = "same"
wandb.config.dense1_neurons = 16
wandb.config.dense1_activation = "relu"
wandb.config.loss = "mse"
parser = ArgumentParser()
parser.add_argument('--hider_count', type=int, default=wandb.env.hider_count)
parser.add_argument('--max_steps', type=int, default=wandb.env.max_steps)
parser.add_argument('--epsilon_decay', type=float, default=wandb.config.epsilon_decay)
parser.add_argument('--min_epsilon', type=float, default=wandb.config.min_epsilon)
parser.add_argument('--learning_rate', type=float, default=wandb.config.learning_rate)
parser.add_argument('--gamma', type=float, default=wandb.config.gamma)
parser.add_argument('--reward_discount', type=float, default=wandb.config.reward_discount)
parser.add_argument('--episodes', type=int, default=wandb.config.episodes)
parser.add_argument('--batch_size', type=int, default=wandb.config.batch_size)
args, unknown = parser.parse_known_args()
wandb.config.update(args, allow_val_change=True)
class ReplayBuffer:
def __init__(self):
self.buffer = deque(maxlen=wandb.config.buffersize)
def put(self, state, action, reward, next_state, done):
self.buffer.append([state, action, reward, next_state, done])
def sample(self):
sample = random.sample(self.buffer, wandb.config.batch_size)
states, actions, rewards, next_states, done = map(np.asarray, zip(*sample))
return states, actions, rewards, next_states, done
def size(self):
return len(self.buffer)
class ActionStatemodel:
def __init__(self):
self.epsilon = wandb.config.epsilon
self.model = self.create_model()
def create_model(self):
# Init model
model = tf.keras.Sequential()
# Set up layers
model.add(Conv2D(filters=wandb.config.conv1_filters, kernel_size=wandb.config.conv1_kernel, activation=wandb.config.conv1_activation,
strides=wandb.config.conv1_strides, padding=wandb.config.conv1_padding, name="conv_1", input_shape=wandb.env.state_dim))
model.add(Conv2D(filters=wandb.config.conv2_filters, kernel_size=wandb.config.conv2_kernel, activation=wandb.config.conv2_activation,
strides=wandb.config.conv2_strides, padding=wandb.config.conv2_padding, name="conv_2"))
model.add(Flatten())
model.add(Dense(units=wandb.config.dense1_neurons, activation=wandb.config.dense1_activation, name="dense_1"))
model.add(Dense(wandb.env.action_dim, name="dense_2"))
# Finalize model
model.compile(loss=wandb.config.loss, optimizer=Adam(wandb.config.learning_rate))
model.summary()
return model
# Get q-values from state
def predict(self, state):
return self.model.predict(state)
# Get action from
def get_action(self, state):
# Predict action
state = np.expand_dims(state, axis=0)
q_value = self.predict(state)
if np.random.random() < self.epsilon: return random.randint(0, wandb.env.action_dim - 1), 1
else: return np.argmax(q_value), 0
def train(self, states, targets):
history = self.model.fit(states, targets, epochs=wandb.config.epochs, callbacks=[WandbCallback()], verbose=2, use_multiprocessing=True)
return history.history["loss"][0]
class Agent:
def __init__(self, env):
self.env = env
self.predict_net = ActionStatemodel()
self.target_net = ActionStatemodel()
self.target_update()
self.buffer = ReplayBuffer()
# Copy weights from model to target_model
def target_update(self):
weights = self.predict_net.model.get_weights()
self.target_net.model.set_weights(weights)
def replay(self):
loss = 0
for _ in range(5):
states, actions, rewards, next_states, done = self.buffer.sample()
# Collect predicted actions from predict_net
predicted_q_values = self.predict_net.predict(next_states)
predicted_actions = np.argmax(predicted_q_values, axis=1)
# Get q values from target_net of above predicted actions
target_q_values = self.target_net.predict(next_states)
target_action_q_values = [np.take(target_q_values[i], predicted_actions[i]) for i in range(len(target_q_values))]
# Create targets based on q values, reward and done
targets = predicted_q_values.copy()
targets[range(wandb.config.batch_size), actions] = rewards + (1 - done) * target_action_q_values * args.gamma
loss += self.predict_net.train(states, targets)
return loss
def train(self):
# Save weights for heatmap rendering
# Main training loop
for ep in range(wandb.config.episodes):
# Initialization
done, total_reward, step, loss, exploration = False, 0, 0, 0, 0
state = self.env.reset()
while not done and step < wandb.env.max_steps:
# Predict and perform action
action, e = self.predict_net.get_action(state)
exploration += e
next_state, reward, done, _ = self.env.step(action)
self.buffer.put(state, action, reward * wandb.config.reward_discount, next_state, done)
total_reward += reward
if self.buffer.size() >= 1000 and step % 10 == 0:
loss = self.replay()
state = next_state
step += 1
self.target_update()
# Update epsilon
self.predict_net.epsilon = max(wandb.config.epsilon_decay * self.predict_net.epsilon, wandb.config.min_epsilon)
# Calculate weights change and log weights
pre_weights = self.get_weights(self.predict_net.model.layers)
tar_weights = self.get_weights(self.target_net.model.layers)
# LOG
print(colored("EP" + str(ep) + "-Reward: " + str(total_reward) + " Done: " + str(done), "green"))
wandb.log({"episode" : ep,
"buffersize" : self.buffer.size(),
"EpReward" : total_reward,
"epsilon" : self.predict_net.epsilon,
"done" : int(done),
"Exploration" : exploration / _,
"loss" : loss,
"pre_weights" : pre_weights,
"tar_weights" : tar_weights
})
# "weigthUpdate" : wandb.Image(neuron_map),
# Get weights and names for every layer of nn model
def get_weights(self, layers):
weigths = []
names = []
for layer in layers:
wb = layer.get_weights()
if wb:
weigths.append(wb[0].flatten())
names.append(layer.name)
return weigths, names
if __name__ == "__main__":
env = HNS(random_spawn=wandb.env.random_spawn, world_size=wandb.env.world_size, hider_count=wandb.env.hider_count)
agent = Agent(env=env)
agent.train()
agent.target_net.model.save(os.path.join(wandb.run.dir, "model.h5"))

Deep Q - Learning for Cartpole with Tensorflow in Python

I know there are many similar topics discussed on StackOverflow, but I have done quite a lot research both in StackOverflow and on the Internet and I couldn't find a solution.
I am trying to implement the classic Deep Q Learning Algorithm to solve the openAI gym's cartpole game:
OpenAI Gym Cartpole
Firstly, I created an agent that generates random weights. The results are shown in the graph below:
Amazingly, the agent managed to reach 200 steps (which is the max) in many episodes by simply generating 4 random uniform weights [w1, w2, w3, w4] from (-1.0 to 1.0) in each episode.
So, i decided to implement a simple DQN with only 4 weights and 2 biases and to make the agent learn this game over the time. The weights will be initialized randomly in the beginning and Back-Propagation will be used to update them as the agent makes steps.
I used the Epsilon Greedy strategy to make the agent explore at the beginning and exploit the Q values later on. However, The results are disappointing compared to the random agent:
I have tried to tune a lot of parameters and different architectures and the result doesn't change as much. So, my question is the following:
Question:
Did i make a wrong implementation of DQN or a simple DQN cannot beat the cartpole? What's your experience? It does reduces the loss (Error), but it doesn't guarantee a good solution.
Thanks in advance.
import tensorflow as tf
import gym
import numpy as np
import random as rand
import matplotlib.pyplot as plt
# Cartpole's Observation:
# 4 Inputs
# 2 Actions (LEFT | RIGHT)
input_size = 4
output_size = 2
# Deep Q Network Class
class DQN:
def __init__(self, var_names):
self.var_names = var_names
self._define_placeholders()
self._add_layers()
self._define_loss()
self._choose_optimizer()
self._initialize()
# Placeholders:
# Inputs: The place where we feed the Observations (States).
# Targets: Q_target = R + gamma*Q(s', a*).
def _define_placeholders(self):
self.inputs = tf.placeholder(tf.float32, shape=(None, input_size), name='inputs')
self.targets = tf.placeholder( tf.float32, shape=(None, output_size), name='targets')
# Layers:
# 4 Input Weights.
# 2 Biases.
# output = softmax(inputs*weights + biases).
# Weights and biases are initialized randomly.
def _add_layers(self):
w = tf.get_variable(name=self.var_names[0], shape=(input_size, output_size),
initializer=tf.initializers.random_uniform(minval=-1.0, maxval=1.0) )
b = tf.get_variable(name=self.var_names[1], shape=(output_size),
initializer=tf.initializers.random_uniform(minval=-1.0, maxval=1.0) )
self.outputs = tf.nn.softmax(tf.matmul(self.inputs, w) + b)
self.prediction = tf.argmax(self.outputs, 1)
# Loss = MSE.
def _define_loss(self):
self.mean_loss = tf.losses.mean_squared_error(labels=self.targets, predictions=self.outputs) / 2
# AdamOptimizer with starting learning rate: a = 0.005.
def _choose_optimizer(self):
self.optimizer = tf.train.AdamOptimizer(learning_rate=0.005).minimize(loss=self.mean_loss)
# Initializes the dqn's weights.
def _initialize(self):
initializer = tf.global_variables_initializer()
self.sess = tf.InteractiveSession()
self.sess.run(initializer)
# Get's current's DQN weights.
def get_weights(self):
return [ self.sess.run( tf.trainable_variables(var) )[0] for var in self.var_names ]
# Updates the weights of DQN.
def update_weights(self, new_weights):
variables = [tf.trainable_variables(name)[0] for name in self.var_names]
update = [ tf.assign(var, weight) for (var, weight) in zip(variables, new_weights) ]
self.sess.run(update)
# Predicts the best possible action from a state s.
# a* = argmax( Q(s) )
# Returns from Q(s), a*
def predict(self, states):
Q, actions = self.sess.run( [self.outputs, self.prediction],
feed_dict={self.inputs: states} )
return Q, actions
# It partially fits the given observations and the targets into the network.
def partial_fit(self, states, targets):
_, loss = self.sess.run( [self.optimizer, self.mean_loss],
feed_dict={self.inputs: states, self.targets: targets} )
return loss
# Replay Memory Buffer
# It stores experiences as (s,a,r,s') --> (State, Action, Reward, Next_Action).
# It generates random mini-batches of experiences from the memory.
# If the memory is full, then it deletes the oldest experiences. Experience is an step.
class ReplayMemory:
def __init__(self, mem_size):
self.mem_size = mem_size
self.experiences = []
def add_experience(self, xp):
self.experiences.append(xp)
if len(self.experiences) > self.mem_size:
self.experiences.pop(0)
def random_batch(self, batch_size):
if len(self.experiences) < batch_size:
return self.experiences
else:
return rand.sample(self.experiences, batch_size)
# The agent's class.
# It contains 2 DQNs: Online DQN for Predictions and Target DQN for the targets.
class Agent:
def __init__(self, epsilon, epsilon_decay, min_epsilon, gamma, mem_size):
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.min_epsilon = min_epsilon
self.gamma = gamma
self.replay_mem = ReplayMemory(mem_size)
self.online_dqn = DQN( var_names=['online_w', 'online_b'] )
self.target_dqn = DQN( var_names=['target_w', 'target_b'] )
self.state = None
def set_epsilon(self, epsilon):
self.epsilon = epsilon
def reduce_epsilon(self):
if self.epsilon > self.min_epsilon:
self.epsilon -= self.epsilon_decay
def update_state(self, state):
self.state = state
def update_memory(self, state, action, reward, next_state):
experience = (state, action, reward, next_state)
self.replay_mem.add_experience(experience)
# It updates the target network after N steps.
def update_network(self):
self.target_dqn.update_weights( self.online_dqn.get_weights() )
# Randomly chooses an action from the enviroment.
def explore(self, env):
action = env.action_space.sample()
return action
# Predicts and chooses the best possible moves from the current state.
def exploit(self):
_, action = self.online_dqn.predict(self.state)
return action[0]
# Uses Epsilon-Greedy to decide whether to explore or exploit.
# Epsilon starts with 1 and is reduced over the time.
# After the agent makes a move, he returns: state, action, reward, next_state.
def take_action(self, env):
action = None
p = rand.uniform(0.0, 1.0)
if p < self.epsilon:
action = self.explore(env)
else:
action = self.exploit()
next_state, reward, done, _ = env.step(action)
if done:
next_state = None
else:
next_state = np.reshape( next_state, (1, input_size) )
return self.state, action, reward, next_state, done
# Trains the agent.
# A random mini-batch is generated from the memory.
# We feed each experience into the DQN.
# For each
# Q(s) = Qtarget(s)
# Q(s'), a* = Qtarget(s'), argmax Q(s')
# We set targets = Q(s')
# For each action (a), reward (r), next_state (s') in the batch:
# If s' is None the GameOver. So, we set target[i] = Reward
# If s' != None, then target[i][a] = r + gamma*Q(s', 'a')
# Then, the online DQN calculates the mean squared difference of r + gamma*Q(s', 'a') - Q(s, a)
# and uses Back-Propagation to update the weights.
def train(self):
mini_batch = self.replay_mem.random_batch(batch_size=256)
batch_size = len(mini_batch)
states = np.zeros( shape=(batch_size, input_size) )
next_states = np.zeros( shape=(batch_size, input_size) )
for i in range(batch_size):
states[i] = mini_batch[i][0]
next_states[i] = mini_batch[i][3]
Q, _ = self.target_dqn.predict(states)
next_Q, next_actions = self.target_dqn.predict(next_states)
targets = Q
for i in range(batch_size):
action = mini_batch[i][1]
reward = mini_batch[i][2]
next_state = mini_batch[i][3]
if next_state is None:
targets[i][action] = reward
else:
targets[i][action] = reward + self.gamma * next_Q[i][ next_actions[i] ]
loss = self.online_dqn.partial_fit(states, targets)
return loss
def play(agent, env, episodes, N, render=False, train=True):
ep = 0
episode_steps = []
steps = 0
total_steps = 0
loss = 0
# Sets the current state as the initial.
# Cartpole spawns the agent in a random state.
agent.update_state( np.reshape( env.reset(), (1, input_size) ) )
agent.update_network()
while ep < episodes:
if render:
env.render()
# The target DQN's weights are frozen.
# The agent Updates the Target DQN's Weights after 100 steps.
if train and total_steps % N == 0:
agent.update_network()
print('---Target network updated---')
# Takes action.
state, action, reward, next_state, done = agent.take_action(env)
# Updates the memory and the current state.
agent.update_memory(state, action, reward, next_state)
agent.update_state(next_state)
steps += 1
total_steps += 1
if train:
loss = agent.train()
if done:
agent.update_state( np.reshape( env.reset(), (1, input_size) ) )
episode_steps.append(steps)
ep += 1
if train:
agent.reduce_epsilon()
print('End of episode', ep, 'Training loss =', loss, 'Steps =', steps)
steps = 0
if render:
env.close()
return episode_steps
env = gym.make('CartPole-v0')
# Training the agent.
agent = Agent(epsilon=1, epsilon_decay = 0.01, min_epsilon = 0.05, gamma=0.9, mem_size=50000)
episodes = 1000
N = 100
episode_steps = play(agent, env, episodes, N)
# Plotting the results.
# After the training is done, the steps should be maximized (up to 200)
plt.plot(episode_steps)
plt.show()
# Testing the agent.
agent.set_epsilon(0)
episodes = 1
steps = play(agent, env, episodes, N, render=True, train=False)[0]
print('\nSteps =', steps)

The algorithm works quite well. When I decided to plot the data, I used as a metric:
Rewards / Episode
Most of Deep Reinforcement Learning Frameworks (e.g. tf-agents) use mean reward (e.g. mean reward per 10 episodes) and this is why the plots look so smooth. If You look at the above plot, The agent manages to get a high score most of the time.
Also, I have decided to improve the speed of the algorithm using numpy operations rather than "for" loops. You can check out my implementation here:
https://github.com/kochlisGit/Deep-Reinforcement-Learning/tree/master/Custom%20DQN

DQN Atari with tensorflow: Training seems to stuck

I'm trying to learn a DQ-Learning Network to play Breakout Atari in Tensorflow. The code runs without problems, but always after 1000-1200 episodes, the time for executing one step explodes to over 100s.
Here is my DQN:
class DQNetwork():
def __init__(self, scope, state_size=(84, 84, 4), num_outputs=4, gamma=0.9, learning_rate=0.001):
self.scope = scope
with tf.variable_scope(self.scope):
# ---------------------
# Basic Deep Q-Network
# ---------------------
self.x = tf.placeholder(tf.float32, shape=[None, *state_size], name="inputs")
# Input is 84x84x4
self.conv1 = tf.layers.conv2d(inputs = self.x,
filters = 32,
kernel_size = [8,8],
strides = [4,4],
padding = "VALID",
name = "conv1",
activation="relu")
self.conv2 = tf.layers.conv2d(inputs = self.conv1,
filters = 64,
kernel_size = [4,4],
strides = [2,2],
padding = "VALID",
name = "conv2",
activation="relu")
self.conv3 = tf.layers.conv2d(inputs = self.conv2,
filters = 64,
kernel_size = [3,3],
strides = [1,1],
padding = "VALID",
name = "conv3",
activation="relu")
self.flatten = tf.layers.flatten(self.conv3)
self.fc = tf.layers.dense(inputs = self.flatten,
units = 512,
activation = tf.nn.relu,
name="fc1")
self.logits = tf.layers.dense(inputs = self.fc,
units = num_outputs,
activation=None)
self.best_action = tf.argmax(self.logits, name="best_action", axis=1)
self.max_q = tf.reduce_max(self.logits, name="max_q", axis=1)
if scope == 'Target':
self.rewards = tf.placeholder(tf.float32, shape=None, name="rewards")
self.gamma = tf.constant(gamma, name="Gamma")
self.done = tf.placeholder(tf.int32, shape=None, name="done_values")
self.td_target = self.rewards + (self.gamma*self.max_q) * tf.cast( tf.abs(self.done -1 ), tf.float32)
if scope == 'Q':
self.target_placeholder = tf.placeholder(tf.float32, shape=None, name="target_placeholder_q")
self.actions = tf.placeholder(tf.uint8, shape=None, name="AllActions")
self.actions_onehot = tf.one_hot(self.actions, depth=num_outputs, name="One_Hot")
self.Q = tf.reduce_sum(tf.multiply(self.actions_onehot, self.logits))
self.huber_loss = huber_loss(self.target_placeholder-self.Q)
self.loss = tf.reduce_mean(self.huber_loss)
self.optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate, epsilon=0.01)
self.train = self.optimizer.minimize(self.loss, name="minimize")
Huber Loss Function:
def huber_loss(x, delta=1.0):
"""Reference: https://en.wikipedia.org/wiki/Huber_loss"""
return tf.where(
tf.abs(x) < delta,
tf.square(x) * 0.5,
delta * (tf.abs(x) - 0.5 * delta)
)
Preprocess Frames
def preprocess_frame(obs):
processed_observe = np.uint8(
resize(rgb2gray(obs), (84, 84), mode='constant') * 255)
return processed_observe
ReplayBuffer
class ReplayBuffer():
def __init__(self, buffer_size):
self.buffer = deque([], maxlen=buffer_size)
def add(self, new_state):
if len(new_state) != 5:
raise Exception("States must have: state, action, reward, next_state, done")
self.buffer.append(new_state)
def sample(self, batch_size):
return r.sample(self.buffer, batch_size)
Update Target Network
def get_update_target_ops(Q_network, Target_network):
# You code comes here
# 1. get the trainable variables per network
Q_trainable = tf.trainable_variables(scope=Q_network.scope)
Target_trainable = tf.trainable_variables(scope=Target_network.scope)
# 2. sort them with sorted(list, key=attrgetter())
Q_trainable = sorted(Q_trainable, key=attrgetter("name"))
Target_trainable = sorted(Target_trainable, key=attrgetter("name"))
# 3.create a new list with all assign ops
update_target_expr = []
for q_var, t_var in zip(Q_trainable, Target_trainable):
update_target_expr.append(t_var.assign(q_var))
return update_target_expr
Greedy Action
def choose_egreedy_action(session, epsilon, network, state):
state = np.float32(state / 255.0)
if np.random.rand() <= epsilon:
return np.random.randint(0, n)
else:
state_reshaped = np.reshape(state, (1, *obs_space))
best_action = session.run(network.best_action, feed_dict={network.x:state_reshaped})[0]
return best_action
Linear Schedule
class LinearSchedule():
def __init__(self, start_epsilon, final_epsilon, pre_train_steps, decay):
self.start_epsilon = start_epsilon
self.final_epsilon = final_epsilon
self.pre_train_steps = pre_train_steps
self.decay = decay
self.epsilon = start_epsilon
def value(self, t):
if t <= self.pre_train_steps:
return self.start_epsilon
else:
return self.start_epsilon - (t-self.pre_train_steps)*self.decay
Train Step
def train(sess, Q, Target, buffer, batch_size):
# You code comes here
# 1. Sample from the replay buffer
mini_batches = buffer.sample(batch_size)
observations, actions, rewards, next_observations, done, = map(list, zip(*mini_batches))
observations = np.array(observations)/255.
next_observations =np.array(next_observations)/255.
td_targets = sess.run( Target.td_target, feed_dict={Target.x : next_observations, Target.rewards:rewards, Target.done:done})
max_q, loss, _ = sess.run([Q.max_q, Q.loss, Q.train], feed_dict={Q.x : observations, Q.target_placeholder:td_targets, Q.actions:actions})
return loss
Hyperparams
EPISODES = 50000
epsilon = 1.
epsilon_start, epsilon_end = 1.0, 0.1
exploration_steps = 1000000.
epsilon_decay_step = (epsilon_start - epsilon_end) / exploration_steps
batch_size = 32
train_start = 50000
update_target_rate = 10000
gamma = 0.99
buffer_size = 400000
no_op_steps = 30
global_steps = 10000000
Init networks
tf.reset_default_graph()
Q_network = DQNetwork(scope="Q", state_size=(84, 84, 4), num_outputs=4, gamma=gamma, learning_rate=0.00025)
T_network = DQNetwork(scope="Target",state_size=(84, 84, 4), num_outputs=4, gamma=gamma, learning_rate=0.00025)
update_target_network = get_update_target_ops(Q_network,T_network)
Train Loop
from tqdm import trange
import random
game = gym.make('BreakoutDeterministic-v4')
scores, episodes = [], [],
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()
buffer = ReplayBuffer(buffer_size)
sess.run(tf.global_variables_initializer())
epsilon_schedule = LinearSchedule(epsilon_start, epsilon_end, train_start, epsilon_decay_step)
done = False
dead = False
# 1 episode = 5 lives
step, score, start_life = 0, 0, 5
observe = game.reset()
eps = 0
# this is one of DeepMind's idea.
# just do nothing at the start of episode to avoid sub-optimal
for _ in range(random.randint(1, no_op_steps)):
observe, _, _, _ = game.step(1)
# At start of episode, there is no preceding frame
# So just copy initial states to make history
state = preprocess_frame(observe)
history = np.stack((state, state, state, state), axis=2)
history = np.reshape([history], (84, 84, 4))
loss = 0
for global_step in trange(global_steps):
#global_step += 1
step += 1
# get action for the current history and go one step in environment
action = choose_egreedy_action(sess, epsilon, Q_network, history)
observe, reward, done, info = game.step(action)
# pre-process the observation --> history
next_state = preprocess_frame(observe)
next_state = np.reshape([next_state], (84, 84, 1))
#print(next_state.shape)
next_history = np.append(next_state, history[ :, :, :3], axis=2)
# if the agent missed ball, agent is dead --> episode is not over
if start_life > info['ale.lives']:
dead = True
start_life = info['ale.lives']
#if dead:
#reward = -1
#score += reward
reward = np.clip(reward, -1., 1.)
# save the sample <s, a, r, s'> to the replay memory
buffer.add([history, action, reward, next_history, dead])
epsilon = epsilon_schedule.value(global_step)
if global_step > train_start:
train(sess, Q_network, T_network, buffer, batch_size)
# update the target model with model
if global_step % update_target_rate == 0:
#print("update networks")
sess.run(update_target_network)
score += reward
# if agent is dead, then reset the history
if dead:
dead = False
else:
history = next_history
# if done, plot the score over episodes
if done:
if eps%100 == 0:
print("episode:", eps, " score:", score, " global_step: ", global_step,
" epsilon: ", epsilon)
scores.append(score)
episodes.append(step)
done = False
dead = False
# 1 episode = 5 lives
step, score, start_life = 0, 0, 5
observe = game.reset()
eps += 1
# this is one of DeepMind's idea.
# just do nothing at the start of episode to avoid sub-optimal
for _ in range(random.randint(1, no_op_steps)):
observe, _, _, _ = game.step(1)
# At start of episode, there is no preceding frame
# So just copy initial states to make history
state = preprocess_frame(observe)
history = np.stack((state, state, state, state), axis=2)
history = np.reshape([history], (84, 84, 4))
if global_step % 5000 == 0:
saver.save(sess, f'models/breakout/model_breakout.ckpt')
For the first 50.000 steps (no training) I have s.th. like 400it/s, which seems to be fine. After that it is s.th. about 50it/s. As soon as epsilon descreases, this number also descreases because we more often have to use the best action instead of a random one.
But after about 1000 episodes i get s.th. like:
2%|▏ | 174582/10000000 [46:53<43:59:11, 100s/it] epsilon:0.88, score: 2
As you can see the duration per iteration increases very much, therefore the training seems to stuck.
I don't know whether this is a problem with my GPU or with the code.
Do you have any idea what to do?

I had the same experience whilst using an RL algorithm and training on the Atari Breakout environment from openAI gym. It happened after my exploration rate dropped to a very low value. I found the solution from this post: OpenAI gym's breakout-v0 "pauses"
I knew my problem was similar to that post because when I rendered the game frames for each episode, I saw that after a life was lost in Atari Breakout, the ball disappeared (it was paused).
The reason why it didn't pause at the beginning of my training, was because the exploration rate was high and it was taking random actions when paused which would have lead it to choosing an action that would start the game.
=================
I think this may solve your problem, not sure if it will (and it is usually not recommended in RL)
See this post:
OpenAI gym breakout-ram-v4 unable to learn
So the question in the above post mentions that the algorithm sets the reward to -1 when the agent loses a life. It does the this by making use of the system information (given when .step() is called) to detect when a life is lost. I'm thinking when you detect a life is lost in Atari Breakout, you hard code that the agent must choose the action (for the next step) that starts the game.
Again, this approach is not recommended in RL cause the agent only supposed to make use of the observation when deciding on which action to take.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Several questions regarding my implementation of PPO on Pytorch - python

Related

Problem getting DQN to learn CartPole-v1 (PyTorch)

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed)

Keras Double DQN average reward decreases over time and is unable to converge

Deep Q - Learning for Cartpole with Tensorflow in Python

DQN Atari with tensorflow: Training seems to stuck

Categories

Resources