Related
I am trying to code my own DQN in Python, using pytorch. I am trying it on the CartPole environment.
Although the Q-loss converaged, the model performed poorly.
Replay buffer was also used in the model with a size of 2000 and the double networks were also used.I updated the target network when the q_net network was updated 100 times.(ps: q_net was used to decide which action to choose.)
I tried to use different network architectures and different combinations of hyper-parameters, but the model still performed poorly, which just kept swaying and never kept itself balanced.
Appreciate for your help, sincerely!!
Here is the figure of Q-loss result:
Q-loss figure
a closer look
Here is the code for action taking.
def take_action(self,state):
if self.learn_count<=10:
self.eposilon = 0.1
else:
self.eposilon = 0.9
decision = np.random.choice([0,1],p = [1-self.eposilon,self.eposilon])
if decision == 1:
#final_decision = self.q_net(state.cuda().detach()).argmax()
final_decision = self.q_net(torch.Tensor(state).to(self.device))
final_decision = torch.max(final_decision, 0)[1].data.numpy()
else:
final_decision = np.random.choice([i for i in range(self.action_space)])
return final_decision.item()
Here is the code for training the network.
def update_nn(self):
if self.learn_count%100 ==0:
self.synchronous_NN()
for i in range(self.num_epoches):
self.learn_count = self.learn_count+1
self.training_nn()
return None
def training_nn(self):
index = random.sample(range(self.replay_buffer.shape[0]),self.minibatch_size)
chosen_sample = self.replay_buffer[index,:]
last_state = copy.deepcopy(chosen_sample[np.isnan(chosen_sample[:,-1:]).squeeze(),:])
not_last_state = copy.deepcopy(chosen_sample[~np.isnan(chosen_sample[:,-1:]).squeeze(),:])
input_not_last_state = torch.FloatTensor(not_last_state[:,:4]).to(self.device)
action_index = torch.LongTensor(not_last_state[:,4].reshape(-1,1)).to(self.device)
action_value = self.q_net(input_not_last_state).gather(1,action_index)
max_action_value = not_last_state[:,5]+self.gamma*self.fixed_q_net(input_not_last_state).detach().max(1).values.numpy()
last_state = np.nan_to_num(last_state)
input_last_state = torch.FloatTensor(last_state[:,:4]).to(self.device)
last_action_index = torch.LongTensor(last_state[:,4].reshape(-1,1)).to(self.device)
last_action_value = self.q_net(input_last_state).gather(1,last_action_index)
last_max_action_value = last_state[:,5]
X = torch.cat([action_value,last_action_value])
y = torch.FloatTensor(np.hstack([max_action_value,last_max_action_value]).reshape(-1,1)).detach()
loss = self.loss(X, y)
self.optimizer.zero_grad() # reset the gradient to zero
loss.backward()
self.optimizer.step() # execute back propagation for one step
self.loss_curve.append(loss)
return None
here is the part of playing:
def start_to_play(self):
agent = Agent()
agent.initial_Q_network(2)
agent.initial_replay_buffer()
self.env = gym.make('CartPole-v0')
self.env = self.env.unwrapped
for i in range(self.episode):
if i%50 ==1:
agent.save_model(i)
step = 0
state = self.env.reset()
ep_r = 0
while(True):
action = agent.take_action(state)
observation, reward, done, info = self.env.step(action)
self.env.render()
#next_state = self.capture_state()
next_state = observation
x, x_dot, theta, theta_dot = observation
r1 = (self.env.x_threshold - abs(x)) / self.env.x_threshold - 0.8
r2 = (self.env.theta_threshold_radians - abs(theta)) / self.env.theta_threshold_radians - 0.5
reward = r1 + r2
ep_r = reward+ep_r
if done:
reward = reward-20
state1_np = np.array(state)
state2_np = np.array([np.nan,np.nan,np.nan,np.nan])
agent.add_replay_buffer(np.hstack([state1_np,np.array([action,reward]),state2_np]))
if agent.replay_buffer.shape[0]>300:
agent.update_nn()
print(i,step,round(ep_r, 2))
break
else:
state1_np = np.array(state)
state2_np = np.array(next_state)
agent.add_replay_buffer(np.hstack([state1_np,np.array([action,reward]),state2_np]))
if agent.replay_buffer.shape[0]>300:
agent.update_nn()
state = next_state
step = step+1
self.plot_curve(agent)
return None
Thanks for your time!!
Loss in reinforcement learning is not very important. In fact, you could see very good agents with good performances but bad results in term of neural network losses.
I suggest you to look into other implementations hyperparameters, since CartPole has been solved an infinite amount of times with multiple algorithms.
The first thing that might be wrong is the experience replay buffer, which is too small. But then again, look into other DQN implementation for CartPole and try to use the same hyperparameters that they are using
I am implementing an RL agent based on A2C of stable-baseline3 on a gym environment with MultiDiscrete observation and action spaces.
I get the following error when learning
RuntimeError: Class values must be smaller than num_classes.
This is a typical PyTorch error, but I do not get its origin. I attach my code.
Before the code, I explain the idea. We train a Custom environment where we have several machines (we train first only two machines), needing to decide the production rate of the machines before they break. The action space includes also the decision of scheduling the maintenance in some time distance, and for each machine it decides which machine to be maintained.
Hence, observation space is the consumption state of each machine and the time distance of the scheduled maintenance (it can also be "not scheduled"), whereas action space is production rate for each machine, maintenance decision for each machine and call-to-schedule.
The reward is given when the total production exceeds a threshold, and negative rewards are the costs of maintenance and scheduling.
Now, I know this is a big thing and we need to reduce these spaces, but the actual problem is this error with PyTorch. I do not see where it comes from. A2C deals with both MultiDiscrete space in observation and action, but I do not know the origin of this. We set a A2C algorithm with MlpPolicy and we try to train the policy with this environment.
I attach the code.
from gym import Env
from gym.spaces import MultiDiscrete
import numpy as np
from numpy.random import poisson
import random
from functools import reduce
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Dense, Flatten
# from tensorflow.keras.optimizers import Adam
from stable_baselines3 import A2C
from stable_baselines3.common.env_checker import check_env
class MaintenanceEnv(Env):
def __init__(self, max_machine_states_vec, production_rates_vec, production_threshold, scheduling_horizon, operations_horizon = 100):
"""
Returns:
self.action_space is a vector with the maximum production rate fro each machine, a binary call-to-maintenance and a binary call-to-schedule
"""
num_machines = len(max_machine_states_vec)
assert len(max_machine_states_vec) == len(production_rates_vec), "Machine states and production rates have different cardinality"
# Actions we can take, down, stay, up
self.action_space = MultiDiscrete(production_rates_vec + num_machines*[2] + [2]) ### Action space is the production rate from 0 to N and the choice of scheduling
# Temperature array
self.observation_space = MultiDiscrete(max_machine_states_vec + [scheduling_horizon+2]) ### Observation space is the 0,...,L for each machine + the scheduling state including "ns" (None = "ns")
# Set start temp
self.state = num_machines*[0] + [0]
# Set shower length
self.operations_horizon = operations_horizon
self.time_to_finish = operations_horizon
self.scheduling_horizon = scheduling_horizon
self.max_states = max_machine_states_vec
self.production_threshold = production_threshold
def step(self, action):
"""
Notes: Schedule state
"""
num_machines = len(self.max_states)
maintenance_distance_index = -1
reward = 0
done = False
info = {}
### Cost parameters
cost_setup_schedule = 5
cost_preventive_maintenance = 10
cost_corrective_maintenance = 50
reward_excess_on_production = 5
cost_production_deficit = 10
cost_fixed_penalty = 10
failure_reward = -10**6
amount_produced = 0
### Errors
if action[maintenance_distance_index] == 1 and self.state[-1] != self.scheduling_horizon + 1: # Case when you set a reparation scheduled, but it is already scheduled. Not possible.
reward = failure_reward ###It should not be possible
done = True
return self.state, reward, done, info
if self.state[-1] == 0:
for pos in range(num_machines):
if action[num_machines + pos] == 1 and self.state[maintenance_distance_index] > 0: ### Case when maintenance is applied, but schedule is not involved yet. Not possible.
reward = failure_reward ### It should not be possible
done = True
return self.state, reward, done, info
for pos in range(num_machines):
if self.state[pos] == self.max_states[pos] and action[pos] > 0: # Case when machine is broken, but it is producing
reward = failure_reward ### It should not be possible
done = True
return self.state, reward, done, info
if self.state[maintenance_distance_index] == 0:
for pos in range(num_machines):
if action[num_machines+pos] == 1 and action[pos] > 0 : ### Case when it is maintenance time but the machines to be maintained keeps working. Not possible
reward = failure_reward ### It should not be possible
done = True
return self.state, reward, done, info
### State update
for pos in range(num_machines):
if self.state[pos] < self.max_states[pos] and self.state[maintenance_distance_index] > 0: ### The machine is in production, state update includes product amount
# self.state[pos] = min(self.max_states[pos] , self.state[pos] + poisson(action[pos] / self.action_space[pos])) ### Temporary: for I delete from the state the result of a poisson distribution depending on the production rate, Poisson is temporary
self.state[pos] = min(self.max_states[pos] , self.state[pos] + action[pos]) ### Temporary: Consumption rate is deterministic
amount_produced += action[pos]
if amount_produced >= self.production_threshold:
reward += reward_excess_on_production * (amount_produced - self.production_threshold)
else:
reward -= cost_production_deficit * (self.production_threshold - amount_produced)
reward -= cost_fixed_penalty
if action[maintenance_distance_index] == 1 and self.state[maintenance_distance_index] == self.scheduling_horizon + 1: ### You call a schedule when the state is not scheduled
self.state[maintenance_distance_index] = self.scheduling_horizon
reward -= cost_setup_schedule
elif self.state[maintenance_distance_index] > 0 and self.state[maintenance_distance_index] <= self.scheduling_horizon: ### You reduced the distance from scheduled maintenance
self.state[maintenance_distance_index] -= 1
for pos in range(num_machines): ### Case when we are repairing the machines and we need to pay the costs of repairment, and set them as new
if action[num_machines+pos] == 1 :
if self.state[pos] < self.max_states[pos]:
reward -= cost_preventive_maintenance
elif self.state[pos] == self.max_states[pos]:
reward -= cost_corrective_maintenance
self.state[pos] = 0
if self.state[maintenance_distance_index] == 0: ### when maintenance have been performed, reset the scheduling state to "not scheduled"
self.state[maintenance_distance_index] = self.scheduling_horizon + 1
### Time threshold
if self.time_to_finish > 0:
self.time_to_finish -= 1
else:
done = True
# Return step information
return self.state, reward, done, info
def render(self):
# Implement viz
pass
def reset(self):
# Reset shower temperature
num_machines = len(self.max_states)
self.state = np.array(num_machines*[0] + [0])
self.time_to_finish = self.operations_horizon
return self.state
def build_model(states, actions):
model = Sequential()
model.add(Dense(24, activation='relu', input_shape=states)) #
model.add(Dense(24, activation='relu'))
model.add(Dense(actions, activation='linear'))
return model
if __name__ == "__main__":
###GLOBAL COSTANTS AND PARAMETERS
NUMBER_MACHINES = 2
FAILURE_STATE_LIMIT = 8
MAXIMUM_PRODUCTION_RATE = 5
SCHEDULING_HORIZON = 4
PRODUCTION_THRESHOLD = 20
machine_states = NUMBER_MACHINES * [4]
failure_states = NUMBER_MACHINES * [FAILURE_STATE_LIMIT]
production_rates = NUMBER_MACHINES * [MAXIMUM_PRODUCTION_RATE]
### Setting environment
env = MaintenanceEnv(failure_states, production_rates, PRODUCTION_THRESHOLD, SCHEDULING_HORIZON)
model = A2C("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10000)
obs = env.reset()
for i in range(1000):
action, _state = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
# env.render()
if done:
obs = env.reset()
I have the feeling it is due to the MultiDiscrete spaces, but I ask help. Thanks :)
As per the docs https://www.gymlibrary.dev/api/spaces/#discrete, Discrete(n) would have values from 0 to (n-1) by default. We have to note that this does not go up to n.
The MultiDiscrete space is built on top of Discrete and has the same behavior. Hence, if an action leads to a value n, we would get this error. A simple way to solve it is to ensure that the values never exceed the bounds (inclusive of 0 and up to n-1) while taking a step.
I'm trying to implement a DQN. As a warm up I want to solve CartPole-v0 with a MLP consisting of two hidden layers along with input and output layers. The input is a 4 element array [cart position, cart velocity, pole angle, pole angular velocity] and output is an action value for each action (left or right). I am not exactly implementing a DQN from the "Playing Atari with DRL" paper (no frame stacking for inputs etc). I also made a few non standard choices like putting done and the target network prediction of action value in the experience replay, but those choices shouldn't affect learning.
In any case I'm having a lot of trouble getting the thing to work. No matter how long I train the agent it keeps predicting a higher value for one action over another, for example Q(s, Right)> Q(s, Left) for all states s. Below is my learning code, my network definition, and some results I get from training
class DQN:
def __init__(self, env, steps_per_episode=200):
self.env = env
self.agent_network = MlpPolicy(self.env)
self.target_network = MlpPolicy(self.env)
self.target_network.load_state_dict(self.agent_network.state_dict())
self.target_network.eval()
self.optimizer = torch.optim.RMSprop(
self.agent_network.parameters(), lr=0.005, momentum=0.95
)
self.replay_memory = ReplayMemory()
self.gamma = 0.99
self.steps_per_episode = steps_per_episode
self.random_policy_stop = 1000
self.start_learning_time = 1000
self.batch_size = 32
def learn(self, episodes):
time = 0
for episode in tqdm(range(episodes)):
state = self.env.reset()
for step in range(self.steps_per_episode):
if time < self.random_policy_stop:
action = self.env.action_space.sample()
else:
action = select_action(self.env, time, state, self.agent_network)
new_state, reward, done, _ = self.env.step(action)
target_value_pred = predict_target_value(
new_state, reward, done, self.target_network, self.gamma
)
experience = Experience(
state, action, reward, new_state, done, target_value_pred
)
self.replay_memory.append(experience)
if time > self.start_learning_time: # learning step
experience_batch = self.replay_memory.sample(self.batch_size)
target_preds = extract_value_predictions(experience_batch)
agent_preds = agent_batch_preds(
experience_batch, self.agent_network
)
loss = torch.square(agent_preds - target_preds).sum()
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
if time % 1_000 == 0: # how frequently to update target net
self.target_network.load_state_dict(self.agent_network.state_dict())
self.target_network.eval()
state = new_state
time += 1
if done:
break
def agent_batch_preds(experience_batch: list, agent_network: MlpPolicy):
"""
Calculate the agent action value estimates using the old states and the
actual actions that the agent took at that step.
"""
old_states = extract_old_states(experience_batch)
actions = extract_actions(experience_batch)
agent_preds = agent_network(old_states)
experienced_action_values = agent_preds.index_select(1, actions).diag()
return experienced_action_values
def extract_actions(experience_batch: list) -> list:
"""
Extract the list of actions from experience replay batch and torchify
"""
actions = [exp.action for exp in experience_batch]
actions = torch.tensor(actions)
return actions
class MlpPolicy(nn.Module):
"""
This class implements the MLP which will be used as the Q network. I only
intend to solve classic control problems with this.
"""
def __init__(self, env):
super(MlpPolicy, self).__init__()
self.env = env
self.input_dim = self.env.observation_space.shape[0]
self.output_dim = self.env.action_space.n
self.fc1 = nn.Linear(self.input_dim, 32)
self.fc2 = nn.Linear(32, 128)
self.fc3 = nn.Linear(128, 32)
self.fc4 = nn.Linear(32, self.output_dim)
def forward(self, x):
if type(x) != torch.Tensor:
x = torch.tensor(x).float()
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = F.relu(self.fc3(x))
x = self.fc4(x)
return x
Learning results:
Here I'm seeing one action always valued over the others (Q(right, s) > Q(left, s)). It's also clear that the network is predicting the same action values for every state.
Does anyone have an idea about what's going on? I've done a lot of debugging and careful reading of the original papers (also thought about "normalizing" the observation space even though the velocities can be infinite) and could be missing something obvious at this point. I can include more code for the helper functions if that would be useful.
There was nothing wrong with the network definition. It turns out the learning rate was too high and reducing it 0.00025 (as in the original Nature paper introducing the DQN) led to an agent which can solve CartPole-v0.
That said, the learning algorithm was incorrect. In particular I was using the wrong target action-value predictions. Note the algorithm laid out above does not use the most recent version of the target network to make predictions. This leads to poor results as training progresses because the agent is learning based on stale target data. The way to fix this is to just put (s, a, r, s', done) into the replay memory and then make target predictions using the most up to date version of the target network when sampling a mini batch. See the code below for an updated learning loop.
def learn(self, episodes):
time = 0
for episode in tqdm(range(episodes)):
state = self.env.reset()
for step in range(self.steps_per_episode):
if time < self.random_policy_stop:
action = self.env.action_space.sample()
else:
action = select_action(self.env, time, state, self.agent_network)
new_state, reward, done, _ = self.env.step(action)
experience = Experience(state, action, reward, new_state, done)
self.replay_memory.append(experience)
if time > self.start_learning_time: # learning step.
experience_batch = self.replay_memory.sample(self.batch_size)
target_preds = target_batch_preds(
experience_batch, self.target_network, self.gamma
)
agent_preds = agent_batch_preds(
experience_batch, self.agent_network
)
loss = torch.square(agent_preds - target_preds).sum()
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
if time % 1_000 == 0: # how frequently to update target net
self.target_network.load_state_dict(self.agent_network.state_dict())
self.target_network.eval()
state = new_state
time += 1
if done:
break
I am attempting to teach a Double DQN agent to run a gridworld where there is one seeker (the agent) who will try to collect all the hiders which are randomly spawned. Every step has a path_cost of -0.1 and if a hider is collected a reward of 1 is received. The DQN net receives an array with the shape (world_width,world_height,1) as the state which is a complete translation of the environment viewed from above where empty space is described as 0, seeker as 2, and hider as 3. The agent is then supposed to choose one action, either left, up, right, or down. An example configuration of the environment is shown in the image below.
gridworld
However, when training my agent the reward initially decreases in correlation to the decreasing exploration and therefore it can be assumed that when the agent follows the DQN net it will perform worse than when choosing actions randomly. Here are a few examples of the reward graphs I have received when training with different hyperparameters (y-axis is total steps where each episode is 100 steps unless it finishes).
Reward Graph
As seen the agent becomes worse at solving the environment and it is approximately when epsilon becomes equal to my min_epsilon the curve stabilizes (meaning almost no exploration or random moves).
I have tried different hyperparameters but without any apparent differences in results and would there appreciate it if someone could give me a pointer to where the problem might be.
The hyperparameters I have been mostly using is:
wandb.config.epsilon = 1.0
wandb.config.epsilon_decay = 0.99
wandb.config.batch_size = 32
wandb.config.learning_rate = 1e-3
wandb.config.gamma = 0.8
wandb.config.min_epsilon = 1e-1
wandb.config.buffersize = 10000
wandb.config.epochs = 1
wandb.config.reward_discount = 0.01
wandb.config.episodes = 1000
And here is my code:
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Conv2D, MaxPooling2D, Flatten
from tensorflow.keras.optimizers import Adam
from collections import deque
from termcolor import colored
import wandb
from wandb.keras import WandbCallback
import numpy as np
import copy, os, random
from argparse import ArgumentParser
from plotter import plotter
from HNS import HNS
tf.keras.backend.set_floatx('float64')
wandb.init(name=name, project=project)
wandb.env.name = "HNS"
wandb.env.world_size = (8, 8)
wandb.env.state_dim = (8, 8, 1)
wandb.env.hider_count = 2
wandb.env.action_dim = 4
wandb.env.random_spawn = True
wandb.env.max_steps = 100
wandb.config.node = node
wandb.config.epsilon = 1.0
wandb.config.epsilon_decay = 0.99
wandb.config.batch_size = 32
wandb.config.learning_rate = 1e-3
wandb.config.gamma = 0.8
wandb.config.min_epsilon = 1e-1
wandb.config.buffersize = 10000
wandb.config.epochs = 1
wandb.config.reward_discount = 0.01
wandb.config.episodes = 1000
wandb.config.conv1_kernel = (8,8)
wandb.config.conv1_filters = 16
wandb.config.conv1_strides = 4
wandb.config.conv1_activation = "relu"
wandb.config.conv1_padding = "same"
wandb.config.conv2_kernel = (4,4)
wandb.config.conv2_filters = 32
wandb.config.conv2_strides = 4
wandb.config.conv2_activation = "relu"
wandb.config.conv2_padding = "same"
wandb.config.dense1_neurons = 16
wandb.config.dense1_activation = "relu"
wandb.config.loss = "mse"
parser = ArgumentParser()
parser.add_argument('--hider_count', type=int, default=wandb.env.hider_count)
parser.add_argument('--max_steps', type=int, default=wandb.env.max_steps)
parser.add_argument('--epsilon_decay', type=float, default=wandb.config.epsilon_decay)
parser.add_argument('--min_epsilon', type=float, default=wandb.config.min_epsilon)
parser.add_argument('--learning_rate', type=float, default=wandb.config.learning_rate)
parser.add_argument('--gamma', type=float, default=wandb.config.gamma)
parser.add_argument('--reward_discount', type=float, default=wandb.config.reward_discount)
parser.add_argument('--episodes', type=int, default=wandb.config.episodes)
parser.add_argument('--batch_size', type=int, default=wandb.config.batch_size)
args, unknown = parser.parse_known_args()
wandb.config.update(args, allow_val_change=True)
class ReplayBuffer:
def __init__(self):
self.buffer = deque(maxlen=wandb.config.buffersize)
def put(self, state, action, reward, next_state, done):
self.buffer.append([state, action, reward, next_state, done])
def sample(self):
sample = random.sample(self.buffer, wandb.config.batch_size)
states, actions, rewards, next_states, done = map(np.asarray, zip(*sample))
return states, actions, rewards, next_states, done
def size(self):
return len(self.buffer)
class ActionStatemodel:
def __init__(self):
self.epsilon = wandb.config.epsilon
self.model = self.create_model()
def create_model(self):
# Init model
model = tf.keras.Sequential()
# Set up layers
model.add(Conv2D(filters=wandb.config.conv1_filters, kernel_size=wandb.config.conv1_kernel, activation=wandb.config.conv1_activation,
strides=wandb.config.conv1_strides, padding=wandb.config.conv1_padding, name="conv_1", input_shape=wandb.env.state_dim))
model.add(Conv2D(filters=wandb.config.conv2_filters, kernel_size=wandb.config.conv2_kernel, activation=wandb.config.conv2_activation,
strides=wandb.config.conv2_strides, padding=wandb.config.conv2_padding, name="conv_2"))
model.add(Flatten())
model.add(Dense(units=wandb.config.dense1_neurons, activation=wandb.config.dense1_activation, name="dense_1"))
model.add(Dense(wandb.env.action_dim, name="dense_2"))
# Finalize model
model.compile(loss=wandb.config.loss, optimizer=Adam(wandb.config.learning_rate))
model.summary()
return model
# Get q-values from state
def predict(self, state):
return self.model.predict(state)
# Get action from
def get_action(self, state):
# Predict action
state = np.expand_dims(state, axis=0)
q_value = self.predict(state)
if np.random.random() < self.epsilon: return random.randint(0, wandb.env.action_dim - 1), 1
else: return np.argmax(q_value), 0
def train(self, states, targets):
history = self.model.fit(states, targets, epochs=wandb.config.epochs, callbacks=[WandbCallback()], verbose=2, use_multiprocessing=True)
return history.history["loss"][0]
class Agent:
def __init__(self, env):
self.env = env
self.predict_net = ActionStatemodel()
self.target_net = ActionStatemodel()
self.target_update()
self.buffer = ReplayBuffer()
# Copy weights from model to target_model
def target_update(self):
weights = self.predict_net.model.get_weights()
self.target_net.model.set_weights(weights)
def replay(self):
loss = 0
for _ in range(5):
states, actions, rewards, next_states, done = self.buffer.sample()
# Collect predicted actions from predict_net
predicted_q_values = self.predict_net.predict(next_states)
predicted_actions = np.argmax(predicted_q_values, axis=1)
# Get q values from target_net of above predicted actions
target_q_values = self.target_net.predict(next_states)
target_action_q_values = [np.take(target_q_values[i], predicted_actions[i]) for i in range(len(target_q_values))]
# Create targets based on q values, reward and done
targets = predicted_q_values.copy()
targets[range(wandb.config.batch_size), actions] = rewards + (1 - done) * target_action_q_values * args.gamma
loss += self.predict_net.train(states, targets)
return loss
def train(self):
# Save weights for heatmap rendering
# Main training loop
for ep in range(wandb.config.episodes):
# Initialization
done, total_reward, step, loss, exploration = False, 0, 0, 0, 0
state = self.env.reset()
while not done and step < wandb.env.max_steps:
# Predict and perform action
action, e = self.predict_net.get_action(state)
exploration += e
next_state, reward, done, _ = self.env.step(action)
self.buffer.put(state, action, reward * wandb.config.reward_discount, next_state, done)
total_reward += reward
if self.buffer.size() >= 1000 and step % 10 == 0:
loss = self.replay()
state = next_state
step += 1
self.target_update()
# Update epsilon
self.predict_net.epsilon = max(wandb.config.epsilon_decay * self.predict_net.epsilon, wandb.config.min_epsilon)
# Calculate weights change and log weights
pre_weights = self.get_weights(self.predict_net.model.layers)
tar_weights = self.get_weights(self.target_net.model.layers)
# LOG
print(colored("EP" + str(ep) + "-Reward: " + str(total_reward) + " Done: " + str(done), "green"))
wandb.log({"episode" : ep,
"buffersize" : self.buffer.size(),
"EpReward" : total_reward,
"epsilon" : self.predict_net.epsilon,
"done" : int(done),
"Exploration" : exploration / _,
"loss" : loss,
"pre_weights" : pre_weights,
"tar_weights" : tar_weights
})
# "weigthUpdate" : wandb.Image(neuron_map),
# Get weights and names for every layer of nn model
def get_weights(self, layers):
weigths = []
names = []
for layer in layers:
wb = layer.get_weights()
if wb:
weigths.append(wb[0].flatten())
names.append(layer.name)
return weigths, names
if __name__ == "__main__":
env = HNS(random_spawn=wandb.env.random_spawn, world_size=wandb.env.world_size, hider_count=wandb.env.hider_count)
agent = Agent(env=env)
agent.train()
agent.target_net.model.save(os.path.join(wandb.run.dir, "model.h5"))
I tried to code a neural network to solve OpenAI's CartPole environment with Tensorflow and Keras. The network uses prioritized experience replay and a seperate target network which is updated every tenth episode. Here's the code:
import numpy as np
import gym
import matplotlib.pyplot as plt
from collections import deque
from tensorflow import keras
class agent:
def __init__(self,inputs,outputs,gamma):
self.gamma = gamma
self.epsilon = 1.0
self.epsilon_decay = 0.999
self.epsilon_min = 0.01
self.inputs = inputs
self.outputs = outputs
self.network = self._build_net()
self.target_network = self._build_net()
self.replay_memory = deque(maxlen=100000)
self.target_network.set_weights(self.network.get_weights())
def _build_net(self):
model = keras.models.Sequential()
model.add(keras.layers.Dense(10,activation='relu',input_dim=self.inputs))
model.add(keras.layers.Dense(10,activation='relu'))
model.add(keras.layers.Dense(10,activation='relu'))
model.add(keras.layers.Dense(self.outputs,activation='sigmoid'))
model.compile(optimizer='adam',loss='categorical_crossentropy')
return model
def act(self,state,testing=False):
if not testing:
if self.epsilon > np.random.rand():
return np.random.randint(self.outputs)
else:
return np.argmax(self.network.predict(np.array([state])))
else:
return np.argmax(self.network.predict(np.array([state])))
def remember(self,state,action,next_state,reward,win):
self.replay_memory.append((state,action,next_state,reward,win))
def get_batch(self):
batch = []
losses = []
targets = []
if len(self.replay_memory) >= 32:
for state,action,next_state,reward,done in self.replay_memory:
target_f = np.zeros([1,self.outputs])
if done != False:
target = (reward + self.gamma * np.amax(self.target_network.predict(np.array([next_state]))[0]))
target_f[0][action] = target
targets.append(target_f)
loss = np.mean(self.network.predict(np.array([state]))-target_f)**2
losses.append(loss)
indexes = np.argsort(losses)[:32]
for indx in indexes:
batch.append((self.replay_memory[indx][0], targets[indx]))
return batch
def replay(self,batch):
for state,target in batch:
self.network.fit(np.array([state]), target, epochs=1, verbose=0)
self.epsilon = max(self.epsilon_min, self.epsilon*self.epsilon_decay)
def update_target_network(self):
self.target_network.set_weights(self.network.get_weights())
def save(self):
self.network.save_weights('./DQN.model')
def load(self):
self.network.load_weights('./DQN.model')
env = gym.make('CartPole-v0')
episodes = 1000
agent = agent(env.observation_space.shape[0],env.action_space.n,0.99)
rewards = []
state = env.reset()
for _ in range(500):
action = agent.act(state,1.0)
next_state, reward, win, _ = env.step(action)
agent.remember(state,action,next_state,reward,win)
state = next_state
if win:
state = env.reset()
for e in range(episodes):
print('episode:',e+1)
batch = agent.get_batch()
state = env.reset()
for t in range(400):
action = agent.act(state)
next_state, reward, win, _ = env.step(action)
agent.remember(state,action,next_state,reward,win)
state = next_state
if win:
break
agent.replay(batch)
print('score:',t)
print('epsilon:',agent.epsilon)
print('')
if e%10 == 0:
agent.update_target_network()
agent.save()
rewards.append(t)
plt.plot(list(range(e+1)),rewards)
plt.savefig('./reward.png')
The problem is that the agent gets worse as epsilon decreases. And when epsilon is a its lowest value, the agent only gets through 7-9 steps before the Pole falls, as seen in the Image below. Can someone tell me why my agent isn't learning anything and how to fix it?
Your network output should be using linear as activation as opposed to sigmoid, since you are bootstrapping summation of all future rewards. The loss should be changed to mse as well, according to the derivation for the td error from the original paper.
Also, if the experience ended in a terminal state (done==True), the td target for the action taken should set equal to the reward (you are currently setting it as 0).
Try making the reward = -reward if done. It helps