Related
I'm trying to optimize the coordinates of the corners of an image. A similar technique works fine in Ceres Solver. But in torch.optim I'm having some issues. In particular, the optimizer for some reason does not change the parameters being optimized. I don't have much experience with pytorch, so I'm pretty sure the error is trivial. Unfortunately, reading the documentation did not help me much.
Optimization model class:
class OptimizeCorners(torch.nn.Module):
def __init__(self, real_corners):
super().__init__()
self._real_corners = torch.nn.Parameter(real_corners)
def forward(self, real_image, synt_image, synt_corners, _threshold):
# Find homography
if visualize_warp_interpolate:
real_image_before_processing = real_image
synt_image_before_processing = synt_image
homography_matrix = kornia.geometry.homography.find_homography_dlt(synt_corners,
self._real_corners,
weights=None)
# Warp and resize synt image
synt_image = kornia.geometry.transform.warp_perspective(synt_image.float(),
homography_matrix,
dsize=(int(real_image.shape[2]),
int(real_image.shape[3])),
mode='bilinear',
padding_mode='zeros',
align_corners=True,
fill_value=torch.zeros(3))
# Interpolate images
real_image = torch.nn.functional.interpolate(real_image.float(),
scale_factor=5,
mode='bicubic',
align_corners=None,
recompute_scale_factor=None,
antialias=False)
synt_image = torch.nn.functional.interpolate(synt_image.float(),
scale_factor=5,
mode='bicubic',
align_corners=None,
recompute_scale_factor=None,
antialias=False)
# Calculate loss
loss_map = torch.sub(real_image, synt_image, alpha=1)
# if element > _threshold: element = 0
loss_map = torch.nn.Threshold(_threshold, 0)(loss_map)
cumulative_loss = torch.sqrt(torch.sum(torch.pow(loss_map, 2)) /
(loss_map.size(dim=2) * loss_map.size(dim=3)))
return torch.autograd.Variable(cumulative_loss.data, requires_grad=True)
The way, how I am trying to execute optimization:
# Convert corresponding images to PyTorch tensors
_image = kornia.utils.image_to_tensor(_image, keepdim=False)
_synt_image = kornia.utils.image_to_tensor(_synt_image, keepdim=False)
_corners = torch.from_numpy(_corners)
_synt_corners = torch.from_numpy(_synt_corners)
# Optimizer L-BFGS
n_iters = 100
h_lbfgs = []
lr = 1
optimize_corners = OptimizeCorners(_corners)
optimizer = torch.optim.LBFGS(optimize_corners.parameters(),
lr=lr)
for it in tqdm(range(n_iters), desc='Fitting corners',
leave=False, position=1):
loss = optimize_corners(_image, _synt_image, _synt_corners, _threshold)
optimizer.zero_grad()
loss.backward()
optimizer.step(lambda: optimize_corners(_image, _synt_image, _synt_corners, _threshold))
h_lbfgs.append(loss.item())
print(h_lbfgs)
Output from console:
pic
So, as you can see, parameters to be optimized do not change.
UPD:
I changed return torch.autograd.Variable(cumulative_loss.data, requires_grad=True) to return cumulative_loss.requires_grad_(), and it actually works, but now I get this error after few iterations:
console output
UPD: this happens because the parameters being optimized turn into NaN after a few iterations.
After some time spent hugging the debugger, I found out that the main problem is that after a few iterations, the backward() method starts to calculate the gradient incorrectly and output NaN's. Thus, the parameters being optimized are also calclulated as NaN's. I didn't have a chance to find out exactly why this is happening, because all the traces (I used torch.autograd.set_detect_anomaly(True) method) pointed to the fact that the error occurs on the side of the C ++ Torch engine in the POW and SVD functions.
In the end, in my case, the problem was solved by the fact that I cast all parameters form float32 to float64 and reduce learning rate.
Here is the final code update can be found:
# Convert corresponding images to PyTorch tensors
_image = kornia.utils.image_to_tensor(_image, keepdim=False).double()
_synt_image = kornia.utils.image_to_tensor(_synt_image, keepdim=False).double()
_corners = torch.from_numpy(_corners).double()
_synt_corners = torch.from_numpy(_synt_corners).double()
# Optimizer L-BFGS
optimize_corners = OptimizeCorners(_corners)
optimizer = torch.optim.LBFGS(optimize_corners.parameters(),
max_iter=20,
lr=0.01)
torch.autograd.set_detect_anomaly(True)
def closure():
optimizer.zero_grad()
loss = optimize_corners(_image, _synt_image, _synt_corners, _threshold)
loss.backward()
return loss
for it in tqdm(range(100), desc="Fitting corners", leave=False, position=1):
optimizer.step(closure)
def forward(self, real_image, synt_image, synt_corners, _threshold):
# Find homography
if visualize_warp_interpolate:
real_image_before_processing = real_image
synt_image_before_processing = synt_image
homography_matrix = kornia.geometry.homography.find_homography_dlt(synt_corners,
self._real_corners,
weights=None)
# Warp and resize synt image
synt_image = kornia.geometry.transform.warp_perspective(synt_image,
homography_matrix,
dsize=(int(real_image.shape[2]),
int(real_image.shape[3])),
mode='bilinear',
padding_mode='zeros',
align_corners=True,
fill_value=torch.zeros(3))
# Interpolate images
real_image = torch.nn.functional.interpolate(real_image,
scale_factor=10,
mode='bicubic',
align_corners=None,
recompute_scale_factor=None,
antialias=False)
synt_image = torch.nn.functional.interpolate(synt_image,
scale_factor=10,
mode='bicubic',
align_corners=None,
recompute_scale_factor=None,
antialias=False)
# Calculate loss
loss_map = torch.sub(real_image, synt_image, alpha=1)
# if element > _threshold: element = 0
loss_map = torch.nn.Threshold(_threshold, 0)(loss_map)
cumulative_loss = torch.sqrt(torch.sum(torch.pow(loss_map, 2)) /
(loss_map.size(dim=2) * loss_map.size(dim=3)))
return cumulative_loss.requires_grad_()
I am currently working on a change detection project for my university course and I was stuck at writing a custom loss function.I know i have to use function closure to be able to use data from layers of the model but i don't know enough tensorflow/keras knowledge to write effecient code.
The loss function equation
This is the modified cross entropy loss equation that i'm trying to turn into code.The loss needs the matrix W which I have to calculate using the inputs to the model, that is X1 and X2. So at the moment I have this.
def cmg_loss(X1,X2):
def loss(y_true,y_pred):
print(X1)
if X1.shape[0] == None:
X1 = tf.reshape(X1,(224,224,3))
X2 = tf.reshape(X2,(224,224,3))
cmm = [get_cmm(X1,X2)]
else:
cmm = [get_cmm(X1[i],X2[i]) for i in range(X1.shape[0])]
N = tf.convert_to_tensor(y_true.shape[0],dtype=tf.float32)
N_val = y_true.shape[0]
loss = tf.convert_to_tensor(0.0)
if(N_val == None):
loss = get_cmgloss(y_true[0],y_pred[0],cmm[0])
loss = tf.math.multiply(-1.0,loss)
return tf.math.divide(loss,N)
else:
for i in range(N_val):
print(i)
print("CMM len ", len(cmm))
x = get_cmgloss(y_true[i],y_pred[i],cmm[i])
loss = tf.math.add(loss,get_cmgloss(y_true[i],y_pred[i],cmm[i]))
loss = tf.math.multiply(-1.0,loss)
return tf.math.divide(loss,N)
return loss
def get_cmgloss(y_true,y_pred,W):
y_true = tf.cast(y_true,dtype=tf.float32)
y_pred = tf.cast(y_pred, dtype=tf.float32)
betaminus = findbetaminus(y_pred)
betaplus = 1 - betaminus
betaminus = betaminus.astype('float32')
betaplus = betaplus.astype('float32')
loss = tf.convert_to_tensor(0.0)
N = tf.convert_to_tensor(y_true.shape[0] * y_true.shape[0],dtype=tf.float32)
betaminus_matrix = tf.fill((224,224), betaminus)
betaplus_matrix = tf.fill((224,224), betaplus)
one_matrix = tf.fill((224,224), 1.0)
first_term = tf.math.multiply(betaminus_matrix,tf.math.multiply(y_true,tf.math.log(y_pred)))
second_term = tf.math.multiply(betaplus_matrix,tf.math.multiply(tf.math.subtract(one_matrix,y_true), tf.math.log(tf.math.subtract(one_matrix,y_pred))))
sum_first_second = tf.math.add(first_term, second_term)
prod = tf.math.multiply(W,sum_first_second)
loss = tf.math.reduce_sum(prod)
#loss = K.sum(K.sum(betaminus_matrix * y_true * tf.math.log(y_pred),betaplus_matrix * (1 - y_true) * tf.math.log(1 - y_pred)))
loss = tf.math.multiply(-1.0, loss)
return tf.math.divide(loss,N)
def findbetaminus(gt):
count_1 = tf.math.count_nonzero(gt == 1)
size = gt.shape[0] * gt.shape[1]
return count_1 / size;
def get_cmm(x1,x2):
b1_diff_sq = tf.math.squared_difference(x1[:,:,0],x2[:,:,0])
b2_diff_sq = tf.math.squared_difference(x1[:,:,1],x2[:,:,1])
b3_diff_sq = tf.math.squared_difference(x1[:,:,2],x2[:,:,2])
sum_3bands = tf.math.add(tf.math.add(b1_diff_sq,b2_diff_sq),b3_diff_sq)
cmm = tf.math.sqrt(sum_3bands)
#print(cmm)
max_val = tf.reduce_max(cmm)
#print("MAX_VAL ", max_val)
max_val_matrix = tf.fill((224,224), max_val)
cmm_bar = tf.divide(cmm,max_val_matrix)
#print(cmm_bar)
mean_cmm_bar = tf.reduce_mean(cmm_bar)
#print("MEAN_VAL ", mean_cmm_bar)
mean_cmm_bar_matrix = tf.fill((224,224), mean_cmm_bar)
#print(mean_cmm_bar_matrix)
condition = tf.math.greater(mean_cmm_bar_matrix, cmm_bar)
return tf.where(condition, mean_cmm_bar_matrix, cmm_bar)
#print(weights)
It would be great help if you could guide me on how to develop a loss function that makes use of data from other layers and also call multiple functions in its computation.
If you want to use more advanced loss functions, you will have to use tf.GradientTape to do train steps by yourself instead of using the fit method.
You can find many examples on the web and in the tensorflow documentation. This requires a little more work but it is much more powerful because you can essentially output a list of tensors out of your custom Model in the call method and then calculate the target losses and choose which parameters are changed.
I tried to implement the most simple Deep Q Learning algorithm. I think, I've implemented it right and know that Deep Q Learning struggles with divergences but the reward is declining very fast and the loss is diverging. I would be grateful if someone could help me pointing out the right hyperparameters or if I implemented the algorithm wrong. I've tried a lot of hyperparameter combinations and also changing the complexity of the QNet.
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
import collections
import numpy as np
import matplotlib.pyplot as plt
import gym
from torch.nn.modules.linear import Linear
from torch.nn.modules.loss import MSELoss
class ReplayBuffer:
def __init__(self, max_replay_size, batch_size):
self.max_replay_size = max_replay_size
self.batch_size = batch_size
self.buffer = collections.deque()
def push(self, *transition):
if len(self.buffer) == self.max_replay_size:
self.buffer.popleft()
self.buffer.append(transition)
def sample_batch(self):
indices = np.random.choice(len(self.buffer), self.batch_size, replace = False)
batch = [self.buffer[index] for index in indices]
state, action, reward, next_state, done = zip(*batch)
state = np.array(state)
action = np.array(action)
reward = np.array(reward)
next_state = np.array(next_state)
done = np.array(done)
return state, action, reward, next_state, done
def __len__(self):
return len(self.buffer)
class QNet(nn.Module):
def __init__(self, state_dim, action_dim):
super(QNet, self).__init__()
self.linear1 = Linear(in_features = state_dim, out_features = 64)
self.linear2 = Linear(in_features = 64, out_features = action_dim)
def forward(self, x):
x = self.linear1(x)
x = F.relu(x)
x = self.linear2(x)
return x
def train(replay_buffer, model, target_model, discount_factor, mse, optimizer):
state, action, reward, next_state, _ = replay_buffer.sample_batch()
state, next_state = torch.tensor(state, dtype = torch.float), torch.tensor(next_state,
dtype = torch.float)
# Compute Q Value and Target Q Value
q_values = model(state).gather(1, torch.tensor(action, dtype = torch.int64).unsqueeze(-1))
with torch.no_grad():
max_next_q_values = target_model(next_state).detach().max(1)[0]
q_target_value = torch.tensor(reward, dtype = torch.float) + discount_factor *
max_next_q_values
optimizer.zero_grad()
loss = mse(q_values, q_target_value.unsqueeze(1))
loss.backward()
optimizer.step()
return loss.item()
def main():
# Define Hyperparameters and Parameters
EPISODES = 10000
MAX_REPLAY_SIZE = 10000
BATCH_SIZE = 32
EPSILON = 1.0
MIN_EPSILON = 0.05
DISCOUNT_FACTOR = 0.95
DECAY_RATE = 0.99
LEARNING_RATE = 1e-3
SYNCHRONISATION = 33
EVALUATION = 32
# Initialize Environment, Model, Target-Model, Optimizer, Loss Function and Replay Buffer
env = gym.make("CartPole-v0")
model = QNet(state_dim = env.observation_space.shape[0], action_dim =
env.action_space.n)
target_model = QNet(state_dim = env.observation_space.shape[0], action_dim =
env.action_space.n)
target_model.load_state_dict(model.state_dict())
optimizer = optim.Adam(model.parameters(), lr = LEARNING_RATE)
mse = MSELoss()
replay_buffer = ReplayBuffer(max_replay_size = MAX_REPLAY_SIZE, batch_size = BATCH_SIZE)
while len(replay_buffer) != MAX_REPLAY_SIZE:
state = env.reset()
done = False
while done != True:
action = env.action_space.sample()
next_state, reward, done, _ = env.step(action)
replay_buffer.push(state, action, reward, next_state, done)
state = next_state
# Begin with the Main Loop where the QNet is trained
count_until_synchronisation = 0
count_until_evaluation = 0
history = {'Episode': [], 'Reward': [], 'Loss': []}
for episode in range(EPISODES):
total_reward = 0.0
total_loss = 0.0
state = env.reset()
iterations = 0
done = False
while done != True:
count_until_synchronisation += 1
count_until_evaluation += 1
# Take an action
if np.random.rand(1) < EPSILON:
action = env.action_space.sample()
else:
with torch.no_grad():
output = model(torch.tensor(state, dtype = torch.float)).numpy()
action = np.argmax(output)
# Observe new state and reward + store into replay_buffer
next_state, reward, done, _ = env.step(action)
total_reward += reward
replay_buffer.push(state, action, reward, next_state, done)
state = next_state
if count_until_synchronisation % SYNCHRONISATION == 0:
target_model.load_state_dict(model.state_dict())
if count_until_evaluation % EVALUATION == 0:
loss = train(replay_buffer = replay_buffer, model = model, target_model =
target_model, discount_factor = DISCOUNT_FACTOR,
mse = mse, optimizer = optimizer)
total_loss += loss
iterations += 1
print (f"Episode {episode} is concluded in {iterations} iterations with a total reward
of {total_reward}")
if EPSILON > MIN_EPSILON:
EPSILON *= DECAY_RATE
history['Episode'].append(episode)
history['Reward'].append(total_reward)
history['Loss'].append(total_loss)
# Plot the Loss + Reward per Episode
fig, ax = plt.subplots(figsize = (10, 6))
ax.plot(history['Episode'], history['Reward'], label = "Reward")
ax.set_xlabel('Episodes', fontsize = 15)
ax.set_ylabel('Total Reward per Episode', fontsize = 15)
plt.legend(prop = {'size': 15})
plt.show()
fig, ax = plt.subplots(figsize = (10, 6))
ax.plot(history['Episode'], history['Loss'], label = "Loss")
ax.set_xlabel('Episodes', fontsize = 15)
ax.set_ylabel('Total Loss per Episode', fontsize = 15)
plt.legend(prop = {'size': 15})
plt.show()
if __name__ == "__main__":
main()
Your code looks fine, I think your hyperparameters are not ideal. I would change two, potentially three things:
If I'm not mistaken, you update your target net every 32 steps. That is way too low I think. In the original paper by Mnih et al., they do a hard update every 10k steps. Think about it: The target net is used to calculate the loss, you essentially change the loss function every 32 steps, which would be more than once per episode.
Your replay buffer size is pretty small. I would set it to 100k or 1M, even if that is longer than what you intend to train for. If the replay buffer is too small, you will lose the older transitions, which can cause your network to "forget" things it already learned. Not sure how dramatic this is for cartpole, but maybe worth trying...
Learning rate could also be lower, I am using 1-e4 with RMSProp. Generally changing the optimizer can also yield different results.
Hope this helps, good luck :)
Your code looks fine and well written, the hyperparams seem reasonable (except maybe for the update frequency which may be too low), I think that the Q network is quite small with a single dense layer.
A deeper model could likely do better (probably not more than 3-4 layers though), but you said that you already tried different network sizes.
Another thing that comes to mind is the target update. You are doing an hard update every n steps; a soft update may help a bit, but I wouldn't count on it.
You can also try lowering the learning rate a bit, but I imagine you already did that.
My suggestions are:
try less frequent target updates
try a larger (deeper, something like 2/3 dense layers with 32 nodes), if you haven't already
look into soft target updates (polyak averaging and so on)
try your implementation in other simple gym environments and check if its behavior is still the same.
Sadly DQN is not ideal and won't converge for many problems, but it should be able to solve cartpole.
This answer might be a little late for OP. But the implementation does not take the terminal values into account. The only signal, from some environments (ex. CartPole), that the agent is doing something wrong, is from the terminal value. In CartPole the agent gets a +1 reward for every step it takes, but it has no idea about time, so if you don't create zero value targets for terminal states then the value tries to converge towards an infinite sum of discounted +1 for every state.
The simplest way to include the zero values for terminal states is to also sample a batch of (dones) and simply multiply them with the value targets you calculated before.
I'm trying to create an extremely simple network with a GRUcell layer to perform the following task: a cue is given in one of two locations. After T timesteps, the agent must learn to take a particular action at the opposite location.
I get the following error when trying to compute backward gradients:
RuntimeError: one of the variables needed for gradient computation has
been modified by an inplace operation.
One problem is that I don't fully understand which piece of my code is performing the inplace operation.
I have read other posts on stackoverflow and on the pytorch forum, which all suggest using the .clone() operation. I've peppered it all over my code anywhere I think it could conceivably make a difference, but I haven't had success.
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.gru = nn.GRUCell(2,50) # GRU layer taking 2 inputs (L or R), has 50 units
self.actor = nn.Linear(50,2) # Linear Actor layer with 2 outputs, takes GRU as input
self.critic = nn.Linear(50,1) # Linear Critic layer with 1 output, takes GRU as input
def forward(self, s, h):
h = self.gru(s,h) # give the input and previous hidden state to the GRU layer
c = self.critic(h) # estimate the value of the current state
pi = F.softmax(self.actor(h),dim=1) # calculate the policy
return (h,c,pi)
def backward_rollout(self, gamma, R_t, c_t, t):
R_t[0,t] = gamma*R_t[0,t+1].clone()
# calculate the reward prediction error
Delta_t[0,t] = c_t[0,t].clone() - R_t[0,t].clone()
#calculate the loss for the critic
crit = c_t[0,t].clone()
ret = R_t[0,t].clone()
Value_l[0,t] = F.smooth_l1_loss(crit,ret)
###################################
# Run a trial
# parameters
N = 1 # number of trials to run
T = 10 # number of time-steps in a trial
gamma = 0.98 # temporal discount factor
# for each trial
for n in range(N):
sample = np.random.choice([0,1],1)[0] # pick the sample input for this trial
s_t = torch.zeros((1,2,T)) # state at each time step
h_0 = torch.zeros((1,50)) # initial hidden state
h_t = torch.zeros((1,50,T)) # hidden state at each time step
c_t = torch.zeros((1,T)) # critic at each time step
pi_t = torch.zeros((1,2,T)) # policy at each time step
R_t = torch.zeros((1,T)) # return at each time step
Delta_t = torch.zeros((1,T)) # difference between critic and true return at each step
Value_l = torch.zeros((1,T)) # value loss
# set the input (state) vector/tensor
s_t[0,sample,0] = 1.0 # set first time-step stimulus
s_t[0,0,-1] = 1.0 # set last time-step stimulus
s_t[0,1,-1] = 1.0 # set last time-step stimulus
# step through the trial
for t in range(T):
# run a forward step
state = s_t[:,:,t].clone()
if t is 0:
(hidden_state, critic, policy) = net(state, h_0)
else:
(hidden_state, critic, policy) = net(state, h_t[:,:,t-1])
h_t[:,:,t] = hidden_state.clone()
c_t[:,t] = critic.clone()
pi_t[:,:,t] = policy.clone()
# select an action using the policy
action = np.random.choice([0,1], 1, p = policy[0,:].detach().numpy())
#action = int(np.random.uniform() < pi[0,1])
# compare the action to the sample
if action is sample:
r = 0
print("WRONG!")
else:
r = 1
print("RIGHT!")
#h_t_old = h_t
#s_t_old = s_t
# step backwards through the trial to calculate gradients
R_t[0,-1] = r
Delta_t[0,-1] = c_t[0,-1].clone() - r
Value_l[0,-1] = F.smooth_l1_loss(c_t[0,-1],R_t[0,-1]).clone()
for t in np.arange(T-2,-1,-1): #backwards rollout
net.backward_rollout(gamma, R_t, c_t, t)
Vl = Value_l.clone().sum()#calculate total loss
Vl.backward() #calculate the derivatives
opt.step() #update the weights
opt.zero_grad() #zero gradients before next trial
You can try anomaly_detection to pinpoint the exact offending in-place operation: https://github.com/pytorch/pytorch/issues/15803
Value_l[0,-1] = and similar are in-place operations. You can sidestep the check by doing Value_l.data[0,-1] =, but this is not stored in the computational graph and may be a bad idea. A relevant discussion is here: https://discuss.pytorch.org/t/how-to-get-around-in-place-operation-error-if-index-leaf-variable-for-gradient-update/14554
I am trying to create a reinforcement learning agent that can buy, sell or hold stock positions. The issue I'm having is that even after over 2000 episodes, the agent still can not learn when to buy, sell or hold. Here is an image from the 2100th episode detailing what I mean, the agent will not take any action unless it is random.
The agent learns using a replay memory and I have double and triple checked that there are no errors. Here is the code for the agent:
import numpy as np
import tensorflow as tf
import random
from collections import deque
from .agent import Agent
class Agent(Agent):
def __init__(self, state_size = 7, window_size = 1, action_size = 3,
batch_size = 32, gamma=.95, epsilon=.95, epsilon_decay=.95, epsilon_min=.01,
learning_rate=.001, is_eval=False, model_name="", stock_name="", episode=1):
"""
state_size: Size of the state coming from the environment
action_size: How many decisions the algo will make in the end
gamma: Decay rate to discount future reward
epsilon: Rate of randomly decided action
epsilon_decay: Rate of decrease in epsilon
epsilon_min: The lowest epsilon can get (limit to the randomness)
learning_rate: Progress of neural net in each iteration
episodes: How many times data will be run through
"""
self.state_size = state_size
self.window_size = window_size
self.action_size = action_size
self.batch_size = batch_size
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.learning_rate = learning_rate
self.is_eval = is_eval
self.model_name = model_name
self.stock_name = stock_name
self.q_values = []
self.layers = [150, 150, 150]
tf.reset_default_graph()
self.sess = tf.Session(config=tf.ConfigProto(allow_soft_placement = True))
self.memory = deque()
if self.is_eval:
model_name = stock_name + "-" + str(episode)
self._model_init()
# "models/{}/{}/{}".format(stock_name, model_name, model_name + "-" + str(episode) + ".meta")
self.saver = tf.train.Saver()
self.saver.restore(self.sess, tf.train.latest_checkpoint("models/{}/{}".format(stock_name, model_name)))
# self.graph = tf.get_default_graph()
# names=[tensor.name for tensor in tf.get_default_graph().as_graph_def().node]
# self.X_input = self.graph.get_tensor_by_name("Inputs/Inputs:0")
# self.logits = self.graph.get_tensor_by_name("Output/Add:0")
else:
self._model_init()
self.sess.run(self.init)
self.saver = tf.train.Saver()
path = "models/{}/6".format(self.stock_name)
self.writer = tf.summary.FileWriter(path)
self.writer.add_graph(self.sess.graph)
def _model_init(self):
"""
Init tensorflow graph vars
"""
# (1,10,9)
with tf.device("/device:GPU:0"):
with tf.name_scope("Inputs"):
self.X_input = tf.placeholder(tf.float32, [None, self.state_size], name="Inputs")
self.Y_input = tf.placeholder(tf.float32, [None, self.action_size], name="Actions")
self.rewards = tf.placeholder(tf.float32, [None, ], name="Rewards")
# self.lstm_cells = [tf.contrib.rnn.GRUCell(num_units=layer)
# for layer in self.layers]
#lstm_cell = tf.contrib.rnn.LSTMCell(num_units=n_neurons, use_peepholes=True)
#gru_cell = tf.contrib.rnn.GRUCell(num_units=n_neurons)
# self.multi_cell = tf.contrib.rnn.MultiRNNCell(self.lstm_cells)
# self.outputs, self.states = tf.nn.dynamic_rnn(self.multi_cell, self.X_input, dtype=tf.float32)
# self.top_layer_h_state = self.states[-1]
# with tf.name_scope("Output"):
# self.out_weights=tf.Variable(tf.truncated_normal([self.layers[-1], self.action_size]))
# self.out_bias=tf.Variable(tf.zeros([self.action_size]))
# self.logits = tf.add(tf.matmul(self.top_layer_h_state,self.out_weights), self.out_bias)
fc1 = tf.contrib.layers.fully_connected(self.X_input, 512, activation_fn=tf.nn.relu)
fc2 = tf.contrib.layers.fully_connected(fc1, 512, activation_fn=tf.nn.relu)
fc3 = tf.contrib.layers.fully_connected(fc2, 512, activation_fn=tf.nn.relu)
fc4 = tf.contrib.layers.fully_connected(fc3, 512, activation_fn=tf.nn.relu)
self.logits = tf.contrib.layers.fully_connected(fc4, self.action_size, activation_fn=None)
with tf.name_scope("Cross_Entropy"):
self.loss_op = tf.losses.mean_squared_error(self.Y_input,self.logits)
self.optimizer = tf.train.RMSPropOptimizer(learning_rate=self.learning_rate)
self.train_op = self.optimizer.minimize(self.loss_op)
# self.correct = tf.nn.in_top_k(self.logits, self.Y_input, 1)
# self.accuracy = tf.reduce_mean(tf.cast(self., tf.float32))
tf.summary.scalar("Reward", tf.reduce_mean(self.rewards))
tf.summary.scalar("MSE", self.loss_op)
# Merge all of the summaries
self.summ = tf.summary.merge_all()
self.init = tf.global_variables_initializer()
def remember(self, state, action, reward, next_state, done):
self.memory.append((state, action, reward, next_state, done))
def act(self, state):
if np.random.rand() <= self.epsilon and not self.is_eval:
prediction = random.randrange(self.action_size)
if prediction == 1 or prediction == 2:
print("Random")
return prediction
act_values = self.sess.run(self.logits, feed_dict={self.X_input: state.reshape((1, self.state_size))})
if np.argmax(act_values[0]) == 1 or np.argmax(act_values[0]) == 2:
pass
return np.argmax(act_values[0])
def replay(self, time, episode):
print("Replaying")
mini_batch = []
l = len(self.memory)
for i in range(l - self.batch_size + 1, l):
mini_batch.append(self.memory[i])
mean_reward = []
x = np.zeros((self.batch_size, self.state_size))
y = np.zeros((self.batch_size, self.action_size))
for i, (state, action, reward, next_state, done) in enumerate(mini_batch):
target = reward
if not done:
self.target = reward + self.gamma * np.amax(self.sess.run(self.logits, feed_dict = {self.X_input: next_state.reshape((1, self.state_size))})[0])
current_q = (self.sess.run(self.logits, feed_dict={self.X_input: state.reshape((1, self.state_size))}))
current_q[0][action] = self.target
x[i] = state
y[i] = current_q.reshape((self.action_size))
mean_reward.append(self.target)
#target_f = np.array(target_f).reshape(self.batch_size - 1, self.action_size)
#target_state = np.array(target_state).reshape(self.batch_size - 1, self.window_size, self.state_size)
_, c, s = self.sess.run([self.train_op, self.loss_op, self.summ], feed_dict={self.X_input: x, self.Y_input: y, self.rewards: mean_reward}) # Add self.summ into the sess.run for tensorboard
self.writer.add_summary(s, global_step=(episode+1)/(time+1))
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
Once the replay memory is greater than the batch size, it runs the replay function. The code might look a little messy since I have been messing with it for days now trying to figure this out. Here is a screenshot of the MSE from tensorboard.
As you can see by the 200th episode the MSE dies out to 0 or almost 0. I'm stumped! I have no idea what is going on. Please help me figure this out. The code is posted here to see the whole thing including the train and eval files. The main agent I have been working on is LSTM.py in tha agents folder. Thanks!
As discussed in the comments of the question, this seems to be a problem of the high learning rate decay.
Essentially, with every episode you multiply your learning rate by some factor j, which means that your learning rate after n episodes/epochs will be equal to
lr = initial_lr * j^n. In our example, the decay is set to 0.95, which means that after only a few iterations, the learning rate will already drop significantly. Subsequently, the updates will only perform minute corrections, and not "learn" very significant changes anymore.
This leads to the question: Why does decay make sense at all? Generally speaking, we want to reach a local optimum (potentially very narrow). To do so, we try and get "relatively close" to such a minimum, and then only do smaller increments that lead us to this optimum. If we would just continue with the original learning rate, it could be that we simply step over the optimal solution every time, and never reach our goal.
Visually, the problem can be summed up by this graphic:
Another method besides decay is to simply decrease the learning rate by a certain amount once the algorithm does not reach any significant updates anymore. This avoids the problem of diminishing learning rates purely by having many episodes.
In your case specifically, a higher value for the decay (i.e. a slower decay) seemed to have helped already quite a lot.
the Q value in reinforcement learning does not represent 'reward' but 'return' which is the sum of current reward and discount future rewards. When your model enters this 'dead end' of all zero actions, the rewards will be zero based on your setting. Then after a period of time, your replay will be full of memories of 'An action of zero gives the reward of zero', so no matter how you update your model it cannot get out of this dead end.
As #dennlinger said, you may increase your epsilon to let your model have some fresh memories to update, also you could use prioritized experience replay to train on 'useful' experiences.
However, I suggest you look at is the environment itself first. Your model output zeroes because there are no better choices, is that true? As you said you were trading stocks, are you sure there is enough information that leads to a strategy that will lead you to a reward that is larger than zero? I think you need to think through this first before you do any tuning on this. For example, if the stock is moving up or down at a pure random 50/50 chance, then you'll never find a strategy that will make average reward larger than zero.
The reinforcement learning agent might already found out the best one, though it's not what you want.