Something wrong with Keras code Q-learning OpenAI gym FrozenLake

Something wrong with Keras code Q-learning OpenAI gym FrozenLake - python

Maybe my question will seem stupid.
I'm studying the Q-learning algorithm. In order to better understand it, I'm trying to remake the Tenzorflow code of this FrozenLake example into the Keras code.
My code:
import gym
import numpy as np
import random
from keras.layers import Dense
from keras.models import Sequential
from keras import backend as K
import matplotlib.pyplot as plt
%matplotlib inline
env = gym.make('FrozenLake-v0')
model = Sequential()
model.add(Dense(16, activation='relu', kernel_initializer='uniform', input_shape=(16,)))
model.add(Dense(4, activation='softmax', kernel_initializer='uniform'))
def custom_loss(yTrue, yPred):
return K.sum(K.square(yTrue - yPred))
model.compile(loss=custom_loss, optimizer='sgd')
# Set learning parameters
y = .99
e = 0.1
#create lists to contain total rewards and steps per episode
jList = []
rList = []
num_episodes = 2000
for i in range(num_episodes):
current_state = env.reset()
rAll = 0
d = False
j = 0
while j < 99:
j+=1
current_state_Q_values = model.predict(np.identity(16)[current_state:current_state+1], batch_size=1)
action = np.reshape(np.argmax(current_state_Q_values), (1,))
if np.random.rand(1) < e:
action[0] = env.action_space.sample() #random action
new_state, reward, d, _ = env.step(action[0])
rAll += reward
jList.append(j)
rList.append(rAll)
new_Qs = model.predict(np.identity(16)[new_state:new_state+1], batch_size=1)
max_newQ = np.max(new_Qs)
targetQ = current_state_Q_values
targetQ[0,action[0]] = reward + y*max_newQ
model.fit(np.identity(16)[current_state:current_state+1], targetQ, verbose=0, batch_size=1)
current_state = new_state
if d == True:
#Reduce chance of random action as we train the model.
e = 1./((i/50) + 10)
break
print("Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%")
When I run it, it doesn't work well: Percent of succesful episodes: 0.052%
plt.plot(rList)
The original Tensorflow code is much more better: Percent of succesful episodes: 0.352%
plt.plot(rList)
What have I done wrong ?

Besides setting use_bias=False as #Maldus mentioned in the comments, another thing you can try is to start with a higher epsilon value (e.g. 0.5, 0.75)? A trick might be to only decrease the epsilon value IF you reach the goal. i.e. don't decrease epsilon on the end of every episode. That way your player can keep on exploring the map randomly, until it starts to converge on a good route, and then it'll be a good idea to reduce the epsilon parameter.
I've actually implemented a similar model in keras in this gist using Convolutional layers instead of Dense layers. Managed to get it to work in under 2000 episodes. Might be of some help to others :)

Related

Deep Reinforcement Learning - CartPole Problem

I tried to implement the most simple Deep Q Learning algorithm. I think, I've implemented it right and know that Deep Q Learning struggles with divergences but the reward is declining very fast and the loss is diverging. I would be grateful if someone could help me pointing out the right hyperparameters or if I implemented the algorithm wrong. I've tried a lot of hyperparameter combinations and also changing the complexity of the QNet.
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
import collections
import numpy as np
import matplotlib.pyplot as plt
import gym
from torch.nn.modules.linear import Linear
from torch.nn.modules.loss import MSELoss
class ReplayBuffer:
def __init__(self, max_replay_size, batch_size):
self.max_replay_size = max_replay_size
self.batch_size = batch_size
self.buffer = collections.deque()
def push(self, *transition):
if len(self.buffer) == self.max_replay_size:
self.buffer.popleft()
self.buffer.append(transition)
def sample_batch(self):
indices = np.random.choice(len(self.buffer), self.batch_size, replace = False)
batch = [self.buffer[index] for index in indices]
state, action, reward, next_state, done = zip(*batch)
state = np.array(state)
action = np.array(action)
reward = np.array(reward)
next_state = np.array(next_state)
done = np.array(done)
return state, action, reward, next_state, done
def __len__(self):
return len(self.buffer)
class QNet(nn.Module):
def __init__(self, state_dim, action_dim):
super(QNet, self).__init__()
self.linear1 = Linear(in_features = state_dim, out_features = 64)
self.linear2 = Linear(in_features = 64, out_features = action_dim)
def forward(self, x):
x = self.linear1(x)
x = F.relu(x)
x = self.linear2(x)
return x
def train(replay_buffer, model, target_model, discount_factor, mse, optimizer):
state, action, reward, next_state, _ = replay_buffer.sample_batch()
state, next_state = torch.tensor(state, dtype = torch.float), torch.tensor(next_state,
dtype = torch.float)
# Compute Q Value and Target Q Value
q_values = model(state).gather(1, torch.tensor(action, dtype = torch.int64).unsqueeze(-1))
with torch.no_grad():
max_next_q_values = target_model(next_state).detach().max(1)[0]
q_target_value = torch.tensor(reward, dtype = torch.float) + discount_factor *
max_next_q_values
optimizer.zero_grad()
loss = mse(q_values, q_target_value.unsqueeze(1))
loss.backward()
optimizer.step()
return loss.item()
def main():
# Define Hyperparameters and Parameters
EPISODES = 10000
MAX_REPLAY_SIZE = 10000
BATCH_SIZE = 32
EPSILON = 1.0
MIN_EPSILON = 0.05
DISCOUNT_FACTOR = 0.95
DECAY_RATE = 0.99
LEARNING_RATE = 1e-3
SYNCHRONISATION = 33
EVALUATION = 32
# Initialize Environment, Model, Target-Model, Optimizer, Loss Function and Replay Buffer
env = gym.make("CartPole-v0")
model = QNet(state_dim = env.observation_space.shape[0], action_dim =
env.action_space.n)
target_model = QNet(state_dim = env.observation_space.shape[0], action_dim =
env.action_space.n)
target_model.load_state_dict(model.state_dict())
optimizer = optim.Adam(model.parameters(), lr = LEARNING_RATE)
mse = MSELoss()
replay_buffer = ReplayBuffer(max_replay_size = MAX_REPLAY_SIZE, batch_size = BATCH_SIZE)
while len(replay_buffer) != MAX_REPLAY_SIZE:
state = env.reset()
done = False
while done != True:
action = env.action_space.sample()
next_state, reward, done, _ = env.step(action)
replay_buffer.push(state, action, reward, next_state, done)
state = next_state
# Begin with the Main Loop where the QNet is trained
count_until_synchronisation = 0
count_until_evaluation = 0
history = {'Episode': [], 'Reward': [], 'Loss': []}
for episode in range(EPISODES):
total_reward = 0.0
total_loss = 0.0
state = env.reset()
iterations = 0
done = False
while done != True:
count_until_synchronisation += 1
count_until_evaluation += 1
# Take an action
if np.random.rand(1) < EPSILON:
action = env.action_space.sample()
else:
with torch.no_grad():
output = model(torch.tensor(state, dtype = torch.float)).numpy()
action = np.argmax(output)
# Observe new state and reward + store into replay_buffer
next_state, reward, done, _ = env.step(action)
total_reward += reward
replay_buffer.push(state, action, reward, next_state, done)
state = next_state
if count_until_synchronisation % SYNCHRONISATION == 0:
target_model.load_state_dict(model.state_dict())
if count_until_evaluation % EVALUATION == 0:
loss = train(replay_buffer = replay_buffer, model = model, target_model =
target_model, discount_factor = DISCOUNT_FACTOR,
mse = mse, optimizer = optimizer)
total_loss += loss
iterations += 1
print (f"Episode {episode} is concluded in {iterations} iterations with a total reward
of {total_reward}")
if EPSILON > MIN_EPSILON:
EPSILON *= DECAY_RATE
history['Episode'].append(episode)
history['Reward'].append(total_reward)
history['Loss'].append(total_loss)
# Plot the Loss + Reward per Episode
fig, ax = plt.subplots(figsize = (10, 6))
ax.plot(history['Episode'], history['Reward'], label = "Reward")
ax.set_xlabel('Episodes', fontsize = 15)
ax.set_ylabel('Total Reward per Episode', fontsize = 15)
plt.legend(prop = {'size': 15})
plt.show()
fig, ax = plt.subplots(figsize = (10, 6))
ax.plot(history['Episode'], history['Loss'], label = "Loss")
ax.set_xlabel('Episodes', fontsize = 15)
ax.set_ylabel('Total Loss per Episode', fontsize = 15)
plt.legend(prop = {'size': 15})
plt.show()
if __name__ == "__main__":
main()

Your code looks fine, I think your hyperparameters are not ideal. I would change two, potentially three things:
If I'm not mistaken, you update your target net every 32 steps. That is way too low I think. In the original paper by Mnih et al., they do a hard update every 10k steps. Think about it: The target net is used to calculate the loss, you essentially change the loss function every 32 steps, which would be more than once per episode.
Your replay buffer size is pretty small. I would set it to 100k or 1M, even if that is longer than what you intend to train for. If the replay buffer is too small, you will lose the older transitions, which can cause your network to "forget" things it already learned. Not sure how dramatic this is for cartpole, but maybe worth trying...
Learning rate could also be lower, I am using 1-e4 with RMSProp. Generally changing the optimizer can also yield different results.
Hope this helps, good luck :)

Your code looks fine and well written, the hyperparams seem reasonable (except maybe for the update frequency which may be too low), I think that the Q network is quite small with a single dense layer.
A deeper model could likely do better (probably not more than 3-4 layers though), but you said that you already tried different network sizes.
Another thing that comes to mind is the target update. You are doing an hard update every n steps; a soft update may help a bit, but I wouldn't count on it.
You can also try lowering the learning rate a bit, but I imagine you already did that.
My suggestions are:
try less frequent target updates
try a larger (deeper, something like 2/3 dense layers with 32 nodes), if you haven't already
look into soft target updates (polyak averaging and so on)
try your implementation in other simple gym environments and check if its behavior is still the same.
Sadly DQN is not ideal and won't converge for many problems, but it should be able to solve cartpole.

This answer might be a little late for OP. But the implementation does not take the terminal values into account. The only signal, from some environments (ex. CartPole), that the agent is doing something wrong, is from the terminal value. In CartPole the agent gets a +1 reward for every step it takes, but it has no idea about time, so if you don't create zero value targets for terminal states then the value tries to converge towards an infinite sum of discounted +1 for every state.
The simplest way to include the zero values for terminal states is to also sample a batch of (dones) and simply multiply them with the value targets you calculated before.

Should my model with Monte Carlo dropout provide a mean prediction similar to the deterministic prediction?

I have a model trained with multiple LayerNormalization layers, and I am unsure if a simple weight transfer works properly when activating dropout for prediction. This is the code I am using:
from tensorflow.keras.models import load_model, Model
from tensorflow.keras.layers import Dense, Dropout, LayerNormalization, Input
model0 = load_model(path + 'model0.h5')
OW = model0.get_weights()
inp = Input(shape=(10,))
D1 = Dense(760, activation='softplus')(inp)
DO1 = Dropout(0.29)(D1,training=True)
N1 = LayerNormalization()(DO1)
D2 = Dense(460,activation='softsign')(N1)
DO2 = Dropout(0.16)(D2,training=True)
N2 = LayerNormalization()(DO2)
D3 = Dense(664,activation='softsign')(N2)
DO3 = Dropout(0.09)(D3,training=True)
N3 = LayerNormalization()(DO3)
out = Dense(1,activation='linear')(N3)
mP = Model(inp,out)
mP.set_weights(OW)
mP.compile(loss='mse',optimizer='Adam')
mP.save(path + 'new_model.h5')
If I set training=False on the dropout layers, the model makes identical predictions to the original model. However, when the code is written as above the mean prediction is not close to the original/deterministic prediction.
Previous models that I had developed with dropout set to training had mean probabilistic predictions nearly identical to the deterministic model. Is there something I am doing incorrectly, or is this an issue with using LayerNormalization and active dropout? As far as I know, LayerNormalization has trainable parameters, so i didn't know if active dropout interferes with that. If it does, I am not sure how to remedy this.
This segment of code is for running a quick test and plotting the results:
inputs = np.zeros(shape=(1,10),dtype='float32')
inputsP = np.zeros(shape=(1000,10),dtype='float32')
opD = mD.predict(inputs)[0,0]
opP = mP.predict(inputsP).reshape(1000)
print('Deterministic: %.4f Probabilistic: %.4f' % (opD,np.mean(opP)))
plt.scatter(0,opD,color='black',label='Det',zorder=3)
plt.scatter(0,np.mean(opP),color='red',label='Mean prob',zorder=2)
plt.errorbar(0,np.mean(opP),yerr=np.std(opP),color='red',zorder=2,markersize=0, capsize=20,label=r'$\sigma$ bounds')
plt.grid(axis='y',zorder=0)
plt.legend()
plt.tick_params(axis='x',labelsize=0,labelcolor='white',color='white',width=0,length=0)
And the resulting output and plot are shown below.
Deterministic: -0.9732 Probabilistic: -0.9011

Edit to my answer:
I think the problem is just an under-sampling from the model. The standard deviation of the predictions is directly tied to the dropout rate and thus the number of predictions you need to approximate the determistic model goes up as well. If you do an absurd test of the code below but with dropout set to 0.7 for each dropout layer, 100,000 samples is no longer enough to approximate the deterministic mean to within 10^-3 and the standard deviation of the predictions gets much larger.
import os
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Dropout, Input
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
GPUs = tf.config.experimental.list_physical_devices('GPU')
for gpu in GPUs:
tf.config.experimental.set_memory_growth(gpu, True)
inp = Input(shape=(10,))
D1 = Dense(760, activation='softplus')(inp)
D2 = Dense(460, activation='softsign')(D1)
D3 = Dense(664, activation='softsign')(D2)
out = Dense(1, activation='linear')(D3)
mP = Model(inp, out)
mP.compile(loss='mse', optimizer='Adam')
inp = Input(shape=(10,))
D1 = Dense(760, activation='softplus')(inp)
DO1 = Dropout(0.29)(D1,training=False)
D2 = Dense(460, activation='softsign')(DO1)
DO2 = Dropout(0.16)(D2,training=True)
D3 = Dense(664, activation='softsign')(DO2)
DO3 = Dropout(0.09)(D3,training=True)
out = Dense(1, activation='linear')(DO3)
mP2 = Model(inp, out)
mP2.set_weights(mP.get_weights())
mP2.compile(loss='mse', optimizer='Adam')
data = np.zeros(shape=(100000, 10),dtype='float32')
res = mP.predict(data).reshape(data.shape[0])
res2 = mP2.predict(data).reshape(data.shape[0])
print (np.abs(res[0] - res2.mean()))

Reinforcement Learning on Tensorflow without Gym

I am currently trying to create a simple ANN learning environment for reinforcement learning. I already did fitting via neuronal network to substitute a physical model for a neuronal network. Now i would like to create a simple reinforcement learning model out of curiosity.
To create this model I thought it would be a good option to manipulate the loss function to not calculate the difference between expectation and model output but to run a simple simulation a few rounds and calculate where the model can earn points for a specific target. In case of the example code below the model is a simple mass damper system that starts with a random excitation and speed. The model can exert a force upon it. The points are based upon the distance from the equilibrium. At the end I invert the points by dividing one by the amount of points earned. I am not sure if this is the right approach but I wanted to try anyway for the sake of learning. Now I get the error message No gradients provided for any variable: . I am not sure how to solve it.
Here is my code:
import time
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.layers import Input, Dense, Conv2D, Reshape,concatenate, Flatten, UpSampling2D, AveragePooling2D,LayerNormalization
import random
#Physical Parameters
m = 1 #kg
k = 1 #N/m
c = 0.01
dt = 0.01
opt = keras.optimizers.Adam(learning_rate=0.01)
def getnewstate(u,v,f):
#Calculate new state of mass spring damper system
a = (f-v*c-k*u)/m
v = v+a*dt
u = u+v*dt
return (u,v)
def generatemodel():
#Generate simple keras model
kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.01)
bias_initializer=tf.keras.initializers.Zeros()
InputLayer = Input(shape=(2))
Outputlayer = Dense(1,activation='linear')(InputLayer)
model = Model(inputs=InputLayer, outputs=Outputlayer)
return model
def lossfunction(u,v,model):
#Costume loss function
loss = 0;
t = 0;
t_last = 0;
#do for 100 timesteps (to ses if it runs at all)
for j in range(100):
x = [];
x.append(np.array([u,v]))
x = np.array(x)
f=model(x)
f=f.numpy()[0][0]
(u,v) = getnewstate(u,v,f)
points = 1000/(abs(u)+1)
loss=loss+1/points
t += dt;
return(loss)
def dotraining(model):
#traububg loop
for epoch in range(100):
print("\nStart of epoch %d" % (epoch,))
start_time = time.time()
loss_value = 0;
# Iterate over the batches of the dataset.
for step in range(100):
with tf.GradientTape() as tape:
loss_value=[]
for i in range(10):
#Randomize Starting Condition
u = random.random()-0.5;
v = random.random()-0.5;
x = [];
x.append(np.array([u,v]))
x = np.array(x)
#feed model
logits = model(x, training=True)
#calculate loss
loss_value.append(lossfunction(u,v,model))
print(step)
print(loss_value)
loss = loss_value
loss = tf.convert_to_tensor(loss)
grads = tape.gradient(loss, model.trainable_weights)
opt.apply_gradients(zip(grads, model.trainable_weights))
# Log every 200 batches.
if step % 200 == 0:
print(
"Training loss (for one batch) at step %d: %.4f"
% (step, float(loss_value))
)
print("Seen so far: %d samples" % ((step + 1) * 64))
print("Time taken: %.2fs" % (time.time() - start_time))
model=generatemodel()
x = []
x.append(np.array([1.0,2.0]))
print(np.shape(x))
f=model(np.array(x))
dotraining(model)

The problem is that, when you cast f to numpy here:
f=f.numpy()[0][0]
it stops being a tensor and tensorflow doesn't track its gradient any more.
For tensorflow to compute gradient, you must get from inputs to loss using only tensor operations.

Keras Accuracy and Loss not changing over a large period of epochs

I am trying to create a Convolutional Neural Network to classify what language a certain "word" is from. There are two files ("english_words.txt" and "spanish_words.txt") which each contain about 60,000 words each. I have converted each word into a 29-dimensional vector where each element is a number between 0 and 1. I am training the model for 500 epochs with the optimizer "adam". However, when I train the model, the loss tends to hover around 0.7 and the accuracy around 0.5, and no matter how long I train it for, these metrics will not improve. Here is the code:
import keras
import numpy as np
from keras.layers import Dense
from keras.models import Sequential
import re
train_labels = []
train_data = []
with open("english_words.txt") as words:
full_words = words.read()
full_words = full_words.split("\n")
# all of the labels are just 1.
# we now need to encode them into 29 dimensional vectors.
vector = []
i = 0
for word in full_words:
train_labels.append([1,0])
for letter in word:
vector.append((ord(letter) - 96) * (1.0 / 26.0))
i += 1
if (i < 29):
for x in range(0, 29 - i):
vector.append(0)
train_data.append(vector)
vector = []
i = 0
with open("spanish_words.txt") as words:
full_words = words.read()
full_words = full_words.replace(' ', '')
full_words = full_words.replace('\n', ',')
full_words = full_words.split(",")
vector = []
for word in full_words:
train_labels.append([0,1])
for letter in word:
vector.append((ord(letter) - 96) * (1.0 / 26.0))
i += 1
if (i < 29):
for x in range(0, 29 - i):
vector.append(0)
train_data.append(vector)
vector = []
i = 0
def shuffle_in_unison(a, b):
assert len(a) == len(b)
shuffled_a = np.empty(a.shape, dtype=a.dtype)
shuffled_b = np.empty(b.shape, dtype=b.dtype)
permutation = np.random.permutation(len(a))
for old_index, new_index in enumerate(permutation):
shuffled_a[new_index] = a[old_index]
shuffled_b[new_index] = b[old_index]
return shuffled_a, shuffled_b
train_data = np.asarray(train_data, dtype=np.float32)
train_labels = np.asarray(train_labels, dtype=np.float32)
train_data, train_labels = shuffle_in_unison(train_data, train_labels)
print(train_data.shape, train_labels.shape)
model = Sequential()
model.add(Dense(29, input_shape=(29,)))
model.add(Dense(60))
model.add(Dense(40))
model.add(Dense(25))
model.add(Dense(2))
model.compile(optimizer="adam",
loss="categorical_crossentropy",
metrics=["accuracy"])
model.summary()
model.fit(train_data, train_labels, epochs=500, batch_size=128)
model.save("language_predictor.model")
For some extra info, I am running python 3.x with tensorflow 1.15 and keras 1.15 on windows x64.

I can see several potential problems with your code.
You added several Dense layers one after another, but you really need to also include a non-linear activation function with the parameter activation= .... In the absence of any non-linear activation functions, all those fully-connected Dense layers will mathematically collapse into one single linear Dense layer incapable of learning a non-linear decision boundary.
In general, if you see your loss and accuracy not making any improvement or even getting worse, then the first thing to try is to reduce your learning rate.
You don't need to necessarily implement your own shuffling function. The Keras fit() function can do it if you use the shuffle=True parameter.

In addition to the points mentioned by stackoverflowuser2010:
I find this a very good read and highly suggest checking the mentioned points: 37 Reasons why your Neural Network is not working
Center your input data: Compute a component-wise mean vector and subtract it from every input.

Policy Gradient algorithm gets worse over time

I tried to write a Policy Gradient algorithm for the Video game Pong.
Here's the Code:
import tensorflow as tf
import gym
import numpy as np
import matplotlib.pyplot as plt
from os import getcwd
num_episodes = 1000
learning_rate = 0.01
rewards = []
env_name = 'Pong-v0'
env = gym.make(env_name)
x = tf.placeholder(tf.float32,(None,)+env.observation_space.shape)
y = tf.placeholder(tf.float32,(None,env.action_space.n))
def net(x):
layer1 = tf.layers.flatten(x)
layer2 = tf.layers.dense(layer1,200,activation=tf.nn.softmax)
layer3 = tf.layers.dense(layer2,env.action_space.n,activation=tf.nn.softmax)
return layer3
logits = net(x)
loss = tf.losses.sigmoid_cross_entropy(y,logits)
train = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
saver = tf.train.Saver()
init = tf.global_variables_initializer()
sess = tf.Session()
with tf.device('/device:GPU:0'):
sess.run(init)
for episode in range(num_episodes):
print('episode:',episode+1)
total_reward = 0
losses = []
training_data = []
observation = env.reset()
while True:
if max(0.1, (episode+1)/num_episodes) > np.random.uniform():
probs = sess.run(logits,feed_dict={x:[observation]})[0]
action = np.argmax(probs)
else:
action = env.action_space.sample()
onehot = np.zeros(env.action_space.n)
onehot[action] = 1
training_data.append([observation,onehot])
observation, reward, done, _ = env.step(action)
total_reward += reward
if done:
break
if total_reward >= 0:
learning_rate = 0.01
else:
learning_rate = -0.01
for sample in training_data:
l,_ = sess.run([loss,train],feed_dict={x:[sample[0]], y:[sample[1]]})
losses.append(l)
print('loss:',l)
print('average loss:',sum(losses)/len(losses))
saver.save(sess,getcwd()+'/model.ckpt')
rewards.append(total_reward)
plt.plot(range(episode+1),rewards)
plt.ylabel('total reward')
plt.xlabel('episodes')
plt.savefig(getcwd()+'/reward_plot.png')
But after I trained my Network, the plot which the script made seemed to suggest that the Network got worse towards the end. Also during the last Episode the loss was the same for all Training examples (~0.68) and when I try to test the Network, the paddle of the Player just sits there motionless. Is there any way I can improve my Code?

I would ask you to familiarize yourself with how to code neural networks using tensorflow because there is where the problem lies. You provide activation=tf.nn.softmax in both the nn layers which should be a terminal layer (since you are trying to find the maximum action probability). You can change it to tf.nn.relu in the second layer. There is a bigger problem with the learning_rate:
if total_reward >= 0:
learning_rate = 0.01
else:
learning_rate = -0.01
Negative learning rate makes absolutely no sense. You want the learning rate to be positive (you can use a constant 0.01 for now).
Also, another comment, you have not mentioned the observation_space shape but I am going to assume it is a 2D matrix. Then you can reshape it before inputting it into x. So you would not need to unnecessarily use tf.flatten.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Something wrong with Keras code Q-learning OpenAI gym FrozenLake - python

Related

Deep Reinforcement Learning - CartPole Problem

Should my model with Monte Carlo dropout provide a mean prediction similar to the deterministic prediction?

Reinforcement Learning on Tensorflow without Gym

Keras Accuracy and Loss not changing over a large period of epochs

Policy Gradient algorithm gets worse over time

Categories

Resources