I am currently reading Reinforcement Learning: An Introduction (RL:AI) and try to reproduce the first example with an n-armed bandit and simple reward averaging.
Averaging
new_estimate = current_estimate + 1.0 / step * (reward - current_estimate)
In order reproduce the graph from the PDF, I am generating 2000 bandit-plays and let different agents play 2000 bandits for 1000 steps (as described in the PDF) and then average the reward as well as the percentage of optimal actions.
In the PDF, the result looks like this:
However, I am not able to reproduce this. If I am using simple averaging, all the agents with exploration (epsilon > 0) actually play worse than an agent without exploration. This is weird because the possibility of exploration should allow agents to leave the local optimum more often and reach out to better actions.
As you can see below, this is not the case for my implementation. Also note that I have added agents which use weighted-averaging. These work but even in that case, raising epsilon results in a degradation of the agents performance.
Any ideas what's wrong in my code?
The code (MVP)
from abc import ABC
from typing import List
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from multiprocessing.pool import Pool
class Strategy(ABC):
def update_estimates(self, step: int, estimates: np.ndarray, action: int, reward: float):
raise NotImplementedError()
class Averaging(Strategy):
def __str__(self):
return 'avg'
def update_estimates(self, step: int, estimates: np.ndarray, action: int, reward: float):
current = estimates[action]
return current + 1.0 / step * (reward - current)
class WeightedAveraging(Strategy):
def __init__(self, alpha):
self.alpha = alpha
def __str__(self):
return 'weighted-avg_alpha=%.2f' % self.alpha
def update_estimates(self, step: int, estimates: List[float], action: int, reward: float):
current = estimates[action]
return current + self.alpha * (reward - current)
class Agent:
def __init__(self, nb_actions, epsilon, strategy: Strategy):
self.nb_actions = nb_actions
self.epsilon = epsilon
self.estimates = np.zeros(self.nb_actions)
self.strategy = strategy
def __str__(self):
return ','.join(['eps=%.2f' % self.epsilon, str(self.strategy)])
def get_action(self):
best_known = np.argmax(self.estimates)
if np.random.rand() < self.epsilon and len(self.estimates) > 1:
explore = best_known
while explore == best_known:
explore = np.random.randint(0, len(self.estimates))
return explore
return best_known
def update_estimates(self, step, action, reward):
self.estimates[action] = self.strategy.update_estimates(step, self.estimates, action, reward)
def reset(self):
self.estimates = np.zeros(self.nb_actions)
def play_bandit(agent, nb_arms, nb_steps):
agent.reset()
bandit_rewards = np.random.normal(0, 1, nb_arms)
rewards = list()
optimal_actions = list()
for step in range(1, nb_steps + 1):
action = agent.get_action()
reward = bandit_rewards[action] + np.random.normal(0, 1)
agent.update_estimates(step, action, reward)
rewards.append(reward)
optimal_actions.append(np.argmax(bandit_rewards) == action)
return pd.DataFrame(dict(
optimal_actions=optimal_actions,
rewards=rewards
))
def main():
nb_tasks = 2000
nb_steps = 1000
nb_arms = 10
fig, (ax_rewards, ax_optimal) = plt.subplots(2, 1, sharex='col', figsize=(8, 9))
pool = Pool()
agents = [
Agent(nb_actions=nb_arms, epsilon=0.00, strategy=Averaging()),
Agent(nb_actions=nb_arms, epsilon=0.01, strategy=Averaging()),
Agent(nb_actions=nb_arms, epsilon=0.10, strategy=Averaging()),
Agent(nb_actions=nb_arms, epsilon=0.00, strategy=WeightedAveraging(0.5)),
Agent(nb_actions=nb_arms, epsilon=0.01, strategy=WeightedAveraging(0.5)),
Agent(nb_actions=nb_arms, epsilon=0.10, strategy=WeightedAveraging(0.5)),
]
for agent in agents:
print('Agent: %s' % str(agent))
args = [(agent, nb_arms, nb_steps) for _ in range(nb_tasks)]
results = pool.starmap(play_bandit, args)
df_result = sum(results) / nb_tasks
df_result.rewards.plot(ax=ax_rewards, label=str(agent))
df_result.optimal_actions.plot(ax=ax_optimal)
ax_rewards.set_title('Rewards')
ax_rewards.set_ylabel('Average reward')
ax_rewards.legend()
ax_optimal.set_title('Optimal action')
ax_optimal.set_ylabel('% optimal action')
ax_optimal.set_xlabel('steps')
plt.xlim([0, nb_steps])
plt.show()
if __name__ == '__main__':
main()
In the formula for the update rule
new_estimate = current_estimate + 1.0 / step * (reward - current_estimate)
the parameter step should be the number of times that the particular action has been taken, not the overall step number of the simulation. So you need to store that variable alongside the action values in order to use it for the update.
This can also be seen from the pseudo-code box at the end of chapter 2.4 Incremental Implementation:
(source: Richard S. Sutton and Andrew G. Barto: Reinforcement Learning - An Introduction, second edition, 2018, Chapter 2.4 Incremental Implementation)
Related
I have set up zipline locally on PyCharm. The simulations work, moreover, I have access to premium data from quandl (which automatically updated when I entered my API key). However, now my question is, how do I make a pipeline locally using zipline.
Zipline's documentation is challenging. Zipline.io (as of 2021-0405) is also down. Fortunately, Blueshift has documentation and sample code that shows how to make a pipeline that can be run locally:
Blueshift sample pipeline code is here. (Pipelines library here.)
Zipline documentation can be accessed from MLTrading (archive documentation here) since though challenging it is still useful.
Full code of the pipeline sample code from Blueshift, but modified to run locally through PyCharm, is below the line. Please note as I'm sure you're already aware, the strategy is a bad strategy and you shouldn't trade on it. It does show local instantiations of pipelines though.
"""
Title: Classic (Pedersen) time-series momentum (equal weights)
Description: This strategy uses past returns and go long (short)
the positive (negative) n-percentile
Style tags: Momentum
Asset class: Equities, Futures, ETFs, Currencies
Dataset: All
"""
"""
Sources:
Overall Algorithm here:
https://github.com/QuantInsti/blueshift-demo-strategies/blob/master/factors/time_series_momentum.py
Custom (Ave Vol Filter, Period Returns) Functions Here:
https://github.com/QuantInsti/blueshift-demo-strategies/blob/master/library/pipelines/pipelines.py
"""
import numpy as np
from zipline.pipeline import CustomFilter, CustomFactor, Pipeline
from zipline.pipeline.data import EquityPricing
from zipline.api import (
order_target_percent,
schedule_function,
date_rules,
time_rules,
attach_pipeline,
pipeline_output,
)
def average_volume_filter(lookback, amount):
"""
Returns a custom filter object for volume-based filtering.
Args:
lookback (int): lookback window size
amount (int): amount to filter (high-pass)
Returns:
A custom filter object
Examples::
# from library.pipelines.pipelines import average_volume_filter
pipe = Pipeline()
volume_filter = average_volume_filter(200, 1000000)
pipe.set_screen(volume_filter)
"""
class AvgDailyDollarVolumeTraded(CustomFilter):
inputs = [EquityPricing.close, EquityPricing.volume]
def compute(self, today, assets, out, close_price, volume):
dollar_volume = np.mean(close_price * volume, axis=0)
high_volume = dollar_volume > amount
out[:] = high_volume
return AvgDailyDollarVolumeTraded(window_length=lookback)
def period_returns(lookback):
"""
Returns a custom factor object for computing simple returns over
period.
Args:
lookback (int): lookback window size
Returns:
A custom factor object.
Examples::
# from library.pipelines.pipelines import period_returns
pipe = Pipeline()
momentum = period_returns(200)
pipe.add(momentum,'momentum')
"""
class SignalPeriodReturns(CustomFactor):
inputs = [EquityPricing.close]
def compute(self, today, assets, out, close_price):
start_price = close_price[0]
end_price = close_price[-1]
returns = end_price / start_price - 1
out[:] = returns
return SignalPeriodReturns(window_length=lookback)
def initialize(context):
'''
A function to define things to do at the start of the strategy
'''
# The context variables can be accessed by other methods
context.params = {'lookback': 12,
'percentile': 0.1,
'min_volume': 1E7
}
# Call rebalance function on the first trading day of each month
schedule_function(strategy, date_rules.month_start(),
time_rules.market_close(minutes=1))
# Set up the pipe-lines for strategies
attach_pipeline(make_strategy_pipeline(context),
name='strategy_pipeline')
def strategy(context, data):
generate_signals(context, data)
rebalance(context, data)
def make_strategy_pipeline(context):
pipe = Pipeline()
# get the strategy parameters
lookback = context.params['lookback'] * 21
v = context.params['min_volume']
# Set the volume filter
volume_filter = average_volume_filter(lookback, v)
# compute past returns
momentum = period_returns(lookback)
pipe.add(momentum, 'momentum')
pipe.set_screen(volume_filter)
return pipe
def generate_signals(context, data):
try:
pipeline_results = pipeline_output('strategy_pipeline')
except:
context.long_securities = []
context.short_securities = []
return
p = context.params['percentile']
momentum = pipeline_results
long_candidates = momentum[momentum > 0].dropna().sort_values('momentum')
short_candidates = momentum[momentum < 0].dropna().sort_values('momentum')
n_long = len(long_candidates)
n_short = len(short_candidates)
n = int(min(n_long, n_short) * p)
if n == 0:
print("{}, no signals".format(data.current_dt))
context.long_securities = []
context.short_securities = []
context.long_securities = long_candidates.index[-n:]
context.short_securities = short_candidates.index[:n]
def rebalance(context, data):
# weighing function
n = len(context.long_securities)
if n < 1:
return
weight = 0.5 / n
# square off old positions if any
for security in context.portfolio.positions:
if security not in context.long_securities and \
security not in context.short_securities:
order_target_percent(security, 0)
# Place orders for the new portfolio
for security in context.long_securities:
order_target_percent(security, weight)
for security in context.short_securities:
order_target_percent(security, -weight)
I want to solve a multi-objective optimization problem using DEAP, a python based framework. Due to time consuming processes, i need to use all of my CPU power to compute. So i used multiprocessing library as suggested in DEAP documentation an this example, but it results in PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed .
My total code is too long to write it down hear, but the following code is similar to my code and results in the same error.Can you please tell me where do i make mistake?
Thanks in advance
import multiprocessing
from deap import creator, base, tools, algorithms
import random
import matplotlib.pyplot as plt
def TEST(dec_var):
return dec_var[0]**2+dec_var[1]**2,(dec_var[0]-2)**2+dec_var[1]**2
def feasible(dec_var):
if all(i>0 for i in dec_var):
return True
return False
creator.create("FitnessMin", base.Fitness, weights=(-1.0,-1.0))
creator.create("Individual", list, fitness=creator.FitnessMin)
toolbox=base.Toolbox()
toolbox.register("uniform", random.uniform, 0.0, 7.0)
toolbox.register("individual",tools.initRepeat,creator.Individual,toolbox.uniform ,n=2)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
toolbox.register("mate", tools.cxTwoPoint)
toolbox.register("mutate", tools.mutGaussian, mu=0, sigma=1, indpb=0.1)
toolbox.register("select", tools.selNSGA2)
toolbox.register("evaluate", TEST)
toolbox.decorate("evaluate", tools.DeltaPenalty(feasible,(1000,1000)))
def main(seed=None):
random.seed(seed)
NGEN = 250
MU = 100
CXPB = 0.9
stats_func1 = tools.Statistics(key=lambda ind: ind.fitness.values[0])
stats_func2 = tools.Statistics(key=lambda ind: ind.fitness.values[1])
stats = tools.MultiStatistics(func1=stats_func1, func2=stats_func2)
stats.register("avg", numpy.mean, axis=0)
stats.register("std", numpy.std, axis=0)
stats.register("min", numpy.min, axis=0)
stats.register("max", numpy.max, axis=0)
logbook = tools.Logbook()
logbook.header = "gen", "evals", "func1","func2"
logbook.chapters["func1"].header = "min", "max"
logbook.chapters["func2"].header = "min", "max"
pop = toolbox.population(n=MU)
invalid_ind = [ind for ind in pop if not ind.fitness.valid]
fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
for ind, fit in zip(invalid_ind, fitnesses):
ind.fitness.values = fit
pop = toolbox.select(pop, len(pop))
record = stats.compile(pop)
logbook.record(gen=0, evals=len(invalid_ind), **record)
print(logbook.stream)
for gen in range(1, NGEN):
offspring = tools.selTournamentDCD(pop, len(pop))
offspring = [toolbox.clone(ind) for ind in offspring]
for ind1, ind2 in zip(offspring[::2], offspring[1::2]):
if random.random() <= CXPB:
toolbox.mate(ind1, ind2)
toolbox.mutate(ind1)
toolbox.mutate(ind2)
del ind1.fitness.values, ind2.fitness.values
invalid_ind = [ind for ind in offspring if not ind.fitness.valid]
fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
for ind, fit in zip(invalid_ind, fitnesses):
ind.fitness.values = fit
pop = toolbox.select(pop + offspring, MU)
record = stats.compile(pop)
logbook.record(gen=gen, evals=len(invalid_ind), **record)
print(logbook.stream)
return pop, logbook
if __name__ == "__main__":
pool = multiprocessing.Pool(processes=4)
toolbox.register("map", pool.map)
pop, stats = main()
pool.close()
print pop
This is noted on the documentation page you linked:
The pickling of lambda function is not yet available in Python.
You are using lambda functions in stats_func1 and stats_func2 – just move them to global functions and try again:
def stats_key_1(ind):
return ind.fitness.values[0]
def stats_key_2(ind):
return ind.fitness.values[1]
# ... snip ...
def main(seed=None):
# ... snip ...
stats_func1 = tools.Statistics(key=stats_key_1)
stats_func2 = tools.Statistics(key=stats_key_1)
I’ve been learning RL this summer and this week I’ve tried to make a PPO implementation on Pytorch with the help of some repositories from github with similiar algorithms.
The code runs OpenAI’s Lunar Lander but I have several errors that I have not been able to fix, the biggest one being that the algorithm quickly converges to doing the same action regardless of the state. The other major problem I’ve found is that even though I’m using backwards() only once, I get an error asking me to set retain_graph to True.
Because of that, I see no improvement of the rewards obtained over 1000 steps, I don’t know if the algorithm needs more steps to be able to see an improvement.
I’m really sorry if this kind of problems have no place in this forums, I just didn’t know where to post this.
Also I’m sorry for the messy code, it’s my first time doing this kind of algorithms, and I’m fairly new with pytorch and machine learning in general.
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
from torch.distributions import Categorical
import gym
class actorCritic(nn.Module):
def __init__(self):
super(actorCritic, self).__init__()
self.fc = nn.Sequential(
nn.Linear(8, 16),
nn.Linear(16, 32),
nn.Linear(32, 64),
nn.ReLU(inplace=True)
)
self.pi = nn.Linear(64, 4)
self.value = nn.Linear(64, 1)
def forward(self, x):
x = self.fc(x)
pi_1 = self.pi(x)
pi_out = F.softmax(pi_1, dim=-1)
value_out = self.value(x)
return pi_out, value_out
def GAE(rewards, values, masks):
gamma = 0.99
lamb = 0.95
advan_t = 0
sizes = rewards.size()
advantages = torch.zeros(1, sizes[1])
for t in reversed(range(sizes[1])):
delta = rewards[0][t] + gamma*values[0][t+1]*masks[0][t] - values[0][t]
advan_t = delta + gamma*lamb*advan_t*mask[0][t]
advantages[0][t] = advan_t
real_values = values[:,:sizes[1]] + advantages
return advantages, real_values
def plot_rewards(rewards):
plt.figure(2)
plt.clf()
plt.plot(rewards)
plt.pause(0.001)
plt.savefig('TruePPO 500 steps.png')
def interact(times, states):
rewards = torch.zeros(1, times)
actions = torch.zeros(1, times)
mask = torch.ones(1, times)
for steps in range(times):
action_probs, _ = network(states[steps])
m = Categorical(action_probs)
action = int(m.sample())
obs, reward, done, _ = env.step(action)
if done:
obs = env.reset()
mask[0][steps] = 0
states[steps+1] = torch.from_numpy(obs).float()
rewards[0][steps] = reward
actions[0][steps] = action
return states, rewards, actions, mask
#Parameters
total_steps = 1000
batch_size = 10
env = gym.make('LunarLander-v2')
network = actorCritic()
old_network = actorCritic()
optimizer = torch.optim.Adam(network.parameters(), lr = 0.001)
states = torch.zeros(batch_size+1, 8)
steps = 0
obs_ = env.reset()
obs = torch.from_numpy(obs_).float()
states[0] = obs
reward_means = []
nn_paramD = network.state_dict()
old_network.load_state_dict(nn_paramD)
while steps < total_steps:
print (steps)
states, rewards, actions, mask = interact(batch_size, states)
#calculate values, GAE, normalize advantages, randomize, calculate loss, backprop,
_, values = network(states)
values = values.view(-1, batch_size+1)
advantages, v_targ = GAE(rewards, values, mask)
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-5)
optimizer.zero_grad()
for n in range(rewards.size()[1]):
probabilities, _ = network(states[n])
print (probabilities)
m = Categorical(probabilities)
action_prob = m.probs[int(actions[0][n])]
entropia = m.entropy()
old_probabilities, _ = old_network(states[n])
m_old = Categorical(old_probabilities)
old_action_prob = m.probs[actions[0][n].int()]
old_action_prob.detach()
ratio = action_prob / old_action_prob
surr1 = ratio*advantages[0][n]
surr2 = torch.clamp(ratio, min = (1.-0.2), max = (1.+0.2))
policy_loss = -torch.min(surr1, surr2)
value_loss = 0.5*(values[0][n]-v_targ[0][n])**2
entropy_loss = -entropia
total_loss = policy_loss + value_loss + 0.01*entropy_loss
total_loss.backward(retain_graph = True)
optimizer.step()
reward_means.append(rewards.numpy().mean())
old_network.load_state_dict(nn_paramD)
nn_paramD = network.state_dict()
steps += 1
plot_rewards(reward_means)
I'm trying to solve the 'BipedalWalker-v2' problem from Open AI, by using python and Tensorflow. In order to solve it I'm implementing an episodic policy gradient algorithms. Because the 'BipedalWalker-v2' actions are continuous my policy is approximated by a multivariate Gaussian distribution. The mean of this distribution is approximated using a fully connected neural network. My neural network has the following layers: [input:24,hidden:5,hidden:5,output:4]. My problem is that when I train the agent, the training process gets slower and slower until it almost freeze. My guess is that I'm misusing sess.run, I'm not feeding the batches in an efficient way. But is just a guess. My question is: Is my guess correct? if it is correct, how can I improve it? and if it is something else, what it is? I'm not looking for a literal solution I just want to get some lights about how to improve the training.
Thanks in advance,
my computer is a Inspiron 15 7000 Gaming, GeForce nvidia gtx 1050, 8 gb ram,cpu: I5
My CODE:
Libraries:
import tensorflow as tf
import tensorflow.contrib.slim as slim
import numpy as np
import gym
import matplotlib.pyplot as plt
Agent class:
class agent_episodic_continuous_action():
def __init__(self, lr, s_size,a_size,batch_size,dist_type):
self.stuck = False
self.gamma = 0.99
self.dist_type = dist_type
self.is_brain_present = False
self.s_size = s_size
self.batch_size=batch_size
self.state_in= tf.placeholder(shape=[None,s_size],dtype=tf.float32)
self.a_size=a_size
self.reward_holder = tf.placeholder(shape=[None],dtype=tf.float32)
self.cov = tf.eye(a_size)
self.reduction = 0.01
if a_size > 1:
self.action_holder = tf.placeholder(shape=[None,a_size],dtype=tf.float32)
else:
self.action_holder = tf.placeholder(shape=[None],dtype=tf.float32)
self.gradient_holders = []
self.optimizer = tf.train.AdamOptimizer(learning_rate=lr)
def save_model(self,path,sess):
self.saver.save(sess, path)
def load_model(self,path,sess):
self.saver.restore(sess, path)
def create_brain(self,hidd_layer,hidd_layer_act_fn,output_act_fn):
self.is_brain_present = True
hidden_output=slim.stack(self.state_in,slim.fully_connected,hidd_layer,activation_fn=hidd_layer_act_fn)
self.output = slim.fully_connected(hidden_output,self.a_size,activation_fn=output_act_fn,biases_initializer=None)
def create_pi_dist(self):
if self.dist_type == "normal":
# amplify= tf.pow(slim.fully_connected(self.output,1,activation_fn=None,biases_initializer=None),2)
mean= self.output
#cov =tf.eye(self.a_size,batch_shape=[self.batch_size])*amplify
normal = tf.contrib.distributions.MultivariateNormalFullCovariance(
loc=mean,
covariance_matrix=self.cov*self.reduction)
self.dist = normal
def create_loss(self):
self.loss = -tf.reduce_mean(tf.log(self.dist.prob(self.action_holder))*self.reward_holder)
def get_gradients_holder(self):
for idx,var in enumerate(self.tvars):
placeholder = tf.placeholder(tf.float32,name=str(idx)+'_holder')
self.gradient_holders.append(placeholder)
def sample_action(self,sess,state):
sample_action= sess.run(self.dist.sample(),feed_dict={self.state_in:state})
return sample_action
def calculate_loss_gradient(self):
self.gradients = tf.gradients(self.loss,self.tvars)
def update_weights(self):
self.update_batch = self.optimizer.apply_gradients(zip(self.gradients,self.tvars))
return self.update_batch
def memorize_data(self,episode,first):
if first:
self.episode_history = episode
self.stuck = False
else:
self.episode_history = np.vstack((self.episode_history,episode))
def shuffle_memories(self):
np.random.shuffle(self.episode_history)
def create_graph_connections(self):
if self.is_brain_present:
self.create_pi_dist()
self.create_loss()
self.tvars = tf.trainable_variables()
self.calculate_loss_gradient()
self.saver = tf.train.Saver()
self.update_weights()
else:
print("initialize brain first")
self.init = tf.global_variables_initializer()
def memory_batch_generator(self):
total=self.episode_history.shape[0]
amount_of_batches= int(total/self.batch_size)
for i in range(amount_of_batches+1):
if i < amount_of_batches:
top=(i+1)*self.batch_size
bottom =i*self.batch_size
yield (self.episode_history[bottom:top,0:self.s_size],self.episode_history[bottom:top,self.s_size:self.s_size+self.a_size],self.episode_history[bottom:top,self.s_size+self.a_size:self.s_size+self.a_size+1],self.episode_history[bottom:top,self.s_size+self.a_size+1:])
else:
yield (self.episode_history[top:,0:self.s_size],self.episode_history[top:,self.s_size:self.s_size+self.a_size],self.episode_history[top:,self.s_size+self.a_size:self.s_size+self.a_size+1],self.episode_history[top:,self.s_size+self.a_size+1:])
def train_with_current_memories(self,sess):
self.sess = sess
for step_sample_batch in self.memory_batch_generator():
sess.run(self.update_weights(), feed_dict={self.state_in:step_sample_batch[0],self.action_holder:step_sample_batch[1],self.reward_holder:step_sample_batch[2].reshape([step_sample_batch[2].shape[0]])})
def get_returns(self):
self.episode_history[:,self.s_size+self.a_size:self.s_size+self.a_size+1] = self.discount_rewards(self.episode_history[:,self.s_size+self.a_size:self.s_size+self.a_size+1])
def discount_rewards(self,r):
""" take 1D float array of rewards and compute discounted reward """
discounted_r = np.zeros_like(r)
running_add = 0
for t in reversed(range(0, r.size)):
running_add = running_add * self.gamma + r[t]
discounted_r[t] = running_add
return discounted_r
def prob_action(self,sess,action,state):
prob = sess.run(self.dist.prob(action),feed_dict={self.state_in:state})
return prob
def check_movement(self):
ep_back = 5
jump = 3
threshold = 3
if len(self.episode_history) > ep_back*2:
difference = sum(abs(self.episode_history[-ep_back:-1,:]-self.episode_history[-ep_back-jump:-1-jump,:]).flatten())
print(difference)
if difference < threshold:
self.stuck = True
def print_last_n_returns(self,n):
if len(self.episode_history[:,self.s_size+self.a_size:self.s_size+self.a_size+1])>n:
n_returns = sum(self.episode_history[-n:,self.s_size+self.a_size:self.s_size+self.a_size+1])/float(n)
print(n_returns)
return n_returns
Training loops:
tf.reset_default_graph()
agent_2= agent_episodic_continuous_action(1e-2,s_size=24,a_size=4,batch_size=30,dist_type="normal")
agent_2.create_brain([5,5],tf.nn.relu,None)
agent_2.create_graph_connections()
env = gym.make('BipedalWalker-v2')
with tf.Session() as sess:
sess.run(agent_2.init)
for i in range(200):
s = env.reset()
d = False
a=agent_2.sample_action(sess,[s])[0]
print(a)
if None in a:
print("None in a! inside for")
print(s)
s1,r,d,_ = env.step(a)
episode = np.hstack((s,a,r,s1))
agent_2.memorize_data(episode=episode,first=True)
count = 0
while not d:
count = count + 1
s = s1
a=agent_2.sample_action(sess,[s])[0]
s1,r,d,_ = env.step(a)
episode = np.hstack((s,a,r,s1))
# env.render()
agent_2.memorize_data(episode=episode,first=False)
# print(s1)
if count % 5 == 0 :
agent_2.check_movement()
if agent_2.stuck:
d = True
agent_2.get_returns()
agent_2.print_last_n_returns(20)
agent_2.shuffle_memories()
agent_2.train_with_current_memories(sess)
env.close()
For each batch of 30 samples I execute Agent.update_weights()
def update_weights(self):
self.update_batch = self.optimizer.apply_gradients(zip(self.gradients,self.tvars))
When I execute:
def train_with_current_memories(self,sess):
self.sess = sess
for step_sample_batch in self.memory_batch_generator():
sess.run(self.update_weights(), feed_dict={self.state_in:step_sample_batch[0],self.action_holder:step_sample_batch[1],self.reward_holder:step_sample_batch[2].reshape([step_sample_batch[2].shape[0]])})
Or maybe this sluggishness is an expected behavior.
The code was slowing down after each iteration because the graph was getting bigger at each iteration. This is because I was creating new graph elements inside the iteration loop.
during each iteration the following function was being called:
def update_weights(self):
self.update_batch = self.optimizer.apply_gradients(zip(self.gradients,self.tvars))
return self.update_batch
This function was creating a new element to the graph.
The best way to avoid "graph leaking" is to add the line
sess.graph.finalize()
as soon as you create your session. In this way, if there is a graph leaking, Tensorflow will raise an exception.
I'm trying to use one_vs_one composition of decision trees for multiclass classification. The problem is, when I pass different object weights to a classifier, the result stays the same.
Do I misunderstand something with weights, or do they just work incorrectly?
Thanks for your replies!
Here is my code:
class AdaLearner(object):
def __init__(self, in_base_type, in_multi_type):
self.base_type = in_base_type
self.multi_type = in_multi_type
def train(self, in_features, in_labels):
model = AdaBoost(self.base_type, self.multi_type)
model.learn(in_features, in_labels)
return model
class AdaBoost(object):
CLASSIFIERS_NUM = 100
def __init__(self, in_base_type, in_multi_type):
self.base_type = in_base_type
self.multi_type = in_multi_type
self.classifiers = []
self.weights = []
def learn(self, in_features, in_labels):
labels_number = len(set(in_labels))
self.weights = self.get_initial_weights(in_labels)
for iteration in xrange(AdaBoost.CLASSIFIERS_NUM):
classifier = self.multi_type(self.base_type())
self.classifiers.append(classifier.train(in_features,
in_labels,
weights=self.weights))
answers = []
for obj in in_features:
answers.append(self.classifiers[-1].apply(obj))
err = self.compute_weighted_error(in_labels, answers)
print err
if abs(err - 0.) < 1e-6:
break
alpha = 0.5 * log((1 - err)/err)
self.update_weights(in_labels, answers, alpha)
self.normalize_weights()
def apply(self, in_features):
answers = {}
for classifier in self.classifiers:
answer = classifier.apply(in_features)
if answer in answers:
answers[answer] += 1
else:
answers[answer] = 1
ranked_answers = sorted(answers.iteritems(),
key=lambda (k,v): (v,k),
reverse=True)
return ranked_answers[0][0]
def compute_weighted_error(self, in_labels, in_answers):
error = 0.
w_sum = sum(self.weights)
for ind in xrange(len(in_labels)):
error += (in_answers[ind] != in_labels[ind]) * self.weights[ind] / w_sum
return error
def update_weights(self, in_labels, in_answers, in_alpha):
for ind in xrange(len(in_labels)):
self.weights[ind] *= exp(in_alpha * (in_answers[ind] != in_labels[ind]))
def normalize_weights(self):
w_sum = sum(self.weights)
for ind in xrange(len(self.weights)):
self.weights[ind] /= w_sum
def get_initial_weights(self, in_labels):
weight = 1 / float(len(in_labels))
result = []
for i in xrange(len(in_labels)):
result.append(weight)
return result
As you can see, it is just a simple AdaBoost (I instantiated it with in_base_type = tree_learner, in_multi_type = one_against_one) and it worked the same way no matter how many base classifiers were engaged. It just acted as one multiclass decision tree.
Then I've made a hack. I chose a random sample of objects on the each iteration with respect to their weights and trained classifiers with a random subset of objects without any weights. And that worked as it was supposed to.
The default tree criterion, namely information gain, does not take the weights into account. If you know of a formula which would do it, I'll implement it.
In the meanwhile, using neg_z1_loss will do it correctly. By the way, there was a slight bug in that implementation, so you will need to use the most current github master.