Why tensorflow Synchronous distributed blocked

Why tensorflow Synchronous distributed blocked - python

show codes:
cluster = tf.train.ClusterSpec({"ps": self.train_args['ps_hosts'], "worker": self.train_args['worker_hosts']})
server = tf.train.Server(cluster, job_name=self.job_name, task_index=self.task_index)
if self.job_name == "ps":
log.info('server is running..............')
server.join()
else:
log.info('worker: {} is running..............'.format(self.task_index))
is_chief = self.task_index == 0
with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:{}".format(self.task_index), cluster=cluster)):
self.cnn = self.init_model_class()
self.global_step = tf.Variable(0, name="global_step", trainable=False)
lr = tf.train.exponential_decay(learning_rate=0.005, global_step=self.global_step,
decay_steps=self.train_args['lr_decay_steps'],
decay_rate=0.96, staircase=True, name='learn_rate')
optimizer = tf.train.AdamOptimizer(lr)
grads_and_vars = optimizer.compute_gradients(self.cnn.loss)
if self.train_args['is_sync']:
rep_op = tf.train.SyncReplicasOptimizer(optimizer, replicas_to_aggregate=len(self.train_args['worker_hosts']),
total_num_replicas=len(self.train_args['worker_hosts']),
use_locking=False)
self.train_op = rep_op.apply_gradients(grads_and_vars, global_step=self.global_step)
if is_chief:
init_token_op = rep_op.get_init_tokens_op()
chief_queue_runner = rep_op.get_chief_queue_runner()
else:
self.train_op = optimizer.apply_gradients(grads_and_vars, global_step=self.global_step)
saver = tf.train.Saver(tf.global_variables(), save_relative_paths=True, max_to_keep=10)
all_summary_op = tf.summary.merge_all()
init_op = tf.global_variables_initializer()
if not os.path.exists(self.checkpoint_dir):
os.makedirs(self.checkpoint_dir)
sv = tf.train.Supervisor(is_chief=(self.task_index == 0),
logdir=self.checkpoint_dir,
init_op=init_op,
summary_op=all_summary_op,
saver=saver,
global_step=self.global_step,
save_model_secs=60)
session_conf = tf.ConfigProto(allow_soft_placement=self.train_args['allow_soft_placement'],
log_device_placement=self.train_args['log_device_placement'],
device_filters=['/job:ps', '/job:worker/task:{}'.format(self.task_index)])
with sv.managed_session(server.target, config=session_conf) as self.sess:
if self.task_index == 0 and self.train_args['is_sync']:
sv.start_queue_runners(self.sess, [chief_queue_runner])
self.sess.run(init_token_op)
batches = data_process.batch_iter_train(self.train_args['batch_size'], self.train_args['num_epochs'])
local_step = 0
for x_batch, y_batch in batches:
local_step += 1
current_step = self.train_step(x_batch, y_batch)
if current_step % self.train_args['evaluate_every'] == 0:
log.info('worker: {} global step: {} local step: {}'.format(self.task_index, current_step, local_step))
x_dev, y_dev = data_process.batch_iter_dev(self.train_args['dev_batch_size'])
self.dev_step(x_dev, y_dev, current_step)
sv.stop()
log.info('****************************finish training***********************************')
I use 4 machines, 3 for worker(same hardware: 24 cpus), 1 for ps
question:
when i use asynchronous distribute, worker 0 and 1 use 50% cpu, but worker 2 just spent 15% cpu, why?
when i use Synchronous distributed, worker 0 blocked after trains 1 step, and other workers blocked too without doing nothing, i try to start worker 1,2 before worker 0, but nothing change, why?
does all checkpoints are stored on worker 0? with my codes, why the checkpoint only store one checkpoint, i set save last 10.
saver = tf.train.Saver(tf.global_variables(), save_relative_paths=True, max_to_keep=10)
how about summary? store on diff machine or just worker 0, i find 3 workers have summary.
anyone can give me some advices, thanks very much! did i use it with wrong way?

Related

DQN Atari with tensorflow: Training seems to stuck

I'm trying to learn a DQ-Learning Network to play Breakout Atari in Tensorflow. The code runs without problems, but always after 1000-1200 episodes, the time for executing one step explodes to over 100s.
Here is my DQN:
class DQNetwork():
def __init__(self, scope, state_size=(84, 84, 4), num_outputs=4, gamma=0.9, learning_rate=0.001):
self.scope = scope
with tf.variable_scope(self.scope):
# ---------------------
# Basic Deep Q-Network
# ---------------------
self.x = tf.placeholder(tf.float32, shape=[None, *state_size], name="inputs")
# Input is 84x84x4
self.conv1 = tf.layers.conv2d(inputs = self.x,
filters = 32,
kernel_size = [8,8],
strides = [4,4],
padding = "VALID",
name = "conv1",
activation="relu")
self.conv2 = tf.layers.conv2d(inputs = self.conv1,
filters = 64,
kernel_size = [4,4],
strides = [2,2],
padding = "VALID",
name = "conv2",
activation="relu")
self.conv3 = tf.layers.conv2d(inputs = self.conv2,
filters = 64,
kernel_size = [3,3],
strides = [1,1],
padding = "VALID",
name = "conv3",
activation="relu")
self.flatten = tf.layers.flatten(self.conv3)
self.fc = tf.layers.dense(inputs = self.flatten,
units = 512,
activation = tf.nn.relu,
name="fc1")
self.logits = tf.layers.dense(inputs = self.fc,
units = num_outputs,
activation=None)
self.best_action = tf.argmax(self.logits, name="best_action", axis=1)
self.max_q = tf.reduce_max(self.logits, name="max_q", axis=1)
if scope == 'Target':
self.rewards = tf.placeholder(tf.float32, shape=None, name="rewards")
self.gamma = tf.constant(gamma, name="Gamma")
self.done = tf.placeholder(tf.int32, shape=None, name="done_values")
self.td_target = self.rewards + (self.gamma*self.max_q) * tf.cast( tf.abs(self.done -1 ), tf.float32)
if scope == 'Q':
self.target_placeholder = tf.placeholder(tf.float32, shape=None, name="target_placeholder_q")
self.actions = tf.placeholder(tf.uint8, shape=None, name="AllActions")
self.actions_onehot = tf.one_hot(self.actions, depth=num_outputs, name="One_Hot")
self.Q = tf.reduce_sum(tf.multiply(self.actions_onehot, self.logits))
self.huber_loss = huber_loss(self.target_placeholder-self.Q)
self.loss = tf.reduce_mean(self.huber_loss)
self.optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate, epsilon=0.01)
self.train = self.optimizer.minimize(self.loss, name="minimize")
Huber Loss Function:
def huber_loss(x, delta=1.0):
"""Reference: https://en.wikipedia.org/wiki/Huber_loss"""
return tf.where(
tf.abs(x) < delta,
tf.square(x) * 0.5,
delta * (tf.abs(x) - 0.5 * delta)
)
Preprocess Frames
def preprocess_frame(obs):
processed_observe = np.uint8(
resize(rgb2gray(obs), (84, 84), mode='constant') * 255)
return processed_observe
ReplayBuffer
class ReplayBuffer():
def __init__(self, buffer_size):
self.buffer = deque([], maxlen=buffer_size)
def add(self, new_state):
if len(new_state) != 5:
raise Exception("States must have: state, action, reward, next_state, done")
self.buffer.append(new_state)
def sample(self, batch_size):
return r.sample(self.buffer, batch_size)
Update Target Network
def get_update_target_ops(Q_network, Target_network):
# You code comes here
# 1. get the trainable variables per network
Q_trainable = tf.trainable_variables(scope=Q_network.scope)
Target_trainable = tf.trainable_variables(scope=Target_network.scope)
# 2. sort them with sorted(list, key=attrgetter())
Q_trainable = sorted(Q_trainable, key=attrgetter("name"))
Target_trainable = sorted(Target_trainable, key=attrgetter("name"))
# 3.create a new list with all assign ops
update_target_expr = []
for q_var, t_var in zip(Q_trainable, Target_trainable):
update_target_expr.append(t_var.assign(q_var))
return update_target_expr
Greedy Action
def choose_egreedy_action(session, epsilon, network, state):
state = np.float32(state / 255.0)
if np.random.rand() <= epsilon:
return np.random.randint(0, n)
else:
state_reshaped = np.reshape(state, (1, *obs_space))
best_action = session.run(network.best_action, feed_dict={network.x:state_reshaped})[0]
return best_action
Linear Schedule
class LinearSchedule():
def __init__(self, start_epsilon, final_epsilon, pre_train_steps, decay):
self.start_epsilon = start_epsilon
self.final_epsilon = final_epsilon
self.pre_train_steps = pre_train_steps
self.decay = decay
self.epsilon = start_epsilon
def value(self, t):
if t <= self.pre_train_steps:
return self.start_epsilon
else:
return self.start_epsilon - (t-self.pre_train_steps)*self.decay
Train Step
def train(sess, Q, Target, buffer, batch_size):
# You code comes here
# 1. Sample from the replay buffer
mini_batches = buffer.sample(batch_size)
observations, actions, rewards, next_observations, done, = map(list, zip(*mini_batches))
observations = np.array(observations)/255.
next_observations =np.array(next_observations)/255.
td_targets = sess.run( Target.td_target, feed_dict={Target.x : next_observations, Target.rewards:rewards, Target.done:done})
max_q, loss, _ = sess.run([Q.max_q, Q.loss, Q.train], feed_dict={Q.x : observations, Q.target_placeholder:td_targets, Q.actions:actions})
return loss
Hyperparams
EPISODES = 50000
epsilon = 1.
epsilon_start, epsilon_end = 1.0, 0.1
exploration_steps = 1000000.
epsilon_decay_step = (epsilon_start - epsilon_end) / exploration_steps
batch_size = 32
train_start = 50000
update_target_rate = 10000
gamma = 0.99
buffer_size = 400000
no_op_steps = 30
global_steps = 10000000
Init networks
tf.reset_default_graph()
Q_network = DQNetwork(scope="Q", state_size=(84, 84, 4), num_outputs=4, gamma=gamma, learning_rate=0.00025)
T_network = DQNetwork(scope="Target",state_size=(84, 84, 4), num_outputs=4, gamma=gamma, learning_rate=0.00025)
update_target_network = get_update_target_ops(Q_network,T_network)
Train Loop
from tqdm import trange
import random
game = gym.make('BreakoutDeterministic-v4')
scores, episodes = [], [],
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()
buffer = ReplayBuffer(buffer_size)
sess.run(tf.global_variables_initializer())
epsilon_schedule = LinearSchedule(epsilon_start, epsilon_end, train_start, epsilon_decay_step)
done = False
dead = False
# 1 episode = 5 lives
step, score, start_life = 0, 0, 5
observe = game.reset()
eps = 0
# this is one of DeepMind's idea.
# just do nothing at the start of episode to avoid sub-optimal
for _ in range(random.randint(1, no_op_steps)):
observe, _, _, _ = game.step(1)
# At start of episode, there is no preceding frame
# So just copy initial states to make history
state = preprocess_frame(observe)
history = np.stack((state, state, state, state), axis=2)
history = np.reshape([history], (84, 84, 4))
loss = 0
for global_step in trange(global_steps):
#global_step += 1
step += 1
# get action for the current history and go one step in environment
action = choose_egreedy_action(sess, epsilon, Q_network, history)
observe, reward, done, info = game.step(action)
# pre-process the observation --> history
next_state = preprocess_frame(observe)
next_state = np.reshape([next_state], (84, 84, 1))
#print(next_state.shape)
next_history = np.append(next_state, history[ :, :, :3], axis=2)
# if the agent missed ball, agent is dead --> episode is not over
if start_life > info['ale.lives']:
dead = True
start_life = info['ale.lives']
#if dead:
#reward = -1
#score += reward
reward = np.clip(reward, -1., 1.)
# save the sample <s, a, r, s'> to the replay memory
buffer.add([history, action, reward, next_history, dead])
epsilon = epsilon_schedule.value(global_step)
if global_step > train_start:
train(sess, Q_network, T_network, buffer, batch_size)
# update the target model with model
if global_step % update_target_rate == 0:
#print("update networks")
sess.run(update_target_network)
score += reward
# if agent is dead, then reset the history
if dead:
dead = False
else:
history = next_history
# if done, plot the score over episodes
if done:
if eps%100 == 0:
print("episode:", eps, " score:", score, " global_step: ", global_step,
" epsilon: ", epsilon)
scores.append(score)
episodes.append(step)
done = False
dead = False
# 1 episode = 5 lives
step, score, start_life = 0, 0, 5
observe = game.reset()
eps += 1
# this is one of DeepMind's idea.
# just do nothing at the start of episode to avoid sub-optimal
for _ in range(random.randint(1, no_op_steps)):
observe, _, _, _ = game.step(1)
# At start of episode, there is no preceding frame
# So just copy initial states to make history
state = preprocess_frame(observe)
history = np.stack((state, state, state, state), axis=2)
history = np.reshape([history], (84, 84, 4))
if global_step % 5000 == 0:
saver.save(sess, f'models/breakout/model_breakout.ckpt')
For the first 50.000 steps (no training) I have s.th. like 400it/s, which seems to be fine. After that it is s.th. about 50it/s. As soon as epsilon descreases, this number also descreases because we more often have to use the best action instead of a random one.
But after about 1000 episodes i get s.th. like:
2%|▏ | 174582/10000000 [46:53<43:59:11, 100s/it] epsilon:0.88, score: 2
As you can see the duration per iteration increases very much, therefore the training seems to stuck.
I don't know whether this is a problem with my GPU or with the code.
Do you have any idea what to do?

I had the same experience whilst using an RL algorithm and training on the Atari Breakout environment from openAI gym. It happened after my exploration rate dropped to a very low value. I found the solution from this post: OpenAI gym's breakout-v0 "pauses"
I knew my problem was similar to that post because when I rendered the game frames for each episode, I saw that after a life was lost in Atari Breakout, the ball disappeared (it was paused).
The reason why it didn't pause at the beginning of my training, was because the exploration rate was high and it was taking random actions when paused which would have lead it to choosing an action that would start the game.
=================
I think this may solve your problem, not sure if it will (and it is usually not recommended in RL)
See this post:
OpenAI gym breakout-ram-v4 unable to learn
So the question in the above post mentions that the algorithm sets the reward to -1 when the agent loses a life. It does the this by making use of the system information (given when .step() is called) to detect when a life is lost. I'm thinking when you detect a life is lost in Atari Breakout, you hard code that the agent must choose the action (for the next step) that starts the game.
Again, this approach is not recommended in RL cause the agent only supposed to make use of the observation when deciding on which action to take.

Pytorch is not using GPU even it detects the GPU

I made my windows 10 jupyter notebook as a server and running some trains on it.
I've installed CUDA 9.0 and cuDNN properly, and python detects the GPU. This is what I've got on the anaconda prompt.
>>> torch.cuda.get_device_name(0)
'GeForce GTX 1070'
And I also placed my model and tensors on cuda by .cuda()
model = LogPPredictor(1, 58, 64, 128, 1, 'gsc')
if torch.cuda.is_available():
torch.set_default_tensor_type(torch.cuda.DoubleTensor)
model.cuda()
else:
torch.set_default_tensor_type(torch.FloatTensor)
list_train_loss = list()
list_val_loss = list()
acc = 0
mse = 0
optimizer = args.optim(model.parameters(),
lr=args.lr,
weight_decay=args.l2_coef)
data_train = DataLoader(args.dict_partition['train'],
batch_size=args.batch_size,
pin_memory=True,
shuffle=args.shuffle)
data_val = DataLoader(args.dict_partition['val'],
batch_size=args.batch_size,
pin_memory=True,
shuffle=args.shuffle)
for epoch in tqdm_notebook(range(args.epoch), desc='Epoch'):
model.train()
epoch_train_loss = 0
for i, batch in enumerate(data_train):
list_feature = torch.tensor(batch[0]).cuda()
list_adj = torch.tensor(batch[1]).cuda()
list_logP = torch.tensor(batch[2]).cuda()
list_logP = list_logP.view(-1,1)
optimizer.zero_grad()
list_pred_logP = model(list_feature, list_adj)
list_pred_logP.require_grad = False
train_loss = args.criterion(list_pred_logP, list_logP)
epoch_train_loss += train_loss.item()
train_loss.backward()
optimizer.step()
list_train_loss.append(epoch_train_loss/len(data_train))
model.eval()
epoch_val_loss = 0
with torch.no_grad():
for i, batch in enumerate(data_val):
list_feature = torch.tensor(batch[0]).cuda()
list_adj = torch.tensor(batch[1]).cuda()
list_logP = torch.tensor(batch[2]).cuda()
list_logP = list_logP.view(-1,1)
list_pred_logP = model(list_feature, list_adj)
val_loss = args.criterion(list_pred_logP, list_logP)
epoch_val_loss += val_loss.item()
list_val_loss.append(epoch_val_loss/len(data_val))
data_test = DataLoader(args.dict_partition['test'],
batch_size=args.batch_size,
pin_memory=True,
shuffle=args.shuffle)
model.eval()
with torch.no_grad():
logP_total = list()
pred_logP_total = list()
for i, batch in enumerate(data_val):
list_feature = torch.tensor(batch[0]).cuda()
list_adj = torch.tensor(batch[1]).cuda()
list_logP = torch.tensor(batch[2]).cuda()
logP_total += list_logP.tolist()
list_logP = list_logP.view(-1,1)
list_pred_logP = model(list_feature, list_adj)
pred_logP_total += list_pred_logP.tolist()
mse = mean_squared_error(logP_total, pred_logP_total)
But on the Process Manager of Windows, whenever I start training, only CPU usage goes up to 25% and GPU usage remains 0. How can I fix this???

I had a similar problem with using PyTorch on Cuda. After looking for possible solutions, I found the following post by Soumith himself that found it very helpful.
https://discuss.pytorch.org/t/gpu-supposed-to-be-used-but-isnt/2883
The bottom line is, at least in my case, I could not put enough load on GPUs. There was a bottleneck in my application. Try another example, or increase batch size; it should be OK.

TensorFlow distribution across cards

I am currently playing around and learning about distributed tensorflow.
I recently created a cluster with One GPU server(two cards) - One CPU server
I was browsing through various articles and in the TensorFlow distributed guide I saw that distribution happened across cards by explicitly calling them with names.
https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py
but here no cluster are being created.
Can i create a TensorFlow cluster and then specify which card the code should run on?
If yes, does the below look correct?
In one github question who's link i dont have right now but the code below, the card is specified under with tf.device(replica_device_setter) but when i try to do that my code throws an error stating "Cannot assign a device for operation 'dummy_queue_Close_1': Could not satisfy explicit device specification '/job:ps/task:0/device:GPU:0' because no supported kernel for GPU devices is available."
Is this because i am assinging tasks which were supposed to happen on a CPU but instead as i gave with tf.device('/gpu:0/') it throws the error ?
Also I cant share my official code but it looks very similar to the below code which i took for reference.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy
import tensorflow as tf
tf.app.flags.DEFINE_string("ps_hosts", "localhost:2222", "...")
tf.app.flags.DEFINE_string("worker_hosts", "localhost:2223", "...")
tf.app.flags.DEFINE_string("job_name", "", "...")
tf.app.flags.DEFINE_integer("task_index", 0, "...")
tf.app.flags.DEFINE_integer('gpu_cards', 4, 'Number of GPU cards in a machine to use.')
FLAGS = tf.app.flags.FLAGS
def dense_to_one_hot(labels_dense, num_classes = 10) :
"""Convert class labels from scalars to one-hot vectors."""
num_labels = labels_dense.shape[0]
index_offset = numpy.arange(num_labels) * num_classes
labels_one_hot = numpy.zeros((num_labels, num_classes))
labels_one_hot.flat[index_offset + labels_dense.ravel()] = 1
return labels_one_hot
def run_training(server, cluster_spec, num_workers) :
is_chief = (FLAGS.task_index == 0)
with tf.Graph().as_default():
with tf.device(tf.train.replica_device_setter(cluster = cluster_spec)) :
with tf.device('/cpu:0') :
global_step = tf.get_variable('global_step', [],
initializer = tf.constant_initializer(0), trainable = False)
with tf.device('/gpu:%d' % (FLAGS.task_index % FLAGS.gpu_cards)) :
# Create the model
x = tf.placeholder("float", [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)
# Define loss and optimizer
y_ = tf.placeholder("float", [None, 10])
cross_entropy = -tf.reduce_sum(y_ * tf.log(y))
opt = tf.train.GradientDescentOptimizer(0.01)
opt = tf.train.SyncReplicasOptimizer(opt, replicas_to_aggregate = num_workers,
replica_id = FLAGS.task_index, total_num_replicas = num_workers)
train_step = opt.minimize(cross_entropy, global_step = global_step)
# Test trained model
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
init_token_op = opt.get_init_tokens_op()
chief_queue_runner = opt.get_chief_queue_runner()
init = tf.initialize_all_variables()
sv = tf.train.Supervisor(is_chief = is_chief,
init_op = init,
global_step = global_step)
# Create a session for running Ops on the Graph.
config = tf.ConfigProto(allow_soft_placement = True)
sess = sv.prepare_or_wait_for_session(server.target, config = config)
if is_chief:
sv.start_queue_runners(sess, [chief_queue_runner])
sess.run(init_token_op)
for i in range(100000):
source_data = numpy.random.normal(loc = 0.0, scale = 1.0, size = (100, 784))
labels_dense = numpy.clip(numpy.sum(source_data, axis = 1) / 5 + 5, 0, 9).astype(int)
labels_one_hot = dense_to_one_hot(labels_dense)
_, cost, acc, step = sess.run([train_step, cross_entropy, accuracy, global_step], feed_dict = { x: source_data, y_ : labels_one_hot })
print("[%d]: cost=%.2f, accuracy=%.2f" % (step, cost, acc))
def main(_) :
ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts.split(",")
num_workers = len(worker_hosts)
print("gup_cards=%d; num_worders=%d" % (FLAGS.gpu_cards, num_workers))
cluster_spec = tf.train.ClusterSpec({ "ps":ps_hosts, "worker" : worker_hosts })
server = tf.train.Server(cluster_spec, job_name = FLAGS.job_name, task_index = FLAGS.task_index)
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker" :
run_training(server, cluster_spec, num_workers)
if __name__ == '__main__' :
tf.app.run()

I found a way to do this it sounds very simple, and it is very simple.
I created a TensorFlow cluster in the same way and passed the n_workers parameter to the cluster, and I called different instances of the code with an extra parameter for CUDA_VISIBLE_DEVICES.
CUDA_VISIBLE_DEVICES is an environment variable which can be used to restrict the vision of TensorFlow or any DL framework to a limited number of cards.
CUDA_VISIBLE_DEVICES value can range from -1 to n (where n is the number of GPUs).
-1 indicates no cards to use
n indicates nth card to use
I hope someone who is looking for a similar answer can find this useful.

TensorFlow prediction runs for deleted images also

We are using tensorflow library for face recognition. Our code works fine for single images. But when we run it as an API, the prediction time increases for every subsequent request. This happens because it searches for previously predicted images as well which should ideally not happen. Please find below the code I am using.
def train:
with tf.Session(config=tf.ConfigProto(log_device_placement=False)) as sess:
test_set = _get_test_data(input_directory)
images, labels = _load_images_and_labels(test_set, image_size=160, batch_size=batch_size,
num_threads=4, num_epochs=1)
_load_model(model_filepath=model_path)
init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
sess.run(init_op)
images_placeholder = tf.get_default_graph().get_tensor_by_name("input:0")
embedding_layer = tf.get_default_graph().get_tensor_by_name("embeddings:0")
phase_train_placeholder = tf.get_default_graph().get_tensor_by_name("phase_train:0")
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess,coord=coord)
emb_array, label_array = _create_embeddings(embedding_layer, images, labels, images_placeholder,
phase_train_placeholder, sess)
classifier_filename = classifier_output_path
class_name, prob = _evaluate_classifier(emb_array, label_array, classifier_filename)
coord.request_stop()
coord.join(threads)
def _create_embeddings(embedding_layer, images, labels, images_placeholder, phase_train_placeholder, sess):
emb_array = None
label_array = None
try:
i = 0
while True:
print("batch images")
batch_images, batch_labels = sess.run([images, labels])
print('Processing iteration {} batch of size: {}'.format(i, len(batch_labels)))
emb = sess.run(embedding_layer,
feed_dict={images_placeholder: batch_images, phase_train_placeholder: False})
emb_array = np.concatenate([emb_array, emb]) if emb_array is not None else emb
label_array = np.concatenate([label_array, batch_labels]) if label_array is not None else batch_labels
i += 1
except tf.errors.OutOfRangeError:
pass
return emb_array, label_array
It searches for previously predicted images at
`batch_images, batch_labels = sess.run([images, labels])`
in the create embedding function. I think this is the problem of some unclosed threads because of which sess.run runs for all queued threads. Can anyone help me with this

During debugging I found that previously predicted images information was in default graph which is picked during execution of following lines
images_placeholder = tf.get_default_graph().get_tensor_by_name("input:0")
embedding_layer = tf.get_default_graph().get_tensor_by_name("embeddings:0")
so by resetting graph before start of session will solve the problem of scanning previously predicted images as
tf.reset_default_graph()

Understanding how a TensorFlow model as a class and a TensorFlow Session interact

I have been using TensorFlow for a reasonable length of time now. and believed I had a thorough understanding of how a TensorFlow graph works and executes within a session. However, I have written all of my TensorFlow models in a script-like fashion as such:
import tensorflow as tf
import DataWorker
import Constants
x = tf.placeholder(tf.float32, [None, Constants.sequenceLength, DataWorker.numFeatures])
y = tf.placeholder(tf.float32, [None, 1])
xTensors = tf.unstack(x, axis=1) # [seqLength tensors of shape (batchSize, numFeatures)]
W = tf.Variable(tf.random_normal([Constants.numHidden, 1])) # Weighted matrix
b = tf.Variable(tf.random_normal([1])) # Bias
cell = tf.contrib.rnn.BasicLSTMCell(Constants.numHidden, forget_bias=Constants.forgetBias)
outputs, finalState = tf.nn.static_rnn(cell, xTensors, dtype=tf.float32)
# predictions = [tf.add(tf.matmul(output, W), b) for output in outputs] # List of predictions after each time step
prediction = tf.add(tf.matmul(outputs[-1], W), b) # Prediction after final time step
prediction = tf.tanh(prediction) # Activation
mse = tf.losses.mean_squared_error(predictions=prediction, labels=y) # Mean loss over entire batch
accuracy = tf.reduce_mean(1 - (tf.abs(y - prediction) / DataWorker.labelRange)) # Accuracy over entire batch
optimiser = tf.train.AdamOptimizer(Constants.learningRate).minimize(mse) # Backpropagation
with tf.Session() as session:
session.run(tf.global_variables_initializer())
# #############################################
# TRAINING
# #############################################
for epoch in range(Constants.numEpochs):
print("***** EPOCH:", epoch + 1, "*****\n")
IDPointer, TSPointer = 0, 0 # Pointers to current ID and timestamp
epochComplete = False
batchNum = 0
while not epochComplete:
batchNum += 1
batchX, batchY, IDPointer, TSPointer, epochComplete = DataWorker.generateBatch(IDPointer, TSPointer, isTraining=True)
dict = {x: batchX, y: batchY}
session.run(optimiser, dict)
if batchNum % 1000 == 0 or epochComplete:
batchLoss = session.run(mse, dict)
batchAccuracy = session.run(accuracy, dict)
print("Iteration:", batchNum)
print(batchLoss)
print(str("%.2f" % (batchAccuracy * 100) + "%\n"))
# #############################################
# TESTING
# #############################################
testX, testY, _, _, _ = DataWorker.generateBatch(0, 0, isTraining=False)
testAccuracy = session.run(accuracy, {x: testX, y: testY})
print("Testing Accuracy:", str("%.2f" % (testAccuracy * 100) + "%"))
But now, for practicality and readability, I want to implement my model as a class, but have encountered many problems with initializing my variables, etc.
This is the closest I have got to implementing the above example using my own LSTM class
Model.py
import tensorflow as tf
import Constants
import DataWorker # Remove this dependency
class LSTM():
"""docstring."""
def __init__(self,
inputDimensionList,
outputDimensionList,
numLayers=Constants.numLayers,
numHidden=Constants.numHidden,
learningRate=Constants.learningRate,
forgetBias=Constants.forgetBias
):
"""docstring."""
self.batchInputs = tf.placeholder(tf.float32, [None] + inputDimensionList)
self.batchLabels = tf.placeholder(tf.float32, [None] + outputDimensionList)
self.weightedMatrix = tf.Variable(tf.random_normal([numHidden] + outputDimensionList))
self.biasMatrix = tf.Variable(tf.random_normal(outputDimensionList))
self.cell = tf.contrib.rnn.BasicLSTMCell(numHidden, forget_bias=forgetBias)
self.numLayers = numLayers
self.numHidden = numHidden
self.learningRate = learningRate
self.forgetBias = forgetBias
self.batchDict = {}
self.batchInputTensors = None
self.batchOutputs = None # All needed as instance variables?
self.batchFinalStates = None
self.batchPredictions = None
self.batchLoss = None
self.batchAccuracy = None
self.initialised = False
self.session = tf.Session()
# Take in activation, loss and optimiser FUNCTIONS as args
def execute(self, command):
"""docstring."""
return self.session.run(command, self.batchDict)
def setBatchDict(self, inputs, labels):
"""docstring."""
self.batchDict = {self.batchInputs: inputs, self.batchLabels: labels}
self.batchInputTensors = tf.unstack(self.batchInputs, axis=1)
def processBatch(self):
"""docstring."""
self.batchOutputs, self.batchFinalState = tf.nn.static_rnn(self.cell, self.batchInputTensors, dtype=tf.float32)
pred = tf.tanh(tf.add(tf.matmul(self.batchOutputs[-1], self.weightedMatrix), self.biasMatrix))
mse = tf.losses.mean_squared_error(predictions=pred, labels=self.batchLabels)
optimiser = tf.train.AdamOptimizer(self.learningRate).minimize(mse)
if not self.initialised:
self.session.run(tf.global_variables_initializer())
self.initialised = True
with tf.variable_scope("model") as scope:
if self.initialised:
scope.reuse_variables()
self.execute(optimiser)
self.batchPredictions = self.execute(pred)
self.batchLoss = self.execute(tf.losses.mean_squared_error(predictions=self.batchPredictions, labels=self.batchLabels))
self.batchAccuracy = self.execute(tf.reduce_mean(1 - (tf.abs(self.batchLabels - self.batchPredictions) / DataWorker.labelRange)))
return self.batchPredictions, self.batchLabels, self.batchLoss, self.batchAccuracy
def kill(self):
"""docstring."""
self.session.close()
This class is quite messy, especially processBatch() as I have just been trying to get it to work before refining it.
I then run my model here:
Main.py
import DataWorker
import Constants
from Model import LSTM
inputDim = [Constants.sequenceLength, DataWorker.numFeatures]
outputDim = [1]
lstm = LSTM(inputDimensionList=inputDim, outputDimensionList=outputDim)
# #############################################
# TRAINING
# #############################################
for epoch in range(Constants.numEpochs):
print("***** EPOCH:", epoch + 1, "*****\n")
IDPointer, TSPointer = 0, 0 # Pointers to current ID and timestamp
epochComplete = False
batchNum = 0
while not epochComplete:
batchNum += 1
batchX, batchY, IDPointer, TSPointer, epochComplete = DataWorker.generateBatch(IDPointer, TSPointer, isTraining=True)
lstm.setBatchDict(batchX, batchY)
batchPredictions, batchLabels, batchLoss, batchAccuracy = lstm.runBatch()
if batchNum % 1000 == 0 or epochComplete:
print("Iteration:", batchNum)
print("Pred:", batchPredictions[-1], "\tLabel:", batchLabels[-1])
print("Loss:", batchLoss)
print("Accuracy:", str("%.2f" % (batchAccuracy * 100) + "%\n"))
# #############################################
# TESTING
# #############################################
testX, testY, _, _, _ = DataWorker.generateBatch(0, 0, isTraining=False)
lstm.setBatchDict(testX, testY)
_, _, _, testAccuracy = lstm.runBatch()
print("Testing Accuracy:", str("%.2f" % (testAccuracy * 100) + "%"))
lstm.kill()
A single passthrough of the graph is executed fine, when all the variables are initialized, but it is on the second iteration where I get the error
ValueError: Variable rnn/basic_lstm_cell/kernel/Adam/ already exists, disallowed. Did you mean to set reuse=True in VarScope? Originally defined at:
optimiser = tf.train.AdamOptimizer(self.learningRate).minimize(mse)
I Googled this problem and learned that using scope.reuse_variables() should stop it trying to initialize the AdamOptimizer a second time, but cleary this isn't working how I have implemented it. How can I fix this issue?
As a side note, is my method of creating the TensorFlow session as an instance variable within my LSTM class acceptable, or should I create the session in Main and then pass it into the LSTM instance?

In general I wrap anything that creates variables under the hood with tf.make_template when doing object oriented model building.
However, you should avoid adding ops to the graph in a training loop, which looks like it's happening here. They will build up and cause problems, and likely give you incorrect results. Instead, define the graph (with inputs from tf.data, placeholders, or queues) and only loop over a session.run call. Even better, structure your code as an Estimator and this will be enforced.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why tensorflow Synchronous distributed blocked - python

Related

DQN Atari with tensorflow: Training seems to stuck

Pytorch is not using GPU even it detects the GPU

TensorFlow distribution across cards

TensorFlow prediction runs for deleted images also

Understanding how a TensorFlow model as a class and a TensorFlow Session interact

Categories

Resources