Tensorflow FailedPreconditionError, but all variables have been initialized - python

EDIT: After trying several things, I have added the following to my code:
with tf.Session(graph=self.graph) as session:
session.run(tf.initialize_all_variables())
try:
session.run(tf.assert_variables_initialized())
except tf.errors.FailedPreconditionError:
raise RuntimeError("Not all variables initialized!")
Now, occasionally this fails, i.e. tf.assert_variables_initialized() will raise FailedPreconditionError, even though immediately before it, tf.initialize_all_variables() was executed. Does anyone have any idea how this can happen?
Original question:
Background
I'm running cross-validated (CV) hyperparameter search on a basic neural net created through Tensorflow, with GradientDescentOptimizer. At seemingly random moments I'm getting a FailedPreconditionError, for different Variables. For example (full stack trace at end of post):
FailedPreconditionError: Attempting to use uninitialized value Variable_5
[[Node: Variable_5/read = Identity[T=DT_FLOAT, _class=["loc:#Variable_5"], _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_5)]]
Some runs fail fairly fast, others not -- one has been running for 15 hours now without problems. I'm running this in parallel on multiple GPUs - not the optimization itself, but each CV fold.
What I've checked
From this and this post I understand that this error occurs when attempting to use Variables that haven't been initialized using tf.initialize_all_variables(). However, I am 99% certain that I'm doing this (and if not, I'd expect it to always fail) - I'll post code below.
The API doc says that
This exception is most commonly raised when running an operation that
reads a tf.Variable before it has been initialized.
"Most commonly" suggests that it can also be raised in different scenarios. So, for now the main question:
Question:
are there other scenarios under which this exception may be raised, and what are they?
Code
MLP class:
class MLP(object):
def __init__(self, n_in, hidden_config, n_out, optimizer, f_transfer=tf.nn.tanh, f_loss=mean_squared_error,
f_out=tf.identity, seed=None, global_step=None, graph=None, dropout_keep_ratio=1):
self.graph = tf.Graph() if graph is None else graph
# all variables defined below
with self.graph.as_default():
self.X = tf.placeholder(tf.float32, shape=(None, n_in))
self.y = tf.placeholder(tf.float32, shape=(None, n_out))
self._init_weights(n_in, hidden_config, n_out, seed)
self._init_computations(f_transfer, f_loss, f_out)
self._init_optimizer(optimizer, global_step)
def fit_validate(self, X, y, val_X, val_y, val_f, iters=100, val_step=1):
[snip]
with tf.Session(graph=self.graph) as session:
VAR INIT HERE-->tf.initialize_all_variables().run() #<-- VAR INIT HERE
for i in xrange(iters):
[snip: get minibatch here]
_, l = session.run([self.optimizer, self.loss], feed_dict={self.X:X_batch, self.y:y_batch})
# validate
if i % val_step == 0:
val_yhat = self.validation_yhat.eval(feed_dict=val_feed_dict, session=session)
As you can see, tf.init_all_variables().run() is always called before anything else is done. The net is initialized as:
def estimator_getter(params):
[snip]
graph = tf.Graph()
with graph.as_default():
global_step = tf.Variable(0, trainable=False)
learning_rate = tf.train.exponential_decay(params.get('learning_rate',0.1), global_step, decay_steps, decay_rate)
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
net = MLP(config_num_inputs[config_id], hidden, 1, optimizer, seed=params.get('seed',100), global_step=global_step, graph=graph, dropout_keep_ratio=dropout)
Full example stack trace:
FailedPreconditionError: Attempting to use uninitialized value Variable_5
[[Node: Variable_5/read = Identity[T=DT_FLOAT, _class=["loc:#Variable_5"], _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_5)]]
Caused by op u'Variable_5/read', defined at:
File "tf_paramsearch.py", line 373, in <module>
randomized_search_params(int(sys.argv[1]))
File "tf_paramsearch.py", line 356, in randomized_search_params
hypersearch.fit()
File "/home/centos/ODQ/main/python/odq/cv.py", line 430, in fit
return self._fit(sampled_params)
File "/home/centos/ODQ/main/python/odq/cv.py", line 190, in _fit
for train_key, test_key in self.cv)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 766, in __call__
n_jobs = self._initialize_pool()
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 537, in _initialize_pool
self._pool = MemmapingPool(n_jobs, **poolargs)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/pool.py", line 580, in __init__
super(MemmapingPool, self).__init__(**poolargs)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/pool.py", line 418, in __init__
super(PicklingPool, self).__init__(**poolargs)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/pool.py", line 159, in __init__
self._repopulate_pool()
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/pool.py", line 223, in _repopulate_pool
w.start()
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/process.py", line 130, in start
self._popen = Popen(self)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/forking.py", line 126, in __init__
code = process_obj._bootstrap()
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/pool.py", line 113, in worker
result = (True, func(*args, **kwds))
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 130, in __call__
return self.func(*args, **kwargs)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/home/centos/ODQ/main/python/odq/cv.py", line 131, in _fold_runner
estimator = estimator_getter(parameters)
File "tf_paramsearch.py", line 264, in estimator_getter
net = MLP(config_num_inputs[config_id], hidden, 1, optimizer, seed=params.get('seed',100), global_step=global_step, graph=graph, dropout_keep_ratio=dropout)
File "tf_paramsearch.py", line 86, in __init__
self._init_weights(n_in, hidden_config, n_out, seed)
File "tf_paramsearch.py", line 105, in _init_weights
self.out_weights = tf.Variable(tf.truncated_normal([hidden_config[-1], n_out], stddev=stdev))
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 206, in __init__
dtype=dtype)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 275, in _init_from_args
self._snapshot = array_ops.identity(self._variable, name="read")
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 523, in identity
return _op_def_lib.apply_op("Identity", input=input, name=name)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 655, in apply_op
op_def=op_def)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2117, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()

Ok, I've found the problem. There was a rare condition in my code that resulted in one of the hidden layers to be created with shape (0, N), i.e. no inputs. In this case, Tensorflow apparently fails to initialize the variables pertaining to that layer.
While this makes sense, it might be useful for Tensorflow to log a warning message in such cases (btw, I also tried to set Tensorflow logging to debug mode, but couldn't find how -- tf.logging.set_verbosity() didn't seem to have an effect).

BTW, for efficiency/less bugs, you could follow following pattern.
tf.reset_default_graph()
a = tf.constant(1)
<add more operations to your graph>
b = tf.Variable(1)
init_op = tf.initialize_all_variables()
tf.get_default_graph().finalize()
sess = tf.InteractiveSession()
sess.run(init_op)
sess.run(compute_op)
The finalize prevents you from modifying graph between runs which is slow in the current version. Also, because there's one session/one graph, you don't need with blocks.

For me the solution was
with sess.as_default():
result = compute_fn([seed_input,1])
check FailedPreconditionError: Attempting to use uninitialized in Tensorflow
for other options and my explanation
Strangely session.run() is not the same as running a function with sess.as_default(), I tried both.

Related

Use Tensorflow distributed with Tensorflow 1 low-level code and Ray

Im trying to distribute the training of an Deep Reinforcement Learning system that I have built using Ray and Tensorflow 1. Meanwhile I am using ray beacuse I have a lot of code that parallelizes logic not directectly releated with the training, I would like to parallelize the training (namely the gradient reduce over different wokers on different GPUs) using tf.distribute utilities, mainly, beacause It can uses the NCCL communication library, that I supposed that will boost the training speed compared with other approaches.
The problem is that I don't want to refactor my tensorflow code (writed in old Tensorflow 1 at low level, with custom training loops, I am not using any API like Keras), but I can't figure out how to use the tf.distribute, namely the MirrorStrategy, to distribute the training using Tensorflow 1 code.
I have found this guide about tf.distribute in Tensorflow 1, but even in the custom loop they are using the Keras API for the model and the optimizer building. I am trying to follow this guide as far as possible in order to build a simple example that uses the libraries/API that I am using in my main project, but I cannot make it works.
The example code is this:
import numpy as np
import tensorflow.compat.v1 as tf
import ray
tf.disable_v2_behavior()
#ray.remote(num_cpus=1, num_gpus=0)
class Trainer:
def __init__(self, local_data):
tf.disable_v2_behavior()
self.current_w = 1.0
self.local_data = local_data
self.strategy = tf.distribute.MirroredStrategy()
with self.strategy.scope():
self.w = tf.Variable(((1.0)), dtype=tf.float32)
self.x = tf.placeholder(shape=(None, 1), dtype=tf.float32)
self.y = self.w * self.x
self.grad = tf.gradients(self.y, [self.w])
def train_step_opt():
def grad_fn():
grad = tf.gradients(self.y, [self.w])
return grad
per_replica_grad = self.strategy.experimental_run_v2(grad_fn)
red_grad = self.strategy.reduce(
tf.distribute.ReduceOp.SUM, per_replica_grad, axis=None)
minimize = self.w.assign_sub(red_grad[0])
return minimize
self.minimize = self.strategy.experimental_run_v2(train_step_opt)
def train_step(self):
with self.strategy.scope():
with tf.Session() as sess:
sess.run(self.minimize, feed_dict={self.x: self.local_data})
self.current_w = sess.run(self.w)
return self.current_w
ray.init()
data = np.arange(4) + 1
data = data.reshape((-1, 1))
data_w = [data[None, i] for i in range(4)]
trainers = [Trainer.remote(d) for d in data_w]
W = ray.get([t.train_step.remote() for t in trainers])[0]
print(W)
It is supposed to simply computes the derivative of an linear function in different processes, reduce all the derivatives in a single value and appy it to the unique parameter "w".
When I run it i get the following error:
Traceback (most recent call last):
File "dtfray.py", line 49, in <module>
r = ray.get([t.train_step.remote() for t in trainers])
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
return func(*args, **kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::Trainer.train_step() (pid=25401, ip=10.128.0.46)
File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
File "dtfray.py", line 32, in __init__
self.minimize = self.strategy.experimental_run_v2(train_step_opt)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 957, in experimental_run_v2
return self.run(fn, args=args, kwargs=kwargs, options=options)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 951, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2290, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 770, in _call_for_each_replica
fn, args, kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 201, in _call_for_each_replica
coord.join(threads)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 998, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 282, in wrapper
return func(*args, **kwargs)
File "dtfray.py", line 22, in train_step_opt
tf.distribute.get_replica_context().merge_call()
TypeError: merge_call() missing 1 required positional argument: 'merge_fn'

Tensorflow-1 distributed with low-level code and Ray

I'm trying to distribute the training of a Deep Reinforcement Learning system that I have built using Ray and Tensorflow 1. Meanwhile, I am using ray because I have a lot of code that parallelizes logic not directly related to the training, I would like to parallelize the training (namely the gradient reduction over different workers on different GPUs) using tf. distribute utilities, mainly, because It can use the NCCL communication library, which I supposed will boost the training speed compared with other approaches.
The problem is that I don't want to refactor my tensorflow code (written in old Tensorflow 1 at a low level, with custom training loops, I am not using any API like Keras), but I can't figure out how to use the tf.distribute, namely the MirrorStrategy, to distribute the training using Tensorflow 1 code.
I have found this guide about tf.distribute in Tensorflow 1, but even in the custom loop they are using the Keras API for the model and the optimizer building. I am trying to follow this guide as far as possible in order to build a simple example that uses the libraries/API that I am using in my main project, but I cannot make it works.
The example code is this:
import numpy as np
import tensorflow.compat.v1 as tf
import ray
tf.disable_v2_behavior()
#ray.remote(num_cpus=1, num_gpus=0)
class Trainer:
def __init__(self, local_data):
tf.disable_v2_behavior()
self.current_w = 1.0
self.local_data = local_data
self.strategy = tf.distribute.MirroredStrategy()
with self.strategy.scope():
self.w = tf.Variable(((1.0)), dtype=tf.float32)
self.x = tf.placeholder(shape=(None, 1), dtype=tf.float32)
self.y = self.w * self.x
self.grad = tf.gradients(self.y, [self.w])
def train_step_opt():
def grad_fn():
grad = tf.gradients(self.y, [self.w])
return grad
per_replica_grad = self.strategy.experimental_run_v2(grad_fn)
red_grad = self.strategy.reduce(
tf.distribute.ReduceOp.SUM, per_replica_grad, axis=None)
minimize = self.w.assign_sub(red_grad[0])
return minimize
self.minimize = self.strategy.experimental_run_v2(train_step_opt)
def train_step(self):
with self.strategy.scope():
with tf.Session() as sess:
sess.run(self.minimize, feed_dict={self.x: self.local_data})
self.current_w = sess.run(self.w)
return self.current_w
ray.init()
data = np.arange(4) + 1
data = data.reshape((-1, 1))
data_w = [data[None, i] for i in range(4)]
trainers = [Trainer.remote(d) for d in data_w]
W = ray.get([t.train_step.remote() for t in trainers])[0]
print(W)
It is supposed to simply computes the derivative of a linear function in different processes, reduce all the derivatives in a single value and apply it to the unique parameter "w".
When I run it I get the following error:
Traceback (most recent call last):
File "dtfray.py", line 49, in <module>
r = ray.get([t.train_step.remote() for t in trainers])
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
return func(*args, **kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::Trainer.train_step() (pid=25401, ip=10.128.0.46)
File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
File "dtfray.py", line 32, in __init__
self.minimize = self.strategy.experimental_run_v2(train_step_opt)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 957, in experimental_run_v2
return self.run(fn, args=args, kwargs=kwargs, options=options)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 951, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2290, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 770, in _call_for_each_replica
fn, args, kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 201, in _call_for_each_replica
coord.join(threads)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 998, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 282, in wrapper
return func(*args, **kwargs)
File "dtfray.py", line 22, in train_step_opt
tf.distribute.get_replica_context().merge_call()
TypeError: merge_call() missing 1 required positional argument: 'merge_fn'
As highlighted in the following section of source code:
def train_step_opt():
def grad_fn():
grad = tf.gradients(self.y, [self.w])
return grad
per_replica_grad = self.strategy.experimental_run_v2(grad_fn)
red_grad = self.strategy.reduce(
tf.distribute.ReduceOp.SUM, per_replica_grad, axis=None)
minimize = self.w.assign_sub(red_grad[0])
return minimize
self.minimize = self.strategy.experimental_run_v2(train_step_opt)
you have to reduce the results from tf.distribute.MirroredStrategy() one more time for train_step_opt after line self.minimize = self.strategy.experimental_run_v2(train_step_opt) , as you are calling self.strategy.experimental_run_v2() twice, once on train_step_opt and then on grad_fn.
Also, you can review the following sections from Line 178 to 188 of mirrored_run.py file of TF Github repo, as get_replica_context() is triggered for cross-replica context:
When fn starts should_run event is set on
_MirroredReplicaThread (MRT) threads. The execution waits until
MRT.has_paused is set, which indicates that either fn is complete
or a get_replica_context().merge_call() is called. If fn
is complete, then MRT.done is set to True. Otherwise, arguments
of get_replica_context().merge_call from all paused threads are grouped and the merge_fn is performed. Results of the
get_replica_context().merge_call are then set to MRT.merge_result.
Each such get_replica_context().merge_call call returns the MRT.merge_result for that thread when MRT.should_run event is
reset again. Execution of fn resumes.

How can I view tensor values that cause a TensorFlow crash?

I am trying to get names of Tensors using a Tensor hashtable but I keep encountering a default value because the key was not found in the table. I have no idea why the key I am looking for is not found in the table. The problem is I have no idea how to actually see which key is giving the error.
table = tf.contrib.lookup.HashTable(
tf.contrib.lookup.KeyValueTensorInitializer(tensor_ids, indexs), -1)
pids_toget = []
for i in range(FLAGS.batch_size):
name = tf.gather(filenames, i)
index_toget = table.lookup(name)
pids_toget.append(index_toget)
tensor_vals = tf.Variable(np_clinical, dtype=tf.float32)
clinic_features_tensor = tf.gather(tensor_vals, pids_toget)
The error I get
Traceback (most recent call last):
File "/home/ubuntu/scipts/01_training/xClasses/bazel-bin/inception/imagenet_train.runfiles/inception/inception/imagenet_train.py", line 50, in <module>
tf.app.run()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/ubuntu/scipts/01_training/xClasses/bazel-bin/inception/imagenet_train.runfiles/inception/inception/imagenet_train.py", line 46, in main
inception_train.train(dataset, pids, np_clinical)
File "/home/ubuntu/scipts/01_training/xClasses/bazel-bin/inception/imagenet_train.runfiles/inception/inception/inception_train.py", line 373, in train
_, loss_value = sess.run([train_op, loss])
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: indices[348] = -1 is not in [0, 891)
[[node GatherV2_400 (defined at /scipts/01_training/xClasses/bazel-bin/inception/imagenet_train.runfiles/inception/inception/inception_train.py:253) ]]
[[tower_7/mixed_35x35x288a/branch_pool/Conv/BatchNorm/AssignMovingAvg_1/AssignSub/_6375]]
(1) Invalid argument: indices[348] = -1 is not in [0, 891)
[[node GatherV2_400 (defined at /scipts/01_training/xClasses/bazel-bin/inception/imagenet_train.runfiles/inception/inception/inception_train.py:253) ]]
Input Source operations connected to node GatherV2_400:
Variable/read (defined at /scipts/01_training/xClasses/bazel-bin/inception/imagenet_train.runfiles/inception/inception/inception_train.py:251)
Input Source operations connected to node GatherV2_400:
Variable/read (defined at /scipts/01_training/xClasses/bazel-bin/inception/imagenet_train.runfiles/inception/inception/inception_train.py:251)
Original stack trace for 'GatherV2_400':
File "/scipts/01_training/xClasses/bazel-bin/inception/imagenet_train.runfiles/inception/inception/imagenet_train.py", line 50, in <module>
tf.app.run()
File "/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/anaconda3/lib/python3.7/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/anaconda3/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/scipts/01_training/xClasses/bazel-bin/inception/imagenet_train.runfiles/inception/inception/imagenet_train.py", line 46, in main
inception_train.train(dataset, pids, np_clinical)
File "/scipts/01_training/xClasses/bazel-bin/inception/imagenet_train.runfiles/inception/inception/inception_train.py", line 253, in train
clinic_features_tensor = tf.gather(tensor_vals, pids_toget)
File "/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 3475, in gather
return gen_array_ops.gather_v2(params, indices, axis, name=name)
File "/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 4097, in gather_v2
batch_dims=batch_dims, name=name)
File "/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
I run the session here. I have tried using printnames = tf.Print(filenames) and print the filenames for that batch but it does not work for the batch that crashes.
# Build an initialization operation to run below.
init = tf.global_variables_initializer()
tbinit = tf.initializers.tables_initializer()
# Start running operations on the Graph. allow_soft_placement must be set to
# True to build towers on GPU, as some of the ops do not have GPU
# implementations.
sess = tf.Session(config=tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=FLAGS.log_device_placement))
sess.run(init)
sess.run(tbinit)
if FLAGS.pretrained_model_checkpoint_path:
try:
assert tf.gfile.Exists(FLAGS.pretrained_model_checkpoint_path)
variables_to_restore = slim.variables.get_variables_to_retore_except_logits()
restorer = tf.train.Saver(variables_to_restore)
restorer.restore(sess, FLAGS.pretrained_model_checkpoint_path)
print('%s: Pre-trained model restored from %s' %
(datetime.now(), FLAGS.pretrained_model_checkpoint_path))
except:
#restorer = tf.train.import_meta_graph(FLAGS.pretrained_model_checkpoint_path + '.meta')
variables_to_restore = slim.variables.get_variables_to_retore_except_logits()
restorer = tf.train.Saver(variables_to_restore)
restorer.restore(sess, FLAGS.pretrained_model_checkpoint_path)
# Start the queue runners.
tf.train.start_queue_runners(sess=sess)
summary_writer = tf.summary.FileWriter(
FLAGS.train_dir,
graph=sess.graph)
for step in range(FLAGS.max_steps):
start_time = time.time()
p2, _, loss_value = sess.run([printnames, train_op, loss,])
print(p2)
duration = time.time() - start_time
I have tried using try/except statements but I cannot get those to work to produce the names I need either. Is it even possible? It seems like sess.run just crashes.

TypeError when using keras functional model

I used the Keras functional API (keras version 2.2) to define a model but when I try to fit the data to the model I get some for of error. I am currently using python 2.7 and the code is running on Ubuntu 18.04.
The following is the code for the model:
class Model:
def __init__(self, config):
self.hidden_layers = config["hidden_layers"]
self.loss = config["loss"]
self.optimizer = config["optimizer"]
self.batch_normalization = config["batch_normalization"]
self.model = self._build_model()
def _build_model(self):
input = Input(shape=(32,))
hidden_layers = []
if self.batch_normalization:
hidden_layers.append(Dense(self.hidden_layers[0], bias_initializer= Orthogonal)(input))
hidden_layers.append(BatchNormalization()(hidden_layers[-1]))
hidden_layers.append(Activation("relu")(hidden_layers[-1]))
else:
hidden_layers.append(Dense(self.hidden_layers[0], bias_initializer= Orthogonal, activation='relu')(input))
for i in self.hidden_layers[1:]:
if self.batch_normalization:
hidden_layers.append(Dense(i, bias_initializer= Orthogonal)(hidden_layers[-1]))
hidden_layers.append(BatchNormalization()(hidden_layers[-1]))
hidden_layers.append(Activation("relu")(hidden_layers[-1]))
else:
hidden_layers.append(Dense(i, bias_initializer= Orthogonal, activation='relu')(hidden_layers[-1]))
output_layer = Dense(2, activation="softmax")(hidden_layers[-1])
model = Model(input= input, output= output_layer)
model.compile(optimizer=self.optimizer, loss=self.loss, metrics=["accuracy"])
return model
The following is the command that I use and the error message I get once I run the fit method:
model.fit(x=X_train,y=Y_train, epochs=20)
File "/home/project/main.py", line 69, in <module>
main(config)
File "/home/project/main.py", line 62, in main
model = Model(config, logging).model
File "/home/project/model.py", line 18, in __init__
self.model = self._build_model()
File "/home/project/model.py", line 34, in _build_model
hidden_layers.append(Dense(self.hidden_layers[0], bias_initializer= Orthogonal, activation='relu')(input))
File "/home/project/venv/local/lib/python2.7/site-packages/keras/engine/base_layer.py", line 431, in __call__
self.build(unpack_singleton(input_shapes))
File "/home/project/venv/local/lib/python2.7/site-packages/keras/layers/core.py", line 872, in build
constraint=self.bias_constraint)
File "/home/project/venv/local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/project/venv/local/lib/python2.7/site-packages/keras/engine/base_layer.py", line 252, in add_weight
constraint=constraint)
File "/home/project/venv/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 402, in variable
v = tf.Variable(value, dtype=tf.as_dtype(dtype), name=name)
File "/home/project/venv/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 183, in __call__
return cls._variable_v1_call(*args, **kwargs)
...
...
File "/home/project/venv/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 1329, in __init__
constraint=constraint)
File "/home/project/venv/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 1437, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
TypeError: __call__() takes at least 2 arguments (1 given)
I really don't understand what this TypeError is. I am not sure how to change my model definition to avoid this error.
It seems the error happens with the bias initializer. You're passing a class Orthogonal when you should be passing an instance of that class, like bias_initializer=Orthogonal().
Now, I strongly recommend you never use the same names as Keras for your classes. Don't create a class Model, create anything else, like class AnyNameOtherThanModel.

Use glstm(Group LSTM) cell to build bidirectional rnn in tensorflow

I'm using a cnn + lstm + ctc network (based on https://arxiv.org/pdf/1507.05717.pdf) to do a Chinese scene text recognition. For a large number of classes (3500+), the network is very hard to train. I heard that using Group LSTM (https://arxiv.org/abs/1703.10722, O. Kuchaiev and B. Ginsburg "Factorization Tricks for LSTM Networks", ICLR 2017 workshop.) can reduce the number of parameters and accelerate the training, so I've tried to use it in my code.
I use a two-layers bidirectional lstm. This is the original code that using tf.contrib.rnn.LSTMCell
rnn_outputs, _, _ =
tf.contrib.rnn.stack_bidirectional_dynamic_rnn(
[tf.contrib.rnn.LSTMCell(num_units=self.num_hidden, state_is_tuple=True) for _ in range(self.num_layers)],
[tf.contrib.rnn.LSTMCell(num_units=self.num_hidden, state_is_tuple=True) for _ in range(self.num_layers)],
self.rnn_inputs, dtype=tf.float32, sequence_length=self.rnn_seq_len, scope='BDDLSTM')
The training is very slow. After 100 hrs, the prediction acc on the test set is still 39%.
Now I want to use tf.contrib.rnn.GLSTMCell. When I replace the LSTMCell with this GLSTMCell like
rnn_outputs, _, _ = tf.contrib.rnn.stack_bidirectional_dynamic_rnn(
[tf.contrib.rnn.GLSTMCell(num_units=self.num_hidden, num_proj=self.num_proj, number_of_groups=4) for _ in range(self.num_layers)],
[tf.contrib.rnn.GLSTMCell(num_units=self.num_hidden, num_proj=self.num_proj, number_of_groups=4) for _ in range(self.num_layers)],
self.rnn_inputs, dtype=tf.float32, sequence_length=self.rnn_seq_len, scope='BDDLSTM')
I get the following error
/home/frisasz/miniconda2/envs/dl/bin/python "/media/frisasz/DATA/FSZ_Work/deep learning/IDOCR_/work/train.py"
Traceback (most recent call last):
File "/media/frisasz/DATA/FSZ_Work/deep learning/IDOCR_/work/train.py", line 171, in <module>
train(train_dir='/media/frisasz/Windows/40T/', val_dir='../../0000/40V/')
File "/media/frisasz/DATA/FSZ_Work/deep learning/IDOCR_/work/train.py", line 41, in train
FLAGS.momentum)
File "/media/frisasz/DATA/FSZ_Work/deep learning/IDOCR_/work/model.py", line 61, in __init__
self.logits = self.rnn_net()
File "/media/frisasz/DATA/FSZ_Work/deep learning/IDOCR_/work/model.py", line 278, in rnn_net
self.rnn_inputs, dtype=tf.float32, sequence_length=self.rnn_seq_len, scope='BDDLSTM')
File "/home/frisasz/miniconda2/envs/dl/lib/python2.7/site-packages/tensorflow/contrib/rnn/python/ops/rnn.py", line 220, in stack_bidirectional_dynamic_rnn
dtype=dtype)
File "/home/frisasz/miniconda2/envs/dl/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 375, in bidirectional_dynamic_rnn
time_major=time_major, scope=fw_scope)
File "/home/frisasz/miniconda2/envs/dl/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 574, in dynamic_rnn
dtype=dtype)
File "/home/frisasz/miniconda2/envs/dl/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 737, in _dynamic_rnn_loop
swap_memory=swap_memory)
File "/home/frisasz/miniconda2/envs/dl/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2770, in while_loop
result = context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/home/frisasz/miniconda2/envs/dl/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2599, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/home/frisasz/miniconda2/envs/dl/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2549, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/home/frisasz/miniconda2/envs/dl/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 720, in _time_step
skip_conditionals=True)
File "/home/frisasz/miniconda2/envs/dl/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 206, in _rnn_step
new_output, new_state = call_cell()
File "/home/frisasz/miniconda2/envs/dl/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 708, in <lambda>
call_cell = lambda: cell(input_t, state)
File "/home/frisasz/miniconda2/envs/dl/lib/python2.7/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 180, in __call__
return super(RNNCell, self).__call__(inputs, state)
File "/home/frisasz/miniconda2/envs/dl/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 441, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/home/frisasz/miniconda2/envs/dl/lib/python2.7/site-packages/tensorflow/contrib/rnn/python/ops/rnn_cell.py", line 2054, in call
R_k = _linear(x_g_id, 4 * self._group_shape[1], bias=False)
File "/home/frisasz/miniconda2/envs/dl/lib/python2.7/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 1005, in _linear
"but saw %s" % (shape, shape[1]))
ValueError: linear expects shape[1] to be provided for shape (?, ?), but saw ?
Process finished with exit code 1
I'm not sure if the GLSTMCell can simply replace the LSTMCell in tf.contrib.rnn.stack_bidirectional_dynamic_rnn() (or other functions that help to build the rnn). I didn't find any examples of the use of GLSTMCell. Anybody know the right way to build a bidirectional rnn with GLSTMCell?
I got the exact same error trying to build bidirectional GLSTM using bidirectional_dynamic_rnn.
In my case, the problem came from the fact that GLSTM can only be used when defined in a static way : when the graph is computed you can't have undefined shape parameters (such as batch_size for instance).
So, try to define in the graph all the shapes that will end up at some point in the GLSTM cell and it should work fine.

Categories

Resources