I'm trying to distribute the training of a Deep Reinforcement Learning system that I have built using Ray and Tensorflow 1. Meanwhile, I am using ray because I have a lot of code that parallelizes logic not directly related to the training, I would like to parallelize the training (namely the gradient reduction over different workers on different GPUs) using tf. distribute utilities, mainly, because It can use the NCCL communication library, which I supposed will boost the training speed compared with other approaches.
The problem is that I don't want to refactor my tensorflow code (written in old Tensorflow 1 at a low level, with custom training loops, I am not using any API like Keras), but I can't figure out how to use the tf.distribute, namely the MirrorStrategy, to distribute the training using Tensorflow 1 code.
I have found this guide about tf.distribute in Tensorflow 1, but even in the custom loop they are using the Keras API for the model and the optimizer building. I am trying to follow this guide as far as possible in order to build a simple example that uses the libraries/API that I am using in my main project, but I cannot make it works.
The example code is this:
import numpy as np
import tensorflow.compat.v1 as tf
import ray
tf.disable_v2_behavior()
#ray.remote(num_cpus=1, num_gpus=0)
class Trainer:
def __init__(self, local_data):
tf.disable_v2_behavior()
self.current_w = 1.0
self.local_data = local_data
self.strategy = tf.distribute.MirroredStrategy()
with self.strategy.scope():
self.w = tf.Variable(((1.0)), dtype=tf.float32)
self.x = tf.placeholder(shape=(None, 1), dtype=tf.float32)
self.y = self.w * self.x
self.grad = tf.gradients(self.y, [self.w])
def train_step_opt():
def grad_fn():
grad = tf.gradients(self.y, [self.w])
return grad
per_replica_grad = self.strategy.experimental_run_v2(grad_fn)
red_grad = self.strategy.reduce(
tf.distribute.ReduceOp.SUM, per_replica_grad, axis=None)
minimize = self.w.assign_sub(red_grad[0])
return minimize
self.minimize = self.strategy.experimental_run_v2(train_step_opt)
def train_step(self):
with self.strategy.scope():
with tf.Session() as sess:
sess.run(self.minimize, feed_dict={self.x: self.local_data})
self.current_w = sess.run(self.w)
return self.current_w
ray.init()
data = np.arange(4) + 1
data = data.reshape((-1, 1))
data_w = [data[None, i] for i in range(4)]
trainers = [Trainer.remote(d) for d in data_w]
W = ray.get([t.train_step.remote() for t in trainers])[0]
print(W)
It is supposed to simply computes the derivative of a linear function in different processes, reduce all the derivatives in a single value and apply it to the unique parameter "w".
When I run it I get the following error:
Traceback (most recent call last):
File "dtfray.py", line 49, in <module>
r = ray.get([t.train_step.remote() for t in trainers])
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
return func(*args, **kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::Trainer.train_step() (pid=25401, ip=10.128.0.46)
File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
File "dtfray.py", line 32, in __init__
self.minimize = self.strategy.experimental_run_v2(train_step_opt)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 957, in experimental_run_v2
return self.run(fn, args=args, kwargs=kwargs, options=options)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 951, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2290, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 770, in _call_for_each_replica
fn, args, kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 201, in _call_for_each_replica
coord.join(threads)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 998, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 282, in wrapper
return func(*args, **kwargs)
File "dtfray.py", line 22, in train_step_opt
tf.distribute.get_replica_context().merge_call()
TypeError: merge_call() missing 1 required positional argument: 'merge_fn'
As highlighted in the following section of source code:
def train_step_opt():
def grad_fn():
grad = tf.gradients(self.y, [self.w])
return grad
per_replica_grad = self.strategy.experimental_run_v2(grad_fn)
red_grad = self.strategy.reduce(
tf.distribute.ReduceOp.SUM, per_replica_grad, axis=None)
minimize = self.w.assign_sub(red_grad[0])
return minimize
self.minimize = self.strategy.experimental_run_v2(train_step_opt)
you have to reduce the results from tf.distribute.MirroredStrategy() one more time for train_step_opt after line self.minimize = self.strategy.experimental_run_v2(train_step_opt) , as you are calling self.strategy.experimental_run_v2() twice, once on train_step_opt and then on grad_fn.
Also, you can review the following sections from Line 178 to 188 of mirrored_run.py file of TF Github repo, as get_replica_context() is triggered for cross-replica context:
When fn starts should_run event is set on
_MirroredReplicaThread (MRT) threads. The execution waits until
MRT.has_paused is set, which indicates that either fn is complete
or a get_replica_context().merge_call() is called. If fn
is complete, then MRT.done is set to True. Otherwise, arguments
of get_replica_context().merge_call from all paused threads are grouped and the merge_fn is performed. Results of the
get_replica_context().merge_call are then set to MRT.merge_result.
Each such get_replica_context().merge_call call returns the MRT.merge_result for that thread when MRT.should_run event is
reset again. Execution of fn resumes.
Related
I know there is a very similar question and answer on stackoverflow (here), but this seems to be distinctly different. I am using statsmodels v 0.13.2, and I am using an ARIMA model as opposed to a SARIMAX model.
I am trying to fit a list of time series data sets with an ARIMA model. The offending piece of my code is here:
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
items = np.log(og_items)
items['count'] = items['count'].apply(lambda x: 0 if math.isnan(x) or math.isinf(x) else x)
model = ARIMA(items, order=(14, 0, 7))
trained = model.fit()
items is a dataframe containing a date index and a single column, count.
I apply the lambda on the second line because some counts can be 0, resulting in a negative infinity after log is applied. The final product going into the ARIMA does not contain any NaNs or Infinite numbers. However, when I try this without using the log function, I do not get the error. This only occurs on certain series, but there does not seem to be rhyme or reason to which are affected. One series had about half of its values as zero after applying the lambda, while another did not have a single zero. Here is the error:
Traceback (most recent call last):
File "item_pipeline.py", line 267, in <module>
main()
File "item_pipeline.py", line 234, in main
restaurant_predictions = make_predictions(restaurant_data=restaurant_data, models=models,
File "item_pipeline.py", line 138, in make_predictions
predictions = model(*data_tuple[:2], min_date=min_date, max_date=max_date,
File "/Users/rob/Projects/5out-ml/models/item_level/items/predict_arima.py", line 127, in predict_daily_arima
predict_date_arima(prediction_dict, item_dict, prediction_date, x_days_out=x_days_out, log_vals=log_vals,
File "/Users/rob/Projects/5out-ml/models/item_level/items/predict_arima.py", line 51, in predict_date_arima
raise e
File "/Users/rob/Projects/5out-ml/models/item_level/items/predict_arima.py", line 47, in predict_date_arima
fitted = model.fit()
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/statsmodels/tsa/arima/model.py", line 390, in fit
res = super().fit(
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/statsmodels/tsa/statespace/mlemodel.py", line 704, in fit
mlefit = super(MLEModel, self).fit(start_params, method=method,
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/statsmodels/base/model.py", line 563, in fit
xopt, retvals, optim_settings = optimizer._fit(f, score, start_params,
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/statsmodels/base/optimizer.py", line 241, in _fit
xopt, retvals = func(objective, gradient, start_params, fargs, kwargs,
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/statsmodels/base/optimizer.py", line 651, in _fit_lbfgs
retvals = optimize.fmin_l_bfgs_b(func, start_params, maxiter=maxiter,
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/scipy/optimize/_lbfgsb_py.py", line 199, in fmin_l_bfgs_b
res = _minimize_lbfgsb(fun, x0, args=args, jac=jac, bounds=bounds,
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/scipy/optimize/_lbfgsb_py.py", line 362, in _minimize_lbfgsb
f, g = func_and_grad(x)
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/scipy/optimize/_differentiable_functions.py", line 286, in fun_and_grad
self._update_grad()
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/scipy/optimize/_differentiable_functions.py", line 256, in _update_grad
self._update_grad_impl()
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/scipy/optimize/_differentiable_functions.py", line 173, in update_grad
self.g = approx_derivative(fun_wrapped, self.x, f0=self.f,
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/scipy/optimize/_numdiff.py", line 505, in approx_derivative
return _dense_difference(fun_wrapped, x0, f0, h,
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/scipy/optimize/_numdiff.py", line 576, in _dense_difference
df = fun(x) - f0
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/scipy/optimize/_numdiff.py", line 456, in fun_wrapped
f = np.atleast_1d(fun(x, *args, **kwargs))
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/scipy/optimize/_differentiable_functions.py", line 137, in fun_wrapped
fx = fun(np.copy(x), *args)
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/statsmodels/base/model.py", line 531, in f
return -self.loglike(params, *args) / nobs
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/statsmodels/tsa/statespace/mlemodel.py", line 939, in loglike
loglike = self.ssm.loglike(complex_step=complex_step, **kwargs)
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/statsmodels/tsa/statespace/kalman_filter.py", line 983, in loglike
kfilter = self._filter(**kwargs)
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/statsmodels/tsa/statespace/kalman_filter.py", line 903, in _filter
self._initialize_state(prefix=prefix, complex_step=complex_step)
File "/Users/rob/Projects/5out-ml/venv/lib/python3.8/site-packages/statsmodels/tsa/statespace/representation.py", line 983, in _initialize_state
self._statespaces[prefix].initialize(self.initialization,
File "statsmodels/tsa/statespace/_representation.pyx", line 1362, in statsmodels.tsa.statespace._representation.dStatespace.initialize
File "statsmodels/tsa/statespace/_initialization.pyx", line 288, in statsmodels.tsa.statespace._initialization.dInitialization.initialize
File "statsmodels/tsa/statespace/_initialization.pyx", line 406, in statsmodels.tsa.statespace._initialization.dInitialization.initialize_stationary_stationary_cov
File "statsmodels/tsa/statespace/_tools.pyx", line 1206, in statsmodels.tsa.statespace._tools._dsolve_discrete_lyapunov
numpy.linalg.LinAlgError: LU decomposition error.
The solution in the other stackoverflow post was to initialize the statespace differently. It looks like the statespace is involved, if you look at the last few lines of the error. However, it does not seem that that workflow is exposed in the newer version of statsmodels. Is it? If not, what else can I try to circumvent this error?
So far, I have tried manually initializing the model to approximate diffuse, and manually setting the initialize property to approximate diffuse. Neither seem to be valid in the new statsmodels code.
Turns out there's a new way to initialize. The second line below is the operative line.
model = ARIMA(items, order=(14, 0, 7))
model.initialize_approximate_diffuse() # this line
trained = model.fit()
Im trying to distribute the training of an Deep Reinforcement Learning system that I have built using Ray and Tensorflow 1. Meanwhile I am using ray beacuse I have a lot of code that parallelizes logic not directectly releated with the training, I would like to parallelize the training (namely the gradient reduce over different wokers on different GPUs) using tf.distribute utilities, mainly, beacause It can uses the NCCL communication library, that I supposed that will boost the training speed compared with other approaches.
The problem is that I don't want to refactor my tensorflow code (writed in old Tensorflow 1 at low level, with custom training loops, I am not using any API like Keras), but I can't figure out how to use the tf.distribute, namely the MirrorStrategy, to distribute the training using Tensorflow 1 code.
I have found this guide about tf.distribute in Tensorflow 1, but even in the custom loop they are using the Keras API for the model and the optimizer building. I am trying to follow this guide as far as possible in order to build a simple example that uses the libraries/API that I am using in my main project, but I cannot make it works.
The example code is this:
import numpy as np
import tensorflow.compat.v1 as tf
import ray
tf.disable_v2_behavior()
#ray.remote(num_cpus=1, num_gpus=0)
class Trainer:
def __init__(self, local_data):
tf.disable_v2_behavior()
self.current_w = 1.0
self.local_data = local_data
self.strategy = tf.distribute.MirroredStrategy()
with self.strategy.scope():
self.w = tf.Variable(((1.0)), dtype=tf.float32)
self.x = tf.placeholder(shape=(None, 1), dtype=tf.float32)
self.y = self.w * self.x
self.grad = tf.gradients(self.y, [self.w])
def train_step_opt():
def grad_fn():
grad = tf.gradients(self.y, [self.w])
return grad
per_replica_grad = self.strategy.experimental_run_v2(grad_fn)
red_grad = self.strategy.reduce(
tf.distribute.ReduceOp.SUM, per_replica_grad, axis=None)
minimize = self.w.assign_sub(red_grad[0])
return minimize
self.minimize = self.strategy.experimental_run_v2(train_step_opt)
def train_step(self):
with self.strategy.scope():
with tf.Session() as sess:
sess.run(self.minimize, feed_dict={self.x: self.local_data})
self.current_w = sess.run(self.w)
return self.current_w
ray.init()
data = np.arange(4) + 1
data = data.reshape((-1, 1))
data_w = [data[None, i] for i in range(4)]
trainers = [Trainer.remote(d) for d in data_w]
W = ray.get([t.train_step.remote() for t in trainers])[0]
print(W)
It is supposed to simply computes the derivative of an linear function in different processes, reduce all the derivatives in a single value and appy it to the unique parameter "w".
When I run it i get the following error:
Traceback (most recent call last):
File "dtfray.py", line 49, in <module>
r = ray.get([t.train_step.remote() for t in trainers])
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
return func(*args, **kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::Trainer.train_step() (pid=25401, ip=10.128.0.46)
File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
File "dtfray.py", line 32, in __init__
self.minimize = self.strategy.experimental_run_v2(train_step_opt)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 957, in experimental_run_v2
return self.run(fn, args=args, kwargs=kwargs, options=options)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 951, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2290, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 770, in _call_for_each_replica
fn, args, kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 201, in _call_for_each_replica
coord.join(threads)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 998, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/home/Adrian/miniconda3/envs/sukan_env/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 282, in wrapper
return func(*args, **kwargs)
File "dtfray.py", line 22, in train_step_opt
tf.distribute.get_replica_context().merge_call()
TypeError: merge_call() missing 1 required positional argument: 'merge_fn'
I used the Keras functional API (keras version 2.2) to define a model but when I try to fit the data to the model I get some for of error. I am currently using python 2.7 and the code is running on Ubuntu 18.04.
The following is the code for the model:
class Model:
def __init__(self, config):
self.hidden_layers = config["hidden_layers"]
self.loss = config["loss"]
self.optimizer = config["optimizer"]
self.batch_normalization = config["batch_normalization"]
self.model = self._build_model()
def _build_model(self):
input = Input(shape=(32,))
hidden_layers = []
if self.batch_normalization:
hidden_layers.append(Dense(self.hidden_layers[0], bias_initializer= Orthogonal)(input))
hidden_layers.append(BatchNormalization()(hidden_layers[-1]))
hidden_layers.append(Activation("relu")(hidden_layers[-1]))
else:
hidden_layers.append(Dense(self.hidden_layers[0], bias_initializer= Orthogonal, activation='relu')(input))
for i in self.hidden_layers[1:]:
if self.batch_normalization:
hidden_layers.append(Dense(i, bias_initializer= Orthogonal)(hidden_layers[-1]))
hidden_layers.append(BatchNormalization()(hidden_layers[-1]))
hidden_layers.append(Activation("relu")(hidden_layers[-1]))
else:
hidden_layers.append(Dense(i, bias_initializer= Orthogonal, activation='relu')(hidden_layers[-1]))
output_layer = Dense(2, activation="softmax")(hidden_layers[-1])
model = Model(input= input, output= output_layer)
model.compile(optimizer=self.optimizer, loss=self.loss, metrics=["accuracy"])
return model
The following is the command that I use and the error message I get once I run the fit method:
model.fit(x=X_train,y=Y_train, epochs=20)
File "/home/project/main.py", line 69, in <module>
main(config)
File "/home/project/main.py", line 62, in main
model = Model(config, logging).model
File "/home/project/model.py", line 18, in __init__
self.model = self._build_model()
File "/home/project/model.py", line 34, in _build_model
hidden_layers.append(Dense(self.hidden_layers[0], bias_initializer= Orthogonal, activation='relu')(input))
File "/home/project/venv/local/lib/python2.7/site-packages/keras/engine/base_layer.py", line 431, in __call__
self.build(unpack_singleton(input_shapes))
File "/home/project/venv/local/lib/python2.7/site-packages/keras/layers/core.py", line 872, in build
constraint=self.bias_constraint)
File "/home/project/venv/local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/project/venv/local/lib/python2.7/site-packages/keras/engine/base_layer.py", line 252, in add_weight
constraint=constraint)
File "/home/project/venv/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 402, in variable
v = tf.Variable(value, dtype=tf.as_dtype(dtype), name=name)
File "/home/project/venv/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 183, in __call__
return cls._variable_v1_call(*args, **kwargs)
...
...
File "/home/project/venv/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 1329, in __init__
constraint=constraint)
File "/home/project/venv/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 1437, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
TypeError: __call__() takes at least 2 arguments (1 given)
I really don't understand what this TypeError is. I am not sure how to change my model definition to avoid this error.
It seems the error happens with the bias initializer. You're passing a class Orthogonal when you should be passing an instance of that class, like bias_initializer=Orthogonal().
Now, I strongly recommend you never use the same names as Keras for your classes. Don't create a class Model, create anything else, like class AnyNameOtherThanModel.
opt = SolverFactory("glpk")
opt.options["mipgap"] = 0.05
opt.options["FeasibilityTol"] = 1e-05
solver_manager = SolverManagerFactory("serial")
# results = solver_manager.solve(instance, opt=opt, tee=True,timelimit=None, mipgap=0.1)
results = solver_manager.solve(model, opt=opt, tee=True, timelimit=None)
# sends results to stdout
# results.write()
def pyomo_save_results(options=None, instance=None, results=None):
OUTPUT = open(r'Results_generic_hub.txt', 'w')
print(results, file=OUTPUT)
OUTPUT.close()
It generates the following error. GLPK is installed with GLPSOL -- help working from any directory. Is this a problem with the GLPK module? Or with the model itself? Environment: - Conda, Mac OS Yosemite.
File "<ipython-input-7-ba156f9322b2>", line 7, in <module>
results = solver_manager.solve(model, opt=opt, tee=True,timelimit=None)
File "/anaconda/lib/python3.6/site-
packages/pyomo/opt/parallel/async_solver.py", line 34, in solve
return self.execute(*args, **kwds)
File "/anaconda/lib/python3.6/site-
packages/pyomo/opt/parallel/manager.py", line 107, in execute
ah = self.queue(*args, **kwds)
File "/anaconda/lib/python3.6/site-
packages/pyomo/opt/parallel/manager.py", line 122, in queue
return self._perform_queue(ah, *args, **kwds)
File "/anaconda/lib/python3.6/site-
packages/pyomo/opt/parallel/local.py", line 59, in _perform_queue
results = opt.solve(*args, **kwds)
File "/anaconda/lib/python3.6/site-packages/pyomo/opt/base/solvers.py", line 582, in solve
self._presolve(*args, **kwds)
File "/anaconda/lib/python3.6/site-packages/pyomo/opt/solver/shellcmd.py", line 196, in _presolve
OptSolver._presolve(self, *args, **kwds)
File "/anaconda/lib/python3.6/site-packages/pyomo/opt/base/solvers.py", line 661, in _presolve
**kwds)
File "/anaconda/lib/python3.6/site-packages/pyomo/opt/base/solvers.py", line 729, in _convert_problem
**kwds)
File "/anaconda/lib/python3.6/site-packages/pyomo/opt/base/convert.py", line 110, in convert_problem
problem_files, symbol_map = converter.apply(*tmp, **tmpkw)
File "/anaconda/lib/python3.6/site-packages/pyomo/solvers/plugins/converter/model.py", line 86, in apply
io_options=io_options)
File "/anaconda/lib/python3.6/site-packages/pyomo/core/base/block.py", line 1646, in write
io_options)
File "/anaconda/lib/python3.6/site-packages/pyomo/repn/plugins/cpxlp.py", line 163, in __call__
include_all_variable_bounds=include_all_variable_bounds)
File "/anaconda/lib/python3.6/site-packages/pyomo/repn/plugins/cpxlp.py", line 575, in _print_model_LP
" cannot write legal LP file" % str(model.name))
ValueError: ERROR: No objectives defined for input model 'unknown'; cannot write legal LP file
The error you are seeing:
"ERROR: No objectives defined for input model 'unknown'; cannot write legal LP file"
indicates that Pyomo cannot find an active Objective component on your model (either you never added one to the model, or the Objective component(s) were all deactivated). Either way, valid LP files (which is how Pyomo interfaces with GLPK) require an objective. Fixing your model by adding an Objective should resolve this error.
Try this code in the end of the script:
> instance = model.create() instance.pprint() opt =
> SolverFactory("glpk") results = opt.solve(instance)
> print(results)
`
EDIT: After trying several things, I have added the following to my code:
with tf.Session(graph=self.graph) as session:
session.run(tf.initialize_all_variables())
try:
session.run(tf.assert_variables_initialized())
except tf.errors.FailedPreconditionError:
raise RuntimeError("Not all variables initialized!")
Now, occasionally this fails, i.e. tf.assert_variables_initialized() will raise FailedPreconditionError, even though immediately before it, tf.initialize_all_variables() was executed. Does anyone have any idea how this can happen?
Original question:
Background
I'm running cross-validated (CV) hyperparameter search on a basic neural net created through Tensorflow, with GradientDescentOptimizer. At seemingly random moments I'm getting a FailedPreconditionError, for different Variables. For example (full stack trace at end of post):
FailedPreconditionError: Attempting to use uninitialized value Variable_5
[[Node: Variable_5/read = Identity[T=DT_FLOAT, _class=["loc:#Variable_5"], _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_5)]]
Some runs fail fairly fast, others not -- one has been running for 15 hours now without problems. I'm running this in parallel on multiple GPUs - not the optimization itself, but each CV fold.
What I've checked
From this and this post I understand that this error occurs when attempting to use Variables that haven't been initialized using tf.initialize_all_variables(). However, I am 99% certain that I'm doing this (and if not, I'd expect it to always fail) - I'll post code below.
The API doc says that
This exception is most commonly raised when running an operation that
reads a tf.Variable before it has been initialized.
"Most commonly" suggests that it can also be raised in different scenarios. So, for now the main question:
Question:
are there other scenarios under which this exception may be raised, and what are they?
Code
MLP class:
class MLP(object):
def __init__(self, n_in, hidden_config, n_out, optimizer, f_transfer=tf.nn.tanh, f_loss=mean_squared_error,
f_out=tf.identity, seed=None, global_step=None, graph=None, dropout_keep_ratio=1):
self.graph = tf.Graph() if graph is None else graph
# all variables defined below
with self.graph.as_default():
self.X = tf.placeholder(tf.float32, shape=(None, n_in))
self.y = tf.placeholder(tf.float32, shape=(None, n_out))
self._init_weights(n_in, hidden_config, n_out, seed)
self._init_computations(f_transfer, f_loss, f_out)
self._init_optimizer(optimizer, global_step)
def fit_validate(self, X, y, val_X, val_y, val_f, iters=100, val_step=1):
[snip]
with tf.Session(graph=self.graph) as session:
VAR INIT HERE-->tf.initialize_all_variables().run() #<-- VAR INIT HERE
for i in xrange(iters):
[snip: get minibatch here]
_, l = session.run([self.optimizer, self.loss], feed_dict={self.X:X_batch, self.y:y_batch})
# validate
if i % val_step == 0:
val_yhat = self.validation_yhat.eval(feed_dict=val_feed_dict, session=session)
As you can see, tf.init_all_variables().run() is always called before anything else is done. The net is initialized as:
def estimator_getter(params):
[snip]
graph = tf.Graph()
with graph.as_default():
global_step = tf.Variable(0, trainable=False)
learning_rate = tf.train.exponential_decay(params.get('learning_rate',0.1), global_step, decay_steps, decay_rate)
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
net = MLP(config_num_inputs[config_id], hidden, 1, optimizer, seed=params.get('seed',100), global_step=global_step, graph=graph, dropout_keep_ratio=dropout)
Full example stack trace:
FailedPreconditionError: Attempting to use uninitialized value Variable_5
[[Node: Variable_5/read = Identity[T=DT_FLOAT, _class=["loc:#Variable_5"], _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_5)]]
Caused by op u'Variable_5/read', defined at:
File "tf_paramsearch.py", line 373, in <module>
randomized_search_params(int(sys.argv[1]))
File "tf_paramsearch.py", line 356, in randomized_search_params
hypersearch.fit()
File "/home/centos/ODQ/main/python/odq/cv.py", line 430, in fit
return self._fit(sampled_params)
File "/home/centos/ODQ/main/python/odq/cv.py", line 190, in _fit
for train_key, test_key in self.cv)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 766, in __call__
n_jobs = self._initialize_pool()
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 537, in _initialize_pool
self._pool = MemmapingPool(n_jobs, **poolargs)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/pool.py", line 580, in __init__
super(MemmapingPool, self).__init__(**poolargs)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/pool.py", line 418, in __init__
super(PicklingPool, self).__init__(**poolargs)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/pool.py", line 159, in __init__
self._repopulate_pool()
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/pool.py", line 223, in _repopulate_pool
w.start()
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/process.py", line 130, in start
self._popen = Popen(self)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/forking.py", line 126, in __init__
code = process_obj._bootstrap()
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/multiprocessing/pool.py", line 113, in worker
result = (True, func(*args, **kwds))
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 130, in __call__
return self.func(*args, **kwargs)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/home/centos/ODQ/main/python/odq/cv.py", line 131, in _fold_runner
estimator = estimator_getter(parameters)
File "tf_paramsearch.py", line 264, in estimator_getter
net = MLP(config_num_inputs[config_id], hidden, 1, optimizer, seed=params.get('seed',100), global_step=global_step, graph=graph, dropout_keep_ratio=dropout)
File "tf_paramsearch.py", line 86, in __init__
self._init_weights(n_in, hidden_config, n_out, seed)
File "tf_paramsearch.py", line 105, in _init_weights
self.out_weights = tf.Variable(tf.truncated_normal([hidden_config[-1], n_out], stddev=stdev))
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 206, in __init__
dtype=dtype)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 275, in _init_from_args
self._snapshot = array_ops.identity(self._variable, name="read")
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 523, in identity
return _op_def_lib.apply_op("Identity", input=input, name=name)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 655, in apply_op
op_def=op_def)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2117, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/centos/miniconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()
Ok, I've found the problem. There was a rare condition in my code that resulted in one of the hidden layers to be created with shape (0, N), i.e. no inputs. In this case, Tensorflow apparently fails to initialize the variables pertaining to that layer.
While this makes sense, it might be useful for Tensorflow to log a warning message in such cases (btw, I also tried to set Tensorflow logging to debug mode, but couldn't find how -- tf.logging.set_verbosity() didn't seem to have an effect).
BTW, for efficiency/less bugs, you could follow following pattern.
tf.reset_default_graph()
a = tf.constant(1)
<add more operations to your graph>
b = tf.Variable(1)
init_op = tf.initialize_all_variables()
tf.get_default_graph().finalize()
sess = tf.InteractiveSession()
sess.run(init_op)
sess.run(compute_op)
The finalize prevents you from modifying graph between runs which is slow in the current version. Also, because there's one session/one graph, you don't need with blocks.
For me the solution was
with sess.as_default():
result = compute_fn([seed_input,1])
check FailedPreconditionError: Attempting to use uninitialized in Tensorflow
for other options and my explanation
Strangely session.run() is not the same as running a function with sess.as_default(), I tried both.