Change Logdir of Ray RLlib Training instead of ~/ray_results - python

I'm using Ray & RLlib to train RL agents on an Ubuntu system. Tensorboard is used to monitor the training progress by pointing it to ~/ray_results where all the log files for all runs are stored. Ray Tune is not being used.
For example, on starting a new Ray/RLlib training run, a new directory will be created at
~/ray_results/DQN_ray_custom_env_2020-06-07_05-26-32djwxfdu1
To visualize the training progress, we need to start Tensorboard using
tensorboard --logdir=~/ray_results
Question: Is it possible to configure Ray/RLlib to change the output directory of the log files from ~/ray_results to another location?
Additionally, instead of logging to a directory named something like DQN_ray_custom_env_2020-06-07_05-26-32djwxfdu1, can this directory name by set by ourselves?
Failed Attempt: Tried setting
os.environ['TUNE_RESULT_DIR'] = '~/another_dir`
before running ray.init(), but the result log files were still being written to ~/ray_results.

Without using Tune, you can change the logdir using rllib's "Trainer". The "Trainer" class takes in an optional "logger_creator" if you want to specify where to save the log (see here).
A concrete example:
Define your customized logger creator (you can simply modify from the default one):
def custom_log_creator(custom_path, custom_str):
timestr = datetime.today().strftime("%Y-%m-%d_%H-%M-%S")
logdir_prefix = "{}_{}".format(custom_str, timestr)
def logger_creator(config):
if not os.path.exists(custom_path):
os.makedirs(custom_path)
logdir = tempfile.mkdtemp(prefix=logdir_prefix, dir=custom_path)
return UnifiedLogger(config, logdir, loggers=None)
return logger_creator
Pass this logger_creator to the trainer, and start training:
trainer = PPOTrainer(config=config, env='CartPole-v0',
logger_creator=custom_log_creator(os.path.expanduser("~/another_ray_results/subdir"), 'custom_dir'))
for i in range(ITER_NUM):
result = trainer.train()
You will find the training results (i.e., TensorBoard events file, params, model, ...) saved under "~/another_ray_results/subdir" with your specified naming convention.

Is it possible to configure Ray/RLlib to change the output directory of the log files from ~/ray_results to another location?
There is currently no way to configure this using RLib CLI tool (rllib).
If you're okay with Python API, then, as described in documentation, local_dir parameter of tune.run is responsible for specifying output directory, default is ~/ray_results.
Additionally, instead of logging to a directory named something like DQN_ray_custom_env_2020-06-07_05-26-32djwxfdu1, can this directory name by set by ourselves?
This is governed by trial_name_creator parameter of tune.run. It must be a function that accepts trial object and formats it into a string like so:
def trial_name_id(trial):
return f"{trial.trainable_name}_{trial.trial_id}"
tune.run(...trial_name_creator=trial_name_id)

Just for anyone who bumps into this problem with Ray Tune.
You can specify local_dir for run_config within tune.Tuner:
# This logs to 2 different trial folders:
# ./results/test_experiment/trial_name_1 and ./results/test_experiment/trial_name_2
# Only trial_name is autogenerated.
tuner = tune.Tuner(trainable,
tune_config=tune.TuneConfig(num_samples=2),
run_config=air.RunConfig(local_dir="./results", name="test_experiment"))
results = tuner.fit()
Please see this link for more info.

Related

How to inject the information about load version into Kedro node?

I need to run a Kedro (v0.17.4) pipeline with a node that is supposed to process data with a different logic depending on the load version of the input.
As a simple and crude example assuming there is a catalog.yml file with this entry:
test_data_set:
type: pandas.CSVDataSet
filepath: data/01_raw/test.csv
versioned: true
and there are multiple versions of test.csv (say '1' and '2') and I want to use the Catalog from the config file and run the following node/pipeline:
from kedro.config import ConfigLoader
from kedro.io import DataCatalog
conf_loader = ConfLoader(['conf/base'])
conf_catalog = conf_loader.get('catalog*', 'catalog/**')
io = DataCatalog.from_config(conf_catalog)
def my_node(my_data_set):
#if version_of_my_data_set == '1': # how to do this?
# print("do something with version 1")
# ... do something else
return
my_pipeline = Pipeline([node(func=my_node, inputs="test_data_set", outputs=None, name="process_versioned_data")])
SequentialRunner().run(my_pipeline, catalog=io)
I understand that runtime parameters or the load version are supposed to be separated from the logic in a node by design, but in my specific case it would still be useful to find a way to do this.
In general the pipeline will be executed via the API but also via the command line with the --load_version flag.
Solutions that I have considered but discarded:
store the load version somehow in the Kedro session and access it within the node via "get_current_session" (how?)
add load_version as a required input parameter for the node (would probably break compatibility with some upstream pipeline)
In short:
Is there a good way to pass the information of the user specified load version of a dataset to a kedro node?

Transformer library cache path is not changing

I have tried this but it's not working for me. I am using this Git repo. I am building a desktop app and don't want users to download model. I want to ship models with build. I know transformers library looks for models in cache/torch/transformers. If it's not there then download it. I also know you can pass cache_dir parameter in pre_trained.
I am trying this.
cache = os.path.join(os.path.abspath(os.getcwd()), 'Transformation/Annotators/New Sentiment Analysis/transformers')
os.environ['TRANSFORMERS_CACHE'] = cache
if args.model_name_or_path is None:
args.model_name_or_path = 'barissayil/bert-sentiment-analysis-sst'
#Configuration for the desired transformer model
config = AutoConfig.from_pretrained(args.model_name_or_path, cache_dir=cache)
I have tried the solution in above mentioned question and tried cache_dir also. The transformer folder is in same directory with analyze.py. The whole repo and transformer folder is in New Sentiment Analysis directory.
You actually haven't showed the code which is not working but I assume you did something like the following:
from transformers import AutoConfig
import os
os.environ['TRANSFORMERS_CACHE'] = '/blabla/cache/'
config = AutoConfig.from_pretrained('barissayil/bert-sentiment-analysis-sst')
os.path.isdir('/blabla/cache/')
Output:
False
This will not create a new default location for caching, because you imported the transformers library before you have set the environment variable (I modified your linked question to make it more clear). The proper way to modify the default caching directory is setting the environment variable before importing the transformers library:
import os
os.environ['TRANSFORMERS_CACHE'] = '/blabla/cache/'
from transformers import AutoConfig
config = AutoConfig.from_pretrained('barissayil/bert-sentiment-analysis-sst')
os.path.isdir('/blabla/cache/')
Output:
True

fmi2GetFMUState/fmi2SetFMUState supported for Matlab/OpenModelica generated FMUs?

I am trying to test a simple fmu to save and restore the states.
For example openmodelica:
model modelicatest
input Real In1;
output Real Out1(start=0, fixed=true);
equation
der(Out1) = In1;
end modelicatest;
Also for simulink:
I am using FMPy to simulate the generated FMUs.
But for OpenModelica v1.14.1 generated FMU, I get the following error when I call getFMUState from FMPy:
Exception: fmi2GetFMUstate failed with status 3
For Simulink (2019b) generated FMU using the built-in exporter, FMU state does not reset (i.e. the output value) when I run setFMUState.
Just wondering these functions are supported for OpenModelica and Simulink generated FMUs? or is it FMPy issue?
With respect to fmi2GetFMUstate/fmi2SetFMUstate, the FMI Specification, section 2.1.8. states:
These functions are only supported by the FMU, if the optional capability flag <fmiModelDescription> <ModelExchange / CoSimulation canGetAndSetFMUstate in = "true"> in the XML file is explicitly set to true (see sections 3.3.1 and 4.3.1).
You can unzip the fmu file and take a look at the modelDescription.xml file to find out if the flag is set: If it is false or not set all, the get and set functions are not supported.

Update from learn_runner.run to tf.estimator.train_and_evaluate on GCMLE

I am trying to make sure I cover everything when updating to tf.estimator.train_and_evaluate() instead of learn_runner.run().
I am looking to base things of this GCMLE custom estimator sample which used to be:
learn_runner.run(
generate_experiment_fn(
min_eval_frequency=args.min_eval_frequency,
eval_delay_secs=args.eval_delay_secs,
train_steps=args.train_steps,
eval_steps=args.eval_steps,
export_strategies=[saved_model_export_utils.make_export_strategy(
model.SERVING_FUNCTIONS[args.export_format],
exports_to_keep=1
)]
),
run_config=tf.contrib.learn.RunConfig(model_dir=args.job_dir),
hparams=hparam.HParams(**args.__dict__)
)
export_strategies:
Previously, the export_strategies would place the final model binaries in the $job_dir/export/Servo/$timestamp. However, when trying to convert to use tf.estimator.train_and_evaluate I cannot see how to replicate this behavior.
Following this newer custom estimator example, I have passed
exporter = tf.estimator.FinalExporter('saved-model', SERVING_FUNCTIONS[hparams.export_format])
into the EvalSpec exporters = [exporter] but it doesn't work as the final export strategy like previously.
run_config
Previously run_config was passed as an additional command with learn_runner.run(). Now my approach within my run_experiment() function is to pass the run_config() directly to tf.estimator.Estimator's config parameter. Is there any functionality that I am missing with this?
Example:
run_config = tf.estimator.RunConfig(model_dir=hparams.job_dir,
save_checkpoints_steps=hparams.save_checkpoint_steps,
save_summary_steps = hparams.save_summary_steps)
estimator = tf.estimator.Estimator(model_fn=model_fn,
model_dir=hparams.job_dir,
config = run_config,
params = hparams)
Is there anything with the new run_config implementation that I am missing out on from the old implementation?
The issue was coming from stopping condition of num_epochs vs num_steps -- it appears that tf.estimator.Estimator does not play well with num_epochs so you should have your stopping condition be num_steps if you want an export folder.
Separately, it's worth noting that if you specify both model_dir within tf.estimator.Estimator() directly and also specifying model_dir within run_config = tf.estimator.RunConfig() then these names must match. The TF documentation suggests that it should be okay to specify the model_dir in both places if they are equal "model_dir: Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into a estimator to continue training a previously saved model. If None, the model_dir in config will be used if set. If both are set, they must be same. If both are None, a temporary directory will be used."

Tensorflow session returns as 'closed'

I have successfully ported the CIFAR-10 ConvNet tutorial code for my own images and am able to train on my data and generate Tensorboard outputs etc.
My next step was to implement an evaluation of new data against the model I built. I am trying now to use cifar10_eval.py as a starting point however am running into some difficulty.
I should point out that the original tutorial code runs entirely without a problem, including cifar10_eval.py. However, when moving this particular code to my application, I get the following error message (last line).
RuntimeError: Attempted to use a closed Session.
I found this error is thrown by TF's session.py
# Check session.
if self._closed:
raise RuntimeError('Attempted to use a closed Session.')
I have checked the directories in which all files should reside and be created, and all seems exactly as it should (they mirror perfectly those created by running the original tutorial code). They include a train, eval and data folders, containing checkpoints/events files, events file, and data binaries respectively.
I wonder if you could help pointing out how I can debug this, as I'm sure there may be something in the data flow that got disrupted when transitioning the code. Unfortunately, despite digging deep and comparing to the original, I can't find the source, as they are essentially similar with trivial changes in file names and destination directories only.
EDIT_01:
Debugging step by step, it seems the line that actually throws the error is #106 in the original cifar10_eval.py:
def eval_once(args etc)
...
with tf.Session() as sess:
...
summary = tf.Summary()
summary.ParseFromString(sess.run(summary_op)) # <========== line 106
summary_op is created in def evaluate of this same script and passed as an arg to def eval_once.
summary_op = tf.merge_all_summaries()
...
while True:
eval_once(saver, summary_writer, top_k_op, summary_op)
From documentation on Session, a session can be closed with .close command or when using it through a context-manager in with block. I did find tensorflow/models/image/cifar10 | xargs grep "sess" and I don't see any sess.close, so it must be the later.
IE, you'll get this error if you do something like this
with tf.Session() as sess:
sess.run(..)
sess.run(...) # Attempted to use a closed Session.
It was a simple (but humbling) error in indentation.
summary = tf.Summary()
summary.ParseFromString(sess.run(summary_op))
summary.value.add(tag='Precision # 1', simple_value=precision)
summary_writer.add_summary(summary, global_step)
was outside of the try: block, and of course, no session could be found.
Sigh.

Categories

Resources