Hello I am using Stable baselines package (https://stable-baselines.readthedocs.io/), specifically I am using the PPO2 and I am not sure how to properly save my model... I trained it for 6 virtual days and got my average return to around 300, then I have decided that this is not enough for me so I trained the model for another 6 days. But when I looked at the training statistics the second training return per episode started at around 30. This suggest that it did not save all parameters.
this is how I save use the package:
def make_env_init(env_id, rank, seed=0):
"""
Utility function for multiprocessed env.
:param env_id: (str) the environment ID
:param seed: (int) the inital seed for RNG
:param rank: (int) index of the subprocess
"""
def env_init():
# Important: use a different seed for each environment
env = gym.make(env_id, connection=blt.DIRECT)
env.seed(seed + rank)
return env
set_global_seeds(seed)
return env_init
envs = VecNormalize(SubprocVecEnv([make_env_init(f'envs:{env_name}', i) for i in range(processes)]), norm_reward=False)
if os.path.exists(folder / 'model_dump.zip'):
model = PPO2.load(folder / 'model_dump.zip', envs, **ppo_kwards)
else:
model = PPO2(MlpPolicy, envs, **ppo_kwards)
model.learn(total_timesteps=total_timesteps, callback=callback)
model.save(folder / 'model_dump.zip')
The way you saved the model is correct. The training is not a monotonous process: it can also show much worse results after a further training.
What you can do, first of all is to write logs of the progress:
model = PPO2(MlpPolicy, envs, tensorboard_log="./logs/progress_tensorboard/")
In order to see the log, run in terminal:
tensorboard --port 6004 --logdir ./logs/progress_tensorboard/
it will give you the link to the board, which you can then open in a browser (e.g. http://pc0259:6004/)
Secondly, you can make snapshots of the model each X steps:
from stable_baselines.common.callbacks import CheckpointCallback
checkpoint_callback = CheckpointCallback(save_freq=1e4, save_path='./model_checkpoints/')
model.learn(total_timesteps=total_timesteps, callback=[callback, checkpoint_callback])
Combining it with the log, you can pick up the model which performed best!
Related
I trained a model using Tensorflow object detection API using Faster-RCNN with Resnet architecture. I am using tensorflow 1.13.1, cudnn 7.6.5, protobuf 3.11.4, python 3.7.7, numpy 1.18.1 and I cannot upgrade the versions at the moment. I need to evaluate the accuracy (AP/mAP) of the trained model with the validation set for the IOU=0.3. I am using legacy/eval.py script on purpose since it calculates AP/mAP for IOU=0.5 only (instead of mAP:0.5:0.95)
python legacy/eval.py --logtostderr --pipeline_config_path=training/faster_rcnn_resnet152_coco.config --checkpoint_dir=training/ --eval_dir=eval/
I tried several things including updating pipeline config file to have min_score_threshold=0.3:
eval_config: {
num_examples: 60
min_score_threshold: 0.3
..
Updated the default value in the protos/eval.proto file and recompiled the proto file to generate new version of eval_pb2.py
// Minimum score threshold for a detected object box to be visualized
optional float min_score_threshold = 13 [default = 0.3];
However, eval.py still calculates/shows AP/mAP with IOU=0.5
The above configuration helped only to detect objects on the images with confidence level < 0.5 in the eval.py output images but this is not what i need.
Does anybody know how to evaluate the model with IOU=0.3?
I finally could solve it by modifing hardcoded matching_iou_threshold=0.5 argument value in multiple method arguments (especially def __init) in the ../object_detection/utils/object_detection_evaluation.py
I'm using Weights&Biases Cloud-based sweeps with Keras.
So first i create a new Sweep within a W&B Project with a config like following:
description: LSTM Model
method: random
metric:
goal: maximize
name: val_accuracy
name: LSTM-Sweep
parameters:
batch_size:
distribution: int_uniform
max: 128
min: 32
epochs:
distribution: constant
value: 200
node_size1:
distribution: categorical
values:
- 64
- 128
- 256
node_size2:
distribution: categorical
values:
- 64
- 128
- 256
node_size3:
distribution: categorical
values:
- 64
- 128
- 256
node_size4:
distribution: categorical
values:
- 64
- 128
- 256
node_size5:
distribution: categorical
values:
- 64
- 128
- 256
num_layers:
distribution: categorical
values:
- 1
- 2
- 3
optimizer:
distribution: categorical
values:
- Adam
- Adamax
- Adagrad
path:
distribution: constant
value: "./path/to/data/"
program: sweep.py
project: SLR
My sweep.py file looks something like this:
# imports
init = wandb.init(project="my-project", reinit=True)
config = wandb.config
def main():
skfold = StratifiedKFold(n_splits=5,
shuffle=True, random_state=7)
cvscores = []
group_id = wandb.util.generate_id()
X,y = # load data
i = 0
for train, test in skfold.split(X,y):
i=i+1
run = wandb.init(group=group_id, reinit=True, name=group_id+"#"+str(i))
model = # build model
model.fit([...], WandBCallback())
cvscores.append([...])
wandb.join()
if __name__ == "__main__":
main()
Starting this with the wandb agent command within the folder of sweep.py.
What i experienced with this setup is, that with the first wandb.init() call a new run is initialized. Okay, i could just remove that. But when calling wandb.init() for the second time it seems to lose track of the sweep it is running in. Online an empty run is listed in the sweep (because of the first wandb.init() call), all other runs are listed inside the project, but not in the sweep.
My goal is to have a run for each fold of the k-Fold cross-validation. At least i thought this would be the right way of doing this.
Is there a different approach to combine sweeps with keras k-fold cross validation?
We put together an example of how to accomplish k-fold cross validation:
https://github.com/wandb/examples/tree/master/examples/wandb-sweeps/sweeps-cross-validation
The solution requires some contortions for the wandb library to spawn multiple jobs on behalf of a launched sweep job.
The basic idea is:
The agent requests a new set of parameters from the cloud hosted parameter server. This is the run called sweep_run in the main function.
Send information about what the folds should process over a multiprocessing queue to waiting processes
Each spawned process logs to their own run, organized with group and job_type to enable auto-grouping in the UI
When the process is finished, it sends the primary metric over a queue to the parent sweep run
The sweep run reads metrics from the child runs and logs it to the sweep run so that the sweep can use that result to impact future parameter choices and/or hyperband early termination optimizations
Example visualizations of the sweep and k-fold grouping can be seen here:
Sweep: https://app.wandb.ai/jeffr/examples-sweeps-cross-validation/sweeps/vp0fsvku
K-fold Grouping: https://app.wandb.ai/jeffr/examples-sweeps-cross-validation/groups/vp0fsvku
Tensorflow version: 1.14
Our current setup is using tensorflow estimators to do live NER i.e. perform inference one document at a time. We have 30 different fields to extract, and we run one model per field, so got total of 30 models.
Our current setup uses python multiprocessing to do the inferences in parallel. (The inference is done on CPUs.) This approach reloads the model weights each time a prediction is made.
Using the approach mentioned here, we exported the estimator models as tf.saved_model. This works as expected in that it does not reload the weights for each request. It also works fine for a single field inference in one process, but doesn't work with multiprocessing. All the processes hang when we make the predict function (predict_fn in the linked post) call.
This post is related, but not sure how to adapt it for saved model.
Importing tensorflow individually for each of the predictors did not work either:
class SavedModelPredictor():
def __init__(self, model_path):
import tensorflow as tf
self.predictor_fn = tf.contrib.predictor.from_saved_model(model_path)
def predictor_fn(self, input_dict):
return self.predictor_fn(input_dict)
How to make tf.saved_model work with multiprocessing?
Ray Serve, ray's model serving solution, also support offline batching. You can wrap your model in Ray Serve's backend and scale it to the number replica you want.
from ray import serve
client = serve.start()
class MyTFModel:
def __init__(self, model_path):
self.model = ... # load model
#serve.accept_batch
def __call__(self, input_batch):
assert isinstance(input_batch, list)
# forward pass
self.model([item.data for item in input_batch])
# return a list of response
return [...]
client.create_backend("tf", MyTFModel,
# configure resources
ray_actor_options={"num_cpus": 2, "num_gpus": 1},
# configure replicas
config={
"num_replicas": 2,
"max_batch_size": 24,
"batch_wait_timeout": 0.5
}
)
client.create_endpoint("tf", backend="tf")
handle = serve.get_handle("tf")
# perform inference on a list of input
futures = [handle.remote(data) for data in fields]
result = ray.get(futures)
Try it out with the nightly wheel and here's the tutorial: https://docs.ray.io/en/master/serve/tutorials/batch.html
Edit: updated the code sample for Ray 1.0
Ok, so the approach outlined in this answer with ray worked.
Built a class like this, which loads the model on init and exposes a function run to perform prediction:
import tensorflow as tf
import ray
ray.init()
#ray.remote
class MyModel(object):
def __init__(self, field, saved_model_path):
self.field = field
# load the model once in the constructor
self.predictor_fn = tf.contrib.predictor.from_saved_model(saved_model_path)
def run(self, df_feature, *args):
# ...
# code to perform prediction using self.predictor_fn
# ...
return self.field, list_pred_string, list_pred_proba
Then used the above in the main module as:
# form a dictionary with key 'field' and value MyModel
model_dict = {}
for field in fields:
export_dir = f"saved_model/{field}"
subdirs = [x for x in Path(export_dir).iterdir()
if x.is_dir() and 'temp' not in str(x)]
latest = str(sorted(subdirs)[-1])
model_dict[field] = MyModel.remote(field, latest)
Then used the above model dictionary to do predictions like this:
results = ray.get([model_dict[field].run.remote(df_feature) for field in fields])
Update:
While this approach works, found that running estimators in parallel with multiprocessing is faster than running predictors in parallel with ray. This is especially true for large document sizes. It looks like the predictor approach might work well for small number of dimensions and when the input data is not large. Maybe an approach like mentioned here might be better for our use case.
I'm trying to run a neural network multiple times with different parameters in order to calibrate the networks parameters (dropout probabilities, learning rate e.d.). However I am having the problem that running the network while keeping the parameters the same still gives me a different solution when I run the network in a loop as follows:
filename = create_results_file()
for i in range(3):
g = tf.Graph()
with g.as_default():
accuracy_result, average_error = network.train_network(
parameters, inputHeight, inputWidth, inputChannels, outputClasses)
f, w = get_csv_writer(filename)
w.writerow([accuracy_result, "did run %d" % i, average_error])
f.close()
I am using the following code at the start of my train_network function before setting up the layers and error function of my network:
np.random.seed(1)
tf.set_random_seed(1)
I have also tried adding this code before the TensorFlow graph creation, but I keep getting different solutions in my results output.
I am using an AdamOptimizer and am initializing network weights using tf.truncated_normal. Additionally I am using np.random.permutation to shuffle the incoming images for each epoch.
Setting the current TensorFlow random seed affects the current default graph only. Since you are creating a new graph for your training and setting it as default (with g.as_default():), you must set the random seed within the scope of that with block.
For example, your loop should look like the following:
for i in range(3):
g = tf.Graph()
with g.as_default():
tf.set_random_seed(1)
accuracy_result, average_error = network.train_network(
parameters, inputHeight, inputWidth, inputChannels, outputClasses)
Note that this will use the same random seed for each iteration of the outer for loop. If you want to use a different—but still deterministic—seed in each iteration, you can use tf.set_random_seed(i + 1).
Backend Setup: cuda:10.1, cudnn: 7, tensorflow-gpu: 2.1.0, keras: 2.2.4-tf, and vgg19 customized model
After looking into the issue of unstable results for tensorflow backend with GPU training and large neural network models based on keras, I was finally able to get reproducible (stable) results as follows:
Import only those libraries that would be required for setting seed and initialize a seed value
import tensorflow as tf
import os
import numpy as np
import random
SEED = 0
Function to initialize seeds for all libraries which might have stochastic behavior
def set_seeds(seed=SEED):
os.environ['PYTHONHASHSEED'] = str(seed)
random.seed(seed)
tf.random.set_seed(seed)
np.random.seed(seed)
Activate Tensorflow deterministic behavior
def set_global_determinism(seed=SEED):
set_seeds(seed=seed)
os.environ['TF_DETERMINISTIC_OPS'] = '1'
os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(1)
# Call the above function with seed value
set_global_determinism(seed=SEED)
Important notes:
Please call the above code before executing any other code
Model training might become slower since the code is deterministic, hence there's a tradeoff
I experimented several times with a varying number of epochs and different settings (including model.fit() with shuffle=True) and the above code gives me reproducible results.
References:
https://suneeta-mall.github.io/2019/12/22/Reproducible-ml-tensorflow.html
https://keras.io/getting_started/faq/#how-can-i-obtain-reproducible-results-using-keras-during-development
https://www.tensorflow.org/api_docs/python/tf/config/threading/set_inter_op_parallelism_threads
https://www.tensorflow.org/api_docs/python/tf/random/set_seed?version=nightly
Deterministic behaviour can be obtained either by supplying a graph-level or an operation-level seed. Both worked for me. A graph-level seed can be placed with tf.set_random_seed. An operation-level seed can be placed e.g, in a variable intializer as in:
myvar = tf.Variable(tf.truncated_normal(((10,10)), stddev=0.1, seed=0))
Tensorflow 2.0 Compatible Answer: For Tensorflow version greater than 2.0, if we want to set the Global Random Seed, the Command used is tf.random.set_seed.
If we are migrating from Tensorflow Version 1.x to 2.x, we can use the command,
tf.compat.v2.random.set_seed.
Note that tf.function acts like a re-run of a program in this case.
To set the Operation Level Seed (as answered above), we can use the command, tf.random.uniform([1], seed=1).
For more details, refer this Tensorflow Page.
It seems as if none of these answers will work due to underlying implementation issues in CuDNN.
You can get a bit more determinism by adding an extra flag
os.environ['PYTHONHASHSEED']=str(SEED)
os.environ['TF_CUDNN_DETERMINISTIC'] = '1' # new flag present in tf 2.0+
random.seed(SEED)
np.random.seed(SEED)
tf.set_random_seed(SEED)
But this still won't be entirely deterministic. To get an even more exact solution, you need use the procedure outlined in this nvidia repo.
Please add all random seed functions before your code:
tf.reset_default_graph()
tf.random.set_seed(0)
random.seed(0)
np.random.seed(0)
I think, some models in TensorFlow are using numpy or the python random function.
I'm using TensorFlow 2 (2.2.0) and I'm running code in JupyterLab. I've tested this in macOS Catalina and in Google Colab with same results. I'll add some stuff to Tensorflow Support's answer.
When I do some training using the model.fit() method I do it in a cell. I do some other stuff in other cells. This is the code I run in the mentioned cell:
# To have same results always place this on top of the cell
tf.random.set_seed(1)
(x_train, y_train), (x_test, y_test) = load_data_onehot_grayscale()
model = get_mlp_model_compiled() # Creates the model, compiles it and returns it
history = model.fit(x=x_train, y=y_train,
epochs=30,
callbacks=get_mlp_model_callbacks(),
validation_split=.1,
)
This is what I understand:
TensorFlow has some random processes that happen at different stages (initializing, shuffling, ...), every time those processes happen TensorFlow uses a random function. When you set the seed using tf.random.set_seed(1) you make those processes use it and if the seed is set and the processes don't change the results will be the same.
Now, in the code above, if I change tf.random.set_seed(1) to go below the line model = get_mlp_model_compiled() my results change, I believe it's because get_mlp_model_compiled() uses randomness and isn't using the seed I want.
Caveat about point 2: if I run the cell 3 times in a row I do get same results. I believe this happens because, in run nº1 get_mlp_model_compiled() isn't using TensorFlow's internal counter with my seed. In run nº2 it will be using a seed and all subsequent runs it will be using the seed too so after run nº2 results will be the same.
I might have some information wrong so feel free to correct me.
To understand what's going on you should read the docs, they're not so long and kind of easy to understand.
This answer is an addition to Luke's answer and for TF v2.2.0
import numpy as np
import os
import random
import tensorflow as tf # 2.2.0
SEED = 42
os.environ['PYTHONHASHSEED']=str(SEED)
os.environ['TF_CUDNN_DETERMINISTIC'] = '1' # TF 2.1
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)
launch tensorboard with tensorboard --logdir=/home/vagrant/notebook
at tensorboard:6006 > graph, it says No graph definition files were found.
To store a graph, create a tf.python.training.summary_io.SummaryWriter and pass the graph either via the constructor, or by calling its add_graph() method.
import tensorflow as tf
sess = tf.Session()
writer = tf.python.training.summary_io.SummaryWriter("/home/vagrant/notebook", sess.graph_def)
However the page is still empty, how can I start playing with tensorboard?
current tensorboard
result wanted
An empty graph that can add nodes, editable.
update
Seems like tensorboard is unable to create a graph to add nodes, drag and edit etc ( I am confused by the official video ).
running https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/tutorials/mnist/fully_connected_feed.py and then tensorboard --logdir=/home/vagrant/notebook/data is able to view the graph
However seems like tensorflow only provide ability to view summary, nothing much different to make it standout
TensorBoard is a tool for visualizing the TensorFlow graph and analyzing recorded metrics during training and inference. The graph is created using the Python API, then written out using the tf.train.SummaryWriter.add_graph() method. When you load the file written by the SummaryWriter into TensorBoard, you can see the graph that was saved, and interactively explore it.
However, TensorBoard is not a tool for building the graph itself. It does not have any support for adding nodes to the graph.
Starting from the following Code Example, I can add one line as shown below:
import tensorflow as tf
import numpy as np
sess = tf.InteractiveSession() #define a session
# Create 100 phony x, y data points in NumPy, y = x * 0.1 + 0.3
x_data = np.random.rand(100).astype("float32")
y_data = x_data * 0.1 + 0.3
# Try to find values for W and b that compute y_data = W * x_data + b
# (We know that W should be 0.1 and b 0.3, but Tensorflow will
# figure that out for us.)
W = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
b = tf.Variable(tf.zeros([1]))
y = W * x_data + b
# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(y - y_data))
optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)
# Before starting, initialize the variables. We will 'run' this first.
init = tf.initialize_all_variables()
# Launch the graph.
sess = tf.Session()
sess.run(init)
#### ----> ADD THIS LINE <---- ####
writer = tf.train.SummaryWriter("/tmp/test", sess.graph)
# Fit the line.
for step in xrange(201):
sess.run(train)
if step % 20 == 0:
print(step, sess.run(W), sess.run(b))
# Learns best fit is W: [0.1], b: [0.3]
And then run tensorboard from the command line, pointing to the appropriate directory. This shows a complete call for the SummaryWriter. It is important to note the following things:
SummaryWriter is passed a Session, and so must happen after the Session (or InteractiveSession) is created
That Session may be created early in the program, but when the Session is passed to the SummaryWriter, the graph as it exists at that point is written to the file that the TensorBoard will use.
In this page, there is a very simple code that you can use to test your installation: http://tensorflow.org/get_started
I included this line
tf.train.write_graph(sess.graph_def, '/home/daniel/Documents/Projetos/Prorum/ProgramasEmPython/TestingTensorFlow/fileGraph', 'graph.pbtxt')
After this "sess.run(init)"
This will generate a file that you have to upload to the "TensorBoard".
In order to open the TensorBoard, supposing that it is installed in your computer (it must be if you use pip to install), I used the terminal of Ubuntu and wrote:
"tensorboard --logdir nameOfDirectory"
Then, you should open your browser in Port 6006:
http://localhost:6006/
This will open the TensorBoard. I went to the "Graph Menu" and uploaded the file. It generated this figure below:
So, what I have done is to transfer the model I created in Python to TensorBoard. I believe that it is possible to create an empty one, if no model is created (only the session is initiated). However, I am not sure if you are able to change this directly in the TensorBoard.
I have answered before this question here in Portuguese with more details for Brazilian users. Maybe it can be useful for other people: http://prorum.com/index.php/1843/recentemente-plataforma-aprendizagem-primeira-impressao
i solved by on windows:
file_writer = tf.summary.FileWriter("output", sess.graph)
for that directory "output". I opened command on windows.
typed
tensorboard --logdir="C:\Users\kiran\machine Learning\output"
my mistake was on that line..
The graphs in TensorBoard do not show up if you are using Firefox. You have to install Chrome.
result wanted
An empty graph that can add nodes, editable.
I think you will find the Orange tool useful. It allows you to drag and drop various nodes and implement algorithms via GUI.
I had to use
python -m tensorflow.tensorboard --logdir="C:\tmp\tensorflow\.."
somehow tensorboard --logdir didn't work.
My environment
OS: Windows 7, Python 3.5, and Tensorflow 1.1.0