I am training a deep RL agent using tensorflow on a custom environment and have noticed that calling the environment.step method in my training loop is much slower than just calling it outside.
Here is psuedo code which explains this
class agent():
........
def train():
for i in batch_size:
start = time.time()
nextstate, reward, done = env.step()
agent.timecounter += time.time - start
For the agent I will find that calling env.step takes up ~.5 seconds out of the ~.8 seconds wall time for a single batch. Instead of calling inside the agent training loop, if I just call the code normally, I don't see this overhead.
start = time.time()
for i in batch_size:
out = env.step()
print(time.time() - start)
I will find that it only takes ~.01 seconds to run the environment.step for an equivalent amount of times. What is causing this extra overhead? It's making training for me very slow.
I tried with gym's cartpole environment and it does not have this issue. My custom environment has several dictionary attributes. Is this possibly what's causing the issue? I don't see any other differences between my code and gym's cartpole.
Full code to reproduce on your machine is available at https://github.com/ronan-keane/havsim/tree/DL3
in scripts/meng assignments/control 1/traintest.py
I managed to answer my own question. The issue was that the input from the neural net to the custom python code was being converted to a tf.float32. This would then cause the python code to convert everything to tensor, and all the math operations in the pure python code would then be acting on newly converted tensors. This added the huge 50x overhead.
Issue was fixed by just converting the output of the NN to a regular python float.
Related
I am using tensorflow2.6 and my code requires setting the below code at starting because I use symbolic keras tensor in partial loss in my model
from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()
At the same time I also want to train on multiple gpus hence I used mirrored strategy but the issue is mirrored strategy requires setting eager execution which fails with my above disableness. Please help me if there exists another way of training on multiple gpus.
I have tried calling my code with the below but it got stuck saying significant overhead ahead.
tf.config.run_functions_eagerly(True)
but this is wrong i believe since as i mentioned** i need the disableness of eagerly mode.**
Sorry if there are any mistakes in this question. I come from a PyTorch background but I need to use TFRecordDataset in order to read from TFRecord's. Currently, this looks like the following:
class TFRecordReader:
def __iter__(self):
dataset = tf.data.TFRecordDataset(
[self.tfrecord_path], compression_type="GZIP"
)
dataset = dataset.map(self._parse_example)
self._tfr_iter = iter(dataset)
def __next__(self):
return next(self._tfr_iter)
However, I need to create multiple TFRecordReader's per PyTorch worker to do batch balancing. This results in me having 4 TFRecordDataset (4 buckets to balance with) per worker per GPU, so I end up with 4 * 4 * 4 = 64 TFRecordDataset in memory. I have enough CPU memory to do this, but the issue is that the memory is not being released from the TFRecordDataset as the memory is increasing consistently over the course of training. I believe the issue is that the computation graph keeps being grown (every time a new TFRecord is read a new TFRecordDataset is created for it), but is never released.
How can I make sure that the memory used by the TFRecordDataset is released after I finish iterating through a single TFRecord?
I tried:
def __iter__(self)
with tf.Graph().as_default() as g:
dataset = tf.data.TFRecordDataset(
[self.tfrecord_path], compression_type="GZIP"
)
dataset = dataset.map(self._parse_example)
tf.compat.v1.enable_eager_execution()
self._tfr_iter = iter(dataset)
while True:
try:
example_dict = next(
self._tfr_iter
)
# ...
However, I get an error that:
RuntimeError: __iter__() is only supported inside of tf.function or when eager execution is enabled.
I would really appreciate any advice on how to make sure the memory does not keep growing. I am using Tensorflow 2.5 for reference.
The issue turned out to be using the PyTorch profiler with PyTorch Lightning. The issue was not with Tensorflow.
See relevant issue here
The problem
I have a (very) small and fast model saved in the SavedModel format which I can load and run with the following code:
model = tf.keras.models.load_model("./<SavedModelDir>")
outputs = model(inputs, training=False)
The predict function runs in 0.05 seconds with a batch of 5 inputs (on a Nvidia GPU).
If however I use model.predict_on_batch(inputs) or model.predict(inputs) the performance drops significantly to 0.65 - 0.80 seconds for a batch of 5. This is consistent with the documentation that states that using model() (__call__) is usually faster for smaller inputs.
The problem I am having is the fact that I am trying to port my model to a C(++) program. And using TF_SessionRun() for the C api and model_bundle.GetSession()->Run() I am getting performance similar to "slow" Python inference methods.
What I have tried
Another (very) small model with small batch, same result.
I tried disabling optimizations with tf.config.optimizer.set_experimental_options({'disable_meta_optimizer': False}) to make sure this does not negatively impact performance but this made things even slower.
I also tried converting the SavedModel to a TensorRT SavedModel. This increases the performance of the model() (__call__) method even further but all the other methods stop working in Python and in the downloaded precompiled Tensorflow C GPU api (2.5.0) and the C++ API compiled with Tensorflow_CC I get an error about the operation not being found (TensorRT does not seem to work).
All the performance numbers given were run after a few warmup runs.
Performance measured both with Tensorflow profiler and Python's time.time
I checked if model() (__call__) is working correctly by checking the output and it is.
My question(s)
Is there a way to get model() (__call__) performance with the Tensorflow C(++) API?
The problem seems to be somewhere in Tensorflows optimization for larger batch sizes which decreases the performance of smaller batch sizes. Is there another API that allows faster inference on small batches out of the box (TensorRT C++ API?)?
I think I figured it out by accident by doing the following for something else I was trying:
tf.compat.v1.disable_v2_behavior() at the top of the script. And then print(len(outputs)) right after getting the outputs. This gives the following error: TypeError: len is not well defined for symbolic Tensors..
By Googling I found out that symbolic tensors are tensors that do not directly hold values so the values are probably filled in later.
This means that Model() (__call__) does its computation asynchronous, timing the function gives us a false value. This can be "fixed" by stopping the time after printing/using every output or just using the predict() method to avoid this completely.
I'm currently struggling when using Capsule Network (Keras version: CapsNet).
Each time I run more than 2-5 predictions in a row (in side a loop) the results vary a lot. I have tried to change so mange things. I have changed the optimizer from ADAM to SGD as well - but I simply can't make it 100 % stable and thereby be able to reproduce a given run - once again.
How can I make CapsNet 100% reproducible every run?
The answer to this is long and involved. There's a blog post that goes into much more detail than I can here, but I'll try to capture the high level points.
Set the PYTHONHASHSEED environment variable to 0 before running your python program.
If you're running calculations on the GPU, that can result in non-reproducible results due to float-rounding. You can disable it and run all operations on the CPU by setting the CUDA_VISIBLE_DEVICES environment variable to an empty string in the same manner as before.
CUDA_VISIBLE_DEVICES="" PYTHONHASHSEED=0 python your_program.py
I have a problem regarding prediction performance. What I do is I repeatedly call test_predictions op in Python loop and put all its return values into the list. The code looks like this:
predictions = []
for _ in trange(args.num_batches):
predictions.extend(sess.run(model.test_predictions))
When I look at performance statistics for more than 2/3 of time my GPU card is idle, probably because of continual switching between Python and TF code. I cannot make batch size bigger, because it won't fit in memory. Is there any better solution I can implement?
There is no such thing as "switching between Python and TF code". If the GPU is idle a lot, that means your fetching of the data (images?) to run the predictions on takes a long time and the GPU has to wait a lot for the data to arrive.
Try implementing pre-fetching.
Alternatively, if you have enough memory, just read all images in at once and feed your network that way.