Keras not using full CPU cores for training

Keras not using full CPU cores for training - python

I am training a LSTM model on a very huge dataset on my machine using Keras on Tensorflow backend. My machine have 16 cores. While training the model I noticed that the load in all the cores are below 40%.
I have gone through different sources looking for a solution and have tried providing the cores to use in the backend as
config = tf.ConfigProto(device_count={"CPU": 16})
backend.tensorflow_backend.set_session(tf.Session(config=config))
Even after that the load is still the same.
Is this because the model is very small.? It is taking around 5 minutes for an epoch. If it uses full cores the speed can be improved.
How to tell Keras or Tensorflow to use the full available cores i.e 16 cores to train the model.??
I have went through these stackoverflow questions and tried the solutions mentioned there. It didn't help.
Limit number of cores used in Keras

How are you training the model exactly? You might want to look into using model.fit_generator() but with a Keras Sequence object instead of a custom generator. This allows to safely use multiprocessing and will result in all cores being used.
You can checkout the Keras docs for an example.

Related

How to train a TF model that is larger than GPU memory?

I want to train a large object detection model using TF2, preferrably the EfficientDet D7 network. With my Tesla P100 card that has 16 GB of memory I am running into an "out of memory" exception, i.e. not enough memory on the graphics card can be allocated.
So I am wondering what my options are in this case. Is it correct that if I would have multiple GPUs, then the TF model would be split so that it fills memory of both cards? So in my case, with a second Tesla card again with 16 GB I would have 32 GB in total during training? If that is the case would that also be true for a cloud provider, where I could utilize multiple GPUs?
Moreover, if I am wrong and it would not work to split a model for multiple GPUs during training, what other approach would work in order to train a large network that does not fit into my GPU memory?
PS: I know that I could reduce the batch_size to 1, but unfortunately that does still not solve my issue for the really large models ...

You can use multiple GPU's in GCP (Google Cloud Platform) atleast, not too sure about other cloud providers. And yes, once you do that, you can train with a larger batch size (exact number would depend on the GPU, it's memory and how may you GPU's you have running in your VM)
You can check this link for the list of all GPU's available in GCP
If you're using the object detection API, you can check this post regarding training using multiple GPU's.
Alternatively, if you want to go with a single GPU, one clever trick would be to use the concept of gradient accumulation where you could virtually increase your batch size without using too much extra GPU memory, which is discussed in this post

Best practice for allocating GPU and CPU resources in TensorFlow

I'm wondering what is the correct way to set devices for creating/training a model in order to optimize resource usage for speedy training in TensorFlow with the Keras API? I have 1 CPU and 2 GPUs at my disposal. I was initially using a tf.device context to create my model and train on GPUs only, but then I saw in the TensorFlow documentation for tf.keras.utils.multi_gpu_model, they suggest explicitly instantiating the model on the CPU:
# Instantiate the base model (or "template" model).
# We recommend doing this with under a CPU device scope,
# so that the model's weights are hosted on CPU memory.
# Otherwise they may end up hosted on a GPU, which would
# complicate weight sharing.
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
# Replicates the model on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
I did this, and now when I train I see my CPU usage go way up with all 8 cores at about 70% usage each, and my GPU memory is maxed out. Would things go faster if the model were created on one of the GPUs? Even if I have just 1 GPU, is it still better to create model on CPU and use tf.device context to train the model on the GPU?

Many TensorFlow operations are accelerated using the GPU for computation. Without any annotations, TensorFlow automatically decides whether to use the GPU or CPU for an operation—copying the tensor between CPU and GPU memory, if necessary. Tensors produced by an operation are typically backed by the memory of the device on which the operation executed.
Tensorflow will only allocate memory and place operations on visible physical devices, as otherwise no LogicalDevice will be created on them. By default all discovered devices are marked as visible.
Also GPU utilization depends on the batch_size. The utilization may change with varying batch_size.
You can also compare your current results(time taken and utilization) with model using the Example 3 from multi_gpu_model.
Also if you go into the link, it states -
Warning: THIS FUNCTION IS DEPRECATED. It will be removed after 2020-04-01. Instructions for updating: Use tf.distribute.MirroredStrategy instead.
There should be performance improvement and GPU Utilization using tf.distribute.MirroredStrategy. This strategy is typically used for training on one machine with multiple GPUs. The tf.distribute.Strategy API provides an abstraction for distributing your training across multiple processing units. The goal is to allow users to enable distributed training using existing models and training code, with minimal changes.
For example, a variable created under a MirroredStrategy is a MirroredVariable. If no devices are specified in the constructor argument of the strategy then it will use all the available GPUs. If no GPUs are found, it will use the available CPUs. Note that TensorFlow treats all CPUs on a machine as a single device, and uses threads internally for parallelism.
Would recommend to go through Custom training with tf.distribute.Strategy tutorial that demonstrates on how to use tf.distribute.Strategy with custom training loops. They will train a simple CNN model on the fashion MNIST dataset.
Hope this answers your question. Happy Learning.

Keras tf backend predict speed slow for batch size of 1

I am combining a Monte-Carlo Tree Search with a convolutional neural network as the rollout policy. I've identified the Keras model.predict function as being very slow. After experimentation, I found that surprisingly model parameter size and prediction sample size don't affect the speed significantly. For reference:
0.00135549 s for 3 samples with batch_size = 3
0.00303991 s for 3 samples with batch_size = 1
0.00115528 s for 1 sample with batch_size = 1
0.00136132 s for 10 samples with batch_size = 10
as you can see I can predict 10 samples at about the same speed as 1 sample. The change is also very minimal though noticeable if I decrease parameter size by 100X but I'd rather not change parameter size by that much anyway. In addition, the predict function is very slow the first time run through (~0.2s) though I don't think that's the problem here since the same model is predicting multiple times.
I wonder if there is some workaround because clearly the 10 samples can be evaluated very quickly, all I want to be able to do is predict the samples at different times and not all at once since I need to update the Tree Search before making a new prediction. Perhaps should I work with tensorflow instead?

The batch size controls parallelism when predicting, so it is expected that increasing the batch size will have better performance, as you can use more cores and use GPU more efficiently.
You cannot really workaround, there is nothing really to work around, using a batch size of one is the worst case for performance. Maybe you should look into a smaller network that is faster to predict, or predict on the CPU if your experiments are done in a GPU, to minimize overhead due to transfer.
Don't forget that model.predict does a full forward pass of the network, so its speed completely depends on the network architecture.

One way that gave me a speed up was switching from model.predict(x) to,
model.predict_on_batch(x)
making sure your x shape has 1 as the first dimension.

I don't think working with pure Tensorflow would change the performance much. Keras is a high-level API for low-level Tensorflow primitives. You could use a smaller model instead, like MobileNetV3 or EfficientNet, but this would require retraining.
If you need to remain with the existing model, you could try OpenVINO. OpenVINO is optimized for Intel hardware, but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. You care about latency, so I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

AWS Sagemaker | Why multiple instances training taking time multiplied to instance number

I am using AWS Sagemaker for model training and deployment, this is sample example for model training
from sagemaker.estimator import Estimator
hyperparameters = {'train-steps': 10}
instance_type = 'ml.m4.xlarge'
estimator = Estimator(role=role,
train_instance_count=1,
train_instance_type=instance_type,
image_name=ecr_image,
hyperparameters=hyperparameters)
estimator.fit(data_location)
The docker image mentioned here is a tensorflow system.
Suppose it will take 1000 seconds to train the model, now I will increase the instance count to 5 then the training time will increase 5 times i.e. 5000 seconds. As per my understanding the training job will be distributed to 5 machines so ideally it will take 200 seconds per machine but seems its doing separate training on each machine. Can someone please let me know its working over distributed system in general or with Tensorflow.
I tried to find out the answer on this documentation https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf but seems the way of working on distributed machines is not mentioned here.

Are you using TensorFlow estimator APIs in your script? If yes, I think you should run the script by wrapping it in sagemaker.tensorflow.TensorFlow class as described in the documentation here. If you run training that way, parallelization and communication between instances should work out-of-the-box.
But note that scaling will not be linear when you increase the number of instances. Communicating between instances takes time and there could be non-parallelizable bottlenecks in your script like loading data to memory.

Keras with Tensorflow backend - Run predict on CPU but fit on GPU

I am using keras-rl to train my network with the D-DQN algorithm. I am running my training on the GPU with the model.fit_generator() function to allow data to be sent to the GPU while it is doing backprops. I suspect the generation of data to be too slow compared to the speed of processing data by the GPU.
In the generation of data, as instructed in the D-DQN algorithm, I must first predict Q-values with my models and then use these values for the backpropagation. And if the GPU is used to run these predictions, it means that they are breaking the flow of my data (I want backprops to run as often as possible).
Is there a way I can specify on which device to run specific operations? In a way that I could run the predictions on the CPU and the backprops on the GPU.

Maybe you can save the model at the end of the training. Then start another python file and write os.environ["CUDA_VISIBLE_DEVICES"]="-1"before you import any keras or tensorflow stuff. Now you should be able to load the model and make predictions with your CPU.

It's hard to properly answer your question without seeing your code.
The code below shows how you can list the available devices and force tensorflow to use a specific device.
def get_available_devices():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos]
get_available_devices()
with tf.device('/gpu:0'):
//Do GPU stuff here
with tf.device('/cpu:0'):
//Do CPU stuff here

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.