Keras seems slower than tensorflow while feeding training data

Keras seems slower than tensorflow while feeding training data - python

I am currently converting a project from tensorflow to keras.
Everything seems fine and I am very impressed by how easy it is to build models with keras. However, training is much slower with Keras, where my GPU is significantly less utilized.
I am using a Tensorflow generator Dataset to load my training data. Luckily keras seems to accept that with no problems.
The problem
However while using tensorflow to train on the dataset I archieve an average GPU utilization of ~70%. When I am training the same network with the same dataset generator using Keras I only archieve ~35% GPU utilization
The problem seems to be that I have a very simple network, hence I need to feed data to the GPU as fast as possible since much time is spent here, compared to actually doing backpropagation.
Using tensorflow the key here seemed to be to not use feed-dicts but instead use the tensor from my dataset as input to the graph. Basically, this can be reduced to
x, y = iterator.get_next() # Get the dataset tensors
loss = tf.reduce_sum(tf.square(y - model_out)) # Use the y tensor directly for loss
# Use x as the input layer in my model <- Implememntation omitted
I would like to achieve the same thing with keras, hence I did something like this, where i set x as the input and y as the target tensor. (Can I somehow get rid of having to put y in a list for the target tensor?)
x, y = iterator.get_next() # Get the dataset tensors
model_input = keras.Input(tensor=x)
# Build model with model_input as input layer and something as output layer. <- Implememntation omitted
model = tf.keras.Model(inputs=model_input, outputs=something) # Insert the dataset tensor directly as input
model.compile(loss='mean_squared_error',
optimizer=#something,
metrics=['accuracy'],
target_tensors=[y]) # Input the dataset y tensor directly for use in the loss calculation
Basically that should set x as the input tensor and y as the tensor used directly for loss, just like in the tensorflow version. I can now cal model.fit without providing x and y arguments explicitly since they are used directly in the graph
model.fit(validation_data=validation_iterator,
steps_per_epoch=5000,
validation_steps=1)
To me it seems like I am doing the same thing now with keras and tensorflow, however, keras is way slower with about half the GPU utilization of the pure tensorflow implementation
Am I doing something wrong here, or should i just accept this slowdown if I want to use keras?

I experienced the same issue on TensorFlow 1.13 and solved it by upgrading to TensorFlow 1.14 / 2.0.0.
For a sanity check, I wrapped the TensorFlow graph (as is) as a Keras model and trained the model using model.fit(). When using TensorFlow 1.13, I got a slowdown in throughput of 50% relative to the throughput of training the pure TensorFlow implementation. In both cases, I used the same tf.data.dataset input pipeline.
Using TensorFlow version 1.14 solved the issue (now I get the ~same throughput for both cases mentioned above). Later I migrated to TensorFlow 2.0.0 (alpha) and also got the same throughput for both cases.

Related

Keras tf backend predict speed slow for batch size of 1

I am combining a Monte-Carlo Tree Search with a convolutional neural network as the rollout policy. I've identified the Keras model.predict function as being very slow. After experimentation, I found that surprisingly model parameter size and prediction sample size don't affect the speed significantly. For reference:
0.00135549 s for 3 samples with batch_size = 3
0.00303991 s for 3 samples with batch_size = 1
0.00115528 s for 1 sample with batch_size = 1
0.00136132 s for 10 samples with batch_size = 10
as you can see I can predict 10 samples at about the same speed as 1 sample. The change is also very minimal though noticeable if I decrease parameter size by 100X but I'd rather not change parameter size by that much anyway. In addition, the predict function is very slow the first time run through (~0.2s) though I don't think that's the problem here since the same model is predicting multiple times.
I wonder if there is some workaround because clearly the 10 samples can be evaluated very quickly, all I want to be able to do is predict the samples at different times and not all at once since I need to update the Tree Search before making a new prediction. Perhaps should I work with tensorflow instead?

The batch size controls parallelism when predicting, so it is expected that increasing the batch size will have better performance, as you can use more cores and use GPU more efficiently.
You cannot really workaround, there is nothing really to work around, using a batch size of one is the worst case for performance. Maybe you should look into a smaller network that is faster to predict, or predict on the CPU if your experiments are done in a GPU, to minimize overhead due to transfer.
Don't forget that model.predict does a full forward pass of the network, so its speed completely depends on the network architecture.

One way that gave me a speed up was switching from model.predict(x) to,
model.predict_on_batch(x)
making sure your x shape has 1 as the first dimension.

I don't think working with pure Tensorflow would change the performance much. Keras is a high-level API for low-level Tensorflow primitives. You could use a smaller model instead, like MobileNetV3 or EfficientNet, but this would require retraining.
If you need to remain with the existing model, you could try OpenVINO. OpenVINO is optimized for Intel hardware, but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. You care about latency, so I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

Using datasets larger than 2 Gb with Keras

TensorFlow has a long-standing limitation of 2 Gb on a single tensor. It means that you can't train your model on more than 2 Gb of data at one time without jumping through hoops. See Initializing tensorflow Variable with an array larger than 2GB ; Use large dataset in Tensorflow
The standard solution referenced in those posts is to use a placeholder and to pass it to the "session" through feed_dict:
my_graph = tf.Graph()
sess = tf.Session(graph=my_graph)
X_init = tf.placeholder(tf.float32, shape=(m_input, n_input))
X = tf.Variable(X_init)
sess.run(tf.global_variables_initializer(), feed_dict={X_init: data_for_X})
However, this only works when I use the "old" API (tf.Session(), etc.) The recommended approach nowadays is to use Keras (all the tutorials on tensorflow.org use it). And, with Keras, there's no tf.Graph(), no tf.Session(), and no run() (at least none that are readily visible to the user.)
How do I adapt the above code to work with Keras?

In Keras, you'd not load your entire dataset in a tensor. You load it in numpy arrays.
If the entire data can be in a single numpy array:
Thanks to #sebrockm's comment.
The most trivial usage of Keras is simply loading your dataset in a numpy array (not a tf tensor) and call model.fit(arrayWithInputs, arrayWithoutputs, ...)
If the entire data doesn't fit a numpy array:
You'd create a generator or a keras.utils.Sequence to load batches one by one and then train the model with model.fit_generator(generatorOrSequence, ...)
The limitation becomes the batch size, but you'd hardly ever hit 2GB in a single batch.
So, go for it:
keras.utils.Sequence
model.fit_generator

Keras doesn't have a 2GB limitation for datasets, I've trained much larger datasets with Keras with no issues.
The limitation could come from TensorFlow constants, which do have a 2GB limit, but in any case you should NOT store datasets as constants, as these are saved as part of the graph, and that is not the idea of storing a model.
Keras has the model.fit_generator function that you can use to pass a generator function which loads data on the fly, and makes batches. This allows you to load a large dataset on the fly, and you usually adjust the batch size so you maximize performance with acceptable RAM usage. TensorFlow doesn't have a similar API, you have to implement it manually as you say with feed_dict.

Use multiple GPUs for inception_v3 model in TF slim

I am trying to train a slim model using 3 GPUs.
I specifically telling TF to use the second GPU to allocate the model:
with tf.device('device:GPU:1'):
logits, end_points = inception_v3(inputs)
However, I'm getting an OOM error on that GPU everytime I run my code. I've tried to reduce the batch_size so the model fits in memory, but the net is ruinned.
I own 3 GPUS so, is there a way to tell TF to use my third GPU when second is full? I've tried not telling TF to use any GPU and allowing soft placemente, but it is not working either.

This statement with tf.device('device:GPU:1') tells tensorflow specifically to use GPU-1, so it won't attempt to use any other device you have.
When the model is too big, the recommended way is to use model parallelism via manually splitting your graph into different GPUs. The complication in your case is that the model definition is in the library, so you can't insert tf.device statements for different layers unless you patch tensorflow.
But there is a workaround
You can define and place variables before invoking inception_v3 builder. This way inception_v3 will reuse these variables and not change its placement. Example:
with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):
with tf.device('device:GPU:1'):
tf.get_variable("InceptionV3/Logits/Conv2d_1c_1x1/biases", shape=[1000])
tf.get_variable("InceptionV3/Logits/Conv2d_1c_1x1/weights", shape=[1, 1, 2048, 1000])
with tf.device('device:GPU:0'):
logits, end_points = inception_v3(inputs)
Upon running, you'll see that all variables except Conv2d_1c_1x1 are placed onto GPU-0, while Conv2d_1c_1x1 layer is on GPU-1.
The drawback is that you need to know the shape of each variable you want to replace. But it is doable and at least can get your model running.

How to obtain the Tensorflow code version of a NN built in Keras?

I have been working with Keras for a week or so. I know that Keras can use either TensorFlow or Theano as a backend. In my case, I am using TensorFlow.
So I'm wondering: is there a way to write a NN in Keras, and then print out the equivalent version in TensorFlow?
MVE
For instance suppose I write
#create seq model
model = Sequential()
# add layers
model.add(Dense(100, input_dim = (10,), activation = 'relu'))
model.add(Dense(1, activation = 'linear'))
# compile model
model.compile(optimizer = 'adam', loss = 'mse')
# fit
model.fit(Xtrain, ytrain, epochs = 100, batch_size = 32)
# predict
ypred = model.predict(Xtest, batch_size = 32)
# evaluate
result = model.evaluate(Xtest)
This code might be wrong, since I just started, but I think you get the idea.
What I want to do is write down this code, run it (or not even, maybe!) and then have a function or something that will produce the TensorFlow code that Keras has written to do all these calculations.

First, let's clarify some of the language in the question. TensorFlow (and Theano) use computational graphs to perform tensor computations. So, when you ask if there is a way to "print out the equivalent version" in Tensorflow, or "produce TensorFlow code," what you're really asking is, how do you export a TensorFlow graph from a Keras model?
As the Keras author states in this thread,
When you are using the TensorFlow backend, your Keras code is actually building a TF graph. You can just grab this graph.
Keras only uses one graph and one session.
However, he links to a tutorial whose details are now outdated. But the basic concept has not changed.
We just need to:
Get the TensorFlow session
Export the computation graph from the TensorFlow session
Do it with Keras
The keras_to_tensorflow repository contains a short example of how to export a model from Keras for use in TensorFlow in an iPython notebook. This is basically using TensorFlow. It isn't a clearly-written example, but throwing it out there as a resource.
Do it with TensorFlow
It turns out we can actually get the TensorFlow session that Keras is using from TensorFlow itself, using the tf.contrib.keras.backend.get_session() function. It's pretty simple to do - just import and call. This returns the TensorFlow session.
Once you have the TensorFlow session variable, you can use the SavedModelBuilder to save your computational graph (guide + example to using SavedModelBuilder in the TensorFlow docs). If you're wondering how the SavedModelBuilder works and what it actually gives you, the SavedModelBuilder Readme in the Github repo is a good guide.
P.S. - If you are planning on heavy usage of TensorFlow + Keras in combination, have a look at the other modules available in tf.contrib.keras

So you want to use instead of WX+b a different function for your neurons. Well in tensorflow you explicitly calculate this product, so for example you do
y_ = tf.matmul(X, W)
you simply have to write your formula and let the network learn. It should not be difficult to implement.
In addition what you are trying to do (according to the paper you link) is called batch normalization and is relatively standard. The idea being you normalize your intermediate steps (in the different layers). Check for example https://www.google.ch/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0ahUKEwikh-HM7PnWAhXDXRQKHZJhD9EQFggyMAE&url=https%3A%2F%2Farxiv.org%2Fabs%2F1502.03167&usg=AOvVaw1nGzrGnhPhNGEczNwcn6WK or https://www.google.ch/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0ahUKEwikh-HM7PnWAhXDXRQKHZJhD9EQFghCMAM&url=https%3A%2F%2Fbcourses.berkeley.edu%2Ffiles%2F66022277%2Fdownload%3Fdownload_frd%3D1%26verifier%3DoaU8pqXDDwZ1zidoDBTgLzR8CPSkWe6MCBKUYan7&usg=AOvVaw0AHLwD_0pUr1BSsiiRoIFc
Hope that helps,
Umberto

Keras VGG16 predict speed slow

I'm working on a feature extractor for this transfer learning personal project, and the predict function of Kera's VGG16 model seems pretty slow (31 seconds for a batch of 4 images). I do expect it to be slow, but not sure if the prediction function is slower than it should be.
data = DataGenerator()
data = data.from_csv(csv_path=csv_file,
img_dir=img_folder,
batch_size=batch)
#####################################################
conv_base = VGG16(include_top=False,
weights='imagenet',
input_shape=(480, 640, 3))
model = Sequential()
model.add(conv_base)
model.add(MaxPooling2D(pool_size=(3, 4)))
model.add(Flatten())
######################################################
for inputs, y in data:
feature_batch = model.predict(inputs)
yield feature_batch, y
So, my hunch is that it is slow for these reasons:
my input data is a bit large (loading in (480, 640, 3) size images)
I am running on a weak CPU (M3-6Y30 # 0.90GHz)
I have a flatten operation at the end of the feature extractor.
Things I've tried:
Other StackOverFlow posts suggested adding a max pooling layer to
reduce the feature size / remove the extraneous zero's. I made I
think a pretty large max pool window (thus reducing the feature size
significantly, but my prediction time increased.
Batch processing doesn't improve time which is probably obvious due
to the use of my M3 CPU). A batch size of 1 image takes 8 seconds, a
batch size of 4 takes 32.
Are there any ideas on how to speed up the prediction function? I need to run this through at least 10,000 images, and due to the nature of the project I would like to retain as much of the raw data as possible before going into the model (will be comparing it with other feature extraction models)
All my image files are saved locally, but I can try to setup a cloud computer and move my code over there to run with GPU support.
Is the issue simply I am running the VGG16 model on a dinky CPU?
Guidance would be much appreciated.

There are many issues with your model. The main issue is of course really slow machine, but as you cannot change that here I will state some pieces of advice on how you could speed up your computations:
VGG16 is relatively old architecture. The main issue here is that the so-called volume of tensors (area of feature maps times number of features) is decreased really slowly. I would advise you to use more modern architectures like e.g. ResNet50 or Inception v3 as they have the so-called stem which is making inside tensors much smaller really fast. Your speed should benefit thanks to that. There is also a really light architecture called MobileNet which seems perfect for your task.
Downsample your images - with a size of (480, 640) your image is 6 times bigger than default VGG input. This makes all computations 6 times slower. You could try to first downsample images and then use a feature extractor.

VGG16 is a very big model. The same accuracy could be reached with modern smaller models such as MobileNetV3 or EfficientNet.
However, if you have to use your model you could try OpenVINO. OpenVINO is optimized for Intel hardware but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
Here are some performance benchmarks for various models and CPUs. Your processor (M3-6Y30) is 6th generation so it should be supported.
It's rather straightforward to convert the Keras model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:
mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.