Background
I have a single layer, 256 hidden-unit, RNN that I've trained with Keras and that I now want to deploy. Ideally, I would like to deploy multiple instances of this RNN onto a GPU. However, at this point, when I load the model with keras.models.load_model(), it seems to be using 11Gb of my available 12Gb of GPU memory.
Questions
Why is my network, which is quite small, taking up so much memory? I only want to predict, not train. Am I loading the model the wrong way?
Is there some way I can generally understand the mapping of my RNN structure to the amount of GPU memory it will use?
Given this understanding, how do I reduce the amount of memory consumed by my RNN?
Current Understanding
My current estimate of how much memory my network should use is given from the number of hyper-parameters:
256 input weights
256 output weights
256x256 recurrent weights
256 hidden units
256 hidden unit biases
Total: 32 bits/parameter x (4 x 256 + 256 x 256) parameters = 260e6 bits
This is significantly less then what I'm currently seeing. So my hypothesis is that Keras thinks I'm still training my model and thus is trying to cache batch error sizes. But how else am I supposed to load my model?
No, it's just a strategy of gpu memory usage. Keras is generally based on tensorflow, and tensorflow default map all your free gpu memory in order to avoid dynamical memory allocation regardless how much memory you will really use.
You can configure it like below:
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.3 # or any valid options.
set_session(tf.Session(config=config))
Related
If I have a single GPU with 8GB RAM and I have a TensorFlow model (excluding training/validation data) that is 10GB, can TensorFlow train the model?
If yes, how does TensorFlow do this?
Notes:
I'm not looking for distributed GPU training. I want to know about single GPU case.
I'm not concerned about the training/validation data sizes.
No you can not train a model larger than your GPU's memory. (there may be some ways with dropout that I am not aware of but in general it is not advised). Further you would need more memory than even all the parameters you are keeping because your GPU needs to retain the parameters along with the derivatives for each step to do back-prop.
Not to mention the smaller batch size this would require as there is less space left for the dataset.
I think it's a pretty common message for PyTorch users with low GPU memory:
RuntimeError: CUDA out of memory. Tried to allocate 😊 MiB (GPU 😊; 😊 GiB total capacity; 😊 GiB already allocated; 😊 MiB free; 😊 cached)
I tried to process an image by loading each layer to GPU and then loading it back:
for m in self.children():
m.cuda()
x = m(x)
m.cpu()
torch.cuda.empty_cache()
But it doesn't seem to be very effective. I'm wondering is there any tips and tricks to train large deep learning models while using little GPU memory.
Although
import torch
torch.cuda.empty_cache()
provides a good alternative for clearing the occupied cuda memory and we can also manually clear the not in use variables by using,
import gc
del variables
gc.collect()
But still after using these commands, the error might appear again because pytorch doesn't actually clears the memory instead clears the reference to the memory occupied by the variables.
So reducing the batch_size after restarting the kernel and finding the optimum batch_size is the best possible option (but sometimes not a very feasible one).
Another way to get a deeper insight into the alloaction of memory in gpu is to use:
torch.cuda.memory_summary(device=None, abbreviated=False)
wherein, both the arguments are optional. This gives a readable summary of memory allocation and allows you to figure the reason of CUDA running out of memory and restart the kernel to avoid the error from happening again (Just like I did in my case).
Passing the data iteratively might help but changing the size of layers of your network or breaking them down would also prove effective (as sometimes the model also occupies a significant memory for example, while doing transfer learning).
Just reduce the batch size, and it will work.
While I was training, it gave following error:
CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB
total capacity; 4.29 GiB already allocated; 10.12 MiB free; 4.46 GiB
reserved in total by PyTorch)
And I was using batch size of 32. So I just changed it to 15 and it worked for me.
Send the batches to CUDA iteratively, and make small batch sizes. Don't send all your data to CUDA at once in the beginning. Rather, do it as follows:
for e in range(epochs):
for images, labels in train_loader:
if torch.cuda.is_available():
images, labels = images.cuda(), labels.cuda()
# blablabla
You can also use dtypes that use less memory. For instance, torch.float16 or torch.half.
Try not drag your grads too far.
I got the same error when I tried to sum up loss in all batches.
loss = self.criterion(pred, label)
total_loss += loss
Then I use loss.item instead of loss which requires grads, then solved the problem
loss = self.criterion(pred, label)
total_loss += loss.item()
The solution below is credited to yuval reina in the kaggle question
This error is related to the GPU memory and not the general memory => #cjinny comment might not work.
Do you use TensorFlow/Keras or Pytorch?
Try using a smaller batch size.
If you use Keras, Try to decrease some of the hidden layer sizes.
If you use Pytorch:
do you keep all the training data on the GPU all the time?
make sure you don't drag the grads too far
check the sizes of you hidden layer
Most things are covered, still will add a little.
If torch gives error as "Tried to allocate 2 MiB" etc. it is a mis-leading message. Actually, CUDA runs out of total memory required to train the model. You can reduce the batch size. Say, even if batch size of 1 is not working (happens when you train NLP models with massive sequences), try to pass lesser data, this will help you confirm that your GPU does not have enough memory to train the model.
Also, Garbage collection and cleaning cache part has to be done again, if you want to re-train the model.
Follow these steps:
Reduce train,val,test data
Reduce batch size {eg. 16 or 32}
Reduce number of model parameters {eg. less than million}
In my case, when I am training common voice dataset in kaggle kernels the same error raises. I delt with reducing training dataset to 20000,batch size to 16 and model parameter to 112K.
If you are done training and just want to test with an image, make sure to add a with torch.no_grad() and m.eval() at the beginning:
with torch.no_grad():
for m in self.children():
m.cuda()
m.eval()
x = m(x)
m.cpu()
torch.cuda.empty_cache()
This may seem obvious but it worked on my case. I was trying to use BERT to transform sentences into an embbeding representation. As BERT is a pre-trained model I didn't need to save all the gradients, and they were consuming all the GPU's memory.
There are ways to avoid, but it certainly depends on your GPU memory size:
Loading the data in GPU when unpacking the data iteratively,
features, labels in batch:
features, labels = features.to(device), labels.to(device)
Using FP_16 or single precision float dtypes.
Try reducing the batch size if you ran out of memory.
Use .detach() method to remove tensors from GPU which are not needed.
If all of the above are used properly, PyTorch library is already highly optimizer and efficient.
Implementation:
Feed the image into gpu batch by batch.
Using a small batch size during training or inference.
Resize the input images with a small image size.
Technically:
Most networks are over parameterized, which means they are too large for the learning tasks. So finding an appropriate network structure can help:
a. Compact your network with techniques like model compression, network pruning and quantization.
b. Directly using a more compact network structure like mobileNetv1/2/3.
c. Network architecture search(NAS).
I have the same error but fix it by resize my images from ~600 to 100 using the lines:
import torchvision.transforms as transforms
transform = transforms.Compose([
transforms.Resize((100, 100)),
transforms.ToTensor()
])
Although this seems bizarre what I found is there are many sessions running in the background for collab even if we factory reset runtime or we close the tab. I conquered this by clicking on "Runtime" from the menu and then selecting "Manage Sessions". I terminated all the unwanted sessions and I was good to go.
I would recommend using mixed precision training with PyTorch. It can make training way faster and consume less memory.
Take a look at https://spell.ml/blog/mixed-precision-training-with-pytorch-Xuk7YBEAACAASJam.
There is now a pretty awesome library which makes this very simple: https://github.com/rentruewang/koila
pip install koila
in your code, simply wrap the input with lazy:
from koila import lazy
input = lazy(input, batch=0)
As long as you don't cross a batch size of 32, you will be fine. Just remember to refresh or restart runtime or else even if you reduce the batch size, you will encounter the same error.
I set my batch size to 16, it reduces zero gradients from occurring during my training and the model matches the true function much better. Rather than using a batch size of 4 or 8 which causes the training loss to fluctuate than
I meet the same error, and my GPU is GTX1650 with 4g video memory and 16G ram. It worked for me when I reduce the batch_size to 3.
Hope this can help you
I faced the same problem and resolved it by degrading the PyTorch version from 1.10.1 to 1.8.1 with code 11.3.
In my case, I am using GPU RTX 3060, which works only with Cuda version 11.3 or above, and when I installed Cuda 11.3, it came with PyTorch 1.10.1. So I degraded the PyTorch version, and now it is working fine.
$ pip3 install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
2- You can check by reducing train batch size also.
If you are working with images, just reduce the input image shape. For example, if you are using 512x512, try 256x256. It worked for me!
Best way would be lowering down the batch size. Usually it works. Otherwise try this:
import gc
del variable #delete unnecessary variables
gc.collect()
I am combining a Monte-Carlo Tree Search with a convolutional neural network as the rollout policy. I've identified the Keras model.predict function as being very slow. After experimentation, I found that surprisingly model parameter size and prediction sample size don't affect the speed significantly. For reference:
0.00135549 s for 3 samples with batch_size = 3
0.00303991 s for 3 samples with batch_size = 1
0.00115528 s for 1 sample with batch_size = 1
0.00136132 s for 10 samples with batch_size = 10
as you can see I can predict 10 samples at about the same speed as 1 sample. The change is also very minimal though noticeable if I decrease parameter size by 100X but I'd rather not change parameter size by that much anyway. In addition, the predict function is very slow the first time run through (~0.2s) though I don't think that's the problem here since the same model is predicting multiple times.
I wonder if there is some workaround because clearly the 10 samples can be evaluated very quickly, all I want to be able to do is predict the samples at different times and not all at once since I need to update the Tree Search before making a new prediction. Perhaps should I work with tensorflow instead?
The batch size controls parallelism when predicting, so it is expected that increasing the batch size will have better performance, as you can use more cores and use GPU more efficiently.
You cannot really workaround, there is nothing really to work around, using a batch size of one is the worst case for performance. Maybe you should look into a smaller network that is faster to predict, or predict on the CPU if your experiments are done in a GPU, to minimize overhead due to transfer.
Don't forget that model.predict does a full forward pass of the network, so its speed completely depends on the network architecture.
One way that gave me a speed up was switching from model.predict(x) to,
model.predict_on_batch(x)
making sure your x shape has 1 as the first dimension.
I don't think working with pure Tensorflow would change the performance much. Keras is a high-level API for low-level Tensorflow primitives. You could use a smaller model instead, like MobileNetV3 or EfficientNet, but this would require retraining.
If you need to remain with the existing model, you could try OpenVINO. OpenVINO is optimized for Intel hardware, but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. You care about latency, so I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
TensorFlow has a long-standing limitation of 2 Gb on a single tensor. It means that you can't train your model on more than 2 Gb of data at one time without jumping through hoops. See Initializing tensorflow Variable with an array larger than 2GB ; Use large dataset in Tensorflow
The standard solution referenced in those posts is to use a placeholder and to pass it to the "session" through feed_dict:
my_graph = tf.Graph()
sess = tf.Session(graph=my_graph)
X_init = tf.placeholder(tf.float32, shape=(m_input, n_input))
X = tf.Variable(X_init)
sess.run(tf.global_variables_initializer(), feed_dict={X_init: data_for_X})
However, this only works when I use the "old" API (tf.Session(), etc.) The recommended approach nowadays is to use Keras (all the tutorials on tensorflow.org use it). And, with Keras, there's no tf.Graph(), no tf.Session(), and no run() (at least none that are readily visible to the user.)
How do I adapt the above code to work with Keras?
In Keras, you'd not load your entire dataset in a tensor. You load it in numpy arrays.
If the entire data can be in a single numpy array:
Thanks to #sebrockm's comment.
The most trivial usage of Keras is simply loading your dataset in a numpy array (not a tf tensor) and call model.fit(arrayWithInputs, arrayWithoutputs, ...)
If the entire data doesn't fit a numpy array:
You'd create a generator or a keras.utils.Sequence to load batches one by one and then train the model with model.fit_generator(generatorOrSequence, ...)
The limitation becomes the batch size, but you'd hardly ever hit 2GB in a single batch.
So, go for it:
keras.utils.Sequence
model.fit_generator
Keras doesn't have a 2GB limitation for datasets, I've trained much larger datasets with Keras with no issues.
The limitation could come from TensorFlow constants, which do have a 2GB limit, but in any case you should NOT store datasets as constants, as these are saved as part of the graph, and that is not the idea of storing a model.
Keras has the model.fit_generator function that you can use to pass a generator function which loads data on the fly, and makes batches. This allows you to load a large dataset on the fly, and you usually adjust the batch size so you maximize performance with acceptable RAM usage. TensorFlow doesn't have a similar API, you have to implement it manually as you say with feed_dict.
I'm working on a feature extractor for this transfer learning personal project, and the predict function of Kera's VGG16 model seems pretty slow (31 seconds for a batch of 4 images). I do expect it to be slow, but not sure if the prediction function is slower than it should be.
data = DataGenerator()
data = data.from_csv(csv_path=csv_file,
img_dir=img_folder,
batch_size=batch)
#####################################################
conv_base = VGG16(include_top=False,
weights='imagenet',
input_shape=(480, 640, 3))
model = Sequential()
model.add(conv_base)
model.add(MaxPooling2D(pool_size=(3, 4)))
model.add(Flatten())
######################################################
for inputs, y in data:
feature_batch = model.predict(inputs)
yield feature_batch, y
So, my hunch is that it is slow for these reasons:
my input data is a bit large (loading in (480, 640, 3) size images)
I am running on a weak CPU (M3-6Y30 # 0.90GHz)
I have a flatten operation at the end of the feature extractor.
Things I've tried:
Other StackOverFlow posts suggested adding a max pooling layer to
reduce the feature size / remove the extraneous zero's. I made I
think a pretty large max pool window (thus reducing the feature size
significantly, but my prediction time increased.
Batch processing doesn't improve time which is probably obvious due
to the use of my M3 CPU). A batch size of 1 image takes 8 seconds, a
batch size of 4 takes 32.
Are there any ideas on how to speed up the prediction function? I need to run this through at least 10,000 images, and due to the nature of the project I would like to retain as much of the raw data as possible before going into the model (will be comparing it with other feature extraction models)
All my image files are saved locally, but I can try to setup a cloud computer and move my code over there to run with GPU support.
Is the issue simply I am running the VGG16 model on a dinky CPU?
Guidance would be much appreciated.
There are many issues with your model. The main issue is of course really slow machine, but as you cannot change that here I will state some pieces of advice on how you could speed up your computations:
VGG16 is relatively old architecture. The main issue here is that the so-called volume of tensors (area of feature maps times number of features) is decreased really slowly. I would advise you to use more modern architectures like e.g. ResNet50 or Inception v3 as they have the so-called stem which is making inside tensors much smaller really fast. Your speed should benefit thanks to that. There is also a really light architecture called MobileNet which seems perfect for your task.
Downsample your images - with a size of (480, 640) your image is 6 times bigger than default VGG input. This makes all computations 6 times slower. You could try to first downsample images and then use a feature extractor.
VGG16 is a very big model. The same accuracy could be reached with modern smaller models such as MobileNetV3 or EfficientNet.
However, if you have to use your model you could try OpenVINO. OpenVINO is optimized for Intel hardware but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
Here are some performance benchmarks for various models and CPUs. Your processor (M3-6Y30) is 6th generation so it should be supported.
It's rather straightforward to convert the Keras model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:
mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.