OOM when allocating tensor with shape[1,144,144,144,128] - python

The computer I use is 1080 which has a GPU memory of 8 GB, the memory of my computer is 32 GB, but the array data might be to large for me to restore, the computer tells me resource exhausted. if there is anyway to solve this problem or evaluate the GPU memory i need for such a large numpy array so I can buy a better computer for calculate.by the way the batch_size I use is 1 so i have reduce the memory to the minimal, or i should consider to reduce the raw column and the height of my numpy array, and I think that would effect the resolution of my results that would be okay.
If anyone can answer my question. thanks

The tensor itself you are using is big, but not that big for a 8Gb GPU. 144 * 144 * 144 * 128 is ~380 million, so even with 32-bit items it requires 1.5GiB. I have a GeForce GTX 1070 with 8Gb (same size as you) and here's my Tensorflow experiment:
import numpy as np
import tensorflow as tf
X = tf.placeholder(dtype=tf.int32, shape=(1, 144, 144, 144, 128))
init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
value = session.run([X], feed_dict={X: np.zeros(shape=(1, 144, 144, 144, 128))})
print np.array(value).shape
The output:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.7465
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 4.14GiB
2017-08-17 20:05:54.312424: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-08-17 20:05:54.312430: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2017-08-17 20:05:54.312444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
(1, 1, 144, 144, 144, 128)
Note that free memory is much lower than 8Gb, because I use 2 UHD monitors. So this might be the first cause in your case: other processes can consume a lot of GPU resources.
Next, you didn't provide your Neural Network architecture, but if you are using, for instance, Deep Convolutional Neural Networks, note that the first layers are consuming a lot of memory for parameters and gradients. You might want to read this helpful page for details. If this is the case, you might need to plug in another GPU and split the graph across all available GPUs (here's how you can do it). There are 12Gb memory GPUs available from NVidia.
Finally, you can always consider reducing the floating precision tf.float64 -> tf.float32 -> tf.float16 for all your variables. This can save 8x memory, which sometimes is just enough to run on a GPU.

Related

Not understanding CUDA resources and keep running out of memory

I think I'm running Tensor PyTorch.
I am new to python, and trying to use it experimenting with convolutional Neural networks and processing larger images. But I keep running into this error, even if I Request smaller image outputs. I just signed up for Colab Pro. While it is certainly faster, it still errors out with the CUDA. I would reallocate memory if I knew how, but I don't. Are there any other other way to access/manage GPU memory??
File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line
255, in backward torch.autograd.backward(self, gradient, retain_graph,
create_graph, inputs=inputs) File
"/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py",
line 149, in backward allow_unreachable=True, accumulate_grad=True) #
allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to
allocate 114.00 MiB (GPU 0; 15.78 GiB total capacity; 13.27 GiB
already allocated; 4.75 MiB free; 14.43 GiB reserved in total by
PyTorch) VGG-19 Architecture Detected Successfully loaded
models/vgg19-d01eb7cb.pth conv1_1: 64 3 3 3 conv1_2: 64 64 3 3
conv2_1: 128 64 3 3 conv2_2:
I have shown below ways to manage GPU memory in pytorch, but often these ways are not suggested ways to deal with CUDA Errors like yours.
The reason you get this error has nothing to do with the size of your output but by the size of your input. You either have way too big images coming into your network in which you may need to use transforms.Resize() or your batch size is way to big, so you are calling for a huge parralel computation and thus need to lower that number in the dataloader.
The ways to remove a tensor from gpu memory can be done by using
a = torch.tensor(1)
del a
# Though not suggested and not rlly needed to be called explicitly
torch.cuda.empty_cache()
The ways to allocate a tensor to cuda memory is to simply move the tensor to device using
a = torch.tensor(1)
a = a.cuda()
# OR
device = torch.device("cuda")
a = a.to(device)
Sarthak Jain

PyTorch list slicing on GPU slower than on CPU

I would like to optimize ML code (SSD in PyTorch) on NVIDIA Jetson Xavier NX (development kit). One of the bottlenecks seems to be list slicing on PyTorch (1.6.0) tensors on GPU device.
The same problem occured on NVIDIA GeForce GTX 1050 Ti (GP107), CPU was ~2 times faster.
Let me create the variables first
import torch
from time import time
cuda0 = torch.device('cuda:0')
probs = torch.ones([3000], dtype=torch.float64, device=cuda0)
mask = torch.ones([3000], dtype=torch.bool, device=cuda0)
probs_cpu = probs.cpu()
mask_cpu = mask.cpu()
Then run the logic (Approximately same results occurred every run)
before = time()
probs[mask]
print(f'GPU {time() - before:.5f}') # output: GPU 0.00263
before = time()
probs_cpu[mask_cpu]
print(f'CPU {time() - before:.5f}') # output: CPU 0.00066
Why is the list slicing ~4 times slower on GPU compared to CPU using PyTorch library vesrion 1.6.0 on NVIDIA Jetson Xavier NX Developer kit according to the code above? How to speed it up?
Code details: see line 51 in predictor.py which is part of SSD Implementation in PyTorch
Run it on CPU?: Whole algorithm will not be faster if I run it on the CPU since the downloading from GPU takes too long (~0.00805 s).

rtx 2070s failed to allocate gpu memory from device:CUDA_ERROR_OUT_OF_MEMORY: out of memory

tf 2.0.0-gpu
CUDA 10.0
RTX2070super
hi. i got a problem regarding allocating gmemory. The initial allocation of memory is 7GB like this.
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6994 MB memory)
2020-01-11 22:19:22.983048: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-11 22:19:23.786225: I tensorflow/stream_executor/cuda/cuda_driver.cc:830] failed to allocate 2.78G (2989634304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-11 22:19:24.159338: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
Limit: 7333884724
InUse: 5888382720
MaxInUse: 6255411968
NumAllocs: 1264
MaxAllocSize: 2372141056
but i can only use 5900MB memory and the rest of memory always fails to be allocated.
i guess that if whole gpu memory is used in rtx 2070s, i use 2 types data typse(float16, float32). so i got a policy by using this codes
opt = tf.keras.optimizers.Adam(1e-4)
opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
Still, the allocation always fails.
Tensorflow memory management can be frustrating.
Main takeaway: whenever you see OOM there is actually not enough memory and you either have to reduce your model size or batch size. TF would throw OOM when it tries to allocate sufficient memory, regardless of how much memory has been allocated before.
On the start, TF would try to allocate a reasonably large chunk of memory which would be equivalent to about 90-98% of the whole memory available - 5900MB in your case. Then, when actual data starts to take more than that, TF would additionally try to allocate sufficient amount of memory or a bit more - 2.78G. And if that does not fit it would throw OOM, like in your case. Your GPU could not fit 5.9+2.8Gb. The last chunk of 2.78G might actually be a little more than TF needs, but it would anyhow be used later if you have multiple training steps because maximum required memory can fluctuate a bit between identical Session.run's.

TensorFlow GPU: No Performance increase in HelloWorld code

Background:
I am a Python Developer new to TensorFlow.
System Spec:
i5-7200U CPU # 2.50GHz × 4
GeForce 940MX 4GB
Ubuntu 18
I am running TensorFlow on Docker (found installing cuda stuff too complicated, and long, maybe i messed up something)
Basically I am running a kind of HelloWorld code on GPU and CPU and checking what kind of difference will it have and to my surprise there is hardly any!
docker-compose.yml
version: '2.3'
services:
tensorflow:
# image: tensorflow/tensorflow:latest-gpu-py3
image: tensorflow/tensorflow:latest-py3
runtime: nvidia
volumes:
- ./:/notebooks/TensorTest1
ports:
- 8888:8888
When I run with image: tensorflow/tensorflow:latest-py3 I get approx 5 seconds.
root#e7dc71acfa59:/notebooks/TensorTest1# python3 hello1.py
2018-11-18 14:37:24.288321: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
TIME: 4.900559186935425
result: [3. 3. 3. ... 3. 3. 3.]
when I run with image: tensorflow/tensorflow:latest-gpu-py3 I again get approx 5 seconds.
root#baf68fc71921:/notebooks/TensorTest1# python3 hello1.py
2018-11-18 14:39:39.811575: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-18 14:39:39.877483: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-18 14:39:39.878122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce 940MX major: 5 minor: 0 memoryClockRate(GHz): 1.189
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 3.56GiB
2018-11-18 14:39:39.878148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-18 14:44:17.101263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-18 14:44:17.101303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-11-18 14:44:17.101313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-11-18 14:44:17.101540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3259 MB memory) -> physical GPU (device: 0, name: GeForce 940MX, pci bus id: 0000:01:00.0, compute capability: 5.0)
TIME: 5.82940673828125
result: [3. 3. 3. ... 3. 3. 3.]
My Code
import tensorflow as tf
import time
with tf.Session():
start_time = time.time()
input1 = tf.constant([1.0, 1.0, 1.0, 1.0] * 100 * 100 * 100)
input2 = tf.constant([2.0, 2.0, 2.0, 2.0] * 100 * 100 * 100)
output = tf.add(input1, input2)
result = output.eval()
duration = time.time() - start_time
print("TIME:", duration)
print("result: ", result)
Am I doing something wrong here? Based on prints it seems to be using GPU correctly
Followed these steps at Can I measure the execution time of individual operations with TensorFlow? and I got this
A GPU is an "external" processor, there's overhead involved in compiling a program for it, running it, sending it data, and retrieving the results. GPUs also have different performance tradeoffs from CPUs. While GPUs are frequently faster for large and complex number-crunching tasks, your "hello world" is too simple. It doesn't do very much with each data item between loading it and saving it (just pairwise addition), and it doesn't do very much at all — a million operations is nothing. That makes any setup/teardown overhead relatively more noticeable. So while the GPU is slower for this program it's still likely to be faster for more useful programs.

Keras seems to hang after call to fit_generator

I am trying to fit the Keras implementation of the SqueezeDet model to a new dataset. After making the appropriate changes to my config file, I tried to run the train script, but it seems to hang after the call to fit_generator(). As I get the following output:
/anaconda/envs/py35/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Number of images: 536
Number of epochs: 100
Number of batches: 53
Batch size: 10
2018-07-04 14:18:49.711606: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-04 14:18:54.080912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 52a9:00:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-07-04 14:18:54.080958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-04 14:18:54.333214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-04 14:18:54.333270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-07-04 14:18:54.333290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-07-04 14:18:54.333559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10764 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 52a9:00:00.0, compute capability: 3.7)
Learning rate: 0.01
Weights initialized by name from ../main/model/imagenet.h5
Using single GPU
Backend Qt5Agg is interactive backend. Turning interactive mode on.
Epoch 1/100
And then nothing happens even if it leave it alone for a day. The call that it seems to freeze on is:
squeeze.model.fit_generator(train_generator, epochs=EPOCHS, verbose=1,
steps_per_epoch=nbatches_train, callbacks=cb)
Where the parameters are:
train_generator = generator_from_data_path(img_names, gt_names, config=cfg)
EPOCHS = 100
nbatches_train = 53
callbacks = [# TensorBoard object, ReduceLROnPlateau object, ModelCheckpoint object #]
My versions:
Python 3.5.4 :: Anaconda custom (64-bit)
tensorflow-gpu : 1.8.0
tensorflow : 1.8.0
Keras : 2.2.0
Formatting conversation in comments to answer.
The culprit was train_generator.
I have looked into sources of model.fit_generator in Keras some time ago. It just retrieves some data from the generator and submits it to the backend, nothing magical :)
So, my hypothesis was that it cannot retrieve data from the generator because the generator does not generate anything.
#Barker has confirmed it, stating that call to next(train_generator) hangs.
I personally have moved to keras.utils.Sequence that supports indexing and length and is much more convenient than ordinary generators. Though this note is not related to the current problem.

Categories

Resources