My system has two NUMA nodes and two GTX 1080 Ti attached to NUMA node 1 (XEON E5).
The NN models are trained via single-machine multi-GPU data parallelism using Keras' multi_gpu_model.
How can TF be instructed to allocate memory and execute the TF workers (merging weights) only on NUMA node 1? For performance reasons I'd like to prevent accessing memory through the QPI.
tf.device():
1) Does tf.device('/cpu:0') refer to a physical CPU or a physical core or is it simply a 'logical device' (thread|pool?) that is moved between all physical cores that are online?
2) How can the TF scheduler be influenced to map the logical device to a set of physical cores?
3) In the case of memory allocation on NUMA systems - does TF support allocating memory on specific nodes? Or do I have to fall back to set_mempolicy()/numactl (LINUX)?
no, answer ...
I'm using numactl --cpunodebind=1 --membind=1 - binds execution and memory allocation to NUMA node 1.
Related
I am running a deep learning script but I am not an expert. My task is to run multiple GPUs for data training. However, I have trouble specifying GPUs. Here are the steps of my confusion.
I set multiple GPUs by
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["TF_MIN_GPU_MULTIPROCESSOR_COUNT"]="2"
os.environ["CUDA_VISIBLE_DEVICES"]= "5,6"
print("# of GPU available:", len(tf.config.list_physical_devices('GPU')))
# of GPU available: 2
when I start the model creation, I receive this error, which I did not receive when using only ONE gpu.
tf.random.set_seed(42)
model_unet = binary_unet(256,256,6)
ResourceExhaustedError: OOM when allocating tensor with shape[3,3,64,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:TruncatedNormal]
I thought I could set one GPU first (for step 2) by making the cuda_visible_devices to be the ONE gpu wanted, and specify multiple GPUs (after step 2) by making the cuda_visible_devices to be multiple GPUs. But then, tensorflow couldn't recognize multiple GPUs that, for example:
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["TF_MIN_GPU_MULTIPROCESSOR_COUNT"]="2"
os.environ["CUDA_VISIBLE_DEVICES"]= "5,6"
print("# of GPU available:", len(tf.config.list_physical_devices('GPU')))
# of GPU available: 1
Note the number of GPUs available becomes 1. This will stick around unless I restart the kernel and clear output. Actually, I need to restart the kernel and clear output between step 1 and 2 as well, to make sure only 1 GPU is available and step 2 doesn't fail. But I can't just restart and clear everything because I am going to use previous outputs to run epochs.
I believe some potential solutions are: 1) make step 2 (creating a unet model) run with multiple GPUs; 2) somehow clear the logs in tensorflow without having to restart the kernel that I can create model with 1 GPU but train data/run epoch with multiple GPUs. But I have no idea how to do this. Could someone help?
I'm trying to run retinanet model on google colab with GPU support but after starting training for 1 epoch it quickly completes 1000 steps without training properly and stops training without any warning.
Here is the output of terminal window I get after run train command
!keras_retinanet/bin/train.py --tensorboard-dir /content/TrainingOutput --snapshot-path /content/TrainingOutput/snapshots --random-transform --steps 1000 pascal /content/PlumsVOC
Creating model, this may take a second...
2021-08-19 03:38:20.717241: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:20.725782: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:20.726450: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:20.727359: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX512F
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-19 03:38:20.727598: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:20.728167: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:20.728749: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:21.263376: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:21.264133: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:21.264721: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:21.265247: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-08-19 03:38:21.265304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13839 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/optimizer_v2.py:356: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.
"The `lr` argument is deprecated, use `learning_rate` instead.")
Model: "retinanet"
__________________________________________________________________________________________________
None
WARNING:tensorflow:`batch_size` is no longer needed in the `TensorBoard` Callback and will be ignored in TensorFlow 2.0.
2021-08-19 03:38:24.467332: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-08-19 03:38:24.467379: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-08-19 03:38:24.467435: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1614] Profiler found 1 GPUs
2021-08-19 03:38:24.588819: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-08-19 03:38:24.589029: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1748] CUPTI activity buffer flushed
/usr/local/lib/python3.7/dist-packages/keras/engine/training.py:1972: UserWarning: `Model.fit_generator` is deprecated and will be removed in a future version. Please use `Model.fit`, which supports generators.
warnings.warn('`Model.fit_generator` is deprecated and '
/usr/local/lib/python3.7/dist-packages/keras/utils/generic_utils.py:497: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
category=CustomMaskWarning)
2021-08-19 03:38:25.187697: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/50
2021-08-19 03:38:32.881842: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8004
1/1000 [..............................] - ETA: 3:31:30 - loss: 3.8681 - regression_loss: 2.7375 - classification_loss: 1.13062021-08-19 03:38:38.104179: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-08-19 03:38:38.104232: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2/1000 [..............................] - ETA: 17:05 - loss: 3.8988 - regression_loss: 2.7693 - classification_loss: 1.1295 2021-08-19 03:38:38.938537: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
2021-08-19 03:38:38.940902: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1748] CUPTI activity buffer flushed
2021-08-19 03:38:39.134281: I tensorflow/core/profiler/internal/gpu/cupti_collector.cc:673] GpuTracer has collected 3251 callback api events and 3247 activity events.
2021-08-19 03:38:39.192167: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-08-19 03:38:39.289977: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39
2021-08-19 03:38:39.355897: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for trace.json.gz to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.trace.json.gz
2021-08-19 03:38:39.455150: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39
2021-08-19 03:38:39.462678: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for memory_profile.json.gz to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.memory_profile.json.gz
2021-08-19 03:38:39.466401: I tensorflow/core/profiler/rpc/client/capture_profile.cc:251] Creating directory: /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39
Dumped tool data for xplane.pb to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.xplane.pb
Dumped tool data for overview_page.pb to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.overview_page.pb
Dumped tool data for input_pipeline.pb to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.kernel_stats.pb
11/1000 [..............................] - ETA: 6:57 - loss: 3.9632 - regression_loss: 2.8365 - classification_loss: 1.1267WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 50000 batches). You may need to use the repeat() function when building your dataset.
1000/1000 [==============================] - 17s 4ms/step - loss: 3.9632 - regression_loss: 2.8365 - classification_loss: 1.1267
Running network: 100% (4 of 4) |##########| Elapsed Time: 0:00:02 Time: 0:00:02
Parsing annotations: 100% (4 of 4) |######| Elapsed Time: 0:00:00 Time: 0:00:00
32 instances of class redPlum with average precision: 0.0000
0 instances of class greenPlum with average precision: 0.0000
mAP: 0.0000
Epoch 00001: saving model to /content/TrainingOutput/snapshots/resnet50_pascal_01.h5
/usr/local/lib/python3.7/dist-packages/keras/utils/generic_utils.py:497: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
category=CustomMaskWarning)
It saves model weight but it's not detecting any objects in test images.
what's happening? how can I fix this and train the model completely for specified number of epochs normally ? anyhelp on this will be very helpful thanks.
tf 2.0.0-gpu
CUDA 10.0
RTX2070super
hi. i got a problem regarding allocating gmemory. The initial allocation of memory is 7GB like this.
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6994 MB memory)
2020-01-11 22:19:22.983048: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-11 22:19:23.786225: I tensorflow/stream_executor/cuda/cuda_driver.cc:830] failed to allocate 2.78G (2989634304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-11 22:19:24.159338: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
Limit: 7333884724
InUse: 5888382720
MaxInUse: 6255411968
NumAllocs: 1264
MaxAllocSize: 2372141056
but i can only use 5900MB memory and the rest of memory always fails to be allocated.
i guess that if whole gpu memory is used in rtx 2070s, i use 2 types data typse(float16, float32). so i got a policy by using this codes
opt = tf.keras.optimizers.Adam(1e-4)
opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
Still, the allocation always fails.
Tensorflow memory management can be frustrating.
Main takeaway: whenever you see OOM there is actually not enough memory and you either have to reduce your model size or batch size. TF would throw OOM when it tries to allocate sufficient memory, regardless of how much memory has been allocated before.
On the start, TF would try to allocate a reasonably large chunk of memory which would be equivalent to about 90-98% of the whole memory available - 5900MB in your case. Then, when actual data starts to take more than that, TF would additionally try to allocate sufficient amount of memory or a bit more - 2.78G. And if that does not fit it would throw OOM, like in your case. Your GPU could not fit 5.9+2.8Gb. The last chunk of 2.78G might actually be a little more than TF needs, but it would anyhow be used later if you have multiple training steps because maximum required memory can fluctuate a bit between identical Session.run's.
I followed this tutoriel to export my own trained tensorflow model to c++ and I got errors when I call freeze_graph
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:03:00.0)
...
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'save/Const_1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Identity: CPU
Const: CPU
[[Node: save/Const_1 = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: model>, _device="/device:GPU:0"]()]]
Caused by op u'save/Const_1', defined at:
...
GPU:0 is detected and usable by Tensorflow, so I don't understand from where the error comes from.
Any idea ?
The error means op save/Const_1 is trying to get placed on GPU, and there's no GPU implementation of that node. In fact Const nodes are CPU only and are stored as part of Graph object, so it can't be placed on GPU. One work-around is to run with allow_soft_placement=True, or to open the pbtxt file and manually remove the device line for that node
How do I interpret the TensorFlow output for building and executing computational graphs on GPGPUs?
Given the following command that executes an arbitrary tensorflow script using the python API.
python3 tensorflow_test.py > out
The first part stream_executor seems like its loading dependencies.
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
What is a NUMA node?
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I assume this is when it finds the available GPU
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:01:00.0
Total memory: 11.25GiB
Free memory: 11.15GiB
Some gpu initialization? what is DMA?
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:01:00.0)
Why does it throw an error E?
E tensorflow/stream_executor/cuda/cuda_driver.cc:932] failed to allocate 11.15G (11976531968 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Great answer to what the pool_allocator does: https://stackoverflow.com/a/35166985/4233809
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 3160 get requests, put_count=2958 evicted_count=1000 eviction_rate=0.338066 and unsatisfied allocation rate=0.412025
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1743 get requests, put_count=1970 evicted_count=1000 eviction_rate=0.507614 and unsatisfied allocation rate=0.456684
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 256 to 281
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1986 get requests, put_count=2519 evicted_count=1000 eviction_rate=0.396983 and unsatisfied allocation rate=0.264854
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 655 to 720
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 28728 get requests, put_count=28680 evicted_count=1000 eviction_rate=0.0348675 and unsatisfied allocation rate=0.0418407
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 1694 to 1863
About NUMA -- https://software.intel.com/en-us/articles/optimizing-applications-for-numa
Roughly speaking, if you have dual-socket CPU, they will each have their own memory and have to access the other processor's memory through a slower QPI link. So each CPU+memory is a NUMA node.
Potentially you could treat two different NUMA nodes as two different devices and structure your network to optimize for different within-node/between-node bandwidth
However, I don't think there's enough wiring in TF right now to do this right now. The detection doesn't work either -- I just tried on a machine with 2 NUMA nodes, and it still printed the same message and initialized to 1 NUMA node.
DMA = Direct Memory Access. You could potentially copy things from one GPU to another GPU without utilizing CPU (ie, through NVlink). NVLink integration isn't there yet.
As far as the error, TensorFlow tries to allocate close to GPU max memory so it sounds like some of your GPU memory is already been allocated to something else and the allocation failed.
You can do something like below to avoid allocating so much memory
config = tf.ConfigProto(log_device_placement=True)
config.gpu_options.per_process_gpu_memory_fraction=0.3 # don't hog all vRAM
config.operation_timeout_in_ms=15000 # terminate on long hangs
sess = tf.InteractiveSession("", config=config)
successfully opened CUDA library xxx locally means that the library was loaded, but it does not meant that it will be used.
successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero means that your kernel does not have NUMA support. You can read about NUMA here and here.
Found device 0 with properties: you have 1 GPU which you can use. It lists the properties of this GPU.
DMA is direct memory access. More information on Wikipedia.
failed to allocate 11.15G the error clearly explains why this happened, but it is hard to tell why do you need so much memory without looking at the code.
pool allocator messages are explained in this answer