I was trying to enable support for multiple gpu training in my model. The code is as follows:
with tf.device('/cpu:0'):
# creating a model
multi_gpu_model = keras.utils.multi_gpu_model(model, gpus=2, )
multi_gpu_model.compile(loss='cosine_proximity', optimizer='nadam', metrics = ['accuracy'])
try:
multi_gpu_model.fit_generator(
sequential_generator('/home/jindal/notebooks/witter_dataset_sequences_20m', batch_size, total_lines),
steps_per_epoch=steps_per_epoch, epochs=epochs,callbacks=[WeightsSaver(model, 200)]
)
except Exception as e:
print(e)
Now when I try to run it, I get an error as follows:
creating a partition for /device:CPU:3 which doesn't exist in the list of available devices. Available devices: /device:CPU:0,/device:XLA_CPU:0,/device:XLA_GPU:0,/device:GPU:0,/device:GPU:1,/device:GPU:2,/device:GPU:3.
I think that I have read the documentation correctly. My keras package is up to date. I have opened an issue on Github as well. What is it that I am doing wrong?
Note: I am using Jupyter Notebook and have 4 gpus available to me each having 12 GB of Ram. I have limited the Jupyter Notebook to use only 2 GPUs (GPU number 2 and 3) by using the command
%env CUDA_DEVICE_ORDER=PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES=2, 3
Related
I want to implement Model Parallelism with Trainer on PyTorch Lightning.
My environment is single machine with 16GB 2 GPUs.
1 GPU cannot read whole my model because that model involves some layers and a large layer(almost 14GB).
I run next code.
import pytorch_lightning as pl
# from pytorch_lightning.strategies import DeepSpeedStrategy
from mymodel import Net
from mydatamodule import DataModule
net = Net()
dm = DataModule()
trainer = pl.Trainer(
accelerator='gpu',
strategy='ddp_shard',
# strategy=DeepSpeedStrategy(
# stage=2,
# config={
# "autotuning": {
# "mp_size": 2
# }
# }
# ),
num_nodes=1,
precision: 16,
check_val_every_n_epoch=10,
)
trainer.fit(net, dm)
I think Model Parallelism of Fairscale/DeepSpeed distributes layers of model to GPUs which stay within those memory limits like below images of articles.
https://towardsdatascience.com/distributed-parallel-training-model-parallel-training-a768058aa02a https://huggingface.co/transformers/v4.9.2/parallelism.html#naive-model-parallel-vertical-and-pipeline-parallel
First Validation of check_val_every_n_epoch is work.
But mentioned above code behaves as Distributed Data Parallel on train phase and raises OOM error.
When OOM error occured, Both of 2 GPUs hold 14GB over.
Also this code raises DeadLock detected from rank: 0 before OOM error.
I haven't understood the cause of this behavior.
I tried strategies "ddp_sharded", "deepspeed-stage-1", "deepspeed-stage-2", "deepspeed-stage-3", "fsdp".
But results are all the same.
I want to resolve 2 problems below.
Is my understanding of Model Parallelism wrong?
How do I fix my code?
System Details
OS: Ubuntu 20.04
Python: 3.8.10
PyTorch: 1.12.1+cu113
PyTorch Lightning: 1.7.7
Fairscale: 0.5.12
DeepSpeed: 0.7.4
I am following along this tutorial (https://colab.research.google.com/github/khanhlvg/tflite_raspberry_pi/blob/main/object_detection/Train_custom_model_tutorial.ipynb) from Colab and running it on my own Windows machine.
When I debug my script it throws me this error >
The size of the train_data (0) couldn't be smaller than batch_size (4). To solve this problem, set the batch_size smaller or increase the size of the train_data.
On this snippet of my code
model = object_detector.create(train_data, model_spec=spec, batch_size=4, train_whole_model=True, epochs=20, validation_data=val_data)
My own train data contains 101 images and the example from Colab only contains 62 in their training folder.
I understand it's complaining about training data can't be smaller than batch size but I don't understand why its throwing it in the first place since my training data is not empty.
On my own machine I have Tensorflow Version: 2.8.0 just like in the colab.
I've tried increasing batch sizes all the way from 0 to 100plus but stil gives me the same error.
I've tried dropping one sample so there are 100 images and setting sample size to 2 , 4 etc... but still throws the error.
I'm leading to the conclusion that it is not loading in the data correctly but why?
For anybody running into the same issue as I was , here was my solution.
Okay so the reason this is happening is because of different versions of Python.
I was trying to run this locally with Python 3.8.10
Colab is running 3.7.12 .
I ran all of my data on colab using version (3.7.12) and trained my model with no more further issues.
I want to train a network using multiple gpus( 2x NVIDIA RTX A6000 ), on a windows 11 machine.
I tried copying the Multi-GPU and distributed training code from https://keras.io/guides/distributed_training/
However i see that GPU 0 is utilized just fine, but the GPU 1 is only utilized a little bit.
Here is a picture of the utilization:
GPUs utilization
While using the
physical_devices = tf.config.list_physical_devices('GPU')
for gpu_instance in physical_devices:
tf.config.experimental.set_memory_growth(gpu_instance, True)
I can even see huge gaps in the utiliation of GPU 1 as seen in:
GPUs utliziation .
Meaning for several epochs the second gpu was not utilized at all.
The only difference between the code in the example and my code is that I set epochs to 20, and I use:
strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
Since running without HierarchicalCopyAllReduce() results in an error:
InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by {{node Adam/NcclAllReduce}} with these attrs: [reduction="sum", shared_name="c1", T=DT_FLOAT, num_devices=2]
Registered devices: [CPU, GPU]
Registered kernels:
<no registered kernels>
Increasing batch size to 512 seems to help a lot and second gpu is utilized.GPUs utilization using 512 batch size
I also tried running the code with , strategy.experimental_distribute_dataset again with 512 batch size since this batch size utilized both GPUs well, however doing so makes the second gpu be not used as seen in picture below
# Train the model on all available devices.
train_dataset, val_dataset, test_dataset = get_dataset()
train_dataset = strategy.experimental_distribute_dataset(train_dataset)
val_dataset = strategy.experimental_distribute_dataset(val_dataset)
test_dataset = strategy.experimental_distribute_dataset(test_dataset)
#model.fit(train_dataset, epochs=20, validation_data=val_dataset)
model.fit(train_dataset, epochs=20, validation_data=val_dataset, steps_per_epoch=98, validation_steps=98)
And again i see that the gpu utilization vanished GPUs utilization using experimental_distribute_dataset
My question is:
Why is the second GPU hardly utilized, isn't the batch split between the GPUs equally, ie if batch size is 128 one gpu receives 64 and the other gpu also 64? I assumed that the same model is run on both gpus and they both get half the batch to process, after which the reduce happens.
If the batch was split the same way wouldn't both gpus be similarly utilized even with small batch size?
Also why does distributing dataset using the strategy make utilization worse?
I'm a noob when it comes to Python and machine learning. I'm trying to run two different projects that have to do with something called Deep Image Matting:
https://github.com/Joker316701882/Deep-Image-Matting with Tensorflow
https://github.com/huochaitiantang/pytorch-deep-image-matting with Pytorch
I'm just trying to run the tests in these projects but I run into various problems. Can I run these on a machine without GPU? I thought that GPU is only for speeding up processing, but I'm only interested in seeing these run before getting a machine with GPU.
I apologize in advance, as I know I'm a total noob in this
When I try the Tensorflow project:
I get an error with this line gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction = args.gpu_fraction) probably because I was tf2 and this requires tf1
After I downgraded to tf1 when I try to run the test I get W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
and InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'MaxPoolWithArgmax' with these attrs. Registered devices: [CPU], Registered kernels:
<no registered kernels> and now I'm stuck because I have no clue what this means
When I try the Pytorch project:
First I get this error: RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
So I added map_location=torch.device('cpu') when the model is loaded, but now I get RuntimeError: Error(s) in loading state_dict for VGG16:
size mismatch for conv6_1.weight: copying a param with shape torch.Size([512, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]). And I'm stuck again
Can someone help?
thank you in advance!
For the PyTorch one, there were two problems and it looks like you've solved the first one on your own with map_location. The second problem is that the weights in your checkpoint and the weights in your model don't have the same shape! A quick detour to the github repo; let's visit net.py in core. Take a look at lines 26 to 28:
# model released before 2019.09.09 should use kernel_size=1 & padding=0
# self.conv6_1 = nn.Conv2d(512, 512, kernel_size=1, padding=0,bias=True)
self.conv6_1 = nn.Conv2d(512, 512, kernel_size=3, padding=1,bias=True)
I'm guessing the checkpoint is loading weights where conv6_1 has a kernel size of 1 rather than 3, like the commented out line of code. So try uncommenting the line with kernel_size=1 and comment out the line with kernel_size=3.
I am trying to perform some hyperparameter tuning of a convolutional neural network written in Tensorflow 2.0 with GPU extension.
My systems settings are:
Windows 10 64bit
GeForce RTX2070, 8GB
Tensorflow 2.0-beta
CUDA 10.0 properly installed (I hope, deviceQuery.exe and bandwidthTest.exe passed positively)
My neural network has 75.572.574 parameters and I am training it on 3777 samples. In a single run, I have no problems in training the CNN.
As next step, I wanted to tune two hyperparameters of the CNN. To this aim, I created a for loop (iterating on 20 steps), in which I build and compile every time a new model, changing the hyperparameters at every loop iteration.
The gist of the code (this is not an MWE) is the following
import tensorflow as tf
from tensorflow import keras
def build_model(input_shape, output_shape, lr=0.01, dropout=0, summary=True):
model = keras.models.Sequential(name="CNN")
model.add(keras.layers.Conv2D(32, (7, 7), activation='relu', input_shape=input_shape, padding="same"))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.MaxPooling2D((2, 2)))
model.add(keras.layers.Dropout(dropout))
model.add(keras.layers.Conv2D(128, (3, 3), activation='relu', padding="same"))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.MaxPooling2D((2, 2)))
model.add(keras.layers.Dropout(dropout))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(1024, activation='relu'))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(output_shape, activation='linear'))
model.build()
model.compile(optimizer=keras.optimizers.Adam(learning_rate=lr),
loss="mse",
metrics=[RMSE])
if summary:
print(model.summary())
return model
...
for run_id in range(25):
lr = learning_rate.max_value + (learning_rate.min_value - learning_rate.max_value) * np.random.rand(1)
dropout = dropout.min_value + (dropout.max_value -
dropout.min_value) * np.random.rand(1)
print("%=== Run #{0}".format(run_id))
run_dir = hparamdir + "\\run{0}".format(run_id)
model0 = build_model(IMG_SHAPE, Ytrain.shape[1], lr=lr, dropout=dropout)
model0_history = model0.fit(Xtrain,
Ytrain,
validation_split=0.3,
epochs=2,
verbose=2)
The problem I encountered is that, after a few (6) loops, the program halts returning the error
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[73728,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Add] name: dense_12/kernel/Initializer/random_uniform/
Process finished with exit code 1.
I believe the problem is that the GPU does not release the memory in between each iteration of the for loop and, after a while, it saturates and crashes.
I have digged around and I tried different solutions as suggested in similar posts (post1, post2)
Trying releasing the memory using the Keras backend at the end of every iteration of the for loop using
from keras import backend as K
K.clear_session()
Trying clearing the GPU using Numba and CUDA with
from numba import cuda
cuda.select_device(0)
cuda.close()
I tried deleting the graph using del model0 but that did not work either.
I couldn't try using tf.reset_default_graph() since the programming style of TF2.0 doesn't have a default graph anymore (AFAIK) and thus I have not found a way to kill/delete a graph at runtime.
Solutions 1. and 3. returned the same out of memory error, while solution 2. returned the following error during the second iteration of the for loop, while building the model in the build_model()call:
2019-07-24 19:51:31.909377: F .\tensorflow/core/kernels/random_op_gpu.h:227] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: invalid resource handle
Process finished with exit code -1073740791 (0xC0000409)
I tried to look around and I don't really understand the last error, I would guess the GPU has not been closed properly/is occupied/cannot be seen by Python anymore.
In any case, I could not find any solution to this issue, except for running the training by hand for every hyperparameter to be tested.
Does anybody have any idea how to solve this problem?
Or a workaround for hyperparameter tuning?
Should I open an issue in TF2.0 Github issue tracker (it does not appear to be a TensorFlow issue per se, since they declare that they don't want to free the GPU to avoid segmentation problems)?
This is due to how TF handles memory.
If you monitor your system while iteratively training TF models, you will observe a linear increase in memory consumption. Additionally, if you watch -n 0.1 nvidia-smi you will notice that the PID for the process remains constant while iterating. TF does not fully release utilized memory until the PID controlling the memory is killed. Also, the Numba documentation notes that cuda.close() is not useful if you want to reset the GPU (though I definitely spent a while trying to make it work when I discovered it!).
The easiest solution is to iterate using the Ray python package and something like the following:
import ray
#ray.remote(
num_gpus=1 # or however many you want to use (e.g., 0.5, 1, 2)
)
class RayNetWrapper:
def __init__(self, net):
self.net = net
def train(self):
return self.net.train()
ray.init()
actors = [RayNetWrapper.remote(model) for _ in range(25)]
results = ray.get([actor.train.remote() for actor in actors]
You should then notice GPU processes will cycle on/off with new PIDs each time and your system memory will no longer increase. Alternatively, you can put your model training code in a new python script and iteratively call out using python's subprocess module. You will also notice some latency now when models shutdown and new models boot up, but this is expected because the GPU is restarting. Ray also has an experimental asynchronous framework that I've had some success with, and enables fractional sharing of GPUs (model size permitting).
you can locate these two lines on top of your code.
from tensorflow.python.framework.config import set_memory_growth
tf.compat.v1.disable_v2_behavior()
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
except RuntimeError as e:
print(e)
that works for me.