model parallelism with large model doesn't distributes layers to gpus

model parallelism with large model doesn't distributes layers to gpus - python

I want to implement Model Parallelism with Trainer on PyTorch Lightning.
My environment is single machine with 16GB 2 GPUs.
1 GPU cannot read whole my model because that model involves some layers and a large layer(almost 14GB).
I run next code.
import pytorch_lightning as pl
# from pytorch_lightning.strategies import DeepSpeedStrategy
from mymodel import Net
from mydatamodule import DataModule
net = Net()
dm = DataModule()
trainer = pl.Trainer(
accelerator='gpu',
strategy='ddp_shard',
# strategy=DeepSpeedStrategy(
# stage=2,
# config={
# "autotuning": {
# "mp_size": 2
# }
# }
# ),
num_nodes=1,
precision: 16,
check_val_every_n_epoch=10,
)
trainer.fit(net, dm)
I think Model Parallelism of Fairscale/DeepSpeed distributes layers of model to GPUs which stay within those memory limits like below images of articles.
https://towardsdatascience.com/distributed-parallel-training-model-parallel-training-a768058aa02a https://huggingface.co/transformers/v4.9.2/parallelism.html#naive-model-parallel-vertical-and-pipeline-parallel
First Validation of check_val_every_n_epoch is work.
But mentioned above code behaves as Distributed Data Parallel on train phase and raises OOM error.
When OOM error occured, Both of 2 GPUs hold 14GB over.
Also this code raises DeadLock detected from rank: 0 before OOM error.
I haven't understood the cause of this behavior.
I tried strategies "ddp_sharded", "deepspeed-stage-1", "deepspeed-stage-2", "deepspeed-stage-3", "fsdp".
But results are all the same.
I want to resolve 2 problems below.
Is my understanding of Model Parallelism wrong?
How do I fix my code?
System Details
OS: Ubuntu 20.04
Python: 3.8.10
PyTorch: 1.12.1+cu113
PyTorch Lightning: 1.7.7
Fairscale: 0.5.12
DeepSpeed: 0.7.4

Related

Why, using Huggingface Trainer, single GPU training is faster than 2 GPUs?

I have a VM with 2 V100s and I am training gpt2-like models (same architecture, fewer layers) using the really nice Trainer API from Huggingface. I am using the pytorch back-end.
I am observing that when I train the exact same model (6 layers, ~82M parameters) with exactly the same data and TrainingArguments, training on a single GPU training is significantly faster than on 2GPUs: ~5hrs vs ~6.5hrs.
How would one debug this kind of issue to uderstand what's causing the slowdown?
Extra notes:
the 2 gpus are both being used (watching nvidia-smi output)
I am using fp16 precision
My TrainingArguments values are:
{
"optim": "adamw_torch",
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"fp16": true,
"gradient_checkpointing": true,
"per_device_train_batch_size": 16,
"per_device_eval_batch_size": 16,
"dataloader_num_workers": 4,
"dataloader_pin_memory": true,
"gradient_accumulation_steps": 1,
"num_train_epochs": 5
}
The output of nvidia-smi topo -m is:
$ nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X SYS 0-11 N/A
GPU1 SYS X 0-11 N/A
I understand that without NVLink inter-gpu communication is not as fast as it could be, but can that be the only cause of a slowdown like the one I'm observing? And if so, is there anything I can do or will I always have slower training times on 2GPUs (thus making multi-gpu training essentially useless)?

Keeping this here for reference. The cause was "gradient_checkpointing": true,. The slowdown induced by gradient checkpointing appears to be larger on 2 GPUs than on a single GPU. I don't really know the cause of this issue, if anyone knows I would really appreaciate someone telling me.

Speeding-up inference of T5-like model

I am currently using a model called T0pp (https://huggingface.co/bigscience/T0pp) in production and would like to speed up inference.
I am running the following code on an on-demand EC2 g4dn.12xlarge instance (4 Nvidia T4 GPUs):
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("bigscience/T0pp")
model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp")
model.parallelize()
input_dict = tokenizer(generation_input.inputs, return_tensors="pt", padding=True)
inputs = input_dict.input_ids.to("cuda:0")
attention_mask = input_dict.attention_mask.to("cuda:0")
with torch.no_grad():
outputs = model.generate(inputs, attention_mask=attention_mask)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
I wanted to know which alternative you would try in order to speed-up inference, and if you knew good tutorials to do so. The main alternatives I see to speed-up inference would be to use the underlying Pytorch models with:
ONNX
Deepspeed
or using fp16 instead of fp32 parameters (with the main drawback of losing some quality)
Would someone have experience in using these tools, and would know which is the best / simplest option?
All this is quite new for me, and I must admit I've been a bit lost in ONNX and Deepspeed tutorials.
PS:
I already tried SageMaker, but this is not working for huge models like T0pp (40Gb).
Batching speeds up things, allowing to go from 1-2 seconds for batch
size 1, to 16 seconds for batch size 32. In an ideal world, even
batch size 32 would be under 1 or 2 seconds.

Maybe you could try OpenVINO? It allows you to convert your model into Intermediate Representation (IR) and then run on the CPU with the FP16 support. OpenVINO is optimized for Intel hardware but it should work with any processor. I cannot guarantee your model will be faster on CPU than Nvidia GPU but it's worth giving it a try. Some NLP models are fast enough (like this BERT).
You can find a full tutorial on how to convert the PyTorch model here (FastSeg) and here (BERT). Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[pytorch,onnx]
Save your model to ONNX
OpenVINO cannot convert PyTorch model directly for now but it can do it with ONNX model. This sample code assumes the model is for computer vision.
dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)
Use Model Optimizer to convert ONNX model
The Model Optimizer is a command line tool that comes from OpenVINO Development Package so be sure you have installed it. It converts the ONNX model to OV format (aka IR), which is a default format for OpenVINO. It also changes the precision to FP16 (to further increase performance). The accuracy drop, in most cases, is insignificant. Run in command line:
mo --input_model "model.onnx" --input_shape "[1, 3, 224, 224]" --mean_values="[123.675, 116.28 , 103.53]" --scale_values="[58.395, 57.12 , 57.375]" --data_type FP16 --output_dir "model_ir"
Run the inference on the CPU
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
It's worth mentioning that Runtime can process the ONNX model directly. In that case, just skip the conversion (Model Optimizer) step and give onnx path to the read_model function.
Disclaimer: I work on OpenVINO.

Tensorflow2.0: GPU runs out of memory during hyperparameter tuning loop

I am trying to perform some hyperparameter tuning of a convolutional neural network written in Tensorflow 2.0 with GPU extension.
My systems settings are:
Windows 10 64bit
GeForce RTX2070, 8GB
Tensorflow 2.0-beta
CUDA 10.0 properly installed (I hope, deviceQuery.exe and bandwidthTest.exe passed positively)
My neural network has 75.572.574 parameters and I am training it on 3777 samples. In a single run, I have no problems in training the CNN.
As next step, I wanted to tune two hyperparameters of the CNN. To this aim, I created a for loop (iterating on 20 steps), in which I build and compile every time a new model, changing the hyperparameters at every loop iteration.
The gist of the code (this is not an MWE) is the following
import tensorflow as tf
from tensorflow import keras
def build_model(input_shape, output_shape, lr=0.01, dropout=0, summary=True):
model = keras.models.Sequential(name="CNN")
model.add(keras.layers.Conv2D(32, (7, 7), activation='relu', input_shape=input_shape, padding="same"))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.MaxPooling2D((2, 2)))
model.add(keras.layers.Dropout(dropout))
model.add(keras.layers.Conv2D(128, (3, 3), activation='relu', padding="same"))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.MaxPooling2D((2, 2)))
model.add(keras.layers.Dropout(dropout))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(1024, activation='relu'))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(output_shape, activation='linear'))
model.build()
model.compile(optimizer=keras.optimizers.Adam(learning_rate=lr),
loss="mse",
metrics=[RMSE])
if summary:
print(model.summary())
return model
...
for run_id in range(25):
lr = learning_rate.max_value + (learning_rate.min_value - learning_rate.max_value) * np.random.rand(1)
dropout = dropout.min_value + (dropout.max_value -
dropout.min_value) * np.random.rand(1)
print("%=== Run #{0}".format(run_id))
run_dir = hparamdir + "\\run{0}".format(run_id)
model0 = build_model(IMG_SHAPE, Ytrain.shape[1], lr=lr, dropout=dropout)
model0_history = model0.fit(Xtrain,
Ytrain,
validation_split=0.3,
epochs=2,
verbose=2)
The problem I encountered is that, after a few (6) loops, the program halts returning the error
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[73728,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Add] name: dense_12/kernel/Initializer/random_uniform/
Process finished with exit code 1.
I believe the problem is that the GPU does not release the memory in between each iteration of the for loop and, after a while, it saturates and crashes.
I have digged around and I tried different solutions as suggested in similar posts (post1, post2)
Trying releasing the memory using the Keras backend at the end of every iteration of the for loop using
from keras import backend as K
K.clear_session()
Trying clearing the GPU using Numba and CUDA with
from numba import cuda
cuda.select_device(0)
cuda.close()
I tried deleting the graph using del model0 but that did not work either.
I couldn't try using tf.reset_default_graph() since the programming style of TF2.0 doesn't have a default graph anymore (AFAIK) and thus I have not found a way to kill/delete a graph at runtime.
Solutions 1. and 3. returned the same out of memory error, while solution 2. returned the following error during the second iteration of the for loop, while building the model in the build_model()call:
2019-07-24 19:51:31.909377: F .\tensorflow/core/kernels/random_op_gpu.h:227] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: invalid resource handle
Process finished with exit code -1073740791 (0xC0000409)
I tried to look around and I don't really understand the last error, I would guess the GPU has not been closed properly/is occupied/cannot be seen by Python anymore.
In any case, I could not find any solution to this issue, except for running the training by hand for every hyperparameter to be tested.
Does anybody have any idea how to solve this problem?
Or a workaround for hyperparameter tuning?
Should I open an issue in TF2.0 Github issue tracker (it does not appear to be a TensorFlow issue per se, since they declare that they don't want to free the GPU to avoid segmentation problems)?

This is due to how TF handles memory.
If you monitor your system while iteratively training TF models, you will observe a linear increase in memory consumption. Additionally, if you watch -n 0.1 nvidia-smi you will notice that the PID for the process remains constant while iterating. TF does not fully release utilized memory until the PID controlling the memory is killed. Also, the Numba documentation notes that cuda.close() is not useful if you want to reset the GPU (though I definitely spent a while trying to make it work when I discovered it!).
The easiest solution is to iterate using the Ray python package and something like the following:
import ray
#ray.remote(
num_gpus=1 # or however many you want to use (e.g., 0.5, 1, 2)
)
class RayNetWrapper:
def __init__(self, net):
self.net = net
def train(self):
return self.net.train()
ray.init()
actors = [RayNetWrapper.remote(model) for _ in range(25)]
results = ray.get([actor.train.remote() for actor in actors]
You should then notice GPU processes will cycle on/off with new PIDs each time and your system memory will no longer increase. Alternatively, you can put your model training code in a new python script and iteratively call out using python's subprocess module. You will also notice some latency now when models shutdown and new models boot up, but this is expected because the GPU is restarting. Ray also has an experimental asynchronous framework that I've had some success with, and enables fractional sharing of GPUs (model size permitting).

you can locate these two lines on top of your code.
from tensorflow.python.framework.config import set_memory_growth
tf.compat.v1.disable_v2_behavior()
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
except RuntimeError as e:
print(e)
that works for me.

Tensorflow / keras multi_gpu_model is not splitted to more than one gpu

I'm encountered the problem, that I can not successfully split my training batches to more than one GPU. If multi_gpu_model from tensorflow.keras.utils is used, tensorflow allocates the full memory on all available (for example 2) gpus, but only the first one (gpu[0]) is utilized to 100% if nvidia-smi is watched.
I'm using tensorflow 1.12 right now.
Test on single device
model = getSimpleCNN(... some parameters)
model .compile()
model .fit()
As expected, data is loaded by cpu and the model runs on gpu[0] with 97% - 100% gpu utilization:
Create a multi_gpu model
As described in the tensorflow api for multi_gpu_model here, the device scope for model definition is not changed.
from tensorflow.keras.utils import multi_gpu_model
model = getSimpleCNN(... some parameters)
parallel_model = multi_gpu_model(model, gpus=2, cpu_merge=False) # weights merge on GPU (recommended for NV-link)
parallel_model.compile()
parallel_model.fit()
As seen in the timeline, cpu now not only loads the data, but is doing some other calculations. Notice: the second gpu is nearly doing nothing:
The question
The effect even worsens as soon as four gpus are used. Utilization of the first one goes up to 100% but for the rest there are only short peeks.
Is there any solution to fix this? How to properly train on multiple gpus?
Is there any difference between tensorflow.keras.utils and keras.utils which causes the unexpected behavior?

I just ran into the same issue.
In my case, the problem came from the use of a build_model(... parameters) function that returned the model.
Be careful with your getSimpleCNN() function, as I don't know what is in it my best advice is to build the model sequentially in your code without using this function.

Keras with Tensorflow backend - Run predict on CPU but fit on GPU

I am using keras-rl to train my network with the D-DQN algorithm. I am running my training on the GPU with the model.fit_generator() function to allow data to be sent to the GPU while it is doing backprops. I suspect the generation of data to be too slow compared to the speed of processing data by the GPU.
In the generation of data, as instructed in the D-DQN algorithm, I must first predict Q-values with my models and then use these values for the backpropagation. And if the GPU is used to run these predictions, it means that they are breaking the flow of my data (I want backprops to run as often as possible).
Is there a way I can specify on which device to run specific operations? In a way that I could run the predictions on the CPU and the backprops on the GPU.

Maybe you can save the model at the end of the training. Then start another python file and write os.environ["CUDA_VISIBLE_DEVICES"]="-1"before you import any keras or tensorflow stuff. Now you should be able to load the model and make predictions with your CPU.

It's hard to properly answer your question without seeing your code.
The code below shows how you can list the available devices and force tensorflow to use a specific device.
def get_available_devices():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos]
get_available_devices()
with tf.device('/gpu:0'):
//Do GPU stuff here
with tf.device('/cpu:0'):
//Do CPU stuff here

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

model parallelism with large model doesn't distributes layers to gpus - python

Related

Why, using Huggingface Trainer, single GPU training is faster than 2 GPUs?

Speeding-up inference of T5-like model

Tensorflow2.0: GPU runs out of memory during hyperparameter tuning loop

Tensorflow / keras multi_gpu_model is not splitted to more than one gpu

Keras with Tensorflow backend - Run predict on CPU but fit on GPU

Categories

Resources