CUDA_OUT_OF_MEMORY in PyTorch head2head model

CUDA_OUT_OF_MEMORY in PyTorch head2head model - python

I am executing the head2head model presented in the Github repo here.
When I am running the code using the following command:
./scripts/train/train_on_target.sh Obama head2headDataset
with contents of the train_on_target.sh file as:
target_name=$1
dataset_name=$2
python train.py --checkpoints_dir checkpoints/$dataset_name \
--target_name $target_name \
--name head2head_$target_name \
--dataroot datasets/$dataset_name/dataset \
--serial_batches
Then I am getting the following error:
Traceback (most recent call last):
File "train.py", line 108, in <module>
flow_ref, conf_ref, t_scales, n_frames_D)
File "/home/nitin/head2head/util/util.py", line 48, in get_skipped_flows
flow_ref_skipped[s], conf_ref_skipped[s] = flowNet(real_B[s][:,1:], real_B[s][:,:-1])
File "/home/nitin/anaconda3/envs/head2head/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/nitin/anaconda3/envs/head2head/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/nitin/anaconda3/envs/head2head/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/nitin/head2head/models/flownet.py", line 38, in forward
flow, conf = self.compute_flow_and_conf(input_A, input_B)
File "/home/nitin/head2head/models/flownet.py", line 55, in compute_flow_and_conf
flow1 = self.flowNet(data1)
File "/home/nitin/anaconda3/envs/head2head/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/nitin/head2head/models/flownet2_pytorch/models.py", line 156, in forward
flownetfusion_flow = self.flownetfusion(concat3)
File "/home/nitin/anaconda3/envs/head2head/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/nitin/head2head/models/flownet2_pytorch/networks/FlowNetFusion.py", line 62, in forward
concat0 = torch.cat((out_conv0,out_deconv0,flow1_up),1)
RuntimeError: CUDA out of memory. Tried to allocate 82.00 MiB (GPU 0; 5.80 GiB total capacity; 4.77 GiB already allocated; 73.56 MiB free; 4.88 GiB reserved in total by PyTorch)
I have checked the batch size in the file options/base_options.py. It is already set to 1. How can I solve the above mentioned exception. My system has 6 GB NVIDIA GTX 1660 Super GPU.

Data management:
You can try reducing the dataset used for training to check if is a hardware limitation.
Moreover, if it is an image dataset, you can reduce the dimensions of the images by reducing the dpi.
Model parameters management:
Another approach is to reduce the number of parameters of your model. The first suggestion would be to change the Dense layer size and then the other neural network hyperparameters.

Related

TensorFlow: out of memory depends on data size

I have a dummy model (a linear autoencoder). When training on a dataset of 1 000 records, it works; but on a larger dataset, three orders of magnitude larger, it runs out of GPU memory; even though the batch size is fixed and the computer has enough RAM to hold.
Am I doing something silly?
Note: it works fine on TF 2.5, but crashes on TF 2.6-2.9. It always works if training on CPU.
The model is:
def get_model(n_inputs: int) -> models.Model:
inp = layers.Input(shape=(n_inputs,))
out = layers.Dense(n_inputs, activation='linear')(inp)
m = models.Model(inputs=inp, outputs=out)
m.compile(loss='mse', optimizer='adam')
m.summary()
return m
I am feeding the data through the tf.data API
def wrap_data(data: np.ndarray) -> tf.data.Dataset:
dataset = tf.data.Dataset.from_tensor_slices(data)
shuffled = dataset.shuffle(buffer_size=len(data), reshuffle_each_iteration=True)
batched = shuffled.batch(16, num_parallel_calls=tf.data.AUTOTUNE, deterministic=False)
autoencoder = batched.map(lambda x: (x, x)).prefetch(5)
return autoencoder
The full reproducing script is here. Running python benchmark.py works, but python benchmark.py --big doesn't.
I am using Python 3.9 on Fedora 36. The GPU is a Nvidia RTX 2070 with 8 GiB of RAM. The driver version is 515.48.07 and CUDA Version: 11.7. nvidia-smi reports most of the memory is available between runs, and the small version requires less than 800 MiB.
The full traceback is:
2022-09-05 15:29:37.525261: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 16384000000 exceeds 10% of free system memory.
2022-09-05 15:29:54.002629: W tensorflow/core/common_runtime/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 15.26GiB (rounded to 16384000000)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2022-09-05 15:29:54.002987: W tensorflow/core/common_runtime/bfc_allocator.cc:491] *_******____________________________________________________________________________________________
Traceback (most recent call last):
File "/home/david/[path]/benchmark.py", line 49, in <module>
main(parser.parse_args().big)
File "/home/david/[path]/benchmark.py", line 40, in main
train_data_iterator = wrap_data(train_data)
File "/home/david/[path]/benchmark.py", line 33, in wrap_data
dataset = tf.data.Dataset.from_tensor_slices(data)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 809, in from_tensor_slices
return TensorSliceDataset(tensors, name=name)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4551, in __init__
element = structure.normalize_element(element)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/util/structure.py", line 125, in normalize_element
ops.convert_to_tensor(t, name="component_%d" % i, dtype=dtype))
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/profiler/trace.py", line 183, in wrapped
return func(*args, **kwargs)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1640, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 48, in _default_conversion_function
return constant_op.constant(value, dtype, name=name)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 267, in constant
return _constant_impl(value, dtype, shape, name, verify_shape=False,
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 279, in _constant_impl
return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 304, in _constant_eager_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 102, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.
And specifying the GPU memory allocator, as suggested, doesn't help:
2022-09-05 15:33:19.542433: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 16384000000 exceeds 10% of free system memory.
2022-09-05 15:33:25.973935: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:288] gpu_async_0 cuMemAllocAsync failed to allocate 16384000000 bytes: CUDA error: out of memory (CUDA_ERROR_OUT_OF_MEMORY)
Reported by CUDA: Free memory/Total memory: 1115357184/8369799168
2022-09-05 15:33:25.973961: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:293] Stats: Limit: 6221922304
InUse: 67126312
MaxInUse: 201327628
NumAllocs: 13
MaxAllocSize: 67108864
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2022-09-05 15:33:25.973970: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:56] Histogram of current allocation: (allocation_size_in_bytes, nb_allocation_of_that_sizes), ...;
2022-09-05 15:33:25.973974: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 4, 5
2022-09-05 15:33:25.973976: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 8, 2
2022-09-05 15:33:25.973979: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 1028, 1
2022-09-05 15:33:25.973982: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 16384, 1
2022-09-05 15:33:25.973985: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 67108864, 1
Traceback (most recent call last):
File "/home/david/[path]/benchmark.py", line 48, in <module>
main(parser.parse_args().big)
File "/home/david/[path]/benchmark.py", line 40, in main
train_data_iterator = wrap_data(train_data)
File "/home/david/[path]/benchmark.py", line 33, in wrap_data
dataset = tf.data.Dataset.from_tensor_slices(data)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 809, in from_tensor_slices
return TensorSliceDataset(tensors, name=name)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4551, in __init__
element = structure.normalize_element(element)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/util/structure.py", line 125, in normalize_element
ops.convert_to_tensor(t, name="component_%d" % i, dtype=dtype))
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/profiler/trace.py", line 183, in wrapped
return func(*args, **kwargs)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1640, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 48, in _default_conversion_function
return constant_op.constant(value, dtype, name=name)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 267, in constant
return _constant_impl(value, dtype, shape, name, verify_shape=False,
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 279, in _constant_impl
return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 304, in _constant_eager_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 102, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

I am using pytorch dataparallel with 2 GPUs. Why are my model's state_dicts' empty on one GPU and have missing keys on the other GPU?

I have a problem with this GitHub project: https://github.com/researchmm/TTSR
If I use it on one GPU only everything runs smoothly. Once I turn on the second GPU and use torch.nn.DataParallel , this results in "Missing key(s) in state_dict":
[2021-08-03 09:01:00,829] - [trainer.py file line:70] - INFO: Current epoch learning rate: 1.000000e-04
Traceback (most recent call last):
File "/rwthfs/rz/cluster/home/ps815691/git/TTSR/main.py", line 53, in <module>
t.train(current_epoch=epoch, is_init=False)
File "/rwthfs/rz/cluster/home/ps815691/git/TTSR/trainer.py", line 126, in train
sr_lv1, sr_lv2, sr_lv3 = self.model(sr=sr)
File "/home/ps815691/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ps815691/.local/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/ps815691/.local/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/ps815691/.local/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/ps815691/.local/lib/python3.9/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/ps815691/.local/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/ps815691/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/rwthfs/rz/cluster/home/ps815691/git/TTSR/model/TTSR.py", line 32, in forward
self.LTE_copy.load_state_dict(self.LTE.state_dict())#, strict=False)
File "/home/ps815691/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LTE:
Missing key(s) in state_dict: "slice1.0.weight", "slice1.0.bias", "slice2.2.weight", "slice2.2.bias", "slice2.5.weight", "slice2.5.bias", "slice3.7.weight", "slice3.7.bias", "slice3.10.weight", "slice3.10.bias".
I printed the state_dicts for the "LTE" and "LTE_copy":
LTE GPU1 odict_keys([])
LTE GPU0 odict_keys(['sub_mean.weight', 'sub_mean.bias'])
LTE_Copy GPU1 odict_keys([])
LTE_Copy GPU0 odict_keys(['slice1.0.weight', 'slice1.0.bias', 'slice2.2.weight', 'slice2.2.bias', 'slice2.5.weight', 'slice2.5.bias', 'slice3.7.weight', 'slice3.7.bias', 'slice3.10.weight', 'slice3.10.bias', 'sub_mean.weight', 'sub_mean.bias'])
I do not get why that happens. Let me give you a quick introduction to the code:
The code starts in main.py. First, the model gets initialized from model/ttsr.py. This ttsr model is composed of several submodels. One of which is "LTE" & "LTE_copy". Then that model is put into nn.DataParallel and the trainer (trainer.py) is initialized with that model. t.train starts the training
_model = TTSR.TTSR(args).to(device)
_model = nn.DataParallel(_model, list(range(args.num_gpu)))
t = Trainer(args, _logger, _dataloader, _model, _loss_all)
t.train(current_epoch=epoch, is_init=True)
In the train function, after a batch has been fed through the model, the models output is fed back to the model, to get some parts of the loss function (trainer.py line 97). The model then executes this code in ttsr.py:
### used in transferal perceptual loss
self.LTE_copy.load_state_dict(self.LTE.state_dict())
sr_lv1, sr_lv2, sr_lv3 = self.LTE_copy((sr + 1.) / 2.)
return sr_lv1, sr_lv2, sr_lv3
Has anyone a clue why the error message above gets thrown out? It does not appear if I use load_state_dict(...,strict=False), but doesn't this just ignore the underlying problem? There does not seem to be any LTE.state_dict on GPU1's memory for example.

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

I am doing training and put the dataset inside the data folder. The Structure looks like this.
--data
-----mars
---------bbox_train
---------bbox_test
---------info
Many developers said that this is a label problem but I am not sure because labels are in the right place.
Traceback (most recent call last):
Args:Namespace(arch='resnet50graphpoolparthyper', concat=False, dataset='mars', dropout=0.1, eval_step=100, evaluate=False, gamma=0.1, gpu_devices='0', height=256, htri_only=False, lr=0.0003, margin=0.3, max_epoch=800, nheads=8, nhid=512, num_instances=4, part1=4, part2=8, part3=2, pool='avg', pretrained_model='/home/jiyang/Workspace/Works/video-person-reid/3dconv-person-reid/pretrained_models/resnet-50-kinetics.pth', print_freq=80, save_dir='log_hypergraphsagepart', seed=1, seq_len=8, start_epoch=0, stepsize=200, test_batch=1, train_batch=32, use_cpu=False, warmup=True, weight_decay=0.0005, width=128, workers=4, xent_only=False)
==========
Currently using GPU 0
Initializing dataset mars
=> MARS loaded
Dataset statistics:
------------------------------
subset | # ids | # tracklets
------------------------------
train | 625 | 8298
query | 626 | 1980
gallery | 622 | 9330
------------------------------
total | 1251 | 19608
number of images per tracklet: 2 ~ 920, average 59.5
------------------------------
Initializing model: resnet50graphpoolparthyper
Model size: 44.17957M
==> Epoch 1/800 lr:1.785e-05
Traceback (most recent call last):
File "main_video_person_reid_hypergraphsage_part.py", line 357, in <module>
main()
File "main_video_person_reid_hypergraphsage_part.py", line 220, in main
train(model, criterion_xent, criterion_htri, optimizer, trainloader, use_gpu)
File "main_video_person_reid_hypergraphsage_part.py", line 257, in train
outputs, features = model(imgs)
File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/khawar/HDD_Khawar1/hypergraph_reid/models/ResNet_hypergraphsage_part.py", line 621, in forward
x = self.base(x)
File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/khawar/HDD_Khawar1/hypergraph_reid/models/resnet.py", line 213, in forward
x = self.conv1(x)
File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 396, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Installing torch with CUDA 11.1 with the following command did fix the initial issue with torch 1.8:
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html

CPU version of "torch._C._nn.nll_loss" function

Is there a function for torch._C._nn.nll_loss that takes in a CPU input? I don't have enough GPU memory to run my function so I'm trying to run everything on CPU.
This is my specific error (look at the anaconda files)
Traceback (most recent call last):
File "plot_parametric_pytorch.py", line 395, in <module>
val_result = validate(val_loader, model, criterion, 0)
File "plot_parametric_pytorch.py", line 228, in validate
training=False, optimizer=None)
File "plot_parametric_pytorch.py", line 169, in forward
loss = criterion(output, target_var)
File "/home/klee/anaconda3/envs/sharpenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/klee/anaconda3/envs/sharpenv/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 932, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File "/home/klee/anaconda3/envs/sharpenv/lib/python3.7/site-packages/torch/nn/functional.py", line 2317, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/home/klee/anaconda3/envs/sharpenv/lib/python3.7/site-packages/torch/nn/functional.py", line 2115, in nll_loss
ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Expected object of device type cuda but got device type cpu for argument #1 'self' in call to _thnn_nll_loss_forward

nll_loss works for both CPU and GPU, but the input and the target need to be on the same device. Yours are on different devices, where the first one (output) is on the CPU, but the second (target_var) is on the GPU.
You need to put target_var onto the CPU.
loss = criterion(output, target_var.cpu())

Multi-GPU training of AllenNLP coreference resolution

I'm trying to replicate (or come close) to the results obtained by the End-to-end Neural Coreference Resolution paper on the CoNLL-2012 shared task. I intend to do some enhancements on top of this, so I decided to use AllenNLP's CoreferenceResolver. This is how I'm initialising & training the model:
import torch
from allennlp.common import Params
from allennlp.data import Vocabulary
from allennlp.data.dataset_readers import ConllCorefReader
from allennlp.data.dataset_readers.dataset_utils import Ontonotes
from allennlp.data.iterators import BasicIterator, MultiprocessIterator
from allennlp.data.token_indexers import SingleIdTokenIndexer, TokenCharactersIndexer
from allennlp.models import CoreferenceResolver
from allennlp.modules import Embedding, FeedForward
from allennlp.modules.seq2seq_encoders import PytorchSeq2SeqWrapper
from allennlp.modules.seq2vec_encoders import CnnEncoder
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder
from allennlp.modules.token_embedders import TokenCharactersEncoder
from allennlp.training import Trainer
from allennlp.training.learning_rate_schedulers import LearningRateScheduler
from torch.nn import LSTM, ReLU
from torch.optim import Adam
def read_data(directory_path):
data = []
for file_path in Ontonotes().dataset_path_iterator(directory_path):
data += dataset_reader.read(file_path)
return data
INPUT_FILE_PATH_TEMPLATE = "data/CoNLL-2012/v4/data/%s"
dataset_reader = ConllCorefReader(10, {"tokens": SingleIdTokenIndexer(),
"token_characters": TokenCharactersIndexer()})
training_data = read_data(INPUT_FILE_PATH_TEMPLATE % "train")
validation_data = read_data(INPUT_FILE_PATH_TEMPLATE % "development")
vocabulary = Vocabulary.from_instances(training_data + validation_data)
model = CoreferenceResolver(vocab=vocabulary,
text_field_embedder=BasicTextFieldEmbedder({"tokens": Embedding.from_params(vocabulary, Params({"embedding_dim": embeddings_dimension, "pretrained_file": "glove.840B.300d.txt"})),
"token_characters": TokenCharactersEncoder(embedding=Embedding(num_embeddings=vocabulary.get_vocab_size("token_characters"), embedding_dim=8, vocab_namespace="token_characters"),
encoder=CnnEncoder(embedding_dim=8, num_filters=50, ngram_filter_sizes=(3, 4, 5), output_dim=100))}),
context_layer=PytorchSeq2SeqWrapper(LSTM(input_size=400, hidden_size=200, num_layers=1, dropout=0.2, bidirectional=True, batch_first=True)),
mention_feedforward=FeedForward(input_dim=1220, num_layers=2, hidden_dims=[150, 150], activations=[ReLU(), ReLU()], dropout=[0.2, 0.2]),
antecedent_feedforward=FeedForward(input_dim=3680, num_layers=2, hidden_dims=[150, 150], activations=[ReLU(), ReLU()], dropout=[0.2, 0.2]),
feature_size=20,
max_span_width=10,
spans_per_word=0.4,
max_antecedents=250,
lexical_dropout=0.5)
if torch.cuda.is_available():
cuda_device = 0
model = model.cuda(cuda_device)
else:
cuda_device = -1
iterator = BasicIterator(batch_size=1)
iterator.index_with(vocabulary)
optimiser = Adam(model.parameters(), weight_decay=0.1)
Trainer(model=model,
train_dataset=training_data,
validation_dataset=validation_data,
optimizer=optimiser,
learning_rate_scheduler=LearningRateScheduler.from_params(optimiser, Params({"type": "step", "step_size": 100})),
iterator=iterator,
num_epochs=150,
patience=1,
cuda_device=cuda_device).train()
After reading the data I've trained the model but ran out of GPU memory: RuntimeError: CUDA out of memory. Tried to allocate 4.43 GiB (GPU 0; 11.17 GiB total capacity; 3.96 GiB already allocated; 3.40 GiB free; 3.47 GiB cached). Therefore, I attempted to make use of multiple GPUs to train this model. I'm making use of Tesla K80s (which have 12GiB memory).
I've tried making use of AllenNLP's MultiprocessIterator, by itialising the iterator as MultiprocessIterator(BasicIterator(batch_size=1), num_workers=torch.cuda.device_count()). However, only 1 GPU is being used (by monitoring the memory usage through the nvidia-smi command) & got the error below. I also tried fiddling with its parameters (increasing num_workers or decreasing output_queue_size) & the ulimit (as mentioned by this PyTorch issue) to no avail.
Process Process-3:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/user/.local/lib/python3.6/site-packages/allennlp/data/iterators/multiprocess_iterator.py", line 32, in _create_tensor_dicts
output_queue.put(tensor_dict)
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/user/.local/lib/python3.6/site-packages/allennlp/data/iterators/multiprocess_iterator.py", line 32, in _create_tensor_dicts
output_queue.put(tensor_dict)
File "<string>", line 2, in put
File "<string>", line 2, in put
File "/usr/lib/python3.6/multiprocessing/managers.py", line 772, in _callmethod
raise convert_to_error(kind, result)
File "/usr/lib/python3.6/multiprocessing/managers.py", line 772, in _callmethod
raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError:
---------------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/managers.py", line 228, in serve_client
request = recv()
File "/usr/lib/python3.6/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
File "/home/user/.local/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 276, in rebuild_storage_fd
fd = df.detach()
File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
return recvfds(s, 1)[0]
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 161, in recvfds
len(ancdata))
RuntimeError: received 0 items of ancdata
---------------------------------------------------------------------------
I also tried achieving this through PyTorch's DataParallel, by wrapping the model's context_layer, mention_feedforward, antecedent_feedforward with a custom DataParallelWrapper (to provide compatibility with the AllenNLP-assumed class functions). Still, only 1 GPU is used & it eventually runs out of memory as before.
class DataParallelWrapper(DataParallel):
def __init__(self, module):
super().__init__(module)
def get_output_dim(self):
return self.module.get_output_dim()
def get_input_dim(self):
return self.module.get_input_dim()
def forward(self, *inputs):
return self.module.forward(inputs)

After some digging through the code I found out that AllenNLP does this under the hood directly through its Trainer. The cuda_device can either be a single int (in the case of single-processing) or a list of ints (in the case of multi-processing):
cuda_device : Union[int, List[int]], optional (default = -1)
An integer or list of integers specifying the CUDA device(s) to use. If -1, the CPU is used.
So all GPU devices needed should be passed on instead:
if torch.cuda.is_available():
cuda_device = list(range(torch.cuda.device_count()))
model = model.cuda(cuda_device[0])
else:
cuda_device = -1
Note that the model still has to be manually moved to the GPU (via model.cuda(...)), as it would otherwise try to use multiple CPUs instead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

CUDA_OUT_OF_MEMORY in PyTorch head2head model - python

Related

TensorFlow: out of memory depends on data size

I am using pytorch dataparallel with 2 GPUs. Why are my model's state_dicts' empty on one GPU and have missing keys on the other GPU?

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

CPU version of "torch._C._nn.nll_loss" function

Multi-GPU training of AllenNLP coreference resolution

Categories

Resources