Loading large pytorch objects on CPU

Loading large pytorch objects on CPU - python

I generate some data on a gpu and save it using
joblib.dump(results, "./results.sav")
I use joblib rather than pickle as the latter gives memory errors
I subsequently need to read the results file on a machine without a gpu
res = torch.load('./results.sav', map_location=torch.device('cpu'))
This however gives the error:
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
I was able to address the equivalent problem when using pickles with:
class CPU_Unpickler(pickle.Unpickler):
def find_class(self, module, name):
if module == 'torch.storage' and name == '_load_from_bytes':
return lambda b: torch.load(io.BytesIO(b), map_location='cpu')
else: return super().find_class(module, name)
res = sm.CPU_Unpickler(open(f'./results.pkl',"rb")).load()
Does anyone have any advice on how to save large gpu generated pytorch objects and then load them on a cpu?

Related

How to free GPU memory in PyTorch

I have a list of sentences I'm trying to calculate perplexity for, using several models using this code:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
import numpy as np
model_name = 'cointegrated/rubert-tiny'
model = AutoModelForMaskedLM.from_pretrained(model_name).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)
def score(model, tokenizer, sentence):
tensor_input = tokenizer.encode(sentence, return_tensors='pt')
repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
masked_input = repeat_input.masked_fill(mask == 1, tokenizer.mask_token_id)
labels = repeat_input.masked_fill( masked_input != tokenizer.mask_token_id, -100)
with torch.inference_mode():
loss = model(masked_input.cuda(), labels=labels.cuda()).loss
return np.exp(loss.item())
print(score(sentence='London is the capital of Great Britain.', model=model, tokenizer=tokenizer))
# 4.541251105675365
Most models work well, but some sentences seem to throw an error:
RuntimeError: CUDA out of memory. Tried to allocate 10.34 GiB (GPU 0; 23.69 GiB total capacity; 10.97 GiB already allocated; 6.94 GiB free; 14.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Which makes sense because some are very long. So what I did was to add something like try, except RuntimeError, pass.
This seemed to work until around 210 sentences, and then it just outputs the error:
CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
I found this which had a lot of discussions and ideas, some were regarding potential faulty GPUs? But I know that my GPU works as this exact code works for other models. There's also talk about batch size here, which is why I thought it potentially relates to freeing up memory.
I tried running torch.cuda.empty_cache() to free the memory like in here after every some epochs but it didn't work (threw the same error).
Update:
I filtered sentences with length over 550 and this seems to remove the CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. error.

You need to apply gc.collect() before torch.cuda.empty_cache()
I also pull the model to cpu and then delete that model and its checkpoint. Try what works for you:
import gc
model.cpu()
del model, checkpoint
gc.collect()
torch.cuda.empty_cache()

I don't have an exact answer but I can share some troubleshooting techniques I adopted in similar situations...hope it may be helpful.
First, CUDA error is unfortunately vague sometimes so you should consider running your code on CPU to see if there is actually something else going on (see here )
If the problem is about memory, here are two custom utils I use:
from torch import cuda
def get_less_used_gpu(gpus=None, debug=False):
"""Inspect cached/reserved and allocated memory on specified gpus and return the id of the less used device"""
if gpus is None:
warn = 'Falling back to default: all gpus'
gpus = range(cuda.device_count())
elif isinstance(gpus, str):
gpus = [int(el) for el in gpus.split(',')]
# check gpus arg VS available gpus
sys_gpus = list(range(cuda.device_count()))
if len(gpus) > len(sys_gpus):
gpus = sys_gpus
warn = f'WARNING: Specified {len(gpus)} gpus, but only {cuda.device_count()} available. Falling back to default: all gpus.\nIDs:\t{list(gpus)}'
elif set(gpus).difference(sys_gpus):
# take correctly specified and add as much bad specifications as unused system gpus
available_gpus = set(gpus).intersection(sys_gpus)
unavailable_gpus = set(gpus).difference(sys_gpus)
unused_gpus = set(sys_gpus).difference(gpus)
gpus = list(available_gpus) + list(unused_gpus)[:len(unavailable_gpus)]
warn = f'GPU ids {unavailable_gpus} not available. Falling back to {len(gpus)} device(s).\nIDs:\t{list(gpus)}'
cur_allocated_mem = {}
cur_cached_mem = {}
max_allocated_mem = {}
max_cached_mem = {}
for i in gpus:
cur_allocated_mem[i] = cuda.memory_allocated(i)
cur_cached_mem[i] = cuda.memory_reserved(i)
max_allocated_mem[i] = cuda.max_memory_allocated(i)
max_cached_mem[i] = cuda.max_memory_reserved(i)
min_allocated = min(cur_allocated_mem, key=cur_allocated_mem.get)
if debug:
print(warn)
print('Current allocated memory:', {f'cuda:{k}': v for k, v in cur_allocated_mem.items()})
print('Current reserved memory:', {f'cuda:{k}': v for k, v in cur_cached_mem.items()})
print('Maximum allocated memory:', {f'cuda:{k}': v for k, v in max_allocated_mem.items()})
print('Maximum reserved memory:', {f'cuda:{k}': v for k, v in max_cached_mem.items()})
print('Suggested GPU:', min_allocated)
return min_allocated
def free_memory(to_delete: list, debug=False):
import gc
import inspect
calling_namespace = inspect.currentframe().f_back
if debug:
print('Before:')
get_less_used_gpu(debug=True)
for _var in to_delete:
calling_namespace.f_locals.pop(_var, None)
gc.collect()
cuda.empty_cache()
if debug:
print('After:')
get_less_used_gpu(debug=True)
2.1 free_memory allows you to combine gc.collect and cuda.empty_cache to delete some desired objects from the namespace and free their memory (you can pass a list of variable names as the to_delete argument). This is useful since you may have unused objects occupying memory. For example, imagine you loop through 3 models, then the first one may still take some gpu memory when you get to the second iteration (I don't know why, but I've experienced this in notebooks and the only solution I could find was to either restart the notebook or explicitly free memory). However, I have to say that is not always practical as you need to know which variables are holding GPU memory...and that's not always the case, especially when you have a lot of gradients internally associated with the model (see here for more info). One thing you could also try is to use with torch.no_grad(): instead of with torch.inference_mode():; they should be equivalent but it may be worth a try...
2.2 in case you have a multigpu environment you could consider alternately switching to the less used GPU thanks to the other utils, get_less_used_gpu
Also, you can try to track GPU usage to see when the error happens and debug from there. The best/simplest way I can suggest is using nvtop if you are on a linux platform
Hope something turns out to be useful :)

How to run tflite on CPU only

I have a tflite modelthat runs in coral USB, but I it to run also in CPU (as an alternative to pass some tests when coral USB is not phisicaly available).
I found this very similar question but the answers given are not useful.
My code looks like this:
class CoralObjectDetector(object):
def __init__(self, model_path: str, label_path: str):
"""
CoralObjectDetector, this object allows to pre-process images and perform object detection.
:param model_path: path to the .tflite file with the model
:param label_path: path to the file with labels
"""
self.label_path = label_path
self.model_path = model_path
self.labels = dict() # type: Dict[int, str]
self.load_labels()
self.interpreter = tflite.Interpreter(model_path),
experimental_delegates=[tflite.load_delegate('libedgetpu.so.1')])
# more code and operations
Where model and labels are downloaded from here.
I would like to load an alternative version of the same model that let me execute without the coral USB accelerator (i.e. only in CPU). My goal is something as follows:
class CoralObjectDetector(object):
def __init__(self, model_path: str, label_path: str, run_in_coral: bool):
"""
CoralObjectDetector, this object allows to pre-process images and perform object detection.
:param model_path: path to the .tflite file with the model
:param label_path: path to the file with labels
:param run_in_coral: whether or not to run it on coral (use CPU otherwise)
"""
self.label_path = label_path
self.model_path = model_path
self.labels = dict() # type: Dict[int, str]
self.load_labels()
if run_in_coral:
self.interpreter = tflite.Interpreter(model_path),
experimental_delegates=[tflite.load_delegate('libedgetpu.so.1')])
else:
# I expect somethig like this
self.interpreter = tflite.CPUInterpreter(model_path)
# more code and operations
I'm not sure if I need just this or something else in the inference/prediction methods.

When you compile a Coral model, it maps all the operations it can to a single TPU Custom OP - for example:
.
This means that this model will only work on the TPU. That being said, your TFLite interpreter can run CPU models too (all we did was add the experimental delegate to handle that edgetpu-custom-op). To run the CPU version, simply pass the CPU version of the model (before it was compiled).
For your object detection, if you use one of the models we provide in test_data, you'll see we provide the CPU and TPU version (for example for MNv1 SSD we have CPU and TPU versions). If you plugged these into any of our code, you'd see both work.
I'd simply check to see if a Coral TPU is attached when picking which model you use.

Tensorflow 2: how to switch execution from GPU to CPU and back?

In tensorflow 1.X with standalone keras 2.X, I used to switch between training on GPU, and running inference on CPU (much faster for some reason for my RNN models) with the following snippet:
keras.backend.clear_session()
def set_session(gpus: int = 0):
num_cores = cpu_count()
config = tf.ConfigProto(
intra_op_parallelism_threads=num_cores,
inter_op_parallelism_threads=num_cores,
allow_soft_placement=True,
device_count={"CPU": 1, "GPU": gpus},
)
session = tf.Session(config=config)
k.set_session(session)
This ConfigProto functionality is no longer available in tensorflow 2.0 (there I'm using the integrated tensorflow.keras). In the beginning, it is possible to run tf.config.experimental.set_visible_devices() in order to e.g. disable the GPU, but any subsequent calls to set_visible_devices result in RuntimeError: Visible devices cannot be modified after being initialized. Is there a way of re-initializing the visible devices or is there another way of switching the devices available?

You can use tf.device to explicitly set which device you want to use. For example:
import tensorflow as tf
model = tf.keras.Model(...)
# Run training on GPU
with tf.device('/gpu:0'):
model.fit(...)
# Run inference on CPU
with tf.device('/cpu:0'):
model.predict(...)
If you only have one CPU and one GPU, the names used above should work. Otherwise, device_lib.list_local_devices() can give you a list of your devices. This post gives a nice function for listing just the names, which I adapt here to also show CPUs:
from tensorflow.python.client import device_lib
def get_available_devices():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU' or x.device_type == 'CPU']

Does using tf.device can help you?
With that, you can set some operations either on CPU or on GPU.

I would just restart the kernel, this worked for me

RuntimeError: Attempting to deserialize object on a CUDA device

I encounter a RunTimeError while I am trying to run the code in my machine's CPU instead of GPU. The code is originally from this GitHub project - IBD: Interpretable Basis Decomposition for Visual Explanation. This is for a research project. I tried putting the CUDA as false and looked at other solutions on this website.
GPU = False # running on GPU is highly suggested
CLEAN = False # set to "True" if you want to clean the temporary large files after generating result
APP = "classification" # Do not change! mode choide: "classification", "imagecap", "vqa". Currently "imagecap" and "vqa" are not supported.
CATAGORIES = ["object", "part"] # Do not change! concept categories that are chosen to detect: "object", "part", "scene", "material", "texture", "color"
CAM_THRESHOLD = 0.5 # the threshold used for CAM visualization
FONT_PATH = "components/font.ttc" # font file path
FONT_SIZE = 26 # font size
SEG_RESOLUTION = 7 # the resolution of cam map
BASIS_NUM = 7 # In decomposition, this is to decide how many concepts are used to interpret the weight vector of a class.
Here is the error:
Traceback (most recent call last):
File "test.py", line 22, in <module>
model = loadmodel()
File "/home/joshuayun/Desktop/IBD/loader/model_loader.py", line 48, in loadmodel
checkpoint = torch.load(settings.MODEL_FILE)
File "/home/joshuayun/.local/lib/python3.6/site-packages/torch/serialization.py", line 387, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "/home/joshuayun/.local/lib/python3.6/site-packages/torch/serialization.py", line 574, in _load
result = unpickler.load()
File "/home/joshuayun/.local/lib/python3.6/site-packages/torch/serialization.py", line 537, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/home/joshuayun/.local/lib/python3.6/site-packages/torch/serialization.py", line 119, in default_restore_location
result = fn(storage, location)
File "/home/joshuayun/.local/lib/python3.6/site-packages/torch/serialization.py", line 95, in _cuda_deserialize
device = validate_cuda_device(location)
File "/home/joshuayun/.local/lib/python3.6/site-packages/torch/serialization.py", line 79, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but
torch.cuda.is_available() is False. If you are running on a CPU-only machine,
please use torch.load with map_location='cpu' to map your storages to the CPU.

If you don't have gpu then use map_location=torch.device('cpu') with load model.load()
my_model = net.load_state_dict(torch.load('classifier.pt', map_location=torch.device('cpu')))

Just giving a smaller answer. To solve this, you could change the parameters of the function named load() in the serialization.py file. This is stored in: ./site-package/torch/serialization.py
Write:
def load(f, map_location='cpu', pickle_module=pickle, **pickle_load_args):
instead of:
def load(f, map_location=None, pickle_module=pickle, **pickle_load_args):
Hope it helps.

"If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU."
model = torch.load('model/pytorch_resnet50.pth',map_location ='cpu')

I have tried add "map_location='cpu'" in load function, but it doesn't work for me.
If you use a model trained by GPU on a CPU only computer, then you may meet this bug. And you can try this solution.
solution
class CPU_Unpickler(pickle.Unpickler):
def find_class(self, module, name):
if module == 'torch.storage' and name == '_load_from_bytes':
return lambda b: torch.load(io.BytesIO(b), map_location='cpu')
else: return super().find_class(module, name)
contents = CPU_Unpickler(f).load()

You can remap the Tensor location at load time using the map_location argument to torch.load.
On the following repository,in file "test.py", model = loadmodel() calls the model_loader.py file to load the model with torch.load().
While this will only map storages from GPU0, add the map_location:
torch.load(settings.MODEL_FILE, map_location={'cuda:0': 'cpu'})
In the model_loader.py file, add, map_location={'cuda:0': 'cpu'} whereever, torch.load() function is called.

As you state the problem hints you are trying to use a cuda-model on non-cuda machine. Pay attention to the details of the error message - please use torch.load with map_location='cpu' to map your storages to the CPU. I've had similar problem when I tried to load (from a checkpoint) pre-trained model on my cpu-only machine. The model was trained on a cuda machine so it couldn't be properly loaded. Once I added the map_location='cpu' argument to the load method everything worked.

I faced the same problem, Instead of modifying the existing code, which was running good yesterday, First I checked whether my GPU is free or not running
nvidia-smi
I could see that, its under utilized, therefore as traditional solution, I shutdown the laptop and restarted it and it got working.
(One thing I kept in mind that, earlier it was working and I haven't changed anything in code therefore it should work once I restart it and it got working and I was able to use the GPU)

For some reason, this also happens with portainer, even though your machines have GPUs. A crude solution would be to just restart it. It usually happens if you fiddle with the state of the container after it has been deployed (e.g. you change the restart policies while the container is running), which makes me think it's some portainer issue.

nothing worked for me-
my pickle was a custom object- in a script file with the line
device = torch.device("cuda")
finally, I managed to take Spikes solution, and adapt it to my needs with simple open(path,"rb"), so for any other unfortunate developers:
class CPU_Unpickler(pickle.Unpickler):
def find_class(self, module, name):
if module == 'torch.storage' and name == '_load_from_bytes':
return lambda b: torch.load(io.BytesIO(b), map_location='cpu')
else: return super().find_class(module, name)
contents = CPU_Unpickler(open(path,"rb")).load()

There is much easier way. Just add map_location to torch.load(path, map_location='cpu') as map_location='cpu':
def load_checkpoint(path) -> 'LanguageModel':
checkpoint = torch.load(path, map_location='cpu')
model = LanguageModel(
number_of_tokens=checkpoint['number_of_tokens'],
max_sequence_length=checkpoint['max_sequence_length'],
embedding_dimension=checkpoint['embedding_dimension'],
number_of_layers=checkpoint['number_of_layers'],
number_of_heads=checkpoint['number_of_heads'],
feed_forward_dimension=checkpoint['feed_forward_dimension'],
dropout_rate=checkpoint['dropout_rate']
).to(get_device())
model.load_state_dict(checkpoint['model_state_dict'])
return model.to(get_device())

Convert CudaNdarraySharedVariable to TensorVariable

I'm trying to convert a pylearn2 GPU model to a CPU compatible version for prediction on a remote server -- how can I convert CudaNdarraySharedVariable's to TensorVariable's to avoid an error calling cuda code on a GPU-less machine? The experimental theano flag unpickle_gpu_to_cpu seems to have left a few CudaNdarraySharedVariable's hanging around (specifically model.layers[n].transformer._W).

For a plain CudaNdarray variable, something like this should work:
'''x = CudaNdarray... x_new=theano.tensor.TensorVariable(CudaNdarrayType([False] * tensor_dim))
f = theano.function([x_new], x_new)
converted_x = f(x)
'''

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Loading large pytorch objects on CPU - python

Related

How to free GPU memory in PyTorch

How to run tflite on CPU only

Tensorflow 2: how to switch execution from GPU to CPU and back?

RuntimeError: Attempting to deserialize object on a CUDA device

Convert CudaNdarraySharedVariable to TensorVariable

Categories

Resources