RuntimeError: Attempting to deserialize object on a CUDA device - python

I encounter a RunTimeError while I am trying to run the code in my machine's CPU instead of GPU. The code is originally from this GitHub project - IBD: Interpretable Basis Decomposition for Visual Explanation. This is for a research project. I tried putting the CUDA as false and looked at other solutions on this website.
GPU = False # running on GPU is highly suggested
CLEAN = False # set to "True" if you want to clean the temporary large files after generating result
APP = "classification" # Do not change! mode choide: "classification", "imagecap", "vqa". Currently "imagecap" and "vqa" are not supported.
CATAGORIES = ["object", "part"] # Do not change! concept categories that are chosen to detect: "object", "part", "scene", "material", "texture", "color"
CAM_THRESHOLD = 0.5 # the threshold used for CAM visualization
FONT_PATH = "components/font.ttc" # font file path
FONT_SIZE = 26 # font size
SEG_RESOLUTION = 7 # the resolution of cam map
BASIS_NUM = 7 # In decomposition, this is to decide how many concepts are used to interpret the weight vector of a class.
Here is the error:
Traceback (most recent call last):
File "test.py", line 22, in <module>
model = loadmodel()
File "/home/joshuayun/Desktop/IBD/loader/model_loader.py", line 48, in loadmodel
checkpoint = torch.load(settings.MODEL_FILE)
File "/home/joshuayun/.local/lib/python3.6/site-packages/torch/serialization.py", line 387, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "/home/joshuayun/.local/lib/python3.6/site-packages/torch/serialization.py", line 574, in _load
result = unpickler.load()
File "/home/joshuayun/.local/lib/python3.6/site-packages/torch/serialization.py", line 537, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/home/joshuayun/.local/lib/python3.6/site-packages/torch/serialization.py", line 119, in default_restore_location
result = fn(storage, location)
File "/home/joshuayun/.local/lib/python3.6/site-packages/torch/serialization.py", line 95, in _cuda_deserialize
device = validate_cuda_device(location)
File "/home/joshuayun/.local/lib/python3.6/site-packages/torch/serialization.py", line 79, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but
torch.cuda.is_available() is False. If you are running on a CPU-only machine,
please use torch.load with map_location='cpu' to map your storages to the CPU.

If you don't have gpu then use map_location=torch.device('cpu') with load model.load()
my_model = net.load_state_dict(torch.load('classifier.pt', map_location=torch.device('cpu')))

Just giving a smaller answer. To solve this, you could change the parameters of the function named load() in the serialization.py file. This is stored in: ./site-package/torch/serialization.py
Write:
def load(f, map_location='cpu', pickle_module=pickle, **pickle_load_args):
instead of:
def load(f, map_location=None, pickle_module=pickle, **pickle_load_args):
Hope it helps.

"If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU."
model = torch.load('model/pytorch_resnet50.pth',map_location ='cpu')

I have tried add "map_location='cpu'" in load function, but it doesn't work for me.
If you use a model trained by GPU on a CPU only computer, then you may meet this bug. And you can try this solution.
solution
class CPU_Unpickler(pickle.Unpickler):
def find_class(self, module, name):
if module == 'torch.storage' and name == '_load_from_bytes':
return lambda b: torch.load(io.BytesIO(b), map_location='cpu')
else: return super().find_class(module, name)
contents = CPU_Unpickler(f).load()

You can remap the Tensor location at load time using the map_location argument to torch.load.
On the following repository,in file "test.py", model = loadmodel() calls the model_loader.py file to load the model with torch.load().
While this will only map storages from GPU0, add the map_location:
torch.load(settings.MODEL_FILE, map_location={'cuda:0': 'cpu'})
In the model_loader.py file, add, map_location={'cuda:0': 'cpu'} whereever, torch.load() function is called.

As you state the problem hints you are trying to use a cuda-model on non-cuda machine. Pay attention to the details of the error message - please use torch.load with map_location='cpu' to map your storages to the CPU. I've had similar problem when I tried to load (from a checkpoint) pre-trained model on my cpu-only machine. The model was trained on a cuda machine so it couldn't be properly loaded. Once I added the map_location='cpu' argument to the load method everything worked.

I faced the same problem, Instead of modifying the existing code, which was running good yesterday, First I checked whether my GPU is free or not running
nvidia-smi
I could see that, its under utilized, therefore as traditional solution, I shutdown the laptop and restarted it and it got working.
(One thing I kept in mind that, earlier it was working and I haven't changed anything in code therefore it should work once I restart it and it got working and I was able to use the GPU)

For some reason, this also happens with portainer, even though your machines have GPUs. A crude solution would be to just restart it. It usually happens if you fiddle with the state of the container after it has been deployed (e.g. you change the restart policies while the container is running), which makes me think it's some portainer issue.

nothing worked for me-
my pickle was a custom object- in a script file with the line
device = torch.device("cuda")
finally, I managed to take Spikes solution, and adapt it to my needs with simple open(path,"rb"), so for any other unfortunate developers:
class CPU_Unpickler(pickle.Unpickler):
def find_class(self, module, name):
if module == 'torch.storage' and name == '_load_from_bytes':
return lambda b: torch.load(io.BytesIO(b), map_location='cpu')
else: return super().find_class(module, name)
contents = CPU_Unpickler(open(path,"rb")).load()

There is much easier way. Just add map_location to torch.load(path, map_location='cpu') as map_location='cpu':
def load_checkpoint(path) -> 'LanguageModel':
checkpoint = torch.load(path, map_location='cpu')
model = LanguageModel(
number_of_tokens=checkpoint['number_of_tokens'],
max_sequence_length=checkpoint['max_sequence_length'],
embedding_dimension=checkpoint['embedding_dimension'],
number_of_layers=checkpoint['number_of_layers'],
number_of_heads=checkpoint['number_of_heads'],
feed_forward_dimension=checkpoint['feed_forward_dimension'],
dropout_rate=checkpoint['dropout_rate']
).to(get_device())
model.load_state_dict(checkpoint['model_state_dict'])
return model.to(get_device())

Related

Loading large pytorch objects on CPU

I generate some data on a gpu and save it using
joblib.dump(results, "./results.sav")
I use joblib rather than pickle as the latter gives memory errors
I subsequently need to read the results file on a machine without a gpu
res = torch.load('./results.sav', map_location=torch.device('cpu'))
This however gives the error:
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
I was able to address the equivalent problem when using pickles with:
class CPU_Unpickler(pickle.Unpickler):
def find_class(self, module, name):
if module == 'torch.storage' and name == '_load_from_bytes':
return lambda b: torch.load(io.BytesIO(b), map_location='cpu')
else: return super().find_class(module, name)
res = sm.CPU_Unpickler(open(f'./results.pkl',"rb")).load()
Does anyone have any advice on how to save large gpu generated pytorch objects and then load them on a cpu?

Why does the Dequantize node fail to prepare?

Background
I'm playing around with MediaPipe for hand tracking and found this useful wrapper for loading MediaPipe's hand_landmark.tflite model. It works without any problems for me on Ubuntu 18.04 with Tensorflow 1.14.0.
However, when I try use a newer recently released model, I run into the following error:
INFO: Initialized TensorFlow Lite runtime.
Traceback (most recent call last):
File "/home/user/code/.../repo/models/test_model.py", line 12, in <module>
use_mediapipe_model()
File "/home/user/code/.../repo/models/test_model.py", line 8, in use_mediapipe_model
interp_joint.allocate_tensors()
File "/home/user/code/env/lib/python3.6/site-packages/tensorflow/lite/python/interpreter.py", line 95, in allocate_tensors
return self._interpreter.AllocateTensors()
File "/home/user/code/env/lib/python3.6/site-packages/tensorflow/lite/python/interpreter_wrapper/tensorflow_wrap_interpreter_wrapper.py", line 106, in AllocateTensors
return _tensorflow_wrap_interpreter_wrapper.InterpreterWrapper_AllocateTensors(self)
RuntimeError: tensorflow/lite/kernels/dequantize.cc:62 op_context.input->type == kTfLiteUInt8 || op_context.input->type == kTfLiteInt8 was not true.Node number 0 (DEQUANTIZE) failed to prepare.
When looking at the two models in Netron, I can see that the newer model uses nodes of the type Dequantize which seem to cause the problem. As I'm a beginner when it comes to Tensorflow I don't really know where to go from here.
Code to reproduce the error
from pathlib import Path
import tensorflow as tf
def use_mediapipe_model():
interp_joint = tf.lite.Interpreter(
f"{Path(__file__).parent}/hand_landmark.tflite") # path to model
interp_joint.allocate_tensors()
if __name__ == "__main__":
use_mediapipe_model()
Question
Is the problem related to the version of Tensorflow that I'm using or am I doing something wrong when it comes to loading the .tflite models?
Doesn't work in TF 1.14.0. You need at least 1.15.2

Error in loading model in PyTorch

I Have the following code snippet
from train import predict
import random
import torch
ann=torch.load('ann.pt') #importing trained model
while True:
k=raw_input("User:")
intent,top_value,top_index = predict(str(k),ann)
print(intent)
when I run the script it is throwing the error as below:
Traceback (most recent call last):
File "test.py", line 6, in <module>
ann=torch.load('ann.pt') #importing trained model
File "/home/local/ZOHOCORP/raghav-5305/miniconda2/lib/python2.7/site-packages/torch/serialization.py", line 261, in load
return _load(f, map_location, pickle_module)
File "/home/local/ZOHOCORP/raghav-5305/miniconda2/lib/python2.7/site-packages/torch/serialization.py", line 409, in _load
result = unpickler.load()
AttributeError: 'module' object has no attribute 'ANN'
I have ann.pt file in the same folder as my script is.
Kindly help me identify fix the error and load the model.
Thanks in advance.
When trying to save both parameters and model, pytorch pickles the parameters but only store path the model Class. For instance, changing tree structure or refactoring can break loading.
Therefore as the documentation points out, it is not recommended, prefer only save/load parameters:
...the serialized data is bound to the specific classes and the exact directory structure used, so it can break in various ways when used in other projects, or after some serious refactors.
For more help, it'll be useful to show your saving code.

Unable to load a pretrained model

After training my model for almost 2 days 3 files were generated:
best_model.ckpt.data-00000-of-00001
best_model.ckpt.index
best_model.ckpt.meta
where best_model is my model name.
When I try to import my model using the following command
with tf.Session() as sess:
saver = tf.train.import_meta_graph('best_model.ckpt.meta')
saver.restore(sess, "best_model.ckpt")
I get the following error
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/home/shreyash/.local/lib/python2.7/site-
packages/tensorflow/python/training/saver.py", line 1577, in
import_meta_graph
**kwargs)
File "/home/shreyash/.local/lib/python2.7/site-
packages/tensorflow/python/framework/meta_graph.py", line 498, in import_scoped_meta_graph
producer_op_list=producer_op_list)
File "/home/shreyash/.local/lib/python2.7/site-packages/tensorflow/python/framework/importer.py", line 259, in import_graph_def
raise ValueError('No op named %s in defined operations.' % node.op)
ValueError: No op named attn_add_fun_f32f32f32 in defined operations.
How to fix this?
I have referred this post: TensorFlow, why there are 3 files after saving the model?
Tensorflow version 1.0.0 installed using pip
Linux version 16.04
python 2.7
The importer can't find a very specific function in your graph, namely attn_add_fun_f32f32f32, which is likely to be one of attention functions.
Probably you've stepped into this issue. However, they say it's bundled in tensorflow 1.0. Double check that installed tensorflow version contains attention_decoder_fn.py (or, if you are using another library, check that it's there).
If it's there, here are your options:
Rename this operation, if possible. You might want to read this discussion for workarounds.
Duplicate your graph definition, so that you won't have to call import_meta_graph, but restore the model into the current graph.

Code for gensim Word2vec as an HTTP service 'KeyedVectors' Attribute error

I am using the w2v_server_googlenews code from the word2vec HTTP server running at https://rare-technologies.com/word2vec-tutorial/#bonus_app. I changed the loaded file to a file of vectors trained with the original C version of word2vec. I load the file with
gensim.models.KeyedVectors.load_word2vec_format(fname, binary=True)
and it seems to load without problems. But when I test the HTTP service with, let's say
curl 'http://127.0.0.1/most_similar?positive%5B%5D=woman&positive%5B%5D=king&negative%5B%5D=man'
I got an empty result with only the execution time.
{"taken": 0.0003361701965332031, "similars": [], "success": 1}
I put a traceback.print_exc() on the except part of the related method, which is in this case def most_similar(self, *args, **kwargs): and I got:
Traceback (most recent call last):
File "./w2v_server.py", line 114, in most_similar
topn=5)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 304, in most_similar
self.init_sims()
File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 817, in init_sims
self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
AttributeError: 'KeyedVectors' object has no attribute 'syn0'
Any idea on why this might happens?
Note: I use python 2.7 and I installed gensim using pip, which gave me gensim 2.1.0.
FYI that demo code was baed on gensim 0.12.3 (from 2015, as listed in its requirements.txt), and would need updating to work with the latest gensim.
It might be sufficient to add a line to w2v_server.py at line 70 (just after the load_word2vec_format()), to force the creation of the needed syn0norm property (which in older gensims was auto-created on load), before deleting the raw syn0 values. Specifically:
self.model.init_sims(replace=True)
(You would leave out the replace=True if you were going to be doing operations other than most_similar(), that might require raw vectors.)
If this works to fix the problem for you, a pull-request to the w2v_server_googlenews repo would be favorably received!

Categories

Resources