I have a list of sentences I'm trying to calculate perplexity for, using several models using this code:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
import numpy as np
model_name = 'cointegrated/rubert-tiny'
model = AutoModelForMaskedLM.from_pretrained(model_name).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)
def score(model, tokenizer, sentence):
tensor_input = tokenizer.encode(sentence, return_tensors='pt')
repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
masked_input = repeat_input.masked_fill(mask == 1, tokenizer.mask_token_id)
labels = repeat_input.masked_fill( masked_input != tokenizer.mask_token_id, -100)
with torch.inference_mode():
loss = model(masked_input.cuda(), labels=labels.cuda()).loss
return np.exp(loss.item())
print(score(sentence='London is the capital of Great Britain.', model=model, tokenizer=tokenizer))
# 4.541251105675365
Most models work well, but some sentences seem to throw an error:
RuntimeError: CUDA out of memory. Tried to allocate 10.34 GiB (GPU 0; 23.69 GiB total capacity; 10.97 GiB already allocated; 6.94 GiB free; 14.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Which makes sense because some are very long. So what I did was to add something like try, except RuntimeError, pass.
This seemed to work until around 210 sentences, and then it just outputs the error:
CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
I found this which had a lot of discussions and ideas, some were regarding potential faulty GPUs? But I know that my GPU works as this exact code works for other models. There's also talk about batch size here, which is why I thought it potentially relates to freeing up memory.
I tried running torch.cuda.empty_cache() to free the memory like in here after every some epochs but it didn't work (threw the same error).
Update:
I filtered sentences with length over 550 and this seems to remove the CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. error.
You need to apply gc.collect() before torch.cuda.empty_cache()
I also pull the model to cpu and then delete that model and its checkpoint. Try what works for you:
import gc
model.cpu()
del model, checkpoint
gc.collect()
torch.cuda.empty_cache()
I don't have an exact answer but I can share some troubleshooting techniques I adopted in similar situations...hope it may be helpful.
First, CUDA error is unfortunately vague sometimes so you should consider running your code on CPU to see if there is actually something else going on (see here )
If the problem is about memory, here are two custom utils I use:
from torch import cuda
def get_less_used_gpu(gpus=None, debug=False):
"""Inspect cached/reserved and allocated memory on specified gpus and return the id of the less used device"""
if gpus is None:
warn = 'Falling back to default: all gpus'
gpus = range(cuda.device_count())
elif isinstance(gpus, str):
gpus = [int(el) for el in gpus.split(',')]
# check gpus arg VS available gpus
sys_gpus = list(range(cuda.device_count()))
if len(gpus) > len(sys_gpus):
gpus = sys_gpus
warn = f'WARNING: Specified {len(gpus)} gpus, but only {cuda.device_count()} available. Falling back to default: all gpus.\nIDs:\t{list(gpus)}'
elif set(gpus).difference(sys_gpus):
# take correctly specified and add as much bad specifications as unused system gpus
available_gpus = set(gpus).intersection(sys_gpus)
unavailable_gpus = set(gpus).difference(sys_gpus)
unused_gpus = set(sys_gpus).difference(gpus)
gpus = list(available_gpus) + list(unused_gpus)[:len(unavailable_gpus)]
warn = f'GPU ids {unavailable_gpus} not available. Falling back to {len(gpus)} device(s).\nIDs:\t{list(gpus)}'
cur_allocated_mem = {}
cur_cached_mem = {}
max_allocated_mem = {}
max_cached_mem = {}
for i in gpus:
cur_allocated_mem[i] = cuda.memory_allocated(i)
cur_cached_mem[i] = cuda.memory_reserved(i)
max_allocated_mem[i] = cuda.max_memory_allocated(i)
max_cached_mem[i] = cuda.max_memory_reserved(i)
min_allocated = min(cur_allocated_mem, key=cur_allocated_mem.get)
if debug:
print(warn)
print('Current allocated memory:', {f'cuda:{k}': v for k, v in cur_allocated_mem.items()})
print('Current reserved memory:', {f'cuda:{k}': v for k, v in cur_cached_mem.items()})
print('Maximum allocated memory:', {f'cuda:{k}': v for k, v in max_allocated_mem.items()})
print('Maximum reserved memory:', {f'cuda:{k}': v for k, v in max_cached_mem.items()})
print('Suggested GPU:', min_allocated)
return min_allocated
def free_memory(to_delete: list, debug=False):
import gc
import inspect
calling_namespace = inspect.currentframe().f_back
if debug:
print('Before:')
get_less_used_gpu(debug=True)
for _var in to_delete:
calling_namespace.f_locals.pop(_var, None)
gc.collect()
cuda.empty_cache()
if debug:
print('After:')
get_less_used_gpu(debug=True)
2.1 free_memory allows you to combine gc.collect and cuda.empty_cache to delete some desired objects from the namespace and free their memory (you can pass a list of variable names as the to_delete argument). This is useful since you may have unused objects occupying memory. For example, imagine you loop through 3 models, then the first one may still take some gpu memory when you get to the second iteration (I don't know why, but I've experienced this in notebooks and the only solution I could find was to either restart the notebook or explicitly free memory). However, I have to say that is not always practical as you need to know which variables are holding GPU memory...and that's not always the case, especially when you have a lot of gradients internally associated with the model (see here for more info). One thing you could also try is to use with torch.no_grad(): instead of with torch.inference_mode():; they should be equivalent but it may be worth a try...
2.2 in case you have a multigpu environment you could consider alternately switching to the less used GPU thanks to the other utils, get_less_used_gpu
Also, you can try to track GPU usage to see when the error happens and debug from there. The best/simplest way I can suggest is using nvtop if you are on a linux platform
Hope something turns out to be useful :)
Related
I am facing a memory leak when iteratively updating tensors in PyTorch on my Mac M1 GPU using the PyTorch mps interface. The following is a minimal reproducible example that replicates the behavior:
import torch
def leak_example(p1, device):
t1 = torch.rand_like(p1, device = device) # torch.cat((torch.diff(ubar.detach(), dim=0).detach().clone(), torch.zeros_like(ubar.detach()[:1,:,:,:], dtype = torch.float32)), dim = 0)
u1 = p1.detach() + 2 * (t1.detach())
B = torch.rand_like(u1, device = device)
mask = u1 < B
a1 = u1.detach().clone()
a1[~mask] = torch.rand_like(a1)[~mask]
return a1
if torch.cuda.is_available(): # cuda gpus
device = torch.device("cuda")
elif torch.backends.mps.is_available(): # mac gpus
device = torch.device("mps")
torch.set_grad_enabled(False)
p1 = torch.rand(5, 5, 224, 224, device = device)
for i in range(10000):
p1 = leak_example(p1, device)
My Mac's GPU memory steadily grows when I execute this loop. I have tried running it on a CUDA GPU in Google Colab and it seems to be behaving similarly, with the GPU's Active memory, Non-releasable memory, and Allocated memory increasing as the loop progresses.
I have tried detaching and cloning the tensors and using weakrefs, to no avail. Interestingly, if I don't reassign the output of leak_example to p1, the behavior disappears, so it really seems related to the recursive assignment. Does anyone have any idea how I could resolve this?
I think I found the cause of the leak, it was the masked assignment. Replacing it with an equivalent torch.where() statement makes the leak disappear. I imagine this is related to masked_scatter not being implemented for MPS support in PyTorch (yet)?
I have been trying to debug a program using vast amounts of memory and have distilled it into the following example:
# Caution, use carefully, this can utilise all available memory on your computer
# and render it effectively unresponsive, to the point where you cannot access
# the shell to kill the process; thus requiring reboot.
import numpy as np
import collections
import torch
# q = collections.deque(maxlen=1500) # Uses around 6.4GB
# q = collections.deque(maxlen=3000) # Uses around 12GB
q = collections.deque(maxlen=5000) # Uses around 18GB
def f():
nparray = np.zeros([4,84,84], dtype=np.uint8)
q.append(nparray)
nparray1 = np.zeros([32,4,84,84], dtype=np.float32)
tens = torch.tensor(nparray1, dtype=torch.float32)
while True:
f()
Please note the cautionary message in the 1st line of this program. If you set maxlen to a level where it uses too much of your available RAM, it can crash your computer.
I measured the memory using top (VIRT column), and its memory use seems wildly excessive (details on the commented lines above). From previous experience in my original program if maxlen is high enough it will crash my computer.
Why is it using so much memory?
I calculate the increase in expected memory from maxlen=1500 to maxlen=3000 to be:
4 * 84 * 84 * 15000 / (1024**2) == 403MB.
But we see an increase of 6GB.
There seems to be some sort of interaction between using collections and the tensor allocation as commenting either out causes memory use to be expected; eg commenting out the tensor line leads to total memory use of 2GB which seems much more reasonable.
Thanks for any help or insight,
Julian.
I think PyTorch store and update the computational graph each time you call f(), and thus the graph-size just keeps getting bigger and bigger.
Can you try to free the memory usage by using del(tens) (deleting the reference for the variable after usage), and let me know how it works? (found in PyTorch-documents here: https://pytorch.org/docs/stable/notes/faq.html)
Firstly, I would like to give the memory information before the processes.
pmem(rss=288796672, vms=4105973760, shared=107503616, text=2039808, lib=0, data=771235840, dirty=0)
I built a model using Keras and assign it to the model variable. Then, I sent the model object to a class constructor for cloning:
from tensorflow.python.keras.models import clone_model
from tensorflow.python.keras import backend as K
class Source:
def __init__(self, model):
config = Config()
self.model = clone_model(model)
# breakpoint to read memory
self.model.compile(optimizer=config.optimizer, loss=config.loss, metrics=config.metrics)
The memory information after the cloning process is showed below:
pmem(rss=289615872, vms=4333002752, shared=107843584, text=2039808, lib=0, data=797331456, dirty=0)
So far, so good. When I try to delete the self.model object by del self.model, memory is not decreased.
pmem(rss=289615872, vms=4333002752, shared=107843584, text=2039808, lib=0, data=797331456, dirty=0)
Then I tried to execute garbage collector by gc.collect(), but the result was the same, nothing has changed.
pmem(rss=289615872, vms=4333002752, shared=107843584, text=2039808, lib=0, data=797331456, dirty=0)
Lastly, I tried to clear the session by using K.clear_session(). Nothing has changed again.
pmem(rss=289615872, vms=4333002752, shared=107843584, text=2039808, lib=0, data=797331456, dirty=0)
Keras version: 2.1.6 (downgraded from last version to solve this problem but it is not worked.)
Tensorflow version: 2.0.0-alpha0
TF never releases memory that it grabbed before. That's probably a rather good thing because it helps avoid mem fragmentation.
Note that this is independent of which part of this allocated mem is available for use - that TF controls it does not mean the mem is actively used. clear_session should make most/all of the controlled mem available again (but again: only to TF, not other processes).
You'll need to find out what part of the controlled mem is actually available, used, or needed.
I am using the python API of TensorFlow to train a variant of an LSTM.
For that purpose I use the tf.while_loop function to iterate over the time steps.
When running my script on the cpu, it does not produce any error messages, but on the gpu python crashes due to:
...tensorflow/tensorflow/core/framework/tensor.cc:885] Check failed: nullptr != b.buf_ (nullptr vs. 00...)
The part of my code, that causes this failure (when commenting it out, it works) is in the body of the while loop:
...
h_gathered = h_ta.gather(tf.range(time))
h_gathered = tf.transpose(h_gathered, [1, 0, 2])
syn_t = self.syntactic_weights_ta.read(time)[:, :time]
syn_t = tf.expand_dims(syn_t, 1)
syn_state_t = tf.squeeze(tf.tanh(tf.matmul(syn_t, h_gathered)), 1)
...
where time is zero based and incremented after each step, h_ta is a TensorArray
h_ta = tf.TensorArray(
dtype=dtype,
size=max_seq_len,
clear_after_read=False,
element_shape=[batch_size, num_hidden],
tensor_array_name="fw_output")
and self.syntactic_weights_ta is also a TensorArray
self.syntactic_weights_ta = tf.TensorArray(
dtype=dtype,
size=max_seq_len,
tensor_array_name="fw_syntactic_weights")
self.syntactic_weights_ta = self.syntactic_weights_ta.unstack(syntactic_weights)
What I am trying to achieve in the code snippet is basically a weighted sum over the past outputs, stored in h_ta.
In the end I train the network with tf.train.AdamOptimizer.
I have tested the script again, but this time with swap_memory parameter in the while loop set to False and it works on GPU as well, though I'd really like to know why it does not work with swap_memory=True.
This looks like a bug in the way that TensorArray's tensor storage mechanisms interact with the allocation magic that is performed by while_loop when swap_memory=True.
Can you open an issue on TF's github? Please also include:
A full stack trace (TF built with -c dbg preferrable)
A minimal code example to reproduce
Describe whether the issue requires you to be calling backprop.
Whether this is reproducible in TF 1.2 / nightlies / master branch.
And respond here with the link to the github issue?
I have a Python algorithm that takes two strings as input and does various tests on each's characters to return a score.
This often involves 100s of pairs of strings, and since it doesn't involve writing to memory, concurrency problems shouldn't be a matter.
Thing is, from my (little) GPU programming experience, I recall it's required to make simple loops and give a fixed length to each arrays when coding for GPU (OpenGL shaders), which is annoying because strings are effectively arrays with variable array length.
I can consider turning Python strings into C-like char arrays, but it seems like a tedious solution, and doesn't solve the problem of making simple loops.
My question is this; is there any way to achieve great performance gains by parallelizing a Python code like this to GPU? Is it even possible?
def evaluator( baseStr, listOfStr ) :
for word in listOfStr : # PARALLELIZE THIS
scoreList += [ evaluateTwoWords(baseStr, word) ];
def evaluateTwoWords(baseStr, otherStr) :
SOME WORD-WISE COMPARISON
i = 0; j = 0;
while i < len(baseStr) and j < len(word) :
...
return someScore;
For the above provided code , yes you could achieve a significant speedup on a GPU if every thread/worker on the GPU is assigned a task to do the string comparison.
But there are a few constraints with a GPU.
1) If the string list to be loaded into the device memory is too huge,then
lost of system bandwidth is utilized to copy the string list from the
host to device memory. This context switch is one of the biggest setbacks
of using a GPU
2) Also a GPU becomes very effective in solving algorithms that have a lot
of SIMD(Single Instruction Multiple Data) characteristics. Check
this out for more info on SIMD https://en.wikipedia.org/wiki/SIMD. So the
more you start deviating from SIMD, the more penaltiy on speedup
Below is a sample Pycuda Version of your program
I've used PyCuda but there are other OpencL python drivers that do the job as well.I haven't tested the GPU code below due to hardware constraints , but I've coded it primarily from these examples http://wiki.tiker.net/PyCuda/Examples.
This is what the code does.
1) copy the string list to gpu device memory
2) copy the base string to device memory
3) Then call the kernel function to return something
4) Finally reduce the returned values using summation or the desired reduce
function of your choice
Below code is a perfect example of SIMD where the result of a thread is independent on the result of another thread. But that's just an ideal case. You might have to decide whether an algorithm can be a good candidate for a GPU or not.
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
string_list = ['Apple','Microsoft', 'Google','Facebook', 'Twitter']
string_list_lines = numpy.array( string_list, dtype=str)
#Allocalte mem to list of strings on the GPU device
string_list_linesGPU = cuda.mem_alloc(string_list_lines.size * string_list_lines.dtype.itemsize)
#After allocation of mem, copy it to gpu device memory
cuda.memcpy_htod(string_list_linesGPU, string_list_lines)
## ****** Now GPU device has list of strings loaded into it
## Same process applied for the base string too
baseStr = "Seagate"
baseStrGPU = cuda.mem_alloc( len(baseStr))
cuda.memcpy_htod(baseStrGPU, baseStr)
#Num of blocks
blocks = len(string_list)
#Threads per block
threadsPerBlock = 1
#Write the actual kernel function
mod = SourceModule("""
__global__ int evaluateTwoWords(char *string1, char **string2)
{
idx = threadIdx.x;
while len(string1) > len(string2){
string2[i][0] = string1[0]s
// you could probably foloow up with some kind of algorithm here
}
return len(string2)
}
""")
#Run the source model
gpusin = mod.get_function("evaluateTwoWords")
result = 0
result += gpusin(destGPU, linesGPU, grid=(blocks,1), block=(threadsPerBlock,1,1))
return result
Hope this helps !