In machine translation, we always need to slice out the first timestep (the SOS token) in the annotation and prediction.
When using batch_first=False, slicing out the first timestep still keeps the tensor contiguous.
import torch
batch_size = 128
seq_len = 12
embedding = 50
# Making a dummy output that is `batch_first=False`
batch_not_first = torch.randn((seq_len,batch_size,embedding))
batch_not_first = batch_first[1:].view(-1, embedding) # slicing out the first time step
However, if we use batch_first=True, after slicing, the tensor is no longer contiguous. We need to make it contiguous before we can do different operations such as view.
batch_first = torch.randn((batch_size,seq_len,embedding))
batch_first[:,1:].view(-1, embedding) # slicing out the first time step
output>>>
"""
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-8-a9bd590a1679> in <module>
----> 1 batch_first[:,1:].view(-1, embedding) # slicing out the first time step
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
"""
Does that mean batch_first=False is better, at least, in the context of machine translation? Since it saves us from doing the contiguous() step. Is there any cases that batch_first=True works better?
Performance
There doesn't seem to be a considerable difference between batch_first=True and batch_first=False. Please see the script below:
import time
import torch
def time_measure(batch_first: bool):
torch.cuda.synchronize()
layer = torch.nn.RNN(10, 20, batch_first=batch_first).cuda()
if batch_first:
inputs = torch.randn(100000, 7, 10).cuda()
else:
inputs = torch.randn(7, 100000, 10).cuda()
start = time.perf_counter()
for chunk in torch.chunk(inputs, 100000 // 64, dim=0 if batch_first else 1):
_, last = layer(chunk)
return time.perf_counter() - start
print(f"Time taken for batch_first=False: {time_measure(False)}")
print(f"Time taken for batch_first=True: {time_measure(True)}")
On my device (GTX 1050 Ti), PyTorch 1.6.0 and CUDA 11.0 here are the results:
Time taken for batch_first=False: 0.3275816479999776
Time taken for batch_first=True: 0.3159054920001836
(and it varies either way so nothing conclusive).
Code readability
batch_first=True is simpler when you want to use other PyTorch layers which require batch as 0th dimension (which is the case for almost all torch.nn layers like torch.nn.Linear).
In this case you would have to permute returned tensor anyway if batch_first=False was specified.
Machine translation
It should be better as the tensor is contiguous all the time and no copy of data has to be done. It also looks cleaner to slice using [1:] instead of [:,1:].
Related
Some context: I am trying to convert the whisper decoder to tensorflow-lite and so far all works, but the result is slow. My decoder looks as follow:
import numpy as np
import tensorflow as tf
class TfliteDecoder:
def __init__(self, decoder_model_path):
# load the TFLite model and allocate tensors
self.interpreter = tf.lite.Interpreter(model_path=decoder_model_path)
self.interpreter.allocate_tensors()
# get input and output details
self.input_tensor_index_1 = self.interpreter.get_input_details()[0]['index']
self.input_tensor_index_2 = self.interpreter.get_input_details()[1]['index']
self.output_tensor_index = self.interpreter.get_output_details()[0]['index']
def decode_tokens(self, encoder_output_data, debug=False):
# init output tensor
self.interpreter.allocate_tensors()
self.interpreter.set_tensor(self.input_tensor_index_1, encoder_output_data)
# init tokens
tokens = tf.constant([50258, 50266, 50358, 50363], dtype=tf.int64, shape=(1,4)) #<|startoftranscript|><|ja|><|translate|><|notimestamps|>
last_token = 50363
i = 0
import time
st = time.time()
while (last_token != 50257) and (i < 448):
# adjust size for input -> allocate memory for input
self.interpreter.resize_tensor_input(self.input_tensor_index_2, tokens.shape)
self.interpreter.allocate_tensors()
self.interpreter.set_tensor(self.input_tensor_index_2, tokens)
# invoke -> get output
self.interpreter.invoke()
output_data = self.interpreter.get_tensor(self.output_tensor_index)
last_token = np.argmax(output_data, axis=-1)[0, -1]
# update tokens array
tokens = tf.concat((tokens, np.array([[last_token]])), axis=1)
i = i +1
print("->", round(time.time() - st, 3))
return tokens
As you can see my next prediction is always dependant on the previous prediction, hence I am using this while loop and every time I am resizing my input_tensor, allocating memory and invoking. This obviously is resource consuming and slow as expected, even though my resulting tensorflow-lite model is much smaller than the pytorch original one.
Using a tensorflow While loop did not speed up things and fixing the input vector size made the predictor get stuck.
I tried an alternative loop using beam search but that was even slower.
My question is:
Is there a way to make this a bit faster? any hints are appreciated.
This question already has answers here:
Pytorch - RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed
(2 answers)
Closed 1 year ago.
Upon running the code snippet (PyTorch 1.7.1; Python 3.8),
import numpy as np
import torch
def batch_matrix(vector_pairs, factor=2):
baselen = len(vector_pairs[0]) // factor
split_batch = []
for j in range(factor):
for i in range(factor):
start_j = j * baselen
end_j = (j+1) * baselen if j != factor - 1 else None
start_i = i * baselen
end_i = (i+1) * baselen if i != factor - 1 else None
mini_pairs = vector_pairs[start_j:end_j, start_i:end_i, :]
split_batch.append(mini_pairs)
return split_batch
def concat_matrix(vectors_):
vectors = vectors_.clone()
seq_len, dim_vec = vectors.shape
project_x = vectors.repeat((1, 1, seq_len)).reshape(seq_len, seq_len, dim_vec)
project_y = project_x.permute(1, 0, 2)
matrix = torch.cat((project_x, project_y), dim=-1)
matrix_ = matrix.clone()
return matrix_
if __name__ == "__main__":
vector_list = []
for i in range(10):
vector_list.append(torch.randn((5,), requires_grad=True))
vectors = torch.stack(vector_list, dim=0)
pmatrix = concat_matrix(vectors)
factor = np.ceil(vectors.shape[0]/6).astype(int)
batched_feats = batch_matrix(pmatrix, factor=factor)
for i in batched_feats:
i = i + 5
print(i.shape)
summed = torch.sum(i)
summed.backward()
I get the output and error as below:
torch.Size([5, 5, 10])
torch.Size([5, 5, 10])
Traceback (most recent call last):
File "/home/user/PycharmProjects/project/run.py", line 43, in <module>
summed.backward()
File "/home/user/anaconda3/envs/diff/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/user/anaconda3/envs/diff/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.
I have read all the existing posts on the issue and could not resolve it myself. Passing retain_graph=True in backward() fixes the issue in the provided snippet, however, the snippet is only an oversimplified version of a large network where retain_graph=True changes the error to the following:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [3000, 512]], which is output 0 of TBackward, is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
I tried setting torch.autograd.set_detect_anomaly(True) and determining the point of failure, but all that I tried failed and the error persisted.
I suspect that if I can understand the cause of error in the current situation then it will help me resolve this error in actual codebase.
Therefore, I want to understand why is it that backward() works fine for first two tensors in batched_feats, while fails for the third one? I would really appreciate if someone can help me see the reuse of an intermediate result that has been freed.
Thanks a lot!
After backpropagation, the leaf nodes' gradients are stored in their Tensor.grad attributes. The gradients of non-leaf nodes (i.e. the intermediate results to which the error refers) are freed by default, as PyTorch assumes you won't need them. In your example, your leaf nodes are those in vector_list created from torch.randn().
Calling backward() multiple times consecutively accumulates gradients via summation by default (this is useful for recurrent neural networks). This is problematic when existing intermediate results have been freed; the leaf nodes' gradients have not; and the call to backward() involves some of the same leaf nodes and intermediate results as a previous call to backward(). This is the problem you're facing; some of your tensor slices reference the same underlying tensors, and you're not zeroing all the relevant gradients between calls to backward(), but you are implicitly zeroing intermediate gradients.
If you wish to accumulate gradients in the leaf nodes via summation, simply call backward like so: summed.backward(retain_graph = True).
However, if you wish to compute gradients with respect to your batches independently (rather than w.r.t. the leaf nodes in vector_list), then you can just detach your batches at the beginning of each iteration. This will prevent gradients from propagating through them all the way to their common leaf nodes in vector_list (i.e. they become leaf nodes themselves in their own graphs). Detaching a tensor disables gradients for it, so you'll have to re-enable them manually:
for i in batched_feats:
i = i.detach()
i.requires_grad = True
j = i + 5
print(j.shape)
summed = torch.sum(j)
summed.backward()
print(i.grad) # Prints the gradients stored in i
This is how some data loaders work; they load the data from disk, convert them to tensors, perform augmentation / other preprocessing, and then detach them so that they can serve as leaf nodes in a fresh computational graph. If the application developer wants to compute gradients w.r.t. the data tensors, they do not have to save intermediate results since the data tensors have been detached and thus serve as leaf nodes.
I am working with REINFORCE algorithm with PyTorch. I noticed that the batch inference/predictions of my simple network with Softmax doesn’t sum to 1 (not even close to 1). I am attaching a minimum working code so that you can reproduce it. What am I missing here?
import numpy as np
import torch
obs_size = 9
HIDDEN_SIZE = 9
n_actions = 2
np.random.seed(0)
model = torch.nn.Sequential(
torch.nn.Linear(obs_size, HIDDEN_SIZE),
torch.nn.ReLU(),
torch.nn.Linear(HIDDEN_SIZE, n_actions),
torch.nn.Softmax(dim=0)
)
state_transitions = np.random.rand(3, obs_size)
state_batch = torch.Tensor(state_transitions)
pred_batch = model(state_batch) # WRONG PREDICTIONS!
print('wrong predictions:\n', *pred_batch.detach().numpy())
# [0.34072137 0.34721774] [0.30972624 0.30191955] [0.3495524 0.3508627]
# DOES NOT SUM TO 1 !!!
pred_batch = [model(s).detach().numpy() for s in state_batch] # CORRECT PREDICTIONS
print('correct predictions:\n', *pred_batch)
# [0.5955179 0.40448207] [0.6574412 0.34255883] [0.624833 0.37516695]
# DOES SUM TO 1 AS EXPECTED
Although PyTorch lets us get away with it, we don’t actually provide an input with the right dimensionality. We have a model that takes one input and produces one output, but PyTorch nn.Module and its subclasses are designed to do so on multiple samples at the same time. To accommodate multiple samples, modules expect the zeroth dimension of the input to be the number of samples in the batch.
Deep Learning with PyTorch
That your model works on each individual sample is an implementation nicety. You have incorrectly specified the dimension for the softmax (across batches instead of across the variables), and hence when given a batch dimension it is computing the softmax across samples instead of within samples:
nn.Softmax requires us to specify the dimension along which the softmax function is applied:
softmax = nn.Softmax(dim=1)
In this case, we have two input vectors in two rows (just like when we work with
batches), so we initialize nn.Softmax to operate along dimension 1.
Change torch.nn.Softmax(dim=0) to torch.nn.Softmax(dim=1) to get appropriate results.
I have looked at the cifar10 multi-GPU implementation to draw inspiration for parallelizing my own GPU trained model.
My model consumes data from TFRecords, which are iterated through the tf.data.Iterator class. So given 2 GPUs what I am trying to do is call iterator.get_next() on the CPU one time for each GPU (twice for example) do some preprocessing ,embedding lookup and other CPU related stuff and then feed the two batches into the GPUs.
Pseudo code:
with tf.device('/cpu:0'):
batches = []
for gpu in multiple_gpus:
single_gpu_batch = cpu_function(iterator.get_next())
batches.append(single_gpu_batch)
....................
for gpu, batch in zip(multiple_gpus, batches):
with tf.device('/device:GPU:{}'.format(gpu.id):
single_gpu_loss = inference_and_loss(batch)
tower_losses.append(single_gpu_loss)
...........
...........
total_loss = average_loss(tower_losses)
The problem is, that if there is only 1 or less examples to be drawn from the data and I call iterator.get_next() twice a tf.errors.OutOfRange exception will be raised and the data of the first call of iterator.get_next() (which actually didn't fail, only the second one) will never be passed through the GPU.
I thought about drawing the data in one iterator.get_next() call and splitting it later, but tf.split fails of the batch size is not dividable by the number of GPUs.
What is the right way to implement consuming from iterator in a multi-GPU setup?
I think the second suggestion is the easiest way to go. In order to avoid the splitting problem on the last batch, you can use the drop_remainder option in dataset.batch; Or if you need to see all data, then one possible solution is to explicitly set the dimensions based on the size of the drawn batch, so that the splitting operation never fails:
dataset = dataset.batch(batch_size * multiple_gpus)
iterator = dataset.make_one_shot_iterator()
batches = iterator.get_next()
split_dims = [0] * multiple_gpus
drawn_batch_size = tf.shape(batches)[0]
Either in a greedy manner, i.e., fits batch_size tensors on each device until runs out
#### Solution 1 [Greedy]:
for i in range(multiple_gpus):
split_dims[i] = tf.maximum(0, tf.minimum(batch_size, drawn_batch_size))
drawn_batch_size -= batch_size
or in a more spread manner to ensure that each device get at least one sample (assuming multiple_gpus < drawn_batch_size)
### Solution 2 [Spread]
drawn_batch_size -= - multiple_gpus
for i in range(multiple_gpus):
split_dims[i] = tf.maximum(0, tf.minimum(batch_size - 1, drawn_batch_size)) + 1
drawn_batch_size -= batch_size
## Split batches
batches = tf.split(batches, split_dims)
I'm trying to save a list of tensors of different lengths to a TFRecords file so that they can be easily loaded later on. The tensors in question are 1-dimensional arrays of integers.
The reason for this is that the tensors are the result of processing a large text file. This file is very large and processing it is slow, so I don't want to have to repeat that step every time I want to run my algorithms. I originally thought of loading the text file into regular Python lists or numpy arrays and then pickling those, but the conversion from those lists to tensors itself takes a very long time, so I don't want to have to wait for that every time I run my script, either. It seems that tensors cannot be pickled directly, and even if there is some workaround for this I am under the impression that TFRecords is the "correct" way to save tensor data.
However, I am not sure how to properly save the tensors to a TFRecords file and them load them back in as tensors. I did go through the TensorFlow tutorial in which MNIST data is saved to TFRecords files and then loaded, but there are a few differences between that and my use cases.
The following is a block of code intended to replicate the issues I'm having in a simpler case.
import tensorflow as tf
def _int64_list_feature(values):
return tf.train.Feature(int64_list=tf.train.Int64List(value=values))
filename = "/Users/me/tensorflow/test.tfrecords"
writer = tf.python_io.TFRecordWriter(filename)
example = tf.train.Example(features=tf.train.Features(feature={'datalist': _int64_list_feature([2,3])}))
writer.write(example.SerializeToString())
example = tf.train.Example(features=tf.train.Features(feature={'datalist': _int64_list_feature([8,5,7])}))
writer.write(example.SerializeToString())
writer.close()
First few lines are standard. I write two 1-D tensors to a TFRecords file, one with length 2 and one with length 3.
def read_my_file(filename_queue):
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
features = tf.parse_single_example(serialized_example, features={'datalist': tf.VarLenFeature(tf.int64), })
datalist = features['datalist']
return datalist
The helper function that it seems you are supposed to use. I am not 100% sure why this is necessary, but I couldn't get it to work without writing this, and all of the examples have something like this. In my case, the data is unlabeled so I don't have a labels variable.
filename_queue = tf.train.string_input_producer([filename], 2)
datalists = read_my_file(filename_queue)
datalists_batch = tf.train.batch([datalists], batch_size=2)
More "boilerplate"-style code from the examples. Batch size is 2 because I only have 2 examples in this code.
datalists_batch will now be a sparse tensor that contains both my vectors, [2, 3] and [8, 5, 7], the first on top of the second. Therefore, I want to split them back into individual tensors. At this point, I was already concerned that the runtime of this might be pretty long too, because in my real code there are over 100,000 individual tensors that will be split.
split_list = tf.sparse_split(0, 2, datalists_batch)
sp0 = split_list[0]
sp1 = split_list[1]
sp0_dense = tf.sparse_tensor_to_dense(sp0)
sp1_dense = tf.sparse_tensor_to_dense(sp1)
sp0_dense = tf.squeeze(sp0_dense)
sp1_dense = tf.squeeze(sp1_dense)
split_list is now a list of the individual tensors, still in sparse format (and all having a length equal to the length of the longest tensor, which is in this case 3. They are also 2-dimensional with the other dimension 1, since the datalists_batch tensor was 2D). I must now manipulate the tensors to get them into proper format. In the real code, I would of course use a for-loop, but there are only 2 examples in this case. First, I convert them to dense tensors. However, in the case of sp0 this fills in the last index with a 0, since this tensor has length 3. (This issue is discussed below.) Then, I "squeeze" them so that they are actually considered tensors with length 3 instead of 1x3.
Finally, I need to remove the trailing zero from sp0. This part gave me difficulty. I don't know programmatically how many trailing zeros a particular tensor has. It is equal to the length of the longest tensor minus the length of this tensor, but I don't know the "real" lengths of the tensors without looking at the sparse indices, but I cannot access that without evaluating the "temp" tensor (since the indices are themselves tensors).
indices_0 = sp0.indices
indices_1 = sp1.indices
indices0_size = tf.shape(indices_0)
indices1_size = tf.shape(indices_1)
These are necessary for the aforementioned slicing operations.
sess = tf.Session()
init_op = tf.initialize_all_variables()
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
Initializations.
sp0_fixed = tf.slice(sp0_dense, [0], [sess.run(indices0_size[0])])
sp1_fixed = tf.slice(sp1_dense, [0], [sess.run(indices1_size[0])])
sess.run(sp0_fixed)
sess.run(sp1_fixed)
This is how I would do it. The problem is, I get strange errors when running these last three commands. I surmise that the problem is that I am creating new ops after sess.run has already been called (in the sp0_fixed line), so the graph is being run simultaneously. I think I should only run sess.run once. However, this makes it impossible for me to figure out the proper indices at which to slice each tensor (to remove trailing zeros). Thus, I don't know what to do next.
I have surprisingly found nothing at all helpful on how to do something like this (save and load variable length tensors to/from files) on Google, TensorFlow documentation, and StackOverflow. I am quite sure that I'm going about this the wrong way; even if there is a workaround to rewrite the last four lines so that the program behaves correctly, the code overall seems excessively complicated to perform a very basic functionality.
I would greatly appreciate any suggestions and feedback.
I dont have too much experience with tfRecords but here's one way to store and retrieve variable length arrays with tfRecords
writing a tfrecord
# creating a default session we'll use it later
sess = tf.InteractiveSession( )
def get_writable( arr ):
"""
this fucntion returns a serialized string
for input array of integers
arr : input array
"""
arr = tf.train.Int64List( value = arr)
arr = tf.train.Feature(int64_list= arr )
arr = tf.train.Features(feature = { 'seqs': arr})
arr = tf.train.Example( features = arr)
return arr.SerializeToString()
filename = "./test2.tfrecords"
writer = tf.python_io.TFRecordWriter(filename)
#writing 3 different sized arrays
writer.write( get_writable([1,3,5,9]))
writer.write( get_writable([2,7,9]))
writer.write( get_writable([3,4,6,5,9]))
writer.close()
written the arrays into 'test2.tfrecord'
Reading the file(s)
##Reading from the tf_record file
## creating a filename queue
reader = tf.TFRecordReader( )
filename_queue = tf.train.string_input_producer(['test2.tfrecords'])
##getting the reader
_, ser_ex = reader.read(filename_queue, )
##features that you want to extract
read_features = {
'seqs' : tf.VarLenFeature(dtype = tf.int64)
}
batchSize = 2
# before parsing examples you must wrap it in tf.batch to get desired batch size
batch = tf.train.batch([ser_ex], batch_size= batchSize , capacity=10)
read_data = tf.parse_example( batch, features= read_features )
tf.train.start_queue_runners( sess)
# starting reading queues are requred before reding the data
Now we're ready to read contents of the tfRecord file
batches = 3
for _ in range(batches):
#get the next sparse tensor of shape (batchSize X elements in the largest array )
#every time you invoke read_data.values()
sparse_tensor = (list(read_data.values())[0]).eval()
# as the batch size is larger than 1
# you'd want seperate lists that you fed
#at the time of writing the tf_record file
for i in tf.sparse_split(axis= 0, num_split=batchSize, sp_input= sparse_tensor ):
i = i.eval()
shape = [1, (i).indices.shape[0]]
#getting individual shapes of different sparse tensors
tens = tf.sparse_to_dense(sparse_indices=i.indices ,sparse_values= i.values , output_shape= shape)
#converting them into dense tensors
print(tens.eval())
#evaluating the final Dense Tensor
Check out this post, great explanation to get started with tfRecords