How to speed up a while loop using invoke() in tensorflow?

How to speed up a while loop using invoke() in tensorflow? - python

Some context: I am trying to convert the whisper decoder to tensorflow-lite and so far all works, but the result is slow. My decoder looks as follow:
import numpy as np
import tensorflow as tf
class TfliteDecoder:
def __init__(self, decoder_model_path):
# load the TFLite model and allocate tensors
self.interpreter = tf.lite.Interpreter(model_path=decoder_model_path)
self.interpreter.allocate_tensors()
# get input and output details
self.input_tensor_index_1 = self.interpreter.get_input_details()[0]['index']
self.input_tensor_index_2 = self.interpreter.get_input_details()[1]['index']
self.output_tensor_index = self.interpreter.get_output_details()[0]['index']
def decode_tokens(self, encoder_output_data, debug=False):
# init output tensor
self.interpreter.allocate_tensors()
self.interpreter.set_tensor(self.input_tensor_index_1, encoder_output_data)
# init tokens
tokens = tf.constant([50258, 50266, 50358, 50363], dtype=tf.int64, shape=(1,4)) #<|startoftranscript|><|ja|><|translate|><|notimestamps|>
last_token = 50363
i = 0
import time
st = time.time()
while (last_token != 50257) and (i < 448):
# adjust size for input -> allocate memory for input
self.interpreter.resize_tensor_input(self.input_tensor_index_2, tokens.shape)
self.interpreter.allocate_tensors()
self.interpreter.set_tensor(self.input_tensor_index_2, tokens)
# invoke -> get output
self.interpreter.invoke()
output_data = self.interpreter.get_tensor(self.output_tensor_index)
last_token = np.argmax(output_data, axis=-1)[0, -1]
# update tokens array
tokens = tf.concat((tokens, np.array([[last_token]])), axis=1)
i = i +1
print("->", round(time.time() - st, 3))
return tokens
As you can see my next prediction is always dependant on the previous prediction, hence I am using this while loop and every time I am resizing my input_tensor, allocating memory and invoking. This obviously is resource consuming and slow as expected, even though my resulting tensorflow-lite model is much smaller than the pytorch original one.
Using a tensorflow While loop did not speed up things and fixing the input vector size made the predictor get stuck.
I tried an alternative loop using beam search but that was even slower.
My question is:
Is there a way to make this a bit faster? any hints are appreciated.

Related

PyTorch RNN is more efficient with `batch_first=False`?

In machine translation, we always need to slice out the first timestep (the SOS token) in the annotation and prediction.
When using batch_first=False, slicing out the first timestep still keeps the tensor contiguous.
import torch
batch_size = 128
seq_len = 12
embedding = 50
# Making a dummy output that is `batch_first=False`
batch_not_first = torch.randn((seq_len,batch_size,embedding))
batch_not_first = batch_first[1:].view(-1, embedding) # slicing out the first time step
However, if we use batch_first=True, after slicing, the tensor is no longer contiguous. We need to make it contiguous before we can do different operations such as view.
batch_first = torch.randn((batch_size,seq_len,embedding))
batch_first[:,1:].view(-1, embedding) # slicing out the first time step
output>>>
"""
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-8-a9bd590a1679> in <module>
----> 1 batch_first[:,1:].view(-1, embedding) # slicing out the first time step
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
"""
Does that mean batch_first=False is better, at least, in the context of machine translation? Since it saves us from doing the contiguous() step. Is there any cases that batch_first=True works better?

Performance
There doesn't seem to be a considerable difference between batch_first=True and batch_first=False. Please see the script below:
import time
import torch
def time_measure(batch_first: bool):
torch.cuda.synchronize()
layer = torch.nn.RNN(10, 20, batch_first=batch_first).cuda()
if batch_first:
inputs = torch.randn(100000, 7, 10).cuda()
else:
inputs = torch.randn(7, 100000, 10).cuda()
start = time.perf_counter()
for chunk in torch.chunk(inputs, 100000 // 64, dim=0 if batch_first else 1):
_, last = layer(chunk)
return time.perf_counter() - start
print(f"Time taken for batch_first=False: {time_measure(False)}")
print(f"Time taken for batch_first=True: {time_measure(True)}")
On my device (GTX 1050 Ti), PyTorch 1.6.0 and CUDA 11.0 here are the results:
Time taken for batch_first=False: 0.3275816479999776
Time taken for batch_first=True: 0.3159054920001836
(and it varies either way so nothing conclusive).
Code readability
batch_first=True is simpler when you want to use other PyTorch layers which require batch as 0th dimension (which is the case for almost all torch.nn layers like torch.nn.Linear).
In this case you would have to permute returned tensor anyway if batch_first=False was specified.
Machine translation
It should be better as the tensor is contiguous all the time and no copy of data has to be done. It also looks cleaner to slice using [1:] instead of [:,1:].

How to speed up tf.data.Dataset.from_generator()

In tensorflow2.0, I want to train a skip-gram model with nce loss. tf.data.Dataset.from_tensor_slices() is not suitable because the input file is really huge. So I write a dataset generator class like this:
class DataSet:
""""""
def __init__(self, args, vocab):
self.args = args
self.vocab = vocab
def generator(self):
"""a generator function, it will return skip-gram sample or cbow sample"""
with open(self.args.input) as f_input:
for line in tqdm.tqdm(f_input.readlines()):
tokens = line.strip().split()
tokens_indices = self.vocab.indices(tokens)
for index, target_word in enumerate(tokens_indices):
context_words = list()
begin = index - self.args.window_size if index - self.args.window_size > 0 else 0
end = index + 1 + self.args.window_size if index + self.args.window_size + 1 < len(tokens_indices) else len(
tokens_indices)
context_words.extend(tokens_indices[begin:index])
context_words.extend(tokens_indices[index + 1:end])
if self.args.cbow > 0:
yield context_words, target_word
else:
for i in range(len(context_words)):
yield target_word, context_words[i]
def dataset(self):
"""Using tf.data.Dataset.from_generator() to return sample"""
if self.args.cbow:
dataset = tf.data.Dataset.from_generator(
self.generator,
(tf.int32, tf.int32),
(tf.TensorShape([None]), tf.TensorShape([]))
)
else:
dataset = tf.data.Dataset.from_generator(
self.generator,
(tf.int32, tf.int32),
(tf.TensorShape([]), tf.TensorShape([]))
)
return dataset
Then I test my code with follow:
dataset = DataSet(args, vocab).dataset()
iterator = dataset.make_one_shot_iterator()
for batch, (x,y) in enumerate(dataset.batch(128)):
pass
print(batch, x.shape, y.shape)
But it cost a lot of time to iterate all lines(about 10 minutes / 15000 lines in MacBook pro 2012). Does there any methods can speed up the code?

If you are working with large datasets then TFRecord is suitable option. It uses a binary file format for storage of your data and can have a significant impact on the performance of your import pipeline and as a consequence on the training time of your model. Binary data takes up less space on disk, takes less time to copy and can be read much more efficiently from disk. This is especially true if your data is stored on spinning disks, due to the much lower read/write performance in comparison with SSDs.
However, pure performance isn’t the only advantage of the TFRecord file format. It is optimized for use with Tensorflow in multiple ways. To start with, it makes it easy to combine multiple datasets and integrates seamlessly with the data import and preprocessing functionality provided by the library. Especially for datasets that are too large to be stored fully in memory this is an advantage as only the data that is required at the time (e.g. a batch) is loaded from disk and then processed. Another major advantage of TFRecords is that it is possible to store sequence data — for instance, a time series or word encodings — in a way that allows for very efficient and (from a coding perspective) convenient import of this type of data.
Would recommend to go through this official link for glimpse on TFRecord. Also you can go through this link on how to build TFRecord pipeline.
Here is a simple example of writing the serialized record using TFRecordWriter and then loading it in TFRecordDatset
%tensorflow_version 2.x
import tensorflow as tf
print(tf.__version__)
def write_date_tfrecord():
#writes 10 dummy values to replicate the issue
Output = [20191221 + x for x in range(0,10)]
print("Writing Output - ", Output)
example = tf.train.Example(
features = tf.train.Features(
feature = {
'Output':tf.train.Feature(float_list=tf.train.FloatList(value=Output))
}
))
writer = tf.io.TFRecordWriter("Output.tf_record")
writer.write(example.SerializeToString())
def parse_function(serialized_example):
features = {
'Output': tf.io.FixedLenSequenceFeature([], tf.float32,allow_missing=True)
}
features = tf.io.parse_single_example(serialized=serialized_example, features=features)
Output = features['Output']
return Output
def dataset_generator():
trRecordDataset = tf.data.TFRecordDataset("Output.tf_record")
trRecordDataset = trRecordDataset.map(parse_function, num_parallel_calls = tf.data.experimental.AUTOTUNE)
return trRecordDataset
if __name__ == '__main__':
write_date_tfrecord()
generator = dataset_generator()
for Output in generator:
print(Output)
Output -
2.2.0
Writing Output - [20191221, 20191222, 20191223, 20191224, 20191225, 20191226, 20191227, 20191228, 20191229, 20191230]
tf.Tensor(
[20191220. 20191222. 20191224. 20191224. 20191224. 20191226. 20191228.
20191228. 20191228. 20191230.], shape=(10,), dtype=float32)
Hope this answers your question. Happy Learning.

How to load numpy array in a tensorflow dataset

I'm trying to create a Dataset object in tensorflow 1.14 (I have some legacy code that i can't change for this specific project) starting from numpy arrays, but everytime i try i get everything copied on my graph and for this reason when i create an event log file it is huge (719 MB in this case).
Originally i tried using this function "tf.data.Dataset.from_tensor_slices()", but it didn't work, then i read it is a common problem and someone suggested me to try with generators, thus i tried with the following code, but again i got a huge event file (719 MB again)
def fetch_batch(x, y, batch):
i = 0
while i < batch:
yield (x[i,:,:,:], y[i])
i +=1
train, test = tf.keras.datasets.fashion_mnist.load_data()
images, labels = train
images = images/255
training_dataset = tf.data.Dataset.from_generator(fetch_batch,
args=[images, np.int32(labels), batch_size], output_types=(tf.float32, tf.int32),
output_shapes=(tf.TensorShape(features_shape), tf.TensorShape(labels_shape)))
file_writer = tf.summary.FileWriter("/content", graph=tf.get_default_graph())
I know in this case I could use tensorflow_datasets API and it would be easier, but this is a more general question, and it involves how to create datasets in general, not only using the mnist one.
Could you explain to me what am i doing wrong? Thank you

I guess it's because you are using args in from_generator. This will surely put the provided args in the graph.
What you could do is define a function that will return a generator that will iterate through your set, something like (haven't tested):
def data_generator(images, labels):
def fetch_examples():
i = 0
while True:
example = (images[i], labels[i])
i += 1
i %= len(labels)
yield example
return fetch_examples
This would give in your example:
train, test = tf.keras.datasets.fashion_mnist.load_data()
images, labels = train
images = images/255
training_dataset = tf.data.Dataset.from_generator(data_generator(images, labels), output_types=(tf.float32, tf.int32),
output_shapes=(tf.TensorShape(features_shape), tf.TensorShape(labels_shape))).batch(batch_size)
file_writer = tf.summary.FileWriter("/content", graph=tf.get_default_graph())
Note that I changed fetch_batch to fetch_examples since you probably want to batch using the dataset utilities (.batch).

Keras predict loop memory leak using tf.data.Dataset but not with a numpy array

I encounter a memory leak and decreasing performance when looping over a Keras model predict function when using a tf.data.Dataset to feed the model, but not when feeding it with a numpy array.
Does anyone understand what is causing this and/or how to resolve the issue?
Minimal reproducible code snippet (copy/paste runnable):
import tensorflow as tf
import numpy as np
import time
SIZE = 5000
inp = tf.keras.layers.Input(shape=(SIZE,), dtype='float32')
x = tf.keras.layers.Dense(units=SIZE)(inp)
model = tf.keras.Model(inputs=inp, outputs=x)
np_data = np.random.rand(1, SIZE)
ds = tf.data.Dataset.from_tensor_slices(np_data).batch(1).repeat()
debug_time = time.time()
while True:
model.predict(x=ds, steps=1)
print('Processing {:.2f}'.format(time.time() - debug_time))
debug_time = time.time()
Result: Predict loop timing starts around 0.04s per iteration, within a minute or two it's up to about 0.5s and process memory continues to increase from a few hundred MB to close to a GB.
Swap out the tf.data.Dataset for an equivalent numpy array and runtime is ~0.01s consistently.
Working case code snippet (copy/paste runnable):
import tensorflow as tf
import numpy as np
import time
SIZE = 5000
inp = tf.keras.layers.Input(shape=(SIZE,), dtype='float32')
x = tf.keras.layers.Dense(units=SIZE)(inp)
model = tf.keras.Model(inputs=inp, outputs=x)
np_data = np.random.rand(1, SIZE)
debug_time = time.time()
while True:
model.predict(x=np_data) # using numpy array directly
print('Processing {:.2f}'.format(time.time() - debug_time))
debug_time = time.time()
Related discussions:
Memory leak tf.data + Keras - Doesn't seem to address the core issue, but the question appears similar.
https://github.com/tensorflow/tensorflow/issues/22098 - Possibly an open issue in Keras/Github, but I can't confirm it, changing inter_op_paralellism as suggested in that thread has no impact on the results posted here.
Additional info:
I can reduce the rate of performance degradation by around 10x by passing in an iterator instead of a dataset object. I noticed in training_utils.py:1314 the Keras code is creating an iterator each call to predict.
TF 1.14.0

The root of the problem appears to be that Keras is creating dataset operations each predict loop. Notice at training_utils.py:1314 a dataset iterator is created in each predict loop.
The problem can be reduced in severity by passing in an iterator, and is solved entirely by passing in the iterators get_next() tensor.
I have posted the issue on the Tensorflow Github page: https://github.com/tensorflow/tensorflow/issues/30448
Here is the solution, this example runs in constant time using the TF dataset, you just can't pass in the dataset object:
import tensorflow as tf
import numpy as np
import time
SIZE = 5000
inp = tf.keras.layers.Input(shape=(SIZE,), dtype='float32')
x = tf.keras.layers.Dense(units=SIZE)(inp)
model = tf.keras.Model(inputs=inp, outputs=x)
np_data = np.random.rand(1, SIZE)
ds = tf.data.Dataset.from_tensor_slices(np_data).batch(1).repeat()
it = tf.data.make_one_shot_iterator(ds)
tensor = it.get_next()
debug_time = time.time()
while True:
model.predict(x=tensor, steps=1)
print('Processing {:.2f}'.format(time.time() - debug_time))
debug_time = time.time()

How to use properly Tensorflow Dataset with batch?

I am new to Tensorflow and deep learning, and I am struggling with the Dataset class. I tried a lot of things and I can’t find a good solution.
What I am trying
I have a large amount of images (500k+) to train my DNN with. This is a denoising autoencoder so I have a pair of each image. I am using the dataset class of TF to manage the data, but I think I use it really badly.
Here is how I load the filenames in a dataset:
class Data:
def __init__(self, in_path, out_path):
self.nb_images = 512
self.test_ratio = 0.2
self.batch_size = 8
# load filenames in input and outputs
inputs, outputs, self.nb_images = self._load_data_pair_paths(in_path, out_path, self.nb_images)
self.size_training = self.nb_images - int(self.nb_images * self.test_ratio)
self.size_test = int(self.nb_images * self.test_ratio)
# split arrays in training / validation
test_data_in, training_data_in = self._split_test_data(inputs, self.test_ratio)
test_data_out, training_data_out = self._split_test_data(outputs, self.test_ratio)
# transform array to tf.data.Dataset
self.train_dataset = tf.data.Dataset.from_tensor_slices((training_data_in, training_data_out))
self.test_dataset = tf.data.Dataset.from_tensor_slices((test_data_in, test_data_out))
I have a function to call at each epoch that will prepare the dataset. It shuffles the filenames, and transforms filenames to images and batch data.
def get_batched_data(self, seed, batch_size):
nb_batch = int(self.size_training / batch_size)
def img_to_tensor(path_in, path_out):
img_string_in = tf.read_file(path_in)
img_string_out = tf.read_file(path_out)
im_in = tf.image.decode_jpeg(img_string_in, channels=1)
im_out = tf.image.decode_jpeg(img_string_out, channels=1)
return im_in, im_out
t_datas = self.train_dataset.shuffle(self.size_training, seed=seed)
t_datas = t_datas.map(img_to_tensor)
t_datas = t_datas.batch(batch_size)
return t_datas
Now during the training, at each epoch we call the get_batched_data function, make an iterator, and run it for each batch, then feed the array to the optimizer operation.
for epoch in range(nb_epoch):
sess_iter_in = tf.Session()
sess_iter_out = tf.Session()
batched_train = data.get_batched_data(epoch)
iterator_train = batched_train.make_one_shot_iterator()
in_data, out_data = iterator_train.get_next()
total_batch = int(data.size_training / batch_size)
for batch in range(total_batch):
print(f"{batch + 1} / {total_batch}")
in_images = sess_iter_in.run(in_data).reshape((-1, 64, 64, 1))
out_images = sess_iter_out.run(out_data).reshape((-1, 64, 64, 1))
sess.run(optimizer, feed_dict={inputs: in_images,
outputs: out_images})
What do I need ?
I need to have a pipeline that loads only the images of the current batch (otherwise it will not fit in memory) and I want to shuffle the dataset in a different way for each epoch.
Questions and problems
First question, am I using the Dataset class in a good way? I saw very different things on the internet, for example in this blog post the dataset is used with a placeholder and fed during the learning with the datas. It seems strange because the data are all in an array, so loaded in memory. I don't see the point of using tf.data.dataset in this case.
I found solution by using repeat(epoch) on the dataset, like this, but the shuffle will not be different for each epoch in this case.
The second problem with my implementation is that I have an OutOfRangeError in some cases. With a small amount of data (512 like in the exemple) it works fine, but with a bigger amount of data, the error occurs. I thought it was because of a bad calculation of the number of batch due to bad rounding, or when the last batch has a smaller amount of data, but it happens in batch 32 out of 115... Is there any way to know the number of batch created after a batch(n) call on dataset?
Sorry for this loooonng question, but I've been struggling with this for a few days.

As far as I know, Official Performance Guideline is the best teaching material to make input pipelines.
I want to shuffle the dataset in a different way for each epoch.
Using shuffle() and repeat(), you can get different shuffle pattern for each epochs. You can confirm it with the following code
dataset = tf.data.Dataset.from_tensor_slices([1,2,3,4])
dataset = dataset.shuffle(4)
dataset = dataset.repeat(3)
iterator = dataset.make_one_shot_iterator()
x = iterator.get_next()
with tf.Session() as sess:
for i in range(10):
print(sess.run(x))
You can also use tf.contrib.data.shuffle_and_repeat as the mentioned by the above official page.
There are some problems in your code outside of creating data pipelines. You confuse graph construction with graph execution. You are repeating to create data input pipeline, so there are many redundant input pipelines as many as epochs. You can observe the redundant pipelines by Tensorboard.
You should place your graph construction code outside of loop as the following code (pseudo code)
batched_train = data.get_batched_data()
iterator = batched_train.make_initializable_iterator()
in_data, out_data = iterator_train.get_next()
for epoch in range(nb_epoch):
# reset iterator's state
sess.run(iterator.initializer)
try:
while True:
in_images = sess.run(in_data).reshape((-1, 64, 64, 1))
out_images = sess.run(out_data).reshape((-1, 64, 64, 1))
sess.run(optimizer, feed_dict={inputs: in_images,
outputs: out_images})
except tf.errors.OutOfRangeError:
pass
Moreover there are some unimportant inefficient code. You loaded a list of file path with from_tensor_slices(), so the list was embedded in your graph. (See https://www.tensorflow.org/guide/datasets#consuming_numpy_arrays for detail)
You would be better off using prefetch, and decreasing sess.run call by combining your graph.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to speed up a while loop using invoke() in tensorflow? - python

Related

PyTorch RNN is more efficient with `batch_first=False`?

How to speed up tf.data.Dataset.from_generator()

How to load numpy array in a tensorflow dataset

Keras predict loop memory leak using tf.data.Dataset but not with a numpy array

How to use properly Tensorflow Dataset with batch?

Categories

Resources