Parallel threads with TensorFlow Dataset API and flat_map

Parallel threads with TensorFlow Dataset API and flat_map - python

I'm changing my TensorFlow code from the old queue interface to the new Dataset API. With the old interface I could specify the num_threads argument to the tf.train.shuffle_batch queue. However, the only way to control the amount of threads in the Dataset API seems to be in the map function using the num_parallel_calls argument. However, I'm using the flat_map function instead, which doesn't have such an argument.
Question: Is there a way to control the number of threads/processes for the flat_map function? Or is there are way to use map in combination with flat_map and still specify the number of parallel calls?
Note that it is of crucial importance to run multiple threads in parallel, as I intend to run heavy pre-processing on the CPU before data enters the queue.
There are two (here and here) related posts on GitHub, but I don't think they answer this question.
Here is a minimal code example of my use-case for illustration:
with tf.Graph().as_default():
data = tf.ones(shape=(10, 512), dtype=tf.float32, name="data")
input_tensors = (data,)
def pre_processing_func(data_):
# normally I would do data-augmentation here
results = (tf.expand_dims(data_, axis=0),)
return tf.data.Dataset.from_tensor_slices(results)
dataset_source = tf.data.Dataset.from_tensor_slices(input_tensors)
dataset = dataset_source.flat_map(pre_processing_func)
# do something with 'dataset'

To the best of my knowledge, at the moment flat_map does not offer parallelism options.
Given that the bulk of the computation is done in pre_processing_func, what you might use as a workaround is a parallel map call followed by some buffering, and then using a flat_map call with an identity lambda function that takes care of flattening the output.
In code:
NUM_THREADS = 5
BUFFER_SIZE = 1000
def pre_processing_func(data_):
# data-augmentation here
# generate new samples starting from the sample `data_`
artificial_samples = generate_from_sample(data_)
return atificial_samples
dataset_source = (tf.data.Dataset.from_tensor_slices(input_tensors).
map(pre_processing_func, num_parallel_calls=NUM_THREADS).
prefetch(BUFFER_SIZE).
flat_map(lambda *x : tf.data.Dataset.from_tensor_slices(x)).
shuffle(BUFFER_SIZE)) # my addition, probably necessary though
Note (to myself and whoever will try to understand the pipeline):
Since pre_processing_func generates an arbitrary number of new samples starting from the initial sample (organised in matrices of shape (?, 512)), the flat_map call is necessary to turn all the generated matrices into Datasets containing single samples (hence the tf.data.Dataset.from_tensor_slices(x) in the lambda) and then flatten all these datasets into one big Dataset containing individual samples.
It's probably a good idea to .shuffle() that dataset, or generated samples will be packed together.

Related

forecasting tensorflow confused with usage of repeat

I came across this notebook that covers forecasting. I got it through this article.
I am confused about the 2nd and 4th line from below
train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_data = train_data.cache().shuffle(buffer_size).batch(batch_size).repeat()
val_data = tf.data.Dataset.from_tensor_slices((x_vali, y_vali))
val_data = val_data.batch(batch_size).repeat()
I understand that we are trying to shuffle our data as we dont want to feed data to our model in the serial order. On additional reading I realized that it is better to have buffer_size same as the size of the dataset. But I am not sure what repeat is doing in this case. Could someone explain what is being done here and what is the function of repeat?
I also looked at this page and saw below text but still not clear.
The following methods in tf.Dataset :
repeat( count=0 ) The method repeats the dataset count number of times.
shuffle( buffer_size, seed=None, reshuffle_each_iteration=None) The method shuffles the samples in the dataset. The buffer_size is the number of samples which are randomized and returned as tf.Dataset.
batch(batch_size,drop_remainder=False) Creates batches of the dataset with batch size given as batch_size which is also the length of the batches.

The repeat call with nothing passed to the count param makes this dataset repeat infinitely.
In python terms, Datasets are a subclass of python iterables. If you have an object ds of type tf.data.Dataset, then you can execute iter(ds). If the dataset was generated by repeat(), then it will never run out of items, i.e., it will never throw a StopIteration exception.
In the notebook you referenced, the call to tf.keras.Model.fit() is passed an argument of 100 to the param steps_per_epoch. This means that the dataset should be infinitely repeating, and Keras will pause training to run validation every 100 steps.
tldr: leave it in.
https://github.com/tensorflow/tensorflow/blob/3f878cff5b698b82eea85db2b60d65a2e320850e/tensorflow/python/data/ops/dataset_ops.py#L134-L3445
https://docs.python.org/3/library/exceptions.html

Tensorflow - Training on condition

I am training a neural network with tensorflow (1.12) in a supervised fashion. I'd like to only train on specific examples. The examples are created on the fly by cutting out subsequences, hence I want to do the conditioning within tensorflow.
This is my original part of code:
train_step, gvs = minimize_clipped(optimizer, loss,
clip_value=FLAGS.gradient_clip,
return_gvs=True)
gradients = [g for (g,v) in gvs]
gradient_norm = tf.global_norm(gradients)
tf.summary.scalar('gradients/norm', gradient_norm)
eval_losses = {'loss1': loss1,
'loss2': loss2}
The training step is later executed as:
batch_eval, _ = sess.run([eval_losses, train_step])
I was thinking about inserting something like
train_step_fake = ????
eval_losses_fake = tf.zeros_like(tensor)
train_step_new = tf.cond(my_cond, train_step, train_step_fake)
eval_losses_new = tf.cond(my_cond, eval_losses, eval_losses_fake)
and then doing
batch_eval, _ = sess.run([eval_losses, train_step])
However, I am not sure how to create a fake train_step.
Also, is this a good idea in general or is there a smoother way of doing this? I am using a tfrecords pipeline, but no other high-level modules (like keras, tf.estimator, eager execution etc.).
Any help is obviously greatly appreciated!

Answering the specific question first. It's certainly possible to only perform your training step based on the tf.cond outcome. Note that the 2nd and 3rd params are lambdas though so more something like:
train_step_new = tf.cond(my_cond, lambda: train_step, lambda: train_step_fake)
eval_losses_new = tf.cond(my_cond, lambda: eval_losses, lambda: eval_losses_fake)
Your instinct that this may not be the right thing to do is correct though.
It's much more preferable (both in terms of efficiency and in terms of reading and reasoning about your code) to filter out the data you want to ignore before it gets to your model in the first place.
This is something you could achieve using the Dataset API. which has a really useful filter() method you could use. If you are using the dataset api to read your TFRecords right now then this should be as simple as adding something along the lines of:
dataset = dataset.filter(lambda x: {whatever op you were going to use in tf.cond})
If you are not yet using the dataset API, now is probably the time to have a little read up on it and consider it rather than butchering the model with that tf.cond() to act as a filter.

Tensorflow: concatenating multiple tf.Dataset very slow

I am on Tensorflow 1.10
Right now I am not sure if this is a bug.
I have been trying to concatenate about 100 Datasets which I generated from multiple tf.data.Dataset.from_generator.
for i in range(1, 100):
dataset = dataset.concatenate(
tf.data.Dataset.from_generator(gens[i], (tf.int8, tf.int32), output_shapes=(
(256, 256), (1))))
print(i)
print("before iterator")
iterator = dataset.make_one_shot_iterator()
print("after iterator")
running the make_one_shot_iterator() takes really long.
Anyone knows a fix?
EDIT:
It looks like that _make_dataset.add_to_graph(ops.get_default_graph())
seems to get called over and over again resulting in a few million calls of the function.
(https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/data/ops/dataset_ops.py function make_one_shot_iterator line 162)

Running concatenateis actually not the best thing to do for multiple tensors or generators like this.
A better way is to use flat_map https://www.tensorflow.org/api_docs/python/tf/data/Dataset#flat_map . I did updated the Example a while a go to show how you can use this for multiple tensors or files.

Python Parallelize a Function with Multiple Inputs

New to python. Working with IPython.
I want to do some calculation on a pandas dataframe with a rolling window. The process looks like this:
def calculate_avg_ret_t(return_matrix, rolling_window, t):
ret_t = return_matrix.iloc[ np.arange((t-rolling_window+1),t+1,1), ]
avg_ret_t = ret_t.mean().mean() # much more complicated in reality
return avg_ret_t
return_matrix = pd.DataFrame( np.random.randn(10000, 10000) )
rolling_window = 21
avg_ret_ts = []
for t in np.arange(rolling_window-1,10001,1):
%time avg_ret_t = calculate_avg_ret_t(return_matrix, rolling_window, t)
avg_ret_ts.append(avg_ret_t)
The actual function executed within each for loop is much more complicated and time-consuming, hence the need for parallelization. Can this process be parallized, and if so, what's the most user-friendly module to do that?
I realized the potential problem is that the function has to call the gigantic input return_matrix in each loop. Should I first transform that matrix to a R-list like object, depending on rolling_window?

If the function is only dependent on the data in a given slice, then this would be easily parallelized. I would do the following:
1) Split the data set into N sets where N is the number of processors. The sets should overlap sufficiently.
2) Each processor compute the quantities on its own data subset.
You may want to look at using mpi4py in ipython. See for example https://ipython.org/ipython-doc/3/parallel/parallel_mpi.html. This would allow you to develop and debug parallel code quite easily.

How to use tf.cond in combination with batching operations / queue runners

Situation
I want to train a specific network architecture (a GAN) that needs inputs from different sources during training.
One input source is examples loaded from disk. The other source is a generator sub-network creating examples.
To choose which kind of input to feed to the network I use tf.cond. There is one caveat though that has already been explained: tf.cond evaluates the inputs to both conditional branches even though only one of those will ultimately be used.
Enough setup, here is a minimal working example:
import numpy as np
import tensorflow as tf
BATCH_SIZE = 32
def load_input_data():
# Normally this data would be read from disk
data = tf.reshape(np.arange(10 * BATCH_SIZE, dtype=np.float32), shape=(10 * BATCH_SIZE, 1))
return tf.train.batch([data], BATCH_SIZE, enqueue_many=True)
def generate_input_data():
# Normally this data would be generated by a much bigger sub-network
return tf.random_uniform(shape=[BATCH_SIZE, 1])
def main():
# A bool to choose between loaded or generated inputs
load_inputs_pred = tf.placeholder(dtype=tf.bool, shape=[])
# Variant 1: Call "load_input_data" inside tf.cond
data_batch = tf.cond(load_inputs_pred, load_input_data, generate_input_data)
# Variant 2: Call "load_input_data" outside tf.cond
#loaded_data = load_input_data()
#data_batch = tf.cond(load_inputs_pred, lambda: loaded_data, generate_input_data)
init_op = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
print(threads)
# Get generated input data
data_batch_values = sess.run(data_batch, feed_dict={load_inputs_pred: False})
print(data_batch_values)
# Get input data loaded from disk
data_batch_values = sess.run(data_batch, feed_dict={load_inputs_pred: True})
print(data_batch_values)
if __name__ == '__main__':
main()
Problem
Variant 1 does not work at all since the queue runner threads don't seem to run. print(threads) outputs something like [<Thread(Thread-1, stopped daemon 140165838264064)>, ...].
Variant 2 does work and print(threads) outputs something like [<Thread(Thread-1, started daemon 140361854863104)>, ...]. But since load_input_data() has been called outside of tf.cond, batches of data will be loaded from disk even when load_inputs_pred is False.
Is it possible to make Variant 1 work, so that input data is only loaded when load_inputs_pred is True and not for every call to session.run()?

If you're using a queue when loading your data and follow it up with a batch input then this shouldn't be a problem as you can specify the max amount to have loaded or stored in the queue.
input = tf.WholeFileReader(somefilelist) # or another way to load data
return tf.train.batch(input,batch_size=10,capacity=100)
See here for more details:
https://www.tensorflow.org/versions/r0.10/api_docs/python/io_ops.html#batch
Also there's an alternative approach that skips the tf.cond completely. Just define two losses one that follows the data through the autoencoder and discrimator and the other that follows the data through just the discriminator.
Then it just becomes a matter of calling
sess.run(auto_loss,feed_dict)
or
sess.run(real_img_loss,feed_dict)
In this way the graph will only run through which ever loss was called upon. Let me know if this needs more explanation.
Lastly I think to make variant one work you need to do something like this if you're using preloaded data.
https://www.tensorflow.org/versions/r0.10/how_tos/reading_data/index.html#preloaded-data
Otherwise I'm not sure what the issue is to be honest.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallel threads with TensorFlow Dataset API and flat_map - python

Related

forecasting tensorflow confused with usage of repeat

Tensorflow - Training on condition

Tensorflow: concatenating multiple tf.Dataset very slow

Python Parallelize a Function with Multiple Inputs

How to use tf.cond in combination with batching operations / queue runners

Categories

Resources