Interleaving files with subprocess.Popen in tf.data - python

I have multiple files that I'd like to consume in tiny chunks until EOF with tf.data instead of using tf.read_file once per file (as some files are much bigger than others).
I don't know how to consume piped subprocesses as a TensorFlow op (tf.py_func somehow?), and the dataset element from list_files is only known during graph execution so the following doesn't work:
def stream(path, bytesize=2048):
args = f'my_program {path}'
with subprocess.Popen(args, stdout=subprocess.PIPE) as pipe:
while True:
buffer = pipe.stdout.read(bytesize)
yield np.frombuffer(buffer)
if len(buffer) < bytesize:
break
def map_func(path):
generator = functools.partial(stream, path)
dataset = tf.data.Dataset.from_generator(generator, tf.float32)
return dataset
dataset = (
tf.data.Dataset
.list_files('data/*')
.interleave(map_func, batch_size)
.batch(batch_size)
)
Is there some way of getting a dataset element's value into the iterable expected by tf.data.Dataset.from_generator or am I going about this the wrong way?
Related: Can the map function supplied to `tf.data.Dataset.from_generator(...)` resolve a tensor object?

TensorFlow just got support for parameterised generators in tf.data!
def map_func(path):
dataset = tf.data.Dataset.from_generator(stream, tf.float32, args=(path,))
return dataset
pip install tf-nightly or tf-nightly-gpu to try the above out.

Related

Tensorflow iterator fails to iterate

I am working on a project related to instance segmentation. I am trying to train a SegNet with my own image dataset which comprises a set of images and their corresponding masks, and I have successfully used tf.Dataset to load my data. But every time I use the feedable iterator to feed the dataset to SegNet, my program is always terminated without any error or warning. My code is shown below.
load_satellite_image() is used to read filename for images and dataset() is used to load images with tf.Dataset. It seems that the iterator fails to update the input pipeline.
train_path = "data_example/train.txt"
val_path = "data_example/test.txt"
config_file = 'config.json'
with open(config_file) as f:
config = json.load(f)
train_img, train_mask = load_satellite_image(train_path)
val_img, val_mask = load_satellite_image(val_path)
train_dataset = dataset(train_img, train_mask, config, True, 0, 1)
val_dataset = dataset(val_img, val_mask, config, True, 0, 1)
train_iter = train_dataset.make_initializable_iterator()
validation_iter = val_dataset.make_initializable_iterator()
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle,
train_dataset.output_types,train_dataset.output_shapes)
next_element = iterator.get_next()
with tf.Session() as Sess:
sess.run(train_iter.initializer)
sess.run(validation_iter.initializer)
train_iter_handle = sess.run(train_iter.string_handle())
val_iter_handle = sess.run(validation_iter.string_handle())
for i in range(2):
print("1")
try:
while True:
for i in range(5):
print(sess.run(next_element,feed_dict={handle:train_iter_handle}))
print('----------------------------','\n')
for i in range(2):
print(sess.run(next_element,feed_dict={handle:val_iter_handle}))
except tf.errors.OutOfRangeError:
pass
After running the code above, I got:
In [2]: runfile('D:/python_code/tensorflow_study/SegNet/load_data.py',
wdir='D:/python_code/tensorflow_study/SegNet')
(tf.float32, tf.int32)
(TensorShape([Dimension(360), Dimension(480), Dimension(3)]), TensorShape([Dimension(360),
Dimension(480), Dimension(1)]))
(tf.float32, tf.int32)
(TensorShape([Dimension(360), Dimension(480), Dimension(3)]), TensorShape([Dimension(360),
Dimension(480), Dimension(1)]))
WARNING:tensorflow:From D:\Anaconda\envs\tensorflow-gpu\lib\site-
packages\tensorflow\python\data\ops\dataset_ops.py:1419: colocate_with (from
tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
In [1]:
I am confused that my code is terminated without any reason. As you can see, I can get the shape and datatype of training/ validation images and masks, which means the problem has nothing to do with my dataset. However, the for loop in the tf.Session() is not executed and I cannot get the result of print("1"). The iterator is not executed by sess.run() as well. Anyone have met this problem before?
Thanks!!!
Problem solved. It's a stupid mistake that wastes me a lot of time.
The reason why my program is terminated without error message is that I am using stupid Spyder to write my code, and I don't know why it doesn't show the error message. Actually, there exists an error message produced by TensorFlow. By coincidence, I ran my code via the command window of Anaconda and I got this error message:
2020-04-30 17:31:03.591207: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at whole_file_read_ops.cc:114 : Invalid argument: NewRandomAccessFile failed to Create/Open: D:\Study\PhD\python_code\tensorflow_study\SegNet\data_example\trainannot\ges_517405_679839_21.jpg
The iterator doesn't work because Tensorflow cannot find mask locations. The image and mask locations are stored in a text file like this:
data_example\train\ges_517404_679750_21.jpg,data_example\trainannot\ges_517404_679750_21.jpg
data_example\train\ges_517411_679762_21.jpg,data_example\trainannot\ges_517411_679762_21.jpg
The left side is the locations of raw images and the right side is the locations of their masks. In the beginning, I used split(",") to get the location of images and masks separately, but it seems that there is something wrong with the locations of masks. So I checked the code that is used to generate the text file:
file.writelines([Train_path[i],',',TrainAnnot_path[i],'\n'])
Each line in the text file ends with \n, and this is why Tensorflow cannot get the location of the masks. So I replaced file.writelines([Train_path[i],',',TrainAnnot_path[i],'\n'])with file.writelines([Train_path[i],' ',TrainAnnot_path[i],'\n']), and used strip().split(" ") rather than split(" "). That solves the problem.

Tensorflow: concatenating multiple tf.Dataset very slow

I am on Tensorflow 1.10
Right now I am not sure if this is a bug.
I have been trying to concatenate about 100 Datasets which I generated from multiple tf.data.Dataset.from_generator.
for i in range(1, 100):
dataset = dataset.concatenate(
tf.data.Dataset.from_generator(gens[i], (tf.int8, tf.int32), output_shapes=(
(256, 256), (1))))
print(i)
print("before iterator")
iterator = dataset.make_one_shot_iterator()
print("after iterator")
running the make_one_shot_iterator() takes really long.
Anyone knows a fix?
EDIT:
It looks like that _make_dataset.add_to_graph(ops.get_default_graph())
seems to get called over and over again resulting in a few million calls of the function.
(https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/data/ops/dataset_ops.py function make_one_shot_iterator line 162)
Running concatenateis actually not the best thing to do for multiple tensors or generators like this.
A better way is to use flat_map https://www.tensorflow.org/api_docs/python/tf/data/Dataset#flat_map . I did updated the Example a while a go to show how you can use this for multiple tensors or files.

Parallel threads with TensorFlow Dataset API and flat_map

I'm changing my TensorFlow code from the old queue interface to the new Dataset API. With the old interface I could specify the num_threads argument to the tf.train.shuffle_batch queue. However, the only way to control the amount of threads in the Dataset API seems to be in the map function using the num_parallel_calls argument. However, I'm using the flat_map function instead, which doesn't have such an argument.
Question: Is there a way to control the number of threads/processes for the flat_map function? Or is there are way to use map in combination with flat_map and still specify the number of parallel calls?
Note that it is of crucial importance to run multiple threads in parallel, as I intend to run heavy pre-processing on the CPU before data enters the queue.
There are two (here and here) related posts on GitHub, but I don't think they answer this question.
Here is a minimal code example of my use-case for illustration:
with tf.Graph().as_default():
data = tf.ones(shape=(10, 512), dtype=tf.float32, name="data")
input_tensors = (data,)
def pre_processing_func(data_):
# normally I would do data-augmentation here
results = (tf.expand_dims(data_, axis=0),)
return tf.data.Dataset.from_tensor_slices(results)
dataset_source = tf.data.Dataset.from_tensor_slices(input_tensors)
dataset = dataset_source.flat_map(pre_processing_func)
# do something with 'dataset'
To the best of my knowledge, at the moment flat_map does not offer parallelism options.
Given that the bulk of the computation is done in pre_processing_func, what you might use as a workaround is a parallel map call followed by some buffering, and then using a flat_map call with an identity lambda function that takes care of flattening the output.
In code:
NUM_THREADS = 5
BUFFER_SIZE = 1000
def pre_processing_func(data_):
# data-augmentation here
# generate new samples starting from the sample `data_`
artificial_samples = generate_from_sample(data_)
return atificial_samples
dataset_source = (tf.data.Dataset.from_tensor_slices(input_tensors).
map(pre_processing_func, num_parallel_calls=NUM_THREADS).
prefetch(BUFFER_SIZE).
flat_map(lambda *x : tf.data.Dataset.from_tensor_slices(x)).
shuffle(BUFFER_SIZE)) # my addition, probably necessary though
Note (to myself and whoever will try to understand the pipeline):
Since pre_processing_func generates an arbitrary number of new samples starting from the initial sample (organised in matrices of shape (?, 512)), the flat_map call is necessary to turn all the generated matrices into Datasets containing single samples (hence the tf.data.Dataset.from_tensor_slices(x) in the lambda) and then flatten all these datasets into one big Dataset containing individual samples.
It's probably a good idea to .shuffle() that dataset, or generated samples will be packed together.

Tensorflow string_input_producer stuck in queue

By following the mnist example, I was able to build a custom network and use the inputs function of the example to load my dataset (previously encoded as a TFRecord). Just to recap it, the inputs function looks like:
def inputs(train_dir, train, batch_size, num_epochs, one_hot_labels=False):
if not num_epochs: num_epochs = None
filename = os.path.join(train_dir,
TRAIN_FILE if train else VALIDATION_FILE)
with tf.name_scope('input'):
filename_queue = tf.train.string_input_producer(
[filename], num_epochs=num_epochs)
# Even when reading in multiple threads, share the filename
# queue.
image, label = read_and_decode(filename_queue)
# Shuffle the examples and collect them into batch_size batches.
# (Internally uses a RandomShuffleQueue.)
# We run this in two threads to avoid being a bottleneck.
images, sparse_labels = tf.train.shuffle_batch(
[image, label], batch_size=batch_size, num_threads=2,
capacity=1000 + 3 * batch_size,
# Ensures a minimum amount of shuffling of examples.
min_after_dequeue=1000)
return images, sparse_labels
Then, during the training I declare the training operator and run everything, and everything goes smoothly.
Now, I am trying to use the very same function to train a different network on the same data, the only (major) difference is that instead of just calling the slim.learning.train function on some train_operator, I do the training manually (by manually evaluating the losses and updating the parameters). The architecture is more complex and I'm forced to do so.
When I try to use the data generated by the inputs function, the program gets stuck, setting a queue timeout indeed shows that it's stuck on the producer's queue.
This leads me to believe that I'm probably missing something about the use of producers in tensorflow, I have read the tutorials but I couldn't figure out the issue. Is there some kind of initialization that calling slim.learning.train does and that I need to replicate by hand if I do my training manually? Why exactly isn't the producer producing?
For example, doing something like:
imgs, labels = inputs(...)
print imgs
prints
<tf.Tensor 'input/shuffle_batch:0' shape=(1, 128, 384, 6) dtype=float32>
which is the correct (symbolic?) tensor but if I then try to get the actual data with a imgs.eval() it's stuck indefinitely.
You need to start the queue runners, or the queues will be empty and reading from them will hang. See the documentation on queue runners.

How to use tf.cond in combination with batching operations / queue runners

Situation
I want to train a specific network architecture (a GAN) that needs inputs from different sources during training.
One input source is examples loaded from disk. The other source is a generator sub-network creating examples.
To choose which kind of input to feed to the network I use tf.cond. There is one caveat though that has already been explained: tf.cond evaluates the inputs to both conditional branches even though only one of those will ultimately be used.
Enough setup, here is a minimal working example:
import numpy as np
import tensorflow as tf
BATCH_SIZE = 32
def load_input_data():
# Normally this data would be read from disk
data = tf.reshape(np.arange(10 * BATCH_SIZE, dtype=np.float32), shape=(10 * BATCH_SIZE, 1))
return tf.train.batch([data], BATCH_SIZE, enqueue_many=True)
def generate_input_data():
# Normally this data would be generated by a much bigger sub-network
return tf.random_uniform(shape=[BATCH_SIZE, 1])
def main():
# A bool to choose between loaded or generated inputs
load_inputs_pred = tf.placeholder(dtype=tf.bool, shape=[])
# Variant 1: Call "load_input_data" inside tf.cond
data_batch = tf.cond(load_inputs_pred, load_input_data, generate_input_data)
# Variant 2: Call "load_input_data" outside tf.cond
#loaded_data = load_input_data()
#data_batch = tf.cond(load_inputs_pred, lambda: loaded_data, generate_input_data)
init_op = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
print(threads)
# Get generated input data
data_batch_values = sess.run(data_batch, feed_dict={load_inputs_pred: False})
print(data_batch_values)
# Get input data loaded from disk
data_batch_values = sess.run(data_batch, feed_dict={load_inputs_pred: True})
print(data_batch_values)
if __name__ == '__main__':
main()
Problem
Variant 1 does not work at all since the queue runner threads don't seem to run. print(threads) outputs something like [<Thread(Thread-1, stopped daemon 140165838264064)>, ...].
Variant 2 does work and print(threads) outputs something like [<Thread(Thread-1, started daemon 140361854863104)>, ...]. But since load_input_data() has been called outside of tf.cond, batches of data will be loaded from disk even when load_inputs_pred is False.
Is it possible to make Variant 1 work, so that input data is only loaded when load_inputs_pred is True and not for every call to session.run()?
If you're using a queue when loading your data and follow it up with a batch input then this shouldn't be a problem as you can specify the max amount to have loaded or stored in the queue.
input = tf.WholeFileReader(somefilelist) # or another way to load data
return tf.train.batch(input,batch_size=10,capacity=100)
See here for more details:
https://www.tensorflow.org/versions/r0.10/api_docs/python/io_ops.html#batch
Also there's an alternative approach that skips the tf.cond completely. Just define two losses one that follows the data through the autoencoder and discrimator and the other that follows the data through just the discriminator.
Then it just becomes a matter of calling
sess.run(auto_loss,feed_dict)
or
sess.run(real_img_loss,feed_dict)
In this way the graph will only run through which ever loss was called upon. Let me know if this needs more explanation.
Lastly I think to make variant one work you need to do something like this if you're using preloaded data.
https://www.tensorflow.org/versions/r0.10/how_tos/reading_data/index.html#preloaded-data
Otherwise I'm not sure what the issue is to be honest.

Categories

Resources