Shuffling input files with tensorflow Datasets

Shuffling input files with tensorflow Datasets - python

With the old input-pipeline API I can do:
filename_queue = tf.train.string_input_producer(filenames, shuffle=True)
and then pass the filenames to other queue, for example:
reader = tf.TFRecordReader()
_, serialized_example = reader.read_up_to(filename_queue, n)
How can I achieve similar behaviour with the Dataset -API?
The tf.data.TFRecordDataset() expects tensor of file-names in fixed order.

Start reading them in order, shuffle right after:
BUFFER_SIZE = 1000 # arbitrary number
# define filenames somewhere, e.g. via glob
dataset = tf.data.TFRecordDataset(filenames).shuffle(BUFFER_SIZE)
EDIT:
The input pipeline of this question gave me an idea on how to implement filenames shuffling with the Dataset API:
dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.shuffle(BUFFER_SIZE) # doesn't need to be big
dataset = dataset.flat_map(tf.data.TFRecordDataset)
dataset = dataset.map(decode_example, num_parallel_calls=5) # add your decoding logic here
# further processing of the dataset
This will put all the data of one file before the one of the next and so on. Files are shuffled, but the data inside them will be produced in the same order.
You can alternatively replace dataset.flat_map with interleave to process multiple files at the same time and return samples from each:
dataset = dataset.interleave(tf.data.TFRecordDataset, cycle_length=4)
Note: interleave does not actually run in multiple threads, it's a round-robin operation. For true parallel processing see parallel_interleave

The current Tensorflow version (v1.5 in 02/2018) does not seem to support filename shuffling natively in the Dataset API. Here is a simple work around using numpy:
import numpy as np
import tensorflow as tf
myShuffledFileList = np.random.choice(myInputFileList, size=len(myInputFileList), replace=False).tolist()
dataset = tf.data.TFRecordDataset(myShuffledFileList)

Related

How to load Fashion MNIST dataset in Tensorflow Fedarated?

I am working on a project with Tensorflow federated. I have managed to use the libraries provided by TensorFlow Federated Learning simulations in order to load, train, and test some datasets.
For example, i load the emnist dataset
emnist_train, emnist_test = tff.simulation.datasets.emnist.load_data()
and it got the data sets returned by load_data() as instances of tff.simulation.ClientData. This is an interface that allows me to iterate over client ids and allow me to select subsets of the data for simulations.
len(emnist_train.client_ids)
3383
emnist_train.element_type_structure
OrderedDict([('pixels', TensorSpec(shape=(28, 28), dtype=tf.float32, name=None)), ('label', TensorSpec(shape=(), dtype=tf.int32, name=None))])
example_dataset = emnist_train.create_tf_dataset_for_client(
emnist_train.client_ids[0])
I am trying to load the fashion_mnist dataset with Keras to perform some federated operations:
fashion_train,fashion_test=tf.keras.datasets.fashion_mnist.load_data()
but I get this error
AttributeError: 'tuple' object has no attribute 'element_spec'
because Keras returns a Tuple of Numpy arrays instead of a tff.simulation.ClientData like before:
def tff_model_fn() -> tff.learning.Model:
return tff.learning.from_keras_model(
keras_model=factory.retrieve_model(True),
input_spec=fashion_test.element_spec,
loss=loss_builder(),
metrics=metrics_builder())
iterative_process = tff.learning.build_federated_averaging_process(
tff_model_fn, Parameters.server_adam_optimizer_fn, Parameters.client_adam_optimizer_fn)
server_state = iterative_process.initialize()
To sum up,
Is any way to create tuple elements of tff.simulation.ClientData from Keras Tuple Numpy arrays?
Another solution that comes to my mind is to use the
tff.simulation.HDF5ClientData and load
manually the appropriate files in aHDF5format (train.h5, test.h5) in order to get the tff.simulation.ClientData, but my problem is that i cant find the url for fashion_mnist HDF5 file format i mean something like that for both train and test:
fileprefix = 'fed_emnist_digitsonly'
sha256 = '55333deb8546765427c385710ca5e7301e16f4ed8b60c1dc5ae224b42bd5b14b'
filename = fileprefix + '.tar.bz2'
path = tf.keras.utils.get_file(
filename,
origin='https://storage.googleapis.com/tff-datasets-public/' + filename,
file_hash=sha256,
hash_algorithm='sha256',
extract=True,
archive_format='tar',
cache_dir=cache_dir)
dir_path = os.path.dirname(path)
train_client_data = hdf5_client_data.HDF5ClientData(
os.path.join(dir_path, fileprefix + '_train.h5'))
test_client_data = hdf5_client_data.HDF5ClientData(
os.path.join(dir_path, fileprefix + '_test.h5'))
return train_client_data, test_client_data
My final goal is to make the fashion_mnist dataset work with the TensorFlow federated learning.

You're on the right track. To recap: the datasets returned by tff.simulation.dataset APIs are tff.simulation.ClientData objects. The object returned by tf.keras.datasets.fashion_mnist.load_data is a tuple of numpy arrays.
So what is needed is to implement a tff.simulation.ClientData to wrap the dataset returned by tf.keras.datasets.fashion_mnist.load_data. Some previous questions about implementing ClientData objects:
Federated learning : convert my own image dataset into tff simulation Clientdata
How define tff.simulation.ClientData.from_clients_and_fn Function?
Is there a reasonable way to create tff clients datat sets?
This does require answering an important question: how should the Fashion MNIST data be split into individual users? The dataset doesn't include features that that could be used for partitioning. Researchers have come up with a few ways to synthetically partition the data, e.g. randomly sampling some labels for each participant, but this will have a great effect on model training and is useful to invest some thought here.

Tensorflow Dataset using many compressed numpy files

I have a large dataset that I would like to use for training in Tensorflow.
The data is stored in compressed numpy format (using numpy.savez_compressed). There are variable numbers of images per file due to the way they are produced.
Currently I use a Keras Sequence based generator object to train, but I'd like to move entirely to Tensorflow without Keras.
I'm looking at the Dataset API on the TF website, but it is not obvious how I might use this to read numpy data.
My first idea was this
import glob
import tensorflow as tf
import numpy as np
def get_data_from_filename(filename):
npdata = np.load(open(filename))
return npdata['features'],npdata['labels']
# get files
filelist = glob.glob('*.npz')
# create dataset of filenames
ds = tf.data.Dataset.from_tensor_slices(filelist)
ds.flat_map(get_data_from_filename)
However, this passes a TF Tensor placeholder to a real numpy function and numpy is expecting a standard string. This results in the error:
File "test.py", line 6, in get_data_from_filename
npdata = np.load(open(filename))
TypeError: coercing to Unicode: need string or buffer, Tensor found
The other option I'm considering (but seems messy) is to create a Dataset object built on TF placeholders which I then fill during my epoch-batch loop from my numpy files.
Any suggestions?

You can define a wrapper and use pyfunc like this:
def get_data_from_filename(filename):
npdata = np.load(filename)
return npdata['features'], npdata['labels']
def get_data_wrapper(filename):
# Assuming here that both your data and label is float type.
features, labels = tf.py_func(
get_data_from_filename, [filename], (tf.float32, tf.float32))
return tf.data.Dataset.from_tensor_slices((features, labels))
# Create dataset of filenames.
ds = tf.data.Dataset.from_tensor_slices(filelist)
ds.flat_map(get_data_wrapper)
If your dataset is very large and you have memory issues, you can consider using a combination of interleave or parallel_interleave and from_generator methods instead. The from_generator method uses py_func internally so you can directly read your np file and then define your generator in python.

Tensorflow Dataset API shuffle hurts performance by 9x

I'm using the Tensorflow Dataset API to take a bunch of filenames; shuffle the filenames; perform a python function to load the image files, preprocess them, and turn them into tensors; and then cache, repeat, and batch them. So far, so good.
When I add a shuffle() to the tensors, performance degrades by a factor of 9x. Likewise, when I do self.dataset.apply(tf.data.experimental.shuffle_and_repeat(16384)).
Why does shuffle hurt performance so badly, and how can I fix it?
Code:
filenames = tf.data.Dataset.list_files(self.FILE_PATTERN).shuffle(buffer_size=16384)
dataset = filenames.map(lambda filename: self.pp(filename),
num_parallel_calls=self.N_CPUS)
dataset = dataset.cache("./cachefile")
# The line below (shuffle_and_repeat) made performance very bad (1s/step without, 9s/step with)
# dataset = dataset.apply(tf.data.experimental.shuffle_and_repeat(16384))
# This too:
# dataset = dataset.repeat().shuffle(16384)
# This works fine, but doesn't shuffle:
dataset = dataset.repeat()
dataset = dataset.batch(self.BATCH_SIZE)
dataset = dataset.prefetch(4)

try changing the prefetch parameter buffer_size=2
dataset = dataset.prefetch(2)
prefetch is a performance flag, read next number of datasets in background for the next iterations. If prefetch's buffer_size is large, then it creates lots of datasets for iterations and may slow down due to low memory.

Tensorflow dataset data preprocessing is done once for the whole dataset or for each call to iterator.next()?

Hi I am studying the dataset API in tensorflow now and I have a question regarding to the dataset.map() function which performs data preprocessing.
file_name = ["image1.jpg", "image2.jpg", ......]
im_dataset = tf.data.Dataset.from_tensor_slices(file_names)
im_dataset = im_dataset.map(lambda image:tuple(tf.py_func(image_parser(), [image], [tf.float32, tf.float32, tf.float32])))
im_dataset = im_dataset.batch(batch_size)
iterator = im_dataset.make_initializable_iterator()
The dataset takes in image names and parse them into 3 tensors (3 infos about the image).
If I have a very larger number of images in my training folder, preprocessing them is gonna take a long time.
My question is that, since Dataset API is said to be designed for efficient input pipeline, the preprocessing is done for the whole dataset before I feed them to my workers (let's say GPUs), or it only preprocess one batch of image each time I call iterator.get_next()?

If your preprocessing pipeline is very long and the output is small, the processed data should fit in memory. If this is the case, you can use tf.data.Dataset.cache to cache the processed data in memory or in a file.
From the official performance guide:
The tf.data.Dataset.cache transformation can cache a dataset, either in memory or on local storage. If the user-defined function passed into the map transformation is expensive, apply the cache transformation after the map transformation as long as the resulting dataset can still fit into memory or local storage. If the user-defined function increases the space required to store the dataset beyond the cache capacity, consider pre-processing your data before your training job to reduce resource usage.
Example use of cache in memory
Here is an example where each pre-processing takes a lot of time (0.5s). The second epoch on the dataset will be much faster than the first
def my_fn(x):
time.sleep(0.5)
return x
def parse_fn(x):
return tf.py_func(my_fn, [x], tf.int64)
dataset = tf.data.Dataset.range(5)
dataset = dataset.map(parse_fn)
dataset = dataset.cache() # cache the processed dataset, so every input will be processed once
dataset = dataset.repeat(2) # repeat for multiple epochs
res = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
for i in range(10):
# First 5 iterations will take 0.5s each, last 5 will not
print(sess.run(res))
Caching to a file
If you want to write the cached data to a file, you can provide an argument to cache():
dataset = dataset.cache('/tmp/cache') # will write cached data to a file
This will allow you to only process the dataset once, and run multiple experiments on the data without reprocessing it again.
Warning: You have to be careful when caching to a file. If you change your data, but keep the /tmp/cache.* files, it will still read the old data that was cached. For instance, if we use the data from above and change the range of the data to be in [10, 15], we will still obtain data in [0, 5]:
dataset = tf.data.Dataset.range(10, 15)
dataset = dataset.map(parse_fn)
dataset = dataset.cache('/tmp/cache')
dataset = dataset.repeat(2) # repeat for multiple epochs
res = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
for i in range(10):
print(sess.run(res)) # will still be in [0, 5]...
Always delete the cached files whenever the data that you want to cache changes.
Another issue that may arise is if you interrupt the script before all the data is cached. You will receive an error like this:
AlreadyExistsError (see above for traceback): There appears to be a concurrent caching iterator running - cache lockfile already exists ('/tmp/cache.lockfile'). If you are sure no other running TF computations are using this cache prefix, delete the lockfile and re-initialize the iterator.
Make sure that you let the whole dataset be processed to have an entire cache file.

tensorflow get size of string_input_producer

How do I check the number of filenames string_input_producer has read? Different operations will be performed depending on size in input data so I need to know how many images will be read or have been read.
Code below not telling me how much images I have read or am about to read.
import tensorflow as tf
import matplotlib.pyplot as plt
# Make a queue of file names including all the JPEG images files in the relative image directory.
filename_queue = tf.train.string_input_producer(tf.train.match_filenames_once("./MNIST_data/*.png"))
reader = tf.WholeFileReader()
key, value = reader.read(filename_queue)
image = tf.image.decode_png(value) # use png or jpg decoder based on your files.
num_preprocess_threads = 1
min_queue_examples = 256
batch_size=2;
images = tf.train.shuffle_batch([image], batch_size, min_queue_examples + 3 * batch_size, num_threads=num_preprocess_threads, min_after_dequeue=min_queue_examples)
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
t_image = image.eval() #here is your image Tensor :)
fig = plt.figure()
plt.imshow(t_image)
plt.show()
coord.request_stop()
coord.join(threads)

Functions like string_input_producer will add a queue to current graph which can only dequeue only one example each time. Usually the output tensors will be feed to functions like tf.train.shuffle_batch which is what you want. The argument batch_size of this function can control how many examples each time dequeued as the input of of your model
UPDATE:
if you want to check whether your input data is correct, you can run it out with sess.run(my_img) which will give you a numpy.array tensor. You can directly look at the element of this tensor or just plot it with matplotlib.
make sure you have already started queue runners before sess.run or your program will hang forever

string_input_producer returns you back a standard FIFOQueue (it returns you an input_producer and it returns you a queue.
A FIFOQueue does not have information about the number of elements it has read, only the number of elements are currently in a queue (q.size()). If you want to know how many element has been read you need to manually add a counter which you will increment each time you read an element.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Shuffling input files with tensorflow Datasets - python

Related

How to load Fashion MNIST dataset in Tensorflow Fedarated?

Tensorflow Dataset using many compressed numpy files

Tensorflow Dataset API shuffle hurts performance by 9x

Tensorflow dataset data preprocessing is done once for the whole dataset or for each call to iterator.next()?

tensorflow get size of string_input_producer

Categories

Resources