Tensorflow concatenate tf.data.Dataset.list_files - python

I'm running my current code on a data set which has the following naming convetions:
Training files: training-??-?? where the ?? are wildcards (placeholders for any range).
The same convetion goes for validation and test files (e.g. validation-??-??).
In my code I create the file pattern like this:
training_file_pattern = os.path.join(config['data_dir'], "training-??-of-??")
But now I wanted to train my model also on the validation and training set together. But I'm having problems figuring out how I can take both datasets. For training I would do:
tf_data_files = tf.data.Dataset.list_files(training_file_pattern, seed=1234, shuffle=self.shuffle)
I thought to do the same with the validation set and concatenate it to like this:
tf_data_files = tf.concat(tf_data_files, tf.data.Dataset.list_files(validation_file_pattern, seed=1234, shuffle=self.shuffle))
But it doesn't work correctly. What would the correct way to do it?
I also tried to define the file_pattern differently to contain also validation but I don't know how to do it without taking also the test set (they are all in the same folder). So I cannot do this:
training_and_validation_file_pattern = os.path.join(config['data_dir'], "?-??-of-??")
Because this would also take the test set right?
Any help would be very appreciated.

If I got your point, you can simply do
dataset = tf.data.Dataset.list_files(os.listdir('path'))
dataset = tf.data.TextLineDataset(dataset)
Dataset API also has concatenate method
dataset = dataset_1.concatenate(dataset_2)
but it's not completely clear wether you need it
Edit:
list_files will create dataset with filenames
dataset = tf.data.Dataset.list_files(['f1.csv', 'f2.csv'])
for i in ds:
print(i) #output 'f1.csv'
I'm using TF 2.0 version just for clarity.
On the other hand, tf.data.TextLineDataset() outputs actual values form text file, like
tf.Tensor(b'0.7079635943784122,0.9659163071487907'
So using just list_files will create dataset from files, not their contents and will require to apply additional parse function to dataset

Related

How to get the CK+ Dataset into a dataframe where one column is the image and the second column is the emotion (Is that even necessary?)

I want to build a model for emotion classification and tbh I am struggling with the dataset. I am using CK+ since I read it'd be on industry standard. I don't know how to format it the right way so I can start working.
The Dataset is formatted in the following way.
Anger (Folder)
File 1
File 2
...
Contempt (Folder)
File 3
File 4
...
I need the foldernames as labels for the files inside of the folder but don't really know how to get there.
You can load all your data in a tf.data.Dataset using the tf.keras.utils.image_dataset_from_directory function. Assuming that your Anger and Contempt folders are located in a directory named Parent you can do like this:
import tensorflow as tf
dataset = tf.keras.utils.image_dataset_from_directory('Parent')
You can then access the images and labels directly from the Dataset, for example like this:
iterator = dataset.as_numpy_iterator()
print(iterator.next())

New hdf5 from a group in a bigger hdf5

I have created a huge hdf5 dataset in the following form:
group1/raw
group1/preprocessed
group1/postprocessed
group2/raw
group2/preprocessed
group2/postprocessed
....
group10/raw
group10/preprocessed
group10/postprocessed
However, I realized that for portability I would like to have 10 different hdf5 files, one for each group. Is there a function in python to achieve this without looping through all the data and scanning the entire original hdf5 tree?
something like:
import h5py
file_path = 'path/to/data.hdf5'
hf = h5py.File(file_path, 'r')
print(hf.keys())
for group in hf.keys():
# create a new dataset for the group
hf_tmp = h5py.File(group + '.h5', 'w')
# get data from hf[key] and dumb them into the new file
# something like
# hf_tmp = hf[group]
# hf_tmp.dumb()
hf_tmp.close()
hf.close()
You have the right idea. There are several questions and answers on SO that show how to do this.
Start with this one. It shows how to loop over the keys and determine if it's a group or a dataset.: h5py: how to use keys() loop over HDF5 Groups and Datasets
Then look at these. Each shows a slightly different approach to the problem.
This shows one way. Extracting datasets from 1 HDF5 file to multiple files
Also, here is an earlier post I wrote: How to copy a dataset object to a different hdf5 file using pytables or h5py?
This does the opposite (copies datasets from different files to 1 file). It's useful because it demonstrates how to use the .copy() method: How can I combine multiple .h5 file?
Finally, you should review visititems() method to recursively search all Groups and Datasets. Take a look at this answer for details: is there a way to get datasets in all groups at once in h5py?
That should answer your questions.
Below is some pseudo-code that pulls all of these ideas together. It works for your schema, where all datasets are in root level groups. It will not work for the more general case with datasets at multiple group levels. Use visititems() for the more general case.
Pseudo-code below:
with h5py.File(file_path, 'r') as hf:
print(hf.keys())
# loop on group names at root level
for group in hf.keys():
hf_tmp = h5py.File(group + '.h5', 'w')
# loop on datasets names in group
for dset in hf[group].keys():
# copy dataset to the new group file
hf.copy(group+'/'+dset, hf_tmp)
hf_tmp.close()

Set up data pipeline for video processing

I am newbie to Tensorflow so I would appreciate any constructive help.
I am trying to build a feature extraction and data pipeline with Tensorflow for video processing where multiple folders holding video files with multiple classes (JHMDB database), but kind of stuck.
I have the feature extracted to one folder, at the moment to separate *.npz compressed arrays, in the filename I have stored the class name as well.
First Attempt
First I thought I would use this code from the TF tutorials site, simply reading files from folder method:
jhmdb_path = Path('...')
# Process files in folder
list_ds = tf.data.Dataset.list_files(str(jhmdb_path/'*.npz'))
for f in list_ds.take(5):
print(f.numpy())
def process_path(file_path):
labels = tf.strings.split(file_path, '_')[-1]
features = np.load(file_path)
return features, labels
labeled_ds = list_ds.map(process_path)
for a, b in labeled_ds.take(5):
print(a, b)
TypeError: expected str, bytes or os.PathLike object, not Tensor
..but this not working.
Second Attempt
Then I thought ok I will use generators:
# using generator
jhmdb_path = Path('...')
def generator():
for item in jhmdb_path.glob("*.npz"):
features = np.load(item)
print(item.files)
print(f['PAFs'].shape)
features = features['PAFs']
yield features
dataset = tf.data.Dataset.from_generator(generator, (tf.uint8))
iter(next(dataset))
TypeError: 'FlatMapDataset' object is not an iterator...not working.
In the first case, somehow the path is a byte type, and I could not change it to str to be able to load it with np.load(). (If I point a file directly on np.load(direct_path), then strange, but it works.)
At second case... I am not sure what is wrong.
I looked for hours to find a solution how to build an iterable dataset from list of relatively big and large numbers of 'npz' or 'npy' files, but seems to be this is not mentioned anywhere (or just way too trivial maybe).
Also, as I could not test the model so far, I am not sure if this is the good way to go. I.e. to feed the model with hundreds of files in this way, or just build one huge 3.5 GB npz (that would sill fit in memory) and use that instead. Or use TFrecords, but that looks more complicated, than the usual examples.
What is really annoying here, that TF tutorials and in general all are about how to load a ready dataset directly, or how to load np array(s) or how to load, image, text, dataframe objects, but no way to find any real examples how to process big chunks of data files, e.g. the case of extracting features from audio or video files.
So any suggestions or solutions would be highly appreciated and I would be really, really grateful to have something finally working! :)

TensorFlow - tf.data.Dataset reading large HDF5 files

I am setting up a TensorFlow pipeline for reading large HDF5 files as input for my deep learning models. Each HDF5 file contains 100 videos of variable size length stored as a collection of compressed JPG images (to make size on disk manageable). Using tf.data.Dataset and a map to tf.py_func, reading examples from the HDF5 file using custom Python logic is quite easy. For example:
def read_examples_hdf5(filename, label):
with h5py.File(filename, 'r') as hf:
# read frames from HDF5 and decode them from JPG
return frames, label
filenames = glob.glob(os.path.join(hdf5_data_path, "*.h5"))
labels = [0]*len(filenames) # ... can we do this more elegantly?
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(
lambda filename, label: tuple(tf.py_func(
read_examples_hdf5, [filename, label], [tf.uint8, tf.int64]))
)
dataset = dataset.shuffle(1000 + 3 * BATCH_SIZE)
dataset = dataset.batch(BATCH_SIZE)
iterator = dataset.make_one_shot_iterator()
next_batch = iterator.get_next()
This example works, however the problem is that it seems like tf.py_func can only handle one example at a time. As my HDF5 container stores 100 examples, this limitation causes significant overhead as the files constantly need to be opened, read, closed and reopened. It would be much more efficient to read all the 100 video examples into the dataset object and then move on with the next HDF5 file (preferably in multiple threads, each thread dealing with it's own collection of HDF5 files).
So, what I would like is a number of threads running in the background, reading video frames from the HDF5 files, decode them from JPG and then feed them into the dataset object. Prior to the introduction of the tf.data.Dataset pipeline, this was quite easy using the RandomShuffleQueue and enqueue_many ops, but it seems like there is currently no elegant way of doing this (or the documentation is lacking).
Does anyone know what would be the best way of achieving my goal? I have also looked into (and implemented) the pipeline using tfrecord files, but taking a random sample of video frames stored in a tfrecord file seems quite impossible (see here). Additionally, I have looked at the from_generator() inputs for tf.data.Dataset but that is definitely not going to run in multiple threads it seems. Any suggestions are more than welcome.
I stumbled across this question while dealing with a similar issue. I came up with a solution based on using a Python generator, together with the TF dataset construction method from_generator. Because we use a generator, the HDF5 file should be opened for reading only once and kept open as long as there are entries to read. So it will not be opened, read, and then closed for every single call to get the next data element.
Generator definition
To allow the user to pass in the HDF5 filename as an argument, I generated a class that has a __call__ method since from_generator specifies that the generator has to be callable. This is the generator:
import h5py
import tensorflow as tf
class generator:
def __init__(self, file):
self.file = file
def __call__(self):
with h5py.File(self.file, 'r') as hf:
for im in hf["train_img"]:
yield im
By using a generator, the code should pick up from where it left off at each call from the last time it returned a result, instead of running everything from the beginning again. In this case it is on the next iteration of the inner for loop. So this should skip opening the file again for reading, keeping it open as long as there is data to yield. For more on generators, see this excellent Q&A.
Of course, you will have to replace anything inside the with block to match how your dataset is constructed and what outputs you want to obtain.
Usage example
ds = tf.data.Dataset.from_generator(
generator(hdf5_path),
tf.uint8,
tf.TensorShape([427,561,3]))
value = ds.make_one_shot_iterator().get_next()
# Example on how to read elements
while True:
try:
data = sess.run(value)
print(data.shape)
except tf.errors.OutOfRangeError:
print('done.')
break
Again, in my case I had stored uint8 images of height 427, width 561, and 3 color channels in my dataset, so you will need to modify these in the above call to match your use case.
Handling multiple files
I have a proposed solution for handling multiple HDF5 files. The basic idea is to construct a Dataset from the filenames as usual, and then use the interleave method to process many input files concurrently, getting samples from each of them to form a batch, for example.
The idea is as follows:
ds = tf.data.Dataset.from_tensor_slices(filenames)
# You might want to shuffle() the filenames here depending on the application
ds = ds.interleave(lambda filename: tf.data.Dataset.from_generator(
generator(filename),
tf.uint8,
tf.TensorShape([427,561,3])),
cycle_length, block_length)
What this does is open cycle_length files concurrently, and produce block_length items from each before moving to the next file - see interleave documentation for details. You can set the values here to match what is appropriate for your application: e.g., do you need to process one file at a time or several concurrently, do you only want to have a single sample at a time from each file, and so on.
Edit: for a parallel version, take a look at tf.contrib.data.parallel_interleave!
Possible caveats
Be aware of the peculiarities of using from_generator if you decide to go with the solution. For Tensorflow 1.6.0, the documentation of from_generator mentions these two notes.
It may be challenging to apply this across different environments or with distributed training:
NOTE: The current implementation of Dataset.from_generator() uses
tf.py_func and inherits the same constraints. In particular, it
requires the Dataset- and Iterator-related operations to be placed on
a device in the same process as the Python program that called
Dataset.from_generator(). The body of generator will not be serialized
in a GraphDef, and you should not use this method if you need to
serialize your model and restore it in a different environment.
Be careful if the generator depends on external state:
NOTE: If generator depends on mutable global variables or other
external state, be aware that the runtime may invoke generator
multiple times (in order to support repeating the Dataset) and at any
time between the call to Dataset.from_generator() and the production
of the first element from the generator. Mutating global variables or
external state can cause undefined behavior, and we recommend that you
explicitly cache any external state in generator before calling
Dataset.from_generator().
I took me a while to figure this out, so I thought I should record this here. Based on mikkola's answer, this is how to handle multiple files:
import h5py
import tensorflow as tf
class generator:
def __call__(self, file):
with h5py.File(file, 'r') as hf:
for im in hf["train_img"]:
yield im
ds = tf.data.Dataset.from_tensor_slices(filenames)
ds = ds.interleave(lambda filename: tf.data.Dataset.from_generator(
generator(),
tf.uint8,
tf.TensorShape([427,561,3]),
args=(filename,)),
cycle_length, block_length)
The key is you can't pass filename directly to generator, since it's a Tensor. You have to pass it through args, which tensorflow evaluates and converts it to a regular python variable.

Google Cloud ML Engine + Tensorflow perform preprocessing/tokenization in input_fn()

I want to perform basic preprocessing and tokenization within my input function. My data is contained in csv's in a google cloud storage bucket location (gs://) that I cannot modify. Further, I to perform any modifications on input text within my ml-engine package so that the behavior can be replicated at serving time.
my input function follows the basic structure below:
filename_queue = tf.train.string_input_producer(filenames)
reader = tf.TextLineReader()
_, rows = reader.read_up_to(filename_queue, num_records=batch_size)
text, label = tf.decode_csv(rows, record_defaults = [[""],[""]])
# add logic to filter special characters
# add logic to make all words lowercase
words = tf.string_split(text) # splits based on white space
Are there any options that avoid performing this preprocessing on the entire data set in advance? This post suggests that tf.py_func() can be used to make these transformations, however they suggest that "The drawback is that as it is not saved in the graph, I cannot restore my saved model" so I am not convinced that this will be useful at serving time. If I am defining my own tf.py_func() to do preprocessing and it is defined in the trainer package that I am uploading to the cloud will I run into any issues? Are there any alternative options that I am not considering?
Best practice is to write a function that you call from both the training/eval input_fn and from your serving input_fn.
For example:
def add_engineered(features):
text = features['text']
features['words'] = tf.string_split(text)
return features
Then, in your input_fn, wrap the features you return with a call to add_engineered:
def input_fn():
features = ...
label = ...
return add_engineered(features), label
and in your serving_input fn, make sure to similarly wrap the returned features (NOT the feature_placeholders) with a call to add_engineered:
def serving_input_fn():
feature_placeholders = ...
features = ...
return tflearn.utils.input_fn_utils.InputFnOps(
add_engineered(features),
None,
feature_placeholders
)
Your model would use 'words'. However, your JSON input at prediction time would only need to contain 'text' i.e. the raw values.
Here's a complete working example:
https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/feateng/taxifare/trainer/model.py#L107

Categories

Resources