Flatten Dataset of multiple files tensorflow

Flatten Dataset of multiple files tensorflow - python

I'm trying to read the CIFAR-10 dataset from 6 .bin files, and then create a initializable_iterator. This is the site I downloaded the data from, and it also contains a description of the structure of the binary files. Each file contains 2500 images. The resulting iterator, however, only generates one tensor for each file, a tensor of size (2500,3703). Here is my code
import tensorflow as tf
filename_dataset = tf.data.Dataset.list_files("cifar-10-batches-bin/*.bin")
image_dataset = filename_dataset.map(lambda x: tf.decode_raw(tf.read_file(x), tf.float32))
iter_ = image_dataset.make_initializable_iterator()
next_file_data = iter_.get_next()I
next_file_data = tf.reshape(next_file_data, [-1,3073])
next_file_img_data, next_file_labels = next_file_data[:,:-1], next_file_data[:,-1]
next_file_img_data = tf.reshape(next_file_img_data, [-1,32,32,3])
init_op = iter_.initializer
with tf.Session() as sess:
sess.run(init_op)
print(next_file_img_data.eval().shape)
_______________________________________________________________________
>> (2500,32,32,3)
The first two lines are based on this answer. I would like to be able to specify the number of images generated by get_next(), using batch() rather than it being the number of images in each .bin file, which here is 2500.
There has already been a question about flattening a dataset here, but the answer is not clear to me. In particular, the question seems to contain a code snippet from a class function which is defined elsewhere, and I am not sure how to implement it.
I have also tried creating the dataset with tf.data.Dataset.from_tensor_slices(), replacing the first line above with
import os
filenames = [os.path.join('cifar-10-batches-bin',f) for f in os.listdir("cifar-10-batches-bin") if f.endswith('.bin')]
filename_dataset = tf.data.Dataset.from_tensor_slices(filenames)
but this didn't solve the problem.
Any help would be very much appreciated. Thanks.

I am not sure how your bin file is structured. I am assuming 32*32*3 = 3072 points per image is present in each file. So the data present in each file is a multiple of 3072. However for any other structure, the kind of operations would be similar, so this can still serve as a guide for that.
You could do a series of mapping operations:
import tensorflow as tf
filename_dataset = tf.data.Dataset.list_files("cifar-10-batches-bin/*.bin")
image_dataset = filename_dataset.map(lambda x: tf.decode_raw(tf.read_file(x), tf.float32))
image_dataset = image_dataset.map(lambda x: tf.reshape(x, [-1, 32, 32, 3]) # Reshape your data to get 2500, 32, 32, 3
image_dataset = image_dataset.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)) # This operation would give you tensors of shape 32,32,3 and put them all together.
image_dataset = image_dataset.batch(batch_size) # Now you can define your batchsize

Related

How to retrieve file paths from a tf.data.Dataset created with from_tensor_slices() and shuffled after every epoch

First of all, I would like to say that this is my first question in stackOverflow, so I hope that the question as a whole respects the rules. I realize that the question is a bit long, but I would like to provide as much background and detail as possible .
I am currently developing a real-time image binary classification system based on Tensorflow 2.8.0 and I am quite new at it. Here are some of the peculiarities of the data that I have for the mentioned project:
Too big to fit in memory: I have more than 200 GB of data. Keep in mind that I have labeled only a small portion of it, but I want to write code that could manage the whole dataset in the future.
Some files are not directly compatible with Tensorflow: I have .FITS and .FIT files that cannot be opened directly with Tensorflow. Due to this issue, I use a library called Astropy to open these files.
The classes are very unbalanced.
After reading the official documentation and tutorials, I thought that, in order to load, preprocess and feed data to my CNN, the best option was to build an input pipeline using the tf.data.Dataset class due to the ease of opening FITS files. My general procedure follows this idea:
Get a list of file paths and split it into train, val and test partitions if desired.
Create a tf.data.Dataset with the from_tensor_slices() method
Shuffle the data (before the heavier reading and image processing operations)
Read and process every path with map()
Batch and prefetch
Here are some code fragments in case they help to understand my goal:
(...)
import config as cfg # Custom .py file
import tensorflow as tf
# x_train, x_val and x_test are previously split file paths lists
train_ds = tf.data.Dataset.from_tensor_slices([str(p) for p in x_train])
val_ds = tf.data.Dataset.from_tensor_slices([str(p) for p in x_val])
test_ds = tf.data.Dataset.from_tensor_slices([str(p) for p in x_test])
train_ds = configure_tf_ds(train_ds)
val_ds = configure_tf_ds(val_ds)
test_ds = configure_tf_ds(test_ds)
def configure_tf_ds(self, tf_ds, buf_size):
# reshuffle_each_iteration=True ensures that data is shuffled each time it is iterated
tf_ds = tf_ds.shuffle(buffer_size=cfg.SHUFFLE_BUF_SIZE, seed=cfg.seed, reshuffle_each_iteration=True)
tf_ds = tf_ds.map(lambda x: tf.py_function(self.process_path, [x], [self.img_dtpye, self.label_dtype]))
tf_ds = tf_ds.batch(self.batch_size)
tf_ds = tf_ds.prefetch(buffer_size=tf.data.AUTOTUNE)
return tf_ds
def process_path(self, file_path):
# Labels are extracted from the file path, not relevant for my problem
label = get_label(file_path)
path = bytes.decode(file_path.numpy()).lower()
img = None
# Open and process images depending on their file paths' extension: FITS, FIT, JPG
if "fit" in path:
img = tf.py_function(func=self.decode_fits, inp=[file_path], Tout=self.img_dtpye)
else:
img = tf.py_function(func=self.decode_img, inp=[file_path], Tout=self.img_dtpye)
return img, label
model.fit(train_ds, epochs=50, validation_data=val_ds)
# Then, I would like to obtain predictions, plot results, and so on but knowing which file paths I am working with
(...)
Following the previous idea, I have successfully created and tested different types of pipelines for different types of partitions of my dataset: unlabeled (remember that only a portion of the data is labeled), labeled and weighted labeled (I wanted to see if my models improve by specifying class weights when training).
However, in order to monitor results and make proper adjustments to my model, I would like to retrieve the usual predictions, real labels and images next to the file paths preserving the ability to shuffle the data after every epoch.
I have managed to solve my question if I do not shuffle data with .shuffle(reshuffle_each_iteration=True), but models' performance is supposed to increase if data is shuffled after each epoch, according to several sources.
I have read different posts in stackOverflow related to my question. I will list those posts next to the problems that I have found for my particular use case:
Solution 1: My dataset cannot be fed to the model as X, y because it is a tf.data.Dataset
Solution 2: I want to obtain the image and the label too.
Solution 3: This works, but it would not respect the expected tf.data.Dataset format in the future .fit() call as stated here:
A tf.data dataset. Should return a tuple of either (inputs, targets)
or (inputs, targets, sample_weights)
I have also tried to keep a separate tf.data.Dataset with only the file paths but if I call the shuffle method with the reshuffle_each_iteration=True option in both tf.data.Dataset instances, the order of their elements does not match even if I set the same seed.
In short, is it possible to achieve what I want? If so, how should I proceed?
Thank you very much in advance.

Preprocess your data into three TFRecord files, one each for training, testing, and validation. Then you can shuffle and never cross records between the sets. This also speeds up data loading and can be done once and reused many times while playing with hyperparameters.
Here is an example of how you can preprocess and split your data. Your actual dataset data will have a different structure, this example has "encdata", a 2048-wide vector of vggface2 face encoding data. This assumes you have a single directory of data, with subdirectories named for a class and containing all the files for that class.
import tensorflow as tf
import numpy as np
import pickle
import sys
import os
# 80% to training, 10% to testing, 10% to validation
validation_portion = 1
testing_portion = 1
training_portion = 8
file_cycle_total = validation_portion + testing_portion + training_portion
# Where to store the TFRecord files
training_tfrecord_path = '/var/tmp/xtraining_tfrecords.tfr'
testing_tfrecord_path = '/var/tmp/xtesting_tfrecords.tfr'
validation_tfrecord_path = '/var/tmp/xvalidation_tfrecords.tfr'
# Where we keep the encodings
FACELIB_DIR='/aimiassd/Datasets/LabeledAstroFaces'
# Get list of all classes from all facelib dirs
classNames = sorted([x for x in os.listdir(FACELIB_DIR) if os.path.isdir(os.path.join(FACELIB_DIR,x)) and not x.startswith('.')])
classStrToInt = dict([(x,i) for i,x in enumerate(classNames)])
print('Found %d different classNames for labels\n' % len(classNames))
# Create our record writers
train_file_writer = tf.io.TFRecordWriter(training_tfrecord_path)
test_file_writer = tf.io.TFRecordWriter(testing_tfrecord_path)
val_file_writer = tf.io.TFRecordWriter(validation_tfrecord_path)
# Create a dataset of filenames of every enc2048 file in the facelibraries
cnt_records_written = [0,0,0]
for CN in classNames:
class_int = classStrToInt[CN]
# Get a list of all the encoding files
encfiles = sorted(filter((lambda x: x.endswith('.enc2048')), os.listdir(os.path.join(FACELIB_DIR, CN))))
# For each encoding file, read the encoding data and write it to the various tfrecords
for i, F in enumerate(encfiles):
file_path = os.path.join(FACELIB_DIR,CN,F)
with open(file_path,'rb') as fin:
encdata,_ = pickle.loads(fin.read()) # encodings, source_image_name
# Turn encdata into a tf.train.Example and serialize it for writing
record_bytes = tf.train.Example(features=tf.train.Features(feature={
"x": tf.train.Feature(float_list=tf.train.FloatList(value=encdata)),
"y": tf.train.Feature(int64_list=tf.train.Int64List(value=[class_int])),
})).SerializeToString()
# Write it out with the appropriate record writer
remainder = i % file_cycle_total
if remainder < validation_portion:
val_file_writer.write(record_bytes)
cnt_records_written[2] += 1
elif remainder < validation_portion + testing_portion:
test_file_writer.write(record_bytes)
cnt_records_written[1] += 1
else:
train_file_writer.write(record_bytes)
cnt_records_written[0] += 1
print('Writing records done.')
print('Wrote %d training, %d testing, %d validation records' %
(cnt_records_written[0], cnt_records_written[1], cnt_records_written[2]) )
train_file_writer.close()
test_file_writer.close()
val_file_writer.close()
print('Reading data back out...')
# Function to turn a serialized TFRecord back into a tf.train.Example
def decode_fn(record_bytes):
return tf.io.parse_single_example(
# Data
record_bytes,
# Schema
{"x": tf.io.FixedLenFeature([2048], dtype=tf.float32),
"y": tf.io.FixedLenFeature([], dtype=tf.int64)}
)
# Read and deserialize the datasets
train_ds = tf.data.TFRecordDataset([training_tfrecord_path]).map(decode_fn)
test_ds = tf.data.TFRecordDataset([ testing_tfrecord_path]).map(decode_fn)
validation_ds = tf.data.TFRecordDataset([validation_tfrecord_path]).map(decode_fn)
# Use a dataset
count = 0
for batch in tf.data.TFRecordDataset([training_tfrecord_path]).map(decode_fn):
print(batch)
count +=1
if count > 4:
sys.exit(0)
print('Done.')
Note how as the data is being process into TFRecords, it is alternately being written into the three datasets. Verify and Testing entries are written first, to ensure classes with very small amounts of samples still get something into the verify and testing datasets. This is controlled by the variables at the top, validation_portion, testing_portion, and training_portion, adjust per your preferences.
Finally, at the end, the TFRecords are re-read and used to build three new tf.data.Dataset, which can be fed to model.fit() and friends. The example code just prints four records to show the data is of the correct, original shape.

Best way to save extracted features for future training deep learning

I am using the VGG19 architecture to extract features from my images. Here is my code to do so:
model = VGG19(include_top=False)
image_paths = glob.glob('train/*/*')
def extract_features(model, path):
img_path = path
img = image.load_img(img_path, target_size=(224,224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
for path in image_paths:
extract_features(model, path)
I want to save each feature in format where I can later use it for deep learning, in torch or tf. Normally, I would just append each features into a list and save that list as a csv, but I've run into issues feeding lists into deep learning models. How do I save this data in a proper format?

I have two suggestions:
Save the features per file, e.g. for cat.png save it as cat.npy; and as you go over your list of files (cat.png, dog.png, snake.png), first check if the feature was already created and directly load the .npy file.
The second approach is using a dictionary data structure, where you use the index of the sample as the key, and the extracted feature as the value. Example:
index = 123
feature = extract_feature(...)
dictionary[index] = feature
You can save this dictionary to pickle. Then load it the next time, and directly pull the features for an index from the dictionary.

How to correctly map a python function and then batch the Dataset in Tensorflow

I wish to create a pipeline to provide non-standard files to the neural network (for example with extension *.xxx).
Currently I have structured my code as follows:
  1) I define a list of paths where to find training files
  2) I define an instance of the tf.data.Dataset object containing these paths
  3) I map to the Dataset a python function that takes each path and returns the associated numpy array (loaded from the folder on the pc); this array is a matrix with dimensions [256, 256, 192].
  4) I define an initializable iterator and then use it during network training.
My doubt lies in the size of the batch I provide to the network. I would like to have batches of size 64 supplied to the network. How could I do?
For example, if I use the function train_data.batch(b_size) with b_size = 1 the result is that when iterated, the iterator gives one element of shape [256, 256, 192]; what if I wanted to feed the neural net with just 64 slices of this array?
This is an extract of my code:
with tf.name_scope('data'):
train_filenames = tf.constant(list_of_files_train)
train_data = tf.data.Dataset.from_tensor_slices(train_filenames)
train_data = train_data.map(lambda filename: tf.py_func(
self._parse_xxx_data, [filename], [tf.float32]))
train_data.shuffle(buffer_size=len(list_of_files_train))
train_data.batch(b_size)
iterator = tf.data.Iterator.from_structure(train_data.output_types, train_data.output_shapes)
input_data = iterator.get_next()
train_init = iterator.make_initializer(train_data)
[...]
with tf.Session() as sess:
sess.run(train_init)
_ = sess.run([self.train_op])
Thanks in advance
----------
I posted a solution to my problem in the comments below. I would still be happy to receive any comment or suggestion on possible improvements. Thank you ;)

It's been a long time but I'll post a possible solution to batch the dataset with custom shape in TensorFlow, in case someone may need it.
The module tf.data offers the method unbatch() to unwrap the content of each dataset element. One can first unbatch and than batch again the dataset object in the desired way. Oftentimes, a good idea may also be shuffling the unbatched dataset before batching it again (so that we have random slices from random elements in each batch):
with tf.name_scope('data'):
train_filenames = tf.constant(list_of_files_train)
train_data = tf.data.Dataset.from_tensor_slices(train_filenames)
train_data = train_data.map(lambda filename: tf.py_func(
self._parse_xxx_data, [filename], [tf.float32]))
# un-batch first, then batch the data
train_data = train_data.apply(tf.data.experimental.unbatch())
train_data.shuffle(buffer_size=BSIZE)
train_data.batch(b_size)
# [...]

If I clearly understand you question, you can try to slice the array into the shape you want in your self._parse_xxx_data function.

Reading data into tensorflow and creating Dataset with TF-slim

I need to read in many 'images' from .txt files and want to generate a tensorflow dataset with them. Currently, I read in every single matrix with numpy.loadtxt and create an array of shape [N_matrices, height, width, N_channels], and a similar array with the label for every matrix.
I create a tensorflow dataset from these two arrays by using
inputs = tf.convert_to_tensor(x_train, dtype=tf.float32)
labels = tf.convert_to_tensor(y_train, dtype=tf.float32)
dataset = tf.data.Dataset.from_tensor_slices( {"image": inputs,"label": labels})
I now want to make use of the following function to create batches from this dataset (as done here):
def load_batch(dataset, batch_size=BATCH_SIZE, height=LENGTH_INPUT, width=LENGTH_INPUT):
data_provider = slim.dataset_data_provider.DatasetDataProvider(dataset)
image, label = data_provider.get(['image', 'label'])
images, labels = tf.train.batch(
[image, label],
batch_size=batch_size,
allow_smaller_final_batch=True)
return images, labels
However, this gives me the following error:
data_provider = slim.dataset_data_provider.DatasetDataProvider(dataset)
File "/home/.local/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/data/dataset_data_provider.py", line 85, in init
dataset.data_sources,
AttributeError: 'TensorSliceDataset' object has no attribute 'data_sources'
Why am I getting this error, and how can I fix it? I also suppose there are much better ways for handling input from txt files to tensorflow (or tensorflow-slim) but I've found very little information on this. How could I generate my Datasets in a better way?

Saving a collection of variable length tensors to a TFRecords file in TensorFlow

I'm trying to save a list of tensors of different lengths to a TFRecords file so that they can be easily loaded later on. The tensors in question are 1-dimensional arrays of integers.
The reason for this is that the tensors are the result of processing a large text file. This file is very large and processing it is slow, so I don't want to have to repeat that step every time I want to run my algorithms. I originally thought of loading the text file into regular Python lists or numpy arrays and then pickling those, but the conversion from those lists to tensors itself takes a very long time, so I don't want to have to wait for that every time I run my script, either. It seems that tensors cannot be pickled directly, and even if there is some workaround for this I am under the impression that TFRecords is the "correct" way to save tensor data.
However, I am not sure how to properly save the tensors to a TFRecords file and them load them back in as tensors. I did go through the TensorFlow tutorial in which MNIST data is saved to TFRecords files and then loaded, but there are a few differences between that and my use cases.
The following is a block of code intended to replicate the issues I'm having in a simpler case.
import tensorflow as tf
def _int64_list_feature(values):
return tf.train.Feature(int64_list=tf.train.Int64List(value=values))
filename = "/Users/me/tensorflow/test.tfrecords"
writer = tf.python_io.TFRecordWriter(filename)
example = tf.train.Example(features=tf.train.Features(feature={'datalist': _int64_list_feature([2,3])}))
writer.write(example.SerializeToString())
example = tf.train.Example(features=tf.train.Features(feature={'datalist': _int64_list_feature([8,5,7])}))
writer.write(example.SerializeToString())
writer.close()
First few lines are standard. I write two 1-D tensors to a TFRecords file, one with length 2 and one with length 3.
def read_my_file(filename_queue):
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
features = tf.parse_single_example(serialized_example, features={'datalist': tf.VarLenFeature(tf.int64), })
datalist = features['datalist']
return datalist
The helper function that it seems you are supposed to use. I am not 100% sure why this is necessary, but I couldn't get it to work without writing this, and all of the examples have something like this. In my case, the data is unlabeled so I don't have a labels variable.
filename_queue = tf.train.string_input_producer([filename], 2)
datalists = read_my_file(filename_queue)
datalists_batch = tf.train.batch([datalists], batch_size=2)
More "boilerplate"-style code from the examples. Batch size is 2 because I only have 2 examples in this code.
datalists_batch will now be a sparse tensor that contains both my vectors, [2, 3] and [8, 5, 7], the first on top of the second. Therefore, I want to split them back into individual tensors. At this point, I was already concerned that the runtime of this might be pretty long too, because in my real code there are over 100,000 individual tensors that will be split.
split_list = tf.sparse_split(0, 2, datalists_batch)
sp0 = split_list[0]
sp1 = split_list[1]
sp0_dense = tf.sparse_tensor_to_dense(sp0)
sp1_dense = tf.sparse_tensor_to_dense(sp1)
sp0_dense = tf.squeeze(sp0_dense)
sp1_dense = tf.squeeze(sp1_dense)
split_list is now a list of the individual tensors, still in sparse format (and all having a length equal to the length of the longest tensor, which is in this case 3. They are also 2-dimensional with the other dimension 1, since the datalists_batch tensor was 2D). I must now manipulate the tensors to get them into proper format. In the real code, I would of course use a for-loop, but there are only 2 examples in this case. First, I convert them to dense tensors. However, in the case of sp0 this fills in the last index with a 0, since this tensor has length 3. (This issue is discussed below.) Then, I "squeeze" them so that they are actually considered tensors with length 3 instead of 1x3.
Finally, I need to remove the trailing zero from sp0. This part gave me difficulty. I don't know programmatically how many trailing zeros a particular tensor has. It is equal to the length of the longest tensor minus the length of this tensor, but I don't know the "real" lengths of the tensors without looking at the sparse indices, but I cannot access that without evaluating the "temp" tensor (since the indices are themselves tensors).
indices_0 = sp0.indices
indices_1 = sp1.indices
indices0_size = tf.shape(indices_0)
indices1_size = tf.shape(indices_1)
These are necessary for the aforementioned slicing operations.
sess = tf.Session()
init_op = tf.initialize_all_variables()
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
Initializations.
sp0_fixed = tf.slice(sp0_dense, [0], [sess.run(indices0_size[0])])
sp1_fixed = tf.slice(sp1_dense, [0], [sess.run(indices1_size[0])])
sess.run(sp0_fixed)
sess.run(sp1_fixed)
This is how I would do it. The problem is, I get strange errors when running these last three commands. I surmise that the problem is that I am creating new ops after sess.run has already been called (in the sp0_fixed line), so the graph is being run simultaneously. I think I should only run sess.run once. However, this makes it impossible for me to figure out the proper indices at which to slice each tensor (to remove trailing zeros). Thus, I don't know what to do next.
I have surprisingly found nothing at all helpful on how to do something like this (save and load variable length tensors to/from files) on Google, TensorFlow documentation, and StackOverflow. I am quite sure that I'm going about this the wrong way; even if there is a workaround to rewrite the last four lines so that the program behaves correctly, the code overall seems excessively complicated to perform a very basic functionality.
I would greatly appreciate any suggestions and feedback.

I dont have too much experience with tfRecords but here's one way to store and retrieve variable length arrays with tfRecords
writing a tfrecord
# creating a default session we'll use it later
sess = tf.InteractiveSession( )
def get_writable( arr ):
"""
this fucntion returns a serialized string
for input array of integers
arr : input array
"""
arr = tf.train.Int64List( value = arr)
arr = tf.train.Feature(int64_list= arr )
arr = tf.train.Features(feature = { 'seqs': arr})
arr = tf.train.Example( features = arr)
return arr.SerializeToString()
filename = "./test2.tfrecords"
writer = tf.python_io.TFRecordWriter(filename)
#writing 3 different sized arrays
writer.write( get_writable([1,3,5,9]))
writer.write( get_writable([2,7,9]))
writer.write( get_writable([3,4,6,5,9]))
writer.close()
written the arrays into 'test2.tfrecord'
Reading the file(s)
##Reading from the tf_record file
## creating a filename queue
reader = tf.TFRecordReader( )
filename_queue = tf.train.string_input_producer(['test2.tfrecords'])
##getting the reader
_, ser_ex = reader.read(filename_queue, )
##features that you want to extract
read_features = {
'seqs' : tf.VarLenFeature(dtype = tf.int64)
}
batchSize = 2
# before parsing examples you must wrap it in tf.batch to get desired batch size
batch = tf.train.batch([ser_ex], batch_size= batchSize , capacity=10)
read_data = tf.parse_example( batch, features= read_features )
tf.train.start_queue_runners( sess)
# starting reading queues are requred before reding the data
Now we're ready to read contents of the tfRecord file
batches = 3
for _ in range(batches):
#get the next sparse tensor of shape (batchSize X elements in the largest array )
#every time you invoke read_data.values()
sparse_tensor = (list(read_data.values())[0]).eval()
# as the batch size is larger than 1
# you'd want seperate lists that you fed
#at the time of writing the tf_record file
for i in tf.sparse_split(axis= 0, num_split=batchSize, sp_input= sparse_tensor ):
i = i.eval()
shape = [1, (i).indices.shape[0]]
#getting individual shapes of different sparse tensors
tens = tf.sparse_to_dense(sparse_indices=i.indices ,sparse_values= i.values , output_shape= shape)
#converting them into dense tensors
print(tens.eval())
#evaluating the final Dense Tensor
Check out this post, great explanation to get started with tfRecords

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Flatten Dataset of multiple files tensorflow - python

Related

How to retrieve file paths from a tf.data.Dataset created with from_tensor_slices() and shuffled after every epoch

Best way to save extracted features for future training deep learning

How to correctly map a python function and then batch the Dataset in Tensorflow

Reading data into tensorflow and creating Dataset with TF-slim

Saving a collection of variable length tensors to a TFRecords file in TensorFlow

Categories

Resources