Custom data generator - python

I have a standard directory structure of train, validation, test, and each contain class subdirectories.
|class A
|class B
I want to use the flow_from_directory API, but all I can find is an ImageDataGenerator, and the files I have are raw numpy arrays (generated with arr.tofile(...)).
Is there an easy way to use ImageDataGenerator with a custom file loader?
I'm aware of flow_from_dataframe, but that doesn't seem to accomplish what I want either; it's for reading images with more custom organization. I want a simple way to load raw binary files instead of having to re-encode 100,000s of files into jpgs with some precision loss along the way (and wasted time, etc.).

Tensorflow is an entire ecosystem with IO capabilities and ImageDataGenerator is one of the least flexible approaches. Read here on How to Load Numpy Data in Tensorflow.
import tensorflow as tf
import numpy as np
path = tf.keras.utils.get_file('mnist.npz', DATA_URL)
with np.load(path) as data:
train_examples = data['x_train']
train_labels = data['y_train']
test_examples = data['x_test']
test_labels = data['y_test']
train_dataset =, train_labels))
test_dataset =, test_labels))


TensorFlow, what does the tensorflow_datasets.load() exactly return?

I am following a tutorial of TensorFlow ML and I am new to Python. I come from a background of languages like Java. Here is the link to the tutorial.
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import tensorflow_hub as hub
import tensorflow_datasets as tfds
from tensorflow.keras import layers
# Download the Flowers Dataset using TensorFlow Datasets
(training_set, validation_set), dataset_info = tfds.load(
split=['train[:70%]', 'train[70%:]'],
for example in training_set:
num_training_examples += 1
# Reformat Images and Create Batches
def format_image(image, label):
image = tf.image.resize(image, (IMAGE_RES, IMAGE_RES))/255.0
return image, label
train_batches = training_set.shuffle(num_training_examples//4).map(format_image).batch(BATCH_SIZE).prefetch(1)
validation_batches =
I don't understand how this code operates: (training_set, validation_set), dataset_info = tfds.load. The function tfds.load downloads images of flowers. How come that training_set is iterable like some sort of array, when it should be a folder perhaps?
for example in training_set:
num_training_examples += 1
Also how come each element in it is used in the following line as two arguments to the function format_image(image, label) in this line:
train_batches = training_set.shuffle(num_training_examples//4).map(format_image).batch(BATCH_SIZE).prefetch(1)
What is training_set exactly? Why is it not a folder that contains the following structure:
file1, file2, file3 ... etc
file1, file2, file3 ... etc
file1, file2, file3 ... etc
etc ...
instead its some sort of an array with each element containing an image and its label? It is not clear in the documentation what is happening for a beginner in Python such as I.
Like the name suggests, Tensorflow exists to "make the tensors flow". It's an entire ecosystem with data loading, preprocessing, and machine learning capabilities. So it's not built as an intuitive library that deals with numpy arrays. Tensorflow doesn't keep everything in memory so what TFDS returns is literally a "Tensorflow Dataset". You need to manipulate it as such. This means that you can't get basic information, like the count, intuitively. You need to iterate through the whole thing. For instance this line you gave:
for example in training_set:
num_training_examples += 1
It's passing all the samples and counting them. For this part:
(training_set, validation_set), dataset_info = tfds.load...
It loads the "Tensorflow Dataset" as supervised, meaning that it's 2 tuples for data and label. If you remove the as_supervised=True, it will be a dictionary, and you can iterate through them with dataset['image'] and dataset['label'].
Let me know if you want me to explain anything else.

How to create a Pytorch Dataset from .pt files?

I have transformed MNIST images saved as .pt files in a folder in Google drive. I'm writing my Pytorch code in Colab.
I would like to use these files, and create a Dataset that stores these images as Tensors. How can I do this?
Transforming images during training took too long. Hence, transformed them and saved them all as .pt files. I just want to load them back as a dataset and use them in my model.
The approach you are following to save images is indeed a good idea. In such a case, you can simply write your own Dataset class to load the images.
from import Dataset, DataLoader
from import RandomSampler
class ReaderDataset(Dataset):
def __init__(self, filename):
# load the images from file
def __len__(self):
# return total dataset size
def __getitem__(self, index):
# write your code to return each batch element
Then you can create Dataloader as follows.
train_dataset = ReaderDataset(filepath)
train_sampler = RandomSampler(train_dataset)
train_loader = DataLoader(
# args is a dictionary containing parameters
# batchify is a custom function that prepares each mini-batch

Tensorflow Dataset using many compressed numpy files

I have a large dataset that I would like to use for training in Tensorflow.
The data is stored in compressed numpy format (using numpy.savez_compressed). There are variable numbers of images per file due to the way they are produced.
Currently I use a Keras Sequence based generator object to train, but I'd like to move entirely to Tensorflow without Keras.
I'm looking at the Dataset API on the TF website, but it is not obvious how I might use this to read numpy data.
My first idea was this
import glob
import tensorflow as tf
import numpy as np
def get_data_from_filename(filename):
npdata = np.load(open(filename))
return npdata['features'],npdata['labels']
# get files
filelist = glob.glob('*.npz')
# create dataset of filenames
ds =
However, this passes a TF Tensor placeholder to a real numpy function and numpy is expecting a standard string. This results in the error:
File "", line 6, in get_data_from_filename
npdata = np.load(open(filename))
TypeError: coercing to Unicode: need string or buffer, Tensor found
The other option I'm considering (but seems messy) is to create a Dataset object built on TF placeholders which I then fill during my epoch-batch loop from my numpy files.
Any suggestions?
You can define a wrapper and use pyfunc like this:
def get_data_from_filename(filename):
npdata = np.load(filename)
return npdata['features'], npdata['labels']
def get_data_wrapper(filename):
# Assuming here that both your data and label is float type.
features, labels = tf.py_func(
get_data_from_filename, [filename], (tf.float32, tf.float32))
return, labels))
# Create dataset of filenames.
ds =
If your dataset is very large and you have memory issues, you can consider using a combination of interleave or parallel_interleave and from_generator methods instead. The from_generator method uses py_func internally so you can directly read your np file and then define your generator in python.

Shuffling input files with tensorflow Datasets

With the old input-pipeline API I can do:
filename_queue = tf.train.string_input_producer(filenames, shuffle=True)
and then pass the filenames to other queue, for example:
reader = tf.TFRecordReader()
_, serialized_example = reader.read_up_to(filename_queue, n)
How can I achieve similar behaviour with the Dataset -API?
The expects tensor of file-names in fixed order.
Start reading them in order, shuffle right after:
BUFFER_SIZE = 1000 # arbitrary number
# define filenames somewhere, e.g. via glob
dataset =
The input pipeline of this question gave me an idea on how to implement filenames shuffling with the Dataset API:
dataset =
dataset = dataset.shuffle(BUFFER_SIZE) # doesn't need to be big
dataset = dataset.flat_map(
dataset =, num_parallel_calls=5) # add your decoding logic here
# further processing of the dataset
This will put all the data of one file before the one of the next and so on. Files are shuffled, but the data inside them will be produced in the same order.
You can alternatively replace dataset.flat_map with interleave to process multiple files at the same time and return samples from each:
dataset = dataset.interleave(, cycle_length=4)
Note: interleave does not actually run in multiple threads, it's a round-robin operation. For true parallel processing see parallel_interleave
The current Tensorflow version (v1.5 in 02/2018) does not seem to support filename shuffling natively in the Dataset API. Here is a simple work around using numpy:
import numpy as np
import tensorflow as tf
myShuffledFileList = np.random.choice(myInputFileList, size=len(myInputFileList), replace=False).tolist()
dataset =

Python: how to save training datasets

I have got training datasets, which are xtrain, ytrain, xtest and ytest. They are all numpy arrays. I want to save them together into a file, so that I can load them into workspace as done in keras for mnist.load_data:
(xtrain, ytrain), (xtest, ytest) = mnist.load_data(filepath)
In python, is there any way to save my training datasets into such a single file? Or is there any other appreciate methods to save them?
You have a number of options:
Keras provides option to save models to hdf5. Also, note that out of the three, it's the only interoperable format.
Pickle is a good way to go:
import pickle as pkl
#to save it
with open("train.pkl", "w") as f:
pkl.dump([train_x, train_y], f)
#to load it
with open("train.pkl", "r") as f:
train_x, train_y = pkl.load(f)
If your dataset is huge, I would recommend check out hdf5 as #Lukasz Tracewski mentioned.
I find hickle is a very nice way to save them all together into a dict:
import hickle as hkl
data = {'xtrain': xtrain, 'xtest': xtest,'ytrain': ytrain,'ytest':ytest}
You simply could use'xtrain.npy', xtrain)
or in a human readable format
np.savetxt('xtrain.txt', xtrain)

