Tensorflow Dataset using many compressed numpy files - python

I have a large dataset that I would like to use for training in Tensorflow.
The data is stored in compressed numpy format (using numpy.savez_compressed). There are variable numbers of images per file due to the way they are produced.
Currently I use a Keras Sequence based generator object to train, but I'd like to move entirely to Tensorflow without Keras.
I'm looking at the Dataset API on the TF website, but it is not obvious how I might use this to read numpy data.
My first idea was this
import glob
import tensorflow as tf
import numpy as np
def get_data_from_filename(filename):
npdata = np.load(open(filename))
return npdata['features'],npdata['labels']
# get files
filelist = glob.glob('*.npz')
# create dataset of filenames
ds = tf.data.Dataset.from_tensor_slices(filelist)
ds.flat_map(get_data_from_filename)
However, this passes a TF Tensor placeholder to a real numpy function and numpy is expecting a standard string. This results in the error:
File "test.py", line 6, in get_data_from_filename
npdata = np.load(open(filename))
TypeError: coercing to Unicode: need string or buffer, Tensor found
The other option I'm considering (but seems messy) is to create a Dataset object built on TF placeholders which I then fill during my epoch-batch loop from my numpy files.
Any suggestions?

You can define a wrapper and use pyfunc like this:
def get_data_from_filename(filename):
npdata = np.load(filename)
return npdata['features'], npdata['labels']
def get_data_wrapper(filename):
# Assuming here that both your data and label is float type.
features, labels = tf.py_func(
get_data_from_filename, [filename], (tf.float32, tf.float32))
return tf.data.Dataset.from_tensor_slices((features, labels))
# Create dataset of filenames.
ds = tf.data.Dataset.from_tensor_slices(filelist)
ds.flat_map(get_data_wrapper)
If your dataset is very large and you have memory issues, you can consider using a combination of interleave or parallel_interleave and from_generator methods instead. The from_generator method uses py_func internally so you can directly read your np file and then define your generator in python.

Related

Store original data (e.g., text, image) along with tensor data in Pytorch Dataloader

Currently, I am using TensorDataset followed by DataLoader to load my dataset like below:
tensor_loader = TensorDataset(x_input_ids,x_seg_ids,x_atten_masks,y)
data_loader = DataLoader(tensor_loader, shuffle=True, batch_size=batch_size)
I now want to also store original (text) data along with the tensor data in the data_loader like below:
tensor_loader = TensorDataset(x_input_ids,x_seg_ids,x_atten_masks,y, x_input_strs)
Note: x_input_strs is text data corresponding to x_input_ids but it fails since TensorDataset allows only tensors. I also tried something like this:
tensor_loader = Dataset(x_input_ids,x_seg_ids,x_atten_masks,y, x_input_strs)
But it gives the following error:
TypeError: object.__new__() takes exactly one argument (the type to instantiate)
Any suggestions are appreciated.

What to do when the data is too big to be stored in memory

I want to train a neural network, I work with Python (3.6.9) and Tensorflow (2.4.0) and my problem is that my dataset is too big to be stored in memory.
A bit of context :
My network takes in input a small complex matrix of dimension 64 by 32.
My dataset is stored in the form of a very large ".mat" file generated by a matlab code.
In the mat file, the samples are stored in a large cell array.
I use the h5py library to open the mat file.
Example of python code to load only one sample from the file :
f = h5py.File('dataset.mat', 'r')
refs = f['data'] # array of reference of each sample
sample = f[refs[0]][()].view(np.complex) # load the first sample
Currently, I load only a small part of the dataset that I store in a tensorflow dataset (ds = tf.data.Dataset.from_tensor_slices(datas)).
I would like to take advantage of the possibility offered by the h5py library to be able to load each example individually to load the examples on the fly during network training.
I tried the following approach:
f = h5py.File('dataset.mat', 'r')
refs = f['data'] # array of reference of each sample
ds_index = tf.data.Dataset.range(len(refs))
ds = ds_index.map(lambda i : f[refs[i]][()].view(np.complex))
but, I have the following error :
NotImplementedError: in user code:
<ipython-input-66-6cf802c8359a>:15 __call__ *
return self._f[self._rs[i]]['channel'][()].view(np.complex).astype(np.complex64).T
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py:855 __array__
" a NumPy call, which is not supported".format(self.name))
NotImplementedError: Cannot convert a symbolic Tensor (args_0:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported
Do you know how to fix this error or can it be a better way to load examples on the fly ?

How to load Fashion MNIST dataset in Tensorflow Fedarated?

I am working on a project with Tensorflow federated. I have managed to use the libraries provided by TensorFlow Federated Learning simulations in order to load, train, and test some datasets.
For example, i load the emnist dataset
emnist_train, emnist_test = tff.simulation.datasets.emnist.load_data()
and it got the data sets returned by load_data() as instances of tff.simulation.ClientData. This is an interface that allows me to iterate over client ids and allow me to select subsets of the data for simulations.
len(emnist_train.client_ids)
3383
emnist_train.element_type_structure
OrderedDict([('pixels', TensorSpec(shape=(28, 28), dtype=tf.float32, name=None)), ('label', TensorSpec(shape=(), dtype=tf.int32, name=None))])
example_dataset = emnist_train.create_tf_dataset_for_client(
emnist_train.client_ids[0])
I am trying to load the fashion_mnist dataset with Keras to perform some federated operations:
fashion_train,fashion_test=tf.keras.datasets.fashion_mnist.load_data()
but I get this error
AttributeError: 'tuple' object has no attribute 'element_spec'
because Keras returns a Tuple of Numpy arrays instead of a tff.simulation.ClientData like before:
def tff_model_fn() -> tff.learning.Model:
return tff.learning.from_keras_model(
keras_model=factory.retrieve_model(True),
input_spec=fashion_test.element_spec,
loss=loss_builder(),
metrics=metrics_builder())
iterative_process = tff.learning.build_federated_averaging_process(
tff_model_fn, Parameters.server_adam_optimizer_fn, Parameters.client_adam_optimizer_fn)
server_state = iterative_process.initialize()
To sum up,
Is any way to create tuple elements of tff.simulation.ClientData from Keras Tuple Numpy arrays?
Another solution that comes to my mind is to use the
tff.simulation.HDF5ClientData and load
manually the appropriate files in aHDF5format (train.h5, test.h5) in order to get the tff.simulation.ClientData, but my problem is that i cant find the url for fashion_mnist HDF5 file format i mean something like that for both train and test:
fileprefix = 'fed_emnist_digitsonly'
sha256 = '55333deb8546765427c385710ca5e7301e16f4ed8b60c1dc5ae224b42bd5b14b'
filename = fileprefix + '.tar.bz2'
path = tf.keras.utils.get_file(
filename,
origin='https://storage.googleapis.com/tff-datasets-public/' + filename,
file_hash=sha256,
hash_algorithm='sha256',
extract=True,
archive_format='tar',
cache_dir=cache_dir)
dir_path = os.path.dirname(path)
train_client_data = hdf5_client_data.HDF5ClientData(
os.path.join(dir_path, fileprefix + '_train.h5'))
test_client_data = hdf5_client_data.HDF5ClientData(
os.path.join(dir_path, fileprefix + '_test.h5'))
return train_client_data, test_client_data
My final goal is to make the fashion_mnist dataset work with the TensorFlow federated learning.
You're on the right track. To recap: the datasets returned by tff.simulation.dataset APIs are tff.simulation.ClientData objects. The object returned by tf.keras.datasets.fashion_mnist.load_data is a tuple of numpy arrays.
So what is needed is to implement a tff.simulation.ClientData to wrap the dataset returned by tf.keras.datasets.fashion_mnist.load_data. Some previous questions about implementing ClientData objects:
Federated learning : convert my own image dataset into tff simulation Clientdata
How define tff.simulation.ClientData.from_clients_and_fn Function?
Is there a reasonable way to create tff clients datat sets?
This does require answering an important question: how should the Fashion MNIST data be split into individual users? The dataset doesn't include features that that could be used for partitioning. Researchers have come up with a few ways to synthetically partition the data, e.g. randomly sampling some labels for each participant, but this will have a great effect on model training and is useful to invest some thought here.

Converting image folder to numpy array is consuming the entire RAM

I am trying to convert the celebA dataset(https://www.kaggle.com/jessicali9530/celeba-dataset) images folder into a numpy array for later to be converted into a .pkl file(for using the data as simply as mnist or cifar).
I am willing to find a better way of converting since this method is absolutely consuming the whole RAM.
from PIL import Image
import pickle
from glob import glob
import numpy as np
TARGET_IMAGES = "img_align_celeba/*.jpg"
def generate_dataset(glob_files):
dataset = []
for _, file_name in enumerate(sorted(glob(glob_files))):
img = Image.open(file_name)
pixels = list(img.getdata())
dataset.append(pixels)
return np.array(dataset)
celebAdata = generate_dataset(TARGET_IMAGES)
I am rather curious on how the mnist authors did this themselves but any approach that works is welcome.
You can transform any kind of data on the fly in Keras and load in memory one batch at the time during training.
See documentation, search for 'Example of using .flow_from_directory(directory)'.

Shuffling input files with tensorflow Datasets

With the old input-pipeline API I can do:
filename_queue = tf.train.string_input_producer(filenames, shuffle=True)
and then pass the filenames to other queue, for example:
reader = tf.TFRecordReader()
_, serialized_example = reader.read_up_to(filename_queue, n)
How can I achieve similar behaviour with the Dataset -API?
The tf.data.TFRecordDataset() expects tensor of file-names in fixed order.
Start reading them in order, shuffle right after:
BUFFER_SIZE = 1000 # arbitrary number
# define filenames somewhere, e.g. via glob
dataset = tf.data.TFRecordDataset(filenames).shuffle(BUFFER_SIZE)
EDIT:
The input pipeline of this question gave me an idea on how to implement filenames shuffling with the Dataset API:
dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.shuffle(BUFFER_SIZE) # doesn't need to be big
dataset = dataset.flat_map(tf.data.TFRecordDataset)
dataset = dataset.map(decode_example, num_parallel_calls=5) # add your decoding logic here
# further processing of the dataset
This will put all the data of one file before the one of the next and so on. Files are shuffled, but the data inside them will be produced in the same order.
You can alternatively replace dataset.flat_map with interleave to process multiple files at the same time and return samples from each:
dataset = dataset.interleave(tf.data.TFRecordDataset, cycle_length=4)
Note: interleave does not actually run in multiple threads, it's a round-robin operation. For true parallel processing see parallel_interleave
The current Tensorflow version (v1.5 in 02/2018) does not seem to support filename shuffling natively in the Dataset API. Here is a simple work around using numpy:
import numpy as np
import tensorflow as tf
myShuffledFileList = np.random.choice(myInputFileList, size=len(myInputFileList), replace=False).tolist()
dataset = tf.data.TFRecordDataset(myShuffledFileList)

Categories

Resources