Using Datasets from large numpy arrays in Tensorflow

Using Datasets from large numpy arrays in Tensorflow - python

I'm trying to load a dataset, stored in two .npy files (for features and ground truth) on my drive, and use it to train a neural network.
print("loading features...")
data = np.load("[...]/features.npy")
print("loading labels...")
labels = np.load("[...]/groundtruth.npy") / 255
dataset = tf.data.Dataset.from_tensor_slices((data, labels))
throws a tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized. error when calling the from_tensor_slices() method.
The ground truth's file is larger than 2.44GB and thus I encounter problems when creating a Dataset with it (see warnings here and here).
Possible solutions I found were either for TensorFlow 1.x (here and here, while I am running version 2.6) or to use numpy's memmap (here), which I unfortunately don't get to run, plus I wonder whether that slows down the computation?
I'd appreciate your help, thanks!

You need some kind of data generator, because your data is way too big to fit directly into tf.data.Dataset.from_tensor_slices. I don't have your dataset, but here's an example of how you could get data batches and train your model inside a custom training loop. The data is an NPZ NumPy archive from here:
import numpy as np
def load_data(file='dsprites_ndarray_co1sh3sc6or40x32y32_64x64.npz'):
dataset_zip = np.load(file, encoding='latin1')
images = dataset_zip['imgs']
latents_classes = dataset_zip['latents_classes']
return images, latents_classes
def get_batch(indices, train_images, train_categories):
shapes_as_categories = np.array([train_categories[i][1] for i in indices])
images = np.array([train_images[i] for i in indices])
return [images.reshape((images.shape[0], 64, 64, 1)).astype('float32'), shapes_as_categories.reshape(
shapes_as_categories.shape[0], 1).astype('float32')]
# Load your data once
train_images, train_categories = load_data()
indices = list(range(train_images.shape[0]))
random.shuffle(indices)
epochs = 2000
batch_size = 256
total_batch = train_images.shape[0] // batch_size
for epoch in range(epochs):
for i in range(total_batch):
batch_indices = indices[batch_size * i: batch_size * (i + 1)]
batch = get_batch(batch_indices, train_images, train_categories)
...
...
# Train your model with this batch.

The accepted answer (https://stackoverflow.com/a/69994287) loads the whole data into memory with the load_data function. That's why our RAM is completely full.
You can try to make a npz file where each feature is its own npy file, then create a generator that loads this and use this generator like 1 to use it with tf.data.Dataset or build a data generator with keras like 2
or use the mmap method of numpy load while loading to stick to your one npy feature file like 3

Related

How to use properly Tensorflow Dataset with batch?

I am new to Tensorflow and deep learning, and I am struggling with the Dataset class. I tried a lot of things and I can’t find a good solution.
What I am trying
I have a large amount of images (500k+) to train my DNN with. This is a denoising autoencoder so I have a pair of each image. I am using the dataset class of TF to manage the data, but I think I use it really badly.
Here is how I load the filenames in a dataset:
class Data:
def __init__(self, in_path, out_path):
self.nb_images = 512
self.test_ratio = 0.2
self.batch_size = 8
# load filenames in input and outputs
inputs, outputs, self.nb_images = self._load_data_pair_paths(in_path, out_path, self.nb_images)
self.size_training = self.nb_images - int(self.nb_images * self.test_ratio)
self.size_test = int(self.nb_images * self.test_ratio)
# split arrays in training / validation
test_data_in, training_data_in = self._split_test_data(inputs, self.test_ratio)
test_data_out, training_data_out = self._split_test_data(outputs, self.test_ratio)
# transform array to tf.data.Dataset
self.train_dataset = tf.data.Dataset.from_tensor_slices((training_data_in, training_data_out))
self.test_dataset = tf.data.Dataset.from_tensor_slices((test_data_in, test_data_out))
I have a function to call at each epoch that will prepare the dataset. It shuffles the filenames, and transforms filenames to images and batch data.
def get_batched_data(self, seed, batch_size):
nb_batch = int(self.size_training / batch_size)
def img_to_tensor(path_in, path_out):
img_string_in = tf.read_file(path_in)
img_string_out = tf.read_file(path_out)
im_in = tf.image.decode_jpeg(img_string_in, channels=1)
im_out = tf.image.decode_jpeg(img_string_out, channels=1)
return im_in, im_out
t_datas = self.train_dataset.shuffle(self.size_training, seed=seed)
t_datas = t_datas.map(img_to_tensor)
t_datas = t_datas.batch(batch_size)
return t_datas
Now during the training, at each epoch we call the get_batched_data function, make an iterator, and run it for each batch, then feed the array to the optimizer operation.
for epoch in range(nb_epoch):
sess_iter_in = tf.Session()
sess_iter_out = tf.Session()
batched_train = data.get_batched_data(epoch)
iterator_train = batched_train.make_one_shot_iterator()
in_data, out_data = iterator_train.get_next()
total_batch = int(data.size_training / batch_size)
for batch in range(total_batch):
print(f"{batch + 1} / {total_batch}")
in_images = sess_iter_in.run(in_data).reshape((-1, 64, 64, 1))
out_images = sess_iter_out.run(out_data).reshape((-1, 64, 64, 1))
sess.run(optimizer, feed_dict={inputs: in_images,
outputs: out_images})
What do I need ?
I need to have a pipeline that loads only the images of the current batch (otherwise it will not fit in memory) and I want to shuffle the dataset in a different way for each epoch.
Questions and problems
First question, am I using the Dataset class in a good way? I saw very different things on the internet, for example in this blog post the dataset is used with a placeholder and fed during the learning with the datas. It seems strange because the data are all in an array, so loaded in memory. I don't see the point of using tf.data.dataset in this case.
I found solution by using repeat(epoch) on the dataset, like this, but the shuffle will not be different for each epoch in this case.
The second problem with my implementation is that I have an OutOfRangeError in some cases. With a small amount of data (512 like in the exemple) it works fine, but with a bigger amount of data, the error occurs. I thought it was because of a bad calculation of the number of batch due to bad rounding, or when the last batch has a smaller amount of data, but it happens in batch 32 out of 115... Is there any way to know the number of batch created after a batch(n) call on dataset?
Sorry for this loooonng question, but I've been struggling with this for a few days.

As far as I know, Official Performance Guideline is the best teaching material to make input pipelines.
I want to shuffle the dataset in a different way for each epoch.
Using shuffle() and repeat(), you can get different shuffle pattern for each epochs. You can confirm it with the following code
dataset = tf.data.Dataset.from_tensor_slices([1,2,3,4])
dataset = dataset.shuffle(4)
dataset = dataset.repeat(3)
iterator = dataset.make_one_shot_iterator()
x = iterator.get_next()
with tf.Session() as sess:
for i in range(10):
print(sess.run(x))
You can also use tf.contrib.data.shuffle_and_repeat as the mentioned by the above official page.
There are some problems in your code outside of creating data pipelines. You confuse graph construction with graph execution. You are repeating to create data input pipeline, so there are many redundant input pipelines as many as epochs. You can observe the redundant pipelines by Tensorboard.
You should place your graph construction code outside of loop as the following code (pseudo code)
batched_train = data.get_batched_data()
iterator = batched_train.make_initializable_iterator()
in_data, out_data = iterator_train.get_next()
for epoch in range(nb_epoch):
# reset iterator's state
sess.run(iterator.initializer)
try:
while True:
in_images = sess.run(in_data).reshape((-1, 64, 64, 1))
out_images = sess.run(out_data).reshape((-1, 64, 64, 1))
sess.run(optimizer, feed_dict={inputs: in_images,
outputs: out_images})
except tf.errors.OutOfRangeError:
pass
Moreover there are some unimportant inefficient code. You loaded a list of file path with from_tensor_slices(), so the list was embedded in your graph. (See https://www.tensorflow.org/guide/datasets#consuming_numpy_arrays for detail)
You would be better off using prefetch, and decreasing sess.run call by combining your graph.

How to correctly map a python function and then batch the Dataset in Tensorflow

I wish to create a pipeline to provide non-standard files to the neural network (for example with extension *.xxx).
Currently I have structured my code as follows:
  1) I define a list of paths where to find training files
  2) I define an instance of the tf.data.Dataset object containing these paths
  3) I map to the Dataset a python function that takes each path and returns the associated numpy array (loaded from the folder on the pc); this array is a matrix with dimensions [256, 256, 192].
  4) I define an initializable iterator and then use it during network training.
My doubt lies in the size of the batch I provide to the network. I would like to have batches of size 64 supplied to the network. How could I do?
For example, if I use the function train_data.batch(b_size) with b_size = 1 the result is that when iterated, the iterator gives one element of shape [256, 256, 192]; what if I wanted to feed the neural net with just 64 slices of this array?
This is an extract of my code:
with tf.name_scope('data'):
train_filenames = tf.constant(list_of_files_train)
train_data = tf.data.Dataset.from_tensor_slices(train_filenames)
train_data = train_data.map(lambda filename: tf.py_func(
self._parse_xxx_data, [filename], [tf.float32]))
train_data.shuffle(buffer_size=len(list_of_files_train))
train_data.batch(b_size)
iterator = tf.data.Iterator.from_structure(train_data.output_types, train_data.output_shapes)
input_data = iterator.get_next()
train_init = iterator.make_initializer(train_data)
[...]
with tf.Session() as sess:
sess.run(train_init)
_ = sess.run([self.train_op])
Thanks in advance
----------
I posted a solution to my problem in the comments below. I would still be happy to receive any comment or suggestion on possible improvements. Thank you ;)

It's been a long time but I'll post a possible solution to batch the dataset with custom shape in TensorFlow, in case someone may need it.
The module tf.data offers the method unbatch() to unwrap the content of each dataset element. One can first unbatch and than batch again the dataset object in the desired way. Oftentimes, a good idea may also be shuffling the unbatched dataset before batching it again (so that we have random slices from random elements in each batch):
with tf.name_scope('data'):
train_filenames = tf.constant(list_of_files_train)
train_data = tf.data.Dataset.from_tensor_slices(train_filenames)
train_data = train_data.map(lambda filename: tf.py_func(
self._parse_xxx_data, [filename], [tf.float32]))
# un-batch first, then batch the data
train_data = train_data.apply(tf.data.experimental.unbatch())
train_data.shuffle(buffer_size=BSIZE)
train_data.batch(b_size)
# [...]

If I clearly understand you question, you can try to slice the array into the shape you want in your self._parse_xxx_data function.

How to limit RAM usage while batch training in tensorflow?

I am training a deep neural network with a large image dataset in mini-batches of size 40. My dataset is in .mat format (which I can easily change to any other format e.g. .npy format if necessitates) and before training, loaded as a 4-D numpy array. My problem is that while training, cpu-RAM (not GPU RAM) is very quickly exhausting and starts using almost half of my Swap memory.
My training code has the following pattern:
batch_size = 40
...
with h5py.File('traindata.mat', 'r') as _data:
train_imgs = np.array(_data['train_imgs'])
# I can replace above with below loading, if necessary
# train_imgs = np.load('traindata.npy')
...
shape_4d = train_imgs.shape
for epoch_i in range(max_epochs):
for iter in range(shape_4d[0] // batch_size):
y_ = train_imgs[iter*batch_size:(iter+1)*batch_size]
...
...
This seems like the initial loading of the full training data is itself becoming the bottle-neck (taking over 12 GB cpu RAM before I abort).
What is the best efficient way to tackle this bottle-neck?
Thanks in advance.

Loading a big dataset in memory is not a good idea. I suggest you to use something different for loading the datasets, take a look to the dataset API in TensorFlow: https://www.tensorflow.org/programmers_guide/datasets
You might need to convert your data into other format, but if you have a CSV or TXT file with a example per line you can use TextLineDataset and feed the model with it:
filenames = ["/var/data/file1.txt", "/var/data/file2.txt"]
dataset = tf.data.TextLineDataset(filenames)
def _parse_py_fun(text_line):
... your custom code here, return np arrays
def _map_fun(text_line):
result = tf.py_func(_parse_py_fun, [text_line], [tf.uint8])
... other tensorlow code here
return result
dataset = dataset.map(_map_fun)
dataset = dataset.batch(4)
iterator = dataset.make_one_shot_iterator()
input_data_of_your_model = iterator.get_next()
output = build_model_fn(input_data_of_your_model)
sess.run([output]) # the input was assigned directly when creating the model

Tensorflow: Batching whole dataset (MNIST Tutorial)

Following this tutorial: https://www.tensorflow.org/versions/r1.3/get_started/mnist/pros
I wanted to solve a classification problem with labeled images by myself. Since I'm not using the MNIST database, I spent days creating my own dataset inside tensorflow. It looks like this:
#variables
batch_size = 50
dimension = 784
stages = 10
#step 1 read Dataset
filenames = tf.constant(filenamesList)
labels = tf.constant(labelsList)
#step 2 create Dataset
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
#step 3: parse every image in the dataset using `map`
def _parse_function(filename, label):
#convert label to one-hot encoding
one_hot = tf.one_hot(label, stages)
#read image file
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_image(image_string, channels=3)
image = tf.cast(image_decoded, tf.float32)
return image, one_hot
#step 4 final input tensor
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size) #batch_size = 100
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
for _ in range(10):
dataset = dataset.shuffle(buffer_size = 100)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
train_step.run(feed_dict={x: images, y_:labels})
Somehow using a higher batch_sizes will break python. What I'm trying to do is to train my neural network with new batches on each iteration. That's why Im also using dataset.shuffle(...). Using dataset.shuffle also breaks my Python.
What I wanted to do (because shuffle breaks) is to batch the whole dataset. By evaluating ('.eval()') I will get a numpy array. I will then shuffle the array with numpy.random.shuffle(images) and then pick up some the first elements to train it.
e.g.
for _ in range(1000):
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
#shuffle
np.random.shuffle(images)
np.random.shuffle(labels)
train_step.run(feed_dict={x: images[0:train_size], y_:labels[0:train_size]})
But then here comes the problem that I can't batch the my whole dataset. It looks like that the data is too big for python to work with.
How should I solve this differently?
Since I'm not using the MNIST database there isn't a function like mnist.train.next_batch(100) which comes handy for me.

Notice how you call shuffle and batch inside your for loop? This is wrong. Datasets in TF work in the style of functional programming, so you are actually defining a pipeline for preprocessing the data to feed into your model. In a way, you give a recipe that answers the question "given this raw data, which operations (map, etc.) should I do to get batches that I can feed into my neural network?"
Now you are modifying that pipeline for every batch! What happens is that the first iteration, the batch size is, say [32 3600]. The next iteration, the elements of this shape are batched again, to [32 32 3600], and so on.
There's a great tutorial on the TF website where you can find out more how Datasets work, but here are a few suggestions how you can resolve your problem.
Move the shuffling to right after "Step 2" in your code. Then you are shuffling the whole dataset so your batches will have a good mixture of examples. Also increase the buffer_size argument, this works in a different way than you probably assume. It's usually a good idea to shuffle as early as possible, as it can be a slow operation if you have a large dataset -- the shuffled part of dataset will have to be read into memory. Here it does not really matter whether you shuffle the filenames and labels, or the read images and labels -- but the latter will have more work to do since the dataset is larger by that time.
Move batching and the iterator generator to be the last steps, just before starting your training loop.
Don't use feed_dict with Dataset iterators to input data into your model. Instead, define your model in terms of the outputs of iterator.get_next() and omit the feed_dict argument. See more details from this Q&A: Tensorflow: create minibatch from numpy array > 2 GB

Ive been getting through a lot of problems with creating tensorflow datasets. So I decided to use OpenCV to import images.
import opencv as cv
imgDataset = []
for i in range(len(files)):
imgDataset.append(cv2.imread(files[i]))
imgDataset = np.asarray(imgDataset)
the shape of imgDataset is (num_img, height, width, col_channels). Getting the i-th image should be imgDataset[i].
shuffling the dataset and getting only batches of it can be done like this:
from sklearn.utils import shuffle
X,y = shuffle(X, y)
X_feed = X[batch_size]
y_feed = y[batch_size]
Then you feed X_feed and y_feed into your model

how to read batches in one hdf5 data file for training?

I have a hdf5 training dataset with size (21760, 1, 33, 33). 21760 is the whole number of training samples. I want to use the mini-batch training data with the size 128 to train the network.
I want to ask:
How to feed 128 mini-batch training data from the whole dataset with tensorflow each time?

If your data set is so large that it can't be imported into memory like keveman suggested, you can use the h5py object directly:
import h5py
import tensorflow as tf
data = h5py.File('myfile.h5py', 'r')
data_size = data['data_set'].shape[0]
batch_size = 128
sess = tf.Session()
train_op = # tf.something_useful()
input = # tf.placeholder or something
for i in range(0, data_size, batch_size):
current_data = data['data_set'][position:position+batch_size]
sess.run(train_op, feed_dict={input: current_data})
You can also run through a huge number of iterations and randomly select a batch if you want to:
import random
for i in range(iterations):
pos = random.randint(0, int(data_size/batch_size)-1) * batch_size
current_data = data['data_set'][pos:pos+batch_size]
sess.run(train_op, feed_dict={inputs=current_data})
Or sequentially:
for i in range(iterations):
pos = (i % int(data_size / batch_size)) * batch_size
current_data = data['data_set'][pos:pos+batch_size]
sess.run(train_op, feed_dict={inputs=current_data})
You probably want to write some more sophisticated code that goes through all data randomly, but keeps track of which batches have been used, so you don't use any batch more often than others. Once you've done a full run through the training set you enable all batches again and repeat.

You can read the hdf5 dataset into a numpy array, and feed slices of the numpy array to the TensorFlow model. Pseudo code like the following would work :
import numpy, h5py
f = h5py.File('somefile.h5','r')
data = f.get('path/to/my/dataset')
data_as_array = numpy.array(data)
for i in range(0, 21760, 128):
sess.run(train_op, feed_dict={input:data_as_array[i:i+128, :, :, :]})

alkamen's approach seems logically right but I have not gotten any positive results using it. My best guess is this: Using code sample 1 above, in every iteration, the network trains afresh, forgetting all that has been learned in the previous loop. So if we are fetching at 30 samples or batches per iteration, at every loop/iteration, only 30 data samples are being used, then at the next loop, everything is overwritten.
Find below a screenshot of this approach
As can be seen, the loss and accuracy always start afresh. I will be happy if anyone could share a possible way around this, please.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.