tf.data.Iterator with a multi GPU setup

tf.data.Iterator with a multi GPU setup - python

I have looked at the cifar10 multi-GPU implementation to draw inspiration for parallelizing my own GPU trained model.
My model consumes data from TFRecords, which are iterated through the tf.data.Iterator class. So given 2 GPUs what I am trying to do is call iterator.get_next() on the CPU one time for each GPU (twice for example) do some preprocessing ,embedding lookup and other CPU related stuff and then feed the two batches into the GPUs.
Pseudo code:
with tf.device('/cpu:0'):
batches = []
for gpu in multiple_gpus:
single_gpu_batch = cpu_function(iterator.get_next())
batches.append(single_gpu_batch)
....................
for gpu, batch in zip(multiple_gpus, batches):
with tf.device('/device:GPU:{}'.format(gpu.id):
single_gpu_loss = inference_and_loss(batch)
tower_losses.append(single_gpu_loss)
...........
...........
total_loss = average_loss(tower_losses)
The problem is, that if there is only 1 or less examples to be drawn from the data and I call iterator.get_next() twice a tf.errors.OutOfRange exception will be raised and the data of the first call of iterator.get_next() (which actually didn't fail, only the second one) will never be passed through the GPU.
I thought about drawing the data in one iterator.get_next() call and splitting it later, but tf.split fails of the batch size is not dividable by the number of GPUs.
What is the right way to implement consuming from iterator in a multi-GPU setup?

I think the second suggestion is the easiest way to go. In order to avoid the splitting problem on the last batch, you can use the drop_remainder option in dataset.batch; Or if you need to see all data, then one possible solution is to explicitly set the dimensions based on the size of the drawn batch, so that the splitting operation never fails:
dataset = dataset.batch(batch_size * multiple_gpus)
iterator = dataset.make_one_shot_iterator()
batches = iterator.get_next()
split_dims = [0] * multiple_gpus
drawn_batch_size = tf.shape(batches)[0]
Either in a greedy manner, i.e., fits batch_size tensors on each device until runs out
#### Solution 1 [Greedy]:
for i in range(multiple_gpus):
split_dims[i] = tf.maximum(0, tf.minimum(batch_size, drawn_batch_size))
drawn_batch_size -= batch_size
or in a more spread manner to ensure that each device get at least one sample (assuming multiple_gpus < drawn_batch_size)
### Solution 2 [Spread]
drawn_batch_size -= - multiple_gpus
for i in range(multiple_gpus):
split_dims[i] = tf.maximum(0, tf.minimum(batch_size - 1, drawn_batch_size)) + 1
drawn_batch_size -= batch_size
## Split batches
batches = tf.split(batches, split_dims)

Related

How to use properly Tensorflow Dataset with batch?

I am new to Tensorflow and deep learning, and I am struggling with the Dataset class. I tried a lot of things and I can’t find a good solution.
What I am trying
I have a large amount of images (500k+) to train my DNN with. This is a denoising autoencoder so I have a pair of each image. I am using the dataset class of TF to manage the data, but I think I use it really badly.
Here is how I load the filenames in a dataset:
class Data:
def __init__(self, in_path, out_path):
self.nb_images = 512
self.test_ratio = 0.2
self.batch_size = 8
# load filenames in input and outputs
inputs, outputs, self.nb_images = self._load_data_pair_paths(in_path, out_path, self.nb_images)
self.size_training = self.nb_images - int(self.nb_images * self.test_ratio)
self.size_test = int(self.nb_images * self.test_ratio)
# split arrays in training / validation
test_data_in, training_data_in = self._split_test_data(inputs, self.test_ratio)
test_data_out, training_data_out = self._split_test_data(outputs, self.test_ratio)
# transform array to tf.data.Dataset
self.train_dataset = tf.data.Dataset.from_tensor_slices((training_data_in, training_data_out))
self.test_dataset = tf.data.Dataset.from_tensor_slices((test_data_in, test_data_out))
I have a function to call at each epoch that will prepare the dataset. It shuffles the filenames, and transforms filenames to images and batch data.
def get_batched_data(self, seed, batch_size):
nb_batch = int(self.size_training / batch_size)
def img_to_tensor(path_in, path_out):
img_string_in = tf.read_file(path_in)
img_string_out = tf.read_file(path_out)
im_in = tf.image.decode_jpeg(img_string_in, channels=1)
im_out = tf.image.decode_jpeg(img_string_out, channels=1)
return im_in, im_out
t_datas = self.train_dataset.shuffle(self.size_training, seed=seed)
t_datas = t_datas.map(img_to_tensor)
t_datas = t_datas.batch(batch_size)
return t_datas
Now during the training, at each epoch we call the get_batched_data function, make an iterator, and run it for each batch, then feed the array to the optimizer operation.
for epoch in range(nb_epoch):
sess_iter_in = tf.Session()
sess_iter_out = tf.Session()
batched_train = data.get_batched_data(epoch)
iterator_train = batched_train.make_one_shot_iterator()
in_data, out_data = iterator_train.get_next()
total_batch = int(data.size_training / batch_size)
for batch in range(total_batch):
print(f"{batch + 1} / {total_batch}")
in_images = sess_iter_in.run(in_data).reshape((-1, 64, 64, 1))
out_images = sess_iter_out.run(out_data).reshape((-1, 64, 64, 1))
sess.run(optimizer, feed_dict={inputs: in_images,
outputs: out_images})
What do I need ?
I need to have a pipeline that loads only the images of the current batch (otherwise it will not fit in memory) and I want to shuffle the dataset in a different way for each epoch.
Questions and problems
First question, am I using the Dataset class in a good way? I saw very different things on the internet, for example in this blog post the dataset is used with a placeholder and fed during the learning with the datas. It seems strange because the data are all in an array, so loaded in memory. I don't see the point of using tf.data.dataset in this case.
I found solution by using repeat(epoch) on the dataset, like this, but the shuffle will not be different for each epoch in this case.
The second problem with my implementation is that I have an OutOfRangeError in some cases. With a small amount of data (512 like in the exemple) it works fine, but with a bigger amount of data, the error occurs. I thought it was because of a bad calculation of the number of batch due to bad rounding, or when the last batch has a smaller amount of data, but it happens in batch 32 out of 115... Is there any way to know the number of batch created after a batch(n) call on dataset?
Sorry for this loooonng question, but I've been struggling with this for a few days.

As far as I know, Official Performance Guideline is the best teaching material to make input pipelines.
I want to shuffle the dataset in a different way for each epoch.
Using shuffle() and repeat(), you can get different shuffle pattern for each epochs. You can confirm it with the following code
dataset = tf.data.Dataset.from_tensor_slices([1,2,3,4])
dataset = dataset.shuffle(4)
dataset = dataset.repeat(3)
iterator = dataset.make_one_shot_iterator()
x = iterator.get_next()
with tf.Session() as sess:
for i in range(10):
print(sess.run(x))
You can also use tf.contrib.data.shuffle_and_repeat as the mentioned by the above official page.
There are some problems in your code outside of creating data pipelines. You confuse graph construction with graph execution. You are repeating to create data input pipeline, so there are many redundant input pipelines as many as epochs. You can observe the redundant pipelines by Tensorboard.
You should place your graph construction code outside of loop as the following code (pseudo code)
batched_train = data.get_batched_data()
iterator = batched_train.make_initializable_iterator()
in_data, out_data = iterator_train.get_next()
for epoch in range(nb_epoch):
# reset iterator's state
sess.run(iterator.initializer)
try:
while True:
in_images = sess.run(in_data).reshape((-1, 64, 64, 1))
out_images = sess.run(out_data).reshape((-1, 64, 64, 1))
sess.run(optimizer, feed_dict={inputs: in_images,
outputs: out_images})
except tf.errors.OutOfRangeError:
pass
Moreover there are some unimportant inefficient code. You loaded a list of file path with from_tensor_slices(), so the list was embedded in your graph. (See https://www.tensorflow.org/guide/datasets#consuming_numpy_arrays for detail)
You would be better off using prefetch, and decreasing sess.run call by combining your graph.

Why is TensorFlow's `tf.data` package slowing down my code?

I'm just learning to use TensorFlow's tf.data API, and I've found that it is slowing my code down a lot, measured in time per epoch. This is the opposite of what it's supposed to do, I thought. I wrote a simple linear regression program to test it out.
Tl;Dr: With 100,000 training data, tf.data slows time per epoch down by about a factor of ten, if you're using full batch training. Worse if you use smaller batches. The opposite is true with 500 training data.
My question: What is going on? Is my implementation flawed? Other sources I've read have tf.data improving speeds by about 30%.
import tensorflow as tf
import numpy as np
import timeit
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
tf.logging.set_verbosity(tf.logging.ERROR)
n_epochs = 10
input_dimensions_list = [10]
def function_to_approximate(x):
return np.dot(x, random_covector).astype(np.float32) + np.float32(.01) * np.random.randn(1,1).astype(np.float32)
def regress_without_tfData(n_epochs, input_dimension, training_inputs, training_labels):
tf.reset_default_graph()
weights = tf.get_variable("weights", initializer=np.random.randn(input_dimension, 1).astype(np.float32))
X = tf.placeholder(tf.float32, shape=(None, input_dimension), name='X')
Y = tf.placeholder(tf.float32, shape=(None, 1), name='Y')
prediction = tf.matmul(X,weights)
loss = tf.reduce_mean(tf.square(tf.subtract(prediction, Y)))
loss_op = tf.train.AdamOptimizer(.01).minimize(loss)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for _ in range(n_epochs):
sess.run(loss_op, feed_dict={X: training_inputs, Y:training_labels})
def regress_with_tfData(n_epochs, input_dimension, training_inputs, training_labels, batch_size):
tf.reset_default_graph()
weights = tf.get_variable("weights", initializer=np.random.randn(input_dimension, 1).astype(np.float32))
X,Y = data_set.make_one_shot_iterator().get_next()
prediction = tf.matmul(X, weights)
loss = tf.reduce_mean(tf.square(tf.subtract(prediction, Y)))
loss_op = tf.train.AdamOptimizer(.01).minimize(loss)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
while True:
try:
sess.run(loss_op)
except tf.errors.OutOfRangeError:
break
for input_dimension in input_dimensions_list:
for data_size in [500, 100000]:
training_inputs = np.random.randn(data_size, input_dimension).astype(np.float32)
random_covector = np.random.randint(-5, 5, size=(input_dimension, 1))
training_labels = function_to_approximate(training_inputs)
print("Not using tf.data, with data size "
"{}, input dimension {} and training with "
"a full batch, it took an average of "
"{} seconds to run {} epochs.\n".
format(
data_size,
input_dimension,
timeit.timeit(
lambda: regress_without_tfData(
n_epochs, input_dimension,
training_inputs, training_labels
),
number=3
),
n_epochs))
for input_dimension in input_dimensions_list:
for data_size, batch_size in [(500, 50), (500, 500), (100000, 50), (100000, 100000)]:
training_inputs = np.random.randn(data_size, input_dimension).astype(np.float32)
random_covector = np.random.randint(-5, 5, size=(input_dimension, 1))
training_labels = function_to_approximate(training_inputs)
data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))
data_set = data_set.repeat(n_epochs)
data_set = data_set.batch(batch_size)
print("Using tf.data, with data size "
"{}, and input dimension {}, and training with "
"batch size {}, it took an average of {} seconds "
"to run {} epochs.\n".
format(
data_size,
input_dimension,
batch_size,
timeit.timeit(
lambda: regress_with_tfData(
n_epochs, input_dimension,
training_inputs, training_labels,
batch_size
),
number=3
)/3,
n_epochs
))
This outputs for me:
Not using tf.data, with data size 500, input dimension 10 and training
with a full batch, it took an average of 0.20243382899980134 seconds
to run 10 epochs.
Not using tf.data, with data size 100000, input dimension 10 and
training with a full batch, it took an average of 0.2431719040000644
seconds to run 10 epochs.
Using tf.data, with data size 500, and input dimension 10, and
training with batch size 50, it took an average of 0.09512088866661846
seconds to run 10 epochs.
Using tf.data, with data size 500, and input dimension 10, and
training with batch size 500, it took an average of
0.07286913600000844 seconds to run 10 epochs.
Using tf.data, with data size 100000, and input dimension 10, and
training with batch size 50, it took an average of 4.421892363666605
seconds to run 10 epochs.
Using tf.data, with data size 100000, and input dimension 10, and
training with batch size 100000, it took an average of
2.2555197536667038 seconds to run 10 epochs.
Edit: Fixed an important issue that Fred Guth pointed out. It didn't much affect the results, though.

I wanted to test the dataset API which seems to be really convenient for processing data. I did a lot of time testing about this API in CPU, GPU and multi-GPU way for small and large NN with different type of data.
First thing, It seems to me that your code is ok. But I need to point that your NN is just one simple layer.
Now, the dataset API is not suitable for your type of NN but for NN with a lot more complexity. Why ? For several reasons that I explain below (founded in my quest of understanding the dataset API).
Firstly, in one hand the dataset API processes data each batch whereas in the other hand data are preprocessed. Therefore, if it fits your RAM, you can save time by preprocessing the data. Here your data are just to "simple". If you want to test what i am saying, try to find a really really big dataset to process. Nevertheless, the dataset API can be tuned with prefetching data. You can take a look to this tutorial that explain really well why it is good to process data with prefetch.
Secondly, in my quest of dataset API for Multi-GPU training, I discovered that as far as i know the old pre-processing way is faster than dataset API for small Neural Network. You can verify that by creating a simple stackable RNN which take a sequence in input. You can try different size of stack (i have tested 1, 2, 10 and 20). You will see that, using the dataset API, on 1-GPU or on 4-GPUs, the time did not differ for small RNN stacks (1, 2 and 5).
To summarize, the dataset API is suitable for Neural Network that have data that can't be pre-process. Depending on your task, it may be more convenient to pre-process data, for example if you want to tweak your NN in order to improve it. I agree that the dataset API is really cool for batch, padding and also convenient for shuffling large amount of data but it's also not suitable for multi-GPU training.

First:
You are recreating the dataset unnecessarily.
data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))
Create the dataset prior to the loop and change the regress_with_tfData input signature to use dataset instead of training_inputs and training_labels.
Second:
The problem here is that minibatches of size 50 or even 500 are too small to compensate the cost of td.data building latency. You should increase the minibatch size. Interestingly you did so with a minibatch of size 100000, but then maybe it is too big ( I am not certain of this, I think it would need more tests).
There are a couple of things you could try:
1) Increase the minibatch size to something like 10000 and see if you get an improvement
2) Change your pipeline to use an iterator, example:
data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))
data_set = data_set.repeat(n_epochs)
data_set = data_set.batch(batch_size)
iterator = data_set.make_one_shot_iterator()
....
next_element = iterator.get_next()

That is because you are comparing apples with bananas.
On one hand, when using placeholders, you are providing a monolithic tensor as is. On the other hand, when using Dataset, you are slicing the tensor into individual samples. This is very different.
The equivalent of providing a monolothic placeholder tensor with the Dataset pipeline is by using tf.data.Dataset.from_tensors. When I use from_tensors in your example, I get similar (actually smaller) computation times than with placeholders.
If you want to compare a more sophisticated pipeline using from_tensor_slices, you should use a fair comparison with placeholders. For example, shuffle your data. Add some preprocessing on your slices. I have no doubt you will observe the performance gain that makes people switch to this pipeline.

One possible thing you are missing is a prefetch. Add a prefetch of 1 at the end of your data pipeline like so:
data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))
data_set = data_set.repeat(n_epochs)
data_set = data_set.batch(batch_size).prefetch(1)
Adding a prefetch of 1 at the end of your dataset pipeline means you try and fetch 1 batch of data while training is happening. This way you wont be waiting around while the batch is prepared, it should be ready to go as soon as each train iteration is done.

The accepted answer doesn't help longer valid, as the TF behavior has changed. Per documentation:
from_tensors produces a dataset containing only a single element. To
slice the input tensor into multiple elements, use from_tensor_slices
instead.
This means you cannot batch it
X = np.arange(10)
data = tf.data.Dataset.from_tensors( X )
data = data.batch(2)
for t in data.as_numpy_iterator():
print(t)
# only one row, whereas expected 5 !!!
The documentation recommends from_tensor_slices. But this has quite some overhead when compared to numpy slicing. Slow slicing is an open issue https://github.com/tensorflow/tensorflow/issues/39750
Essentially, slicing in TF is slow and impacts input-bound or light models such as small networks (regression, word2vec).

Streaming large training and test files into Tensorflow's DNNClassifier

I have a huge training CSV file (709M) and a large testing CSV file (125M) that I want to send into a DNNClassifier in the context of using the high-level Tensorflow API.
It appears that the input_fn param accepted by fit and evaluate must hold all feature and label data in memory, but I currently would like to run this on my local machine, and thus expect it to run out of memory rather quickly if I read these files into memory and then process them.
I skimmed the doc on streamed-reading of data, but the sample code for reading CSVs appears to be for the low-level Tensorflow API.
And - if you'll forgive a bit of whining - it seems overly-complex for the trivial use case of sending well-prepared files of training and test data into an Estimator ... although, perhaps that level of complexity is actually required for training and testing large volumes of data in Tensorflow?
In any case, I'd really appreciate an example of using that approach with the high-level API, if it's even possible, which I'm beginning to doubt.
After poking around, I did manage to find DNNClassifier#partial_fit, and will attempt to use it for training.
Examples of how to use this method would save me some time, though hopefully I'll stumble into the correct usage in the next few hours.
However, there doesn't seem to be a corresponding DNNClassifier#partial_evaluate ... though I suspect that I could break-up the testing data into smaller pieces and run DNNClassifier#evaluate successively on each batch, which might actually be a great way to do it since I could segment the testing data into cohorts, and thereby obtain per-cohort accuracy.
==== Update ====
Short version:
DomJack's recommendation should be the accepted answer.
However, my Mac's 16GB of RAM enough for it to hold the entire 709Mb training data set in memory without crashing. So, while I will use the DataSets feature when I eventually deploy the app, I'm not using it yet for local dev work.
Longer version:
I started by using the partial_fit API as described above, but upon every use it emitted a warning.
So, I went to look at the source for the method here, and discovered that its complete implementation looks like this:
logging.warning('The current implementation of partial_fit is not optimized'
' for use in a loop. Consider using fit() instead.')
return self.fit(x=x, y=y, input_fn=input_fn, steps=steps,
batch_size=batch_size, monitors=monitors)
... which reminds me of this scene from Hitchhiker's Guide:
Arthur Dent: What happens if I press this button?
Ford Prefect: I wouldn't-
Arthur Dent: Oh.
Ford Prefect: What happened?
Arthur Dent: A sign lit up, saying 'Please do not press this button again'.
Which is to say: partial_fit seems to exist for the sole purpose of telling you not to use it.
Furthermore, the model generated by using partial_fit iteratively on training file chunks was much smaller than the one generated by using fit on the whole training file, which strongly suggests that only the last partial_fit training chunk actually "took".

Check out the tf.data.Dataset API. There are a number of ways to create a dataset. I'll outline four - but you'll only have to implement one.
I assume each row of your csv files is n_features float values followed by a single int value.
Creating a tf.data.Dataset
Wrap a python generator with Dataset.from_generator
The easiest way to get started is to wrap a native python generator. This can have performance issues, but may be fine for your purposes.
def read_csv(filename):
with open(filename, 'r') as f:
for line in f.readlines():
record = line.rstrip().split(',')
features = [float(n) for n in record[:-1]]
label = int(record[-1])
yield features, label
def get_dataset():
filename = 'my_train_dataset.csv'
generator = lambda: read_csv(filename)
return tf.data.Dataset.from_generator(
generator, (tf.float32, tf.int32), ((n_features,), ()))
This approach is highly versatile and allows you to test your generator function (read_csv) independently of TensorFlow.
Use Tensorflow Datasets API
Supporting tensorflow versions 1.12+, tensorflow datasets is my new favourite way of creating datasets. It automatically serializes your data, collects statistics and makes other meta-data available to you via info and builder objects. It can also handle automatic downloading and extracting making collaboration simple.
import tensorflow_datasets as tfds
class MyCsvDatasetBuilder(tfds.core.GeneratorBasedBuilder):
VERSION = tfds.core.Version("0.0.1")
def _info(self):
return tfds.core.DatasetInfo(
builder=self,
description=(
"My dataset"),
features=tfds.features.FeaturesDict({
"features": tfds.features.Tensor(
shape=(FEATURE_SIZE,), dtype=tf.float32),
"label": tfds.features.ClassLabel(
names=CLASS_NAMES),
"index": tfds.features.Tensor(shape=(), dtype=tf.float32)
}),
supervised_keys=("features", "label"),
)
def _split_generators(self, dl_manager):
paths = dict(
train='/path/to/train.csv',
test='/path/to/test.csv',
)
# better yet, if the csv files were originally downloaded, use
# urls = dict(train=train_url, test=test_url)
# paths = dl_manager.download(urls)
return [
tfds.core.SplitGenerator(
name=tfds.Split.TRAIN,
num_shards=10,
gen_kwargs=dict(path=paths['train'])),
tfds.core.SplitGenerator(
name=tfds.Split.TEST,
num_shards=2,
gen_kwargs=dict(cvs_path=paths['test']))
]
def _generate_examples(self, csv_path):
with open(csv_path, 'r') as f:
for i, line in enumerate(f.readlines()):
record = line.rstrip().split(',')
features = [float(n) for n in record[:-1]]
label = int(record[-1])
yield dict(features=features, label=label, index=i)
Usage:
builder = MyCsvDatasetBuilder()
builder.download_and_prepare() # will only take time to run first time
# as_supervised makes output (features, label) - good for model.fit
datasets = builder.as_dataset(as_supervised=True)
train_ds = datasets['train']
test_ds = datasets['test']
Wrap an index-based python function
One of the downsides of the above is shuffling the resulting dataset with a shuffle buffer of size n requires n examples to be loaded. This will either create periodic pauses in your pipeline (large n) or result in potentially poor shuffling (small n).
def get_record(i):
# load the ith record using standard python, return numpy arrays
return features, labels
def get_inputs(batch_size, is_training):
def tf_map_fn(index):
features, labels = tf.py_func(
get_record, (index,), (tf.float32, tf.int32), stateful=False)
features.set_shape((n_features,))
labels.set_shape(())
# do data augmentation here
return features, labels
epoch_size = get_epoch_size()
dataset = tf.data.Dataset.from_tensor_slices((tf.range(epoch_size,))
if is_training:
dataset = dataset.repeat().shuffle(epoch_size)
dataset = dataset.map(tf_map_fn, (tf.float32, tf.int32), num_parallel_calls=8)
dataset = dataset.batch(batch_size)
# prefetch data to CPU while GPU processes previous batch
dataset = dataset.prefetch(1)
# Also possible
# dataset = dataset.apply(
# tf.contrib.data.prefetch_to_device('/gpu:0'))
features, labels = dataset.make_one_shot_iterator().get_next()
return features, labels
In short, we create a dataset just of the record indices (or any small record ID which we can load entirely into memory). We then do shuffling/repeating operations on this minimal dataset, then map the index to the actual data via tf.data.Dataset.map and tf.py_func. See the Using with Estimators and Testing in isolation sections below for usage. Note this requires your data to be accessible by row, so you may need to convert from csv to some other format.
TextLineDataset
You can also read the csv file directly using a tf.data.TextLineDataset.
def get_record_defaults():
zf = tf.zeros(shape=(1,), dtype=tf.float32)
zi = tf.ones(shape=(1,), dtype=tf.int32)
return [zf]*n_features + [zi]
def parse_row(tf_string):
data = tf.decode_csv(
tf.expand_dims(tf_string, axis=0), get_record_defaults())
features = data[:-1]
features = tf.stack(features, axis=-1)
label = data[-1]
features = tf.squeeze(features, axis=0)
label = tf.squeeze(label, axis=0)
return features, label
def get_dataset():
dataset = tf.data.TextLineDataset(['data.csv'])
return dataset.map(parse_row, num_parallel_calls=8)
The parse_row function is a little convoluted since tf.decode_csv expects a batch. You can make it slightly simpler if you batch the dataset before parsing.
def parse_batch(tf_string):
data = tf.decode_csv(tf_string, get_record_defaults())
features = data[:-1]
labels = data[-1]
features = tf.stack(features, axis=-1)
return features, labels
def get_batched_dataset(batch_size):
dataset = tf.data.TextLineDataset(['data.csv'])
dataset = dataset.batch(batch_size)
dataset = dataset.map(parse_batch)
return dataset
TFRecordDataset
Alternatively you can convert the csv files to TFRecord files and use a TFRecordDataset. There's a thorough tutorial here.
Step 1: Convert the csv data to TFRecords data. Example code below (see read_csv from from_generator example above).
with tf.python_io.TFRecordWriter("my_train_dataset.tfrecords") as writer:
for features, labels in read_csv('my_train_dataset.csv'):
example = tf.train.Example()
example.features.feature[
"features"].float_list.value.extend(features)
example.features.feature[
"label"].int64_list.value.append(label)
writer.write(example.SerializeToString())
This only needs to be run once.
Step 2: Write a dataset that decodes these record files.
def parse_function(example_proto):
features = {
'features': tf.FixedLenFeature((n_features,), tf.float32),
'label': tf.FixedLenFeature((), tf.int64)
}
parsed_features = tf.parse_single_example(example_proto, features)
return parsed_features['features'], parsed_features['label']
def get_dataset():
dataset = tf.data.TFRecordDataset(['data.tfrecords'])
dataset = dataset.map(parse_function)
return dataset
Using the dataset with estimators
def get_inputs(batch_size, shuffle_size):
dataset = get_dataset() # one of the above implementations
dataset = dataset.shuffle(shuffle_size)
dataset = dataset.repeat() # repeat indefinitely
dataset = dataset.batch(batch_size)
# prefetch data to CPU while GPU processes previous batch
dataset = dataset.prefetch(1)
# Also possible
# dataset = dataset.apply(
# tf.contrib.data.prefetch_to_device('/gpu:0'))
features, label = dataset.make_one_shot_iterator().get_next()
estimator.train(lambda: get_inputs(32, 1000), max_steps=1e7)
Testing the dataset in isolation
I'd strongly encourage you to test your dataset independently of your estimator. Using the above get_inputs, it should be as simple as
batch_size = 4
shuffle_size = 100
features, labels = get_inputs(batch_size, shuffle_size)
with tf.Session() as sess:
f_data, l_data = sess.run([features, labels])
print(f_data, l_data) # or some better visualization function
Performance
Assuming your using a GPU to run your network, unless each row of your csv file is enormous and your network is tiny you probably won't notice a difference in performance. This is because the Estimator implementation forces data loading/preprocessing to be performed on the CPU, and prefetch means the next batch can be prepared on the CPU as the current batch is training on the GPU. The only exception to this is if you have a massive shuffle size on a dataset with a large amount of data per record, which will take some time to load in a number of examples initially before running anything through the GPU.

I agree with DomJack about using the Dataset API, except the need to read the whole csv file and then convert to TfRecord. I am hereby proposing to emply TextLineDataset - a sub-class of the Dataset API to directly load data into a TensorFlow program. An intuitive tutorial can be found here.
The code below is used for the MNIST classification problem for illustration and hopefully, answer the question of the OP. The csv file has 784 columns, and the number of classes is 10. The classifier I used in this example is a 1-hidden-layer neural network with 16 relu units.
Firstly, load libraries and define some constants:
# load libraries
import tensorflow as tf
import os
# some constants
n_x = 784
n_h = 16
n_y = 10
# path to the folder containing the train and test csv files
# You only need to change PATH, rest is platform independent
PATH = os.getcwd() + '/'
# create a list of feature names
feature_names = ['pixel' + str(i) for i in range(n_x)]
Secondly, we create an input function reading a file using the Dataset API, then provide the results to the Estimator API. The return value must be a two-element tuple organized as follows: the first element must be a dict in which each input feature is a key, and then a list of values for the training batch, and the second element is a list of labels for the training batch.
def my_input_fn(file_path, batch_size=32, buffer_size=256,\
perform_shuffle=False, repeat_count=1):
'''
Args:
- file_path: the path of the input file
- perform_shuffle: whether the data is shuffled or not
- repeat_count: The number of times to iterate over the records in the dataset.
For example, if we specify 1, then each record is read once.
If we specify None, iteration will continue forever.
Output is two-element tuple organized as follows:
- The first element must be a dict in which each input feature is a key,
and then a list of values for the training batch.
- The second element is a list of labels for the training batch.
'''
def decode_csv(line):
record_defaults = [[0.]]*n_x # n_x features
record_defaults.insert(0, [0]) # the first element is the label (int)
parsed_line = tf.decode_csv(records=line,\
record_defaults=record_defaults)
label = parsed_line[0] # First element is the label
del parsed_line[0] # Delete first element
features = parsed_line # Everything but first elements are the features
d = dict(zip(feature_names, features)), label
return d
dataset = (tf.data.TextLineDataset(file_path) # Read text file
.skip(1) # Skip header row
.map(decode_csv)) # Transform each elem by applying decode_csv fn
if perform_shuffle:
# Randomizes input using a window of 256 elements (read into memory)
dataset = dataset.shuffle(buffer_size=buffer_size)
dataset = dataset.repeat(repeat_count) # Repeats dataset this # times
dataset = dataset.batch(batch_size) # Batch size to use
iterator = dataset.make_one_shot_iterator()
batch_features, batch_labels = iterator.get_next()
return batch_features, batch_labels
Then, the mini-batch can be computed as
next_batch = my_input_fn(file_path=PATH+'train1.csv',\
batch_size=batch_size,\
perform_shuffle=True) # return 512 random elements
Next, we define the feature columns are numeric
feature_columns = [tf.feature_column.numeric_column(k) for k in feature_names]
Thirdly, we create an estimator DNNClassifier:
classifier = tf.estimator.DNNClassifier(
feature_columns=feature_columns, # The input features to our model
hidden_units=[n_h], # One layer
n_classes=n_y,
model_dir=None)
Finally, the DNN is trained using the test csv file, while the evaluation is performed on the test file. Please change the repeat_count and steps to ensure that the training meets the required number of epochs in your code.
# train the DNN
classifier.train(
input_fn=lambda: my_input_fn(file_path=PATH+'train1.csv',\
perform_shuffle=True,\
repeat_count=1),\
steps=None)
# evaluate using the test csv file
evaluate_result = classifier.evaluate(
input_fn=lambda: my_input_fn(file_path=PATH+'test1.csv',\
perform_shuffle=False))
print("Evaluation results")
for key in evaluate_result:
print(" {}, was: {}".format(key, evaluate_result[key]))

Train model using queue Tensorflow

I designed a neural network in tensorflow for my regression problem by following and adapting the tensorflow tutorial. However, due to the structure of my problem (~300.000 data points and use of the costful FTRLOptimizer), my problem took too long to execute even with my 32 CPUs machine (I don't have GPUs).
According to this comment and a quick confirmation via htop, it appears that I have some single-threaded operations and it should be feed_dict.
Therefore, as adviced here, I tried to use queues for multi-threading my program.
I wrote a simple code file with queue to train a model as following:
import numpy as np
import tensorflow as tf
import threading
#Function for enqueueing in parallel my data
def enqueue_thread():
sess.run(enqueue_op, feed_dict={x_batch_enqueue: x, y_batch_enqueue: y})
#Set the number of couples (x, y) I use for "training" my model
BATCH_SIZE = 5
#Generate my data where y=x+1+little_noise
x = np.random.randn(10, 1).astype('float32')
y = x+1+np.random.randn(10, 1)/100
#Create the variables for my model y = x*W+b, then W and b should both converge to 1.
W = tf.get_variable('W', shape=[1, 1], dtype='float32')
b = tf.get_variable('b', shape=[1, 1], dtype='float32')
#Prepare the placeholdeers for enqueueing
x_batch_enqueue = tf.placeholder(tf.float32, shape=[None, 1])
y_batch_enqueue = tf.placeholder(tf.float32, shape=[None, 1])
#Create the queue
q = tf.RandomShuffleQueue(capacity=2**20, min_after_dequeue=BATCH_SIZE, dtypes=[tf.float32, tf.float32], seed=12, shapes=[[1], [1]])
#Enqueue operation
enqueue_op = q.enqueue_many([x_batch_enqueue, y_batch_enqueue])
#Dequeue operation
x_batch, y_batch = q.dequeue_many(BATCH_SIZE)
#Prediction with linear model + bias
y_pred=tf.add(tf.mul(x_batch, W), b)
#MAE cost function
cost = tf.reduce_mean(tf.abs(y_batch-y_pred))
learning_rate = 1e-3
train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
available_threads = 1024
#Feed the queue
for i in range(available_threads):
threading.Thread(target=enqueue_thread).start()
#Train the model
for step in range(1000):
_, cost_step = sess.run([train_op, cost])
print(cost_step)
Wf=sess.run(W)
bf=sess.run(b)
This code doesn't work because each time I call x_batch, one y_batch is also dequeued and vice versa. Then, I do not compare the features with the corresponding "result".
Is there an easy way to avoid this problem ?

My mistake, everything worked fine.
I was misled because I estimated at each step of the algorithm my performance on different batches and also because my model was too complicated for a dummy one (I should had something like y=W*x or y=x+b).
Then, when I tried to print in the console, I exucuted several times sess.run on different variables and got obviously non-consistent results.

Nonetheless your problem is solved, wanted to show you a small inefficiency in your code. When you created your RandomShuffleQueue you specified capacity=2**20. In all the queues capacity:
The upper bound on the number of elements that may be stored in this
queue.
The queue will try to put as many elements as possible in the queue till it will hit this limit. All these elements are eating your RAM. If each element consists of only 1byte, your queue will eat 1Mb of your data. If you will have 10Kb images in your queue you will eat 10Gb of RAM.
This is very wasteful, especially because you never need so many elements in the queue. All you need to make sure is that your queue is never empty. So find a reasonable capacity of the queue and do not use huge numbers.

how to read batches in one hdf5 data file for training?

I have a hdf5 training dataset with size (21760, 1, 33, 33). 21760 is the whole number of training samples. I want to use the mini-batch training data with the size 128 to train the network.
I want to ask:
How to feed 128 mini-batch training data from the whole dataset with tensorflow each time?

If your data set is so large that it can't be imported into memory like keveman suggested, you can use the h5py object directly:
import h5py
import tensorflow as tf
data = h5py.File('myfile.h5py', 'r')
data_size = data['data_set'].shape[0]
batch_size = 128
sess = tf.Session()
train_op = # tf.something_useful()
input = # tf.placeholder or something
for i in range(0, data_size, batch_size):
current_data = data['data_set'][position:position+batch_size]
sess.run(train_op, feed_dict={input: current_data})
You can also run through a huge number of iterations and randomly select a batch if you want to:
import random
for i in range(iterations):
pos = random.randint(0, int(data_size/batch_size)-1) * batch_size
current_data = data['data_set'][pos:pos+batch_size]
sess.run(train_op, feed_dict={inputs=current_data})
Or sequentially:
for i in range(iterations):
pos = (i % int(data_size / batch_size)) * batch_size
current_data = data['data_set'][pos:pos+batch_size]
sess.run(train_op, feed_dict={inputs=current_data})
You probably want to write some more sophisticated code that goes through all data randomly, but keeps track of which batches have been used, so you don't use any batch more often than others. Once you've done a full run through the training set you enable all batches again and repeat.

You can read the hdf5 dataset into a numpy array, and feed slices of the numpy array to the TensorFlow model. Pseudo code like the following would work :
import numpy, h5py
f = h5py.File('somefile.h5','r')
data = f.get('path/to/my/dataset')
data_as_array = numpy.array(data)
for i in range(0, 21760, 128):
sess.run(train_op, feed_dict={input:data_as_array[i:i+128, :, :, :]})

alkamen's approach seems logically right but I have not gotten any positive results using it. My best guess is this: Using code sample 1 above, in every iteration, the network trains afresh, forgetting all that has been learned in the previous loop. So if we are fetching at 30 samples or batches per iteration, at every loop/iteration, only 30 data samples are being used, then at the next loop, everything is overwritten.
Find below a screenshot of this approach
As can be seen, the loss and accuracy always start afresh. I will be happy if anyone could share a possible way around this, please.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.