Related
I used a GridSearchCV pipeline for training several different image classifiers in scikit-learn. In the pipeline I used two stages, scaler and classifier. The training run successfully, and this is what turned out to be the best hyper-parameter setting:
Pipeline(steps=[('scaler', MinMaxScaler()),
('classifier',
ExtraTreesClassifier(criterion='log_loss', max_depth=30,
min_samples_leaf=5, min_samples_split=7,
n_estimators=50, random_state=42))],
verbose=True)
Now I want to use this trained pipeline to test it on a lot of images. Therefore, I'm reading my test images from disk (150x150px) and store them in a hdf5 file, where each image is represented as a row vector (150*150=22500px), and all images are stacked upon each other in an np.array:
X_test.shape -> (n_imgs,22500)
Then I'm predicting the labels y_preds with
y_preds = model.predict(X_test)
So far, so good, as long as I'm only predicting some images.
But when n_imgs is growing (e.g. 1 Mio images), it doesn't fit into memory anymore. So I was googling around and found some solutions, that unfortunately didn't work.
I'm currently trying to use multiprocessing.pool.Pool. Now my problem: I want to call multiprocessing's Pool.map(), like so:
n_cores = 10
with Pool(n_cores) as pool:
results = pool.map(model.predict, X_test, chunksize=22500)
but suddenly all workers say:
without further details, no matter what chunksize I use.
So I tried to reshape X_test so that each image is represented blockwise next to each other:
X_reshaped = np.reshape(X_test,(n_imgs,150,150))
now chunksize picks out whole images, but as my model has been trained on 1x22500 arrays, not quadratic ones, I get the error:
ValueError: X_test has 150 features, but MinMaxScaler is expecting 22500 features as input.
I'd need to reshape the images back to 1x22500 before predict runs on the chunks. But I'd need a function with several inputs, which pool.map() doesn't allow (it only takes 1 argument for the given function).
So I followed Jason Brownlee's post: Multiprocessing Pool map() Multiple Arguments
and packed several variables into a tuple, which I then unpacked in a wrapper function, before calling model.predict():
n_imgs = X_test.shape[0]
X_reshaped = np.reshape(X_test,(n_imgs,150,150)) # reshape each row to 150x150px images
input_tuple = (model,X_reshaped) # pack model and data into a tuple as input for the wrapper
with Pool(n_cores) as pool:
results = pool.map(predict_wrapper, input_tuple, chunksize=22500)
and the wrapper function:
def predict_wrapper(input_tuple):
model, X = input_tuple # unpack the input tuple
n_imgs = X.shape[0]
X_mod = np.reshape(X,(n_imgs,150*150)) # reshape back
y_preds = model.predict(X_mod)
return y_preds
But: input_tuple doesn't get unpacked correctly in the wrapper function:
As you can see: instead of assigning model to model and X_test to X, it splits my pipeline and assigns the scaler to model and the classifier to X. ๐คฏ
So, long story short:
does anybody have a solution how I can use my trained scikit-learn pipeline and do prediction on a plethora of images? I'm not bound to use multiprocessing.pool.Pool, but I didn't find any other solution so far...
Many thanks in advance! ๐ค๐ผ
When you call pool.map() on a numpy array, the array is broken up along its first dimension.
So if you called pool.map(my_func, X_test), this will cause my_func to be called n_imag times, each with a 1-dimensional array of size 22500.
You have already mentioned that X_test is too big to fit into memory. It might make sense to have each subprocess read a range of images on its own from the database, process those, and send you back the results, rather than you sending the images to it.
def process_image_ranges(image_range):
start, end = image_range
# read images start [include] to end (exclusive) and process them
if __name__ == '__main__':
image_count = 1_000_000 # or whatever the count is
image_batching = 1024 # or whatever you want your batch size to be
image_ranges = [(i, min(i + image_batching, image_count))
for i in range(0, image_count, image_batching)]
with mp.Pool() as pool:
result = pool.map(process_image_ranges, image_ranges)
Ok, now I finally got this working! Thanks to Frank Yellin's answer here I realized that my problem seemed to be the chunksize I explicitly passed. I thought that by doing so I could force pool.map() to take a certain number of images per chunk, but it behaved differently and complained about the wrong dimensions of the given chunks.
But inspired by Frank's answer I rather defined the chunks before the call to pool.map() and then passed the chunks to it. Now the images are passed chunkwise to the single workers.
Seems I could not see the forest for the trees...
So in the end it looks like this:
from multiprocessing import Pool
import h5py
import joblib
import numpy as np
def main_prediction_batch():
# --- load model ---
model_URL = "<path to model.pkl>"
with open(model_URL, 'rb') as model_file:
model = joblib.load(model_file)
# --- load image and label file ---
hdf5_file_URL = "<path to hdf5 file with images and labels.hdf5>"
with h5py.File(hdf5_file_URL, mode='r') as hdf5_file:
X_test = hdf5_file["Images"][:] # ๐๏ธ (n_imgs, 150*150)
y_test = hdf5_file["Labels"].asstr()[:] # ๐๏ธ (n_imgs,)
n_imgs = X_test.shape[0]
n_cores = 10
image_batching = 10000 # or whatever you want your batch size to be
# doesn't have to be a multiple of n_cores!
chunk_ranges = [(i, min(i + image_batching, n_imgs))
for i in range(0, n_imgs, image_batching)]
# define chunks of several images
chunks = [
(X_test[chunk_ranges[i][0]:chunk_ranges[i][1], :])
for i in range(len(chunk_ranges))
]
with Pool(n_cores) as pool:
results = pool.map(model.predict, chunks)
# stack the predictions to get a final row vector
y_preds = np.hstack(results) # can now be compared with y_test
return y_preds
# --------------------------
# MAIN
# --------------------------
if __name__ == '__main__':
y_preds = main_prediction_batch()
When I now look at it... it was complicated to describe but the final solution was quite simple... thanks a lot for enlightening me!
I'm trying to generate batches for triplet loss where there are always pairs in the batch. The code below achieves this but it's very, very slow. In particular the choose_from_datasets method seems to be the source of the slowness.
Is there something wrong with my code that's creating the slowdown? Or is there a smarter way to do this?
I tried switching to sample_from_datasets instead, but this didn't help.
def batch_pairs3(dataset, num_classes, shuffle=True, num_classes_per_batch=10, num_images_per_class=2):
# Isolate each class into its own dataset
datasets = []
for cl in range(num_classes):
this_dataset = dataset.filter(lambda xx, yy: tf.equal(tf.reshape(yy, []), cl))
if shuffle:
this_dataset = this_dataset.shuffle(100)
datasets += [this_dataset]
# if shuffle:
# random.shuffle(datasets)
selector = tf.contrib.data.Counter().map(
lambda x: generator3(x, num_classes, num_classes_per_batch, num_images_per_class))
selector = selector.apply(tf.contrib.data.unbatch())
dataset = tf.contrib.data.choose_from_datasets(datasets, selector)
# Batch
batch_size = num_classes_per_batch * num_images_per_class
return dataset.batch(batch_size)
tf data pipeline does not handle these kind of applications where you are processing your data on the fly by iterating through it very well, unless you can independently map every data point to do such processing. For what you are doing, you may be better off pre-processing and storing your data, in something like tfrecord format and then using the data pipeline to read it in an optimized way.
Refer this official example, which kind of works on a similar problem involving triplet loss: Time Contrastive Networks, the data provider
I am new to Tensorflow and deep learning, and I am struggling with the Dataset class. I tried a lot of things and I canโt find a good solution.
What I am trying
I have a large amount of images (500k+) to train my DNN with. This is a denoising autoencoder so I have a pair of each image. I am using the dataset class of TF to manage the data, but I think I use it really badly.
Here is how I load the filenames in a dataset:
class Data:
def __init__(self, in_path, out_path):
self.nb_images = 512
self.test_ratio = 0.2
self.batch_size = 8
# load filenames in input and outputs
inputs, outputs, self.nb_images = self._load_data_pair_paths(in_path, out_path, self.nb_images)
self.size_training = self.nb_images - int(self.nb_images * self.test_ratio)
self.size_test = int(self.nb_images * self.test_ratio)
# split arrays in training / validation
test_data_in, training_data_in = self._split_test_data(inputs, self.test_ratio)
test_data_out, training_data_out = self._split_test_data(outputs, self.test_ratio)
# transform array to tf.data.Dataset
self.train_dataset = tf.data.Dataset.from_tensor_slices((training_data_in, training_data_out))
self.test_dataset = tf.data.Dataset.from_tensor_slices((test_data_in, test_data_out))
I have a function to call at each epoch that will prepare the dataset. It shuffles the filenames, and transforms filenames to images and batch data.
def get_batched_data(self, seed, batch_size):
nb_batch = int(self.size_training / batch_size)
def img_to_tensor(path_in, path_out):
img_string_in = tf.read_file(path_in)
img_string_out = tf.read_file(path_out)
im_in = tf.image.decode_jpeg(img_string_in, channels=1)
im_out = tf.image.decode_jpeg(img_string_out, channels=1)
return im_in, im_out
t_datas = self.train_dataset.shuffle(self.size_training, seed=seed)
t_datas = t_datas.map(img_to_tensor)
t_datas = t_datas.batch(batch_size)
return t_datas
Now during the training, at each epoch we call the get_batched_data function, make an iterator, and run it for each batch, then feed the array to the optimizer operation.
for epoch in range(nb_epoch):
sess_iter_in = tf.Session()
sess_iter_out = tf.Session()
batched_train = data.get_batched_data(epoch)
iterator_train = batched_train.make_one_shot_iterator()
in_data, out_data = iterator_train.get_next()
total_batch = int(data.size_training / batch_size)
for batch in range(total_batch):
print(f"{batch + 1} / {total_batch}")
in_images = sess_iter_in.run(in_data).reshape((-1, 64, 64, 1))
out_images = sess_iter_out.run(out_data).reshape((-1, 64, 64, 1))
sess.run(optimizer, feed_dict={inputs: in_images,
outputs: out_images})
What do I need ?
I need to have a pipeline that loads only the images of the current batch (otherwise it will not fit in memory) and I want to shuffle the dataset in a different way for each epoch.
Questions and problems
First question, am I using the Dataset class in a good way? I saw very different things on the internet, for example in this blog post the dataset is used with a placeholder and fed during the learning with the datas. It seems strange because the data are all in an array, so loaded in memory. I don't see the point of using tf.data.dataset in this case.
I found solution by using repeat(epoch) on the dataset, like this, but the shuffle will not be different for each epoch in this case.
The second problem with my implementation is that I have an OutOfRangeError in some cases. With a small amount of data (512 like in the exemple) it works fine, but with a bigger amount of data, the error occurs. I thought it was because of a bad calculation of the number of batch due to bad rounding, or when the last batch has a smaller amount of data, but it happens in batch 32 out of 115... Is there any way to know the number of batch created after a batch(n) call on dataset?
Sorry for this loooonng question, but I've been struggling with this for a few days.
As far as I know, Official Performance Guideline is the best teaching material to make input pipelines.
I want to shuffle the dataset in a different way for each epoch.
Using shuffle() and repeat(), you can get different shuffle pattern for each epochs. You can confirm it with the following code
dataset = tf.data.Dataset.from_tensor_slices([1,2,3,4])
dataset = dataset.shuffle(4)
dataset = dataset.repeat(3)
iterator = dataset.make_one_shot_iterator()
x = iterator.get_next()
with tf.Session() as sess:
for i in range(10):
print(sess.run(x))
You can also use tf.contrib.data.shuffle_and_repeat as the mentioned by the above official page.
There are some problems in your code outside of creating data pipelines. You confuse graph construction with graph execution. You are repeating to create data input pipeline, so there are many redundant input pipelines as many as epochs. You can observe the redundant pipelines by Tensorboard.
You should place your graph construction code outside of loop as the following code (pseudo code)
batched_train = data.get_batched_data()
iterator = batched_train.make_initializable_iterator()
in_data, out_data = iterator_train.get_next()
for epoch in range(nb_epoch):
# reset iterator's state
sess.run(iterator.initializer)
try:
while True:
in_images = sess.run(in_data).reshape((-1, 64, 64, 1))
out_images = sess.run(out_data).reshape((-1, 64, 64, 1))
sess.run(optimizer, feed_dict={inputs: in_images,
outputs: out_images})
except tf.errors.OutOfRangeError:
pass
Moreover there are some unimportant inefficient code. You loaded a list of file path with from_tensor_slices(), so the list was embedded in your graph. (See https://www.tensorflow.org/guide/datasets#consuming_numpy_arrays for detail)
You would be better off using prefetch, and decreasing sess.run call by combining your graph.
I've been trying to be able to feed my own images in some tensorflow code to look how the code would react to my own images instead of the MNIST set. I've been able to import images(i think) into tensorflow but I have two placeholders that should get my image data and label data. I tried to use feed_dict(which still seems right to me) to be able to use my data in the rest of my code but it won't accept any data I feed it. I know I can't feed it a Tensor and apparently not a batch but the only way I can think of to make this work is to feed it a list. I saw feed_dict is able to use numpy arrays but im not sure how i should approach converting data to a numpy array.
I'm new to TensorFlow and python so please forgive any mistakes I made, I'm still learning how everything works.
with tf.name_scope('Image_Data_Input'):
def read_labeled_image_list(image_list_file):
print('read_labeled_image_list function opened')
f = open(image_list_file, 'r')
print('image_list_file opened')
filenames = []
labels = []
print('Arrays formed')
for line in f:
filename, label = line[:-1].split(' ')
filenames.append(filename)
labels.append(label)
print('Lines deconstructed')
return filenames, labels
def read_image(input_queue):
label = input_queue[1]
file_contents = tf.read_file(input_queue[0])
decoded_image = tf.image.decode_jpeg(file_contents, channels=3)
print('Image decoded to JPEG')
decoded_image.set_shape([2560, 1440, 3])
decoded_image = tf.image.resize_images(decoded_image, [128, 128])
return decoded_image, label
image_list, label_list = read_labeled_image_list(image_list_file)
images = tf.convert_to_tensor(image_list, dtype=tf.string)
labels = tf.convert_to_tensor(label_list, dtype=tf.string)
input_queue = tf.train.slice_input_producer([images, labels], num_epochs=None, shuffle=True)
image, label = read_image(input_queue)
The indentation behaved a little weird when i pasted my code so I'm not sure everything is properly placed.
Well now I have these placeholder:
with tf.name_scope('input'):
x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])
And I've seen the code routing data to those placeholder this way:
batch_x, batch_y = tf.train.batch([image, label], batchsize)
#_, summary = sess.run([train_writer, summary_op], feed_dict={x: batch_x, y_: batch_y})
But I can't seem to make this work.
Does anyone have any idea how i could make this work?
Again sorry for any mistakes and thanks in advance.
As the error says, you can't feed tensors into a placeholder. batch_x and batch_y are tensors. The new tf.Dataset API is the preferred way to input data into a model (guide here). I think Dataset.from_tensor_slices would require minimal rewriting. Short of that, build the graph so that batch_x and batch_y flow into the model you're using directly. Then you don't need to use placeholders.
I don't recommend this, but for completeness I want to mention another method. You could:
numpy_batch_x, numpy_batch_y = sess.run([batch_x, batch_y])
_, summary = sess.run([train_writer, summary_op],
feed_dict={x: numpy_batch_x, y_: numpy_batch_y})
PS: If train_writer is a tf.summary.FileWriter, I think you want to:
summary = sess.run([summary_op], ...)
train_writer.add_summary(summary)
EDIT: In response to confusion on the dataset API, I am going to show how to handle this with a Dataset. I am going to use TFRecords. It may not be the simplest solution, but it's one way.
import numpy as np
from scipy.misc import imread # There are others that would work here.
from cv2 import resize # Again, others to choose from.
def read_labeled_image_list(...)
# See question
return filenames, labels
def make_tfr(tfr_dir="/YOUR/PREFERRED/TFR/DIR")
def _int64_list_feature(a_list):
return tf.train.Feature(int64_list=tf.train.Int64List(value=a_list)
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
writer = tf.python_io.TFRecordWriter(tfr_dir)
all_image_paths, all_labels = read_labeled_image_list(...)
for path, label in zip(all_image_paths, all_labels):
disk_im = imread(path)
resized_im = cv2.resize(disk_im, (128, 128))
raw_im = resized_im.tostring()
# Construct an example proto-obj,
example = tf.train.Example(
# which wants a Features proto-obj,
features=tf.train.Features(
# which wants a dict.
feature={
'image_raw': _bytes_feature(raw_im),
'label': _int64_list_feature(label)
})) # close your example object
serialized = example.SerializeToString()
writer.write(serialized)
make_tfr() # After you've done it successfully once, comment out.
def input_pipeline(batch_size, epochs, tfr_dir="/YOUR/PREFERRED/TFR/DIR"):
# with tf.name_scope("Input"): maybe you like to scope as much as I do?
dataset = tf.data.TFRecordDataset(tfr_dir)
def parse_protocol_buffer(example_proto):
features = {'image_raw': tf.FixedLenFeature((), tf.string),
'label': tf.FixedLenFeature((), tf.int64)}
parsed_features = tf.parse_single_example(
example_proto, features)
return parsed_features['image_raw'], parsed_features['label']
dataset = dataset.map(parse_protocol_buffer)
def convert_parsed_proto_to_input(image_string, label):
image_decoded = tf.decode_raw(image_string, tf.uint8)
image_resized = tf.reshape(image_decoded, (128, 128, 3))
image = tf.cast(image_resized, tf.float32)
# I usually put my image elements in [-1, 1]
return image * (2. /255) -1, label
dataset = dataset.map(converted_parsed_proto_to_input)
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.repeat(batch_size * epochs)
return dataset
def model(image_tensor):
...
# However you want to do this.
return predictions
def loss(predictions, labels):
...
return some_loss
def train(some_loss):
...
return train_op
batch_size = 50
iterations = 10000
train_dataset = input_pipeline(batch_size, iterations)
train_iterator = train_dataset.make_initializable_iterator()
image, label = train_iterator.get_next()
predictions = model(image)
loss_op = loss(image, predictions)
train_op = train(loss_op)
summary_op = tf.summary.merge_all()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
train_writer = tf.summary.FileWriter("/YOUR/LOGDIR", sess.graph)
sess.run(train_iterator.initializer)
for epoch in range(iterations + 1):
_, summary = sess.run([train_op, summary_op])
train_writer.add_summary(summary, epoch)
You say you're new to TensorFlow. I hope this doesn't intimidate you. I was new to TensorFlow not long ago, and it was a pain to figure out how to make a good input pipeline. Learning TFRecords seemed impossible. You also say you're new to Python, so I'll warn you that cv2 has a reputation of difficult installs. You may want to look into other ways to resize an image (though I'd advise against PIL, which is probably even more confusing and difficult at first).
Basically, I'm posting this code because because the documentation on writing TFRecords is confusing (Exhibit A vs a blog post that helped me figure it out) but TFRecords is the way I know how to make a Dataset best. Even if you don't go the TFRecords route, this could help you with the map function for datasets, e.g. notice how I pass label through convert... even though it's not used. Making a dataset (specifically from TFRecords) is a lot of lines of code, but Dataset is the preferred way to construct an input pipeline and it's designed to replace the old queue method you're using.
As a side note, the purpose of the queue strategy was to read data from memory directly into the graph without placeholders. Placeholders are slow and memory-intensive compared to the queue strategy, but Datasets are even better when implemented correctly.
I see in your comment that you want to see the placeholder namescope get connected to your graph. The dataset way, you'll see some dataset nodes on the graph. If you scope them with what I commented out, it should be apparent that everything's hooked up right. Your way, you're actually adding this queue-and-preprocess structure onto the graph. Since you'll have to de-tensor-ify the images to pass them into a placeholder, it won't be apparent that your data is flowing correctly.
Now, as I mentioned in the original post, you can just pass batch_x and batch_y into your model and forget the placeholder and dataset altogether. You'll see everything hooked up right from the preprocessing stage, if the queue is implemented right. Still, your images are large before reshaping them. Reading them will be an intensive task. I'd recommend going the hard route of learning to use Datasets and TFRecords.
I hope this helps you implement a Dataset in your code. I hope this helps you get TensorBoard running. And I hope this helps you figure out TFRecords if you decide to go that route.
PS: On the topic of TensorBoard validating that the model is working, you could attach a tf.summary.image(img) as the first line of model(...). Then check out the image dash and see if it's what you expect.
EDIT 2: example = tf.train.Example(features=tf.train.Features(feature={}))
I have a huge training CSV file (709M) and a large testing CSV file (125M) that I want to send into a DNNClassifier in the context of using the high-level Tensorflow API.
It appears that the input_fn param accepted by fit and evaluate must hold all feature and label data in memory, but I currently would like to run this on my local machine, and thus expect it to run out of memory rather quickly if I read these files into memory and then process them.
I skimmed the doc on streamed-reading of data, but the sample code for reading CSVs appears to be for the low-level Tensorflow API.
And - if you'll forgive a bit of whining - it seems overly-complex for the trivial use case of sending well-prepared files of training and test data into an Estimator ... although, perhaps that level of complexity is actually required for training and testing large volumes of data in Tensorflow?
In any case, I'd really appreciate an example of using that approach with the high-level API, if it's even possible, which I'm beginning to doubt.
After poking around, I did manage to find DNNClassifier#partial_fit, and will attempt to use it for training.
Examples of how to use this method would save me some time, though hopefully I'll stumble into the correct usage in the next few hours.
However, there doesn't seem to be a corresponding DNNClassifier#partial_evaluate ... though I suspect that I could break-up the testing data into smaller pieces and run DNNClassifier#evaluate successively on each batch, which might actually be a great way to do it since I could segment the testing data into cohorts, and thereby obtain per-cohort accuracy.
==== Update ====
Short version:
DomJack's recommendation should be the accepted answer.
However, my Mac's 16GB of RAM enough for it to hold the entire 709Mb training data set in memory without crashing. So, while I will use the DataSets feature when I eventually deploy the app, I'm not using it yet for local dev work.
Longer version:
I started by using the partial_fit API as described above, but upon every use it emitted a warning.
So, I went to look at the source for the method here, and discovered that its complete implementation looks like this:
logging.warning('The current implementation of partial_fit is not optimized'
' for use in a loop. Consider using fit() instead.')
return self.fit(x=x, y=y, input_fn=input_fn, steps=steps,
batch_size=batch_size, monitors=monitors)
... which reminds me of this scene from Hitchhiker's Guide:
Arthur Dent: What happens if I press this button?
Ford Prefect: I wouldn't-
Arthur Dent: Oh.
Ford Prefect: What happened?
Arthur Dent: A sign lit up, saying 'Please do not press this button again'.
Which is to say: partial_fit seems to exist for the sole purpose of telling you not to use it.
Furthermore, the model generated by using partial_fit iteratively on training file chunks was much smaller than the one generated by using fit on the whole training file, which strongly suggests that only the last partial_fit training chunk actually "took".
Check out the tf.data.Dataset API. There are a number of ways to create a dataset. I'll outline four - but you'll only have to implement one.
I assume each row of your csv files is n_features float values followed by a single int value.
Creating a tf.data.Dataset
Wrap a python generator with Dataset.from_generator
The easiest way to get started is to wrap a native python generator. This can have performance issues, but may be fine for your purposes.
def read_csv(filename):
with open(filename, 'r') as f:
for line in f.readlines():
record = line.rstrip().split(',')
features = [float(n) for n in record[:-1]]
label = int(record[-1])
yield features, label
def get_dataset():
filename = 'my_train_dataset.csv'
generator = lambda: read_csv(filename)
return tf.data.Dataset.from_generator(
generator, (tf.float32, tf.int32), ((n_features,), ()))
This approach is highly versatile and allows you to test your generator function (read_csv) independently of TensorFlow.
Use Tensorflow Datasets API
Supporting tensorflow versions 1.12+, tensorflow datasets is my new favourite way of creating datasets. It automatically serializes your data, collects statistics and makes other meta-data available to you via info and builder objects. It can also handle automatic downloading and extracting making collaboration simple.
import tensorflow_datasets as tfds
class MyCsvDatasetBuilder(tfds.core.GeneratorBasedBuilder):
VERSION = tfds.core.Version("0.0.1")
def _info(self):
return tfds.core.DatasetInfo(
builder=self,
description=(
"My dataset"),
features=tfds.features.FeaturesDict({
"features": tfds.features.Tensor(
shape=(FEATURE_SIZE,), dtype=tf.float32),
"label": tfds.features.ClassLabel(
names=CLASS_NAMES),
"index": tfds.features.Tensor(shape=(), dtype=tf.float32)
}),
supervised_keys=("features", "label"),
)
def _split_generators(self, dl_manager):
paths = dict(
train='/path/to/train.csv',
test='/path/to/test.csv',
)
# better yet, if the csv files were originally downloaded, use
# urls = dict(train=train_url, test=test_url)
# paths = dl_manager.download(urls)
return [
tfds.core.SplitGenerator(
name=tfds.Split.TRAIN,
num_shards=10,
gen_kwargs=dict(path=paths['train'])),
tfds.core.SplitGenerator(
name=tfds.Split.TEST,
num_shards=2,
gen_kwargs=dict(cvs_path=paths['test']))
]
def _generate_examples(self, csv_path):
with open(csv_path, 'r') as f:
for i, line in enumerate(f.readlines()):
record = line.rstrip().split(',')
features = [float(n) for n in record[:-1]]
label = int(record[-1])
yield dict(features=features, label=label, index=i)
Usage:
builder = MyCsvDatasetBuilder()
builder.download_and_prepare() # will only take time to run first time
# as_supervised makes output (features, label) - good for model.fit
datasets = builder.as_dataset(as_supervised=True)
train_ds = datasets['train']
test_ds = datasets['test']
Wrap an index-based python function
One of the downsides of the above is shuffling the resulting dataset with a shuffle buffer of size n requires n examples to be loaded. This will either create periodic pauses in your pipeline (large n) or result in potentially poor shuffling (small n).
def get_record(i):
# load the ith record using standard python, return numpy arrays
return features, labels
def get_inputs(batch_size, is_training):
def tf_map_fn(index):
features, labels = tf.py_func(
get_record, (index,), (tf.float32, tf.int32), stateful=False)
features.set_shape((n_features,))
labels.set_shape(())
# do data augmentation here
return features, labels
epoch_size = get_epoch_size()
dataset = tf.data.Dataset.from_tensor_slices((tf.range(epoch_size,))
if is_training:
dataset = dataset.repeat().shuffle(epoch_size)
dataset = dataset.map(tf_map_fn, (tf.float32, tf.int32), num_parallel_calls=8)
dataset = dataset.batch(batch_size)
# prefetch data to CPU while GPU processes previous batch
dataset = dataset.prefetch(1)
# Also possible
# dataset = dataset.apply(
# tf.contrib.data.prefetch_to_device('/gpu:0'))
features, labels = dataset.make_one_shot_iterator().get_next()
return features, labels
In short, we create a dataset just of the record indices (or any small record ID which we can load entirely into memory). We then do shuffling/repeating operations on this minimal dataset, then map the index to the actual data via tf.data.Dataset.map and tf.py_func. See the Using with Estimators and Testing in isolation sections below for usage. Note this requires your data to be accessible by row, so you may need to convert from csv to some other format.
TextLineDataset
You can also read the csv file directly using a tf.data.TextLineDataset.
def get_record_defaults():
zf = tf.zeros(shape=(1,), dtype=tf.float32)
zi = tf.ones(shape=(1,), dtype=tf.int32)
return [zf]*n_features + [zi]
def parse_row(tf_string):
data = tf.decode_csv(
tf.expand_dims(tf_string, axis=0), get_record_defaults())
features = data[:-1]
features = tf.stack(features, axis=-1)
label = data[-1]
features = tf.squeeze(features, axis=0)
label = tf.squeeze(label, axis=0)
return features, label
def get_dataset():
dataset = tf.data.TextLineDataset(['data.csv'])
return dataset.map(parse_row, num_parallel_calls=8)
The parse_row function is a little convoluted since tf.decode_csv expects a batch. You can make it slightly simpler if you batch the dataset before parsing.
def parse_batch(tf_string):
data = tf.decode_csv(tf_string, get_record_defaults())
features = data[:-1]
labels = data[-1]
features = tf.stack(features, axis=-1)
return features, labels
def get_batched_dataset(batch_size):
dataset = tf.data.TextLineDataset(['data.csv'])
dataset = dataset.batch(batch_size)
dataset = dataset.map(parse_batch)
return dataset
TFRecordDataset
Alternatively you can convert the csv files to TFRecord files and use a TFRecordDataset. There's a thorough tutorial here.
Step 1: Convert the csv data to TFRecords data. Example code below (see read_csv from from_generator example above).
with tf.python_io.TFRecordWriter("my_train_dataset.tfrecords") as writer:
for features, labels in read_csv('my_train_dataset.csv'):
example = tf.train.Example()
example.features.feature[
"features"].float_list.value.extend(features)
example.features.feature[
"label"].int64_list.value.append(label)
writer.write(example.SerializeToString())
This only needs to be run once.
Step 2: Write a dataset that decodes these record files.
def parse_function(example_proto):
features = {
'features': tf.FixedLenFeature((n_features,), tf.float32),
'label': tf.FixedLenFeature((), tf.int64)
}
parsed_features = tf.parse_single_example(example_proto, features)
return parsed_features['features'], parsed_features['label']
def get_dataset():
dataset = tf.data.TFRecordDataset(['data.tfrecords'])
dataset = dataset.map(parse_function)
return dataset
Using the dataset with estimators
def get_inputs(batch_size, shuffle_size):
dataset = get_dataset() # one of the above implementations
dataset = dataset.shuffle(shuffle_size)
dataset = dataset.repeat() # repeat indefinitely
dataset = dataset.batch(batch_size)
# prefetch data to CPU while GPU processes previous batch
dataset = dataset.prefetch(1)
# Also possible
# dataset = dataset.apply(
# tf.contrib.data.prefetch_to_device('/gpu:0'))
features, label = dataset.make_one_shot_iterator().get_next()
estimator.train(lambda: get_inputs(32, 1000), max_steps=1e7)
Testing the dataset in isolation
I'd strongly encourage you to test your dataset independently of your estimator. Using the above get_inputs, it should be as simple as
batch_size = 4
shuffle_size = 100
features, labels = get_inputs(batch_size, shuffle_size)
with tf.Session() as sess:
f_data, l_data = sess.run([features, labels])
print(f_data, l_data) # or some better visualization function
Performance
Assuming your using a GPU to run your network, unless each row of your csv file is enormous and your network is tiny you probably won't notice a difference in performance. This is because the Estimator implementation forces data loading/preprocessing to be performed on the CPU, and prefetch means the next batch can be prepared on the CPU as the current batch is training on the GPU. The only exception to this is if you have a massive shuffle size on a dataset with a large amount of data per record, which will take some time to load in a number of examples initially before running anything through the GPU.
I agree with DomJack about using the Dataset API, except the need to read the whole csv file and then convert to TfRecord. I am hereby proposing to emply TextLineDataset - a sub-class of the Dataset API to directly load data into a TensorFlow program. An intuitive tutorial can be found here.
The code below is used for the MNIST classification problem for illustration and hopefully, answer the question of the OP. The csv file has 784 columns, and the number of classes is 10. The classifier I used in this example is a 1-hidden-layer neural network with 16 relu units.
Firstly, load libraries and define some constants:
# load libraries
import tensorflow as tf
import os
# some constants
n_x = 784
n_h = 16
n_y = 10
# path to the folder containing the train and test csv files
# You only need to change PATH, rest is platform independent
PATH = os.getcwd() + '/'
# create a list of feature names
feature_names = ['pixel' + str(i) for i in range(n_x)]
Secondly, we create an input function reading a file using the Dataset API, then provide the results to the Estimator API. The return value must be a two-element tuple organized as follows: the first element must be a dict in which each input feature is a key, and then a list of values for the training batch, and the second element is a list of labels for the training batch.
def my_input_fn(file_path, batch_size=32, buffer_size=256,\
perform_shuffle=False, repeat_count=1):
'''
Args:
- file_path: the path of the input file
- perform_shuffle: whether the data is shuffled or not
- repeat_count: The number of times to iterate over the records in the dataset.
For example, if we specify 1, then each record is read once.
If we specify None, iteration will continue forever.
Output is two-element tuple organized as follows:
- The first element must be a dict in which each input feature is a key,
and then a list of values for the training batch.
- The second element is a list of labels for the training batch.
'''
def decode_csv(line):
record_defaults = [[0.]]*n_x # n_x features
record_defaults.insert(0, [0]) # the first element is the label (int)
parsed_line = tf.decode_csv(records=line,\
record_defaults=record_defaults)
label = parsed_line[0] # First element is the label
del parsed_line[0] # Delete first element
features = parsed_line # Everything but first elements are the features
d = dict(zip(feature_names, features)), label
return d
dataset = (tf.data.TextLineDataset(file_path) # Read text file
.skip(1) # Skip header row
.map(decode_csv)) # Transform each elem by applying decode_csv fn
if perform_shuffle:
# Randomizes input using a window of 256 elements (read into memory)
dataset = dataset.shuffle(buffer_size=buffer_size)
dataset = dataset.repeat(repeat_count) # Repeats dataset this # times
dataset = dataset.batch(batch_size) # Batch size to use
iterator = dataset.make_one_shot_iterator()
batch_features, batch_labels = iterator.get_next()
return batch_features, batch_labels
Then, the mini-batch can be computed as
next_batch = my_input_fn(file_path=PATH+'train1.csv',\
batch_size=batch_size,\
perform_shuffle=True) # return 512 random elements
Next, we define the feature columns are numeric
feature_columns = [tf.feature_column.numeric_column(k) for k in feature_names]
Thirdly, we create an estimator DNNClassifier:
classifier = tf.estimator.DNNClassifier(
feature_columns=feature_columns, # The input features to our model
hidden_units=[n_h], # One layer
n_classes=n_y,
model_dir=None)
Finally, the DNN is trained using the test csv file, while the evaluation is performed on the test file. Please change the repeat_count and steps to ensure that the training meets the required number of epochs in your code.
# train the DNN
classifier.train(
input_fn=lambda: my_input_fn(file_path=PATH+'train1.csv',\
perform_shuffle=True,\
repeat_count=1),\
steps=None)
# evaluate using the test csv file
evaluate_result = classifier.evaluate(
input_fn=lambda: my_input_fn(file_path=PATH+'test1.csv',\
perform_shuffle=False))
print("Evaluation results")
for key in evaluate_result:
print(" {}, was: {}".format(key, evaluate_result[key]))