I am currently implementing a machine learning model which uses a rather heavy representation of data.
My dataset is composed of images. Each of these images is encoded into a (224, 224, 103) matrix, making the entire dataset very heavy. I store these matrixes on the disk and load them during the training.
What I am currently doing right now is using mini-batches of 8 images and loading the .npy files for these 8 images from the disk during the entire training process. This is slow but it works.
Is there a more efficient way to do it using Keras/Tensorflow (which is what I'm using to code my model)?
I unfortunately couldn't find much about a dataloader that would allow me to do this.
Thanks in advance.
You have several options to do this.
I will assume that the transformations you are doing to the images to get the final (224, 224, 103) matrix is very expensive, and that it's not desirable to do the pre-processing on the data loading. If this is not the case, you might benefit from reading the tutorial relevant to image processing.
I suggest you use a python generator to read the data, and to use tf.data to create a data pipeline to feed these .npy files to your model. The basic idea is very simple. you use a wrapper to ingest data from a generator that will read the files as needed. The relevant documentation and examples are here.
Now, once you get that working, I think it would be a good idea for you to optimize your pipeline, especially if you're planning to train in multiple GPUs or multiple computers.
Related
I have been studying transfer learning with models like inception_v4 and inception_resnet_v2. Found some projects that uses bottleneck and some uses tfrecords to store the training images. When retraining the inception_v4 model with the same data using those two methods bottleneck gave 95% accuracy and tfrecord only gave 75%. But, all the new projects seems to use tfrecords for data and .ckpt format to store the model. Can someone explain me whats the difference and which one is better in which case
If you are working with large datasets, using a binary file format for storage of your data can have a significant impact on the performance of your import pipeline. Hence, it will affect your training time of the model.
By using TFRecords, it is possible to store sequence data. For e.g, a series of data. Besides, it easy to combine multiple datasets and integrates seamlessly with the data import and preprocessing functionality provided by the library.
For more information about TFrecords, please refer this link.
I am training a model that takes a video as input and makes per-frame predictions (object tracking, to be specific).
TensorFlow provides the data.Dataset.prefetch() operation to load images asynchronously during training. This works well because my training sequences are only a few frames long. However, I would also like to load images asynchronously during testing, where the sequences can be thousands of frames long.
Is there any convenient way to achieve this in TensorFlow? I suppose I could define each video as a Dataset object? This seems kind of ugly because I need to create one operation in the graph for each video in my dataset?
Thank you!
I have an image classification task where I've created multiple crops of each image as well as flipped/flopped versions to extend my limited dataset. I have written the dataset to a tfrecords file where each record consists of (simplified here to two crops and only a flipped version):
{
lbl: int,
crop_0: np.ndarray,
crop_1: np.ndarray,
crop_0_flipped: np.ndarray,
crop_1_flipped: np.ndarray
}
Basically 4 images / entry. During training, I'd like to treat each image as separate, i.e. feed each record as 4 images with the same label, shuffled with the rest of the images in the dataset, so that N images becomes 4N images. During testing (using a separate but similarly structured dataset), I'd like to take each image, only use the crop_0 and crop_1 images and average the softmax outputs for classification.
My question is - what is the best and most efficient way of training such a dataset? I'm willing to change my approach if this will make training more inefficient, and it seems that the simplest thing to do would have been to have separate tfrecords files for each version (crop & flip/flop images) and interleave the files into one dataset, but I do not want to have a whole bunch of files to deal with if I can help it.
Writing the dataset to disk with 4N images is an approach that you'll come to loath later (I did it this way originally and loath that code now). The better way is to keep your original dataset on disk as-is, don't write your preprocessing steps to disk. Do that kind of preprocessing in the CPU while you train. The tensorflow Dataset preprocessing pipeline makes this easy, modular, and provides the parallelization you need to take advantage of multiple cores at not extra coding expense.
This is the main guide:
https://www.tensorflow.org/programmers_guide/datasets
Your approach should be to create 2 Dataset objects, one for train and one for test. The train Dataset pipeline will perform all the data augmentation you mentioned. The test Dataset pipeline will not, naturally.
One key to understanding this approach is that you will not feed the data to tensorflow using feed_dict, instead, tensorflow will just invoke the Dataset pipeline to pull the data it needs for each batch.
To get parallelization you'll use the Dataset.map function to apply some set of transformations and use the property num_parallel_calls to distribute the operations across multiple cores. If your preprocessing can be done in tensorflow code, great, if not you'll need to use tf.py_func to use python preprocessing code.
The guide I linked to above describes all of this very well. You will want to us a feedable iterator described in the section called "Creating an iterator". This will allow you to get a string ID from each of the 2 datasets (train and test) and pass that string to tensorflow via feed_dict indicating which of the two datasets tensorflow should pull samples from.
So here is my question:
I want to make my very own dataset using a motion capture camera system to get the ground truth poses and one RGB camera to get images, and then using this as input to my network, train/test a convNet.
I have looked around at other datasets for tensorflow, caffe and Matlab. I have viewed the MNIST, Cats/Dogs, Iris, LSP, HumanEva, HumanEva3.6, FLIC, etc. datasets and have viewed and tried to understand their data as best as I can. I have viewed online people trying to make their own datasets. The one thing is usually when you use their datasets as an example, you download a .txt file that already contains the labels.
If anyone could please explain to me how to use the image data with the labels to feed it into my network, it would be a tremendous help. I have made code before using tensorflow to input a .txt file into the network and get the correct predicted output. But, my brain is missing something to understand how to input an image with a label. How to I create that dataset?
Your input images and your labels are two separate variables. You will be writing separate bits of code to import them. The videos typically need to be converted to JPG files (it's a royal pain to read video files directly, mostly because you can't randomly skip around the video easily).
Probably the easiest way to structure you data is via a CSV that contains filename, poseinfoA, poseinfoB, etc. And the filename refers to the JPG image on disk.
To get started on the basics, I suggest looking at the Aymericdamen tutorial examples, I haven't found tutorials anywhere that were as clear and concise.
https://github.com/aymericdamien/TensorFlow-Examples
Those examples don't go into detail on the data input pipeline though. To set up a good data input pipeline in tensorflow I suggest you use the new (as of TF 1.4) Dataset object. It will force you into a good data input pipline workflow, and it's the way all data input is going in tensorflow, so it's worth learning. It's also easy to test and debug when you write it this way. Here's the guide you want to follow.
https://www.tensorflow.org/programmers_guide/datasets
You can start your Dataset object from the CSV, and use a dataset.map_fn() to load the images using tf.image.decode_jpeg
Since you're doing pose estimation I'll also suggest a nice blog I came across recently that will probably interest you. The topic is segmentation, but pose estimation is quite related.
http://blog.qure.ai/notes/semantic-segmentation-deep-learning-review
I have a very large image dataset (>50G, single images in a folder) for training, to make loading of images more efficient, I firstly load parts of the images onto RAM and then send small batches to GPU for training.
I want to further speed up the data preparation process before feeding the images to the GPU and was thinking about multi-processing. But I'm not sure how should I do it, any ideas?
For speed I would advise to used HDF5 or LMDB:
I have successfully used ml-pyxis for creating deep learning datasets using LMDBs.
It allows to create binary blobs (LMDB) and they can be read quite fast.
The link above comes with some simple examples on how to create and read the data. Including python generators/ iteratos
For multi-processing:
I personally work with Keras, and by using a python generator it is possible train with mutiple-processing for data using the fit_generator method.
fit_generator(self, generator, samples_per_epoch,
nb_epoch, verbose=1, callbacks=[],
validation_data=None, nb_val_samples=None,
class_weight={}, max_q_size=10, nb_worker=1,
pickle_safe=False)
Fits the model on data generated batch-by-batch by a Python generator. The generator is run in parallel to the model, for efficiency. For instance, this allows you to do real-time data augmentation on images on CPU in parallel to training your model on GPU. You can find the source code here , and the documentation here.
Don't know whether you prefer tensorflow/keras/torch/caffe whatever.
Multiprocessing is simply Using Multiple GPUs
Basically you are trying to leverage more hardware by delegating or spawning one child process for every GPU and let them do their magic. The example above is for Logistic Regression.
Of course you would be more keen on looking into Convnets -
This LSU Material (Pgs 48-52[Slides 11-14]) builds some intuition
Keras is yet to officially provide support but you can "proceed at your own risk"
For multiprocessing, tensorflow is a better way to go about this (my opinion)
In fact they have some good documentation on it too