What is the difference between tfrecord and bottleneck

What is the difference between tfrecord and bottleneck - python

I have been studying transfer learning with models like inception_v4 and inception_resnet_v2. Found some projects that uses bottleneck and some uses tfrecords to store the training images. When retraining the inception_v4 model with the same data using those two methods bottleneck gave 95% accuracy and tfrecord only gave 75%. But, all the new projects seems to use tfrecords for data and .ckpt format to store the model. Can someone explain me whats the difference and which one is better in which case

If you are working with large datasets, using a binary file format for storage of your data can have a significant impact on the performance of your import pipeline. Hence, it will affect your training time of the model.
By using TFRecords, it is possible to store sequence data. For e.g, a series of data. Besides, it easy to combine multiple datasets and integrates seamlessly with the data import and preprocessing functionality provided by the library.
For more information about TFrecords, please refer this link.

Related

Efficiently load large .npy files (>20GB) with Keras/Tensorflow dataloader

I am currently implementing a machine learning model which uses a rather heavy representation of data.
My dataset is composed of images. Each of these images is encoded into a (224, 224, 103) matrix, making the entire dataset very heavy. I store these matrixes on the disk and load them during the training.
What I am currently doing right now is using mini-batches of 8 images and loading the .npy files for these 8 images from the disk during the entire training process. This is slow but it works.
Is there a more efficient way to do it using Keras/Tensorflow (which is what I'm using to code my model)?
I unfortunately couldn't find much about a dataloader that would allow me to do this.
Thanks in advance.

You have several options to do this.
I will assume that the transformations you are doing to the images to get the final (224, 224, 103) matrix is very expensive, and that it's not desirable to do the pre-processing on the data loading. If this is not the case, you might benefit from reading the tutorial relevant to image processing.
I suggest you use a python generator to read the data, and to use tf.data to create a data pipeline to feed these .npy files to your model. The basic idea is very simple. you use a wrapper to ingest data from a generator that will read the files as needed. The relevant documentation and examples are here.
Now, once you get that working, I think it would be a good idea for you to optimize your pipeline, especially if you're planning to train in multiple GPUs or multiple computers.

MxNet: Good ways to infer on large image datasets

I have millions of images to infer on. I know how to write my own code to create batches and forward the batches to a trained network using MxNet Module API in order to get the predictions. However, creating the batches leads to a lot of data manipulation that is not especially optimized.
Before doing any optimisation myself, I would like to know if there are some recommended approaches for batch predictions/inferences. More specifically, since this is a common use case, I was wondering if there is an interface/api that can do the usual image pre-processing, batch creation, and inference given a trained model (i.e. symbole file & epoch checkpoint)?

If you are using a standard pretrained model, I would highly recommend to take a look into gluoncv project - a toolkit for Computer Vision based on Apache MXNet.
They have really nice implementations of state of the art models, sometimes even beating the original results that are published in scientific papers. What is cool is that they also provide the data preprocessing code - as far as I understand, this is what you are looking for. (see gluoncv.data.transforms.presets package).
I don't know which inference you want to do, like image classification, segmentation, etc, but take a look to the list of tutorials and most probably you will find one you need.
Other than that, optimization for the fast wall clock time requires you to make sure that your GPU is 100% utilized. You may find useful to watch this video to learn more about tips and tricks on optimizing performance. It discusses training, but the same techniques applies to inference.

splitting tfrecords dataset with multiple features

I have an image classification task where I've created multiple crops of each image as well as flipped/flopped versions to extend my limited dataset. I have written the dataset to a tfrecords file where each record consists of (simplified here to two crops and only a flipped version):
{
lbl: int,
crop_0: np.ndarray,
crop_1: np.ndarray,
crop_0_flipped: np.ndarray,
crop_1_flipped: np.ndarray
}
Basically 4 images / entry. During training, I'd like to treat each image as separate, i.e. feed each record as 4 images with the same label, shuffled with the rest of the images in the dataset, so that N images becomes 4N images. During testing (using a separate but similarly structured dataset), I'd like to take each image, only use the crop_0 and crop_1 images and average the softmax outputs for classification.
My question is - what is the best and most efficient way of training such a dataset? I'm willing to change my approach if this will make training more inefficient, and it seems that the simplest thing to do would have been to have separate tfrecords files for each version (crop & flip/flop images) and interleave the files into one dataset, but I do not want to have a whole bunch of files to deal with if I can help it.

Writing the dataset to disk with 4N images is an approach that you'll come to loath later (I did it this way originally and loath that code now). The better way is to keep your original dataset on disk as-is, don't write your preprocessing steps to disk. Do that kind of preprocessing in the CPU while you train. The tensorflow Dataset preprocessing pipeline makes this easy, modular, and provides the parallelization you need to take advantage of multiple cores at not extra coding expense.
This is the main guide:
https://www.tensorflow.org/programmers_guide/datasets
Your approach should be to create 2 Dataset objects, one for train and one for test. The train Dataset pipeline will perform all the data augmentation you mentioned. The test Dataset pipeline will not, naturally.
One key to understanding this approach is that you will not feed the data to tensorflow using feed_dict, instead, tensorflow will just invoke the Dataset pipeline to pull the data it needs for each batch.
To get parallelization you'll use the Dataset.map function to apply some set of transformations and use the property num_parallel_calls to distribute the operations across multiple cores. If your preprocessing can be done in tensorflow code, great, if not you'll need to use tf.py_func to use python preprocessing code.
The guide I linked to above describes all of this very well. You will want to us a feedable iterator described in the section called "Creating an iterator". This will allow you to get a string ID from each of the 2 datasets (train and test) and pass that string to tensorflow via feed_dict indicating which of the two datasets tensorflow should pull samples from.

How to create my own dataset to train/test a convolutional neural network

So here is my question:
I want to make my very own dataset using a motion capture camera system to get the ground truth poses and one RGB camera to get images, and then using this as input to my network, train/test a convNet.
I have looked around at other datasets for tensorflow, caffe and Matlab. I have viewed the MNIST, Cats/Dogs, Iris, LSP, HumanEva, HumanEva3.6, FLIC, etc. datasets and have viewed and tried to understand their data as best as I can. I have viewed online people trying to make their own datasets. The one thing is usually when you use their datasets as an example, you download a .txt file that already contains the labels.
If anyone could please explain to me how to use the image data with the labels to feed it into my network, it would be a tremendous help. I have made code before using tensorflow to input a .txt file into the network and get the correct predicted output. But, my brain is missing something to understand how to input an image with a label. How to I create that dataset?

Your input images and your labels are two separate variables. You will be writing separate bits of code to import them. The videos typically need to be converted to JPG files (it's a royal pain to read video files directly, mostly because you can't randomly skip around the video easily).
Probably the easiest way to structure you data is via a CSV that contains filename, poseinfoA, poseinfoB, etc. And the filename refers to the JPG image on disk.
To get started on the basics, I suggest looking at the Aymericdamen tutorial examples, I haven't found tutorials anywhere that were as clear and concise.
https://github.com/aymericdamien/TensorFlow-Examples
Those examples don't go into detail on the data input pipeline though. To set up a good data input pipeline in tensorflow I suggest you use the new (as of TF 1.4) Dataset object. It will force you into a good data input pipline workflow, and it's the way all data input is going in tensorflow, so it's worth learning. It's also easy to test and debug when you write it this way. Here's the guide you want to follow.
https://www.tensorflow.org/programmers_guide/datasets
You can start your Dataset object from the CSV, and use a dataset.map_fn() to load the images using tf.image.decode_jpeg
Since you're doing pose estimation I'll also suggest a nice blog I came across recently that will probably interest you. The topic is segmentation, but pose estimation is quite related.
http://blog.qure.ai/notes/semantic-segmentation-deep-learning-review

how to do cross validation and hyper parameter tuning for huge dataset?

I have a csv file of 10+gb ,i used "chunksize" parameter available in the pandas.read_csv() to read and pre-process the data,for training the model want to use one of the online learning algo.
normally cross-validation and hyper-parameter tuning is done on the entire training data set and train the model using the best hyper-parameter,but in the case of the huge data, if i do the same on the chunk of the training data how to choose the hyper-parameter?

I believe you are looking for online learning algorithms like the ones mentioned on this link Scaling Strategies for large datasets. You should use algorithms that support partial_fit parameter to load these large datasets in chunks. You can also look at the following links to see which one helps you the best, since you haven't specified the exact problem or the algorithm that you are working on:
Numpy save partial results in RAM
Scalling Computationally - Sklearn
Using Large Datasets in Sklearn
Comparision of various Online Sovers -Sklearn
EDIT : If you want to solve the class imbalance problem, you can try this : imabalanced-learn library in Python

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.