I have a very large image dataset (>50G, single images in a folder) for training, to make loading of images more efficient, I firstly load parts of the images onto RAM and then send small batches to GPU for training.
I want to further speed up the data preparation process before feeding the images to the GPU and was thinking about multi-processing. But I'm not sure how should I do it, any ideas?
For speed I would advise to used HDF5 or LMDB:
I have successfully used ml-pyxis for creating deep learning datasets using LMDBs.
It allows to create binary blobs (LMDB) and they can be read quite fast.
The link above comes with some simple examples on how to create and read the data. Including python generators/ iteratos
For multi-processing:
I personally work with Keras, and by using a python generator it is possible train with mutiple-processing for data using the fit_generator method.
fit_generator(self, generator, samples_per_epoch,
nb_epoch, verbose=1, callbacks=[],
validation_data=None, nb_val_samples=None,
class_weight={}, max_q_size=10, nb_worker=1,
pickle_safe=False)
Fits the model on data generated batch-by-batch by a Python generator. The generator is run in parallel to the model, for efficiency. For instance, this allows you to do real-time data augmentation on images on CPU in parallel to training your model on GPU. You can find the source code here , and the documentation here.
Don't know whether you prefer tensorflow/keras/torch/caffe whatever.
Multiprocessing is simply Using Multiple GPUs
Basically you are trying to leverage more hardware by delegating or spawning one child process for every GPU and let them do their magic. The example above is for Logistic Regression.
Of course you would be more keen on looking into Convnets -
This LSU Material (Pgs 48-52[Slides 11-14]) builds some intuition
Keras is yet to officially provide support but you can "proceed at your own risk"
For multiprocessing, tensorflow is a better way to go about this (my opinion)
In fact they have some good documentation on it too
Related
I am surprised that I could not see a concise way to run a tf.data API on the GPU. I understand that the data pipelines can run on CPU so that it can happen in parallel (with pre-fetching), allowing the GPU to run actual model and train it.
However, my pre-processing is extremely parallel and computationally intensive. While I can technically write the pre-processing as the first layer in my model, I would really not prefer to do this to prevent training data leakage into my model.
Any pointers is appreciated for this. The closest I found was https://towardsdatascience.com/overcoming-data-preprocessing-bottlenecks-with-tensorflow-data-service-nvidia-dali-and-other-d6321917f851, which involves using nvidia DALI frameworks.
Here are some critical points:
I have already tried enforcing device placement with tf.device('...').
I dont want to prefectch the data into the device but rather run the whole data pipeline on GPU.
Preferably, if my computation is more, I want to save my dataset as a tfrecords, so that I can load it directly. This can be done with tf.data.experimental.save for now, but it again it uses CPU!
I have a training dataset of shape(90000,50) and I trying to fit this in model(Gaussian process regression). This errors out with memory error. I do understand the computation, but is there way to pass data in batches using scikit? I am using the scikit implementation of the GPR algorithm.
Keras has generator because, you can create checkpoints and resume from where you left off in Neural Networks. However, not all of trainable algorithms has this property. Take a look at incremental learning from Scikit-API docs.
The Gaussian process implementation(Regression/classification) from scikit is'nt capable of handling big dataset. It can run only upto 15000 rows of data. So I decided to use a different algorithm instead as this seems to be a problem with algorithm.
I have millions of images to infer on. I know how to write my own code to create batches and forward the batches to a trained network using MxNet Module API in order to get the predictions. However, creating the batches leads to a lot of data manipulation that is not especially optimized.
Before doing any optimisation myself, I would like to know if there are some recommended approaches for batch predictions/inferences. More specifically, since this is a common use case, I was wondering if there is an interface/api that can do the usual image pre-processing, batch creation, and inference given a trained model (i.e. symbole file & epoch checkpoint)?
If you are using a standard pretrained model, I would highly recommend to take a look into gluoncv project - a toolkit for Computer Vision based on Apache MXNet.
They have really nice implementations of state of the art models, sometimes even beating the original results that are published in scientific papers. What is cool is that they also provide the data preprocessing code - as far as I understand, this is what you are looking for. (see gluoncv.data.transforms.presets package).
I don't know which inference you want to do, like image classification, segmentation, etc, but take a look to the list of tutorials and most probably you will find one you need.
Other than that, optimization for the fast wall clock time requires you to make sure that your GPU is 100% utilized. You may find useful to watch this video to learn more about tips and tricks on optimizing performance. It discusses training, but the same techniques applies to inference.
I have been studying transfer learning with models like inception_v4 and inception_resnet_v2. Found some projects that uses bottleneck and some uses tfrecords to store the training images. When retraining the inception_v4 model with the same data using those two methods bottleneck gave 95% accuracy and tfrecord only gave 75%. But, all the new projects seems to use tfrecords for data and .ckpt format to store the model. Can someone explain me whats the difference and which one is better in which case
If you are working with large datasets, using a binary file format for storage of your data can have a significant impact on the performance of your import pipeline. Hence, it will affect your training time of the model.
By using TFRecords, it is possible to store sequence data. For e.g, a series of data. Besides, it easy to combine multiple datasets and integrates seamlessly with the data import and preprocessing functionality provided by the library.
For more information about TFrecords, please refer this link.
I am reading this performance guide on the best practices for optimizing TensorFlow code for GPU. One suggestion they have is to place the preprocessing operations on the CPU so that the GPU is dedicated for training. To try to understand how one would actually implement this within an experiment (ie. learn_runner.run()). To further the discussion, I'd like to consider the best way to apply this strategy to the Custom Estimator Census Sample provided here.
The article suggests placing with tf.device('/cpu:0') around the preprocessing operations. However, when I look at the custom estimator the 'preprocessing' appears to be done in multiple steps:
Line 152/153 inputs = tf.feature_column.input_layer(features, transformed_columns) & label_values = tf.constant(LABELS) -- if I wrapped with tf.device('/cpu:0') around these two lines would that be sufficient to cover the 'preprocessing' in this example?
Line 282/294 - There is also a generate_input_fn and parse_csv function that are used to set up input data queues. Would it be necessary to place with tf.device('/cpu:0') within these functions as well or would that basically be forced by having the inputs & label_values already wrapped?
Main Question: Which of the above implementation suggestions is sufficient to properly place all preprocessing on the CPU?
Some additional questions that aren't addressed in the post:
What if the machine has multiple cores? Would 'cpu:0' be limiting?
The post implies to me that by wrapping the preprocessing on the cpu, the GPU would be automatically used for the rest. Is that actually the case?
Distributed ML Engine Experiment
As a follow up, I would like to understand how this can be further adapted in a distributed ML engine experiment - would any of the recommendations above need to change if there were say 2 worker GPUs, 1 master CPU and a parameter server? My understanding is that the distributed training would be data-parallel asynchronous training so that each worker will be independently iterating through the data (and passing gradients asynchronously back to the PS) which suggests to me that no further modifications from the single GPU above would be needed if you train in this way. However, this seems a bit to easy to be true.
MAIN QUESTION:
The 2 codes your placed actually are 2 different parts of the training, Line 282/294 in my options is so called "pre-processing" part, for it's parse raw input data into Tensors, this operations not suitable for GPU accelerating, so it will be sufficient if allocated on CPU.
Line 152/152 is part of the training model for it's processing the raw feature into different type of features.
'cpu:0' means the operations of this section will be allocated on CPU, but not bind to specified core. The operations allocated on CPU will run in multi-threads and use multi-cores.
If your running machine has GPUs, the TensorFlow will prefer allocating the operations on GPUs if the device is not specified.
The previous answer accurately describes device placement. Allow me to provide an answer to the questions about distributed TF.
The first thing to note is that, whenever possible, prefer a single machine with lots of GPUs to multiple machines with single GPUs. The bandwidth to parameters in RAM on the same machine (or even better, on the GPUs themselves) is orders of magnitude faster than going over the network.
That said, there are times where you'll want distributed training, including remote parameter servers. In that case, you would not necessarily need to change anything in your code from the single machine setup.