In my face recognition project a face is represented as a 128-dimensional embedding(face_descriptor) as used in FaceNet.
I could generate embedding from image in 2 ways.
Using Tensorflow resnet model v1.
emb_array = sess.run(embedding_layer,
{images_placeholder: images_array, phase_train_placeholder: False})
An array of images can be passed and a list of embeddings is obtained.
This is a bit slow.Took 1.6s.(Though the time is almost constant for large number of images).
Note: GPU not available
Other method is using dlib
dlib.face_recognition_model_v1.compute_face_descriptor(image, shape)
This gives fast result. Almost 0.05 seconds.
But only one image can be passed at a time.Time increases with number of images.
Is there any way to pass array of images to calculate embeddings in dlib or any way to improve the speed in dlib?
Or is there any other faster method to generate 128 dimensional face embedding?
Update:
I concatenated multiple images into single image and passed to dlib
dlib.face_recognition_model_v1.compute_face_descriptor(big_image, shapes)
i.e converted multiple images with single face into single image with multiple faces.
Still time is proportional to number of images(i.e number of faces) concatenated. Almost same time for iterating on individual images.
One of the more important aspects to this question is that you have no GPU available. I'm putting this here so if anyone reads this answer will have a better understanding of the context.
There are two major parts to the time consumed for inference. First is the setup time. Tensorflow takes its sweet, sweet time to set itself up when you first run the network, therefore your measurement of 1.6 seconds is probably 99.9999% setup time and 0.0001% processing your image. Then it does the actual inference calculation, which is probably tiny for one image compared to the setup. A better measurement would be running 1,000 images through it and then 2,000 images and calculate the difference, divided by 1,000 to get how much time each image takes to infer.
From the look of it, Dlib doesn't spend much time with setting up on the first run, but it would still be a better benchmark to do the same as outlined above.
I suspect Tensorflow and Dlib should be fairly similar in terms of execution speed on a CPU because both use optimized linear algebra libraries (BLAS, LAPACK) and there is only so much optimization one can do for matrix multiplication.
There is another thing you might want to give a try though. Most networks use 32 bit floating point calculations for training and inference, but research shows that in most cases, switching over to 8 bit integers for inference doesn't degrade accuracy too much but speeds up inference by a lot.
It is generally better to train a network with later quantization in mind at training, which is not the case here because you use a pre-trained model, but you can still benefit from quantization a lot probably. You can quantize your model with basically running a command that's included in Tensorflow (with the surprising name quantize_graph) but there is a little bit more to it. There is a nice quantization tutorial to follow, but keep in mind that the script is now in tensorflow/tools/quantization and not in contrib any more, as written in the tutorial.
Related
I am training a machine learning model with my own dataset generated from another program. I have designed the model to work on 128x128 images however as I approach 100'000 images I start to run into issues with the training crashing without any informative output (the kernel dies). I am assuming that this is caused by memory limits since it only occurs as the number of images increases.
To mitigate the memory usage I realized that all of the pixels in the input image are either 0 or 255 meaning that after normalization they are 0 or 1. Is there a way to use this phenomenon in PyTorch to reduce memory usage? Or are there some other benefits you can use when the input image only contains binary values?
I've built a network (In Pytorch) that performs well for image restoration purposes. I'm using an autoencoder with a Resnet50 encoder backbone, however, I am only using a batch size of 1. I'm experimenting with some frequency domain stuff that only allows me to process one image at a time.
I have found that my network performs reasonably well, however, it only behaves well if I remove all batch normalization from the network. Now of course batch norm is useless for a batch size of 1 so I switched over to group norm, designed for this purpose. However, even with group norm, my gradient explodes. The training can go very well for 20 - 100 epochs and then game over. Sometimes it recovers and explodes again.
I should also say that in training, every new image fed in is given a wildly different amount of noise to train for random noise amounts. This has been done before but perhaps coupled with a batch size of 1 it could be problematic.
I'm scratching my head at this one and I'm wondering if anyone has suggestions. I've dialed in my learning rate and clipped the max gradients but this isn't really solving the actual issue. I can post some code but I'm not sure where to start and hoping someone could give me a theory. Any ideas? Thanks!
To answer my own question, my network was unstable in training because a batch size of 1 makes the data too different from batch to batch. Or as the papers like to put it, too high an internal covariate shift.
Not only were my images drawn from a very large varied dataset, but they were also rotated and flipped randomly. As well as this, random Gaussain of noise between 0 and 30 was chosen for each image, so one image may have little to no noise while the next may be barely distinguisable in some cases. Or as the papers like to put it, too high an internal covariate shift.
In the above question I mentioned group norm - my network is complex and some of the code is adapted from other work. There were still batch norm functions hidden in my code that I missed. I removed them. I'm still not sure why BN made things worse.
Following this I reimplemented group norm with groups of size=32 and things are training much more nicely now.
In short removing the extra BN and adding Group norm helped.
I am trying to measure FLOPS for a TFLite model in TF2.
I know that Tensorflow 1.x had the tf.profiler, which was awesome for measuring flops. It doesn't seem to work well with tf.keras.
Could anybody please describe how to measure FLOPs for a TFLite model in TF2? I can't seem to find an answer online.
Thank you all so much for your time.
Edit: The link commented below does not help with tflite.
I encountered the same problem and wrote a simple python package to roughly calculate FLOPS.
https://github.com/lisosia/tflite-flops
Only Conv and DepthwiseConv layers are considered, but it was sufficient for my use case.
Unfortunately, there's no direct way you can calculate the FLOPS for a tflite model. However, you can estimate its value indirectly, by following these 3 steps:
Use the official TFLite performance tool to measure how long your model takes (in ms) to perform a single inference.
Use some benchmark app (such as xOPS) to estimate how many floating-point operations per second (FLOPS) your target device can run.
Use the results you got from steps 1 and 2 to estimate the number of floating-point operations your model performs during a single inference.
The final result will probably be a rough approximation, but it still can bring some value to your performance analysis.
I have millions of images to infer on. I know how to write my own code to create batches and forward the batches to a trained network using MxNet Module API in order to get the predictions. However, creating the batches leads to a lot of data manipulation that is not especially optimized.
Before doing any optimisation myself, I would like to know if there are some recommended approaches for batch predictions/inferences. More specifically, since this is a common use case, I was wondering if there is an interface/api that can do the usual image pre-processing, batch creation, and inference given a trained model (i.e. symbole file & epoch checkpoint)?
If you are using a standard pretrained model, I would highly recommend to take a look into gluoncv project - a toolkit for Computer Vision based on Apache MXNet.
They have really nice implementations of state of the art models, sometimes even beating the original results that are published in scientific papers. What is cool is that they also provide the data preprocessing code - as far as I understand, this is what you are looking for. (see gluoncv.data.transforms.presets package).
I don't know which inference you want to do, like image classification, segmentation, etc, but take a look to the list of tutorials and most probably you will find one you need.
Other than that, optimization for the fast wall clock time requires you to make sure that your GPU is 100% utilized. You may find useful to watch this video to learn more about tips and tricks on optimizing performance. It discusses training, but the same techniques applies to inference.
It is a common practice to normalize input values (to a neural network) to speed up the learning process, especially if features have very large scales.
In its theory, normalization is easy to understand. But I wonder how this is done if the training data set is very large, say for 1 million training examples..? If # features per training example is large as well (say, 100 features per training example), 2 problems pop up all of a sudden:
- It will take some time to normalize all training samples
- Normalized training examples need to be saved somewhere, so that we need to double the necessary disk space (especially if we do not want to overwrite the original data).
How is input normalization solved in practice, especially if the data set is very large?
One option maybe is to normalize inputs dynamically in the memory per mini batch while training.. But normalization results will then be changing from one mini batch to another. Would it be tolerable then?
There is maybe someone in this platform having hands on experience on this question. I would really appreciate if you could share your experiences.
Thank you in advance.
A large number of features makes it easier to parallelize the normalization of the dataset. This is not really an issue. Normalization on large datasets would be easily GPU accelerated, and it would be quite fast. Even for large datasets like you are describing. One of my frameworks that I have written can normalize the entire MNIST dataset in under 10 seconds on a 4-core 4-thread CPU. A GPU could easily do it in under 2 seconds. Computation is not the problem. While for smaller datasets, you can hold the entire normalized dataset in memory, for larger datasets, like you mentioned, you will need to swap out to disk if you normalize the entire dataset. However, if you are doing reasonably large batch sizes, about 128 or higher, your minimums and maximums will not fluctuate that much, depending upon the dataset. This allows you to normalize the mini-batch right before you train the network on it, but again this depends upon the network. I would recommend experimenting based on your datasets, and choosing the best method.