How to improve Face matching time - python

I'm working on a project that aims to detect each person's face while entering to a public space and store entering time and the person's image (array format) in Elasticsearch, and then detect each exiting face, loop over the Elasticsearch index relative to people who have entered in that day, pass to my model two images (detected exiting face and faces stored in Elasticsearch), match the two faces and return enter time, exit time and total duration.
For face matching/Face re-identification I'm using a VGG model that takes ~1sec to compare two faces.
This model takes two parameters and returns a value between 0 and 1.
I loop over all faces, I append accuracy to a list, and the appropriate face is which has the minimum value returned.
So that, if I have 100 entered person in that day, while looping to find one face, the program will take more than 100sec, but in my use case the program needs to run in real-time.
Any propositions for that ?
This is a Screenshot of my code where I'm calling the model:

In case you have too many image I would suggest to look at method like -FAISS. It is more efficient than computing distances between the new and other saved images. Also, you can try with 4-layer Conv net/efficient-net instead of VGG(but need to check accuracy degradation). As VGG is more computationally expensive.
Another approach you can do if list of person is fixed then you can save the embedding of all saved images and store them. At real time you can use feature extractor to extract feature and compare it with all stored embedding, this will definitely save time for you.

Adding to #Rambo_john - here is a nice image search demo that uses VGG and a managed Faiss service.

Related

Optimize too many azure face api calls while enrolling and verification of data

I have a face dataset on which I am using Azure Face Service for identification of people.
The first step is to detect faces using face.detect in data then enroll if face is present using face lists.
The second step is to train the enrolled set.
The third step is getting inference by detecting faces on the same dataset using face.detect and then finding out what faces are there in the database that matches with it using find similar method.(Eg : Images 1,2,3 are enrolled; 1,2 is a match;1,3 is a match;2,1 is a match;3,1 is a match so on...).
Now, the problem is for large number of images the api calls would become way more expensive and we can see that the first and third step both uses face detect api call and returns a face id with its other properties that expires in 24 hours. Is there any way to minimize the api calls in this scenerio (Enrolling entire dataset and then verifying from the enrolled dataset itself).

Get character out of image with python

I want to detect the characters in a image like this with python:
In this case the code should return the result '6010001'.
How can I get the result out of this image? What do I need?
For your information, if the solution is a AI-solution, there are about 20.000 labeled images.
Thank you in forward :)
Question: Are all the pictures of similar nature?
Meaning the Numbers are stamped into a similar material, or are they random pictures with numbers with different techniques (e.g. pen drawn, stamped etc.)?
If they are all quite similar (nice contrast as in sample pic), I would recommend to write your "own" AI, otherwise use an existing neural network / library (as I assume you may want to avoid the pain of creating your own neural network - and tag a lot of pictures).
If they pics are quite "similar", following suggested approach:
greyscale Image with increase contrast
define box (greater than a digit), scan over image and count 0s, define by trial valid range to detect a digit, avoid overlaps
each hit take area, split it in sectors, e.g. 6x4, count 0s
build a little knowledge base (csv file) of counts per sector for each number from 0-9 (e.g. a string); you will end up in the database with multiple valid strings per each number, just ensure they are unique (otherwise redefine steps 1-3)
In addition I recommend to make yourself a smart knowledge database, meaning: if the digit could not be identified, save digit picture and result. Then make yourself a little review program where it shows you the undefined digits and the result string, you can then manually add them to your knowledge database for the respective number.
Hope it helps. I used the same approach read a lot of different data from screen pictures and store them in a database. Works like a charm.
#better do it yourself than using a standard neural network :)
You can use opencv-python and pytesseract
import cv2
import pytesseract
img = cv2.imread('img3.jpeg')
text = pytesseract.image_to_string(img)
print(text)
It doesn't work for all images with text, but works for most.

Object recognition with CNN, what is the best way to train my model : photos or videos?

I aim to design an app that recognize a certain type of objects (let's say, a book) and that can say whether the input is effectively a book or not (binary classification).
For a better user experience, I would like the input to be a video rather than a picture: that way, the user won't have to deal with issues such as sharpness, centering of the object... He'll just have to make a "scan" of the object, without much consideration for the quality of a single image.
And there comes my problem : As I intend to create my training dataset from scratch (the true object I want to detect being absent from existing datasets such as ImageNet),
I was wondering if videos were irrelevant for this type of binary classification and if I should rather ask the user to take a good picture of the object.
On one hand, videos have the advantage of constituting a larger dataset than one created only from photos (though I can expand my picture's dataset thanks to data augmentation) as it is easier to take a 10s video of an object rather than taking 10x24 (more or less…) pictures of it.
But on the other hand I fear the result will be less precise, as in a video many frames are redundant and the average quality might not be as good as in a single, proper image.
Moreover, I do not intend to use the time property of a video (as in a scan the temporality is useless) but rather working one frame at a time (as depicted in this article).
What is the proper way of constituting my dataset? As I really would like to keep this “scan” for the user’s comfort and if images are more precise than videos in such a classification is it eventually possible to automatically extract a single image from a “scan”, and working directly on it?
Good question! The answer is: you should train your model on how you plan to use it. So if you ask the user to take photos, train it on photos. If you ask the user to film the object, train on frames extracted from video.
The images might seem blurry to you, but they won't be for a computer. It will just learn to detect "blurry books", but that's OK, that's what you want.
Of course this is not always the case. The image might become so blurry that the information whether or not there is a book in the frame is no longer there. Where is the line? A general rule of thumb: if you can see it's a book, the computer will also see it. As I think blurry images of books will still be recognizable as books, I think you could totally do it.
Creating "photos (single image, sharp)" from "scan (more blurry, frames from video)" can be done, it's called super-resolution. But those models are pretty beefy, not something you would want to run on a mobile device.
On a completely unrelated note: try googling Transfer Learning! It will benefit you for sure :D.

Feature extraction for keyword spotting on long form audio using a CNN

I've built a simple CNN word detector that is accurately able to predict a given word when using a 1-second .wav as input. As seems to be the standard, I'm using the MFCC of the audio files as input for the CNN.
However, my goal is to be able to apply this to longer audio files with multiple words being spoken, and to have the model be able to predict if and when a given word is spoken. I've been searching online how the best approach, but seem to be hitting a wall and I truly apologize if the answer could've been easily found through google.
My first thought is to cut the audio file into several windows of 1-second length that intersect each other -
and then convert each window into an MFCC and use these as input for the model prediction.
My second thought would be to instead use onset detection in attempts isolate each word, add padding if the word if it was < 1-second, and then feed these as input for the model prediction.
Am I way off here? Any references or recommendations would hugely appreciated. Thank you.
Cutting the audio up into analysis windows is the way to go. It is common to use some overlap. The MFCC features can be calculated first and then split done using an integer number of frames that gets you closest to the window length you want (1s).
See How to use a context window to segment a whole log Mel-spectrogram (ensuring the same number of segments for all the audios)? for example code

How To Count a Trained Classifiers Using Tensorflow and OpenCV For Traffic Count?

New to Tensorflow and its Object Detection API but looking at improving a project I am working on. So far I've basically trained two classifiers (person and cars) and I am able to count the number of times these two classifiers show up given a image. However, my main goal is to detect the traffic count of pedestrians and cars on a given intersection using a video. I have a working prototype that gets a video using OpenCV and starts counting the detected objects (classes) from the trained model, but it only counts the objects in that frame. It is also really finicky and looses the tracked object easily because it runs really slow. I am using Faster RCNN because I want a precise count but I feel this is slowing down the video. So I am having two problems:
Passing a video to the script runs it at 2fps, so how can I improve that? Thinking of a queue pipeline.
How can I count unique instances of each object, not just the objects shown in the frame? Right now I can only count objects in the frame but not total objects that have crossed by.
I've looked at using a Region Of Interest approach but the implementation that I've seen working is bi-directional I am looking for something more versatile and entirely machine-drive.
I might be thinking on this entirely wrong, so any guidance is greatly appreciated.

Categories

Resources