How to count objects in videos

How to count objects in videos - python

In general, is there any "best practise" on how to use videos as input of deep learning models? How can we annotate video in the most efficient way?
Also, I have some videos with ducks walking through a passage. I want to count the number of grey-duck and the number of yellow-duck passing through the passage. A duck can pass directly through (easiest case), or can stay in the passage for a while and pass through, or can go half the passage and go back the other direction (in this case it should not be counted).
I plan to use Mask-RCNN to segment ducks in each frame and then to look at the masks from frame i and masks from frame i+1 and make rules to count the number of different ducks that truly pass the passage.
This does not seems optimal to me.
Any ideas/help/hints?

I guess it depends on the video, but a good option was to
Annotate some 'not to similar frames' with: http://www.robots.ox.ac.uk/~vgg/software/via/
Use a model like YOLO or Mask-RCNN to find bounding box over each objects and classfiy them. Or use Optical flow algorithm. Optical flow algorithm is also an option instead of using deep learning, but I finally decided not to use it due to several possible outcome which made it from my point of view less automatic: *object that moves, stop and restart moving would require special attention *objects which are of one main color might be split into two pieces (middle pixels might be saw as not moving) *group of object passing together will probably be saw as one object
Then using tracking algorithm you will be able to give a specific ID to each object, and hence to count when they pass a certain line.

Related

Object recognition with CNN, what is the best way to train my model : photos or videos?

I aim to design an app that recognize a certain type of objects (let's say, a book) and that can say whether the input is effectively a book or not (binary classification).
For a better user experience, I would like the input to be a video rather than a picture: that way, the user won't have to deal with issues such as sharpness, centering of the object... He'll just have to make a "scan" of the object, without much consideration for the quality of a single image.
And there comes my problem : As I intend to create my training dataset from scratch (the true object I want to detect being absent from existing datasets such as ImageNet),
I was wondering if videos were irrelevant for this type of binary classification and if I should rather ask the user to take a good picture of the object.
On one hand, videos have the advantage of constituting a larger dataset than one created only from photos (though I can expand my picture's dataset thanks to data augmentation) as it is easier to take a 10s video of an object rather than taking 10x24 (more or less…) pictures of it.
But on the other hand I fear the result will be less precise, as in a video many frames are redundant and the average quality might not be as good as in a single, proper image.
Moreover, I do not intend to use the time property of a video (as in a scan the temporality is useless) but rather working one frame at a time (as depicted in this article).
What is the proper way of constituting my dataset? As I really would like to keep this “scan” for the user’s comfort and if images are more precise than videos in such a classification is it eventually possible to automatically extract a single image from a “scan”, and working directly on it?

Good question! The answer is: you should train your model on how you plan to use it. So if you ask the user to take photos, train it on photos. If you ask the user to film the object, train on frames extracted from video.
The images might seem blurry to you, but they won't be for a computer. It will just learn to detect "blurry books", but that's OK, that's what you want.
Of course this is not always the case. The image might become so blurry that the information whether or not there is a book in the frame is no longer there. Where is the line? A general rule of thumb: if you can see it's a book, the computer will also see it. As I think blurry images of books will still be recognizable as books, I think you could totally do it.
Creating "photos (single image, sharp)" from "scan (more blurry, frames from video)" can be done, it's called super-resolution. But those models are pretty beefy, not something you would want to run on a mobile device.
On a completely unrelated note: try googling Transfer Learning! It will benefit you for sure :D.

Which datatype should I choose for a units selection in an RTS game?

What is a good data type for a unit collection in an rts?
Im contributing to an api that lets you write bots for the strategy game Starcraft2 in Python.
Right now there is a class units that inherits from list. Every frame, a new units object gets created and then selections of these units are made, creating new units objects, for example with a filter for all enemy units or all flying units.
We use these selections to find the closest enemy to control our units, select our units that can attack right now or need a different order and so on.
But this also means we do a lot of filtering by attributes of each unit in every frame which takes a lot of time. The time for inititalizing one units object alone is 2e-5 to 5e-5 sec and we do it millions of times per game which can slow down the bot and tests a lot, in addition to the filtering process with loops over each unit in the units object.
Is there a better datataype for this?
Maybe something that does not need to be recreated every time for each selection in one frame, but just starts with the initial list of all units we get from the protocol buffer and then the selections and filters can be applied without recreating the object? What would be a good way to implement this so that filtering multiple times per frame is not that slow and/or complicated?

This doesn't sound like a ADT problem at all. This sounds like inefficient programming. It is impossible for us to tell you the correct message to construct to achieve what you're going for.
What you should probably be investigating is how to construct a UnitView if you don't actually need to modify the units data. Consider something similar to how dictionaries return views in Python 3. See here for more details.

Tracking one or more arbitrary, previously unknown objects with opencv-python

as my question mentioned, does anyone know a way to track arbitrary, previously unknown objects
from a video feed with opencv-python?
Lets say we have a table and there are different objects on it. Suppose the objects begin to
move as if by magic. Is there a possibility to keep track of each object?
I read something about Background Subtraction that highlights objects in the foreground.
And I read something about Optical Flow that gives information about which pixels have
changed compared to the previous frame.
I am a beginner in image processing and I am looking for a way to put it all together
and track arbitrary, previously unknown objects.
I don't want to use a neural network in advance to find and classify the objects. I want
to track the objects as soon as they begin to move. One can assume that the camera is in a fixed position
with a fixed perspective onto the table. Is there a way to track an object
that has just begun to move? All I want to do is drawing a rectangle around the object and keep track of
it.
I don't want to do a classification of objects. I am not interested in the type of object, I am only
interested in the movement of the object.
Can someone give me an impulse?
Code examples would be very welcome.

How To Count a Trained Classifiers Using Tensorflow and OpenCV For Traffic Count?

New to Tensorflow and its Object Detection API but looking at improving a project I am working on. So far I've basically trained two classifiers (person and cars) and I am able to count the number of times these two classifiers show up given a image. However, my main goal is to detect the traffic count of pedestrians and cars on a given intersection using a video. I have a working prototype that gets a video using OpenCV and starts counting the detected objects (classes) from the trained model, but it only counts the objects in that frame. It is also really finicky and looses the tracked object easily because it runs really slow. I am using Faster RCNN because I want a precise count but I feel this is slowing down the video. So I am having two problems:
Passing a video to the script runs it at 2fps, so how can I improve that? Thinking of a queue pipeline.
How can I count unique instances of each object, not just the objects shown in the frame? Right now I can only count objects in the frame but not total objects that have crossed by.
I've looked at using a Region Of Interest approach but the implementation that I've seen working is bi-directional I am looking for something more versatile and entirely machine-drive.
I might be thinking on this entirely wrong, so any guidance is greatly appreciated.

Count the number of people in the video

I am working on image processing and computer vision project. The project is to count the number of people entering the conference. This need to done in OpenCV or Python.
I have already tried the Haar Cascade that is available in OpenCV for Upper body: Detect upper body portion using OpenCV
However, it does not address the requirement. The link of the videos is as follows:
https://drive.google.com/open?id=0B3LatSCwKo2benZyVXhKLXV6R0U
If you view the sample1 file, at 0:16 secs a person is entering the room, that would always be the way. The camera is on top of the door.

Identifying People from this Aerial Video Stream
I think there is a simple way of approaching this problem. Background subtraction methods for detecting moving objects are just what you need because the video you provided seems to only have one moving object at any point: the person walking through the door. Thus, if you follow this tutorial in Python, you should be able to implement a satisfying solution for your problem.
Counting People Entering / Exiting
Now, the first question that pops to my mind is what might I do to count if multiple people are walking through the door at separate time intervals (one person walks in 10 seconds into the video and a second person walks in 20 seconds into the video)? Here's the simplest solution to this consideration that I can think of. Once you've detected the blob(s) via background subtraction, you only have to track the blob until it goes off the frame. Once it leaves the frame, the next blob you detect must be a new person entering the room and thus you can continue counting. If you aren't familiar with how to track objects once they have been detected, give this tutorial a read. In this manner, you'd avoid counting the same blob (i.e., the same person) entering too many times.
The Difficulties in Processing Complex Dynamic Environments
If you think that there is a high level of traffic through that doorway, then the problem becomes much more difficult. This is because in that case there may not be much stationary background to subtract at any given moment, and further there may be a lot of overlap between detected blobs. There is a lot of active research in the area of autonomous pedestrian tracking and identification - so, in short, it's a difficult question that doesn't have a straightforward easy-to-implement solution. However, if you're interested in reading about some of the potential approaches you could take to solving these more challenging problems in pedestrian detection from an aerial view, I'd recommend reading the answers to this question.
I hope this helps, good luck coding!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.