The current version of Tensorflow object detection API supports the representation of bounding boxes without angle - represented by xmin, ymin, xmax, ymax.
I am looking for ideas to represent (and predict) bounding boxes with an angle/orientation.
Like this:
Use backpropagation to identify the pixel contributing the strongest to the activation and put a reasonable threshold to identify which pixels belong to the object.
The default algorithm does this then computes an axis-aligned bounding box of the selected pixels (because it's really simple). You need to run another bounding box algorithm that allows for arbitrary orientation. Wikipedia has some ideas (link).
For how to get the interesting pixels you can look inside the tensorflow code to figure it out.
Oriented bounding boxes is a very interesting topic and it has been kind of ignored by the deep learning based object detection approaches and it's hard to find datasets.
A recent paper/dataset/challenge wich I found very interesting (specially because they pay attention to oriented boxes) can be found here:
http://captain.whu.edu.cn/DOTAweb/index.html
They don't share the code (nor give much details in the paper) of their modification of Fater-RCNN to work with oriented bounding boxes but the dataset by itself and the representation discussion are quite usefull.
Related
I'm using yolo v8 to detect subjects in pictures. It's working well, and can create quite precise masks over subjects.
from ultralytics import YOLO
model = YOLO('yolov8x-seg.pt')
for output in model('image.jpg', return_outputs=True):
for segment in output['segment']:
print(segment)
The code above works, and generates a series of "segments", which are a list of points that define the shape of subjects on my image. That shape is not convex (for example horses).
I need to figure out if a random coordinate on the image falls within these segments, and I'm not sure how to do it.
My first approach was to build an image mask using PIL. That roughly worked, but it doesn't always work, depending on the shape of the segments. I also thought about using shapely, but it has restrictions on the Polygon classes, which I think will be a problem in some cases.
In any case, this really feels like a problem that could easily be solved with the tools I'm already using (yolo, pytorch, numpy...), but to be honest I'm too new to all this to figure out how to do it properly.
Any suggestion is appreciated :)
You should be able to get a segmentation mask from your model: imagine a binary image where black (zeros) represents the background and white (or other non zero values) represent an instance of a segmentation class.
Once you have the binary image you can use opencv's findContours function to get a the largest outer path.
Once you have that path you can use pointPolygonTest() to check if a point is inside that contour or not.
As I understand, anchor-based is using multiple box at once to predict bounding box close to ground truth.
1. Is it correct?
2. And what is anchor-free?
3. What is the difference between anchor-based and anchor-free (methods, pros, cons,...)?
I'm new and thanks for any answer!
The following paper is providing a quick overview which you might be interested in. https://ieeexplore.ieee.org/document/9233610
What I understood is that there are some approaches to finding bounding boxes. These are categorized as
Sliding window: Consider all possible bounding boxes
Anchor-based: Get a way to find prior knowledge on what widths and heights are more suitable for every class type (it is basically the same as learning common aspect ratios for each class). Then tile those bounding boxes in the image and just predict the probability of those tiles.
YOLOv5 uses clustering to estimate anchor-boxes before training and saves them. With that said, it does have its disadvantages. The first is that you must learn anchor-boxes for each class. The second is that your accuracy may be based on your anchor box prediction.
Anchor-free: Instead of using prior knowledge or considering all possibilities, they predict two points (top-left and bottom-right) for every object directly.
Yolov3, Yolov4 and Yolov5 use anchors but YOLOX and CornerNet don't.
Though not a complete explanation, I think you get the point.
References:
Anchor Boxes for Object Detection
A fully convolutional anchor-free object detector
Forget the hassles of Anchor boxes with FCOS: Fully Convolutional One-Stage Object Detection
I'm trying to understand about how YOLOv3 works. And this thing still confuses me: YOLO can determine the bounding box (the coordinates and dimensions), but why it doesn't output these value directly, instead of using them to adjust the anchor boxes?
Most object detection algorithms compute offset (x, y, width, height) for bounding boxes compared to a fixed anchor.
Anchors are generally generated to follow a fix grid: for each location on the grid a set of anchors of different aspect ratios and different areas are created.
It's much easier for the learning algorithm to output an offset from the fixed anchor from which it can deduce the overall coordinate rather than trying to find the overall coordinate directly because it's a local and position-invariant feature.
It means that if there is a dog with a mis-centered bounding box on the top-left of the picture, the algorithm is asked to output the offset as if the dog were on the bottom-right of the picture, which makes it robust to shift and does not require it to learn the global position of the object in the image.
There are several packages and methods for segmentation in Python. However, if I know apriori that certain pixels (and no others) correspond to a particular object, how can I use that to segment other objects?
Which methods implemented in python would lend themselves to this approach?
Thanks.
You'll want to take a look at semi-automated image segmentation. Image segmentation in a semi-automated perspective means that you know before hand what class certain pixels belong to - either foreground or background. Given this a priori information, the goal is to minimize an energy function that best segments the rest of the pixels into foreground and background.
The best two methods that I know of are Graph Cuts and Random Walks. If you want to study the fundamentals of both of them, you should read the canonical papers by Boykov (Graph Cuts) and Grady (Random Walks) respectively:
Graph Cuts - Boykov: http://www.csd.uwo.ca/~yuri/Papers/ijcv06.pdf
Random Walks - Grady: http://webdocs.cs.ualberta.ca/~nray1/CMPUT615/MRF/grady2006random.pdf
For Graph Cuts, OpenCV uses the GrabCut algorithm, which is an extension of the original Graph Cuts algorithm: http://en.wikipedia.org/wiki/GrabCut. Essentially, you surround a box around the object you want segmented, and Gaussian Mixture Models are used to model the foreground and background and the object will be segmented from the background inside this box. Additionally, you can add foreground and background markers inside the box to further constrain the solution to ensure you get a good result.
Take a look at this official OpenCV tutorial for more details: http://docs.opencv.org/trunk/doc/py_tutorials/py_imgproc/py_grabcut/py_grabcut.html
For Random Walks, this is implemented in the scikit-image library and here's a great tutorial on how to get the segmentation up and running off of their official website: http://scikit-image.org/docs/dev/auto_examples/plot_random_walker_segmentation.html
Good luck!
I have written a program in Python which automatically reads score sheets like this one
At the moment I am using the following basic strategy:
Deskew the image using ImageMagick
Read into Python using PIL, converting the image to B&W
Calculate calculate the sums of pixels in the rows and the columns
Find peaks in these sums
Check the intersections implied by these peaks for fill.
The result of running the program is shown in this image:
You can see the peak plots below and to the right of the image shown in the top left. The lines in the top left image are the positions of the columns and the red dots show the identified scores. The histogram bottom right shows the fill levels of each circle, and the classification line.
The problem with this method is that it requires careful tuning, and is sensitive to differences in scanning settings. Is there a more robust way of recognising the grid, which will require less a-priori information (at the moment I am using knowledge about how many dots there are) and is more robust to people drawing other shapes on the sheets? I believe it may be possible using a 2D Fourier Transform, but I'm not sure how.
I am using the EPD, so I have quite a few libraries at my disposal.
First of all, I find your initial method quite sound and I would have probably tried the same way (I especially appreciate the row/column projection followed by histogramming, which is an underrated method that is usually quite efficient in real applications).
However, since you want to go for a more robust processing pipeline, here is a proposal that can probably be fully automated (also removing at the same time the deskewing via ImageMagick):
Feature extraction: extract the circles via a generalized Hough transform. As suggested in other answers, you can use OpenCV's Python wrapper for that. The detector may miss some circles but this is not important.
Apply a robust alignment detector using the circle centers.You can use Desloneux parameter-less detector described here. Don't be afraid by the math, the procedure is quite simple to implement (and you can find example implementations online).
Get rid of diagonal lines by a selection on the orientation.
Find the intersections of the lines to get the dots. You can use these coordinates for deskewing by assuming ideal fixed positions for these intersections.
This pipeline may be a bit CPU-intensive (especially step 2 that will proceed to some kind of greedy search), but it should be quite robust and automatic.
The correct way to do this is to use Connected Component analysis on the image, to segment it into "objects". Then you can use higher level algorithms (e.g. hough transform on the components centroids) to detect the grid and also determine for each cell whether it's on/off, by looking at the number of active pixels it contains.