I have 3 images of differents objets : a smartphone, a shirt and a packet of pasta.
I want to perform recognition of each object on any images containing one of these objects.
For example, if we have the same phone in a picture, i want to be able to see the phone with a bounded box drawn in this picture. If the phone is different, nothing should be drawn.
I first tried to perform object recognition using neural network like Mask R-CNN with python and tensorflow. But i realized that i haven't a huge training dataset, only my 3 images. Neural network algorithms seem to be adapted to recognize concept like dog, smartphone, landscape but not a particular dog, a specific smartphone or a specific landscape.
To get to the point, if i have in input any picture that contain the same smartphone, the same shirt or the same packet of pasta, i want the program to detect that.
What algorithms are best suited to perform this recognition ?
Try using the COCO dataset. Since the COCO weights have already been trained on thousands of items and images, you should just be able to run the splash feature to help detection with Mask RCNN.
Worst case scenario, if you want to train your own dataset, just find a lot of photos online relating to the objects you want to detect, annotate them, then train.
Related
As an example I have two pictures with a particular type of clothing of a certain brand.
I can download a lot of different images of this same piece, and color, of clothing
I want to create a model which can recognize the item based on a picture.
I tried to do it using this example:
https://www.tensorflow.org/tutorials/keras/classification.
This can recognize the type of clothing (eg shirt or shoe or trousers, etc) But not a specific item and color.
My goal is to have a model that can tell me that the person on my first picture is wearing the item of my second picture.
As mentioned I can upload a few variations of this same item to train my model, if that would be the best approach.
I also tried to use https://pillow.readthedocs.io
This can do something with color recognition but does not solve my initial goal.
i don't think that CNN can help you in your problemes, take a look at the SIFT Technique see this for more détails.it is used for image matching and i think it's better in your cas. if your not looking to get in to much detailes the opencv is a python (and c++ i think) library that has image matching function that are easy to use more détails .
As mentionned by #nadji mansouri, I would use SIFT technique as it suits your need. But I want just to correct something, CNN is also a thing in this case. This being said, I wouldn't tackle the problem as a classification problem, but rather using Distance Metric Learning, i.e, training a model to generate embeddings that are similar in the space when the inputs are similar, and distant otherwise. But to do this you need a large representative dataset.
In short, I suggest starting with SIFT, using OpenCV, or open source implementations on GitHub, playing around with the parameters and see what fits your case best, and then see if it's really necessary to switch to a neural network, and in this case tackling the problem as a metric learning task, maybe with something like siamese networks.
Some definitions:
Metric learning is an approach based directly on a distance metric that aims to establish similarity or dissimilarity between data (images in your case). Deep Metric Learning on the other hand uses Neural Networks to automatically learn discriminative features from the data and then compute the metric. source.
The Scale-Invariant Feature Transform (SIFT) is a method used in computer vision to detect and describe local features in images. The algorithm is invariant to image scale and rotation, and robust to changes in illumination and affine distortion. SIFT features are represented by local image gradients, which are calculated at various scales and orientations, and are used to identify keypoints in an image. These keypoints and their associated descriptor vectors can then be used for tasks such as image matching, object recognition, and structure from motion. source, with modification.
I have a dataset of images that have two folders: test and training. I need to do object detection using OpenCV and Yolo.
Thus, I need to create my own Yolo model for the street objects.
For the training folder:
training
Example training image:
training image
For the test folder:
test
I have the classes txt file which includes id, name and classification (warning, indication and mandatory).
Example:
0 = animal crossing (warning)
1 = soft verges (warning)
2 = road narrows (warning)
Here, the numbers are the numbers (or ids) in the training folder, names, and classification.
My purpose is to create a Yolo model from these training images. I have checked some papers and articles, but in their case, they label the full image using labelimg, but in my case training images are so small and they don't need any labeling.
Thus, I'm confused about how to do this. Could you please give me some ideas?
Labeling images is a must in YOLO's that's how they deal with their loss functions. To detect objects something called (intersection over union )
More easy way to label images is by using (roboflow site ).
I would refer to this image that describes the different types of computer vision tasks.
I think what you want to do is a Classification tasks. Yolo is for Object Detection tasks, where you usually want to detect more than one object per image.
For classification tasks, it can be easier because you don't need to make separate label files. The names of the folders are the labels. Here is an example of a classification model that you can use https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
If you really want to use Yolo you will need to make label files. If you are going to do Classification of the whole image then the format of the annotation will be easy. It would be something like this.
`0 0.5 0.5 1 1' The first column is the class number: 0,1,2,3 etc. You will need to make one file for each image with the name .txt.
Does this help you?
I'm new to the computer vision world, I'm trying to create a script with the objective to gather data from a dataset of images.
I'm interested in what kind of objects are in those images and getting a summary of them in a json file for every image.
I've checked out some YOLO implementations but the ones I've seen are almost always based on COCO and have 80 classes or have a custom dataset.
I've seen that there are algorithms like InceptionV3 etc. which are capable of classifying 1000 classes. But per my understanding object classification is different from object recognition.
Is there a way to use those big dataset classification algos for object detection?
Or any other suggestion?
Unfortunately, I do not know where the breaking point is, and of course, it will depend on acceptable evaluation metrics and training data size.
From a technical point of view, there is no hard limit and if you go to extremes there could be Core ML model size issues and memory issues during inferences. However, that will only happen for an extremely large number of classes.
From a modeling perspective (which is a problem that will happen much earlier than the technical limitation) it is not as clear. As you increase the number of classes, you increase the risk of making classification mistakes. Although, the severity of a lot of the mistakes should simultaneously go down as you will have more and more classes that are naturally similar (breeds of dogs, etc.). The original YOLO9000 paper (https://arxiv.org/pdf/1612.08242.pdf) trained a model using 9000+ classes with reasonable results (lots of mistakes of course, but still impressive). They trained it on a combination of detection and classification data, so if they actually had detection data for all 9000, then results would presumably be even better.
In your experiment, it sounds like 50-60 was OK (thanks for giving us a sample point!). Anything below 100 is definitely tried and true, as long as you have the data. However, will 300 do OK? Will 1000 do OK? Theoretically, I would say yes, if you are able to provide enough training data and you adjust your expectation of what a good evaluation metric is since you know you'll make more mistakes. For instance, for classification with 1000 classes, it is common to report top-5 accuracy (that is, the correct label is in your top-5 classes for a sample).
Here is a useful link - https://github.com/apple/turicreate/issues/968
First, to level set on terminology.
Image Classification based neural networks, such as Inception and Resnet, classify an entire image based upon the classes the network was trained on. So if the image has a dog, then the classifier will most likely return the class dog with a higher confidence score as compared to the other classes the network was trained on. To train a network such as this, it's simple enough to group the same class images (all images with a dog) into folders as inputs. ImageNet and Pascal VOC are examples of public labeled datasets for Image Classification.
Object Detection based neural networks on the other hand, such as SSD and Yolo, will return a set of coordinates that indicate a bounding box and confident score for each class (object) that is detected based upon what the network was trained with. To train a network such as this, each object in an image much as annotated with a set of coordinates that correspond to the bounding boxes of the class (object). The COCO dataset, for example, is an annotated dataset of 80 classes (objects) with coordinates corresponding to the bounding box around each object. Another popular dataset is Object365 that contains 365 classes.
Another important type of neural network that the COCO dataset provides annotations for is Instance Segmentation models, such as Mask RCNN. These models provide pixel-level classification and are extremely compute-intensive, but critical for use cases such as self-driving cars. If you search for Detectron2 tutorials, you will find several great learning examples of training a Mask RCNN network on the COCO dataset.
So, to answer your question, Yes, you can use the COCO dataset (amongst many other options available publicly on the web) for object detection, or, you can also create your own dataset with a little effort by annotating your own dataset with bounding boxes around the object classes you want to train. Try Googling - 'using coco to train ssd model' to get some easy-to-follow tutorials. SSD stands for single-shot detector and is an alternative neural network architecture to Yolo.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm just a beginner in Machine learning. I've just learnt supervised machine learning so far with some basic image classification and regression problem. I've just done an image classification problem with sklearn load_digits() which has about 1800 images of the characters from 0-9 (description of the dataset) . What I want to do is to make my own dataset instead of loading it from sklearn like:
from sklearn.datasets import load_digits
I want to use my own dataset. So can someone guide me can I make my own dataset in CSV or any other format so that I can use it in my supervised machine learning technique ?
First thing would be to understand your use case. There is difference between OCR and Image Classification tasks. Lets look at both of the scenarios.
Image Classification : The task is similar to standard supervised tasks that you might have seen in ML only in this case we classify image instead of data in a sheet. Data Curation is one of the major tasks involved in image classification and complete accuracy depends upon how you processed your data. lets say given an image you want to identify if its a dog or a cat. This would require you to collect at least 500 images each of different types of dogs and cat. You can also artificially create the image by taking an image of a dog and then use python OpenCV library to add some noise or rotation and save the updated image. This way you can collect more images in short span of time. Once you have the images for all the categories you want to classify ( dogs and cats ), you can then go for model selection. CNN (Convolutional Neural Network) are considered to be best for image classification tasks but creating them from scratch and tuning them could take long time. My advise would be to use Tensorflow Object Detection API the provides a good framework for beginners to built their own image classifier or object detector with many pre-trained models to choose from. https://github.com/tensorflow/models/tree/master/research/object_detection
OCR : OCR is one of the complex application of image classification and its not that easy to built from scratch. In the example you mentioned in your question, though it looks like an OCR but its more or less an image classification task, since you have a single image of each character that you are trying to classify. In real world OCR would involve handwritten notes and extracting the text written in them to your system which is a complicated process. There are some prebuilt libraries like Tesseract that specializes in OCR, by taking the input image with text written on it and it returns the text present in the image in string format. However, these libraries fails when it comes to handwritten text as those are much difficult to read. If you are interested in building an OCR system from scratch it would require you great deal of image processing tasks. Lets say you have an image on which there is a phone number written by someone. You OCR system would first have to detect each numbers separately by drawing detection boxes around each number in the image (you can use tensorflow object detection system api mentioned above) but lets say you have an image of both alphabets and numbers and symbols, this would then be complex tasks to first collect individual images of each alphabet , numbers and symbols which could be tough. My advise again would be to use API which are free and also much accurate. I used Microsoft Cognitive Vision API that has an OCR function to detect any type of text from an image. This would reduce your effort to only properly cleaning the image.
I am new by tensorflow. I want to write a Neural network, that gets noisy images from a file and uncorrupted images from another file.
then I want to correct noisy images based on the other images.
What you are talking about is a denoising autoencoder.
This is not my code. It was ranked very high on google search, has several github stars and forkes, all which are great indicators that it is a working and supported implementation.
Actually, I'm trying to train a NN that get corrupted images and based on them the grand truth, remove noise from that images.It must be Network in Network, an another word pixels independent.