How to generate embeddings of object images detected with YOLOv5? - python

I want to run real time object detection using YOLOv5 on a camera and then generate vector embeddings for cropped images of detected objects.
I currently generate image embeddings using this function below for locally saved images:
def generate_img_embedding(img_file_path):
images = [
Image.open(img_file_path)
]
# Encoding a single image takes ~20 ms
embeddings = embedding_model.encode(img_str)
return embeddings
also I start the Yolov5 objection detection with image cropping as follows
def start_camera(productid):
print("Attempting to start camera")
# productid = "11011"
try:
command = " python ./yolov5/detect.py --source 0 --save-crop --name "+ id +" --project ./cropped_images"
os.system(command)
print("Camera runnning")
except Exception as e:
print("error starting camera!", e)
How can I modify the YOLOv5 model to pass the cropped images into my embedding function in real time?

Just take a look at the detect.py supplied with yolov5, the file you are running. The implementation is pretty short (~150 SLOC), I would recommend re-implementing it or modifying for your use case.
Key points, omitting a lot of (important, but standard and easily understandable) data transforms and parameter parsing, are as follows:
device = select_device(device)
model = DetectMultiBackend(weights, device=device, dnn=dnn, data=data)
# Code selecting FP16/FP32 omitted here
model.warmup(imgsz=(1 if pt else bs, 3, *imgsz), half=half)
for path, im, im0s, vid_cap, s in dataset:
im = torch.from_numpy(im).to(device)
# Image transforms omitted
pred = model(im, augment=augment, visualize=visualize) # stage 1
pred = non_max_suppression(pred, conf_thres, iou_thres, classes, agnostic_nms, max_det=max_det) # stage 2
for i, det in enumerate(pred):
if len(det):
# Rescale boxes from img_size to im0 size
det[:, :4] = scale_coords(im.shape[2:], det[:, :4], im0.shape).round()
# --> This is where you would access detections in real time! <--
Most of the code's logic is handling the I/O (in particular, dataset loading is handled by either LoadStreams or LoadImages from yolov5's utils), the rest is just rescaling input images, loading a torch model, and running detection and NMS. No rocket science here.
The least effort path for you would be just copying the entire thing and implementing your embeddings under
for *xyxy, conf, cls in reversed(det):
Instead of saving to file, you would get (x, y, w, h) and crop the image using e.g. Pillow's Image.crop() or slice the numpy array directly. Whichever works for you depends on the implementation of your embedding_model.encode.

Related

How to convert YoloV5 Model Result to Pandas then back to Model Result

tl; don't want to read
How do I convert YoloV5 model results into results.pandas(), sort it, then convert it back into results so I can access the useful methods like results.render() or results.crop()?
Context:
I've recently learned how to load and do inference with a YoloV5 model:
# Load model
model = torch.hub.load('./yolov5', 'custom', path='/content/drive/MyDrive/models/best.pt', source='local') # local repo
# Import Image
im1 = 'https://ultralytics.com/images/zidane.jpg'
im2 = 'https://ultralytics.com/images/bus.jpg'
# Do Inference
results = model([im1, im2])
I also learned that this results object returned from inference has really useful methods for getting the result in different formats:
imgs = results.render() # gives image results with bounding boxes
crops = results.crop(save=True) # cropped detections dictionary
df = results.pandas().xyxy[0] # Pandas DataFrame of 1 image
n_df = results.pandas().xyxyn[0] # Pandas DataFrame of 1 image with normalized coordinates
My use-case here was to sort it, then get the top 20 in terms of confidence.
top_20 = results.pandas().xyxy[0].sort_values('score',ascending = False).groupby('confidence').head(20) # get top 17 sorted by confidence
Now I'm not sure how to turn it back to just results, so I can also access the same utility methods like .render() and .crop()
I think I could also create my own render and crop functions with OpenCV using my sorted dataframes as args, but I was just wondering if there was a more intuitive way to just reuse those utility methods.

extracting coordinates from computer vision inference

I converted this computer vision model 7.x to an ONNX type model that can be used with the open VINO toolkit. This model has good characteristics of what I am after for how it is used in other applications I have read about.
I think my question is super basic related to not understanding computer vision enough and just curious if someone can give me some tips on the computer vision basics on how to loop through the model output for "bounding boxes" to draw with opencv.
Using this on CPU with pip installed open VINO:
import cv2
import numpy as np
import matplotlib.pyplot as plt
from openvino.runtime import Core
model_path = (
f"./yolov7.xml"
)
ie_core = Core()
def model_init(model_path):
model = ie_core.read_model(model=model_path)
compiled_model = ie_core.compile_model(model=model, device_name="CPU")
input_keys = compiled_model.input(0)
output_keys = compiled_model.output(0)
return input_keys, output_keys, compiled_model
input_key, output_keys, compiled_model = model_init(model_path)
# resize the image so it works with the model dimensions
image = cv2.resize(image, (width, height))
image = image.transpose((2,0,1))
image = image.reshape(1,3, height,width)
# Run inference on image, trying .output(1) first
boxes = compiled_model([image])[compiled_model.output(1)]
The code works....outputs an array, but what does this data contain? For some reason I thought that there could be a confidence I could filter out bad predictions as well as bounding box coordinates?
If I print(compiled_model) this outputs I think the model architecture:
<CompiledModel:
inputs[
<ConstOutput: names[input.1] shape{1,3,640,640} type: f32>
]
outputs[
<ConstOutput: names[812] shape{1,25200,85} type: f32>,
<ConstOutput: names[588] shape{1,3,80,80,85} type: f32>,
<ConstOutput: names[669] shape{1,3,40,40,85} type: f32>,
<ConstOutput: names[750] shape{1,3,20,20,85} type: f32>
]>
Does this tell me anything about the model output, like what the data would contain? or the boxes.shape:
Which returns:
(1, 3, 80, 80, 85)
for box in boxes:
print(box)
this is just numpy arrays lots of float data just curious if anyone can help me understand at a high level what I need to learn to draw bounding boxes around features inside the image.
From my replication, your code is not working with "NameError:name 'image' is not defined" error. In your output, the ConstOutput only represents port/node of your model. To ensure your model works, run your yolov7.xml file with OpenVINO Benchmark Python Tool. You should not receive any errors.
In OpenVINO samples, you may refer to Object Detection Python Demo source code to learn the OpenVINO Inference Engine API usage for creating bounding boxes and how to handle the model. Here is another example of creating bounding boxes:
For box in boxes:
#Pick a confidence factor from the last place in an array.
conf=box[-1]
If conf > threshold:
#Convert float to int and multiply corner position of each box by x and y ration.
#If the bounding box is found that the top of the image
#Position the upper box bar little lower to make it visible on the image
(x_min, y_min, x_max, y_max) = [
int (max(corner_position*ratio_y, 10)) if idx%2
else int (corner_position*ratio_x)
for idx, corner_position in enumerate(box[:-1])
#Draw a box base on the position, parameters in rectangle function are: image,start_point, end_point, color, thickness.
rgb_image = cv2.rectangle(rgb_image, (x_min,y_min), (x_max,y_max),
colors["green"], 3)

How to customize object detection using cv2.dnn_detectionModel method

is there any way to customize object detection from my script. If yes, how to do it, do I need to install anything? Please provide step by step or video guide.
Anyway, I'm using Raspberry Pi to do it. So it best free GPU, and able to done so inside raspberry pi.
Below this script, it workable for me, just that I need detect specific thing which not included in "coco.name", "ssd_mobilenet".
Example: I want to detect "SKII Toner" instead appear "bottle" I want it to be "SKII Toner"
import cv2
import numpy as np
#Threshold setup
thres = 0.3 # Threshold to detect object
nms_threshold = 0.2
#camera setup
cap = cv2.VideoCapture(0)
#cap.set(3,1080)
#cap.set(4,1920)
#cap.set(10,300)
#standard configuration setting up
classFile = "coco.names"
classNames = []
with open(classFile,"rt") as f:
classNames = f.read().rstrip("\n").split("\n")
#print(classNames)
configPath = "/home/pi/darknet/ssd_mobilenet_v3_large_coco_2020_01_14.pbtxt"
weightPath = "/home/pi/darknet/frozen_inference_graph.pb"
net = cv2.dnn_DetectionModel(configPath,weightPath)
net.setInputSize(320,320)
net.setInputScale(1.0/ 120)
net.setInputMean((120, 120, 120))
net.setInputSwapRB(True)
while True:
success,img = cap.read()
img = cv2.flip(img, 0)
classIds, confs, bbox = net.detect(img,confThreshold=thres)
bbox = list(bbox)
confs = list(np.array(confs).reshape(1,-1)[0])
confs = list(map(float,confs))
#print(type(confs[0]))
#print(confs)
indices = cv2.dnn.NMSBoxes(bbox,confs,thres,nms_threshold)
#print(indices)
for i in indices:
i = i[0]
box = bbox[i]
x,y,w,h = box[0], box[1], box[2], box[3]
cv2.rectangle(img, (x,y),(x+w,h+y), color=(0,255,0),thickness=1)
cv2.putText(img,classNames[classIds[i][0]-1].upper(),(box[0]+10,box[1]+30),
cv2.FONT_HERSHEY_COMPLEX,1,(0,255,0),1)
cv2.imshow("Output",img)
cv2.waitKey(1)
Since based on the comments you want to detect new classes, the only way is either to take a pre-trained detection model that already detect the desired class (if any) and see if the accuracy suits your needs, OR even better take a a model pre-trained on a big dataset (e.g COCO) and fine-tune it on a dataset labeled for you class of interest. You will need the dataset for this, and based on your class you may find something already available on the net or will you have to collect one. A good starting point could be the Tensorflow Object Detection API, which provides pre-trained models on COCO and a relatively easy-to-use APIs to fine tune on a new dataset.

augmentation on images to look like real example python

I have generated many images like number plate as below[![enter image description here]
Now, want to convert all such images like real world vehicle number plate image.
For example-
How to these type of augmentation and save the all the augmented images in another folder.
Solution
Check out the library: albumentations. Try to answer the question: "what is the difference between the image you have and the image you want?". For instance, that image is :
more pixelated,
grainy,
has lower resolution,
also could have nails/fastening screws on it
may have something else written under or over the main number
may have shadows on it
the number plate may be unevenly bright at places, etc.
Albumentations, helps you come up with many types of image augmentations. Please try to break down this problem like I suggested and then try and find out which augemntations you need there from albumentations.
Example of image augmentation using albumentations
The following code block (source) shows you how to apply albumentations for image augmentation. In case you had an image and a mask, both of them will undergo identical transformations.
Another example from kaggle: Image Augmentation Demo with albumentation
from albumentations import (
HorizontalFlip, IAAPerspective, ShiftScaleRotate, CLAHE, RandomRotate90,
Transpose, ShiftScaleRotate, Blur, OpticalDistortion, GridDistortion, HueSaturationValue,
IAAAdditiveGaussianNoise, GaussNoise, MotionBlur, MedianBlur, IAAPiecewiseAffine,
IAASharpen, IAAEmboss, RandomBrightnessContrast, Flip, OneOf, Compose
)
import numpy as np
def strong_aug(p=0.5):
return Compose([
RandomRotate90(),
Flip(),
Transpose(),
OneOf([
IAAAdditiveGaussianNoise(),
GaussNoise(),
], p=0.2),
OneOf([
MotionBlur(p=0.2),
MedianBlur(blur_limit=3, p=0.1),
Blur(blur_limit=3, p=0.1),
], p=0.2),
ShiftScaleRotate(shift_limit=0.0625, scale_limit=0.2, rotate_limit=45, p=0.2),
OneOf([
OpticalDistortion(p=0.3),
GridDistortion(p=0.1),
IAAPiecewiseAffine(p=0.3),
], p=0.2),
OneOf([
CLAHE(clip_limit=2),
IAASharpen(),
IAAEmboss(),
RandomBrightnessContrast(),
], p=0.3),
HueSaturationValue(p=0.3),
], p=p)
image = np.ones((300, 300, 3), dtype=np.uint8)
mask = np.ones((300, 300), dtype=np.uint8)
whatever_data = "my name"
augmentation = strong_aug(p=0.9)
data = {"image": image, "mask": mask, "whatever_data": whatever_data, "additional": "hello"}
augmented = augmentation(**data)
image, mask, whatever_data, additional = augmented["image"], augmented["mask"], augmented["whatever_data"], augmented["additional"]
Strategy
First tone down the number of augmentations to a bare minimum
Save a single augmented-image
Save a few images post augmentation.
Now test and update your augmentation pipeline to suit your requirements of mimicking the ground-truth scenario.
finalize your pipeline and run it on a larger number of images.
Time it: how long this takes for how many images.
Then finally run it on all the images: this time you can have a time estimate on how long it is going to take to run it.
NOTE: every time an image passes through the augmentation pipeline, only a single instance of augmented image comes out of it. So, say you want 10 different augmented versions of each image, you will need to pass each image through the augmentation pipeline 10 times, before moving on to the next image.
# this will not be what you end up using
# but you can begin to understand what
# you need to do with it.
def simple_aug(p-0,5):
return return Compose([
RandomRotate90(),
# Flip(),
# Transpose(),
OneOf([
IAAAdditiveGaussianNoise(),
GaussNoise(),
], p=0.2),
])
# for a single image: check first
image = ... # write your code to read in your image here
augmentation = strong_aug(p=0.5)
augmented = augmentation({'image': image}) # see albumentations docs
# SAVE the image
# If you are using imageio or PIL, saving an image
# is rather straight forward, and I will let you
# figure that out.
# save the content of the variable: augmented['image']
For multiple images
Assuming each image passing 10 times through the augmentation pipeline, your code could look like as follows:
import os
# I assume you have a way of loading your
# images from the filesystem, and they come
# out of `images` (an iterator)
NUM_AUG_REPEAT = 10
AUG_SAVE_DIR = 'data/augmented'
# create directory of not present already
if not os.path.isdir(AUG_SAVE_DIR):
os.makedirs(AUG_SAVE_DIR)
# This will create augmentation ids for the same image
# example: '00', '01', '02', ..., '08', '09' for
# - NUM_AUG_REPEAT = 10
aug_id = lambda x: str(x).zfill(len(str(NUM_AUG_REPEAT)))
for image in images:
for i in range(NUM_AUG_REPEAT):
data = {'image': image}
augmented = augmentation(**data)
# I assume you have a function: save_image(image_path, image)
# You need to write this function with
# whatever logic necessary. (Hint: use imageio or PIL.Image)
image_filename = f'image_name_{aug_id(i)}.png'
save_image(os.path.join(AUG_SAVE_DIR, image_filename), augmented['image'])

X.shape[1] size doesn't fit the expected value

I'm currently working on my final degree project in robotics, and I decided to create an open-source robot capable of replicating human emotions. The robot is all set up and ready to receive orders, but I'm still busy coding it. I'm currently basing my code off this method. The idea is to extract 68 facial landmarks from
a low FPS video feed (using RPi Camera V2), feed those landmarks to a trained SVM classifier and have it return a numeral from 0-6 depending on the expression it detected (Angry, Disgust, Fear, Happy, Sad, Surprise and Neutral). I'm testing out the capabilities of my model with some pictures I took using the RPi Camera, and this is what I've managed to put together so far in terms of code:
# import the necessary packages
from imutils import face_utils
import dlib
import cv2
import numpy as np
import time
import argparse
import os
import sys
if sys.version_info >= (3, 0):
import _pickle as cPickle
else:
import cPickle
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from data_loader import load_data
from parameters import DATASET, TRAINING, HYPERPARAMS
def get_landmarks(image, rects):
if len(rects) > 1:
raise BaseException("TooManyFaces")
if len(rects) == 0:
raise BaseException("NoFaces")
return np.matrix([[p.x, p.y] for p in predictor(image, rects[0]).parts()])
# initialize dlib's face detector (HOG-based) and then create
# the facial landmark predictor
print("Initializing variables...")
p = "shape_predictor_68_face_landmarks.dat"
detector = dlib.get_frontal_face_detector()
predictor = dlib.shape_predictor(p)
# path to pretrained model
path = "saved_model.bin"
# load pretrained model
print("Loading model...")
model = cPickle.load(open(path, 'rb'))
# initialize final image height & width
height = 48
width = 48
# initialize landmarks variable as empty array
landmarks = []
# load the input image and convert it to grayscale
print("Loading image...")
gray = cv2.imread("foo.jpg")
# detect faces in the grayscale image
print("Detecting faces in loaded image...")
rects = detector(gray, 0)
# loop over the face detections
print("Looping over detections...")
for (i, rect) in enumerate(rects):
# determine the facial landmarks for the face region, then
# convert the facial landmark (x, y)-coordinates to a NumPy
# array
shape = predictor(gray, rect)
shape = face_utils.shape_to_np(shape)
# loop over the (x, y)-coordinates for the facial landmarks
# and draw them on the image
for (x, y) in shape:
cv2.circle(gray, (x, y), 2, (0, 255, 0), -1)
# show the output image with the face detections + facial landmarks
print("Storing saved image...")
cv2.imwrite("output.jpg", gray)
print("Image stored as /'output.jpg/'")
# arrange landmarks in array
print("Collecting and arranging landmarks...")
# scipy.misc.imsave('temp.jpg', image)
# image2 = cv2.imread('temp.jpg')
face_rects = [dlib.rectangle(left=1, top=1, right=47, bottom=47)]
landmarks = get_landmarks(gray, face_rects)
# load data
print("Loading collected data into predictor...")
print("Extracted landmarks: ", landmarks)
landmarks = np.array(landmarks.flatten())
# predict expression
print("Making prediction")
predicted = model.predict(landmarks)
However, after running the code everything seems to be fine up until this point:
Making prediction
Traceback (most recent call last):
File "face.py", line 97, in <module>
predicted = model.predict(landmarks)
File "/usr/local/lib/python2.7/dist-packages/sklearn/svm/base.py", line 576, in predict
y = super(BaseSVC, self).predict(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/svm/base.py", line 325, in predict
X = self._validate_for_predict(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/svm/base.py", line 478, in _validate_for_predict
(n_features, self.shape_fit_[1]))
ValueError: X.shape[1] = 136 should be equal to 2728, the number of features at training time
I searched for similar issues on this website, but being such a specific purpose I didn't quite find what I needed. I've been working on the design and research for quite some time, but finding all the snippets needed to make the code work has taken the most time out of me, and I'd love to polish this concept as soon as possible since the presentation date is approaching quickly. Any and all contributions are greatly welcomed!
Here's the trained model I'm currently using, by the way.
I am probably being silly, but it looks like you define path after you use it to load your model.
Also path seems like a very bad name for a variable containing a file location, perhaps modelFileLocation is less likely to already be defined.
Solved it! Turns out my model was trained using a combination of HOG features and Dlib landmarks, however I was only feeding the landmarks to the predictor, which resulted in the size discrepancy.

Categories

Resources