is there any way to customize object detection from my script. If yes, how to do it, do I need to install anything? Please provide step by step or video guide.
Anyway, I'm using Raspberry Pi to do it. So it best free GPU, and able to done so inside raspberry pi.
Below this script, it workable for me, just that I need detect specific thing which not included in "coco.name", "ssd_mobilenet".
Example: I want to detect "SKII Toner" instead appear "bottle" I want it to be "SKII Toner"
import cv2
import numpy as np
#Threshold setup
thres = 0.3 # Threshold to detect object
nms_threshold = 0.2
#camera setup
cap = cv2.VideoCapture(0)
#cap.set(3,1080)
#cap.set(4,1920)
#cap.set(10,300)
#standard configuration setting up
classFile = "coco.names"
classNames = []
with open(classFile,"rt") as f:
classNames = f.read().rstrip("\n").split("\n")
#print(classNames)
configPath = "/home/pi/darknet/ssd_mobilenet_v3_large_coco_2020_01_14.pbtxt"
weightPath = "/home/pi/darknet/frozen_inference_graph.pb"
net = cv2.dnn_DetectionModel(configPath,weightPath)
net.setInputSize(320,320)
net.setInputScale(1.0/ 120)
net.setInputMean((120, 120, 120))
net.setInputSwapRB(True)
while True:
success,img = cap.read()
img = cv2.flip(img, 0)
classIds, confs, bbox = net.detect(img,confThreshold=thres)
bbox = list(bbox)
confs = list(np.array(confs).reshape(1,-1)[0])
confs = list(map(float,confs))
#print(type(confs[0]))
#print(confs)
indices = cv2.dnn.NMSBoxes(bbox,confs,thres,nms_threshold)
#print(indices)
for i in indices:
i = i[0]
box = bbox[i]
x,y,w,h = box[0], box[1], box[2], box[3]
cv2.rectangle(img, (x,y),(x+w,h+y), color=(0,255,0),thickness=1)
cv2.putText(img,classNames[classIds[i][0]-1].upper(),(box[0]+10,box[1]+30),
cv2.FONT_HERSHEY_COMPLEX,1,(0,255,0),1)
cv2.imshow("Output",img)
cv2.waitKey(1)
Since based on the comments you want to detect new classes, the only way is either to take a pre-trained detection model that already detect the desired class (if any) and see if the accuracy suits your needs, OR even better take a a model pre-trained on a big dataset (e.g COCO) and fine-tune it on a dataset labeled for you class of interest. You will need the dataset for this, and based on your class you may find something already available on the net or will you have to collect one. A good starting point could be the Tensorflow Object Detection API, which provides pre-trained models on COCO and a relatively easy-to-use APIs to fine tune on a new dataset.
Related
I want to run real time object detection using YOLOv5 on a camera and then generate vector embeddings for cropped images of detected objects.
I currently generate image embeddings using this function below for locally saved images:
def generate_img_embedding(img_file_path):
images = [
Image.open(img_file_path)
]
# Encoding a single image takes ~20 ms
embeddings = embedding_model.encode(img_str)
return embeddings
also I start the Yolov5 objection detection with image cropping as follows
def start_camera(productid):
print("Attempting to start camera")
# productid = "11011"
try:
command = " python ./yolov5/detect.py --source 0 --save-crop --name "+ id +" --project ./cropped_images"
os.system(command)
print("Camera runnning")
except Exception as e:
print("error starting camera!", e)
How can I modify the YOLOv5 model to pass the cropped images into my embedding function in real time?
Just take a look at the detect.py supplied with yolov5, the file you are running. The implementation is pretty short (~150 SLOC), I would recommend re-implementing it or modifying for your use case.
Key points, omitting a lot of (important, but standard and easily understandable) data transforms and parameter parsing, are as follows:
device = select_device(device)
model = DetectMultiBackend(weights, device=device, dnn=dnn, data=data)
# Code selecting FP16/FP32 omitted here
model.warmup(imgsz=(1 if pt else bs, 3, *imgsz), half=half)
for path, im, im0s, vid_cap, s in dataset:
im = torch.from_numpy(im).to(device)
# Image transforms omitted
pred = model(im, augment=augment, visualize=visualize) # stage 1
pred = non_max_suppression(pred, conf_thres, iou_thres, classes, agnostic_nms, max_det=max_det) # stage 2
for i, det in enumerate(pred):
if len(det):
# Rescale boxes from img_size to im0 size
det[:, :4] = scale_coords(im.shape[2:], det[:, :4], im0.shape).round()
# --> This is where you would access detections in real time! <--
Most of the code's logic is handling the I/O (in particular, dataset loading is handled by either LoadStreams or LoadImages from yolov5's utils), the rest is just rescaling input images, loading a torch model, and running detection and NMS. No rocket science here.
The least effort path for you would be just copying the entire thing and implementing your embeddings under
for *xyxy, conf, cls in reversed(det):
Instead of saving to file, you would get (x, y, w, h) and crop the image using e.g. Pillow's Image.crop() or slice the numpy array directly. Whichever works for you depends on the implementation of your embedding_model.encode.
I converted this computer vision model 7.x to an ONNX type model that can be used with the open VINO toolkit. This model has good characteristics of what I am after for how it is used in other applications I have read about.
I think my question is super basic related to not understanding computer vision enough and just curious if someone can give me some tips on the computer vision basics on how to loop through the model output for "bounding boxes" to draw with opencv.
Using this on CPU with pip installed open VINO:
import cv2
import numpy as np
import matplotlib.pyplot as plt
from openvino.runtime import Core
model_path = (
f"./yolov7.xml"
)
ie_core = Core()
def model_init(model_path):
model = ie_core.read_model(model=model_path)
compiled_model = ie_core.compile_model(model=model, device_name="CPU")
input_keys = compiled_model.input(0)
output_keys = compiled_model.output(0)
return input_keys, output_keys, compiled_model
input_key, output_keys, compiled_model = model_init(model_path)
# resize the image so it works with the model dimensions
image = cv2.resize(image, (width, height))
image = image.transpose((2,0,1))
image = image.reshape(1,3, height,width)
# Run inference on image, trying .output(1) first
boxes = compiled_model([image])[compiled_model.output(1)]
The code works....outputs an array, but what does this data contain? For some reason I thought that there could be a confidence I could filter out bad predictions as well as bounding box coordinates?
If I print(compiled_model) this outputs I think the model architecture:
<CompiledModel:
inputs[
<ConstOutput: names[input.1] shape{1,3,640,640} type: f32>
]
outputs[
<ConstOutput: names[812] shape{1,25200,85} type: f32>,
<ConstOutput: names[588] shape{1,3,80,80,85} type: f32>,
<ConstOutput: names[669] shape{1,3,40,40,85} type: f32>,
<ConstOutput: names[750] shape{1,3,20,20,85} type: f32>
]>
Does this tell me anything about the model output, like what the data would contain? or the boxes.shape:
Which returns:
(1, 3, 80, 80, 85)
for box in boxes:
print(box)
this is just numpy arrays lots of float data just curious if anyone can help me understand at a high level what I need to learn to draw bounding boxes around features inside the image.
From my replication, your code is not working with "NameError:name 'image' is not defined" error. In your output, the ConstOutput only represents port/node of your model. To ensure your model works, run your yolov7.xml file with OpenVINO Benchmark Python Tool. You should not receive any errors.
In OpenVINO samples, you may refer to Object Detection Python Demo source code to learn the OpenVINO Inference Engine API usage for creating bounding boxes and how to handle the model. Here is another example of creating bounding boxes:
For box in boxes:
#Pick a confidence factor from the last place in an array.
conf=box[-1]
If conf > threshold:
#Convert float to int and multiply corner position of each box by x and y ration.
#If the bounding box is found that the top of the image
#Position the upper box bar little lower to make it visible on the image
(x_min, y_min, x_max, y_max) = [
int (max(corner_position*ratio_y, 10)) if idx%2
else int (corner_position*ratio_x)
for idx, corner_position in enumerate(box[:-1])
#Draw a box base on the position, parameters in rectangle function are: image,start_point, end_point, color, thickness.
rgb_image = cv2.rectangle(rgb_image, (x_min,y_min), (x_max,y_max),
colors["green"], 3)
I'm trying to build a program to count the number of pipes from a given image,
https://drive.google.com/drive/folders/1iw2W7dUg3ICGRt3hxOynUJCf-KQj2Pka?usp=sharing are some example test images.
I've tried doing the same with Canny's, Hough's but neither of them seem to be even close to counting them properly. What approach should I go for?
There are several things to consider:
It is better to find the range of pipes with a suitable method (search the net).
In the second step, with PerspectiveTransform, preferably eliminate the rotation of the image and only move the range of the pipes to the next step.
In the next step, from now on, you can use the following algorithm.
This is not a complete algorithm and will not work for all test_cases. You have to spend time. Read and combine different image processing methods or maybe even machine learning; Change the parameters to get a better result.
Another point is to try to keep the environmental conditions constant.
import sys
import cv2
import numpy as np
# Load image
im = cv2.imread(sys.path[0]+'/im.jpeg')
H, W = im.shape[:2]
# Make a copy from image
out = im.copy()
# Make a grayscale version
gry = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
# Make a clean black and white version of that picture
bw = cv2.adaptiveThreshold(
gry, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 12)
bw = cv2.medianBlur(bw, 3)
bw = cv2.erode(bw, np.ones((5, 5)))
bw = cv2.medianBlur(bw, 9)
bw = cv2.dilate(bw, np.ones((5, 5)))
# Draw a rectangle around image to eliminate errors
cv2.rectangle(bw, (0, 0), (W, H), 0, thickness=17)
# Count number of pipes
cnts, _ = cv2.findContours(bw, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
# Change channels of black/white image
bw = cv2.cvtColor(bw, cv2.COLOR_GRAY2BGR)
# Draw position of pipes
c = 0
for cnt in cnts:
c += 1
x, y, w, h = cv2.boundingRect(cnt)
cv2.circle(out, (x+w//2, y+h//2), max(w, h)//2, (c, 220, 255-c), 2)
if c >= 255:
c = 0
# Count and print number of pipes
print(len(cnts))
# Save output images
cv2.imwrite(sys.path[0]+'/im_out.jpg', np.hstack((bw, out)))
Although you may get by with some ad-hoc algorithm, I think you would spend more time tuning it than just training and tuning a model:
The pipes do not seem to have a very complicated texture, so I would create a set of synthetic images by cropping some of the pipes from the images and then I would apply multiple transformations to it. For instance, you can use the imgaug (https://imgaug.readthedocs.io/en/latest/) library in python. Then, you can use HOG, Haar features, LBP, etc.. to get features and train a model adaboost with trees, linear SVM, etc... The object detection pipeline from DLib should be good enough and is easy to use. In general, you need a framework with sliding windows for the detection part.
Another option is to get a bunch of backgrounds as well as transformations over the cropping images and create multiple images with bounding boxes annotations, you could use a tool like this one for that: https://github.com/tylerhutcherson/synthetic-images.
Then, you can use a deep learning model like YOLOv4, RetinaNet, or Faster-RCNN. Notice that you may need to change the strides of the output layers since the objects may be too small.
All in all, If I were you I would start with the object detection from dlib, it should not take you too long to have everything ready.
One thing, given the type of features used for the training the model may detect circles that are not really pipes, but if you expect to have many pipes, that should be easy to remove.
Hope it helps and good luck.
I am trying to align an RGB image with an IR image (single channel).
The goal is to create a 4 channel image R,G,B,IR.
In order to do this, I am using cv2.findTransformECC as described in this very neat guide. The code is unchanged for now, except for line 13 where the Motion is set to Euclidian because I want to handle rotations in the future. I am using Python.
In order to verify the workings of the software, I used the images from the guide. It worked well so I wanted to correlate satellite images from multiple spectra as described above. Unfortunately, I ran into problems here.
Sometimes the algorithm converged (after ages) and sometimes it immediately crashed because it cant converge and other times it "finds" a solution that is clearly wrong. Attached you find two images that, from a human perspective, are easy to match, but the algorithm fails. The images are not rotated in any way, they are just not the exact same image (check the borders), so a translational motion is expected. Images are of Lake Neusiedlersee in Austria, the source is Sentinelhub.
Edit: With "sometimes" I refer to using different images from Sentinel. One pair of images has consistently the same outcome.
I know that ECC is not feature-based which might pose a problem here.
I have also read that it is somewhat dependent on the initial warp matrix.
My questions are:
Am I using cv2.findTransformECC wrong?
Is there a better way to do this?
Should I try to "Monte-Carlo" the initial matrices until it converges? (This feels wrong)
Do you suggest using a feature-based algorithm?
If so, is there one available or would I have to implement this myself?
Thanks for the help!
Do you suggest using a feature-based algorithm?
Sure.
There are many feature detections algorithms.
I generally choose SIFT because it provides good matching results and the runtime is feasibly fast.
import cv2 as cv
import numpy as np
# read the images
ir = cv.imread('ir.jpg', cv.IMREAD_GRAYSCALE)
rgb = cv.imread('rgb.jpg', cv.IMREAD_COLOR)
descriptor = cv.SIFT.create()
matcher = cv.FlannBasedMatcher()
# get features from images
kps_ir, desc_ir = descriptor.detectAndCompute(ir, mask=None)
gray = cv.cvtColor(rgb, cv.COLOR_BGR2GRAY)
kps_color, desc_color = descriptor.detectAndCompute(gray, mask=None)
# find the corresponding point pairs
if (desc_ir is not None and desc_color is not None and len(desc_ir) >=2 and len(desc_color) >= 2):
rawMatch = matcher.knnMatch(desc_color, desc_ir, k=2)
matches = []
# ensure the distance is within a certain ratio of each other (i.e. Lowe's ratio test)
ratio = 0.75
for m in rawMatch:
if len(m) == 2 and m[0].distance < m[1].distance * ratio:
matches.append((m[0].trainIdx, m[0].queryIdx))
# convert keypoints to points
pts_ir, pts_color = [], []
for id_ir, id_color in matches:
pts_ir.append(kps_ir[id_ir].pt)
pts_color.append(kps_color[id_color].pt)
pts_ir = np.array(pts_ir, dtype=np.float32)
pts_color = np.array(pts_color, dtype=np.float32)
# compute homography
if len(matches) > 4:
H, status = cv.findHomography(pts_ir, pts_color, cv.RANSAC)
warped = cv.warpPerspective(ir, H, (rgb.shape[1], rgb.shape[0]))
warped = cv.cvtColor(warped, cv.COLOR_GRAY2BGR)
# visualize the result
winname = 'result'
cv.namedWindow(winname, cv.WINDOW_KEEPRATIO)
alpha = 5
# res = cv.addWeighted(rgb, 0.5, warped, 0.5, 0)
res = None
def onChange(alpha):
global rgb, warped, res, winname
res = cv.addWeighted(rgb, alpha/10, warped, 1 - alpha/10, 0)
cv.imshow(winname, res)
onChange(alpha)
cv.createTrackbar('alpha', winname, alpha, 10, onChange)
cv.imshow(winname, res)
cv.waitKey()
cv.destroyWindow(winname)
Result (alpha=8)
Edit: It seems like SIFT is not the best option as it fails for some other examples. Example images are in another question.
In this case, I suggest using SURF.
It is a patented algorithm, so it does not come with the latest OpenCV PIP installations.
You can install previous versions of OpenCV or build it from source.
descriptor = cv.xfeatures2d.SURF_create()
Result (alpha=8)
Edit2: It is now clear that the key to achieve this task is to choose the correct feature descriptor. As a final note, I suggest choosing the appropriate motion model. Affine transform fits better than homography in this case.
H, _ = cv.estimateAffine2D(pts_ir, pts_color)
H = np.vstack((H, [0, 0, 1]))
Affine transform result:
I'm trying to create simple Eigenfaces face recognition app using Python and OpenCV. Unfortunately when I try to play app, then I got result:
(-1, '\n', 1.7976931348623157e+308), where -1 stands for not found and confidence... Is quite high...
Is there possibility to put by someone the most basic OpenCV implementation of Eigenfaces?
Here is my approach to the problem. I use Python2, as it is suggested in official documentation (due to some problems with P3).
import cv2 as cv
import numpy as np
import os
num_components = 10
threshold = 10.0
faceRecognizer = cv.face_EigenFaceRecognizer.create(num_components, threshold)
images = []
labels = []
textLabels = ["Person1", "Person2", "Person3"]
destinedIm = cv.imread("images/set1/1.jpg", cv.IMREAD_GRAYSCALE)
destinedSize = destinedIm.shape
#Person1
img = cv.imread("images/set1/1.jpg", cv.IMREAD_GRAYSCALE)
imResized = cv.resize(img, destinedSize)
images.append(imResized)
labels.append(0)
#In similar way I read total 8 images of set1 and 6 images of set2 (2 different people, with label 0 and 1 respectively)
cv.imwrite("images/set2/resized.jpg", imResized) #this doesn't work
numpyImages = np.array(images)
numpyLabels = np.array(labels)
# cv.face_FaceRecognizer.train(self=faceRecognizer, src=images, labels=labels)
faceRecognizer.train(src=images, labels=numpyLabels)
testImage = cv.imread("images/set1/testIm.jpg", cv.IMREAD_GRAYSCALE)
# cv.face_FaceRecognizer.predict()
resultLabel, resultConfidence = faceRecognizer.predict(testImage)
print (resultLabel, "\n" ,resultConfidence)
testImage is another image of person with label = 0;
I would look at the sizing of the testImage. Also, I used a different sizing method than you used and got it working.
face_resized = cv2.resize(img, (299, 299))