extracting coordinates from computer vision inference - python

I converted this computer vision model 7.x to an ONNX type model that can be used with the open VINO toolkit. This model has good characteristics of what I am after for how it is used in other applications I have read about.
I think my question is super basic related to not understanding computer vision enough and just curious if someone can give me some tips on the computer vision basics on how to loop through the model output for "bounding boxes" to draw with opencv.
Using this on CPU with pip installed open VINO:
import cv2
import numpy as np
import matplotlib.pyplot as plt
from openvino.runtime import Core
model_path = (
f"./yolov7.xml"
)
ie_core = Core()
def model_init(model_path):
model = ie_core.read_model(model=model_path)
compiled_model = ie_core.compile_model(model=model, device_name="CPU")
input_keys = compiled_model.input(0)
output_keys = compiled_model.output(0)
return input_keys, output_keys, compiled_model
input_key, output_keys, compiled_model = model_init(model_path)
# resize the image so it works with the model dimensions
image = cv2.resize(image, (width, height))
image = image.transpose((2,0,1))
image = image.reshape(1,3, height,width)
# Run inference on image, trying .output(1) first
boxes = compiled_model([image])[compiled_model.output(1)]
The code works....outputs an array, but what does this data contain? For some reason I thought that there could be a confidence I could filter out bad predictions as well as bounding box coordinates?
If I print(compiled_model) this outputs I think the model architecture:
<CompiledModel:
inputs[
<ConstOutput: names[input.1] shape{1,3,640,640} type: f32>
]
outputs[
<ConstOutput: names[812] shape{1,25200,85} type: f32>,
<ConstOutput: names[588] shape{1,3,80,80,85} type: f32>,
<ConstOutput: names[669] shape{1,3,40,40,85} type: f32>,
<ConstOutput: names[750] shape{1,3,20,20,85} type: f32>
]>
Does this tell me anything about the model output, like what the data would contain? or the boxes.shape:
Which returns:
(1, 3, 80, 80, 85)
for box in boxes:
print(box)
this is just numpy arrays lots of float data just curious if anyone can help me understand at a high level what I need to learn to draw bounding boxes around features inside the image.

From my replication, your code is not working with "NameError:name 'image' is not defined" error. In your output, the ConstOutput only represents port/node of your model. To ensure your model works, run your yolov7.xml file with OpenVINO Benchmark Python Tool. You should not receive any errors.
In OpenVINO samples, you may refer to Object Detection Python Demo source code to learn the OpenVINO Inference Engine API usage for creating bounding boxes and how to handle the model. Here is another example of creating bounding boxes:
For box in boxes:
#Pick a confidence factor from the last place in an array.
conf=box[-1]
If conf > threshold:
#Convert float to int and multiply corner position of each box by x and y ration.
#If the bounding box is found that the top of the image
#Position the upper box bar little lower to make it visible on the image
(x_min, y_min, x_max, y_max) = [
int (max(corner_position*ratio_y, 10)) if idx%2
else int (corner_position*ratio_x)
for idx, corner_position in enumerate(box[:-1])
#Draw a box base on the position, parameters in rectangle function are: image,start_point, end_point, color, thickness.
rgb_image = cv2.rectangle(rgb_image, (x_min,y_min), (x_max,y_max),
colors["green"], 3)

Related

How to crop out the annotation box and everything within it?

After running yolov8, the algorithm annotated the following picture: Density-Area
My goal is to crop out a large number of these pictures to use in the further analysis. So, I want everything within the bounding box saved, and everything else outside of it removed.
I tried using torch, numpy, cv2, and PIL but haven't been successful.
import torch
import torchvision
from PIL import Image
# Load the image
image = Image.open("path to .jpg")
# Define the model and download the pre-trained weights
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True, weights=None)
# Set the model to evaluation mode
model.eval()
# Transform the image to a tensor
transform = torchvision.transforms.ToTensor()
image_tensor = transform(image)
# Make predictions on the image using the model
predictions = model([image_tensor])
# Extract the bounding boxes and object labels from the predictions
boxes = predictions[0]['boxes'].tolist()
labels = predictions[0]['labels'].tolist()
# Crop the image for each object detected
for i in range(len(boxes)):
bbox = tuple(boxes[i])
object_label = labels[i]
object_image = image.crop(bbox)
object_image.save(f"image_save.jpg")
The image is just an nd-array, so just use array indexing to perform the cropping operation you desire.
For example I assume your bounding boxes are of the form [xmin,ymin,xmax,ymax].
for i in range(len(boxes)):
object_label = labels[i]
object_image = image_tensor
crop = object_image[:,ymin:ymax,xmin:xmax]
# permute color dimension last
crop = crop.permute(1,2,0)
# convert from tensor to numpy array
crop = crop.data.numpy()
# swap from RGB to BGR (per opencv convention)
crop = crop[:,:,::-1]
# save
cv2.imwrite("output_image.jpg",crop)
I'm sure you could accomplish this working directly with the PIL image objects as well but more generally in response to your comment: NO, you cannot crop an image without providing the coordinates of the cropping bounding box.

How to get orientation of text from an image?

I'm trying to get the orientation of text from an image.
I have 8 type of images with different orientation, look at all type in the next image (I will put a link of a repository which you can get all images inputs):
I was using these lybraries to detect orientation of my text from an image.
import pytesseract as tess
from PIL import Image
my_image = Image.open(temp_image_path)
osd = tess.image_to_osd(my_image)
print(osd)
Output:
this is what i got
> Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 2.77
Script: Cyrillic
Script confidence: 2.88
however, I don't get why sometimes a vertical plan with a vertical text (type II from my image) has an output like this:
Rotate: 90 or Rotate: 270.
I used opencv and tensorflow, they helped me to get similarities but not to identify if my text has a different orientation.
This is the Repository from github:
Click Here to watch the repository with inputs
Following #stateMachine's recommendation, detecting the footer position and aspect ratio is a good idea. You can try to do so by detecting squares in the image. This should be fairly easy to do with OpenCV, see an example.
If you have some labeled images you can also try #StereoMatching idea. In this case using a very simple HOG descriptor as the image representation + a Suport Vector for the classification should do the trick. You can use OpenCV implementation of HOGD and sklearn SVC.
Let's assume you have a nice load() function for your (small) dataset, you can do something like that :
import cv2
from sklearn import svm
from sklearn.model_selection import cross_val_score
### Load the dataset
path_images, labels = load(dataset)
### HOGD Options ###
winSize = (112,112)
blockSize = (16,16)
blockStride = (8,8)
cellSize = (8,8)
nbins = 9
derivAperture = 1
winSigma = 4.
histogramNormType = 0
L2HysThreshold = 2.0000000000000001e-01
gammaCorrection = 0
nlevels = 64
hog = cv2.HOGDescriptor(winSize,blockSize,blockStride,cellSize,nbins,derivAperture,winSigma,
histogramNormType,L2HysThreshold,gammaCorrection,nlevels)
### Get the dataset representation
hogds = list(map(lambda p: hog.compute(cv2.imread(p)),path_images))
### Get a sense of the performance
clf = svm.SVC(class_weight='balanced')
print(cross_val_score(clf, hogds, labels, cv=5))

Python Code containing OpenCV is not detecting object in my image

I am very new to OpenCV and Python. So, for my first project to recognize objects, I am using a small test file in python. This is the test file
import cv2
from matplotlib import pyplot as plt
# Opening image
img = cv2.imread("image.jpg")
# OpenCV opens images as BRG
# but we want it as RGB We'll
# also need a grayscale version
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# Use minSize because for not
# bothering with extra-small
# dots that would look like STOP signs
stop_data = cv2.CascadeClassifier('cascade.xml')
found = stop_data.detectMultiScale(img_gray,
minSize=(20,20))
amount_found = len(found)
if amount_found != 0:
# There may be more than one
# sign in the image
for (x, y, width, height) in found:
# We draw a green rectangle around
# every recognized sign
cv2.rectangle(img_rgb, (x, y),
(x + height, y + width),
(0, 255, 0), 5)
# Creates the environment of
# the picture and shows it
plt.subplot(1, 1, 1)
plt.imshow(img_rgb)
plt.show()
The tutorial that I saw told me to use a premade cascade.xml file. However, I wanted to train my cascade file to recognize a simple apple logo, which I cropped from a photo. I copy pasted the same image multiple times into one folder titled p. For the negative images, I used a folder titled n. After this, I used the GUI cascade trainer and trained a cascade file. However, when I use the same image (uncropped) in the python program, there is no output.
This is the image which I used to train:
This is the original image which I put in the python script
The original image is of size 422612 and the cropped image is of size 285354. I had set the size in the trainer to height=20 and width=15.
These are my settings:
Folder p:
Output of program:
Cascade GUI log end
NEG count : acceptanceRatio 0 : 0
Required leaf false alarm rate achieved. Branch training terminated.
Could someone tell me where I am going wrong ? Please ask if I have to provide anymore information. Is there any ratio relation between the positive image and the actual image that I am missing on ?
UPDATE: I added some more negative images and am now getting this output:

X.shape[1] size doesn't fit the expected value

I'm currently working on my final degree project in robotics, and I decided to create an open-source robot capable of replicating human emotions. The robot is all set up and ready to receive orders, but I'm still busy coding it. I'm currently basing my code off this method. The idea is to extract 68 facial landmarks from
a low FPS video feed (using RPi Camera V2), feed those landmarks to a trained SVM classifier and have it return a numeral from 0-6 depending on the expression it detected (Angry, Disgust, Fear, Happy, Sad, Surprise and Neutral). I'm testing out the capabilities of my model with some pictures I took using the RPi Camera, and this is what I've managed to put together so far in terms of code:
# import the necessary packages
from imutils import face_utils
import dlib
import cv2
import numpy as np
import time
import argparse
import os
import sys
if sys.version_info >= (3, 0):
import _pickle as cPickle
else:
import cPickle
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from data_loader import load_data
from parameters import DATASET, TRAINING, HYPERPARAMS
def get_landmarks(image, rects):
if len(rects) > 1:
raise BaseException("TooManyFaces")
if len(rects) == 0:
raise BaseException("NoFaces")
return np.matrix([[p.x, p.y] for p in predictor(image, rects[0]).parts()])
# initialize dlib's face detector (HOG-based) and then create
# the facial landmark predictor
print("Initializing variables...")
p = "shape_predictor_68_face_landmarks.dat"
detector = dlib.get_frontal_face_detector()
predictor = dlib.shape_predictor(p)
# path to pretrained model
path = "saved_model.bin"
# load pretrained model
print("Loading model...")
model = cPickle.load(open(path, 'rb'))
# initialize final image height & width
height = 48
width = 48
# initialize landmarks variable as empty array
landmarks = []
# load the input image and convert it to grayscale
print("Loading image...")
gray = cv2.imread("foo.jpg")
# detect faces in the grayscale image
print("Detecting faces in loaded image...")
rects = detector(gray, 0)
# loop over the face detections
print("Looping over detections...")
for (i, rect) in enumerate(rects):
# determine the facial landmarks for the face region, then
# convert the facial landmark (x, y)-coordinates to a NumPy
# array
shape = predictor(gray, rect)
shape = face_utils.shape_to_np(shape)
# loop over the (x, y)-coordinates for the facial landmarks
# and draw them on the image
for (x, y) in shape:
cv2.circle(gray, (x, y), 2, (0, 255, 0), -1)
# show the output image with the face detections + facial landmarks
print("Storing saved image...")
cv2.imwrite("output.jpg", gray)
print("Image stored as /'output.jpg/'")
# arrange landmarks in array
print("Collecting and arranging landmarks...")
# scipy.misc.imsave('temp.jpg', image)
# image2 = cv2.imread('temp.jpg')
face_rects = [dlib.rectangle(left=1, top=1, right=47, bottom=47)]
landmarks = get_landmarks(gray, face_rects)
# load data
print("Loading collected data into predictor...")
print("Extracted landmarks: ", landmarks)
landmarks = np.array(landmarks.flatten())
# predict expression
print("Making prediction")
predicted = model.predict(landmarks)
However, after running the code everything seems to be fine up until this point:
Making prediction
Traceback (most recent call last):
File "face.py", line 97, in <module>
predicted = model.predict(landmarks)
File "/usr/local/lib/python2.7/dist-packages/sklearn/svm/base.py", line 576, in predict
y = super(BaseSVC, self).predict(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/svm/base.py", line 325, in predict
X = self._validate_for_predict(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/svm/base.py", line 478, in _validate_for_predict
(n_features, self.shape_fit_[1]))
ValueError: X.shape[1] = 136 should be equal to 2728, the number of features at training time
I searched for similar issues on this website, but being such a specific purpose I didn't quite find what I needed. I've been working on the design and research for quite some time, but finding all the snippets needed to make the code work has taken the most time out of me, and I'd love to polish this concept as soon as possible since the presentation date is approaching quickly. Any and all contributions are greatly welcomed!
Here's the trained model I'm currently using, by the way.
I am probably being silly, but it looks like you define path after you use it to load your model.
Also path seems like a very bad name for a variable containing a file location, perhaps modelFileLocation is less likely to already be defined.
Solved it! Turns out my model was trained using a combination of HOG features and Dlib landmarks, however I was only feeding the landmarks to the predictor, which resulted in the size discrepancy.

Uniformity of color and texture in image

I am new to the field of deep learning and have a problem in determining whether two images have uniform color and texture. For example, I have a
Master image -
Now, with respect to this image i need to determine whether the following images have uniform texture and color distributions -
image 1 -
image 2 -
image 3 -
I need to develop an algorithm which will evaluate these 3 images with the master image. The algorithm should approve the image 1 and reject image2 because of its color and image 3 because of color and texture uniformity.
My approach for the problem was directly analyzing image for texture detection. I found that Local Binary Patterns method was good among all texture recognition methods (but I am not sure). I used its skimage implementation with opencv in python and found that the method worked.
from skimage import feature
import numpy as np
import cv2
import matplotlib.pyplot as plt
class LocalBinaryPatterns:
def __init__(self, numPoints, radius):
# store the number of points and radius
self.numPoints = numPoints
self.radius = radius
def describe(self, image, eps=1e-7):
# compute the Local Binary Pattern representation
# of the image, and then use the LBP representation
# to build the histogram of patterns
lbp = feature.local_binary_pattern(image, self.numPoints,
self.radius, method="uniform")
(hist, _) = np.histogram(lbp.ravel(),
bins=np.arange(0, self.numPoints + 3),
range=(0, self.numPoints + 2))
# normalize the histogram
hist = hist.astype("float")
hist /= (hist.sum() + eps)
# return the histogram of Local Binary Patterns
return hist
desc = LocalBinaryPatterns(24, 8)
image = cv2.imread("main.png")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
hist = desc.describe(gray)
plt.plot(hist,'b-')
plt.ylabel('Feature Vectors')
plt.show()
It detected the features and made a histogram of feature vectors. I plotted the histogram using matplotlib and clearly found that image 1 and image 2 texture features were almost similar to the master image. And image 3 texture features were not matching.
Then I started analyzing images for their color. I plotted the color histograms using opencv as -
import cv2
from matplotlib import pyplot as plt
def draw_image_histogram(image, channels, color='k'):
hist = cv2.calcHist([image], channels, None, [256], [0, 256])
plt.plot(hist, color=color)
plt.xlim([0, 256])
def show_color_histogram(image):
for i, col in enumerate(['b', 'g', 'r']):
draw_image_histogram(image, [i], color=col)
plt.show()
show_color_histogram(cv2.imread("test1.jpg"))
I found that color histogram of image 1 matched with master image. And color histograms of image 2 and 3 did not matched. In this way I figured out that image 1 was matching and image 2 and 3 were not.
But, I this is pretty simple approach and I have no idea about the false positives it will match. Moreover I don't know the approach for the problem is the best one.
I also want this to be done by a single and robust algorithm like CNN (but should not be computationally too expensive). But I have no experience with CNNs. So should I train a CNN with master images?. Please point me in the right direction. I also came across LBCNNs, can they solve the problem?. And what can be other better approaches.
Thank you so much for the help
CNN are good on capture the underlying features and distribution of data-set. But they need large(hundreds of thousands examples) to learn and extract those features, which is very expensive task. Also for high-res images, it will need more parameters to extract those features, which further demand for more data.
If you have large data-set, you can prefer CNN, which can capture tiny bit information such as these fine texture. Otherwise, these classical methods(one you have performed) also works good.
There is also method called transfer-learning, where we use pre-trained model(which trained on similar data-set) and fine tune it on our small data-set. If you can find any such model, that can be another option.

Categories

Resources