Stripping down/modifying Yolov5 for real-time on RPi4

Stripping down/modifying Yolov5 for real-time on RPi4 - python

I want to use my Raspberry Pi 4 to detect license plates in real time without any hardware add-ons. I know it doesn't sound very feasible but hear me out.
I am using two ways of optimizing the network for this purpose: stripping down the neurons of Yolov5 and using ONNX runtime for inference, which from what I understand is optimal for RPi architecture. I also use quantization with my ONNX network. Stripping down the network works well, probably because the task is not very complicated and relatively simple network works well enough for me. I've reduced the number of weights to 200k and reached about 10 FPS. However when I reduced the number down to 27k, as an experiment, I only got an increase of 4 FPS, up to 14 FPS. I realize that the time of inference is not linearly dependent on the number of weights. However I wonder if there is any way to squeeze more FPS out of it. I don't care about precision as much.
Here's the stripped down model yaml I am using:
# YOLOv5 🚀 by Ultralytics, GPL-3.0 license
# Parameters
nc: 80 # number of classes
depth_multiple: 0.33 # model depth multiple
width_multiple: 0.50 # layer channel multiple
anchors:
- [10,13, 16,30, 33,23] # P3/8
- [30,61, 62,45, 59,119] # P4/16
- [116,90, 156,198, 373,326] # P5/32
# YOLOv5 v6.0 backbone
backbone:
# [from, number, module, args]
[[-1, 1, Conv, [16, 6, 2, 2]], # 0-P1/2
[-1, 1, Conv, [32, 3, 2]], # 1-P2/4
[-1, 3, C3, [32]],
[-1, 1, Conv, [64, 3, 2]], # 3-P3/8
[-1, 3, C3, [64]],
[-1, 1, Conv, [64, 3, 2]], # 5-P4/16
[-1, 5, C3, [128]],
[-1, 1, Conv, [128, 3, 2]], # 7-P5/32
[-1, 3, C3, [128]],
[-1, 1, SPPF, [128, 5]], # 9
]
# YOLOv5 v6.0 head
head:
[[-1, 1, Conv, [32, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 6], 1, Concat, [1]], # cat backbone P4
[-1, 3, C3, [32, False]], # 13
[-1, 1, Conv, [32, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 4], 1, Concat, [1]], # cat backbone P3
[-1, 3, C3, [64, False]], # 17 (P3/8-small)
[-1, 1, Conv, [64, 3, 2]],
[[-1, 14], 1, Concat, [1]], # cat head P4
[-1, 3, C3, [128, False]], # 20 (P4/16-medium)
[-1, 1, Conv, [128, 3, 2]],
[[-1, 10], 1, Concat, [1]], # cat head P5
[-1, 3, C3, [128, False]], # 23 (P5/32-large)
[[17, 20, 23], 1, Detect, [nc, anchors]], # Detect(P3, P4, P5)
]
To quantize my model I use:
from onnxruntime.quantization import quantize_dynamic, QuantType
model_fp32 = 'yolo200k.onnx'
model_quant = 'yolo200k_quant.onnx'
quantize_dynamic(model_fp32, model_quant, weight_type=QuantType.QUInt8)
Here's my inference script:
import cv2
import time
import numpy as np
from PIL import Image
import onnxruntime as rt
#Initialize inference
sess = rt.InferenceSession('yolo200k_quant.onnx', providers = rt.get_available_providers())
input_name = sess.get_inputs()[0].name
cap = cv2.VideoCapture(-1)
print(cap.isOpened())
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)
while True:
start = time.time()
ret,frame = cap.read()
image = frame[0:416, 0:416]
frame_cropped = image
image = np.expand_dims(image, axis = 3)
image = np.transpose(image/256, (3,2,0,1))
start_inference = time.time()
pred_onnx = sess.run(None, {input_name: image.astype('float32')})
print("Inference time:" + str(time.time()- start_inference))
output = pred_onnx[0]
size = pred_onnx[0].size
dimensions = pred_onnx[0].shape[2]
rows = size / dimensions
confidenceIndex = 4
labelStartIndex = 5
modelWidth = 416.0
modelHeight = 416.0
xGain = modelWidth / image.shape[2]
yGain = modelHeight / image.shape[3]
frame = np.array(cap)
location = [0]*4
locations = []
for i in range(int(rows)):
# index = i * dimensions
if(output[0][i][confidenceIndex] <= 0.4):
continue
for j in range(labelStartIndex,dimensions):
output[0][i][j] = output[0][i][j] * output[0][i][confidenceIndex]
for k in range(labelStartIndex,dimensions):
if(output[0][i][k] <= 0.5):
continue
location[0] = (output[0][i][0] - output[0][i][2] / 2) / xGain #top left x
location[1] = (output[0][i][1] - output[0][i][3] / 2) / yGain #top left y
location[2] = (output[0][i][0] + output[0][i][2] / 2) / xGain #bottom right x
location[3] = (output[0][i][1] + output[0][i][3] / 2) / yGain #bottom right y
locations.append(location)
cv2.rectangle(frame_cropped, (int(location[0]),int(location[1])), (int(location[2]),int(location[3])), (10, 255, 0), 2)
end = time.time()
fps = (1/(end-start))
cv2.imshow('frame', frame_cropped)
print(fps)
key = cv2.waitKey(1)
if key == ord("q"):
break
cap.release()
I use very inefficient way of cropping a frame for inference and later finding and drawing the bounding boxes, however I judge the FPS from the inference time alone, which is independent from those. The actual FPS is lower and will require further optimization. Right now I want to decrease the actual inference so that I can theoretically achieve real time. Are there any techniques to achieve that other than the ones I've already discussed, or is there a way to use them more effectively than I do? I am wondering if there is a way to use ONNX more efficiently.

Related

Allow or prohibit permutation in tensorflow loss function?

I'm training a network to reconstruct coordinates of specific structures in an image. Until now, my loss function contains three 2D vectors (i.e. 6 variables) for the coordinates which are learned via MSE, and three corresponding classifiers learned via SigmoidFocalCrossEntropy, indicating if there are 0, 1, 2 or 3 of these structures present. I thought it might be beneficial to give tensorflow the information that it is neglectable in which order that the vectors are reconstructed as long as the classifier is still correct. Simple example:
loss(tf.constant([[30, 20, 15, 7, 0, 0, 1, 1, 0]], dtype=tf.float32),
tf.constant([[0, 0, 15, 7, 30, 20, 0, 1, 1]], dtype=tf.float32)) == 0
To implement this I used tf.argsort on the magnitude of each vector:
def sort(tensor):
x = tf.unstack(tensor, axis=-1)
squ = []
for i in range(len(x) // 2):
i *= 2
squ.append(x[i] ** 2 + x[i+1] ** 2)
new = tf.stack(squ, axis=-1)
return tf.argsort(new, axis=-1, direction='ASCENDING',
stable=False, name=None)
and consecutively permuted the tensor:
def permute_tensor_structure(tensor, indices):
c = indices + 6
x = indices * 2
y = indices * 2 + 1
v = tf.transpose([x, y])
v = tf.reshape(v, [indices.shape[0], -1])
perm = tf.concat([v, c], axis=-1)
return tf.gather(tensor, perm, batch_dims=1, name=None, axis=-1)
I did the same for my ground truth and got the network up and running.
Minimal example extracted from my code:
import tensorflow as tf
from tensorflow_addons.losses import SigmoidFocalCrossEntropy
def compute_permute_loss(truth, predict):
l2 = tf.keras.losses.MeanSquaredError()
ce = SigmoidFocalCrossEntropy()
indices = sort(predict[:, 0:6])
indices2 = sort(truth[:, 0:6])
predict = permute_tensor_structure(predict, indices)
truth = permute_tensor_structure(truth, indices2)
L2 = l2(predict[:, 0:6], truth[:, 0:6])
BCE = tf.reduce_sum(ce(truth[:, 6:], predict[:, 6:],))
return 3 * L2 + BCE
class Test(tf.test.TestCase):
def test_permutation_loss(self):
tensor1 = tf.constant(
[[30, 20, 15, 7, 0, 0, 1, 1, 0]],
dtype=tf.float32)
tensor2 = permute_tensor_structure(tensor1, tf.constant([[2, 1, 0]]))
loss = compute_permute_loss(tensor1, tensor2)
self.assertEqual(loss, 0,
msg="loss for only permuted tensors is not zero")
tensor3 = tf.constant(
[[29, 22, 15, 7, 0, 0, 1, 1, 0]],
dtype=tf.float32)
loss = compute_permute_loss(tensor3, tensor2)
self.assertAllClose(loss, (1.0 + 4.0) / 2.0,
msg="loss for values is not rmse")
if __name__ == "__main__":
tf.test.main()
# [ RUN ] Test.test_permutation_loss
# [ OK ] Test.test_permutation_loss
However, I'm afraid the permutation in the loss could backfire and impair tensorflows backpropagation. Maybe somebody already faced a similar problem? Or has deeper knowledge on tensorflows graph building and back propagation? I would be grateful for every suggestion or input.

Extract sub arrays based on kernel in numpy

I would like to know if there is an efficient method to get sub-arrays from a larger numpy array.
What I have is an application of np.where. I iterate 'manually' over x and y as offsets and apply where with a kernel to each rectangle extracted from the larger array with proper dimensions.
But is there a more direct approach in numpy's collection of methods?
import numpy as np
example = np.arange(20).reshape((5, 4))
# e.g. a cross kernel
a_kernel = np.asarray([[0, 1, 0], [1, 1, 1], [0, 1, 0]])
np.where(a_kernel, example[1:4, 1:4], 0)
# returns
# array([[ 0, 6, 0],
# [ 9, 10, 11],
# [ 0, 14, 0]])
def arrays_from_kernel(a, a_kernel):
width, height = a_kernel.shape
y_max, x_max = a.shape
return [np.where(a_kernel, a[y:(y + height), x:(x + width)], 0)
for y in range(y_max - height + 1)
for x in range(x_max - width + 1)]
sub_arrays = arrays_from_kernel(example, a_kernel)
This returns the arrays I need for further processing.
# [array([[0, 1, 0],
# [4, 5, 6],
# [0, 9, 0]]),
# array([[ 0, 2, 0],
# [ 5, 6, 7],
# [ 0, 10, 0]]),
# ...
# array([[ 0, 9, 0],
# [12, 13, 14],
# [ 0, 17, 0]]),
# array([[ 0, 10, 0],
# [13, 14, 15],
# [ 0, 18, 0]])]
The context: similar to 2D convolution I would like to apply a custom function on each of the subarrays (e.g. product of squared numbers).

At the moment, you're manually advancing a sliding window over the data - stride tricks to the rescue! (And no, I didn't just make that up - there's actually a submodule called stride_tricks in numpy!) Instead of manually building windows into the data, and calling np.where() on them, if you had the windows in an array, you could call np.where() just once. Stride tricks allow you to create such an array without even having to copy the data.
Let me explain. Normal slices in numpy create views into the original data instead of copies. This is done by referring to the original data, but changing the strides used to access the data (ie. how much to jump between two elements or two rows, and so on). Stride tricks allow you to modify those strides more freely than just slicing and reshaping does, so you can eg. iterate over the same data more than once, which is useful here.
Let me demonstrate:
import numpy as np
example = np.arange(20).reshape((5, 4))
a_kernel = np.array([[0, 1, 0], [1, 1, 1], [0, 1, 0]])
def sliding_window(data, win_shape, **kwargs):
assert data.ndim == len(win_shape)
shape = tuple(dn - wn + 1 for dn, wn in zip(data.shape, win_shape)) + win_shape
strides = data.strides * 2
return np.lib.stride_tricks.as_strided(data, shape=shape, strides=strides, **kwargs)
def arrays_from_kernel(a, a_kernel):
windows = sliding_window(a, a_kernel.shape)
return np.where(a_kernel, windows, 0)
sub_arrays = arrays_from_kernel(example, a_kernel)

The scipy.ndimage module offers a number of filters -- one of which might meet your needs. If none of those filters do what you want, you could use ndimage.generic_filter
to call a custom function on each subarray. ndimage.generic_filter is not as fast as the other ndimage filters, however.
For example,
import numpy as np
example = np.arange(20).reshape((5, 4))
a_kernel = np.asarray([[0, 1, 0], [1, 1, 1], [0, 1, 0]])
# def arrays_from_kernel(a, a_kernel):
# width, height = a_kernel.shape
# y_max, x_max = a.shape
# return [np.where(a_kernel, a[y:(y + height), x:(x + width)], 0)
# for y in range(y_max - height + 1)
# for x in range(x_max - width + 1)]
# sub_arrays = arrays_from_kernel(example, a_kernel)
# for arr in sub_arrays:
# print(arr)
# print('-'*80)
import scipy.ndimage as ndimage
def func(x):
# reject subarrays that extend beyond the border of the `example` array
if not np.isnan(x).any():
y = np.zeros_like(a_kernel, dtype=example.dtype)
np.put(y, np.flatnonzero(a_kernel), x)
print(y)
# Instead or returning 0, you can perform your desired computation on the subarray here.
# Note that you may not need the 2D array y; often, you only need the values in the 1D array x
return 0
result = ndimage.generic_filter(example, func, footprint=a_kernel, mode='constant', cval=np.nan)
For the particular problem of computing the product of squares for each subarray, you
could convert the product into a sum by taking advantage of the fact that A * B = exp(log(A)+log(B)). This would allow you to express the computation as a normal convolution. Now using ndimage.convolve can improve performance a lot. The amount of the improvement depends on the size of example:
import numpy as np
import scipy.ndimage as ndimage
import perfplot
a_kernel = np.asarray([[0, 1, 0], [1, 1, 1], [0, 1, 0]])
def orig(example, a_kernel=a_kernel):
def arrays_from_kernel(a, a_kernel):
width, height = a_kernel.shape
y_max, x_max = a.shape
return [
np.where(a_kernel, a[y : (y + height), x : (x + width)], 1)
for y in range(y_max - height + 1)
for x in range(x_max - width + 1)
]
return [np.prod(x) ** 2 for x in arrays_from_kernel(example, a_kernel)]
def alt(example, a_kernel=a_kernel):
logged = np.log(example)
result = ndimage.convolve(logged, a_kernel, mode="constant", cval=0)[1:-1, 1:-1]
return (np.exp(result) ** 2).ravel()
def make_example(N):
return np.random.random(size=(N, N))
def check(A, B):
return np.allclose(A, B)
perfplot.show(
setup=make_example,
kernels=[orig, alt],
n_range=[2 ** k for k in range(2, 11)],
logx=True,
logy=True,
xlabel="len(example)",
equality_check=check,
)

How to reduce usage of memory?

This is a sample of my code
def normalize_3D(input):
for i in range(input.shape[0]):
s = tf.concat([tf.reshape(input[i, 9, 0], shape=[1, 1]),
tf.reshape(input[i, 9, 1], shape=[1, 1]),
tf.reshape(input[i, 9, 2], shape=[1, 1])], axis=1)
output = input[i, :, :] - s
output2 = output / tf.sqrt(tf.square(input[i, 9, 0] - input[i, 0, 0]) +
tf.square(input[i, 9, 1] - input[i, 0, 1]) +
tf.square(input[i, 9, 2] - input[i, 0, 2]))
output2 = tf.reshape(output2, [1, input.shape[1], input.shape[2]])
if i == 0:
output3 = output2
else:
output3 = tf.concat([output3, output2], axis=0)
return output3
like this sample I used 'for' state many times to calculate the data which has just a few batch.
However, while I'm writing my code, I noticed that it uses a lot of memory and error message came out.
Some of my predictions just showing 'nan' and after that the program is stucked.
Is there any way to reduce this kind of memory abuse when I calculate batch data?

You function can be expressed in a simpler and more efficient way like this:
import tensorflow as tf
def normalize_3D(input):
shift = input[:, 9]
scale = tf.norm(input[:, 9] - input[:, 0], axis=1, keepdims=True)
output = (input - tf.expand_dims(shift, 1)) / tf.expand_dims(scale, 1)
return output

Naive Implementation of Convolution algorithm

Currently learning about computer vision and machine learning through the free online course by stanford CS131. Came across some heavy math formulas and was wondering if anyone could explain to me how one would go on about in implementing a naive 4 nested for loops for the convolution algorithm using only knowing the image height, width and kernel height and width. I was able to come up with this solution by researching online.
image_padded = np.zeros((image.shape[0] + 2, image.shape[1] + 2))
image_padded[1:-1, 1:-1] = image
for x in range(image.shape[1]): # Loop over every pixel of the image
for y in range(image.shape[0]):
# element-wise multiplication of the kernel and the image
out[y, x] = (kernel * image_padded[y:y + 3, x:x + 3]).sum()
I was able to understand this based on some website examples using this type of algorithm however, I can't seem to grasp how a 4 nested for loops would do it. And if you could, break down the formula into something more digestible then the given mathematical equation found online.
Edit:
Just to clarify while the code snippet I left works to a certain degree I'm trying to come up with a solution that's a bit less optimized and a bit more beginner friendly such as what this code is asking:
def conv_nested(image, kernel):
"""A naive implementation of convolution filter.
This is a naive implementation of convolution using 4 nested for-loops.
This function computes convolution of an image with a kernel and outputs
the result that has the same shape as the input image.
Args:
image: numpy array of shape (Hi, Wi)
kernel: numpy array of shape (Hk, Wk)
Returns:
out: numpy array of shape (Hi, Wi)
"""
Hi, Wi = image.shape
Hk, Wk = kernel.shape
out = np.zeros((Hi, Wi))
### YOUR CODE HERE
### END YOUR CODE
return out

For this task scipy.signal.correlate2d is your friend.
Demo
I wrapped your code in a function named naive_correlation:
import numpy as np
def naive_correlation(image, kernel):
image_padded = np.zeros((image.shape[0] + 2, image.shape[1] + 2))
image_padded[1:-1, 1:-1] = image
out = np.zeros_like(image)
for x in range(image.shape[1]):image
for y in range(image.shape[0]):
out[y, x] = (kernel * image_padded[y:y + 3, x:x + 3]).sum()
return out
Notice that your snippet throws an error because out is not initialized.
In [67]: from scipy.signal import correlate2d
In [68]: img = np.array([[3, 9, 5, 9],
...: [1, 7, 4, 3],
...: [2, 1, 6, 5]])
...:
In [69]: kernel = np.array([[0, 1, 0],
...: [0, 0, 0],
...: [0, -1, 0]])
...:
In [70]: res1 = correlate2d(img, kernel, mode='same')
In [71]: res1
Out[71]:
array([[-1, -7, -4, -3],
[ 1, 8, -1, 4],
[ 1, 7, 4, 3]])
In [72]: res2 = naive_correlation(img, kernel)
In [73]: np.array_equal(res1, res2)
Out[73]: True
If you wish to perform convolution rather than correlation you could use convolve2d.
Edit
Is this what you are looking for?
def explicit_correlation(image, kernel):
hi, wi= image.shape
hk, wk = kernel.shape
image_padded = np.zeros(shape=(hi + hk - 1, wi + wk - 1))
image_padded[hk//2:-hk//2, wk//2:-wk//2] = image
out = np.zeros(shape=image.shape)
for row in range(hi):
for col in range(wi):
for i in range(hk):
for j in range(wk):
out[row, col] += image_padded[row + i, col + j]*kernel[i, j]
return out

Applying the Sobel filter using scipy

I'm trying to apply the Sobel filter on an image to detect edges using scipy. I'm using Python 3.2 (64 bit) and scipy 0.9.0 on Windows 7 Ultimate (64 bit). Currently my code is as follows:
import scipy
from scipy import ndimage
im = scipy.misc.imread('bike.jpg')
processed = ndimage.sobel(im, 0)
scipy.misc.imsave('sobel.jpg', processed)
I don't know what I'm doing wrong, but the processed image does not look anything like what it should. The image, 'bike.jpg' is a greyscale (mode 'L' not 'RGB') image so each pixel has only one value associated with it.
Unfortunately I can't post the images here yet (don't have enough reputation) but I've provided links below:
Original Image (bike.jpg):
http://s2.postimage.org/64q8w613j/bike.jpg
Scipy Filtered (sobel.jpg):
http://s2.postimage.org/64qajpdlb/sobel.jpg
Expected Output:
http://s1.postimage.org/5vexz7kdr/normal_sobel.jpg
I'm obviously going wrong somewhere! Can someone please tell me where. Thanks.

1) Use a higher precision. 2) You are only calculating the approximation of the derivative along the zero axis. The 2D Sobel operator is explained on Wikipedia. Try this code:
import numpy
import scipy
from scipy import ndimage
im = scipy.misc.imread('bike.jpg')
im = im.astype('int32')
dx = ndimage.sobel(im, 0) # horizontal derivative
dy = ndimage.sobel(im, 1) # vertical derivative
mag = numpy.hypot(dx, dy) # magnitude
mag *= 255.0 / numpy.max(mag) # normalize (Q&D)
scipy.misc.imsave('sobel.jpg', mag)

I couldn't comment on cgohlke's answer so I repeated his answer with a corrction. Parameter 0 is used for vertical derivative and 1 for horizontal derivative (first axis of an image array is y/vertical direction - rows, and second axis is x/horizontal direction - columns). Just wanted to warn other users, because I lost 1 hour searching for mistake in the wrong places.
import numpy
import scipy
from scipy import ndimage
im = scipy.misc.imread('bike.jpg')
im = im.astype('int32')
dx = ndimage.sobel(im, 1) # horizontal derivative
dy = ndimage.sobel(im, 0) # vertical derivative
mag = numpy.hypot(dx, dy) # magnitude
mag *= 255.0 / numpy.max(mag) # normalize (Q&D)
scipy.misc.imsave('sobel.jpg', mag)

or you can use :
def sobel_filter(im, k_size):
im = im.astype(np.float)
width, height, c = im.shape
if c > 1:
img = 0.2126 * im[:,:,0] + 0.7152 * im[:,:,1] + 0.0722 * im[:,:,2]
else:
img = im
assert(k_size == 3 or k_size == 5);
if k_size == 3:
kh = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]], dtype = np.float)
kv = np.array([[1, 2, 1], [0, 0, 0], [-1, -2, -1]], dtype = np.float)
else:
kh = np.array([[-1, -2, 0, 2, 1],
[-4, -8, 0, 8, 4],
[-6, -12, 0, 12, 6],
[-4, -8, 0, 8, 4],
[-1, -2, 0, 2, 1]], dtype = np.float)
kv = np.array([[1, 4, 6, 4, 1],
[2, 8, 12, 8, 2],
[0, 0, 0, 0, 0],
[-2, -8, -12, -8, -2],
[-1, -4, -6, -4, -1]], dtype = np.float)
gx = signal.convolve2d(img, kh, mode='same', boundary = 'symm', fillvalue=0)
gy = signal.convolve2d(img, kv, mode='same', boundary = 'symm', fillvalue=0)
g = np.sqrt(gx * gx + gy * gy)
g *= 255.0 / np.max(g)
#plt.figure()
#plt.imshow(g, cmap=plt.cm.gray)
return g
for more see here

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Stripping down/modifying Yolov5 for real-time on RPi4 - python

Related

Allow or prohibit permutation in tensorflow loss function?

Extract sub arrays based on kernel in numpy

How to reduce usage of memory?

Naive Implementation of Convolution algorithm

Applying the Sobel filter using scipy

Categories

Resources