Allow or prohibit permutation in tensorflow loss function? - python

I'm training a network to reconstruct coordinates of specific structures in an image. Until now, my loss function contains three 2D vectors (i.e. 6 variables) for the coordinates which are learned via MSE, and three corresponding classifiers learned via SigmoidFocalCrossEntropy, indicating if there are 0, 1, 2 or 3 of these structures present. I thought it might be beneficial to give tensorflow the information that it is neglectable in which order that the vectors are reconstructed as long as the classifier is still correct. Simple example:
loss(tf.constant([[30, 20, 15, 7, 0, 0, 1, 1, 0]], dtype=tf.float32),
tf.constant([[0, 0, 15, 7, 30, 20, 0, 1, 1]], dtype=tf.float32)) == 0
To implement this I used tf.argsort on the magnitude of each vector:
def sort(tensor):
x = tf.unstack(tensor, axis=-1)
squ = []
for i in range(len(x) // 2):
i *= 2
squ.append(x[i] ** 2 + x[i+1] ** 2)
new = tf.stack(squ, axis=-1)
return tf.argsort(new, axis=-1, direction='ASCENDING',
stable=False, name=None)
and consecutively permuted the tensor:
def permute_tensor_structure(tensor, indices):
c = indices + 6
x = indices * 2
y = indices * 2 + 1
v = tf.transpose([x, y])
v = tf.reshape(v, [indices.shape[0], -1])
perm = tf.concat([v, c], axis=-1)
return tf.gather(tensor, perm, batch_dims=1, name=None, axis=-1)
I did the same for my ground truth and got the network up and running.
Minimal example extracted from my code:
import tensorflow as tf
from tensorflow_addons.losses import SigmoidFocalCrossEntropy
def compute_permute_loss(truth, predict):
l2 = tf.keras.losses.MeanSquaredError()
ce = SigmoidFocalCrossEntropy()
indices = sort(predict[:, 0:6])
indices2 = sort(truth[:, 0:6])
predict = permute_tensor_structure(predict, indices)
truth = permute_tensor_structure(truth, indices2)
L2 = l2(predict[:, 0:6], truth[:, 0:6])
BCE = tf.reduce_sum(ce(truth[:, 6:], predict[:, 6:],))
return 3 * L2 + BCE
class Test(tf.test.TestCase):
def test_permutation_loss(self):
tensor1 = tf.constant(
[[30, 20, 15, 7, 0, 0, 1, 1, 0]],
dtype=tf.float32)
tensor2 = permute_tensor_structure(tensor1, tf.constant([[2, 1, 0]]))
loss = compute_permute_loss(tensor1, tensor2)
self.assertEqual(loss, 0,
msg="loss for only permuted tensors is not zero")
tensor3 = tf.constant(
[[29, 22, 15, 7, 0, 0, 1, 1, 0]],
dtype=tf.float32)
loss = compute_permute_loss(tensor3, tensor2)
self.assertAllClose(loss, (1.0 + 4.0) / 2.0,
msg="loss for values is not rmse")
if __name__ == "__main__":
tf.test.main()
# [ RUN ] Test.test_permutation_loss
# [ OK ] Test.test_permutation_loss
However, I'm afraid the permutation in the loss could backfire and impair tensorflows backpropagation. Maybe somebody already faced a similar problem? Or has deeper knowledge on tensorflows graph building and back propagation? I would be grateful for every suggestion or input.

Related

Stripping down/modifying Yolov5 for real-time on RPi4

I want to use my Raspberry Pi 4 to detect license plates in real time without any hardware add-ons. I know it doesn't sound very feasible but hear me out.
I am using two ways of optimizing the network for this purpose: stripping down the neurons of Yolov5 and using ONNX runtime for inference, which from what I understand is optimal for RPi architecture. I also use quantization with my ONNX network. Stripping down the network works well, probably because the task is not very complicated and relatively simple network works well enough for me. I've reduced the number of weights to 200k and reached about 10 FPS. However when I reduced the number down to 27k, as an experiment, I only got an increase of 4 FPS, up to 14 FPS. I realize that the time of inference is not linearly dependent on the number of weights. However I wonder if there is any way to squeeze more FPS out of it. I don't care about precision as much.
Here's the stripped down model yaml I am using:
# YOLOv5 🚀 by Ultralytics, GPL-3.0 license
# Parameters
nc: 80 # number of classes
depth_multiple: 0.33 # model depth multiple
width_multiple: 0.50 # layer channel multiple
anchors:
- [10,13, 16,30, 33,23] # P3/8
- [30,61, 62,45, 59,119] # P4/16
- [116,90, 156,198, 373,326] # P5/32
# YOLOv5 v6.0 backbone
backbone:
# [from, number, module, args]
[[-1, 1, Conv, [16, 6, 2, 2]], # 0-P1/2
[-1, 1, Conv, [32, 3, 2]], # 1-P2/4
[-1, 3, C3, [32]],
[-1, 1, Conv, [64, 3, 2]], # 3-P3/8
[-1, 3, C3, [64]],
[-1, 1, Conv, [64, 3, 2]], # 5-P4/16
[-1, 5, C3, [128]],
[-1, 1, Conv, [128, 3, 2]], # 7-P5/32
[-1, 3, C3, [128]],
[-1, 1, SPPF, [128, 5]], # 9
]
# YOLOv5 v6.0 head
head:
[[-1, 1, Conv, [32, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 6], 1, Concat, [1]], # cat backbone P4
[-1, 3, C3, [32, False]], # 13
[-1, 1, Conv, [32, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 4], 1, Concat, [1]], # cat backbone P3
[-1, 3, C3, [64, False]], # 17 (P3/8-small)
[-1, 1, Conv, [64, 3, 2]],
[[-1, 14], 1, Concat, [1]], # cat head P4
[-1, 3, C3, [128, False]], # 20 (P4/16-medium)
[-1, 1, Conv, [128, 3, 2]],
[[-1, 10], 1, Concat, [1]], # cat head P5
[-1, 3, C3, [128, False]], # 23 (P5/32-large)
[[17, 20, 23], 1, Detect, [nc, anchors]], # Detect(P3, P4, P5)
]
To quantize my model I use:
from onnxruntime.quantization import quantize_dynamic, QuantType
model_fp32 = 'yolo200k.onnx'
model_quant = 'yolo200k_quant.onnx'
quantize_dynamic(model_fp32, model_quant, weight_type=QuantType.QUInt8)
Here's my inference script:
import cv2
import time
import numpy as np
from PIL import Image
import onnxruntime as rt
#Initialize inference
sess = rt.InferenceSession('yolo200k_quant.onnx', providers = rt.get_available_providers())
input_name = sess.get_inputs()[0].name
cap = cv2.VideoCapture(-1)
print(cap.isOpened())
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)
while True:
start = time.time()
ret,frame = cap.read()
image = frame[0:416, 0:416]
frame_cropped = image
image = np.expand_dims(image, axis = 3)
image = np.transpose(image/256, (3,2,0,1))
start_inference = time.time()
pred_onnx = sess.run(None, {input_name: image.astype('float32')})
print("Inference time:" + str(time.time()- start_inference))
output = pred_onnx[0]
size = pred_onnx[0].size
dimensions = pred_onnx[0].shape[2]
rows = size / dimensions
confidenceIndex = 4
labelStartIndex = 5
modelWidth = 416.0
modelHeight = 416.0
xGain = modelWidth / image.shape[2]
yGain = modelHeight / image.shape[3]
frame = np.array(cap)
location = [0]*4
locations = []
for i in range(int(rows)):
# index = i * dimensions
if(output[0][i][confidenceIndex] <= 0.4):
continue
for j in range(labelStartIndex,dimensions):
output[0][i][j] = output[0][i][j] * output[0][i][confidenceIndex]
for k in range(labelStartIndex,dimensions):
if(output[0][i][k] <= 0.5):
continue
location[0] = (output[0][i][0] - output[0][i][2] / 2) / xGain #top left x
location[1] = (output[0][i][1] - output[0][i][3] / 2) / yGain #top left y
location[2] = (output[0][i][0] + output[0][i][2] / 2) / xGain #bottom right x
location[3] = (output[0][i][1] + output[0][i][3] / 2) / yGain #bottom right y
locations.append(location)
cv2.rectangle(frame_cropped, (int(location[0]),int(location[1])), (int(location[2]),int(location[3])), (10, 255, 0), 2)
end = time.time()
fps = (1/(end-start))
cv2.imshow('frame', frame_cropped)
print(fps)
key = cv2.waitKey(1)
if key == ord("q"):
break
cap.release()
I use very inefficient way of cropping a frame for inference and later finding and drawing the bounding boxes, however I judge the FPS from the inference time alone, which is independent from those. The actual FPS is lower and will require further optimization. Right now I want to decrease the actual inference so that I can theoretically achieve real time. Are there any techniques to achieve that other than the ones I've already discussed, or is there a way to use them more effectively than I do? I am wondering if there is a way to use ONNX more efficiently.

Numerical errors in Keras vs Numpy

In order to really understand convolutional layers, I have reimplemented the forward method of a single keras Conv2D layer in basic numpy. The outputs of both seam almost identical, but there are some minor differences.
Getting the keras output:
inp = K.constant(test_x)
true_output = model.layers[0].call(inp).numpy()
My output:
def relu(x):
return np.maximum(0, x)
def forward(inp, filter_weights, filter_biases):
result = np.zeros((1, 64, 64, 32))
inp_with_padding = np.zeros((1, 66, 66, 1))
inp_with_padding[0, 1:65, 1:65, :] = inp
for filter_num in range(32):
single_filter_weights = filter_weights[:, :, 0, filter_num]
for i in range(64):
for j in range(64):
prod = single_filter_weights * inp_with_padding[0, i:i+3, j:j+3, 0]
filter_sum = np.sum(prod) + filter_biases[filter_num]
result[0, i, j, filter_num] = relu(filter_sum)
return result
my_output = forward(test_x, filter_weights, biases_weights)
The results are largely the same, but here are some examples of differences:
Mine: 2.6608338356018066
Keras: 2.660834312438965
Mine: 1.7892705202102661
Keras: 1.7892701625823975
Mine: 0.007190803997218609
Keras: 0.007190565578639507
Mine: 4.970898151397705
Keras: 4.970897197723389
I've tried converting everything to float32, but that does not solve it. Any ideas?
Edit:
I plotted the distribution over errors, and it might give some insight into what is happening. As can be seen, the errors all have very similar values, falling into four groups. However, these errors are not exactly these four values, but are almost all unique values around these four peaks.
I am very interested in how to get my implementation to exactly match the keras one. Unfortunately, the errors seem to increase exponentially when implementing multiple layers. Any insight would help me out a lot!
Given how small the differences are, I would say that they are rounding errors.
I recommend using np.isclose (or math.isclose) to check if floats are "equal".
Floating point operations are not commutable. Here is an example:
In [19]: 1.2 - 1.0 - 0.2
Out[19]: -5.551115123125783e-17
In [21]: 1.2 - 0.2 - 1.0
Out[21]: 0.0
So if you want completely identical results, you not only need to do the same computations analytically. But you also need to do them in the exact same order, with the same datatypes and rounding implementation.
To debug this. Start with the Keras code and change it line by line towards your code, until you see a difference.
First thing is to check whether you're using padding='same'. You seem to be using padding same in your implementation.
If you're using other types of padding, including the default which is padding='valid', there will be a difference.
Another possibility is that you may be accumulating errors because of the triple loop of little sums.
You could do it at once and see if it gets different. Compare this implementation with your own, for instance:
def forward2(inp, filter_weights, filter_biases):
#inp: (batch, 64, 64, in)
#w: (3, 3, in, out)
#b: (out,)
padded_input = np.pad(inp, ((0,0), (1,1), (1,1), (0,0))) #(batch, 66, 66, in)
stacked_input = np.stack([
padded_input[:, :-2],
padded_input[:, 1:-1],
padded_input[:, 2: ]], axis=1) #(batch, 3, 64, 64, in)
stacked_input = np.stack([
stacked_input[:, :, :, :-2],
stacked_input[:, :, :, 1:-1],
stacked_input[:, :, :, 2: ]], axis=2) #(batch, 3, 3, 64, 64, in)
stacked_input = stacked_input.reshape((-1, 3, 3, 64, 64, 1, 1))
w = filter_weights.reshape(( 1, 3, 3, 1, 1, 1, 32))
b = filter_biases.reshape (( 1, 1, 1, 32))
result = stacked_input * w #(-1, 3, 3, 64, 64, 1, 32)
result = result.sum(axis=(1,2,-2)) #(-1, 64, 64, 32)
result += b
result = relu(result)
return result
A third possibility is to check whether you're using GPU and switch everything to CPU for test. Some algorithms for GPU are even non-deterministic.
For any kernel size:
def forward3(inp, filter_weights, filter_biases):
inShape = inp.shape #(batch, imgX, imgY, ins)
wShape = filter_weights.shape #(wx, wy, ins, out)
bShape = filter_biases.shape #(out,)
ins = inShape[-1]
out = wShape[-1]
wx = wShape[0]
wy = wShape[1]
imgX = inShape[1]
imgY = inShape[2]
assert imgX >= wx
assert imgY >= wy
assert inShape[-1] == wShape[-2]
assert bShape[-1] == wShape[-1]
#you may need to invert this padding, exchange L with R
loseX = wx - 1
padXL = loseX // 2
padXR = padXL + (1 if loseX % 2 > 0 else 0)
loseY = wy - 1
padYL = loseY // 2
padYR = padYL + (1 if loseY % 2 > 0 else 0)
padded_input = np.pad(inp, ((0,0), (padXL,padXR), (padYL,padYR), (0,0)))
#(batch, paddedX, paddedY, in)
stacked_input = np.stack([padded_input[:, i:imgX + i] for i in range(wx)],
axis=1) #(batch, wx, imgX, imgY, in)
stacked_input = np.stack([stacked_input[:,:,:,i:imgY + i] for i in range(wy)],
axis=2) #(batch, wx, wy, imgX, imgY, in)
stacked_input = stacked_input.reshape((-1, wx, wy, imgX, imgY, ins, 1))
w = filter_weights.reshape(( 1, wx, wy, 1, 1, ins, out))
b = filter_biases.reshape(( 1, 1, 1, out))
result = stacked_input * w
result = result.sum(axis=(1,2,-2))
result += b
result = relu(result)
return result

Extract sub arrays based on kernel in numpy

I would like to know if there is an efficient method to get sub-arrays from a larger numpy array.
What I have is an application of np.where. I iterate 'manually' over x and y as offsets and apply where with a kernel to each rectangle extracted from the larger array with proper dimensions.
But is there a more direct approach in numpy's collection of methods?
import numpy as np
example = np.arange(20).reshape((5, 4))
# e.g. a cross kernel
a_kernel = np.asarray([[0, 1, 0], [1, 1, 1], [0, 1, 0]])
np.where(a_kernel, example[1:4, 1:4], 0)
# returns
# array([[ 0, 6, 0],
# [ 9, 10, 11],
# [ 0, 14, 0]])
def arrays_from_kernel(a, a_kernel):
width, height = a_kernel.shape
y_max, x_max = a.shape
return [np.where(a_kernel, a[y:(y + height), x:(x + width)], 0)
for y in range(y_max - height + 1)
for x in range(x_max - width + 1)]
sub_arrays = arrays_from_kernel(example, a_kernel)
This returns the arrays I need for further processing.
# [array([[0, 1, 0],
# [4, 5, 6],
# [0, 9, 0]]),
# array([[ 0, 2, 0],
# [ 5, 6, 7],
# [ 0, 10, 0]]),
# ...
# array([[ 0, 9, 0],
# [12, 13, 14],
# [ 0, 17, 0]]),
# array([[ 0, 10, 0],
# [13, 14, 15],
# [ 0, 18, 0]])]
The context: similar to 2D convolution I would like to apply a custom function on each of the subarrays (e.g. product of squared numbers).
At the moment, you're manually advancing a sliding window over the data - stride tricks to the rescue! (And no, I didn't just make that up - there's actually a submodule called stride_tricks in numpy!) Instead of manually building windows into the data, and calling np.where() on them, if you had the windows in an array, you could call np.where() just once. Stride tricks allow you to create such an array without even having to copy the data.
Let me explain. Normal slices in numpy create views into the original data instead of copies. This is done by referring to the original data, but changing the strides used to access the data (ie. how much to jump between two elements or two rows, and so on). Stride tricks allow you to modify those strides more freely than just slicing and reshaping does, so you can eg. iterate over the same data more than once, which is useful here.
Let me demonstrate:
import numpy as np
example = np.arange(20).reshape((5, 4))
a_kernel = np.array([[0, 1, 0], [1, 1, 1], [0, 1, 0]])
def sliding_window(data, win_shape, **kwargs):
assert data.ndim == len(win_shape)
shape = tuple(dn - wn + 1 for dn, wn in zip(data.shape, win_shape)) + win_shape
strides = data.strides * 2
return np.lib.stride_tricks.as_strided(data, shape=shape, strides=strides, **kwargs)
def arrays_from_kernel(a, a_kernel):
windows = sliding_window(a, a_kernel.shape)
return np.where(a_kernel, windows, 0)
sub_arrays = arrays_from_kernel(example, a_kernel)
The scipy.ndimage module offers a number of filters -- one of which might meet your needs. If none of those filters do what you want, you could use ndimage.generic_filter
to call a custom function on each subarray. ndimage.generic_filter is not as fast as the other ndimage filters, however.
For example,
import numpy as np
example = np.arange(20).reshape((5, 4))
a_kernel = np.asarray([[0, 1, 0], [1, 1, 1], [0, 1, 0]])
# def arrays_from_kernel(a, a_kernel):
# width, height = a_kernel.shape
# y_max, x_max = a.shape
# return [np.where(a_kernel, a[y:(y + height), x:(x + width)], 0)
# for y in range(y_max - height + 1)
# for x in range(x_max - width + 1)]
# sub_arrays = arrays_from_kernel(example, a_kernel)
# for arr in sub_arrays:
# print(arr)
# print('-'*80)
import scipy.ndimage as ndimage
def func(x):
# reject subarrays that extend beyond the border of the `example` array
if not np.isnan(x).any():
y = np.zeros_like(a_kernel, dtype=example.dtype)
np.put(y, np.flatnonzero(a_kernel), x)
print(y)
# Instead or returning 0, you can perform your desired computation on the subarray here.
# Note that you may not need the 2D array y; often, you only need the values in the 1D array x
return 0
result = ndimage.generic_filter(example, func, footprint=a_kernel, mode='constant', cval=np.nan)
For the particular problem of computing the product of squares for each subarray, you
could convert the product into a sum by taking advantage of the fact that A * B = exp(log(A)+log(B)). This would allow you to express the computation as a normal convolution. Now using ndimage.convolve can improve performance a lot. The amount of the improvement depends on the size of example:
import numpy as np
import scipy.ndimage as ndimage
import perfplot
a_kernel = np.asarray([[0, 1, 0], [1, 1, 1], [0, 1, 0]])
def orig(example, a_kernel=a_kernel):
def arrays_from_kernel(a, a_kernel):
width, height = a_kernel.shape
y_max, x_max = a.shape
return [
np.where(a_kernel, a[y : (y + height), x : (x + width)], 1)
for y in range(y_max - height + 1)
for x in range(x_max - width + 1)
]
return [np.prod(x) ** 2 for x in arrays_from_kernel(example, a_kernel)]
def alt(example, a_kernel=a_kernel):
logged = np.log(example)
result = ndimage.convolve(logged, a_kernel, mode="constant", cval=0)[1:-1, 1:-1]
return (np.exp(result) ** 2).ravel()
def make_example(N):
return np.random.random(size=(N, N))
def check(A, B):
return np.allclose(A, B)
perfplot.show(
setup=make_example,
kernels=[orig, alt],
n_range=[2 ** k for k in range(2, 11)],
logx=True,
logy=True,
xlabel="len(example)",
equality_check=check,
)

How to reduce usage of memory?

This is a sample of my code
def normalize_3D(input):
for i in range(input.shape[0]):
s = tf.concat([tf.reshape(input[i, 9, 0], shape=[1, 1]),
tf.reshape(input[i, 9, 1], shape=[1, 1]),
tf.reshape(input[i, 9, 2], shape=[1, 1])], axis=1)
output = input[i, :, :] - s
output2 = output / tf.sqrt(tf.square(input[i, 9, 0] - input[i, 0, 0]) +
tf.square(input[i, 9, 1] - input[i, 0, 1]) +
tf.square(input[i, 9, 2] - input[i, 0, 2]))
output2 = tf.reshape(output2, [1, input.shape[1], input.shape[2]])
if i == 0:
output3 = output2
else:
output3 = tf.concat([output3, output2], axis=0)
return output3
like this sample I used 'for' state many times to calculate the data which has just a few batch.
However, while I'm writing my code, I noticed that it uses a lot of memory and error message came out.
Some of my predictions just showing 'nan' and after that the program is stucked.
Is there any way to reduce this kind of memory abuse when I calculate batch data?
You function can be expressed in a simpler and more efficient way like this:
import tensorflow as tf
def normalize_3D(input):
shift = input[:, 9]
scale = tf.norm(input[:, 9] - input[:, 0], axis=1, keepdims=True)
output = (input - tf.expand_dims(shift, 1)) / tf.expand_dims(scale, 1)
return output

Unable to print variables in Python when using def function

I am trying to implement a simple neural net. I want to print the initial pattern, weights, activation. I then want it to print the learning process (i.e. every pattern it goes through as it learns). I am as yet unable to do this - it returns the initial and final pattern (whn I put print p in appropriate places), but nothing else. Hints and tips appreciated - I'm a complete newbie to Python!
#!/usr/bin/python
import random
p = [ [1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1] ] # pattern I want the net to learn
n = 5
alpha = 0.01
activation = [] # unit activations
weights = [] # weights
output = [] # output
def initWeights(n): # set weights to zero, n is the number of units
global weights
weights = [[[0]*n]*n] # initialised to zero
def initNetwork(p): # initialises units to activation
global activation
activation = p
def updateNetwork(k): # pick unit at random and update k times
for l in range(k):
unit = random.randint(0,n-1)
activation[unit] = 0
for i in range(n):
activation[unit] += output[i] * weights[unit][i]
output[unit] = 1 if activation[unit] > 0 else -1
def learn(p):
for i in range(n):
for j in range(n):
weights += alpha * p[i] * p[j]
You have a problem with the line:
weights = [[[0]*n]*n]
When you use*, you multiply object references. You are using the same n-len array of zeroes every time. This will cause:
>>> weights[0][1][0] = 8
>>> weights
[[[8, 0, 0], [8, 0, 0], [8, 0, 0]]]
The first item of all the sublists is 8, because they are one and the same list. You stored the same reference multiple times, and so modifying the n-th item on any of them will alter all of them.
this the line is where you get :
"IndexError: list index out of range"
output[unit] = 1 if activation[unit] > 0 else -1
because output = [] , you should do output.append() or ...

Categories

Resources