YOLO custom loss using keras

YOLO custom loss using keras - python

I am trying to solve a similar problem like YOLO, except by the fact that my boxes are rotated. Thus I need to add a 6th parameter to account for theta/rotation. I only have one class. My tensors are 16x16x6
This implementation does not seem to work. My loss is always the same regardless of network architecture or epoch. After many changes I am deeply convinced my loss calculation is wrong
def my_loss(y_true,y_pred):
# 0,prob / 1,x /2,y/3,a/4,b/5,theta
#params["w_l"]=[5.0,0.5]
#m1=labels[...,5] > 0
#m2=labels[...,5] < 0
#labels[m1][...,5]/=360
#labels[m2][...,5]=(labels[m2][...,5]+360)/360
lambda_coord=y_true[...,0]*params["w_l"][0]
lambda_noobj=-(y_true[...,0] -1.0)*params["w_l"][1]
ly=K.sum(K.square(y_pred[...,1]-y_true[...,1])*lambda_coord)
lx=K.sum(K.square(y_pred[...,2]-y_true[...,2])*lambda_coord)
lt=K.sum(K.square(y_pred[...,5]-y_true[...,5])*lambda_coord)
la=K.sum(K.square(K.sqrt(y_pred[...,3])-K.sqrt(y_true[...,3]))*lambda_coord)
lb=K.sum(K.square(K.sqrt(y_pred[...,4])-K.sqrt(y_true[...,4]))*lambda_coord)
lp=K.sum(K.square(y_true[...,0]-y_pred[...,0])*lambda_noobj)
return ly+lx+lt+la+lb+lp
My label array contains probability/y/x/a/b/theta.
Loss coming from positives (coordinates/size/theta) are multiplied by a factor, as to reinforce their losses on the total loss, while the probability loss for Negatives is reduced by a factor as to lower their influence.
Any advise of improvement or hind at what is wrong ?
Thanks
JC

I edited my question in order for my response to make sense. The issue with my loss calculation was not related to my implementation but to how I was setting up my labels.
A box rotated on 90 degrees or -90 degrees is essentially the same box due to symmetry. Both answers are true which make the theta term of my loss to never decrease.
As the range of theta, out of opencv rotatedbox, was eventually -90,90, making my labels = (labels +90)/180 solved my convergence issues.
hope this helps.

Related

Negative SDR result for evaluating audio source separation

I'm trying to use eval_mus_track function of the museval package to evaluate my audio source separation model. The model I'm evaluating was trained to predict vocals and the results are similar to the actual vocals but the evaluation metrics such as SDR are negative.
Below is my function for generating the metrics:
def estimate_and_evaluate(track):
#track.audio is stereo therefore we predict each channel separately
vocals_predicted_channel_1, acompaniment_predicted_channel_1, _ = model_5.predict(np.squeeze(track.audio[:, 0]))
vocals_predicted_channel_2, acompaniment_predicted_channel_2, _ = model_5.predict(np.squeeze(track.audio[:, 1]) )
vocals = np.squeeze(np.array([vocals_predicted_channel_1.wav_file, vocals_predicted_channel_2.wav_file])).T
accompaniment = np.squeeze(np.array([acompaniment_predicted_channel_1.wav_file, acompaniment_predicted_channel_2.wav_file])).T
estimates = {
'vocals': vocals,
'accompaniment': accompaniment
}
scores = museval.eval_mus_track(track, estimates)
print(scores)
The metric values I get are:
vocals ==> SDR: -3.776 SIR: 4.621 ISR: -0.005 SAR: -30.538
accompaniment ==> SDR: -0.590 SIR: 1.704 ISR: -0.006 SAR: -16.613
The above result doesn't make sense because first of all, accompaniment prediction is pure noise as this model was trained for vocals but it gets a higher SDR. The second reason is the predicted vocals have a very similar graph to the actual ones but still gets a negative SDR value!
In the following graphs, the top one is the actual sound and the bottom one is the predicted source:
Channel 1:
Channel 2:
I tried to shift the predicted vocals as mentioned here but the result got worse.
Any idea what's causing this issue?
This is the link to the actual vocals stereo numpy array
and this one to the predicted stereo vocals numpy array. you can load and manipulate them by using np.load
Thanks for your time

The signal to distortion ratio is actually the logarithm of a ratio. See equation (12) of this article:
https://hal.inria.fr/inria-00630985/PDF/vincent_SigPro11.pdf
So, a SDR of 0 means that the signal is equal to the distortion. An SDR value of less than 0 means that there is more distortion than signal. If the audio doesn't sound like there is more distortion than signal, the cause is often sample alignment problems.
When you look at equation (12), you can see that the calculation depends strongly on preserving the exact sample alignment of the predicted a ground-truth audio. It can be difficult to tell from plots of the waveform or even listening if the samples are misaligned. But, a zoomed-in plot where you can see each individual sample could help you make sure that the ground truth and predicted samples are exactly lined up. If it is shifted by even a single sample, the SDR calculation will not reflect the actual SDR.

Gradient descent function always returns a parameter vector with Nan values

I am trying to apply a simple optimization by using gradient descent. In particular, I want to calulate the vector of parameters (Theta) that minimize the cost function (Mean Squared Error).
The gradient descent function looks like this:
eta = 0.1 # learning rate
n_iterations = 1000
m = 100
theta = np.random.randn(2,1) # random initialization
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y) #this is the partial derivate of the cost function
theta = theta - eta * gradients
Where X_b and y are respectively the input matrix and the target vector.
Now, if I take a look at my final theta, it is always equal to [[nan],
[nan]], while it should be equal to [[85.4575313 ],
[ 0.11802224]] (obtained by using both np.linalg and ScikitLearn LinearRegression).
In order to get a numeric result, I have to reduce the learning rate to 0.00001 and the number of iterations to 500. By appling these changes, the results are far away from the real theta.
My data, both X_b and y, are scaled using a StandardScaler.
If I try to print out theta at each iteration, I get the following (these are only few results):
...
[[2.09755838e+297]
[7.26731496e+299]]
[[-3.54990719e+300]
[-1.22992017e+303]]
[[6.00786188e+303]
[ inf]]
[[-inf]
[ nan]]
...
How to solve the problem? Is it because of the function dominium?
Thanks

I've found an error in the code. For the benefit of all the readers, the error was generated by the feature scaling part that isn't reported in the code above.
The initial theta (randomly assigned) had a completely different scale comparing to the dataset and this led to the impossibility to find valid parameters for the regression.
So by using the correct scaled inputs and targets, the function does its job and converges to the values that I know are correct, as reported in my question.
As Kuedsha suggested, I tried to apply a learning schedule in order to reduce the learning rate at each iteration, even if it is not necessary in this specific case. It works, but of course it takes more iterations to converge. I think that potentially this could be a useful thing to do in a random gradient descent algorithm.
Thanks for your support

In my personal experience, this is probably due to the learning rate you are using. If your result goes to infinity this might be because you are using a too big learning rate. Also, be sure to decrease the leaning rate (eta in your code) in each iteration as this will make sure that your solution converges. I am not sure about what would be the optimal way to do it for your particular problem but you could try something like:
eta=initial_eta/(iteration+1)
or
eta=initial_eta/sqrt(iteration+1)
Edit: in fact, as you can see in your results, the value for your parameter goes from negative to positive in each iteration and always increasing in modulus.
I think this is because when you calculate the gradient in the first iteration eta*gradient is so large that is goes to negative value which is higher in modulus. Then, in the second iteration the gradient is even greater and eta*gradient is therefore also greater which gives you a positive number which is also greater in modulus. This continious until you get infinity.
This is the reason why you normally have to be careful when tuning the value for the learning rate and decrease it with the iterations.

Calculating Mean Squared Logarithmic Error in Tensorflow returns nan

I'm training an autoencoder using tensorflow, and the starter code provides a way to calculate mean squared error as the loss function.
self.mse_loss = tf.reduce_mean(tf.square(self.x - self.x_))
Note here, self.x is the tensor containing the input data (MNIST, with 784 features) and self.x_ are the results from the decoder on the other side of the algorithm.
I wanted to use MSE to find some optimum values for input parameters (namely the number of clusters to find in this unsupervised problem I'm working on), but MSE doesn't differentiate enough between the different runs to attempt the elbow method. Instead I thought I could try a different metric, like Mean Squared Logarithmic Error. The formula for this metric can be found HERE.
Initially I tried the following code;
self.msle_loss = tf.reduce_mean(tf.square(tf.log(1 + self.x) - tf.log(1 + self.x_ )))
However, whenever I run this it returns nan. I think this is something to do with tf.log() being unable to deal with zeros.
So some solutions I've tried that produce values (I'm just not sure which is best);
Use tf.clip_by_value()
self.msle_loss = tf.reduce_mean(tf.square(tf.math.log(tf.clip_by_value(1 + self.x,1e-10,1e10)) - tf.math.log(tf.clip_by_value(1 + self.x_,1e-10,1e10))))
This will run and return values, but I don't think it is correct because they are pretty large, ca. 240
Adding a small constant
self.msle_loss = tf.reduce_mean(tf.square(tf.log(1 + (self.x + 1e-4)) - tf.log(1 + (self.x_ + 1e-4))))
This produces valid values, smaller than solution 1), ca. 12 (so an order of magnitude smaller). This made me worry that the two methods were not interchangeable, which begs the question, which is the correct method here? Not when I initially came across the suggestion to add a small constant, the suggestion was to add one that was much smaller (1e-10), but I kept getting nan until I made the constant big enough, as 1e-4.
Use tf.where()
I found a solution that aims to catch zeros.
self.msle_loss = tf.reduce_mean(tf.square(tf.log(1. + tf.where(tf.equal(self.x, 0.), tf.ones_like(self.x), self.x)) - tf.log(1.0 + tf.where(tf.equal(self.x_, 0.), tf.ones_like(self.x_), self.x_))))
However I don't think I'm implementing it correctly because I still get nan with this method.
If anyone is able to suggest the best way of doing this without biasing the values I'm getting, I'd really appreciate it. Thanks.

sklearn svm SVC fails, but does not report fit_status_==1

I am trying to fit a weighted linear SVC to the "noisy circles" dataset. For some reason, the weighted version finds a decision function that is very very very bad. Yet, libsvm reports that the fit was successful. My weights are not totally strange, so I'm not sure why the algorithm fails. Worse, I'm not sure how to predict under what circumstances the algorithm will fail, or what to do about it.
Here is the offensive code
import numpy as np
import sklearn.datasets
import sklearn.svm
## GET THE NOISY CIRCLES DATASET
n = 200
noise=0.04
factor = 0.3
SEED = 1
np.random.seed(SEED)
noisy_circles, c = sklearn.datasets.make_circles(n_samples=n, factor=factor,
noise=noise)
## HARDCODED WEIGHTS 4 STACKOVERFLOW
weights = np.array([0.93301464, 0.92261151, 0.93367401, 0.38632274, 0.35437395,
0.43346701, 1.09297683, 1.19747184, 0.96349809, 0.32426173,
0.29397037, 1.03628304, 1.05908521, 1.10653401, 0.37677232,
0.35153446, 0.24747971, 0.90887151, 0.24463193, 0.85877582,
0.89405636, 1.03921294, 0.87729103, 1.1589434 , 0.93196245,
0.22982046, 0.82391095, 0.95794411, 0.39876209, 0.96383222,
0.91290011, 0.24322639, 0.41364025, 0.32605574, 0.3712862 ,
1.13075687, 0.33799184, 0.94422961, 0.96021123, 0.29392899,
0.40880845, 0.37780868, 0.4861022 , 1.06077845, 0.89866461,
1.07030338, 0.34269111, 0.86699042, 0.39481626, 0.33021158,
1.17056528, 0.24180542, 0.2446189 , 0.87293221, 0.91510412,
0.32998597, 0.37407169, 0.41486528, 0.42505555, 0.20065111,
0.38846804, 0.92251402, 0.99049091, 0.90580681, 0.97491595,
1.08819797, 0.26700098, 0.42487132, 0.93167479, 1.02463133,
0.89980578, 1.1096191 , 0.37254448, 0.2359968 , 0.28334117,
0.33311215, 1.08758973, 0.32901317, 1.13315268, 0.29888742,
0.14581565, 1.07038078, 1.03316864, 0.35451779, 0.45098287,
1.12772454, 1.08896868, 0.28236812, 0.46117373, 0.83258909,
1.174982 , 0.89901124, 0.12965322, 0.41543288, 0.17358532,
0.45842307, 0.42685333, 0.42375945, 0.210712 , 0.377017 ,
1.03517938, 0.9891231 , 1.07126936, 0.19820075, 1.1002386 ,
0.93338903, 1.1061464 , 0.20301447, 1.08130118, 0.34030289,
1.16104716, 0.15868522, 1.07481773, 0.94876721, 0.93468891,
0.3231601 , 1.04994012, 0.32166893, 0.90920628, 0.90999114,
1.03839278, 1.14232502, 0.18056755, 0.2639544 , 0.16631772,
1.10689008, 0.36852231, 0.20091628, 0.28666013, 1.05392917,
0.91207713, 1.13049957, 0.40367044, 0.33333911, 0.3380625 ,
1.0615807 , 0.30797683, 1.08206638, 0.39374589, 0.40647774,
0.23565583, 0.22030266, 0.33806818, 0.44739648, 0.94079254,
1.03878309, 0.84132066, 0.2772951 , 0.40448219, 1.14960352,
0.89091529, 0.97398981, 1.00992373, 0.87505294, 0.98439767,
1.13634672, 0.2694606 , 0.89735526, 0.21407159, 0.31951442,
0.37647624, 0.90387395, 0.36897273, 0.32483939, 0.42423936,
1.14167808, 0.88631001, 0.34304598, 1.12320881, 0.91640671,
1.0111603 , 0.8649317 , 0.97180267, 1.17381377, 0.4581278 ,
0.15286761, 1.14522941, 1.17181889, 1.02299728, 0.91620512,
0.18773065, 0.2600077 , 0.23665254, 0.20477831, 0.16430318,
0.38680433, 1.0352136 , 0.31850732, 1.02505276, 0.24500125,
1.01564276, 0.20866012, 0.2194238 , 0.37527691, 1.05327402,
0.18154061, 0.25013442, 0.99024356, 0.15072547, 0.87641354])
## MODEL SETUP AND TRAINING
model = sklearn.svm.SVC(C=30.,kernel="linear")
model.fit(noisy_circles, c, sample_weight=weights)
print(model.coef_, model.intercept_, model.fit_status_)
Note that the fit_status reports success. However, the fitted model parameters are total nonsense. To see this, here is the plot of the data (with size of dot scaled as the weight of the point):
Here is the fitted line along the same range in x:
Whatever is happening here seems to be driving the decision surface off to infinity. At first I thought that it was my having such a large C that was simply overpowering the part of the SVM that was trying to learn anything, but reducing C to 0.0001 does not change anything.
What is going on with the algorithm that produces this counter-intuitive behavior? Under what circumstances should I expect the algorithm to fail in this way?
UPDATE: The nightly build of sklearn supports sample weights for LinearSVC. Switching over to LinearSVC, I am witnessing the same behavior when the loss is set to "hinge", but not for this particular set of weights. This causes me to suspect that there is some kind of ill-conditioning in the problem somewhere. I'm still not sure exactly what is happening, but possibly this sheds some light on the problem.

The problem doesn't lie in sample_weight or C, it lies in the linear nature of the kernel. You are trying to learn a non-linear decision boundary (circular in this case) using a function that simply can not express anything but linear decision boundary. This applies to both SVC(kernel="linear") and LinearSVC. In my experiments, simply using a non-linear kernel like rbf completely solved it.
All SVMs in fact learn a linear boundary. So why something like rbf performs well? The answer lies in something called "kernel trick". Put simply, rbf transforms the dataset in some ways (projection to higher dimensional space is the technical way to put it), so that linearly separating the classes in that transformed space actually results in non-linear boundary in our original space. Here is a more detailed explanation for it.
Update: As for how weights contribute to the failure for linear kernel, the answer most likely lies in the fact that the avg weight assigned to the classes is imbalanced. In particular, the fact that the avg weight assigned to class-0 is 3 times higher than 1. Here are few results that point to this conclusion:
The Linear kernel learns "reasonable" boundary (meaning boundary is within the input space of samples) when weights are all 1.0 or randomly generated.
Also reasonable if we balance the weights of the classes using class_weight={0:(w[y==1].sum()/w[y==0].sum()),1:1} formula in constructor.
If we create weight imbalance in some other ways: like using uniform weights but with different class weights, or assigning class-1 3 times higher than 0, or if we remove weights altogether and simply make frequency of one class 1/3rd of other, the above problem reappears.
This imbalance seems to push co-efficients to zero, although rbf kernel doesn't seem to be affected by it. As for why libsvm can't report the failure, that unfortunately I do not know.

Active Shape Models' fitting procedure doesn't converge with Statistical Model fitting function

I followed the Active Shape Models approach described by Tim Cootes in
textbook and original paper. So far everything went well (Procrustes Analysis, Principal Component Analysis, preprocessing of images (contrast, noise)). Only the fitting procedure itself seems not to converge.
I use the statistical model of the grey-level structure approach as described in textbook (p. 13) for creating a fitting function for each of the 8 incisors, for each of the 40 landmarks (so 320 different fitting functions are created in total) per incisor by sampling 5 (=k) points on either side along the profile normal to the boundary through each of the 40 landmarks for each of the 8 incisors. Those functions are equal to the Mahalanobis distance (textbook p. 14).
During the fitting procedure I sample 10 (=m>k) points on either side along the profile normal to the boundary through each of the 40 landmarks of the current approximation of a tooth. That way I must evaluate 2(m-k)+1 samples with the corresponding fitting function.
Each of those samples contains the gradient value of 2k+1 points. The sample that minimizes the function is choosen and the corresponding landmark is positioned at the middle point of those 2k+1 points. This is done for each of the 40 landmarks. This results in a new (not validated yet) approximation of the tooth.
This approximation in the image coordinate frame is aligned with the model of the tooth in the image coordinate frame. Then the coefficients (bi) of the principal component analysis are calculated and checked if |bi|<3*sqrt(eigenvalue_i) in order not to deviate too much from the shape of the model. The coefficients (bi) are limited if necessary and we transform back to the image coordinate frame and start some new iteration.
Shows the image of which we want to find the upper-left incisor.
Shows the gradient image with the approximation of the tooth in the image coordinate frame in iteration 19.
(Red: before validation - Green: after validation) As one can see we're kind of diverged from the optimal solution.
def create_gradient(img):
temp = cv2.Scharr(img, ddepth=-1, dx=1, dy=0)
return cv2.Scharr(temp, ddepth=-1, dx=0, dy=1)
Shows the approximation of the tooth in the model coordinate frame in iteration 19.
(Blue: model - Red: before validation - Green: after validation) As one can see we're still close to the shape of the model.
Shows the approximation of the tooth in the model coordinate frame for 19 iterations.
(Blue: model - Red: before validation - Green: after validation) As one can see we stay close to the shape of the model during all these iterations
So we stay close to the shape (guarded by principal component analysis), but not close to the intensity behaviour (guarded by the fitting function) around the landmarks.

The gradient image is wrong or better is not of any use because one needs to take the derivative along the profile normals instead of the horizontal and vertical direction.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

YOLO custom loss using keras - python

Related

Negative SDR result for evaluating audio source separation

Gradient descent function always returns a parameter vector with Nan values

Calculating Mean Squared Logarithmic Error in Tensorflow returns nan

sklearn svm SVC fails, but does not report fit_status_==1

Active Shape Models' fitting procedure doesn't converge with Statistical Model fitting function

Categories

Resources