I'm trying to use eval_mus_track function of the museval package to evaluate my audio source separation model. The model I'm evaluating was trained to predict vocals and the results are similar to the actual vocals but the evaluation metrics such as SDR are negative.
Below is my function for generating the metrics:
def estimate_and_evaluate(track):
#track.audio is stereo therefore we predict each channel separately
vocals_predicted_channel_1, acompaniment_predicted_channel_1, _ = model_5.predict(np.squeeze(track.audio[:, 0]))
vocals_predicted_channel_2, acompaniment_predicted_channel_2, _ = model_5.predict(np.squeeze(track.audio[:, 1]) )
vocals = np.squeeze(np.array([vocals_predicted_channel_1.wav_file, vocals_predicted_channel_2.wav_file])).T
accompaniment = np.squeeze(np.array([acompaniment_predicted_channel_1.wav_file, acompaniment_predicted_channel_2.wav_file])).T
estimates = {
'vocals': vocals,
'accompaniment': accompaniment
}
scores = museval.eval_mus_track(track, estimates)
print(scores)
The metric values I get are:
vocals ==> SDR: -3.776 SIR: 4.621 ISR: -0.005 SAR: -30.538
accompaniment ==> SDR: -0.590 SIR: 1.704 ISR: -0.006 SAR: -16.613
The above result doesn't make sense because first of all, accompaniment prediction is pure noise as this model was trained for vocals but it gets a higher SDR. The second reason is the predicted vocals have a very similar graph to the actual ones but still gets a negative SDR value!
In the following graphs, the top one is the actual sound and the bottom one is the predicted source:
Channel 1:
Channel 2:
I tried to shift the predicted vocals as mentioned here but the result got worse.
Any idea what's causing this issue?
This is the link to the actual vocals stereo numpy array
and this one to the predicted stereo vocals numpy array. you can load and manipulate them by using np.load
Thanks for your time
The signal to distortion ratio is actually the logarithm of a ratio. See equation (12) of this article:
https://hal.inria.fr/inria-00630985/PDF/vincent_SigPro11.pdf
So, a SDR of 0 means that the signal is equal to the distortion. An SDR value of less than 0 means that there is more distortion than signal. If the audio doesn't sound like there is more distortion than signal, the cause is often sample alignment problems.
When you look at equation (12), you can see that the calculation depends strongly on preserving the exact sample alignment of the predicted a ground-truth audio. It can be difficult to tell from plots of the waveform or even listening if the samples are misaligned. But, a zoomed-in plot where you can see each individual sample could help you make sure that the ground truth and predicted samples are exactly lined up. If it is shifted by even a single sample, the SDR calculation will not reflect the actual SDR.
I am trying to apply a simple optimization by using gradient descent. In particular, I want to calulate the vector of parameters (Theta) that minimize the cost function (Mean Squared Error).
The gradient descent function looks like this:
eta = 0.1 # learning rate
n_iterations = 1000
m = 100
theta = np.random.randn(2,1) # random initialization
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y) #this is the partial derivate of the cost function
theta = theta - eta * gradients
Where X_b and y are respectively the input matrix and the target vector.
Now, if I take a look at my final theta, it is always equal to [[nan],
[nan]], while it should be equal to [[85.4575313 ],
[ 0.11802224]] (obtained by using both np.linalg and ScikitLearn LinearRegression).
In order to get a numeric result, I have to reduce the learning rate to 0.00001 and the number of iterations to 500. By appling these changes, the results are far away from the real theta.
My data, both X_b and y, are scaled using a StandardScaler.
If I try to print out theta at each iteration, I get the following (these are only few results):
...
[[2.09755838e+297]
[7.26731496e+299]]
[[-3.54990719e+300]
[-1.22992017e+303]]
[[6.00786188e+303]
[ inf]]
[[-inf]
[ nan]]
...
How to solve the problem? Is it because of the function dominium?
Thanks
I've found an error in the code. For the benefit of all the readers, the error was generated by the feature scaling part that isn't reported in the code above.
The initial theta (randomly assigned) had a completely different scale comparing to the dataset and this led to the impossibility to find valid parameters for the regression.
So by using the correct scaled inputs and targets, the function does its job and converges to the values that I know are correct, as reported in my question.
As Kuedsha suggested, I tried to apply a learning schedule in order to reduce the learning rate at each iteration, even if it is not necessary in this specific case. It works, but of course it takes more iterations to converge. I think that potentially this could be a useful thing to do in a random gradient descent algorithm.
Thanks for your support
In my personal experience, this is probably due to the learning rate you are using. If your result goes to infinity this might be because you are using a too big learning rate. Also, be sure to decrease the leaning rate (eta in your code) in each iteration as this will make sure that your solution converges. I am not sure about what would be the optimal way to do it for your particular problem but you could try something like:
eta=initial_eta/(iteration+1)
or
eta=initial_eta/sqrt(iteration+1)
Edit: in fact, as you can see in your results, the value for your parameter goes from negative to positive in each iteration and always increasing in modulus.
I think this is because when you calculate the gradient in the first iteration eta*gradient is so large that is goes to negative value which is higher in modulus. Then, in the second iteration the gradient is even greater and eta*gradient is therefore also greater which gives you a positive number which is also greater in modulus. This continious until you get infinity.
This is the reason why you normally have to be careful when tuning the value for the learning rate and decrease it with the iterations.
I'm training an autoencoder using tensorflow, and the starter code provides a way to calculate mean squared error as the loss function.
self.mse_loss = tf.reduce_mean(tf.square(self.x - self.x_))
Note here, self.x is the tensor containing the input data (MNIST, with 784 features) and self.x_ are the results from the decoder on the other side of the algorithm.
I wanted to use MSE to find some optimum values for input parameters (namely the number of clusters to find in this unsupervised problem I'm working on), but MSE doesn't differentiate enough between the different runs to attempt the elbow method. Instead I thought I could try a different metric, like Mean Squared Logarithmic Error. The formula for this metric can be found HERE.
Initially I tried the following code;
self.msle_loss = tf.reduce_mean(tf.square(tf.log(1 + self.x) - tf.log(1 + self.x_ )))
However, whenever I run this it returns nan. I think this is something to do with tf.log() being unable to deal with zeros.
So some solutions I've tried that produce values (I'm just not sure which is best);
Use tf.clip_by_value()
self.msle_loss = tf.reduce_mean(tf.square(tf.math.log(tf.clip_by_value(1 + self.x,1e-10,1e10)) - tf.math.log(tf.clip_by_value(1 + self.x_,1e-10,1e10))))
This will run and return values, but I don't think it is correct because they are pretty large, ca. 240
Adding a small constant
self.msle_loss = tf.reduce_mean(tf.square(tf.log(1 + (self.x + 1e-4)) - tf.log(1 + (self.x_ + 1e-4))))
This produces valid values, smaller than solution 1), ca. 12 (so an order of magnitude smaller). This made me worry that the two methods were not interchangeable, which begs the question, which is the correct method here? Not when I initially came across the suggestion to add a small constant, the suggestion was to add one that was much smaller (1e-10), but I kept getting nan until I made the constant big enough, as 1e-4.
Use tf.where()
I found a solution that aims to catch zeros.
self.msle_loss = tf.reduce_mean(tf.square(tf.log(1. + tf.where(tf.equal(self.x, 0.), tf.ones_like(self.x), self.x)) - tf.log(1.0 + tf.where(tf.equal(self.x_, 0.), tf.ones_like(self.x_), self.x_))))
However I don't think I'm implementing it correctly because I still get nan with this method.
If anyone is able to suggest the best way of doing this without biasing the values I'm getting, I'd really appreciate it. Thanks.
I am trying to fit a weighted linear SVC to the "noisy circles" dataset. For some reason, the weighted version finds a decision function that is very very very bad. Yet, libsvm reports that the fit was successful. My weights are not totally strange, so I'm not sure why the algorithm fails. Worse, I'm not sure how to predict under what circumstances the algorithm will fail, or what to do about it.
Here is the offensive code
import numpy as np
import sklearn.datasets
import sklearn.svm
## GET THE NOISY CIRCLES DATASET
n = 200
noise=0.04
factor = 0.3
SEED = 1
np.random.seed(SEED)
noisy_circles, c = sklearn.datasets.make_circles(n_samples=n, factor=factor,
noise=noise)
## HARDCODED WEIGHTS 4 STACKOVERFLOW
weights = np.array([0.93301464, 0.92261151, 0.93367401, 0.38632274, 0.35437395,
0.43346701, 1.09297683, 1.19747184, 0.96349809, 0.32426173,
0.29397037, 1.03628304, 1.05908521, 1.10653401, 0.37677232,
0.35153446, 0.24747971, 0.90887151, 0.24463193, 0.85877582,
0.89405636, 1.03921294, 0.87729103, 1.1589434 , 0.93196245,
0.22982046, 0.82391095, 0.95794411, 0.39876209, 0.96383222,
0.91290011, 0.24322639, 0.41364025, 0.32605574, 0.3712862 ,
1.13075687, 0.33799184, 0.94422961, 0.96021123, 0.29392899,
0.40880845, 0.37780868, 0.4861022 , 1.06077845, 0.89866461,
1.07030338, 0.34269111, 0.86699042, 0.39481626, 0.33021158,
1.17056528, 0.24180542, 0.2446189 , 0.87293221, 0.91510412,
0.32998597, 0.37407169, 0.41486528, 0.42505555, 0.20065111,
0.38846804, 0.92251402, 0.99049091, 0.90580681, 0.97491595,
1.08819797, 0.26700098, 0.42487132, 0.93167479, 1.02463133,
0.89980578, 1.1096191 , 0.37254448, 0.2359968 , 0.28334117,
0.33311215, 1.08758973, 0.32901317, 1.13315268, 0.29888742,
0.14581565, 1.07038078, 1.03316864, 0.35451779, 0.45098287,
1.12772454, 1.08896868, 0.28236812, 0.46117373, 0.83258909,
1.174982 , 0.89901124, 0.12965322, 0.41543288, 0.17358532,
0.45842307, 0.42685333, 0.42375945, 0.210712 , 0.377017 ,
1.03517938, 0.9891231 , 1.07126936, 0.19820075, 1.1002386 ,
0.93338903, 1.1061464 , 0.20301447, 1.08130118, 0.34030289,
1.16104716, 0.15868522, 1.07481773, 0.94876721, 0.93468891,
0.3231601 , 1.04994012, 0.32166893, 0.90920628, 0.90999114,
1.03839278, 1.14232502, 0.18056755, 0.2639544 , 0.16631772,
1.10689008, 0.36852231, 0.20091628, 0.28666013, 1.05392917,
0.91207713, 1.13049957, 0.40367044, 0.33333911, 0.3380625 ,
1.0615807 , 0.30797683, 1.08206638, 0.39374589, 0.40647774,
0.23565583, 0.22030266, 0.33806818, 0.44739648, 0.94079254,
1.03878309, 0.84132066, 0.2772951 , 0.40448219, 1.14960352,
0.89091529, 0.97398981, 1.00992373, 0.87505294, 0.98439767,
1.13634672, 0.2694606 , 0.89735526, 0.21407159, 0.31951442,
0.37647624, 0.90387395, 0.36897273, 0.32483939, 0.42423936,
1.14167808, 0.88631001, 0.34304598, 1.12320881, 0.91640671,
1.0111603 , 0.8649317 , 0.97180267, 1.17381377, 0.4581278 ,
0.15286761, 1.14522941, 1.17181889, 1.02299728, 0.91620512,
0.18773065, 0.2600077 , 0.23665254, 0.20477831, 0.16430318,
0.38680433, 1.0352136 , 0.31850732, 1.02505276, 0.24500125,
1.01564276, 0.20866012, 0.2194238 , 0.37527691, 1.05327402,
0.18154061, 0.25013442, 0.99024356, 0.15072547, 0.87641354])
## MODEL SETUP AND TRAINING
model = sklearn.svm.SVC(C=30.,kernel="linear")
model.fit(noisy_circles, c, sample_weight=weights)
print(model.coef_, model.intercept_, model.fit_status_)
Note that the fit_status reports success. However, the fitted model parameters are total nonsense. To see this, here is the plot of the data (with size of dot scaled as the weight of the point):
Here is the fitted line along the same range in x:
Whatever is happening here seems to be driving the decision surface off to infinity. At first I thought that it was my having such a large C that was simply overpowering the part of the SVM that was trying to learn anything, but reducing C to 0.0001 does not change anything.
What is going on with the algorithm that produces this counter-intuitive behavior? Under what circumstances should I expect the algorithm to fail in this way?
UPDATE: The nightly build of sklearn supports sample weights for LinearSVC. Switching over to LinearSVC, I am witnessing the same behavior when the loss is set to "hinge", but not for this particular set of weights. This causes me to suspect that there is some kind of ill-conditioning in the problem somewhere. I'm still not sure exactly what is happening, but possibly this sheds some light on the problem.
The problem doesn't lie in sample_weight or C, it lies in the linear nature of the kernel. You are trying to learn a non-linear decision boundary (circular in this case) using a function that simply can not express anything but linear decision boundary. This applies to both SVC(kernel="linear") and LinearSVC. In my experiments, simply using a non-linear kernel like rbf completely solved it.
All SVMs in fact learn a linear boundary. So why something like rbf performs well? The answer lies in something called "kernel trick". Put simply, rbf transforms the dataset in some ways (projection to higher dimensional space is the technical way to put it), so that linearly separating the classes in that transformed space actually results in non-linear boundary in our original space. Here is a more detailed explanation for it.
Update: As for how weights contribute to the failure for linear kernel, the answer most likely lies in the fact that the avg weight assigned to the classes is imbalanced. In particular, the fact that the avg weight assigned to class-0 is 3 times higher than 1. Here are few results that point to this conclusion:
The Linear kernel learns "reasonable" boundary (meaning boundary is within the input space of samples) when weights are all 1.0 or randomly generated.
Also reasonable if we balance the weights of the classes using class_weight={0:(w[y==1].sum()/w[y==0].sum()),1:1} formula in constructor.
If we create weight imbalance in some other ways: like using uniform weights but with different class weights, or assigning class-1 3 times higher than 0, or if we remove weights altogether and simply make frequency of one class 1/3rd of other, the above problem reappears.
This imbalance seems to push co-efficients to zero, although rbf kernel doesn't seem to be affected by it. As for why libsvm can't report the failure, that unfortunately I do not know.