Negative SDR result for evaluating audio source separation

Negative SDR result for evaluating audio source separation - python

I'm trying to use eval_mus_track function of the museval package to evaluate my audio source separation model. The model I'm evaluating was trained to predict vocals and the results are similar to the actual vocals but the evaluation metrics such as SDR are negative.
Below is my function for generating the metrics:
def estimate_and_evaluate(track):
#track.audio is stereo therefore we predict each channel separately
vocals_predicted_channel_1, acompaniment_predicted_channel_1, _ = model_5.predict(np.squeeze(track.audio[:, 0]))
vocals_predicted_channel_2, acompaniment_predicted_channel_2, _ = model_5.predict(np.squeeze(track.audio[:, 1]) )
vocals = np.squeeze(np.array([vocals_predicted_channel_1.wav_file, vocals_predicted_channel_2.wav_file])).T
accompaniment = np.squeeze(np.array([acompaniment_predicted_channel_1.wav_file, acompaniment_predicted_channel_2.wav_file])).T
estimates = {
'vocals': vocals,
'accompaniment': accompaniment
}
scores = museval.eval_mus_track(track, estimates)
print(scores)
The metric values I get are:
vocals ==> SDR: -3.776 SIR: 4.621 ISR: -0.005 SAR: -30.538
accompaniment ==> SDR: -0.590 SIR: 1.704 ISR: -0.006 SAR: -16.613
The above result doesn't make sense because first of all, accompaniment prediction is pure noise as this model was trained for vocals but it gets a higher SDR. The second reason is the predicted vocals have a very similar graph to the actual ones but still gets a negative SDR value!
In the following graphs, the top one is the actual sound and the bottom one is the predicted source:
Channel 1:
Channel 2:
I tried to shift the predicted vocals as mentioned here but the result got worse.
Any idea what's causing this issue?
This is the link to the actual vocals stereo numpy array
and this one to the predicted stereo vocals numpy array. you can load and manipulate them by using np.load
Thanks for your time

The signal to distortion ratio is actually the logarithm of a ratio. See equation (12) of this article:
https://hal.inria.fr/inria-00630985/PDF/vincent_SigPro11.pdf
So, a SDR of 0 means that the signal is equal to the distortion. An SDR value of less than 0 means that there is more distortion than signal. If the audio doesn't sound like there is more distortion than signal, the cause is often sample alignment problems.
When you look at equation (12), you can see that the calculation depends strongly on preserving the exact sample alignment of the predicted a ground-truth audio. It can be difficult to tell from plots of the waveform or even listening if the samples are misaligned. But, a zoomed-in plot where you can see each individual sample could help you make sure that the ground truth and predicted samples are exactly lined up. If it is shifted by even a single sample, the SDR calculation will not reflect the actual SDR.

Related

Correct way of normalizing and scaling the MNIST dataset

I've looked everywhere but couldn't quite find what I want. Basically the MNIST dataset has images with pixel values in the range [0, 255]. People say that in general, it is good to do the following:
Scale the data to the [0,1] range.
Normalize the data to have zero mean and unit standard deviation (data - mean) / std.
Unfortunately, no one ever shows how to do both of these things. They all subtract a mean of 0.1307 and divide by a standard deviation of 0.3081. These values are basically the mean and the standard deviation of the dataset divided by 255:
from torchvision.datasets import MNIST
import torchvision.transforms as transforms
trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True)
print('Min Pixel Value: {} \nMax Pixel Value: {}'.format(trainset.data.min(), trainset.data.max()))
print('Mean Pixel Value {} \nPixel Values Std: {}'.format(trainset.data.float().mean(), trainset.data.float().std()))
print('Scaled Mean Pixel Value {} \nScaled Pixel Values Std: {}'.format(trainset.data.float().mean() / 255, trainset.data.float().std() / 255))
This outputs the following
Min Pixel Value: 0
Max Pixel Value: 255
Mean Pixel Value 33.31002426147461
Pixel Values Std: 78.56748962402344
Scaled Mean: 0.13062754273414612
Scaled Std: 0.30810779333114624
However clearly this does none of the above! The resulting data 1) will not be between [0, 1] and will not have mean 0 or std 1. In fact this is what we are doing:
[data - (mean / 255)] / (std / 255)
which is very different from this
[(scaled_data) - (mean/255)] / (std/255)
where scaled_data is just data / 255.

Euler_Salter
I may have stumbled upon this a little too late, but hopefully I can help a little bit.
Assuming that you are using torchvision.Transform, the following code can be used to normalize the MNIST dataset.
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('./data', train=True
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
Usually, 'transforms.ToTensor()' is used to turn the input data in the range of [0,255] to a 3-dimensional Tensor. This function automatically scales the input data to the range of [0,1]. (This is equivalent to scaling the data down to 0,1)
Therefore, it makes sense that the mean and std used in the 'transforms.Normalize(...)' will be 0.1307 and 0.3081, respectively. (This is equivalent to normalizing zero mean and unit standard deviation.)
Please refer to the link below for better explanation.
https://pytorch.org/vision/stable/transforms.html

I think you misunderstand one critical concept: these are two different, and inconsistent, scaling operations. You can have only one of the two:
mean = 0, stdev = 1
data range [0,1]
Think about it, considering the [0,1] range: if the data are all small positive values, with min=0 and max=1, then the sum of the data must be positive, giving a positive, non-zero mean. Similarly, the stdev cannot be 1 when none of the data can possibly be as much as 1.0 different from the mean.
Conversely, if you have mean=0, then some of the data must be negative.
You use only one of the two transformations. Which one you use depends on the characteristics of your data set, and -- ultimately -- which one works better for your model.
For the [0,1] scaling, you simply divide by 255.
For the mean=0, stdev=1 scaling, you perform the simple linear transformation you already know:
new_val = (old_val - old_mean) / old_stdev
Does that clarify it for you, or have I entirely missed your point of confusion?

Purpose
Two of the most important reasons for features scaling are:
You scale features to make them all of the same magnitude (i.e. importance or weight).
Example:
Dataset with two features: Age and Weight. The ages in years and the weights in grams! Now a fella in the 20th of his age and weights only 60Kg would translate to a vector = [20 yrs, 60000g], and so on for the whole dataset. The Weight Attribute will dominate during the training process. How is that, depends on the type of the algorithm you are using - Some are more sensitive than others: E.g. Neural Network where the Learning Rate for Gradient Descent get affected by the magnitude of the Neural Network Thetas (i.e. Weights), and the latter varies in correlation to the input (i.e. features) during the training process; also Feature Scaling improves Convergence. Another example is the K-Mean Clustering Algorithm requires Features of the same magnitude since it is isotropic in all directions of space. INTERESTING LIST.
You scale features to speed up execution time.
This is straightforward: All these matrices multiplications and parameters summation would be faster with small numbers compared to very large number (or very large number produced from multiplying features by some other parameters..etc)
Types
The most popular types of Feature Scalers can be summarized as follows:
StandardScaler: usually your first option, it's very commonly used. It works via standardizing the data (i.e. centering them), that's to bring them to a STD=1 and Mean=0. It gets affected by outliers, and should only be used if your data have Gaussian-Like Distribution.
MinMaxScaler: usually used when you want to bring all your data point into a specific range (e.g. [0-1]). It heavily gets affected by outliers simply because it uses the Range.
RobustScaler: It's "robust" against outliers because it scales the data according to the quantile range. However, you should know that outliers will still exist in the scaled data.
MaxAbsScaler: Mainly used for sparse data.
Unit Normalization: It basically scales the vector for each sample to have unit norm, independently of the distribution of the samples.
Which One & How Many
You need to get to know your dataset first. As per mentioned above, there are things you need to look at before, such as: the Distribution of the Data, the Existence of Outliers, and the Algorithm being utilized.
Anyhow, you need one scaler per dataset, unless there is a specific requirement, such that if there exist an algorithm that works only if data are within certain range and has mean of zero and standard deviation of 1 - all together. Nevertheless, I have never come across such case.
Key Takeaways
There are different types of Feature Scalers that are used based on some rules of thumb mentioned above.
You pick one Scaler based on the requirements, not randomly.
You scale data for a purpose, for example, in the Random Forest Algorithm you do NOT usually need to scale.

Well the data gets scaled to [0,1] using torchvision.transforms.ToTensor() and then the normalization (0.1306,0.3081) is applied.
You can look about it in the Pytorch documentation : https://pytorch.org/vision/stable/transforms.html.
Hope that answers your question.

YOLO custom loss using keras

I am trying to solve a similar problem like YOLO, except by the fact that my boxes are rotated. Thus I need to add a 6th parameter to account for theta/rotation. I only have one class. My tensors are 16x16x6
This implementation does not seem to work. My loss is always the same regardless of network architecture or epoch. After many changes I am deeply convinced my loss calculation is wrong
def my_loss(y_true,y_pred):
# 0,prob / 1,x /2,y/3,a/4,b/5,theta
#params["w_l"]=[5.0,0.5]
#m1=labels[...,5] > 0
#m2=labels[...,5] < 0
#labels[m1][...,5]/=360
#labels[m2][...,5]=(labels[m2][...,5]+360)/360
lambda_coord=y_true[...,0]*params["w_l"][0]
lambda_noobj=-(y_true[...,0] -1.0)*params["w_l"][1]
ly=K.sum(K.square(y_pred[...,1]-y_true[...,1])*lambda_coord)
lx=K.sum(K.square(y_pred[...,2]-y_true[...,2])*lambda_coord)
lt=K.sum(K.square(y_pred[...,5]-y_true[...,5])*lambda_coord)
la=K.sum(K.square(K.sqrt(y_pred[...,3])-K.sqrt(y_true[...,3]))*lambda_coord)
lb=K.sum(K.square(K.sqrt(y_pred[...,4])-K.sqrt(y_true[...,4]))*lambda_coord)
lp=K.sum(K.square(y_true[...,0]-y_pred[...,0])*lambda_noobj)
return ly+lx+lt+la+lb+lp
My label array contains probability/y/x/a/b/theta.
Loss coming from positives (coordinates/size/theta) are multiplied by a factor, as to reinforce their losses on the total loss, while the probability loss for Negatives is reduced by a factor as to lower their influence.
Any advise of improvement or hind at what is wrong ?
Thanks
JC

I edited my question in order for my response to make sense. The issue with my loss calculation was not related to my implementation but to how I was setting up my labels.
A box rotated on 90 degrees or -90 degrees is essentially the same box due to symmetry. Both answers are true which make the theta term of my loss to never decrease.
As the range of theta, out of opencv rotatedbox, was eventually -90,90, making my labels = (labels +90)/180 solved my convergence issues.
hope this helps.

does the classification after a fft

I have a spectrum and I do the fft. And I wanted to use this data to make learning with scikit-learn. However I know what to take as explanatory variables, the frequencies the amplitudes or phases. It also seems it there's specific methods to process data. If you have ideas thank you
for example measurements made on two species
measure for the specie 1
Frequency [Hz] Peak amplitude Phase [degrees]
117.122319744375 2806130.78600507 -79.781679752725
234.24463948875 1913786.60902507 17.7111789273704
351.366959233125 808519.710937228 116.444676921222
468.4892789775 122095.42475935 25.5770279979328
585.520239658112 607116.287067349 142.264887989957
702.642559402487 604818.747928879 -112.469849617122
819.764879146862 277750.38203791 -15.0000950192717
936.887198891237 118608.971696726 -74.5121366118222
1054.00951863561 344484.145698282 -6.21161038546633
1171.13183837999 327156.097365635 97.0304114077862
1288.25415812436 133294.989030519 -42.5375933954097
1405.37647786874 112216.937121264 78.5147573168857
1522.49879761311 231245.476714294 -25.4436913705878
1639.62111735749 201337.057689481 -24.3659638609968
1756.6520780381 77785.2190703514 29.0468023773855
1873.77439778247 103345.482912432 -13.8433556624336
1990.89671752685 164252.685204496 32.0091367478569
2108.01903727122 131507.600569796 3.20717282723705
2225.1413570156 62446.6053497028 17.6656168494324
2342.26367675998 92615.8137781526 -2.92386499550556
measure for the specie 2
Frequency [Hz] Peak amplitude Phase [degrees]
117.122319744375 2786323.45338023 -78.5559125894388
234.24463948875 1915479.67743241 20.1586403367551
351.366959233125 830370.792189816 120.081294764269
468.4892789775 94486.3308071095 28.1762359863422
585.611598721875 590794.892175599 137.070646192436
702.642559402487 610017.558439343 -99.8603287979889
819.764879146862 300481.494163747 -7.0350571153689
936.887198891237 93989.1090623071 -52.6686900337389
1054.00951863561 332194.292343295 4.40278213901234
1171.13183837999 335166.932956212 92.5972261483014
1288.25415812436 154686.81104112 -64.5940556800747
1405.37647786874 91910.7647280088 82.3509804545009
1522.49879761311 223229.665336525 -64.4186985300827
1639.62111735749 211038.25587802 12.6057366375093
1756.74343710186 93456.4477333818 25.3398315513138
1873.77439778247 87937.8620001563 15.3447294063444
1990.89671752685 160213.112972346 7.41647669351739
2108.01903727122 141354.896010814 -48.4341201110724
2225.1413570156 69137.6327300227 39.9238718439715
2342.26367675998 82097.0663259956 -28.9291500313113

OP is asking how to classify this. I've explained it to him in comments and will break it down more here:
Each "specie" represents a row, or a sample. Each sample, thus, has 60 features (20x3)
He is doing a binary classification problem
Re-cast the output of the FFT to give Freq1,Amp1,Phase1....etc as a numerical input set for a training algorithm
Use something like a Support Vector Machine or Decision Tree Classifier out of scikit-learn and train over the dataset
Evaluate and measure accuracy
Caveats: 60 features over 1000 samples is potentially going to be quite hard to separate and liable to over-fitting. OP needs to be careful. I havent spent much time understanding the features themselves, but I suspect 20 of those features are redundant (the frequencies always seem to be the same between samples)

sklearn.mixture.GMM to fit multiple Gaussian curves into a histogram, an EM algorithm error

I'm using sklearn.mixture.GMM to fit two Gaussian curves to an array of data and consequently overlay it with data histogram (dat disturbution is mixture of 2 Gaussian curves).
My data is a list of float number and here is the line of code i am using :
clf = mixture.GMM(n_components=1, covariance_type='diag')
clf.fit(listOffValues)
if i set n_components to 1, I get the following error:
"(or increasing n_init) or check for degenerate data.")
RuntimeError: EM algorithm was never able to compute a valid likelihood given initial parameters. Try different init parameters (or increasing n_init) or check for degenerate data.
and if i use n_components to 2 there error is:
(self.n_components, X.shape[0]))
ValueError: GMM estimation with 2 components, but got only 1 samples.
For the first error, I tried changing all init parameters of GMM, but it didn't make any difference.
Tried an array of random numbers and the code is working perfectly fine.
I cant figure out what possibly can be the issue.
Is there an implementation issue I'm overlooking?
Thank you for your help.

If I understood you correctly - you would like to fit you data distribution with gaussians and you have only one feature per element. Than you should reshape your vector to be a column vector:
listOffValues = np.reshape(listOffValues, (-1, 1))
otherwise, if your listOffValues corresponds to some curve that you want to fit it with several gaussians, than you should use curve_fit. See Gaussian fit for Python

Active Shape Models' fitting procedure doesn't converge with Statistical Model fitting function

I followed the Active Shape Models approach described by Tim Cootes in
textbook and original paper. So far everything went well (Procrustes Analysis, Principal Component Analysis, preprocessing of images (contrast, noise)). Only the fitting procedure itself seems not to converge.
I use the statistical model of the grey-level structure approach as described in textbook (p. 13) for creating a fitting function for each of the 8 incisors, for each of the 40 landmarks (so 320 different fitting functions are created in total) per incisor by sampling 5 (=k) points on either side along the profile normal to the boundary through each of the 40 landmarks for each of the 8 incisors. Those functions are equal to the Mahalanobis distance (textbook p. 14).
During the fitting procedure I sample 10 (=m>k) points on either side along the profile normal to the boundary through each of the 40 landmarks of the current approximation of a tooth. That way I must evaluate 2(m-k)+1 samples with the corresponding fitting function.
Each of those samples contains the gradient value of 2k+1 points. The sample that minimizes the function is choosen and the corresponding landmark is positioned at the middle point of those 2k+1 points. This is done for each of the 40 landmarks. This results in a new (not validated yet) approximation of the tooth.
This approximation in the image coordinate frame is aligned with the model of the tooth in the image coordinate frame. Then the coefficients (bi) of the principal component analysis are calculated and checked if |bi|<3*sqrt(eigenvalue_i) in order not to deviate too much from the shape of the model. The coefficients (bi) are limited if necessary and we transform back to the image coordinate frame and start some new iteration.
Shows the image of which we want to find the upper-left incisor.
Shows the gradient image with the approximation of the tooth in the image coordinate frame in iteration 19.
(Red: before validation - Green: after validation) As one can see we're kind of diverged from the optimal solution.
def create_gradient(img):
temp = cv2.Scharr(img, ddepth=-1, dx=1, dy=0)
return cv2.Scharr(temp, ddepth=-1, dx=0, dy=1)
Shows the approximation of the tooth in the model coordinate frame in iteration 19.
(Blue: model - Red: before validation - Green: after validation) As one can see we're still close to the shape of the model.
Shows the approximation of the tooth in the model coordinate frame for 19 iterations.
(Blue: model - Red: before validation - Green: after validation) As one can see we stay close to the shape of the model during all these iterations
So we stay close to the shape (guarded by principal component analysis), but not close to the intensity behaviour (guarded by the fitting function) around the landmarks.

The gradient image is wrong or better is not of any use because one needs to take the derivative along the profile normals instead of the horizontal and vertical direction.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.