I have a spectrum and I do the fft. And I wanted to use this data to make learning with scikit-learn. However I know what to take as explanatory variables, the frequencies the amplitudes or phases. It also seems it there's specific methods to process data. If you have ideas thank you
for example measurements made on two species
measure for the specie 1
Frequency [Hz] Peak amplitude Phase [degrees]
117.122319744375 2806130.78600507 -79.781679752725
234.24463948875 1913786.60902507 17.7111789273704
351.366959233125 808519.710937228 116.444676921222
468.4892789775 122095.42475935 25.5770279979328
585.520239658112 607116.287067349 142.264887989957
702.642559402487 604818.747928879 -112.469849617122
819.764879146862 277750.38203791 -15.0000950192717
936.887198891237 118608.971696726 -74.5121366118222
1054.00951863561 344484.145698282 -6.21161038546633
1171.13183837999 327156.097365635 97.0304114077862
1288.25415812436 133294.989030519 -42.5375933954097
1405.37647786874 112216.937121264 78.5147573168857
1522.49879761311 231245.476714294 -25.4436913705878
1639.62111735749 201337.057689481 -24.3659638609968
1756.6520780381 77785.2190703514 29.0468023773855
1873.77439778247 103345.482912432 -13.8433556624336
1990.89671752685 164252.685204496 32.0091367478569
2108.01903727122 131507.600569796 3.20717282723705
2225.1413570156 62446.6053497028 17.6656168494324
2342.26367675998 92615.8137781526 -2.92386499550556
measure for the specie 2
Frequency [Hz] Peak amplitude Phase [degrees]
117.122319744375 2786323.45338023 -78.5559125894388
234.24463948875 1915479.67743241 20.1586403367551
351.366959233125 830370.792189816 120.081294764269
468.4892789775 94486.3308071095 28.1762359863422
585.611598721875 590794.892175599 137.070646192436
702.642559402487 610017.558439343 -99.8603287979889
819.764879146862 300481.494163747 -7.0350571153689
936.887198891237 93989.1090623071 -52.6686900337389
1054.00951863561 332194.292343295 4.40278213901234
1171.13183837999 335166.932956212 92.5972261483014
1288.25415812436 154686.81104112 -64.5940556800747
1405.37647786874 91910.7647280088 82.3509804545009
1522.49879761311 223229.665336525 -64.4186985300827
1639.62111735749 211038.25587802 12.6057366375093
1756.74343710186 93456.4477333818 25.3398315513138
1873.77439778247 87937.8620001563 15.3447294063444
1990.89671752685 160213.112972346 7.41647669351739
2108.01903727122 141354.896010814 -48.4341201110724
2225.1413570156 69137.6327300227 39.9238718439715
2342.26367675998 82097.0663259956 -28.9291500313113
OP is asking how to classify this. I've explained it to him in comments and will break it down more here:
Each "specie" represents a row, or a sample. Each sample, thus, has 60 features (20x3)
He is doing a binary classification problem
Re-cast the output of the FFT to give Freq1,Amp1,Phase1....etc as a numerical input set for a training algorithm
Use something like a Support Vector Machine or Decision Tree Classifier out of scikit-learn and train over the dataset
Evaluate and measure accuracy
Caveats: 60 features over 1000 samples is potentially going to be quite hard to separate and liable to over-fitting. OP needs to be careful. I havent spent much time understanding the features themselves, but I suspect 20 of those features are redundant (the frequencies always seem to be the same between samples)
Related
I'm trying to use eval_mus_track function of the museval package to evaluate my audio source separation model. The model I'm evaluating was trained to predict vocals and the results are similar to the actual vocals but the evaluation metrics such as SDR are negative.
Below is my function for generating the metrics:
def estimate_and_evaluate(track):
#track.audio is stereo therefore we predict each channel separately
vocals_predicted_channel_1, acompaniment_predicted_channel_1, _ = model_5.predict(np.squeeze(track.audio[:, 0]))
vocals_predicted_channel_2, acompaniment_predicted_channel_2, _ = model_5.predict(np.squeeze(track.audio[:, 1]) )
vocals = np.squeeze(np.array([vocals_predicted_channel_1.wav_file, vocals_predicted_channel_2.wav_file])).T
accompaniment = np.squeeze(np.array([acompaniment_predicted_channel_1.wav_file, acompaniment_predicted_channel_2.wav_file])).T
estimates = {
'vocals': vocals,
'accompaniment': accompaniment
}
scores = museval.eval_mus_track(track, estimates)
print(scores)
The metric values I get are:
vocals ==> SDR: -3.776 SIR: 4.621 ISR: -0.005 SAR: -30.538
accompaniment ==> SDR: -0.590 SIR: 1.704 ISR: -0.006 SAR: -16.613
The above result doesn't make sense because first of all, accompaniment prediction is pure noise as this model was trained for vocals but it gets a higher SDR. The second reason is the predicted vocals have a very similar graph to the actual ones but still gets a negative SDR value!
In the following graphs, the top one is the actual sound and the bottom one is the predicted source:
Channel 1:
Channel 2:
I tried to shift the predicted vocals as mentioned here but the result got worse.
Any idea what's causing this issue?
This is the link to the actual vocals stereo numpy array
and this one to the predicted stereo vocals numpy array. you can load and manipulate them by using np.load
Thanks for your time
The signal to distortion ratio is actually the logarithm of a ratio. See equation (12) of this article:
https://hal.inria.fr/inria-00630985/PDF/vincent_SigPro11.pdf
So, a SDR of 0 means that the signal is equal to the distortion. An SDR value of less than 0 means that there is more distortion than signal. If the audio doesn't sound like there is more distortion than signal, the cause is often sample alignment problems.
When you look at equation (12), you can see that the calculation depends strongly on preserving the exact sample alignment of the predicted a ground-truth audio. It can be difficult to tell from plots of the waveform or even listening if the samples are misaligned. But, a zoomed-in plot where you can see each individual sample could help you make sure that the ground truth and predicted samples are exactly lined up. If it is shifted by even a single sample, the SDR calculation will not reflect the actual SDR.
So I am exploring using a logistic regression model to predict the probability of a shot resulting in a goal. I have two predictors but for simplicity lets assume I have one predictor: distance from the goal. When doing some data exploration I decided to investigate the relationship between distance and the result of a goal. I did this graphical by splitting the data into equal size bins and then taking the mean of all the results (0 for a miss and 1 for a goal) within each bin. Then I plotted the average distance from goal for each bin vs the probability of scoring. I did this in python
#use the seaborn library to inspect the distribution of the shots by result (goal or no goal)
fig, axes = plt.subplots(1, 2,figsize=(11, 5))
#first we want to create bins to calc our probability
#pandas has a function qcut that evenly distibutes the data
#into n bins based on a desired column value
df['Goal']=df['Goal'].astype(int)
df['Distance_Bins'] = pd.qcut(df['Distance'],q=50)
#now we want to find the mean of the Goal column(our prob density) for each bin
#and the mean of the distance for each bin
dist_prob = df.groupby('Distance_Bins',as_index=False)['Goal'].mean()['Goal']
dist_mean = df.groupby('Distance_Bins',as_index=False)['Distance'].mean()['Distance']
dist_trend = sns.scatterplot(x=dist_mean,y=dist_prob,ax=axes[0])
dist_trend.set(xlabel="Avg. Distance of Bin",
ylabel="Probabilty of Goal",
title="Probability of Scoring Based on Distance")
Probability of Scoring Based on Distance
So my question is why would we go through the process of creating a logistic regression model when I could fit a curve to the plot in the image? Would that not provide a function that would predict a probability for a shot with distance x.
I guess the problem would be that we are reducing say 40,000 data point into 50 but I'm not entirely sure why this would be a problem for predict future shot. Could we increase the number of bins or would that just add variability? Is this a case of bias-variance trade off? Im just a little confused about why this would not be as good as a logistic model.
The binning method is a bit more finicky than the logistic regression since you need to try different types of plots to fit the curve (e.g. inverse relationship, log, square, etc.), while for logistic regression you only need to adjust the learning rate to see results.
If you are using one feature (your "Distance" predictor), I wouldn't see much difference between the binning method and the logistic regression. However, when you are using two or more features (I see "Distance" and "Angle" in the image you provided), how would you plan to combine the probabilities for each to make a final 0/1 classification? It can be tricky. For one, perhaps "Distance" is more useful a predictor than "Angle". However, logistic regression does that for you because it can adjust the weights.
Regarding your binning method, if you use fewer bins you might see more bias since the data may be more complicated than you think, but this is not that likely because your data looks quite simple at first glance. However, if you use more bins that would not significantly increase variance, assuming that you fit the curve without varying the order of the curve. If you change the order of the curve you fit, then yes, it will increase variance. However, your data seems like it is amenable to a very simple fit if you go with this method.
I've an EEG dataset which has 8 features taken using 8-channel EEG headset. Each row represents readings taken with 250ms interval. The values are all floating point representing voltages in micro volt. If I plot individual features, I can see that they form a continuous wave. now the target has 3 categories: 0,1,2. and for a duration of time the target doesn't change because the sample taken spans across multiple rows. I would appreciate any guidance as to how to pre-process the dataset. Since using it as it is gives me very low accuracy(80%) and according to Wikipedia P300 signal can be detected with 95% accuracy. And please note that I've almost zero knowledge about signal processing and analysing waveforms.
I did try making a 3D array where each row represented a single target and the values of each feature was a list of values that originally spanned across multiple rows. But I get an error that says estimator array expected to be <=2. I'm not sure if this was the right approach. But it didn't work anyway.
here have a look at my feature set:-
-1.2198,-0.32769,-1.22,2.4115,0.057031,-2.6568,7.372,-0.2789
-1.4262,-4.19,-5.6546,-7.7161,-5.4359,-9.4553,-3.6705,-5.4851
-1.3152,-6.8708,-8.5599,-14.739,-9.1808,-14.268,-11.632,-8.929
-0.53987,-7.5156,-8.9646,-16.656,-10.119,-15.791,-14.616,-9.4095
Their corresponding targets:-
0
0
0
0
I have empirical somewhat noisy data about two classes of objects: 0 and 1.
I have a hypothesis that class 0's data is following the sin wave pattern while class 1 not so much.
Problem: how to test this hypothesis?
Dataset of one sample: df = pd.DataFrame.load('path-name'): https://www.dropbox.com/s/zbgnivgcww49b7w/sindrink.pkl?dl=0
I tried goodness of fit optimizing error function (distance between predicted by a*sin(x/b + c)) but that leads to incorrect result:
Due to the imperfectness of the data - frequency and amplitude are not perfectly constant:
So I need some sort of algorithm and metric that would confirm (or reject) that this sample if following the sine wave pattern:
And this sample does not:
I had an idea to try Fourier transform to fit the few sine waves and then calculate the goodness of fit but I fail to do that so far.
Any ideas/suggestions?
A more sinusoidal signal will look like a sharp "spike" in the fourier-amplitude domain. You could try taking the FFT-amplitude of you signal (subtracting the mean first) and measure the ratio of the max to the mean. That'll at least give a number that corresponds to the "sinusoidalness" of your signal.
I've been trying out the kmeans clustering algorithm implementation in scipy. Are there any standard, well-defined metrics that could be used to measure the quality of the clusters generated?
ie, I have the expected labels for the data points that are clustered by kmeans. Now, once I get the clusters that have been generated, how do I evaluate the quality of these clusters with respect to the expected labels?
I am doing this very thing at that time with Spark's KMeans.
I am using:
The sum of squared distances of points to their nearest center
(implemented in computeCost()).
The Unbalanced factor (see
Unbalanced factor of KMeans?
for an implementation and
Understanding the quality of the KMeans algorithm
for an explanation).
Both quantities promise a better cluster, when the are small (the less, the better).
Kmeans attempts to minimise a sum of squared distances to cluster centers. I would compare the result of this with the Kmeans clusters with the result of this using the clusters you get if you sort by expected labels.
There are two possibilities for the result. If the KMeans sum of squares is larger than the expected label clustering then your kmeans implementation is buggy or did not get started from a good set of initial cluster assignments and you could think about increasing the number of random starts you using or debugging it. If the KMeans sum of squares is smaller than the expected label clustering sum of squares and the KMeans clusters are not very similar to the expected label clustering (that is, two points chosen at random from the expected label clustering are/are not usually in the same expected label clustering when they are/are not in the KMeans clustering) then sum of squares from cluster centers is not a good way of splitting your points up into clusters and you need to use a different distance function or look at different attributes or use a different sort of clustering.
In your case, when you do have the samples true label, validation is very easy.
First of all, compute the confusion matrix (http://en.wikipedia.org/wiki/Confusion_matrix). Then, derive from it all relevant measures: True Positive, false negatives, false positives and true negatives. Then, you can find the Precision, Recall, Miss rate, etc.
Make sure you understand the meaning of all above. They basically tell you how well your clustering predicted / recognized the true nature of your data.
If you're using python, just use the sklearn package:
http://scikit-learn.org/stable/modules/model_evaluation.html
In addition, it's nice to run some internal validation, to see how well your clusters are separated. There are known internal validity measures, like:
Silhouette
DB index
Dunn index
Calinski-Harabasz measure
Gamma score
Normalized Cut
etc.
Read more here: An extensive comparative study of cluster validity indices
Olatz Arbelaitz , Ibai Gurrutxaga, , Javier Muguerza , Jesús M. Pérez , Iñigo Perona