Gaussian Mixture Model with discrete data - python

I have 136 numbers which have an overlapping distribution of 8 Gaussian distributions. I want to find it's means, and variances with each Gaussian distribution! Can you find any mistakes with my code?
file = open("1.txt",'r') #data is in 1.txt like 0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194...
y=[int (i) for i in list((file.read()).split(','))] # I want to make list which element is above data
x=list(range(1,len(y)+1)) # it is x values
z=list(zip(x,y)) # z elements consist as (1, 0), (2, 0), ...
Therefore, through the above process, for the 136 points (x,y) on the xy plane having the first given data as y values, a list z using this as an element was obtained.
Now I want to obtain each Gaussian distribution's mean, variance. At this time, the basic assumption is that the given data consists of overlapping 8 Gaussian distributions.
import numpy as np
from sklearn.mixture import GaussianMixture
data = np.array(z).reshape(-1,1)
model = GaussianMixture(n_components=8).fit(data)
print(model.means_)
file.close()
Actually, I don't know how to make it's code to print 8 means and variances... Anyone can help me?

You can use this, I have made a sample code for your visualizations -
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194]
num_components = 2
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
model = GaussianMixture(n_components=num_components).fit(data)
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))

Related

Count data points for each K-means cluster

I have a dataset for banknotes wavelet data of genuine and forged banknotes with 2 features which are:
X axis: Variance of Wavelet Transformed image
Y axis: Skewness of Wavelet Transformed image
I run on this dataset K-means to identify 2 clusters of the data which are basically genuine and forged banknotes.
Now I have 3 questions:
How can I count the data points of each cluster?
How can I set a color of each data point based on it's cluster?
How do I know without another feature in the data if the datapoint is genuine or forged? I know the data set has a "class" which shows 1 and 2 for genuine and forged but can I identify this without the "class" feature?
My code:
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.patches as patches
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.patches as patches
data = pd.read_csv('Banknote-authentication-dataset-all.csv')
V1 = data['V1']
V2 = data['V2']
bn_class = data['Class']
V1_min = np.min(V1)
V1_max = np.max(V1)
V2_min = np.min(V2)
V2_max = np.max(V2)
normed_V1 = (V1 - V1_min)/(V1_max - V1_min)
normed_V2 = (V2 - V2_min)/(V2_max - V2_min)
V1_mean = normed_V1.mean()
V2_mean = normed_V2.mean()
V1_std_dev = np.std(normed_V1)
V2_std_dev = np.std(normed_V2)
ellipse = patches.Ellipse([V1_mean, V2_mean], V1_std_dev*2, V2_std_dev*2, alpha=0.4)
V1_V2 = np.column_stack((normed_V1, normed_V2))
km_res = KMeans(n_clusters=2).fit(V1_V2)
clusters = km_res.cluster_centers_
plt.xlabel('Variance of Wavelet Transformed image')
plt.ylabel('Skewness of Wavelet Transformed image')
scatter = plt.scatter(normed_V1,normed_V2, s=10, c=bn_class, cmap='coolwarm')
#plt.scatter(V1_std_dev, V2_std_dev,s=400, Alpha=0.5)
plt.scatter(V1_mean, V2_mean, s=400, Alpha=0.8, c='lightblue')
plt.scatter(clusters[:,0], clusters[:,1],s=3000,c='orange', Alpha=0.8)
unique = list(set(bn_class))
plt.text(1.1, 0, 'Kmeans cluster centers', bbox=dict(facecolor='orange'))
plt.text(1.1, 0.11, 'Arithmetic Mean', bbox=dict(facecolor='lightblue'))
plt.text(1.1, 0.33, 'Class 1 - Genuine Notes',color='white', bbox=dict(facecolor='blue'))
plt.text(1.1, 0.22, 'Class 2 - Forged Notes', bbox=dict(facecolor='red'))
plt.savefig('figure.png',bbox_inches='tight')
plt.show()
Appendix image for better visibility
How to count the data points of each cluster
You can do this easily by using fit_predict instead of fit, or calling predict on your training data after fitting it.
Here's a working example:
kM = KMeans(...).fit_predict(V1_V2)
labels = kM.labels_
clusterCount = np.bincount(labels)
clusterCount will now hold your information for how many points are in each cluster. You can just as easily do this with fit then predict, but this should be more efficient:
kM = KMeans(...).fit(V1_V2)
labels = kM.predict(V1_V2)
clusterCount = np.bincount(labels)
To set its color, use kM.labels_ or the output of kM.predict() as a coloring index.
labels = kM.predict(V1_V2)
plt.scatter(normed_V1, normed_V2, s=10, c=labels, cmap='coolwarm') # instead of c=bn_class
For a new data point, notice how the KMeans you have quite nicely separates out the majority of the two classes. This separability means you can actually use your KMeans clusters as predictors. Simply use predict.
predictedClass = KMeans.predict(newDataPoint)
Where a cluster is assigned the value of the class which it has the majority of. Or even a percentage chance.

Python find peaks of distribution

In a dataset like this: (y is angle and x is datapoints)
How to find the weighted average of each "band" (in that case would be 0.1 and -90 something) whilst ignoring potential random points.
I was thinking of the FFT transform but that might not be the right approach.
Perhaps transforming that in a graph alike normal distribution and find the peaks?
Solve Using KMeans
Step 1. Generate Data
from random import randint, choice
from numpy import random
import numpy as np
from matplotlib import pyplot as plt
def gen_pts(mean_, std_, n):
"""Generate gaussian distributed random data
mean: mean_
standard deviation: std_
number points: n
"""
return np.random.normal(loc=mean_, scale = std_, size = n)
# Number of groups of horizontal blobs
n_groups = 20
# Genereate random count for each group
counts = [randint(100, 200) for _ in range(n_groups)]
# Generate random mean for each group (i.e. 0 or -90)
means = [random.choice([0, -90]) for _ in range(n_groups)]
# All the groups
data = [gen_pts(mean_, 5, n) for mean_, n in zip(means, counts)]
# Concatenate groups into 1D array
X = np.concatenate(data, axis=0)
# Show Data
plt.plot(X)
plt.show()
Step 2-Find Cluster Centers
# Reshape 1D data so it's suitable for kmeans model
X = X.reshape(-1,1)
# Get model for two clusters
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0)
# Fit Data to model
pred_y = kmeans.fit_predict(X)
# Cluster Centers
centers = kmeans.cluster_centers_
print(*centers)
# Output: [-89.79165334] [-0.07875314]

QQplot for discrete distribution

I have a set whose samples are discrete values (in particular, the size of a queue over time). Now I'd like to find what distribution they belong to. To achieve this goal I'd act the same way I did for the other quantities, i.e. plotting a qqplot, launching
import statsmodels.api as sm
sm.qqplot(df, dist = 'geom', sparams = (.5,), line ='s', alpha = 0.3, marker ='.')
This works if dist is not a discrete random variables (e.g. 'exp' or 'norm') and indeed I used to get some results, but when the distribution is discrete (say, 'geom'), I get
AttributeError: 'geom_gen' object has no attribute 'fit'
I searched on the Internet how to make a qqplot (or something similar) to spot what distribution my samples belong to but I found nothing
def discreteQQ(x_sample):
p_test = np.array([])
for i in range(0, 1001):
p_test = np.append(p_test, i/1000)
i = i + 1
x_sample = np.sort(x_sample)
x_theor = stats.geom.rvs(.5, size=len(x_sample))
ecdf_sample = np.arange(1, len(x_sample) + 1)/(len(x_sample)+1)
x_theor = stats.geom.ppf(ecdf_sample, p=0.5)
for p in p_test:
plt.scatter(np.quantile(x_theor, p), np.quantile(x_sample, p), c = 'blue')
plt.xlabel('Theoretical quantiles')
plt.ylabel('Sample quantiles')
plt.show()
Generate a theoretical geometric distribution using scipy.stats.geom, convert the sample and theoretical data using statsmodels' ProbPlot and pass these to statsmodels' qqplot_2samples.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import ProbPlot
from statsmodels.graphics.gofplots import qqplot_2samples
p_theor = 1/4 # The probability we check for
p_sample = 1/5 # The true probability of the sample distribution
# The experimental data
x_sample = stats.geom.rvs(p_sample, size=50)
# The model data
x_theor = stats.geom.rvs(p_theor, size=100)
qqplot_2samples(ProbPlot(x_sample), ProbPlot(x_theor), line='45')
plt.show()

How to Extract the following Frequency-domain Features in Python?

Please feel free to point out any errors/improvements in the existing code
So this is a very basic question and I only have a beginner level understanding of signal processing. I have a 1.02 second accelerometer data sampled at 32000 Hz. I am looking to extract the following frequency domain features after having performed FFT in python -
Mean Freq, Median Freq, Power Spectrum Deformation, Spectrum energy, Spectral Kurtosis, Spectral Skewness, Spectral Entropy, RMSF (Root Mean Square Freq.), RVF (Root Variance Frequency), Power Cepstrum.
More specifically, I am looking for plots of these features as a final output.
The csv file containing data has four columns: Time, X Axis Value, Y Axis Value, Z Axis Value (The accelerometer is a triaxial one). So far on python, I have been able to visualize the time domain data, apply convolution filter to it, applied FFT and generated a Spectogram that shows an interesting shock
To Visualize Data
#Importing pandas and plotting modules
import numpy as np
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
#Reading Data
data = pd.read_csv('HelicalStage_Aug1.csv', index_col=0)
data = data[['X Value', 'Y Value', 'Z Value']]
date_rng = pd.date_range(start='1/8/2018', end='11/20/2018', freq='s')
#Plot the entire time series data and show gridlines
data.grid=True
data.plot()
enter image description here
Denoising
# Applying Convolution Filter
mylist = [1, 2, 3, 4, 5, 6, 7]
N = 3
cumsum, moving_aves = [0], []
for i, x in enumerate(mylist, 1):
cumsum.append(cumsum[i-1] + x)
if i>=N:
moving_ave = (cumsum[i] - cumsum[i-N])/N
#can do stuff with moving_ave here
moving_aves.append(moving_ave)
np.convolve(x, np.ones((N,))/N, mode='valid')
result_X = np.convolve(data[["X Value"]].values[:,0], np.ones((20001,))/20001, mode='valid')
result_Y = np.convolve(data[["Y Value"]].values[:,0], np.ones((20001,))/20001, mode='valid')
result_Z = np.convolve(data[["Z Value"]].values[:,0],
np.ones((20001,))/20001, mode='valid')
plt.plot(result_X-np.mean(result_X))
plt.plot(result_Y-np.mean(result_Y))
plt.plot(result_Z-np.mean(result_Z))
enter image description here
FFT and Spectogram
import numpy as np
import scipy as sp
import scipy.fftpack
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('HelicalStage_Aug1.csv')
df = df.drop(columns="Time")
df.plot()
plt.title('Sensor Data as Time Series')
signal = df[['Y Value']]
signal = np.squeeze(signal)
Y = np.fft.fftshift(np.abs(np.fft.fft(signal)))
Y = Y[int(len(Y)/2):]
Y = Y[10:]
plt.figure()
plt.plot(Y)
enter image description here
plt.figure()
powerSpectrum, freqenciesFound, time, imageAxis = plt.specgram(signal, Fs= 32000)
plt.show()
enter image description here
If my code is correct and the generated FFT and spectrogram are good, then how can I graphically compute the previously mentioned frequency domain features?
I have tried doing the following for MFCC -
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
from scipy.io import wavfile
from python_speech_features import mfcc
from python_speech_features import logfbank
# Extract MFCC and Filter bank features
mfcc_features = mfcc(signal, Fs)
filterbank_features = logfbank(signal, Fs)
# Printing parameters to see how many windows were generated
print('\nMFCC:\nNumber of windows =', mfcc_features.shape[0])
print('Length of each feature =', mfcc_features.shape[1])
print('\nFilter bank:\nNumber of windows =', filterbank_features.shape[0])
print('Length of each feature =', filterbank_features.shape[1])
Visualizing filter bank features
#Matrix needs to be transformed in order to have horizontal time domain
mfcc_features = mfcc_features.T
plt.matshow(mfcc_features)
plt.title('MFCC')
enter image description here
enter image description here
I think your fft taking procedure is not correct, fft output is usually peak and when you are taking abs it should be one peak, as , probably you should change it to Y = np.fft.fftshift(np.abs(np.fft.fft(signal))) to Y=np.abs(np.fft.fftshift(signal)

How to validate the downsampling is as intended

how to validate whether the down sampled output is correct. For example, I had make some example, however, I am not sure whether the output is correct or not?
Any idea on the validation
Code
import numpy as np
import matplotlib.pyplot as plt # For ploting
from scipy import signal
import mne
fs = 100 # sample rate
rsample=50 # downsample frequency
fTwo=400 # frequency of the signal
x = np.arange(fs)
y = [ np.sin(2*np.pi*fTwo * (i/fs)) for i in x]
f_res = signal.resample(y, rsample)
xnew = np.linspace(0, 100, f_res.size, endpoint=False)
#
# ##############################
#
plt.figure(1)
plt.subplot(211)
plt.stem(x, y)
plt.subplot(212)
plt.stem(xnew, f_res, 'r')
plt.show()
Plotting the data is a good first take at a verification. Here I made regular plot with the points connected by lines. The lines are useful since they give a guide for where you expect the down-sampled data to lie, and also emphasize what the down-sampled data is missing. (It would also work to only show lines for the original data, but lines, as in a stem plot, are too confusing, imho.)
import numpy as np
import matplotlib.pyplot as plt # For ploting
from scipy import signal
fs = 100 # sample rate
rsample=43 # downsample frequency
fTwo=13 # frequency of the signal
x = np.arange(fs, dtype=float)
y = np.sin(2*np.pi*fTwo * (x/fs))
print y
f_res = signal.resample(y, rsample)
xnew = np.linspace(0, 100, f_res.size, endpoint=False)
#
# ##############################
#
plt.figure()
plt.plot(x, y, 'o')
plt.plot(xnew, f_res, 'or')
plt.show()
A few notes:
If you're trying to make a general algorithm, use non-rounded numbers, otherwise you could easily introduce bugs that don't show up when things are even multiples. Similarly, if you need to zoom in to verify, go to a few random places, not, for example, only the start.
Note that I changed fTwo to be significantly less than the number of samples. Somehow, you need at least more than one data point per oscillation if you want to make sense of it.
I also remove the loop for calculating y: in general, you should try to vectorize calculations when using numpy.
The spectrum of the resampled signal should have a tone at the same frequency as the input signal just in a smaller nyquist bandwidth.
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
import scipy.fftpack as fft
fs = 100 # sample rate
rsample=50 # downsample frequency
fTwo=10 # frequency of the signal
n = np.arange(1024)
y = np.sin(2*np.pi*fTwo/fs*n)
y_res = signal.resample(y, len(n)/2)
Y = fft.fftshift(fft.fft(y))
f = -fs*np.arange(-512, 512)/1024
Y_res = fft.fftshift(fft.fft(y_res, 1024))
f_res = -fs/2*np.arange(-512, 512)/1024
plt.figure(1)
plt.subplot(211)
plt.stem(f, abs(Y))
plt.subplot(212)
plt.stem(f_res, abs(Y_res))
plt.show()
The tone is still at 10.
IF you down sample a signal both signals will still have the exact same value and a given time , so just loop through "time" and check that the values are the same. In your case you go from a sample rate of 100 to 50. Assuming you have 1 seconds worth of data from building your x from fs, then just loop through t = 0 to t=1 in 1/50'th increments and make sure that Yd(t) = Ys(t) where Yd d is the down sampled f and Ys is the original sampled frequency. Or to say it simply Yd(n) = Ys(2n) for n = 1,2,3,...n=total_samples-1.

Categories

Resources