Generate a GMM Dataset by using multivariate_normal from scipy.stats - python

How can I use from scipy.stats import multivariate_normal to generate data?
In specific, I want to create a GMM data that contains 3 columns (features) and a label column (0 or 1).
So I am basically looking to see a 3d plot that contains 6 different Gaussians (3 per class).
Thanks a lot!

Related

plot clusters of kmeans of sparse matrix

I have a python script which do clustering over a data file which is in svmlight format.
I use the function sklearn.datasets.load_svmlight_file to load the data from the data file.
I know that this function returns a sparse matrix.
I need to scatter plot the clusters, can any body help me please.
This what I have done:
import sklearn.datasets
import sys
from sklearn.cluster import KMeans
dataFilename = sys.argv[1]
X, y = sklearn.datasets.load_svmlight_file(dataFilename)
kmeans = KMeans(n_clusters = 3)
kmeans.fit(X)
labels = kmeans.labels_
print(labels)
centroids = kmeans.cluster_centers_
Without having the dataset, I would suggest the following:
Since load_svmlight_file() returns a sparse matrix, turn X into a NumPy array using samples = X.toarray() prior to fitting the model.
Plot two features (for example) of the dataset using:
plt.scatter(samples[:,0], samples[:,1], c=labels). This colours the clusters by their predicted labels.
Follow this with plt.scatter(centroids[:,0], centroids[:,1], marker='D') to see the location of the centroids with diamonds.
Note that samples[:,n] represents an array containing the sample values for the nth feature of the dataset.
I hope this helps. If not, please let me know.

Clustering data- Poor results, feature extraction

I have measured data (vibrations) from a wind turbine running under different operating conditions. My dataset consists of operating conditions as well as measurement features I have extracted from the measured data.
Dataset shape: (423, 15). Each of the 423 data points represent a measurement on a day, chronologically over 423 days.
I now want to cluster the data to see if there is any change in the measurements. Specifically, I want to examine if the vibrations change over time (which could indicate a fault in the turbine gearbox).
What I have currently done:
Scale the data between 0,1 ->
Perform PCA (reduce from 15 to 5)
Cluster using db scan since I do not know the number of clusters. I am using this code to find the optimal epsilon (eps) in dbscan:
# optimal Epsilon (distance):
X_pca = principalDf.values
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X_pca)
distances, indices = nbrs.kneighbors(X_pca)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances,color="#0F215A")
plt.grid(True)
The result so far are not giving any clear indication that the data is changing over time:
Of course, the case could be that the data is not changing over these data points. Howver, what are some other things I could try? Kind of an open question, but I am running out of ideas.
First of all, with KMeans, if the dataset is not naturally partitioned, you may end up with some very weird results! As KMeans is unsupervised, you basically dump in all kinds of numeric variables, set the target variable, and let the machine do the lift for you. Here is a simple example using the canonical Iris dataset. You can EASILY modify this to fit your specific dataset. Just change the 'X' variables (all but the target variable) and 'y' variable (just one target variable). Try that and feedback.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, 0:4] # we only take the first two features.
y = iris.target
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array)
X_scaled.sample(5)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
from sklearn.cluster import KMeans
nclusters = 3 # this is the k in kmeans
seed = 0
km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)
# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans
# use seaborn to make scatter plot showing species for each sample
sns.FacetGrid(data, hue="species", size=4) \
.map(plt.scatter, "sepal_length", "sepal_width") \
.add_legend();

Genextreme fit not working for some datasets

I'm trying to fit a GEV distribution to temperature data to help identify extreme values. I have data sets for different regions - for some regions the fit works fine but for others it breaks down. It appears that it is setting the location parameter close to the maximum of the distribution range. All data sets are large, of the same size, complete and have no particularly strange values.
Could you please suggest what might be happening or how I can investigate the genextreme function process to work out what the problem is?
Here's the relevant bits of code (values are read in from NetCDF without any problem):
import pandas as pd
import numpy as np
import netCDF4 as nc
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import genextreme as gev
# calculate GEV fit
fit = gev.fit(season_temp)
# GEV parameters from fit
c, loc, scale = fit
fit_mean= loc
min_extreme,max_extreme = gev.interval(0.99,c,loc,scale)
# evenly spread x axis values for pdf plot
x = np.linspace(min(season_temp),max(season_temp),200)
# plot distribution
fig,ax = plt.subplots(1, 1)
plt.plot(x, gev.pdf(x, *fit))
plt.hist(season_temp,30,normed=True,alpha=0.3)
And here are two examples of outputs from different regions, successful and not:
Successful fit
Unsuccessful fit
The successfully fitted distribution has mean location parameter of 1.066 compared to data mean of 2.395. The one that failed has calculated a location parameter of 12.202 compared to data mean of 2.138.
Thanks in advance for your help!

How classify new entries in python having classified knowledge base [duplicate]

This question already has answers here:
How to assign an new observation to existing Kmeans clusters based on nearest cluster centriod logic in python?
(3 answers)
Closed 5 years ago.
I have a set of vectors, in python, composing my knowledge base, for example:
KB=[[1,2,3,4],[1,2,2,1],[4,3,1,2],[5,4,3,5]]
Now I computed the cluster for KB, using:
from sklearn.cluster import KMeans
model=KMeans(n_clusters=3)
model.fit(KB)
Now I have a new entry (could I have more than one),
A=[3,2,1,3]
and I would know which is the cluster that best fits A with respect to the cluster computed above, then exploiting the KB.
Could you help me?
Thanks in advance
Here you are:
KB=[[1,2,3,4],[1,2,2,1],[4,3,1,2],[5,4,3,5]]
from sklearn.cluster import KMeans
model=KMeans(n_clusters=3).fit(KB)
A=[3,2,1,3]
l = model.predict([A])
print model.labels_, l
centers = model.cluster_centers_.copy()
print centers
In order you model to be 'fit', i join two lines.
I then use the method predict to .. predict.
I also print the labels for each example that were use in the model.
Edit Add plot
import matplotlib.pyplot as plt
import numpy
# Compute the distances vector to vector
d = numpy.array([[numpy.sum(KBi - cj) for KBi in KB] for cj in centers])
print d
# for cluster 0 and 1
plt.scatter(d[0], d[1])
plt.pause(10)

How do I use my dataset in Sklearn clustering?

I am trying to adapt the Sklearn example here to use my own dataset, which is a 1000 row, 4 column matrix of integers. I cannot see how to replace one of the SKlearn datasets with mine. I.e. what do I replace
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
noise=.05)
with?
The datasets.make_circles function creates a toy dataset with a very clear pattern. The data it returns is a tuple containing an X array of features (n x 2 dimensions) and a y array of labels (length n).
To pass your data into the clustering script, you just need to put it into a similar format and use that in place of the value returned by make_circles.
Load your data as a 2 dimensional numpy array. Read the documentation of numpy and scipy to learn how to do so depending on the file format you have at hand.
Before running the clustering algorithm you might want to preprocess the data with a one-hot encoder if the integer mean category assignment rather than quantities.
If they represent quantities, you might want to preprocess with StandardScaler.

Categories

Resources