I have a python script which do clustering over a data file which is in svmlight format.
I use the function sklearn.datasets.load_svmlight_file to load the data from the data file.
I know that this function returns a sparse matrix.
I need to scatter plot the clusters, can any body help me please.
This what I have done:
import sklearn.datasets
import sys
from sklearn.cluster import KMeans
dataFilename = sys.argv[1]
X, y = sklearn.datasets.load_svmlight_file(dataFilename)
kmeans = KMeans(n_clusters = 3)
kmeans.fit(X)
labels = kmeans.labels_
print(labels)
centroids = kmeans.cluster_centers_
Without having the dataset, I would suggest the following:
Since load_svmlight_file() returns a sparse matrix, turn X into a NumPy array using samples = X.toarray() prior to fitting the model.
Plot two features (for example) of the dataset using:
plt.scatter(samples[:,0], samples[:,1], c=labels). This colours the clusters by their predicted labels.
Follow this with plt.scatter(centroids[:,0], centroids[:,1], marker='D') to see the location of the centroids with diamonds.
Note that samples[:,n] represents an array containing the sample values for the nth feature of the dataset.
I hope this helps. If not, please let me know.
Related
I have measured data (vibrations) from a wind turbine running under different operating conditions. My dataset consists of operating conditions as well as measurement features I have extracted from the measured data.
Dataset shape: (423, 15). Each of the 423 data points represent a measurement on a day, chronologically over 423 days.
I now want to cluster the data to see if there is any change in the measurements. Specifically, I want to examine if the vibrations change over time (which could indicate a fault in the turbine gearbox).
What I have currently done:
Scale the data between 0,1 ->
Perform PCA (reduce from 15 to 5)
Cluster using db scan since I do not know the number of clusters. I am using this code to find the optimal epsilon (eps) in dbscan:
# optimal Epsilon (distance):
X_pca = principalDf.values
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X_pca)
distances, indices = nbrs.kneighbors(X_pca)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances,color="#0F215A")
plt.grid(True)
The result so far are not giving any clear indication that the data is changing over time:
Of course, the case could be that the data is not changing over these data points. Howver, what are some other things I could try? Kind of an open question, but I am running out of ideas.
First of all, with KMeans, if the dataset is not naturally partitioned, you may end up with some very weird results! As KMeans is unsupervised, you basically dump in all kinds of numeric variables, set the target variable, and let the machine do the lift for you. Here is a simple example using the canonical Iris dataset. You can EASILY modify this to fit your specific dataset. Just change the 'X' variables (all but the target variable) and 'y' variable (just one target variable). Try that and feedback.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, 0:4] # we only take the first two features.
y = iris.target
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array)
X_scaled.sample(5)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
from sklearn.cluster import KMeans
nclusters = 3 # this is the k in kmeans
seed = 0
km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)
# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans
# use seaborn to make scatter plot showing species for each sample
sns.FacetGrid(data, hue="species", size=4) \
.map(plt.scatter, "sepal_length", "sepal_width") \
.add_legend();
From research, only Single-Linkage Hierarchical Clustering can obtain optimal clusters. This is also know as SLINK. The libraries are published in originally in C++ and now in Python/R.
So far, following the steps in the documentations, I managed to come up with:
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist
## generating random numbers from 20 to 90, and storing them in a dataframe. This is a 1-dimensional data
np.random.seed(1)
df = pd.DataFrame(np.random.randint(20,90,size=(100,1)), columns = list('A'))
df = df.sort_values(by=['A'])
df = df.values
df[:,0].sort()
## getting condensed distance matrix
d = pdist(df_final, metric='euclidean')
## running the SLINK algorithm
Z = linkage(d, 'single')
I understand that Z is a 'hierarchical clustering encoded as a linkage matrix' (as written in the documentation), but I am wondering how do I go back to my original data set and distinguish the cluster calculated by this result?
I could achieve clustering result by Scikit-Learn clustering, but I think Scikit-Learn clustering algorithms are not optimal and hence I turned to this SLINK algorithm. Would be much appreciated if someone could help me with this.
From scipy.cluster.hierarchy.linkage you get back how clusters are formed with each iteration.
Normally this information is not so useful, so we can look at the clustering first:
import scipy as scipy
import matplotlib.pyplot as plt
plt.figure()
dn =scipy.cluster.hierarchy.dendrogram(Z)
If we want to get the three clusters, we can do:
labels = scipy.cluster.hierarchy.fcluster(Z,3,'maxclust')
If you want to get it by distance between the data points:
scipy.cluster.hierarchy.fcluster(Z,2,'distance')
This gives about the same result as calling for 3 clusters because that's not many ways to cut this example dataset.
If you look the example you have, the next point you can cut it is at height ~ 1.5, which is 16 clusters. So if you try to do scipy.cluster.hierarchy.fcluster(Z,5,'maxclust'), you get the same results as for 3 clusters. If you have a more spread dataset, it will work:
np.random.seed(111)
df = np.random.normal(0,1,(50,3))
## getting condensed distance matrix
d = pdist(df, metric='euclidean')
Z = linkage(d, 'single')
dn = scipy.cluster.hierarchy.dendrogram(Z,above_threshold_color='black',color_threshold=1.1)
Then this works:
scipy.cluster.hierarchy.fcluster(Z,5,'maxclust')
I used shap to determine the feature importance for multiple regression with correlated features.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
import shap
boston = load_boston()
regr = pd.DataFrame(boston.data)
regr.columns = boston.feature_names
regr['MEDV'] = boston.target
X = regr.drop('MEDV', axis = 1)
Y = regr['MEDV']
fit = LinearRegression().fit(X, Y)
explainer = shap.LinearExplainer(fit, X, feature_dependence = 'independent')
# I used 'independent' because the result is consistent with the ordinary
# shapely values where `correlated' is not
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X, plot_type = 'bar')
shap offers a chart to get the shap values. Is there also a statistic available? I am interested in the exact shap values. I read the Github repository and the documentation but I found nothing regarding this topic.
When we look at shap_values we see that it contains some positive and negative numbers, and its dimensions equal the dimensions of boston dataset. Linear regression is a ML algorithm, which calculates optimal y = wx + b, where y is MEDV, x is feature vector and w is a vector of weights. In my opinion, shap_values stores wx - a matrix with the value of the each feauture multiplyed by the vector of weights calclulated by linear regression.
So to calculate wanted statistics, I first extracted absolute values and then averaged over them. The order is important! Next I used initial column names and sorted from biggest effect to smallest one. With this, I hope I have answered your question!:)
from matplotlib import pyplot as plt
#rataining only the size of effect
shap_values_abs = np.absolute(shap_values)
#dividing to get good numbers
means_norm = shap_values_abs.mean(axis = 0)/1e-15
#sorting values and names
idx = np.argsort(means_norm)
means = np.array(means_norm)[idx]
names = np.array(boston.feature_names)[idx]
#plotting
plt.figure(figsize=(10,10))
plt.barh(names, means)
I have a matrix of dimension (nw,ny,nx) where nx and ny are dimension of an image (photon counts) and for each pixel I have a spectral profile of nw wavelength points.
I have applied K-mean clustering from scikit-learn python package with number of cluster equal to ncl=5.
dat =dat1.reshape(nw,nx*ny)
mm[:]=KMeans(n_clusters=ncl).fit(np.transpose(dat)).labels_
x=KMeans(n_clusters=ncl).fit(np.transpose(dat)).cluster_centers_
and then plotting x[i,:] (i= cluster label) I can see the 5 different average spectral profiles generated by Kmeans.
Now my question is the following: I would like to use these 5 cluster_centers in a different dataset of the same dimensions (nw,ny,nx) to retrieve the lables that here I have called mm. How can I do it?
Thank you in advance for your time.
As #sascha pointed out, you need to persist the KMeans object to predict future data
dat = dat1.reshape(nw,nx*ny)
clusterer = KMeans(n_clusters=ncl).fit(np.transpose(dat)
dat2 = dat2.reshape(nw,nx*ny)
dat2_labels = clusterer.predict(np.transpose(dat2))
I have a RGB image of the following shape ((3L, 5L, 5L). It means 5 by 5 pixels image having 3 layers (R,G,andB).I want to cluster it using DBSCAN algorithm as follows. But I got an error message that ValueError: Found array with dim 3. Expected <= 2. Can not I use for my 3d image?
import numpy as np
from sklearn.cluster import DBSCAN
from collections import Counter
data = np.random.rand(3,5,5)
print np.shape(data)
print data
db = DBSCAN(eps=0.12, min_samples=3).fit(data)
print db
DBSCAN(algorithm='auto', eps=0.12, leaf_size=30, metric='euclidean',
min_samples=1, p=None, random_state=None)
labels = db.labels_
print Counter(labels)
To cluster you need to say what the distance between two points is. DBSCAN is not a graph clustering algorithm, it works with features. You need to represent each pixel as features, so that the distances are appropriate.
The features could just be RGB, in which case similar colors are clustered together. Or the features could also include x, y coordinates, which would mean spacial distances are also considered.
If you want to consider spatial distances, I'd suggest you take a look at scikit-image's segmentation module, which contains a couple of popular image segmentation methods.