How can I find what data is clustering in K-Shape?

How can I find what data is clustering in K-Shape? - python

I wrote codes,
import numpy
import matplotlib.pyplot as plt
from tslearn.clustering import KShape
from tslearn.datasets import CachedDatasets
from tslearn.preprocessing import TimeSeriesScalerMeanVariance
ks = KShape(n_clusters=3, n_init=10, verbose=True, random_state=seed)
y_pred = ks.fit_predict(data)
plt.figure(figsize=(16,9))
for yi in range(3):
plt.subplot(3, 1, 1 + yi)
for xx in stack_data[y_pred == yi]:
plt.plot(xx.ravel(), "k-", alpha=.2)
plt.title("Cluster %d" % (yi + 1))
plt.tight_layout()
plt.show()
I want to divide data by usigng KShape’s clustering.Now plot is shown, but I cannot find what data is in each 3 clustering.
data is an order of A,B,C,D ’s kind.So I want to show label to plot or the result of the clustering.I searched KShape’s document (http://tslearn.readthedocs.io/en/latest/auto_examples/plot_kshape.html ),but I cannot find the information to do my ideal things.How should I do it?

Why there are no perfect solutions
K-Shape works randomly, and without setting a seed for every iteration you might get different clusters and centroids. There is no deterministic way to know a-priori if a given class is completely described by a given centroid, but you can proceed in an offline fashion, in a fuzzy way, by checking to which centroid a given class is classified mostly.
Also any given class, A for instance, could contain elements that are part of two clusters in the space of the features you are considering.
Suppose you have 3 classes but your dataset is best described (for example by maximal average density) by 4 clusters: you'd surely have some points of at least one class that go in the 4th cluster.
Or alternatively, suppose your classes do not overlap with the centroids generated by the distance metric you are considering: take in consideration an obvious example: you have 3 classes, numbers from 0 to 100, from 100 to 1000 and from 1000 to 1100, but your dataset contains numbers from 0 to 150 and from 950 to 1100: a clustering algorithm would find its optimum with 2 clusters and put the points of class A in either one of the two.
Once you have determined that, for example, class A goes mostly to cluster 1, class B to cluster 2 etc... you can proceed to assign that cluster to the given class.
A possible fuzzy approach
We will proceed to determining the clusters classes by assigning the best fitted class to the cluster that contains most of its points:
Simple example: classes that actually fit clusters
For this example we use one of tslearn.datasets. This code is partially taken from this K-Shape example on tslearn.
import numpy as np
import matplotlib.pyplot as plt
from tslearn.clustering import KShape
from tslearn.datasets import CachedDatasets
from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from seaborn import heatmap
We set the seed, for code reproducibility:
seed = 0
np.random.seed(seed)
Firstly we prepare the dataset, selecting the first classes_number=3 classes:
classes_number = 3
X_train, y_train, X_test, y_test = CachedDatasets().load_dataset("Trace")
mask = y_train <= classes_number
X_train, y_train = X_train[mask], y_train[mask] # Keep first 3 classes
X_train = TimeSeriesScalerMeanVariance().fit_transform(X_train) # Keep only 50 time series
sz = X_train.shape[1]
Now we find the clusters, with clusters_number=3:
# Euclidean k-means
clusters_number = 3
ks = KShape(n_clusters=clusters_number, verbose=False, random_state=seed)
y_pred = ks.fit_predict(X_train)
We now proceed to count the elements of each class that are assigned to each cluster and to add the 0 paddings for where no elements of a given class was assigned to a given cluster (surely there will be a more pythonic way to d this but I've yet to find it):
data = [np.unique(y_pred[y_train==i+1], return_counts=True) for i in range(classes_number)]
>>>[(array([2]), array([26])),
(array([0]), array([21])),
(array([1]), array([22]))]
Adding the padding:
padded_data = np.array([[
data[j][1][data[j][0] == i][0] if np.any(data[j][0] == i) else 0
for i in range(clusters_number)
] for j in range(classes_number)])
>>> array([[ 0, 0, 26],
[21, 0, 0],
[ 0, 22, 0]])
Normalising the obtained matrix:
normalized_data = padded_data / np.sum(padded_data, axis=-1)[:, np.newaxis]
>>> array([[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.]])
We can visualise the obtained matrix using seaborn.heatmap:
xticklabels = ["Cluster n. %s" % (1+i) for i in range(clusters_number)]
yticklabels = ["Class n. %s" % (1+i) for i in range(classes_number)]
heatmap(
normalized_data,
cbar=False,
square=True,
annot=True,
cmap="YlGnBu",
xticklabels=xticklabels,
yticklabels=yticklabels)
plt.yticks(rotation=0)
Obtaining:
In this optimal situation, every cluster contains only and exactly one class, so with absolute precision we obtain:
classes_clusters = np.argmax(normalized_data, axis=1)
>>> array([2, 0, 1])
Second example: classes that do not overlap with clusters
For simplicity sake, to simulate classes that do not overlap completely with the clusters I'm just going to shuffle part of the labels, but there a vast range of example: most of clustering problems ends up with classes that do not exactly coincide with a cluster.
tmp = y_train[:20]
np.random.shuffle(tmp)
y_train[:20] = tmp
Now, when we execute the script again we get quite a different matrix:
But we are still able to determine the classes clusters:
classes_clusters = np.argmax(normalized_data, axis=1)
>>> array([2, 0, 1])
Third example: classes that do not exist in the dataset
Suppose we were lead to believe that in the dataset existed 4 classes: we would find after running with different values of k that the best number of clusters is k=3 in our current dataset: how would we proceed to assign the classes to the clusters? Which class could be thrown away?
We proceed to simulate such a situation by arbitrarily assigning a forth class to our labels:
y_train[:20] = 4
Running again our script we would obtain:
Clearly the 4th class has got to go. We can proceed by thresholding on the mean variance:
threshold = np.mean(np.var(normalized_data, axis=1))
result = np.argmax(normalized_data[np.var(normalized_data, axis=1)>threshold], axis=1)
And we obtain yet again:
array([2, 0, 1])
I hope this explanation has cleared most of your doubts!

Related

Should features that correlate be deleted from ML models?

I've seen that it's common practice to delete input features that demonstrate co-linearity (and leave only one of them).
However, I've just completed a course on how a linear regression model will give different weights to different features, and I've thought that maybe the model will do better than us giving a low weight to less necessary features instead of completely deleting them.
To try to solve this doubt myself, I've created a small dataset resembling a x_squared function and applied two linear regression models using Python:
A model that keeps only the x_squared feature
A model that keeps both the x and x_squared features
The results suggest that we shouldn't delete features, and let the model decide the best weights instead. However, I would like to ask the community if the rationale of my exercise is right, and whether you've found this doubt in other places.
Here's my code to generate the dataset:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate the data
all_Y = [10, 3, 1.5, 0.5, 1, 5, 8]
all_X = range(-3, 4)
all_X_2 = np.square(all_X)
# Store the data into a dictionary
data_dic = {"x": all_X, "x_2": all_X_2, "y": all_Y}
# Generate a dataframe
df = pd.DataFrame(data=data_dic)
# Display the dataframe
display(df)
which produces this:
and this is the code to generate the ML models:
# Create the lists to iterate over
ids = [1, 2]
features = [["x_2"], ["x", "x_2"]]
titles = ["$x^{2}$", "$x$ and $x^{2}$"]
colors = ["blue", "green"]
# Initiate figure
fig = plt.figure(figsize=(15,5))
# Iterate over the necessary lists to plot results
for i, model, title, color in zip(ids, features, titles, colors):
# Initiate model, fit and make predictions
lr = LinearRegression()
lr.fit(df[model], df["y"])
predicted = lr.predict(df[model])
# Calculate mean squared error of the model
mse = mean_squared_error(all_Y, predicted)
# Create a subplot for each model
plt.subplot(1, 2, i)
plt.plot(df["x"], predicted, c=color, label="f(" + title + ")")
plt.scatter(df["x"], df["y"], c="red", label="y")
plt.title("Linear regression using " + title + " --- MSE: " + str(round(mse, 3)))
plt.legend()
# Display results
plt.show()
which generate this:
What do you think about this issue? This difference in the Mean Squared Error can be of high importance on certain contexts.

Because x and x^2 are not linear anymore, that is why deleting one of them is not helping the model. The general notion for regression is to delete those features which are highly co-linear (which is also highly correlated)

So x2 and y are highly correlated and you are trying to predict y with x2? A high correlation between predictor variable and response variable is usually a good thing - and since x and y are practically uncorrelated you are likely to "dilute" your model and with that get worse model performance.
(Multi-)Colinearity between the predicor variables themselves would be more problematic.

Why are KMeans cluster labels not always the same with set random_state?

I have a notebook where I want to analyze the clusters from sklearn.cluster.KMeans. When I run the code, the clusters are the same, but the labels applied can vary. This makes it impossible for me to refer to a cluster by label in the markdown sections of the notebook. I am wondering why this occurs even when setting the random_state. It appears that random_state is only allowing for the clustering to be the same, but why does it not also apply the same label values each time? The code below will replicate the issue and the plot shows how the labels can vary.
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2)
x = np.random.normal(size=(1800, 2))
x[:700, 0] += 3
x[:700, 1] += 3
x[700:1200, 0] -= 0.5
x[700:1200, 1] -= 0.5
x[1200:, 0] += 3
x[1200:, 1] -= 3
np.random.shuffle(x)
first = None
while True: # it typically only takes a few iterations for a difference to occur
km = KMeans(n_clusters=3, random_state=10)
km.fit(x)
pred = km.predict(x)
if first is None:
first = pred
elif not np.array_equal(first, pred):
print(first)
print(pred)
fig, ax = plt.subplots(1,2)
for label in range(3):
clusters = x[first == label]
cluster = x[pred == label]
ax[0].scatter(clusters[:, 0], clusters[:, 1], label=label)
ax[1].scatter(cluster[:, 0], cluster[:, 1], label=label)
break
ax[0].legend()
ax[1].legend()
plt.show()
[0 1 1 ... 2 0 0] # labels for first run
[0 2 2 ... 1 0 0] # different labels for later run
Furthermore, I am confused as to why the verbose output is not exactly the same when using the same random_state.
I have noticed a couple of things. First is that np.random.seed(1) will not generate this problem. So, it appears to be data-dependent. Second if n_jobs=1 this doesn't seem to occur but the default n_jobs=None gives different results (both labels and verbose output). Is the parallelization causing this to happen?
It would be good to know if this is a bug that I should report to scikit-learn devs or if this is a an issue specific to my case that will require a work-around.

It appears that this is due to rounding errors. The best labels are updated anytime the current inertia is better than the previous best. However, due to rounding errors, an inertia value may be equal to the current value but appear slightly better and update the labels unnecessarily. There is currently a PR that is addressing this issue.

Outlier detection with Local Outlier Factor (LOF)

I am working with healthcare insurance claims data and would like to identify fraudulent claims. Have been reading online to try and find a better method. I came across the following code on scikit-learn.org
Does anyone know how to select the outliers? the code plot them in a graph but I would like to select those outliers if possible.
I have tried appending the y_predictions to the x dataframe but that has not worked.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
np.random.seed(42)
# Generate train data
X = 0.3 * np.random.randn(100, 2)
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X + 2, X - 2, X_outliers]
# fit the model
clf = LocalOutlierFactor(n_neighbors=20)
y_pred = clf.fit_predict(X)
y_pred_outliers = y_pred[200:]
Below is the code i tried.
X['outliers'] = y_pred

The first 200 data are inliers while the last 20 are outliers. When you did fit_predict on X, you will get either outlier (-1) or inlier(1) in y_pred. So to get the predicted outliers, you need to get those y_pred = -1 and get the corresponding value in X. Below script will give you the outliers in X.
X_pred_outliers = [each[1] for each in list(zip(y_pred, X.tolist())) if each[0] == -1]
I combine y_pred and X into an array and check if y=-1, if yes then collect X values.
However, there are eight errors on the predictions (8 out of 220). These errors are -1 values in y_pred[:200] and 1 in y_pred[201:220]. Please be aware of the errors as well.

Select 5 data points closest to SVM hyperlane

I have written Python code using Sklearn to cluster my dataset:
af = AffinityPropagation().fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_= len(cluster_centers_indices)
I am exploring the use of query-by-clustering and so form an inital training dataset by:
td_title =[]
td_abstract = []
td_y= []
for each in centers:
td_title.append(title[each])
td_abstract.append(abstract[each])
td_y.append(y[each])
I then train my model (an SVM) on it by:
clf = svm.SVC()
clf.fit(X, data_y)
I wish to write a function that given the centres, the model, the X values and the Y values will append the 5 data points which the model is most unsure about, ie. the data points closest to the hyperplane. How can I do this?

The first steps of your process aren't entirely clear to me, but here's a suggestion for "Select(ing) 5 data points closest to SVM hyperplane". The scikit documentation defines decision_function as the distance of the samples to the separating hyperplane. The method returns an array which can be sorted with argsort to find the "top/bottom N samples".
Following this basic scikit example, define a function closestN to return the samples closest to the hyperplane.
import numpy as np
def closestN(X_array, n):
# array of sample distances to the hyperplane
dists = clf.decision_function(X_array)
# absolute distance to hyperplane
absdists = np.abs(dists)
return absdists.argsort()[:n]
Add these two lines to the scikit example to see the function implemented:
closest_samples = closestN(X, 5)
plt.scatter(X[closest_samples][:, 0], X[closest_samples][:, 1], color='yellow')
Original
Closest Samples Highlighted
If you need to append the samples to some list, you could somelist.append(closestN(X, 5)). If you needed the sample values you could do something like somelist.append(X[closestN(X, 5)]).
closestN(X, 5)
array([ 1, 20, 14, 31, 24])
X[closestN(X, 5)]
array([[-1.02126202, 0.2408932 ],
[ 0.95144703, 0.57998206],
[-0.46722079, -0.53064123],
[ 1.18685372, 0.2737174 ],
[ 0.38610215, 1.78725972]])

apply sklearn PCA on movielens dataset

I have movielens dataset which I want to apply PCA on it, but sklearn PCA function dose not seems to do it correctly.
I have 718*8913 matrix which rows indicate the users and columns indicate movies
here is my python code :
Load movie names and movie ratings
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
ratings.drop(['timestamp'], axis=1, inplace=True)
def replace_name(x):
return movies[movies['movieId']==x].title.values[0]
ratings.movieId = ratings.movieId.map(replace_name)
M = ratings.pivot_table(index=['userId'], columns=['movieId'], values='rating')
df1 = M.replace(np.nan, 0, regex=True)
Standardizing
X_std = StandardScaler().fit_transform(df1)
Apply PCA
pca = PCA()
result = pca.fit_transform(X_std)
print result.shape
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.show()
I did't set any component number so I expect that PCA return 718*8913 matrix in new dimension but pca result size is 718*718 and pca.explained_variance_ratio_ size is 718, and sum of all members of it is 1, but how this is possible!!!
I have 8913 features and it return only 718 and sum of variance of them is equal to 1 can any one explain what is wrong here ?
my plot picture result:
As you can see in the above picture it just contain 718 component and sum of it is 1 but I have 8913 features where they gone?
Test with smaller example
I even try with scikit learn PCA example which can be found in documentation page of pca Here is the Link I change the example and just increase the number of features
import numpy as np
from sklearn.decomposition import PCA
import pandas as pd
X = np.array([[-1, -1,3,4,-1, -1,3,4], [-2, -1,5,-1, -1,3,4,2], [-3, -2,1,-1, -1,3,4,1],
[1, 1,4,-1, -1,3,4,2], [2, 1,0,-1, -1,3,4,2], [3, 2,10,-1, -1,3,4,10]])
ipca = PCA(n_components = 7)
print (X.shape)
ipca.fit(X)
result = ipca.transform(X)
print (result.shape);
and in this example we have 6 sample and 8 feauters I set the n_components to 7 but the result size is 6*6.
I think when the number of features is bigger than number of samples the maximum number of components scikit learn pca will return is equal to number of samples

See the documentation on PCA.
Because you did not pass an n_components parameter to PCA(), sklearn uses min(n_samples, n_features) as the value of n_components, which is why you get a reduced feature set equal to n_samples.
I believe your variance is equal to 1 because you didn't set the n_components, from the documentation:
If n_components is not set then all components are stored and the sum
of explained variances is equal to 1.0.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I find what data is clustering in K-Shape? - python

Related

Should features that correlate be deleted from ML models?

Why are KMeans cluster labels not always the same with set random_state?

Outlier detection with Local Outlier Factor (LOF)

Select 5 data points closest to SVM hyperlane

apply sklearn PCA on movielens dataset

Categories

Resources