Clustering data- Poor results, feature extraction

Clustering data- Poor results, feature extraction - python

I have measured data (vibrations) from a wind turbine running under different operating conditions. My dataset consists of operating conditions as well as measurement features I have extracted from the measured data.
Dataset shape: (423, 15). Each of the 423 data points represent a measurement on a day, chronologically over 423 days.
I now want to cluster the data to see if there is any change in the measurements. Specifically, I want to examine if the vibrations change over time (which could indicate a fault in the turbine gearbox).
What I have currently done:
Scale the data between 0,1 ->
Perform PCA (reduce from 15 to 5)
Cluster using db scan since I do not know the number of clusters. I am using this code to find the optimal epsilon (eps) in dbscan:
# optimal Epsilon (distance):
X_pca = principalDf.values
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X_pca)
distances, indices = nbrs.kneighbors(X_pca)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances,color="#0F215A")
plt.grid(True)
The result so far are not giving any clear indication that the data is changing over time:
Of course, the case could be that the data is not changing over these data points. Howver, what are some other things I could try? Kind of an open question, but I am running out of ideas.

First of all, with KMeans, if the dataset is not naturally partitioned, you may end up with some very weird results! As KMeans is unsupervised, you basically dump in all kinds of numeric variables, set the target variable, and let the machine do the lift for you. Here is a simple example using the canonical Iris dataset. You can EASILY modify this to fit your specific dataset. Just change the 'X' variables (all but the target variable) and 'y' variable (just one target variable). Try that and feedback.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, 0:4] # we only take the first two features.
y = iris.target
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array)
X_scaled.sample(5)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
from sklearn.cluster import KMeans
nclusters = 3 # this is the k in kmeans
seed = 0
km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)
# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans
# use seaborn to make scatter plot showing species for each sample
sns.FacetGrid(data, hue="species", size=4) \
.map(plt.scatter, "sepal_length", "sepal_width") \
.add_legend();

Related

How to give more importance to some features in sklearn Isolation Forest

I am using sklearn isolation forest for an anomaly detection task. Isolation forest consists of iTrees. As this paper describes, the nodes of the iTrees are split in the following way:
We select any feature (uniformly) randomly and perform a split on a random value of that feature.
But I want to give more weight to some features than the others. So instead of selecting the features with equal probability, I want to draw some features with a higher probability (giving more weight to those features) and other features with a lower probability.
How can I do that? From the source code it seems I have to change the function _generate_bagging_indices in _bagging.py, but not sure.

You can achieve this without changing the source code. Instead, you can tweak your input data by duplicating the features you wish to increase the weight for. If you have a feature appearing twice, the trees will use it twice to split your data, which in practice will mean the same as having doubled the weight of the feature.
In addition to this, you can also choose to reduce the amount of features used by your isolation forest in each tree. This is controlled by the argument max_features. The default value of 1.0 ensures that every feature will be used for each tree. By reducing it, more trees will be trained without the less frequent features in your input.
Illustration
Load Data
from sklearn.ensemble import IsolationForest
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
data = load_iris()
X = data.data
df = pd.DataFrame(X, columns=data.feature_names)
Default settings
IF = IsolationForest()
IF.fit(df)
preds = IF.predict(df)
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds)
plt.title("Default settings")
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()
Weighted Settings
df1 = df.copy()
weight_feature = 10
for i in range(weight_feature):
df1["duplicated_" + str(i)] = df1["sepal length (cm)"]
IF1 = IsolationForest(max_features=0.3)
IF1.fit(df1)
preds1 = IF1.predict(df1)
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds1)
plt.title("Weighted settings")
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()
As you can see visually, the second option has used the X-axis more intensively to determine which are the outliers.

Clustering geospatial data on coordinates AND non spatial feature

Say i have the following dataframe stored as a variable called coordinates, where the first few rows look like:
business_lat business_lng business_rating
0 19.111841 72.910729 5.
1 19.111342 72.908387 5.
2 19.111342 72.908387 4.
3 19.137815 72.914085 5.
4 19.119677 72.905081 2.
5 19.119677 72.905081 2.
. . .
. . .
. . .
As you can see this data is geospatial (has a lat and a lng) AND every row has an additional value, business_rating, that corresponds to the rating of the business at the latlng in that row. I want to cluster the data, where businesses that are nearby and have similar ratings are assigned into the same cluster. Essentially I need a a geospatial cluster with an additional requirement that the clustering must consider the rating column.
I've looked online and can't really find much addressing approaches for this: only things for strict geospatial clustering (only features to cluster on are latlng) or non spatial clustering.
I have a simple DBSCAN running below, but when i plot the results of the clustering it does not seem to be doing what I want correctly.
from sklearn.cluster import DBSCAN
import numpy as np
db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
Would I be better served trying to tweak the parameters of the DBSCAN, doing some additional processing of the data or using a different approach all together?

The tricky part about clustering two different types of information (location and rating) is determining how they should relate to each other. It's simple to ask when it is just one domain and you are comparing the same units. My approach would be to look at how to relate rows within a domain and then determine some interaction between the domains. This could be done using scaling options like MinMaxScaler mentioned, however, I think this is a bit heavy handed and we could use our knowledge of the domains to cluster better.
Handling Location
Location distance is best handled directly as this has real world meaning that we can precalculate distances for. The meaning of meters apart is direct to what we
You could use the scaling option mentioned in the previous answer but this risks distorting the location data. For example, if you have a long and thin set of locations, MinMaxScaling would give more importance to variation on the thin axis than the long axis. If you are going to use scaling, do it on the computed distance matrix, not on the lat lon themselves.
import numpy as np
from sklearn.metrics.pairwise import haversine_distances
points_in_radians = df[['business_lat','business_lng']].apply(np.radians).values
distances_in_km = haversine_distances(points_in_radians) * 6371
Adding in Rating
We can think of the problem through asking a couple of questions that relate rating to distance. We could ask, how different must ratings be to separate observations in the same place? What is the meter difference to rating difference ratio? With an idea of ratio, we can calculate another distance matrix for the rating difference for all observations and use this to scale or add on the original location distance matrix or we could increase the distance for every gap in rating. This location-plus-ratings-difference matrix can then be clustered on.
from sklearn.metrics.pairwise import euclidean_distances
added_km_per_rating_gap = 1
rating_distances = euclidean_distances(df[['business_rating']].values) * added_km_per_rating_gap
We can then simply add these together and cluster on the resulting matrix.
from sklearn.cluster import DBSCAN
distance_matrix = rating_distances + distances_in_km
clustering = DBSCAN(metric='precomputed', eps=1, min_samples=2)
clustering.fit(distance_matrix)
What we have done is cluster by location, adding a penalty for ratings difference. Making that penalty direct and controllable allows for optimisation to find the best clustering.
Testing
The problem I'm finding is that (with my test data at least) DBSCAN has a tendency to 'walk' from observation to observation forming clusters that either blend ratings together because the penalty is not high enough or separates into single rating groups. It might be that DBSCAN is not suitable for this type of clustering. If I had more time, I would look for some open data to test this on and try other clustering methods.
Here is the code I used to test. I used the square of the ratings distance to emphasise larger gaps.
import random
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=6, cluster_std=0.60, random_state=0)
ratings = np.array([random.randint(1,4) for _ in range(len(X)//2)] \
+[random.randint(2,5) for _ in range(len(X)//2)]).reshape(-1, 1)
distances_in_km = euclidean_distances(X)
rating_distances = euclidean_distances(ratings)
def build_clusters(multiplier, eps):
rating_addition = (rating_distances ** 2) * multiplier
distance_matrix = rating_addition + distances_in_km
clustering = DBSCAN(metric='precomputed', eps=eps, min_samples=10)
clustering.fit(distance_matrix)
return clustering.labels_

Using the DBSCAN methodology, we can calculate the distance between points (the Euclidean distance or some other distance) and look for points which are far away from others. You may want to consider using the MinMaxScaler to normalize values, so one feature doesn't overwhelm other features.
Where is your code and what are your final results? Without an actual code sample, I can only guess what you are doing.
I hacked together some sample code for you. You can see the results below.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns; sns.set()
import csv
df = pd.read_csv('C:\\your_path_here\\business.csv')
X=df.loc[:,['review_count','latitude','longitude']]
K_clusters = range(1,10)
kmeans = [KMeans(n_clusters=i) for i in K_clusters]
Y_axis = df[['latitude']]
X_axis = df[['longitude']]
score = [kmeans[i].fit(Y_axis).score(Y_axis) for i in range(len(kmeans))]# Visualize
plt.plot(K_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
kmeans = KMeans(n_clusters = 3, init ='k-means++')
kmeans.fit(X[X.columns[0:2]]) # Compute k-means clustering.
X['cluster_label'] = kmeans.fit_predict(X[X.columns[0:2]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
labels = kmeans.predict(X[X.columns[0:2]]) # Labels of each point
X.head(10)
X.plot.scatter(x = 'latitude', y = 'longitude', c=labels, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
from scipy.stats import zscore
df["zscore"] = zscore(df["review_count"])
df["outlier"] = df["zscore"].apply(lambda x: x <= -2.5 or x >= 2.5)
df[df["outlier"]]
df_cord = df[["latitude", "longitude"]]
df_cord.plot.scatter(x = "latitude", y = "latitude")
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_cord = scaler.fit_transform(df_cord)
df_cord = pd.DataFrame(df_cord, columns = ["latitude", "longitude"])
df_cord.plot.scatter(x = "latitude", y = "longitude")
from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
eps = 0.5,
metric="euclidean",
min_samples = 3,
n_jobs = -1)
clusters = outlier_detection.fit_predict(df_cord)
clusters
from matplotlib import cm
cmap = cm.get_cmap('Accent')
df_cord.plot.scatter(
x = "latitude",
y = "longitude",
c = clusters,
cmap = cmap,
colorbar = False
)
The final result looks a little weird, to tell you the truth. Remember, not everything is clusterable.

plot clusters of kmeans of sparse matrix

I have a python script which do clustering over a data file which is in svmlight format.
I use the function sklearn.datasets.load_svmlight_file to load the data from the data file.
I know that this function returns a sparse matrix.
I need to scatter plot the clusters, can any body help me please.
This what I have done:
import sklearn.datasets
import sys
from sklearn.cluster import KMeans
dataFilename = sys.argv[1]
X, y = sklearn.datasets.load_svmlight_file(dataFilename)
kmeans = KMeans(n_clusters = 3)
kmeans.fit(X)
labels = kmeans.labels_
print(labels)
centroids = kmeans.cluster_centers_

Without having the dataset, I would suggest the following:
Since load_svmlight_file() returns a sparse matrix, turn X into a NumPy array using samples = X.toarray() prior to fitting the model.
Plot two features (for example) of the dataset using:
plt.scatter(samples[:,0], samples[:,1], c=labels). This colours the clusters by their predicted labels.
Follow this with plt.scatter(centroids[:,0], centroids[:,1], marker='D') to see the location of the centroids with diamonds.
Note that samples[:,n] represents an array containing the sample values for the nth feature of the dataset.
I hope this helps. If not, please let me know.

PCA on sklearn - how to interpret pca.components_

I ran PCA on a data frame with 10 features using this simple code:
pca = PCA()
fit = pca.fit(dfPca)
The result of pca.explained_variance_ratio_ shows:
array([ 5.01173322e-01, 2.98421951e-01, 1.00968655e-01,
4.28813755e-02, 2.46887288e-02, 1.40976609e-02,
1.24905823e-02, 3.43255532e-03, 1.84516942e-03,
4.50314168e-16])
I believe that means that the first PC explains 52% of the variance, the second component explains 29% and so on...
What I dont undestand is the output of pca.components_. If I do the following:
df = pd.DataFrame(pca.components_, columns=list(dfPca.columns))
I get the data frame bellow where each line is a principal component.
What I'd like to understand is how to interpret that table. I know that if I square all the features on each component and sum them I get 1, but what does the -0.56 on PC1 mean? Dos it tell something about "Feature E" since it is the highest magnitude on a component that explains 52% of the variance?
Thanks

Terminology: First of all, the results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).
PART1: I explain how to check the importance of the features and how to plot a biplot.
PART2: I explain how to check the importance of the features and how to save them into a pandas dataframe using the feature names.
Summary in an article: Python compact guide: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f
PART 1:
In your case, the value -0.56 for Feature E is the score of this feature on the PC1. This value tells us 'how much' the feature influences the PC (in our case the PC1).
So the higher the value in absolute value, the higher the influence on the principal component.
After performing the PCA analysis, people usually plot the known 'biplot' to see the transformed features in the N dimensions (2 in our case) and the original variables (features).
I wrote a function to plot this.
Example using iris data:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general it is a good idea to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
pca = PCA()
pca.fit(X,y)
x_new = pca.transform(X)
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
plt.scatter(xs ,ys, c = y) #without scaling
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
#Call the function.
myplot(x_new[:,0:2], pca.components_)
plt.show()
Results
PART 2:
The important features are the ones that influence more the components and thus, have a large absolute value on the component.
TO get the most important features on the PCs with names and save them into a pandas dataframe use this:
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)
# 10 samples with 5 features
train_features = np.random.rand(10,5)
model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)
# number of components
n_pcs= model.components_.shape[0]
# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]
initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
# build the dataframe
df = pd.DataFrame(dic.items())
This prints:
0 1
0 PC0 e
1 PC1 d
So on the PC1 the feature named e is the most important and on PC2 the d.
Summary in an article: Python compact guide: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f

Basic Idea
The Principle Component breakdown by features that you have there basically tells you the "direction" each principle component points to in terms of the direction of the features.
In each principle component, features that have a greater absolute weight "pull" the principle component more to that feature's direction.
For example, we can say that in PC1, since Feature A, Feature B, Feature I, and Feature J have relatively low weights (in absolute value), PC1 is not as much pointing in the direction of these features in the feature space. PC1 will be pointing most to the direction of Feature E relative to other directions.
Visualization in Lower Dimensions
For a visualization of this, look at the following figures taken from here and here:
The following shows an example of running PCA on correlated data.
We can visually see that both eigenvectors derived from PCA are being "pulled" in both the Feature 1 and Feature 2 directions. Thus, if we were to make a principle component breakdown table like you made, we would expect to see some weightage from both Feature 1 and Feature 2 explaining PC1 and PC2.
Next, we have an example with uncorrelated data.
Let us call the green principle component as PC1 and the pink one as PC2. It's clear that PC1 is not pulled in the direction of feature x', and as isn't PC2 in the direction of feature y'.
Thus, in our table, we must have a weightage of 0 for feature x' in PC1 and a weightage of 0 for feature y' in PC2.
I hope this gives an idea of what you're seeing in your table.

Genextreme fit not working for some datasets

I'm trying to fit a GEV distribution to temperature data to help identify extreme values. I have data sets for different regions - for some regions the fit works fine but for others it breaks down. It appears that it is setting the location parameter close to the maximum of the distribution range. All data sets are large, of the same size, complete and have no particularly strange values.
Could you please suggest what might be happening or how I can investigate the genextreme function process to work out what the problem is?
Here's the relevant bits of code (values are read in from NetCDF without any problem):
import pandas as pd
import numpy as np
import netCDF4 as nc
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import genextreme as gev
# calculate GEV fit
fit = gev.fit(season_temp)
# GEV parameters from fit
c, loc, scale = fit
fit_mean= loc
min_extreme,max_extreme = gev.interval(0.99,c,loc,scale)
# evenly spread x axis values for pdf plot
x = np.linspace(min(season_temp),max(season_temp),200)
# plot distribution
fig,ax = plt.subplots(1, 1)
plt.plot(x, gev.pdf(x, *fit))
plt.hist(season_temp,30,normed=True,alpha=0.3)
And here are two examples of outputs from different regions, successful and not:
Successful fit
Unsuccessful fit
The successfully fitted distribution has mean location parameter of 1.066 compared to data mean of 2.395. The one that failed has calculated a location parameter of 12.202 compared to data mean of 2.138.
Thanks in advance for your help!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.