I’ve been trying to determine the silhouette scores for each sample in a data set, which contains two different classes. However, the distribution and sample values change depending on how I’ve sorted my data ahead of time. For example, if I sort my dataframe by the class labels (0 & 1) in ascending vs descending order prior to calling silhouette_samples(), the silhouette scores change.
Can someone help me figure out what’s going on? I’d like to know
Is there a bug in my code that I’m not aware of?
Is this normal behavior of the sklearn silhouette_samples function that I’m
ignorant of?
Or is this a bug in the sklearn silhouette_samples?
The effect occurs with the following code:
import pandas as pd
from sklearn.metrics import silhouette_samples
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
df_myfeatures #data frame containing features and class labels
'''data frame sorted by output labels in ascending order'''
df1 = df_myfeatures.copy().sort_values(['output_label'], ascending = True)
'''data frame sorted by output labels in descending order'''
df2 = df_myfeatures.copy().sort_values(['output_label'], ascending = False)
standardize the features ahead of time since they’re on different scales
the X matrix has 26k rows (observations) and 9 columns (features)
standard_scaler = StandardScaler()
X1 = standard_scaler.fit_transform(df1[cols]) #cols is just a list of columns for fitting
X2 = standard_scaler.fit_transform(df2[cols])
y1 = df1['output_label']
y2 = df2['output_label']
'''find the silhouette scores'''
ss1 = silhouette_samples(X1,y1)
ss2 = silhouette_samples(X2,y2)
'''plot the distribution'''
plt.hist(ss1, bins = np.linspace(-1,1,21), alpha = 0.3, label = 'sorted ascending')
plt.hist(ss2, bins = np.linspace(-1,1,21), alpha = 0.3, label = 'sorted descending')
plt.title('distribution of silhouette scores')
Which generates the following distributions of scores:
histogram of silhouette scores
As you can see, the distribution of scores changes depending on the order of data. I’ve verified that there’s not an issue with standard scaler producing different results for the data depending on the order, and that there’s not an issue with the pandas sorting the data and somehow messing up the alignment of different rows across columns. I'm completely stumped, as far as I was aware, there shouldn't be any order effects in the calculation of silhouette scores.
Please, help me understand this behavior! Thanks!
Note: I’m running on a Windows 10 machine, using Anaconda 4.0 (64 bit), Python v3.5.1, sklearn v0.17.1 and Pandas v0.18.0
I am using sklearn isolation forest for an anomaly detection task. Isolation forest consists of iTrees. As this paper describes, the nodes of the iTrees are split in the following way:
We select any feature (uniformly) randomly and perform a split on a random value of that feature.
But I want to give more weight to some features than the others. So instead of selecting the features with equal probability, I want to draw some features with a higher probability (giving more weight to those features) and other features with a lower probability.
How can I do that? From the source code it seems I have to change the function _generate_bagging_indices in _bagging.py, but not sure.
You can achieve this without changing the source code. Instead, you can tweak your input data by duplicating the features you wish to increase the weight for. If you have a feature appearing twice, the trees will use it twice to split your data, which in practice will mean the same as having doubled the weight of the feature.
In addition to this, you can also choose to reduce the amount of features used by your isolation forest in each tree. This is controlled by the argument max_features. The default value of 1.0 ensures that every feature will be used for each tree. By reducing it, more trees will be trained without the less frequent features in your input.
Load Data
from sklearn.ensemble import IsolationForest
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
data = load_iris()
X = data.data
df = pd.DataFrame(X, columns=data.feature_names)
Default settings
IF = IsolationForest()
preds = IF.predict(df)
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds)
plt.title("Default settings")
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
Weighted Settings
df1 = df.copy()
weight_feature = 10
for i in range(weight_feature):
df1["duplicated_" + str(i)] = df1["sepal length (cm)"]
IF1 = IsolationForest(max_features=0.3)
preds1 = IF1.predict(df1)
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds1)
plt.title("Weighted settings")
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
As you can see visually, the second option has used the X-axis more intensively to determine which are the outliers.
I have measured data (vibrations) from a wind turbine running under different operating conditions. My dataset consists of operating conditions as well as measurement features I have extracted from the measured data.
Dataset shape: (423, 15). Each of the 423 data points represent a measurement on a day, chronologically over 423 days.
I now want to cluster the data to see if there is any change in the measurements. Specifically, I want to examine if the vibrations change over time (which could indicate a fault in the turbine gearbox).
What I have currently done:
Scale the data between 0,1 ->
Perform PCA (reduce from 15 to 5)
Cluster using db scan since I do not know the number of clusters. I am using this code to find the optimal epsilon (eps) in dbscan:
# optimal Epsilon (distance):
X_pca = principalDf.values
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X_pca)
distances, indices = nbrs.kneighbors(X_pca)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
The result so far are not giving any clear indication that the data is changing over time:
Of course, the case could be that the data is not changing over these data points. Howver, what are some other things I could try? Kind of an open question, but I am running out of ideas.
First of all, with KMeans, if the dataset is not naturally partitioned, you may end up with some very weird results! As KMeans is unsupervised, you basically dump in all kinds of numeric variables, set the target variable, and let the machine do the lift for you. Here is a simple example using the canonical Iris dataset. You can EASILY modify this to fit your specific dataset. Just change the 'X' variables (all but the target variable) and 'y' variable (just one target variable). Try that and feedback.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, 0:4] # we only take the first two features.
y = iris.target
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
from sklearn.cluster import KMeans
nclusters = 3 # this is the k in kmeans
seed = 0
km = KMeans(n_clusters=nclusters, random_state=seed)
# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
# use seaborn to make scatter plot showing species for each sample
sns.FacetGrid(data, hue="species", size=4) \
.map(plt.scatter, "sepal_length", "sepal_width") \
In PyMC3, single new observations passed via set_data() are currently not handled correctly by sample_posterior_predictive(), which in such cases predicts the training data instead (see #3640). Therefore, I decided to add a second artificial row, which is identical to the first one, to my input data in order to bypass this behavior.
Now, I stumbled across something that I currently fail to make sense of: the predictions for the first and second row are different. With a constant random_seed, I would have expected the two predictions to be identical. Can anyone please (i) affirm that this is intended behavior rather than a bug and, if so, (ii) explain why sample_posterior_predictive() creates different results for one and the same input data?
Here's a reproducible example based on the iris dataset, where petal width and length serve as predictor and response, respectively, and everything but the last row is used for training. The model is subsequently tested against the last row. pd.concat() is used to duplicate the first row of the test data frame to circumvent the above bug.
import seaborn as sns
import pymc3 as pm
import pandas as pd
import numpy as np
### . training ----
dat = sns.load_dataset('iris')
trn = dat.iloc[:-1]
with pm.Model() as model:
s_data = pm.Data('s_data', trn['petal_width'])
outcome = pm.glm.GLM(x = s_data, y = trn['petal_length'], labels = 'petal_width')
trace = pm.sample(500, cores = 1, random_seed = 1899)
### . testing ----
tst = dat.iloc[-1:]
tst = pd.concat([tst, tst], axis = 0, ignore_index = True)
with model:
pm.set_data({'s_data': tst['petal_width']})
ppc = pm.sample_posterior_predictive(trace, random_seed = 1900)
np.mean(ppc['y'], axis = 0)
# array([5.09585088, 5.08377112]) # mean predicted value for [first, second] row
I don't think it's a bug and I also don't find it troubling. Since PyMC3 doesn't check whether the points being predicted are identical, it treats them separately and each one results in a random draw from the model. While each PPC draw (row in ppc['y']) is using the same random parameter settings for the GLM taken from the trace, the model is still stochastic (i.e., there is always measurement error). I think this explains the difference.
If you increase the number of draws in the PPC, you will see that the difference in the means decreases, which is consistent with this just being a difference in sampling.
I ran PCA on a data frame with 10 features using this simple code:
pca = PCA()
fit = pca.fit(dfPca)
The result of pca.explained_variance_ratio_ shows:
array([ 5.01173322e-01, 2.98421951e-01, 1.00968655e-01,
4.28813755e-02, 2.46887288e-02, 1.40976609e-02,
1.24905823e-02, 3.43255532e-03, 1.84516942e-03,
I believe that means that the first PC explains 52% of the variance, the second component explains 29% and so on...
What I dont undestand is the output of pca.components_. If I do the following:
df = pd.DataFrame(pca.components_, columns=list(dfPca.columns))
I get the data frame bellow where each line is a principal component.
What I'd like to understand is how to interpret that table. I know that if I square all the features on each component and sum them I get 1, but what does the -0.56 on PC1 mean? Dos it tell something about "Feature E" since it is the highest magnitude on a component that explains 52% of the variance?
Terminology: First of all, the results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).
PART1: I explain how to check the importance of the features and how to plot a biplot.
PART2: I explain how to check the importance of the features and how to save them into a pandas dataframe using the feature names.
Summary in an article: Python compact guide: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f
In your case, the value -0.56 for Feature E is the score of this feature on the PC1. This value tells us 'how much' the feature influences the PC (in our case the PC1).
So the higher the value in absolute value, the higher the influence on the principal component.
After performing the PCA analysis, people usually plot the known 'biplot' to see the transformed features in the N dimensions (2 in our case) and the original variables (features).
I wrote a function to plot this.
Example using iris data:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general it is a good idea to scale the data
scaler = StandardScaler()
pca = PCA()
x_new = pca.transform(X)
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
plt.scatter(xs ,ys, c = y) #without scaling
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
#Call the function.
myplot(x_new[:,0:2], pca.components_)
The important features are the ones that influence more the components and thus, have a large absolute value on the component.
TO get the most important features on the PCs with names and save them into a pandas dataframe use this:
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
# 10 samples with 5 features
train_features = np.random.rand(10,5)
model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)
# number of components
n_pcs= model.components_.shape[0]
# get the index of the most important feature on EACH component
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]
initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
# build the dataframe
df = pd.DataFrame(dic.items())
This prints:
0 1
0 PC0 e
1 PC1 d
So on the PC1 the feature named e is the most important and on PC2 the d.
Summary in an article: Python compact guide: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f
Basic Idea
The Principle Component breakdown by features that you have there basically tells you the "direction" each principle component points to in terms of the direction of the features.
In each principle component, features that have a greater absolute weight "pull" the principle component more to that feature's direction.
For example, we can say that in PC1, since Feature A, Feature B, Feature I, and Feature J have relatively low weights (in absolute value), PC1 is not as much pointing in the direction of these features in the feature space. PC1 will be pointing most to the direction of Feature E relative to other directions.
Visualization in Lower Dimensions
For a visualization of this, look at the following figures taken from here and here:
The following shows an example of running PCA on correlated data.
We can visually see that both eigenvectors derived from PCA are being "pulled" in both the Feature 1 and Feature 2 directions. Thus, if we were to make a principle component breakdown table like you made, we would expect to see some weightage from both Feature 1 and Feature 2 explaining PC1 and PC2.
Next, we have an example with uncorrelated data.
Let us call the green principle component as PC1 and the pink one as PC2. It's clear that PC1 is not pulled in the direction of feature x', and as isn't PC2 in the direction of feature y'.
Thus, in our table, we must have a weightage of 0 for feature x' in PC1 and a weightage of 0 for feature y' in PC2.
I hope this gives an idea of what you're seeing in your table.
I would like to run kmeans clustering with more than 3 features. I've tried with two features and wondering how to provide more than 3 features to sklearn.cluster KMeans.
Here's my code and dataframe that I'd like to select features to run. I have multiple dataframes as an input and I have to provide them as features.
# currently two features are selected
# I'd like to combine more than 3 features and provide them to dataset
df_features = pd.merge(df_max[['id', 'max']],
df_var[['id', 'variance']], on='id', how='left')
cols = list(df_features.loc[:,'max':'variance'])
X = df_features.as_matrix(columns=cols)
kmeans = KMeans(n_clusters=3)
centroid = kmeans.cluster_centers_
labels = kmeans.labels_
colors = ["g.","r.","c."]
for i in range(len(X)):
print ("coordinate:" , X[i], "label:", labels[i])
plt.scatter(centroid[:,0],centroid[:,1], marker = "x", s=150, linewidths = 5, zorder =10)
Generally you wouldn't want id to be a feature, because, unless you have good reason to believe otherwise, they do not correlate with anything.
As long as you feed in a valid matrix X at kmeans.fit(X), it will run KMean algorithm for you regardless of number of features in X. Though, if you have a huge amount of features, it may take longer to finish.
The problem is then how to construct X. As you have shown in your example, you can simply merge dataframes, select the wanted columns, and extract feature matrix with a .as_matrix() call. If you have more dataframes and columns, I guess you just merge more and select more.
Feature selection and dimensional reduction may come in handy once you have more than enough features in your dataset. Read more about them when you have time.
P.S. Why scipy in the title?