Clustering geospatial data on coordinates AND non spatial feature - python

Say i have the following dataframe stored as a variable called coordinates, where the first few rows look like:
business_lat business_lng business_rating
0 19.111841 72.910729 5.
1 19.111342 72.908387 5.
2 19.111342 72.908387 4.
3 19.137815 72.914085 5.
4 19.119677 72.905081 2.
5 19.119677 72.905081 2.
. . .
. . .
. . .
As you can see this data is geospatial (has a lat and a lng) AND every row has an additional value, business_rating, that corresponds to the rating of the business at the latlng in that row. I want to cluster the data, where businesses that are nearby and have similar ratings are assigned into the same cluster. Essentially I need a a geospatial cluster with an additional requirement that the clustering must consider the rating column.
I've looked online and can't really find much addressing approaches for this: only things for strict geospatial clustering (only features to cluster on are latlng) or non spatial clustering.
I have a simple DBSCAN running below, but when i plot the results of the clustering it does not seem to be doing what I want correctly.
from sklearn.cluster import DBSCAN
import numpy as np
db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
Would I be better served trying to tweak the parameters of the DBSCAN, doing some additional processing of the data or using a different approach all together?

The tricky part about clustering two different types of information (location and rating) is determining how they should relate to each other. It's simple to ask when it is just one domain and you are comparing the same units. My approach would be to look at how to relate rows within a domain and then determine some interaction between the domains. This could be done using scaling options like MinMaxScaler mentioned, however, I think this is a bit heavy handed and we could use our knowledge of the domains to cluster better.
Handling Location
Location distance is best handled directly as this has real world meaning that we can precalculate distances for. The meaning of meters apart is direct to what we
You could use the scaling option mentioned in the previous answer but this risks distorting the location data. For example, if you have a long and thin set of locations, MinMaxScaling would give more importance to variation on the thin axis than the long axis. If you are going to use scaling, do it on the computed distance matrix, not on the lat lon themselves.
import numpy as np
from sklearn.metrics.pairwise import haversine_distances
points_in_radians = df[['business_lat','business_lng']].apply(np.radians).values
distances_in_km = haversine_distances(points_in_radians) * 6371
Adding in Rating
We can think of the problem through asking a couple of questions that relate rating to distance. We could ask, how different must ratings be to separate observations in the same place? What is the meter difference to rating difference ratio? With an idea of ratio, we can calculate another distance matrix for the rating difference for all observations and use this to scale or add on the original location distance matrix or we could increase the distance for every gap in rating. This location-plus-ratings-difference matrix can then be clustered on.
from sklearn.metrics.pairwise import euclidean_distances
added_km_per_rating_gap = 1
rating_distances = euclidean_distances(df[['business_rating']].values) * added_km_per_rating_gap
We can then simply add these together and cluster on the resulting matrix.
from sklearn.cluster import DBSCAN
distance_matrix = rating_distances + distances_in_km
clustering = DBSCAN(metric='precomputed', eps=1, min_samples=2)
clustering.fit(distance_matrix)
What we have done is cluster by location, adding a penalty for ratings difference. Making that penalty direct and controllable allows for optimisation to find the best clustering.
Testing
The problem I'm finding is that (with my test data at least) DBSCAN has a tendency to 'walk' from observation to observation forming clusters that either blend ratings together because the penalty is not high enough or separates into single rating groups. It might be that DBSCAN is not suitable for this type of clustering. If I had more time, I would look for some open data to test this on and try other clustering methods.
Here is the code I used to test. I used the square of the ratings distance to emphasise larger gaps.
import random
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=6, cluster_std=0.60, random_state=0)
ratings = np.array([random.randint(1,4) for _ in range(len(X)//2)] \
+[random.randint(2,5) for _ in range(len(X)//2)]).reshape(-1, 1)
distances_in_km = euclidean_distances(X)
rating_distances = euclidean_distances(ratings)
def build_clusters(multiplier, eps):
rating_addition = (rating_distances ** 2) * multiplier
distance_matrix = rating_addition + distances_in_km
clustering = DBSCAN(metric='precomputed', eps=eps, min_samples=10)
clustering.fit(distance_matrix)
return clustering.labels_

Using the DBSCAN methodology, we can calculate the distance between points (the Euclidean distance or some other distance) and look for points which are far away from others. You may want to consider using the MinMaxScaler to normalize values, so one feature doesn't overwhelm other features.
Where is your code and what are your final results? Without an actual code sample, I can only guess what you are doing.
I hacked together some sample code for you. You can see the results below.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns; sns.set()
import csv
df = pd.read_csv('C:\\your_path_here\\business.csv')
X=df.loc[:,['review_count','latitude','longitude']]
K_clusters = range(1,10)
kmeans = [KMeans(n_clusters=i) for i in K_clusters]
Y_axis = df[['latitude']]
X_axis = df[['longitude']]
score = [kmeans[i].fit(Y_axis).score(Y_axis) for i in range(len(kmeans))]# Visualize
plt.plot(K_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
kmeans = KMeans(n_clusters = 3, init ='k-means++')
kmeans.fit(X[X.columns[0:2]]) # Compute k-means clustering.
X['cluster_label'] = kmeans.fit_predict(X[X.columns[0:2]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
labels = kmeans.predict(X[X.columns[0:2]]) # Labels of each point
X.head(10)
X.plot.scatter(x = 'latitude', y = 'longitude', c=labels, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
from scipy.stats import zscore
df["zscore"] = zscore(df["review_count"])
df["outlier"] = df["zscore"].apply(lambda x: x <= -2.5 or x >= 2.5)
df[df["outlier"]]
df_cord = df[["latitude", "longitude"]]
df_cord.plot.scatter(x = "latitude", y = "latitude")
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_cord = scaler.fit_transform(df_cord)
df_cord = pd.DataFrame(df_cord, columns = ["latitude", "longitude"])
df_cord.plot.scatter(x = "latitude", y = "longitude")
from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
eps = 0.5,
metric="euclidean",
min_samples = 3,
n_jobs = -1)
clusters = outlier_detection.fit_predict(df_cord)
clusters
from matplotlib import cm
cmap = cm.get_cmap('Accent')
df_cord.plot.scatter(
x = "latitude",
y = "longitude",
c = clusters,
cmap = cmap,
colorbar = False
)
The final result looks a little weird, to tell you the truth. Remember, not everything is clusterable.

Related

Defining clusters from a matrix using a value threshold and named by cluster size in Python

I have a matrix of pairwise differences between samples. I would like to label each sample as being part of a cluster, named by cluster size, where clusters are defined by an absolute cutoff in the matrix values, e.g. all those clusters with a difference of zero from each other.
Mock data:
# Load packages
import numpy as np
import pandas as pd
import seaborn as sns
## Generate fake data
# matrix
d = {'sample_A': [0,2,0,1,1,2,2,1], 'sample_B': [2,0,2,3,3,0,0,3], 'sample_C': [0,2,0,1,1,2,2,1], 'sample_D': [1,3,1,0,2,3,3,1],
'sample_E': [1,3,1,2,0,3,3,1], 'sample_F': [2,0,2,3,3,0,0,3], 'sample_G': [2,0,2,3,3,0,0,3], 'sample_H': [1,3,1,1,1,3,3,0]}
idx = ["sample_A","sample_B","sample_C","sample_D","sample_E", "sample_F", "sample_G", "sample_H"]
df = pd.DataFrame(data=d,index=idx)
df
# Visualise heatmap (this isn't directly needed for this output)
g = sns.clustermap(df, cmap="coolwarm_r")
g
# Desired output
d = {'cluster_zero': [2,1,2,"NA","NA",1,1,"NA"]}
df3 = pd.DataFrame(data=d,index=idx)
df3
So the output labels each sample as belonging to a cluster defined as having zero pairwise difference in the matrix, and names the cluster in order of size from largest to smallest. In this case, samples B, F and G all have zero differences, so get put in cluster 1. Samples A and C also have zero differences from each other, and as that cluster is smaller than B/F/G they are cluster 2. There are no other samples with zero differences in this case, so the other samples don't get a cluster.
Ideally, I would like to be able to control the threshold of difference I used to define clusters, e.g. run the script again but using a threshold of <1 or <2 rather than zero.
There are various questions similar to this (e.g. Extracting clusters from seaborn clustermap), but they seem to use metrics of calculating distance rather than the absolute count in the matrix. Another similar question is: generating numerical clusters from matrix values of a minimal size but this counts the size of each cluster, which is different to the output I want.
Thanks for your help.
I found the answer to my question in this blog:
https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/
The solution I've done is:
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
df_arr = np.asarray(df)
Z = hierarchy.linkage(df_arr, 'single')
plt.figure()
dn = hierarchy.dendrogram(Z)
from scipy.cluster.hierarchy import fcluster
# Set maximum threshold for difference e.g. 1
max_d = 1
clusters = fcluster(Z, max_d, criterion='distance')
clusters
Then I turn the clusters array into a dataframe and pd.concat it onto the distances dataframe, then extract the list of sample names with cluster. Lastlt I take only clusters with e.g. >2 samples in each cluster:
result = result.groupby('cluster').filter(lambda x : len(x)>2)

Clustering data- Poor results, feature extraction

I have measured data (vibrations) from a wind turbine running under different operating conditions. My dataset consists of operating conditions as well as measurement features I have extracted from the measured data.
Dataset shape: (423, 15). Each of the 423 data points represent a measurement on a day, chronologically over 423 days.
I now want to cluster the data to see if there is any change in the measurements. Specifically, I want to examine if the vibrations change over time (which could indicate a fault in the turbine gearbox).
What I have currently done:
Scale the data between 0,1 ->
Perform PCA (reduce from 15 to 5)
Cluster using db scan since I do not know the number of clusters. I am using this code to find the optimal epsilon (eps) in dbscan:
# optimal Epsilon (distance):
X_pca = principalDf.values
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X_pca)
distances, indices = nbrs.kneighbors(X_pca)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances,color="#0F215A")
plt.grid(True)
The result so far are not giving any clear indication that the data is changing over time:
Of course, the case could be that the data is not changing over these data points. Howver, what are some other things I could try? Kind of an open question, but I am running out of ideas.
First of all, with KMeans, if the dataset is not naturally partitioned, you may end up with some very weird results! As KMeans is unsupervised, you basically dump in all kinds of numeric variables, set the target variable, and let the machine do the lift for you. Here is a simple example using the canonical Iris dataset. You can EASILY modify this to fit your specific dataset. Just change the 'X' variables (all but the target variable) and 'y' variable (just one target variable). Try that and feedback.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, 0:4] # we only take the first two features.
y = iris.target
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array)
X_scaled.sample(5)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
from sklearn.cluster import KMeans
nclusters = 3 # this is the k in kmeans
seed = 0
km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)
# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans
# use seaborn to make scatter plot showing species for each sample
sns.FacetGrid(data, hue="species", size=4) \
.map(plt.scatter, "sepal_length", "sepal_width") \
.add_legend();

Separating Clusters after using SLINK in Python/R

From research, only Single-Linkage Hierarchical Clustering can obtain optimal clusters. This is also know as SLINK. The libraries are published in originally in C++ and now in Python/R.
So far, following the steps in the documentations, I managed to come up with:
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist
## generating random numbers from 20 to 90, and storing them in a dataframe. This is a 1-dimensional data
np.random.seed(1)
df = pd.DataFrame(np.random.randint(20,90,size=(100,1)), columns = list('A'))
df = df.sort_values(by=['A'])
df = df.values
df[:,0].sort()
## getting condensed distance matrix
d = pdist(df_final, metric='euclidean')
## running the SLINK algorithm
Z = linkage(d, 'single')
I understand that Z is a 'hierarchical clustering encoded as a linkage matrix' (as written in the documentation), but I am wondering how do I go back to my original data set and distinguish the cluster calculated by this result?
I could achieve clustering result by Scikit-Learn clustering, but I think Scikit-Learn clustering algorithms are not optimal and hence I turned to this SLINK algorithm. Would be much appreciated if someone could help me with this.
From scipy.cluster.hierarchy.linkage you get back how clusters are formed with each iteration.
Normally this information is not so useful, so we can look at the clustering first:
import scipy as scipy
import matplotlib.pyplot as plt
plt.figure()
dn =scipy.cluster.hierarchy.dendrogram(Z)
If we want to get the three clusters, we can do:
labels = scipy.cluster.hierarchy.fcluster(Z,3,'maxclust')
If you want to get it by distance between the data points:
scipy.cluster.hierarchy.fcluster(Z,2,'distance')
This gives about the same result as calling for 3 clusters because that's not many ways to cut this example dataset.
If you look the example you have, the next point you can cut it is at height ~ 1.5, which is 16 clusters. So if you try to do scipy.cluster.hierarchy.fcluster(Z,5,'maxclust'), you get the same results as for 3 clusters. If you have a more spread dataset, it will work:
np.random.seed(111)
df = np.random.normal(0,1,(50,3))
## getting condensed distance matrix
d = pdist(df, metric='euclidean')
Z = linkage(d, 'single')
dn = scipy.cluster.hierarchy.dendrogram(Z,above_threshold_color='black',color_threshold=1.1)
Then this works:
scipy.cluster.hierarchy.fcluster(Z,5,'maxclust')

How do I force clustering of data in a specific evident pattern?

I have a large set of 'Vehicle speed vs Engine RPM' values for a vehicle. I'm try to predict the time spent by the vehicle on each gear.
I ran K-Means clustering on the dataset and got the following result:
Clearly, my algorithm has failed to capture the evident pattern. I want to force K-Means (or any other clustering algorithm, for that matter) to cluster data along the six sloped lines. Snippet of relevant code:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
# Importing the dataset
data = pd.read_csv('speedRpm.csv')
print(data.shape)
data.head()
# Getting the data points
f1 = data['rpm'].values
f2 = data['speed'].values
X = np.array(list(zip(f1, f2)))
# Number of clusters
k = 5
kmeans = KMeans(n_clusters=k)
# Fitting the input data
kmeans = kmeans.fit(X)
# Getting the cluster labels
labels = kmeans.predict(X)
# Centroid values
centroids = kmeans.cluster_centers_
labeled_array = {i: X[np.where(kmeans.labels_ == i)] for i in range(kmeans.n_clusters)}
colors = ['r', 'g', 'b', 'y', 'c']
fig, ax = plt.subplots()
for i in range(k):
points = np.array([X[j] for j in range(len(X)) if kmeans.labels_[j] == i])
ax.scatter(points[:, 0], points[:, 1], s=7, c=colors[i])
ax.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=200, c='#050505')
plt.show()
How do I make sure the clustering algorithm captures the right pattern, even though it possibly isn't the most efficient?
Thanks!
EDIT:
Ran the same set of points using DBSCAN this time. After playing around with the eps and min_samples values for sometime, got the following result:
Although, still not perfect and way too many outliers, the algorithm is beginning to capture the linear trend.
Code:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
# Importing the dataset
data = pd.read_csv('speedRpm.csv')
print(data.shape)
data.head()
# Getting the values and plotting it
f1 = data['rpm'].values
f2 = data['speed'].values
X = np.array(list(zip(f1, f2)))
# DBSCAN
# Compute DBSCAN
db = DBSCAN(eps=1.1, min_samples=3).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print "Estimated Number of Clusters", n_clusters_
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
High Level
There are two major options here:
Transform your data so that k-means-style clustering algorithms succeed
Pick a different algorithm
Minor option:
Tweak kmeans by forcing the initialization to be smarter
Option 2
Python has a good description of several clustering algorithms here . From the link, a (crudely cropped) helpful graphic:
This row looks similar to your dataset; have you tried a Gaussian mixture model? A GMM has few well known theoretical properties, but it works by assigning probabilities that points belong to each cluster center based on a posterior calculated from the data. You can often initialize it with kmeans, which Sklearn does for you.
Similarly, desnity-based clustering algorithms (DBSCAN, e.g.), seem like a logical choice. Your data has a nice segmentation of dense clusters, and this seems like a good topological property to be filtering for. In the image on the linked wikipedia page:
they offer the caption:
DBSCAN can find non-linearly separable clusters. This dataset cannot
be adequately clustered with k-means
which seems to speak to your troubles.
More on your troubles
Kmeans is an extremely versatile algorithm, but it is not globally optimal and suffers from a lot of weak-points. Here is dense reading
In addition to problems like the mickey mouse problem, kmeans is often trying to minimize simple euclidean distance to the centroids. While this makes a lot of sense for a lot of problems, it doesn't make sense in yours, where the skew of the clusters means that isn't quite the correct measure. Notice that other algorithms like agglomerative/hierarchical clustering, shown above, that use similar measures, have similar trappings.
I haven't covered transforming your data or tweaking kmeans because the latter requires actually hacking into (or writing your own) clustering algorithm (I don't recommend for a simple exploratory problem given the coverage of sklearn and similar packages), where the former seems like a local solution sensitive to your exact data. ICA might be a decent start, but there's a lot of options for that task
k-means (and other clustering algorithms quoted in the #en-knight answer) are meant for multi-dimensional data that tends to have groups of data points that are 'close' to each other (in terms of Euclidean distance), but separated spatially.
In your case, if data is considered in your un-processed input space (rpm vs velocity) the 'clusters' that are formed are very elongated and largely overlap in the region near (0,0), so most if not all methods based on Euclidean distance are bound to fail.
Your data isn't really 6 groups of 2-dimensional points that are spatially separated. Instead, it is actually a mix of 6 possible linear trends.
Therefore, the grouping should be based on x/y (the gear ratio). It is 1-dimensional: each (rpm,velocity) pair corresponds to a single (rpm/velocity) value and you want to group those.
I don't know if the k-means (or other algorithms) can take a 1-D data set, but if it cannot, you can create a new array with pairs like [0, rpm/vel] and run that through it.
You may want to look for a 1-D algorithm that's more efficient than the multi-dimensional generic ones.
This will make the graph labeling a bit more involved because the grouping is computed on a derivative data set that has a different shape (1 x samples) than the original data which is (2 x samples), but mapping them isn't difficult.
You could multiply your y-values by a factor of 10 or more, so they spread out along that axis. Make sure you keep track of whether you're working with the real values or the multiplied values.

PCA on sklearn - how to interpret pca.components_

I ran PCA on a data frame with 10 features using this simple code:
pca = PCA()
fit = pca.fit(dfPca)
The result of pca.explained_variance_ratio_ shows:
array([ 5.01173322e-01, 2.98421951e-01, 1.00968655e-01,
4.28813755e-02, 2.46887288e-02, 1.40976609e-02,
1.24905823e-02, 3.43255532e-03, 1.84516942e-03,
4.50314168e-16])
I believe that means that the first PC explains 52% of the variance, the second component explains 29% and so on...
What I dont undestand is the output of pca.components_. If I do the following:
df = pd.DataFrame(pca.components_, columns=list(dfPca.columns))
I get the data frame bellow where each line is a principal component.
What I'd like to understand is how to interpret that table. I know that if I square all the features on each component and sum them I get 1, but what does the -0.56 on PC1 mean? Dos it tell something about "Feature E" since it is the highest magnitude on a component that explains 52% of the variance?
Thanks
Terminology: First of all, the results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).
PART1: I explain how to check the importance of the features and how to plot a biplot.
PART2: I explain how to check the importance of the features and how to save them into a pandas dataframe using the feature names.
Summary in an article: Python compact guide: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f
PART 1:
In your case, the value -0.56 for Feature E is the score of this feature on the PC1. This value tells us 'how much' the feature influences the PC (in our case the PC1).
So the higher the value in absolute value, the higher the influence on the principal component.
After performing the PCA analysis, people usually plot the known 'biplot' to see the transformed features in the N dimensions (2 in our case) and the original variables (features).
I wrote a function to plot this.
Example using iris data:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general it is a good idea to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
pca = PCA()
pca.fit(X,y)
x_new = pca.transform(X)
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
plt.scatter(xs ,ys, c = y) #without scaling
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
#Call the function.
myplot(x_new[:,0:2], pca.components_)
plt.show()
Results
PART 2:
The important features are the ones that influence more the components and thus, have a large absolute value on the component.
TO get the most important features on the PCs with names and save them into a pandas dataframe use this:
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)
# 10 samples with 5 features
train_features = np.random.rand(10,5)
model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)
# number of components
n_pcs= model.components_.shape[0]
# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]
initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
# build the dataframe
df = pd.DataFrame(dic.items())
This prints:
0 1
0 PC0 e
1 PC1 d
So on the PC1 the feature named e is the most important and on PC2 the d.
Summary in an article: Python compact guide: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f
Basic Idea
The Principle Component breakdown by features that you have there basically tells you the "direction" each principle component points to in terms of the direction of the features.
In each principle component, features that have a greater absolute weight "pull" the principle component more to that feature's direction.
For example, we can say that in PC1, since Feature A, Feature B, Feature I, and Feature J have relatively low weights (in absolute value), PC1 is not as much pointing in the direction of these features in the feature space. PC1 will be pointing most to the direction of Feature E relative to other directions.
Visualization in Lower Dimensions
For a visualization of this, look at the following figures taken from here and here:
The following shows an example of running PCA on correlated data.
We can visually see that both eigenvectors derived from PCA are being "pulled" in both the Feature 1 and Feature 2 directions. Thus, if we were to make a principle component breakdown table like you made, we would expect to see some weightage from both Feature 1 and Feature 2 explaining PC1 and PC2.
Next, we have an example with uncorrelated data.
Let us call the green principle component as PC1 and the pink one as PC2. It's clear that PC1 is not pulled in the direction of feature x', and as isn't PC2 in the direction of feature y'.
Thus, in our table, we must have a weightage of 0 for feature x' in PC1 and a weightage of 0 for feature y' in PC2.
I hope this gives an idea of what you're seeing in your table.

Categories

Resources