scikit-learn: Finding the features that contribute to each KMeans cluster

scikit-learn: Finding the features that contribute to each KMeans cluster - python

Say you have 10 features you are using to create 3 clusters. Is there a way to see the level of contribution each of the features have for each of the clusters?
What I want to be able to say is that for cluster k1, features 1,4,6 were the primary features where as cluster k2's primary features were 2,5,7.
This is the basic setup of what I am using:
k_means = KMeans(init='k-means++', n_clusters=3, n_init=10)
k_means.fit(data_features)
k_means_labels = k_means.labels_

You can use
Principle Component Analysis (PCA)
PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and normalizing or using Z-scores) the data matrix for each attribute. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).
Some essential points:
the eigenvalues reflect the portion of variance explained by the corresponding component. Say, we have 4 features with eigenvalues 1, 4, 1, 2. These are the variances explained by the corresp. vectors. The second value belongs to the first principle component as it explains 50 % off the overall variance and the last value belongs to the second principle component explaining 25 % of the overall variance.
the eigenvectors are the component's linear combinations. The give the weights for the features so that you can know, which feature as high/low impact.
use PCA based on correlation matrix instead of empiric covariance matrix, if the eigenvalues strongly differ (magnitudes).
Sample approach
do PCA on entire dataset (that's what the function below does)
take matrix with observations and features
center it to its average (average of feature values among all observations)
compute empiric covariance matrix (e.g. np.cov) or correlation (see above)
perform decomposition
sort eigenvalues and eigenvectors by eigenvalues to get components with highest impact
use components on original data
examine the clusters in the transformed dataset. By checking their location on each component you can derive the features with high and low impact on distribution/variance
Sample function
You need to import numpy as np and scipy as sp. It uses sp.linalg.eigh for decomposition. You might want to check also the scikit decomposition module.
PCA is performed on a data matrix with observations (objects) in rows and features in columns.
def dim_red_pca(X, d=0, corr=False):
r"""
Performs principal component analysis.
Parameters
----------
X : array, (n, d)
Original observations (n observations, d features)
d : int
Number of principal components (default is ``0`` => all components).
corr : bool
If true, the PCA is performed based on the correlation matrix.
Notes
-----
Always all eigenvalues and eigenvectors are returned,
independently of the desired number of components ``d``.
Returns
-------
Xred : array, (n, m or d)
Reduced data matrix
e_values : array, (m)
The eigenvalues, sorted in descending manner.
e_vectors : array, (n, m)
The eigenvectors, sorted corresponding to eigenvalues.
"""
# Center to average
X_ = X-X.mean(0)
# Compute correlation / covarianz matrix
if corr:
CO = np.corrcoef(X_.T)
else:
CO = np.cov(X_.T)
# Compute eigenvalues and eigenvectors
e_values, e_vectors = sp.linalg.eigh(CO)
# Sort the eigenvalues and the eigenvectors descending
idx = np.argsort(e_values)[::-1]
e_vectors = e_vectors[:, idx]
e_values = e_values[idx]
# Get the number of desired dimensions
d_e_vecs = e_vectors
if d > 0:
d_e_vecs = e_vectors[:, :d]
else:
d = None
# Map principal components to original data
LIN = np.dot(d_e_vecs, np.dot(d_e_vecs.T, X_.T)).T
return LIN[:, :d], e_values, e_vectors
Sample usage
Here's a sample script, which makes use of the given function and uses scipy.cluster.vq.kmeans2 for clustering. Note that the results vary with each run. This is due to the starting clusters a initialized randomly.
import numpy as np
import scipy as sp
from scipy.cluster.vq import kmeans2
import matplotlib.pyplot as plt
SN = np.array([ [1.325, 1.000, 1.825, 1.750],
[2.000, 1.250, 2.675, 1.750],
[3.000, 3.250, 3.000, 2.750],
[1.075, 2.000, 1.675, 1.000],
[3.425, 2.000, 3.250, 2.750],
[1.900, 2.000, 2.400, 2.750],
[3.325, 2.500, 3.000, 2.000],
[3.000, 2.750, 3.075, 2.250],
[2.075, 1.250, 2.000, 2.250],
[2.500, 3.250, 3.075, 2.250],
[1.675, 2.500, 2.675, 1.250],
[2.075, 1.750, 1.900, 1.500],
[1.750, 2.000, 1.150, 1.250],
[2.500, 2.250, 2.425, 2.500],
[1.675, 2.750, 2.000, 1.250],
[3.675, 3.000, 3.325, 2.500],
[1.250, 1.500, 1.150, 1.000]], dtype=float)
clust,labels_ = kmeans2(SN,3) # cluster with 3 random initial clusters
# PCA on orig. dataset
# Xred will have only 2 columns, the first two princ. comps.
# evals has shape (4,) and evecs (4,4). We need all eigenvalues
# to determine the portion of variance
Xred, evals, evecs = dim_red_pca(SN,2)
xlab = '1. PC - ExpVar = {:.2f} %'.format(evals[0]/sum(evals)*100) # determine variance portion
ylab = '2. PC - ExpVar = {:.2f} %'.format(evals[1]/sum(evals)*100)
# plot the clusters, each set separately
plt.figure()
ax = plt.gca()
scatterHs = []
clr = ['r', 'b', 'k']
for cluster in set(labels_):
scatterHs.append(ax.scatter(Xred[labels_ == cluster, 0], Xred[labels_ == cluster, 1],
color=clr[cluster], label='Cluster {}'.format(cluster)))
plt.legend(handles=scatterHs,loc=4)
plt.setp(ax, title='First and Second Principle Components', xlabel=xlab, ylabel=ylab)
# plot also the eigenvectors for deriving the influence of each feature
fig, ax = plt.subplots(2,1)
ax[0].bar([1, 2, 3, 4],evecs[0])
plt.setp(ax[0], title="First and Second Component's Eigenvectors ", ylabel='Weight')
ax[1].bar([1, 2, 3, 4],evecs[1])
plt.setp(ax[1], xlabel='Features', ylabel='Weight')
Output
The eigenvectors show the weighting of each feature for the component
Short Interpretation
Let's just have a look at cluster zero, the red one. We'll be mostly interested in the first component as it explains about 3/4 of the distribution. The red cluster is in the upper area of the first component. All observations yield rather high values. What does it mean? Now looking at the linear combination of the first component we see on first sight, that the second feature is rather unimportant (for this component). The first and fourth feature are the highest weighted and the third one has a negative score. This means, that - as all red vertices have a rather high score on the first PC - these vertices will have high values in the first and last feature, while at the same time they have low scores concerning the third feature.
Concerning the second feature we can have a look at the second PC. However, note that the overall impact is far smaller as this component explains only roughly 16 % of the variance compared to the ~74 % of the first PC.

You can do it this way:
>>> import numpy as np
>>> import sklearn.cluster as cl
>>> data = np.array([99,1,2,103,44,63,56,110,89,7,12,37])
>>> k_means = cl.KMeans(init='k-means++', n_clusters=3, n_init=10)
>>> k_means.fit(data[:,np.newaxis]) # [:,np.newaxis] converts data from 1D to 2D
>>> k_means_labels = k_means.labels_
>>> k1,k2,k3 = [data[np.where(k_means_labels==i)] for i in range(3)] # range(3) because 3 clusters
>>> k1
array([44, 63, 56, 37])
>>> k2
array([ 99, 103, 110, 89])
>>> k3
array([ 1, 2, 7, 12])

Try this,
estimator=KMeans()
estimator.fit(X)
res=estimator.__dict__
print res['cluster_centers_']
You will get matrix of cluster and feature_weights, from that you can conclude, the feature having more weight takes major part to contribute cluster.

I assume that by saying "a primary feature" you mean - had the biggest impact on the class. A nice exploration you can do is look at the coordinates of the cluster centers . For example, plot for each feature it's coordinate in each of the K centers.
Of course that any features that are on large scale will have much larger effect on the distance between the observations, so make sure your data is well scaled before performing any analysis.

a method I came up with is calculating the standard deviation of each feature in relation to the distribution - basically how is the data is spread across each feature
the lesser the spread, the better the feature of each cluster basically:
1 - (std(x) / (max(x) - min(x))
I wrote an article and a class to maintain it
https://github.com/GuyLou/python-stuff/blob/main/pluster.py
https://medium.com/#guylouzon/creating-clustering-feature-importance-c97ba8133c37

It might be difficult to talk about feature importance separately for each cluster. Rather, it could be better to talk globally about which features are most important for separating different clusters.
For this goal, a very simple method is described as follow. Note that the Euclidean distance between two cluster centers is a sum of square difference between individual features. We can then just use the square difference as the weight for each feature.

Related

How to apply Boolean Matrix Factorization to clustering problems?

Problem Statement and Implementation:
I have a boolean matrix that has the data of users and items. If a user has bought the item then the value is 1, if not it is 0. I have about 10,000 users(rows) and 20,000 items(columns). I need to group similar users who have bought similar items. I have implemented this in K-modes. But, the silhouette score is less. That is, the users are not grouped well enough.
Recommended solution:
So, I was recommended to use Boolean Matrix factorization(BMF) because k-means/modes is only usable if clusters have convex shapes and every point belongs to exactly one cluster. BMF can compute clusters with overlap (might not be necessarily important for your application) but it also identifies the features which are important for creating one cluster together with the data points belonging to one cluster. Especially if you have a high-dimensional feature space, it is unlikely that the similarity between points is expressed over the whole feature space. BMF identifies the feature space in which the points form a cluster.
Question:
I have used the BMF packages from python. But I am not sure how to proceed with the results of the algorithm. Could somebody please suggest how I can proceed to forming clusters, name the clusters, and check which users are in which cluster using the Boolean Matrix Factorization?
K-modes implementation
#method 1: CAO initialisation, with clusters 2
km_cao = KModes(n_clusters=2, init = "Cao", n_init = 1, verbose=1)
fitClusters_cao = km_cao.fit_predict(pivot_table)
fitClusters_cao
clusterCentroidsDf = pd.DataFrame(km_cao.cluster_centroids_)
clusterCentroidsDf.columns = pivot_table.columns
clusterCentroidsDf
#method 2: Huang initialization, with clusters 2
km_huang = KModes(n_clusters=2, init = "Huang", n_init = 1, verbose=1)
fitClusters_huang = km_huang.fit_predict(pivot_table)
fitClusters_huang
Boolean Matrix Factorization tried so far
Pivot_table has the boolean matrix.
def __fact_factor(X):
"""
Return dense factorization factor, so that output is printed nice if factor is sparse.
:param X: Factorization factor.
:type X: :class:`scipy.sparse` of format csr, csc, coo, bsr, dok, lil, dia or :class:`numpy.matrix`
"""
return X.todense() if sp.isspmatrix(X) else X
bmf = nimfa.Bmf(pivot_table.values, seed="nndsvd", rank=10, max_iter=12, initialize_only=True, lambda_w=1.1, lambda_h=1.1)
bmf_fit = bmf()
print("=================================================================================================")
print("Factorization method:", bmf_fit.fit)
print("Initialization method:", bmf_fit.fit.seed)
print("Basis matrix W: ")
print(__fact_factor(bmf_fit.basis()))
print("Actual number of iterations: ", bmf_fit.summary('none')['n_iter'])
# We can access sparseness measure directly through fitted model.
# fit.fit.sparseness()
print("Sparseness basis: %7.4f, Sparseness mixture: %7.4f" % (bmf_fit.summary('none')['sparseness'][0], bmf_fit.summary('none')['sparseness'][1]))
# We can access explained variance directly through fitted model.
# fit.fit.evar()
print("Explained variance: ", bmf_fit.summary('none')['evar'])
# We can access residual sum of squares directly through fitted model.
# fit.fit.rss()
print("Residual sum of squares: ", bmf_fit.summary('none')['rss'])
# There are many more ... but just cannot print out everything =] and some measures need additional data or more runs
# e.g. entropy, predict, purity, coph_cor, consensus, select_features, score_features, connectivity
print("=================================================================================================")
The output for the above code is :
=================================================================================================
Factorization method: bmf
Initialization method: nndsvd
Basis matrix W:
[[1.66625229e+00 1.39154537e+00 1.49951683e+00 1.28801329e+00
1.30824274e+00 1.49929241e+00 1.49805989e+00 1.45501586e+00
1.46423781e+00 1.46044597e+00]
[1.39418572e-03 1.56074288e+00 1.63243924e+00 1.32752612e+00
1.33121227e+00 1.36615553e+00 1.40162330e+00 1.33175641e+00
1.35845014e+00 1.37581247e+00].......
.....
Actual number of iterations: 12
Sparseness basis: 0.1438, Sparseness mixture: 0.2611
Explained variance: 0.7680056334943757
Residual sum of squares: 1426.7653540095896

Python-Generating numbers according to a corellation matrix

Hi, I am trying to generate correlated data as close to the first table as possible (first three rows shown out of a total of 13). The correlation matrix for the relevant columns is also shown (corr_total).
I am trying the following code, which shows the error:
"LinAlgError: 4-th leading minor not positive definite"
from scipy.linalg import cholesky
# Correlation matrix
# Compute the (upper) Cholesky decomposition matrix
upper_chol = cholesky(corr_total)
# What should be here? The mu and sigma of one row of a table?
rnd = np.random.normal(2.57, 0.78, size=(10,7))
# Finally, compute the inner product of upper_chol and rnd
ans = rnd # upper_chol
My question is what goes into the values of The mu and sigma, and how to resolve the error shown above.
Thanks!
P.S I have edited the question to show the original table. It shows data for four patients. I basically want to make synthetic data for more cases, that replicates the patterns found in these patients

Thank you for answering my question about when data you have access to. The error that you received was generated when you called cholesky. cholesky requires that your matrix be positive semidefinite. One way to check if a matrix is semi-positive definite is to see if all of its eigenvalues are greater than zero. One of the eigenvalues of your correlation/covarance matrix is nearly zero. I think that cholesky is just being fussy. Use can use scipy.linalg.sqrtm as an alternate decomposition.
For your question on the generation of multivariate normals, the random normal that you generate should be a standard random normal, i.e. a mean of 0 and a width of 1. Numpy provides a standard random normal generator with np.random.randn.
To generate a multivariate normal, you should also take the decomposition of the covariance, not the correlation matrix. The following will generate a multivariate normal using an affine transformation, as in your question.
from scipy.linalg import cholesky, sqrtm
relavant_columns = ['Affecting homelife',
'Affecting mobility',
'Affecting social life/hobbies',
'Affecting work',
'Mood',
'Pain Score',
'Range of motion in Doc']
# df is a pandas dataframe containing the data frame from figure 1
mu = df[relavant_columns].mean().values
cov = df[relavant_columns].cov().values
number_of_sample = 10
# generate using affine transformation
#c2 = cholesky(cov).T
c2 = sqrtm(cov).T
s = np.matmul(c2, np.random.randn(c2.shape[0], number_of_sample)) + mu.reshape(-1, 1)
# transpose so each row is a sample
s = s.T
Numpy also has a built-in function which can generate multivariate normals directly
s = np.random.multivariate_normal(mu, cov, size=number_of_sample)

PCA on word2vec embeddings

I am trying to reproduce the results of this paper: https://arxiv.org/pdf/1607.06520.pdf
Specifically this part:
To identify the gender subspace, we took the ten gender pair difference vectors and computed its principal components (PCs). As Figure 6 shows, there is a single direction that explains the majority of variance in these vectors. The first eigenvalue is significantly larger than the rest.
I am using the same set of word vectors as the authors (Google News Corpus, 300 dimensions), which I load into word2vec.
The 'ten gender pair difference vectors' the authors refer to are computed from the following word pairs:
I've computed the differences between each normalized vector in the following way:
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-
negative300.bin', binary = True)
model.init_sims()
pairs = [('she', 'he'),
('her', 'his'),
('woman', 'man'),
('Mary', 'John'),
('herself', 'himself'),
('daughter', 'son'),
('mother', 'father'),
('gal', 'guy'),
('girl', 'boy'),
('female', 'male')]
difference_matrix = np.array([model.word_vec(a[0], use_norm=True) - model.word_vec(a[1], use_norm=True) for a in pairs])
I then perform PCA on the resulting matrix, with 10 components, as per the paper:
from sklearn.decomposition import PCA
pca = PCA(n_components=10)
pca.fit(difference_matrix)
However I get very different results when I look at pca.explained_variance_ratio_ :
array([ 2.83391436e-01, 2.48616155e-01, 1.90642492e-01,
9.98411858e-02, 5.61260498e-02, 5.29706681e-02,
2.75670634e-02, 2.21957722e-02, 1.86491774e-02,
1.99108478e-32])
or with a chart:
The first component accounts for less than 30% of the variance when it should be above 60%!
The results I get are similar to what I get when I try to do the PCA on randomly selected vectors, so I must be doing something wrong, but I can't figure out what.
Note: I've tried without normalizing the vectors, but I get the same results.

They released the code for the paper on github: https://github.com/tolga-b/debiaswe
Specifically, you can see their code for creating the PCA plot in this file.
Here is the relevant snippet of code from that file:
def doPCA(pairs, embedding, num_components = 10):
matrix = []
for a, b in pairs:
center = (embedding.v(a) + embedding.v(b))/2
matrix.append(embedding.v(a) - center)
matrix.append(embedding.v(b) - center)
matrix = np.array(matrix)
pca = PCA(n_components = num_components)
pca.fit(matrix)
# bar(range(num_components), pca.explained_variance_ratio_)
return pca
Based on the code, looks like they are taking the difference between each word in a pair and the average vector of the pair. To me, it's not clear this is what they meant in the paper. However, I ran this code with their pairs and was able to recreate the graph from the paper:

To expand on oregano's answer:
For each pair, a and b, they calculate the center, c = (a + b) / 2 and then include vectors pointing in both directions, a - c and b - c.
The reason this is critical is that PCA gives you the vector along which the most variance occurs. All of your vectors point in the same direction, so there is very little variance in precisely the direction you are trying to reveal.
Their set includes vectors pointing in both directions in the gender subspace, so PCA clearly reveals gender variation.

How to use Robust PCA output as principal-component (eigen)vectors from traditional PCA

I am using PCA to reduce the dimensionality of a N-dimensional dataset, but I want to build in robustness to large outliers, so I've been looking into Robust PCA codes.
For traditional PCA, I'm using python's sklearn.decomposition.PCA which nicely returns the principal components as vectors, onto which I can then project my data (to be clear, I've also coded my own versions using SVD so I know how the method works). I found a few pre-coded RPCA python codes out there (like https://github.com/dganguli/robust-pca and https://github.com/jkarnows/rpcaADMM).
The 1st code is based on the Candes et al. (2009) method, and returns low rank L and sparse S matrices for a dataset D. The 2nd code uses the ADMM method of matrix decomposition (Parikh, N., & Boyd, S. 2013) and returns X_1, X_2, X_3 matrices. I must admit, I'm having a very hard time figuring out how to connect these to the principal axes that are returned by a standard PCM algorithm. Can anyone provide any guidance?
Specifically, in one dataset X, I have a cloud of N 3-D points. I run it through PCA:
pca=sklean.decompose.PCA(n_components=3)
pca.fit(X)
comps=pca.components_
and these 3 components are 3-D vectors define the new basis onto which I project all my points. With Robust PCA, I get matrices L+S=X. Does one then run pca.fit(L)? I would have thought that RPCA would have given me back the eigenvectors but have internal steps to throw out outliers as part of building the covariance matrix or performing SVD.
Maybe what I think of as "Robust PCA" isn't how other people are using/coding it?

The robust-pca code factors the data matrix D into two matrices, L and S which are "low-rank" and "sparse" matrices (see the paper for details). L is what's mostly constant between the various observations, while S is what varies. Figures 2 and 3 in the paper give a really nice example from a couple of security cameras, picking out the static background (L) and variability such as passing people (S).
If you just want the eigenvectors, treat the S as junk (the "large outliers" you're wanting to clip out) and do an eigenanalysis on the L matrix.
Here's an example using the robust-pca code:
L, S = RPCA(data).fit()
rcomp, revals, revecs = pca(L)
print("Normalised robust eigenvalues: %s" % (revals/np.sum(revals),))
Here, the pca function is:
def pca(data, numComponents=None):
"""Principal Components Analysis
From: http://stackoverflow.com/a/13224592/834250
Parameters
----------
data : `numpy.ndarray`
numpy array of data to analyse
numComponents : `int`
number of principal components to use
Returns
-------
comps : `numpy.ndarray`
Principal components
evals : `numpy.ndarray`
Eigenvalues
evecs : `numpy.ndarray`
Eigenvectors
"""
m, n = data.shape
data -= data.mean(axis=0)
R = np.cov(data, rowvar=False)
# use 'eigh' rather than 'eig' since R is symmetric,
# the performance gain is substantial
evals, evecs = np.linalg.eigh(R)
idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
evals = evals[idx]
if numComponents is not None:
evecs = evecs[:, :numComponents]
# carry out the transformation on the data using eigenvectors
# and return the re-scaled data, eigenvalues, and eigenvectors
return np.dot(evecs.T, data.T).T, evals, evecs

Plane fitting in a 3d point cloud

I am trying to find planes in a 3d point cloud, using the regression formula Z= aX + bY +C
I implemented least squares and ransac solutions,
but the 3 parameters equation limits the plane fitting to 2.5D- the formula can not be applied on planes parallel to the Z-axis.
My question is how can I generalize the plane fitting to full 3d?
I want to add the fourth parameter in order to get the full equation
aX +bY +c*Z + d
how can I avoid the trivial (0,0,0,0) solution?
Thanks!
The Code I'm using:
from sklearn import linear_model
def local_regression_plane_ransac(neighborhood):
"""
Computes parameters for a local regression plane using RANSAC
"""
XY = neighborhood[:,:2]
Z = neighborhood[:,2]
ransac = linear_model.RANSACRegressor(
linear_model.LinearRegression(),
residual_threshold=0.1
)
ransac.fit(XY, Z)
inlier_mask = ransac.inlier_mask_
coeff = model_ransac.estimator_.coef_
intercept = model_ransac.estimator_.intercept_

Update
This functionality is now integrated in https://github.com/daavoo/pyntcloud and makes the plane fitting process much simplier:
Given a point cloud:
You just need to add a scalar field like this:
is_floor = cloud.add_scalar_field("plane_fit")
Wich will add a new column with value 1 for the points of the plane fitted.
You can visualize the scalar field:
Old answer
I think that you could easily use PCA to fit the plane to the 3D points instead of regression.
Here is a simple PCA implementation:
def PCA(data, correlation = False, sort = True):
""" Applies Principal Component Analysis to the data
Parameters
----------
data: array
The array containing the data. The array must have NxM dimensions, where each
of the N rows represents a different individual record and each of the M columns
represents a different variable recorded for that individual record.
array([
[V11, ... , V1m],
...,
[Vn1, ... , Vnm]])
correlation(Optional) : bool
Set the type of matrix to be computed (see Notes):
If True compute the correlation matrix.
If False(Default) compute the covariance matrix.
sort(Optional) : bool
Set the order that the eigenvalues/vectors will have
If True(Default) they will be sorted (from higher value to less).
If False they won't.
Returns
-------
eigenvalues: (1,M) array
The eigenvalues of the corresponding matrix.
eigenvector: (M,M) array
The eigenvectors of the corresponding matrix.
Notes
-----
The correlation matrix is a better choice when there are different magnitudes
representing the M variables. Use covariance matrix in other cases.
"""
mean = np.mean(data, axis=0)
data_adjust = data - mean
#: the data is transposed due to np.cov/corrcoef syntax
if correlation:
matrix = np.corrcoef(data_adjust.T)
else:
matrix = np.cov(data_adjust.T)
eigenvalues, eigenvectors = np.linalg.eig(matrix)
if sort:
#: sort eigenvalues and eigenvectors
sort = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[sort]
eigenvectors = eigenvectors[:,sort]
return eigenvalues, eigenvectors
And here is how you could fit the points to a plane:
def best_fitting_plane(points, equation=False):
""" Computes the best fitting plane of the given points
Parameters
----------
points: array
The x,y,z coordinates corresponding to the points from which we want
to define the best fitting plane. Expected format:
array([
[x1,y1,z1],
...,
[xn,yn,zn]])
equation(Optional) : bool
Set the oputput plane format:
If True return the a,b,c,d coefficients of the plane.
If False(Default) return 1 Point and 1 Normal vector.
Returns
-------
a, b, c, d : float
The coefficients solving the plane equation.
or
point, normal: array
The plane defined by 1 Point and 1 Normal vector. With format:
array([Px,Py,Pz]), array([Nx,Ny,Nz])
"""
w, v = PCA(points)
#: the normal of the plane is the last eigenvector
normal = v[:,2]
#: get a point from the plane
point = np.mean(points, axis=0)
if equation:
a, b, c = normal
d = -(np.dot(normal, point))
return a, b, c, d
else:
return point, normal
However as this method is sensitive to outliers you could use RANSAC to make the fit robust to outliers.
There is a Python implementation of ransac here.
And you should only need to define a Plane Model class in order to use it for fitting planes to 3D points.
In any case if you can clean the 3D points from outliers (maybe you could use a KD-Tree S.O.R filter to that) you should get pretty good results with PCA.
Here is an implementation of an S.O.R:
def statistical_outilier_removal(kdtree, k=8, z_max=2 ):
""" Compute a Statistical Outlier Removal filter on the given KDTree.
Parameters
----------
kdtree: scipy's KDTree instance
The KDTree's structure which will be used to
compute the filter.
k(Optional): int
The number of nearest neighbors wich will be used to estimate the
mean distance from each point to his nearest neighbors.
Default : 8
z_max(Optional): int
The maximum Z score wich determines if the point is an outlier or
not.
Returns
-------
sor_filter : boolean array
The boolean mask indicating wherever a point should be keeped or not.
The size of the boolean mask will be the same as the number of points
in the KDTree.
Notes
-----
The 2 optional parameters (k and z_max) should be used in order to adjust
the filter to the desired result.
A HIGHER 'k' value will result(normally) in a HIGHER number of points trimmed.
A LOWER 'z_max' value will result(normally) in a HIGHER number of points trimmed.
"""
distances, i = kdtree.query(kdtree.data, k=k, n_jobs=-1)
z_distances = stats.zscore(np.mean(distances, axis=1))
sor_filter = abs(z_distances) < z_max
return sor_filter
You could feed the function with a KDtree of your 3D points computed maybe using this implementation

import pcl
cloud = pcl.PointCloud()
cloud.from_array(points)
seg = cloud.make_segmenter_normals(ksearch=50)
seg.set_optimize_coefficients(True)
seg.set_model_type(pcl.SACMODEL_PLANE)
seg.set_normal_distance_weight(0.05)
seg.set_method_type(pcl.SAC_RANSAC)
seg.set_max_iterations(100)
seg.set_distance_threshold(0.005)
inliers, model = seg.segment()
you need to install python-pcl first. Feel free to play with the parameters. points here is a nx3 numpy array with n 3d points. Model will be [a, b, c, d] such that ax + by + cz + d = 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.