How to apply Boolean Matrix Factorization to clustering problems? - python

Problem Statement and Implementation:
I have a boolean matrix that has the data of users and items. If a user has bought the item then the value is 1, if not it is 0. I have about 10,000 users(rows) and 20,000 items(columns). I need to group similar users who have bought similar items. I have implemented this in K-modes. But, the silhouette score is less. That is, the users are not grouped well enough.
Recommended solution:
So, I was recommended to use Boolean Matrix factorization(BMF) because k-means/modes is only usable if clusters have convex shapes and every point belongs to exactly one cluster. BMF can compute clusters with overlap (might not be necessarily important for your application) but it also identifies the features which are important for creating one cluster together with the data points belonging to one cluster. Especially if you have a high-dimensional feature space, it is unlikely that the similarity between points is expressed over the whole feature space. BMF identifies the feature space in which the points form a cluster.
Question:
I have used the BMF packages from python. But I am not sure how to proceed with the results of the algorithm. Could somebody please suggest how I can proceed to forming clusters, name the clusters, and check which users are in which cluster using the Boolean Matrix Factorization?
K-modes implementation
#method 1: CAO initialisation, with clusters 2
km_cao = KModes(n_clusters=2, init = "Cao", n_init = 1, verbose=1)
fitClusters_cao = km_cao.fit_predict(pivot_table)
fitClusters_cao
clusterCentroidsDf = pd.DataFrame(km_cao.cluster_centroids_)
clusterCentroidsDf.columns = pivot_table.columns
clusterCentroidsDf
#method 2: Huang initialization, with clusters 2
km_huang = KModes(n_clusters=2, init = "Huang", n_init = 1, verbose=1)
fitClusters_huang = km_huang.fit_predict(pivot_table)
fitClusters_huang
Boolean Matrix Factorization tried so far
Pivot_table has the boolean matrix.
def __fact_factor(X):
"""
Return dense factorization factor, so that output is printed nice if factor is sparse.
:param X: Factorization factor.
:type X: :class:`scipy.sparse` of format csr, csc, coo, bsr, dok, lil, dia or :class:`numpy.matrix`
"""
return X.todense() if sp.isspmatrix(X) else X
bmf = nimfa.Bmf(pivot_table.values, seed="nndsvd", rank=10, max_iter=12, initialize_only=True, lambda_w=1.1, lambda_h=1.1)
bmf_fit = bmf()
print("=================================================================================================")
print("Factorization method:", bmf_fit.fit)
print("Initialization method:", bmf_fit.fit.seed)
print("Basis matrix W: ")
print(__fact_factor(bmf_fit.basis()))
print("Actual number of iterations: ", bmf_fit.summary('none')['n_iter'])
# We can access sparseness measure directly through fitted model.
# fit.fit.sparseness()
print("Sparseness basis: %7.4f, Sparseness mixture: %7.4f" % (bmf_fit.summary('none')['sparseness'][0], bmf_fit.summary('none')['sparseness'][1]))
# We can access explained variance directly through fitted model.
# fit.fit.evar()
print("Explained variance: ", bmf_fit.summary('none')['evar'])
# We can access residual sum of squares directly through fitted model.
# fit.fit.rss()
print("Residual sum of squares: ", bmf_fit.summary('none')['rss'])
# There are many more ... but just cannot print out everything =] and some measures need additional data or more runs
# e.g. entropy, predict, purity, coph_cor, consensus, select_features, score_features, connectivity
print("=================================================================================================")
The output for the above code is :
=================================================================================================
Factorization method: bmf
Initialization method: nndsvd
Basis matrix W:
[[1.66625229e+00 1.39154537e+00 1.49951683e+00 1.28801329e+00
1.30824274e+00 1.49929241e+00 1.49805989e+00 1.45501586e+00
1.46423781e+00 1.46044597e+00]
[1.39418572e-03 1.56074288e+00 1.63243924e+00 1.32752612e+00
1.33121227e+00 1.36615553e+00 1.40162330e+00 1.33175641e+00
1.35845014e+00 1.37581247e+00].......
.....
Actual number of iterations: 12
Sparseness basis: 0.1438, Sparseness mixture: 0.2611
Explained variance: 0.7680056334943757
Residual sum of squares: 1426.7653540095896

Related

partially define initial centroid for scikit-learn K-Means clustering

Scikit documentation states that:
Method for initialization:
‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
My data has 10 (predicted) clusters and 7 features. However, I would like to pass array of 10 by 6 shape, i.e. I want 6 dimensions of centroid of be predefined by me, but 7th dimension to be iterated freely using k-mean++.(In another word, I do not want to specify initial centroid, but rather control 6 dimension and only leave one dimension to vary for initial cluster)
I tried to pass 10x6 dimension, in hope it would work, but it just throw up the error.
Sklearn does not allow you to perform this kind of fine operations.
The only possibility is to provide a 7th feature value that is random, or similar to what Kmeans++ would have achieved.
So basically you can estimate a good value for this as follows:
import numpy as np
from sklearn.cluster import KMeans
nb_clust = 10
# your data
X = np.random.randn(7*1000).reshape( (1000,7) )
# your 6col centroids
cent_6cols = np.random.randn(6*nb_clust).reshape( (nb_clust,6) )
# artificially fix your centroids
km = KMeans( n_clusters=10 )
km.cluster_centers_ = cent_6cols
# find the points laying on each cluster given your initialization
initial_prediction = km.predict(X[:,0:6])
# For the 7th column you'll provide the average value
# of the points laying on the cluster given by your partial centroids
cent_7cols = np.zeros( (nb_clust,7) )
cent_7cols[:,0:6] = cent_6cols
for i in range(nb_clust):
init_7th = X[ np.where( initial_prediction == i ), 6].mean()
cent_7cols[i,6] = init_7th
# now you have initialized the 7th column with a Kmeans ++ alike
# So now you can use the cent_7cols as your centroids
truekm = KMeans( n_clusters=10, init=cent_7cols )
That is a very nonstandard variation of k-means. So you cannot expect sklearn to be prepared for every exotic variation. That would make sklearn slower for everybody else.
In fact, your approach is more like certain regression approaches (predicting the last value of the cluster centers) rather than clustering. I also doubt the results will be much better than simply setting the last value to the average of all points assigned to the cluster center using the other 6 dimensions only. Try partitioning your data based on the nearest center (ignoring the last column) and then setting the last column to be the arithmetic mean of the assigned data.
However, sklearn is open source.
So get the source code, and modify k-means. Initialize the last component randomly, and while running k-means only update the last column. It's easy to modify it this way - but it's very hard to design an efficient API to allow such customizations through trivial parameters - use the source code to customize at his level.

Multiple linear regression: appending an array on ones to Matrix of features (Python)

I'm currently learning basics of Data Science online. In one of the session on Multiple Linear Regression using Python, the tutor executed below step to add an array on ones to the Matrix of features ; I did not understand why it is being added. From online forums, it is mentioned that it is added so that model (equation) have a constant offset. But why 1 and not any other values. Does the number of independent variables (3) have any impact on this value
X -> Matrix of features ; number of rows in data set : 50 ; Number of
X = np.append(arr = np.ones([50,1]).astype(int), values = X,axis=1)
To better explain, let's imagine you have only 1 feature stored, and let's say 3
training examples.
Then, your parameters are:
And your input variables are:
If you want to realize a linear classification, you must compute the cost function for each training example i:
And if you need to vectorize the calculus (for efficiency and code readability), you want to compute the following matricial product:
However, by definition of the matricial product, the number of columns of matrix X should be the same than the number of rows of matrix Theta. Thus, to compute the product but leave the result unchanged, you add a column of ones to the left of matrix X:
Then, the result for each sample i is the following:
TLDR: You need to append a column of ones to X for the matricial product X*Theta to be defined. If you were adding any other coefficient c instead of 1, then your constant offset theta_0 would be multiplied by your coefficient c.
I think the cost function is the summation of errors between the predicted label and the actual label which we want to minimise. The J function given above is the hypothesis function.

How to change parameters of a scikit learn function dynamically i.e. find best parameter

I am trying to do dimensionality reduction using PCA function of sklearn, specifically
from sklearn.decomposition import PCA
def mypca(X,comp):
pca = PCA(n_components=comp)
pca.fit(X)
PCA(copy=True, n_components=comp, whiten=False)
Xpca = pca.fit_transform(X)
return Xpca
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
I am calling mypca function from a loop with different values for comp. I am doing this in order to find the best value of comp for the problem I am trying to solve. But mypca function always returns the same value i.e. Xpca irrespective of value of comp.
The value it returns is correct for first value of comp I send from the loop i.e. Xpca value which it sends each time is correct for comp = 10 in my case.
What should I do in order to find best value of comp?
You use PCA to reduce the dimension.
From your code:
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
Your input dataset X is only a 2 dimensional array, the minimum n_comp is 10, so the PCA try to find the 10 best dimension for you. Since 10 > 2, you will always get the same answer. :)
It looks like you're trying to pass different values for number of components, and re-fit with each. A great thing about PCA is that it's actually not necessary to do this. You can fit the full number of components (even as many components as dimensions in your dataset), then simply discard the components you don't want (i.e. those with small variance). This is equivalent to re-fitting the entire model with fewer components. Saves a lot of computation.
How to do it:
# x = input data, size(<points>, <dimensions>)
# fit the full model
max_components = x.shape[1] # as many components as input dimensions
pca = PCA(n_components=max_components)
pca.fit(x)
# transform the data (contains all components)
y_all = pca.transform(x)
# keep only the top k components (with greatest variance)
k = 2
y = y_all[:, 0:k]
In terms of how to select the number of components, it depends what you want to do. One standard way of choosing the number of components k is to look at the fraction of variance explained (R^2) by each choice of k. If your data is distributed near a low-dimensional linear subspace, then when you plot R^2 vs. k, the curve will have an 'elbow' shape. The elbow will be located at the dimensionality of the subspace. It's good practice to look at this curve because it helps understand the data. Even if there's no clean elbow, it's common to choose a threshold value for R^2, e.g. to preserve 95% of the variance.
Here's how to do it (this should be done on the model with max_components components):
# Calculate fraction of variance explained
# for each choice of number of components
r2 = pca.explained_variance_.cumsum() / x.var(0).sum()
Another way you might want to proceed is to take the PCA-transformed data and feed it to a downstream algorithm (e.g. classifier/regression), then select your number of components based on the performance (e.g. using cross validation).
Side note: Maybe just a formatting issue, but your code block in mypca() should be indented, or it won't be interpreted as part of the function.

scikit-learn: Finding the features that contribute to each KMeans cluster

Say you have 10 features you are using to create 3 clusters. Is there a way to see the level of contribution each of the features have for each of the clusters?
What I want to be able to say is that for cluster k1, features 1,4,6 were the primary features where as cluster k2's primary features were 2,5,7.
This is the basic setup of what I am using:
k_means = KMeans(init='k-means++', n_clusters=3, n_init=10)
k_means.fit(data_features)
k_means_labels = k_means.labels_
You can use
Principle Component Analysis (PCA)
PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and normalizing or using Z-scores) the data matrix for each attribute. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).
Some essential points:
the eigenvalues reflect the portion of variance explained by the corresponding component. Say, we have 4 features with eigenvalues 1, 4, 1, 2. These are the variances explained by the corresp. vectors. The second value belongs to the first principle component as it explains 50 % off the overall variance and the last value belongs to the second principle component explaining 25 % of the overall variance.
the eigenvectors are the component's linear combinations. The give the weights for the features so that you can know, which feature as high/low impact.
use PCA based on correlation matrix instead of empiric covariance matrix, if the eigenvalues strongly differ (magnitudes).
Sample approach
do PCA on entire dataset (that's what the function below does)
take matrix with observations and features
center it to its average (average of feature values among all observations)
compute empiric covariance matrix (e.g. np.cov) or correlation (see above)
perform decomposition
sort eigenvalues and eigenvectors by eigenvalues to get components with highest impact
use components on original data
examine the clusters in the transformed dataset. By checking their location on each component you can derive the features with high and low impact on distribution/variance
Sample function
You need to import numpy as np and scipy as sp. It uses sp.linalg.eigh for decomposition. You might want to check also the scikit decomposition module.
PCA is performed on a data matrix with observations (objects) in rows and features in columns.
def dim_red_pca(X, d=0, corr=False):
r"""
Performs principal component analysis.
Parameters
----------
X : array, (n, d)
Original observations (n observations, d features)
d : int
Number of principal components (default is ``0`` => all components).
corr : bool
If true, the PCA is performed based on the correlation matrix.
Notes
-----
Always all eigenvalues and eigenvectors are returned,
independently of the desired number of components ``d``.
Returns
-------
Xred : array, (n, m or d)
Reduced data matrix
e_values : array, (m)
The eigenvalues, sorted in descending manner.
e_vectors : array, (n, m)
The eigenvectors, sorted corresponding to eigenvalues.
"""
# Center to average
X_ = X-X.mean(0)
# Compute correlation / covarianz matrix
if corr:
CO = np.corrcoef(X_.T)
else:
CO = np.cov(X_.T)
# Compute eigenvalues and eigenvectors
e_values, e_vectors = sp.linalg.eigh(CO)
# Sort the eigenvalues and the eigenvectors descending
idx = np.argsort(e_values)[::-1]
e_vectors = e_vectors[:, idx]
e_values = e_values[idx]
# Get the number of desired dimensions
d_e_vecs = e_vectors
if d > 0:
d_e_vecs = e_vectors[:, :d]
else:
d = None
# Map principal components to original data
LIN = np.dot(d_e_vecs, np.dot(d_e_vecs.T, X_.T)).T
return LIN[:, :d], e_values, e_vectors
Sample usage
Here's a sample script, which makes use of the given function and uses scipy.cluster.vq.kmeans2 for clustering. Note that the results vary with each run. This is due to the starting clusters a initialized randomly.
import numpy as np
import scipy as sp
from scipy.cluster.vq import kmeans2
import matplotlib.pyplot as plt
SN = np.array([ [1.325, 1.000, 1.825, 1.750],
[2.000, 1.250, 2.675, 1.750],
[3.000, 3.250, 3.000, 2.750],
[1.075, 2.000, 1.675, 1.000],
[3.425, 2.000, 3.250, 2.750],
[1.900, 2.000, 2.400, 2.750],
[3.325, 2.500, 3.000, 2.000],
[3.000, 2.750, 3.075, 2.250],
[2.075, 1.250, 2.000, 2.250],
[2.500, 3.250, 3.075, 2.250],
[1.675, 2.500, 2.675, 1.250],
[2.075, 1.750, 1.900, 1.500],
[1.750, 2.000, 1.150, 1.250],
[2.500, 2.250, 2.425, 2.500],
[1.675, 2.750, 2.000, 1.250],
[3.675, 3.000, 3.325, 2.500],
[1.250, 1.500, 1.150, 1.000]], dtype=float)
clust,labels_ = kmeans2(SN,3) # cluster with 3 random initial clusters
# PCA on orig. dataset
# Xred will have only 2 columns, the first two princ. comps.
# evals has shape (4,) and evecs (4,4). We need all eigenvalues
# to determine the portion of variance
Xred, evals, evecs = dim_red_pca(SN,2)
xlab = '1. PC - ExpVar = {:.2f} %'.format(evals[0]/sum(evals)*100) # determine variance portion
ylab = '2. PC - ExpVar = {:.2f} %'.format(evals[1]/sum(evals)*100)
# plot the clusters, each set separately
plt.figure()
ax = plt.gca()
scatterHs = []
clr = ['r', 'b', 'k']
for cluster in set(labels_):
scatterHs.append(ax.scatter(Xred[labels_ == cluster, 0], Xred[labels_ == cluster, 1],
color=clr[cluster], label='Cluster {}'.format(cluster)))
plt.legend(handles=scatterHs,loc=4)
plt.setp(ax, title='First and Second Principle Components', xlabel=xlab, ylabel=ylab)
# plot also the eigenvectors for deriving the influence of each feature
fig, ax = plt.subplots(2,1)
ax[0].bar([1, 2, 3, 4],evecs[0])
plt.setp(ax[0], title="First and Second Component's Eigenvectors ", ylabel='Weight')
ax[1].bar([1, 2, 3, 4],evecs[1])
plt.setp(ax[1], xlabel='Features', ylabel='Weight')
Output
The eigenvectors show the weighting of each feature for the component
Short Interpretation
Let's just have a look at cluster zero, the red one. We'll be mostly interested in the first component as it explains about 3/4 of the distribution. The red cluster is in the upper area of the first component. All observations yield rather high values. What does it mean? Now looking at the linear combination of the first component we see on first sight, that the second feature is rather unimportant (for this component). The first and fourth feature are the highest weighted and the third one has a negative score. This means, that - as all red vertices have a rather high score on the first PC - these vertices will have high values in the first and last feature, while at the same time they have low scores concerning the third feature.
Concerning the second feature we can have a look at the second PC. However, note that the overall impact is far smaller as this component explains only roughly 16 % of the variance compared to the ~74 % of the first PC.
You can do it this way:
>>> import numpy as np
>>> import sklearn.cluster as cl
>>> data = np.array([99,1,2,103,44,63,56,110,89,7,12,37])
>>> k_means = cl.KMeans(init='k-means++', n_clusters=3, n_init=10)
>>> k_means.fit(data[:,np.newaxis]) # [:,np.newaxis] converts data from 1D to 2D
>>> k_means_labels = k_means.labels_
>>> k1,k2,k3 = [data[np.where(k_means_labels==i)] for i in range(3)] # range(3) because 3 clusters
>>> k1
array([44, 63, 56, 37])
>>> k2
array([ 99, 103, 110, 89])
>>> k3
array([ 1, 2, 7, 12])
Try this,
estimator=KMeans()
estimator.fit(X)
res=estimator.__dict__
print res['cluster_centers_']
You will get matrix of cluster and feature_weights, from that you can conclude, the feature having more weight takes major part to contribute cluster.
I assume that by saying "a primary feature" you mean - had the biggest impact on the class. A nice exploration you can do is look at the coordinates of the cluster centers . For example, plot for each feature it's coordinate in each of the K centers.
Of course that any features that are on large scale will have much larger effect on the distance between the observations, so make sure your data is well scaled before performing any analysis.
a method I came up with is calculating the standard deviation of each feature in relation to the distribution - basically how is the data is spread across each feature
the lesser the spread, the better the feature of each cluster basically:
1 - (std(x) / (max(x) - min(x))
I wrote an article and a class to maintain it
https://github.com/GuyLou/python-stuff/blob/main/pluster.py
https://medium.com/#guylouzon/creating-clustering-feature-importance-c97ba8133c37
It might be difficult to talk about feature importance separately for each cluster. Rather, it could be better to talk globally about which features are most important for separating different clusters.
For this goal, a very simple method is described as follow. Note that the Euclidean distance between two cluster centers is a sum of square difference between individual features. We can then just use the square difference as the weight for each feature.

How to compute the shannon entropy and mutual information of N variables

I need to compute the mutual information, and so the shannon entropy of N variables.
I wrote a code that compute shannon entropy of certain distribution.
Let's say that I have a variable x, array of numbers.
Following the definition of shannon entropy I need to compute the probability density function normalized, so using the numpy.histogram is easy to get it.
import scipy.integrate as scint
from numpy import*
from scipy import*
def shannon_entropy(a, bins):
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
x=binedg[:-1]
g=-p*log2(p)
g[isnan(g)]=0.
return scint.simps(g,x=x)
Choosing inserting x, and carefully the bin number this function works.
But this function is very dependent on the bin number: choosing different values of this parameter I got different values.
Particularly if my input is an array of values constant:
x=[0,0,0,....,0,0,0]
the entropy of this variables obviously has to be 0, but if I choose the bin number equal to 1 I got the right answer, if I choose different values I got strange non sense (negative) answers.. what I am feeling is that numpy.histogram have the arguments normed=True or density= True that (as said in the official documentation) they should give back the histogram normalized, and probably I do some error in the moment that I swich from the probability density function (output of numpy.histogram) to the probability mass function (input of shannon entropy), I do:
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
I would like to find a way to solve these problems, I would like to have an efficient method to compute the shannon entropy independent of the bin number.
I wrote a function to compute the shannon entropy of a distribution of more variables, but I got the same error.
The code is this, where the input of the function shannon_entropydd is the array where at each position there is each variable that has to be involved in the statistical computation
def intNd(c,axes):
assert len(c.shape) == len(axes)
assert all([c.shape[i] == axes[i].shape[0] for i in range(len(axes))])
if len(axes) == 1:
return scint.simps(c,axes[0])
else:
return intNd(scint.simps(c,axes[-1]),axes[:-1])
def shannon_entropydd(c,bins=30):
hist,ax=histogramdd(c,bins,normed=True)
for i in range(len(ax)):
ax[i]=ax[i][:-1]
p=-hist*log2(hist)
p[isnan(p)]=0
return intNd(p,ax)
I need these quantities in order to be able to compute the mutual information between certain set of variables:
M_info(x,y,z)= H(x)+H(z)+H(y)- H(x,y,z)
where H(x) is the shannon entropy of the variable x
I have to find a way to compute these quantities so if some one has a completely different kind of code that works I can switch on it, I don't need to repair this code but find a right way to compute this statistical functions!
The result will depend pretty strongly on the estimated density. Can you assume a specific form for the density? You can reduce the dependence of the result on the estimate if you avoid histograms or other general-purpose estimates such as kernel density estimates. If you can give more detail about the variables involved, I can make more specific comments.
I worked with estimates of mutual information as part of the work for my dissertation [1]. There is some stuff about MI in section 8.1 and appendix F.
[1] http://riso.sourceforge.net/docs/dodier-dissertation.pdf
I think that if you choose bins = 1, you will always find an entropy of 0, as there is no "uncertainty" over the possible bin the values are in ("uncertainty" is what entropy measures). You should choose an number of bins "big enough" to account for the diversity of the values that your variable can take. If you have discrete values: for binary values, you should take such that bins >= 2. If the values that can take your variable are in {0,1,2}, you should have bins >= 3, and so on...
I must say that I did not read your code, but this works for me:
import numpy as np
x = [0,1,1,1,0,0,0,1,1,0,1,1]
bins = 10
cx = np.histogram(x, bins)[0]
def entropy(c):
c_normalized = c/float(np.sum(c))
c_normalized = c_normalized[np.nonzero(c_normalized)]
h = -sum(c_normalized * np.log(c_normalized))
return h
hx = entropy(cx)

Categories

Resources