I am working on text clustering. I would need to plot the data using different colours.
I used kmeans method for clustering and tf-idf for similarity.
kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_
pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1])
kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=np.array([kmeans.labels_])
Currently, my output looks like: there are a few elements as it is a test.
I would need to add labels (they are strings) and differentiate dots by clusters: each cluster should have its own colour to make the reader easy to analyse the chart.
Could you please tell me how to change my code in order to include both labels and colours? I think any example it would be great.
A sample of my dataset is (the output above was generated from a different sample):
Sentences
Where do we do list them? ...
Make me a list of the things we would need and I'll take you into town. ...
Do you have a list yet? ...
The first was a list for Howie. ...
You're not on my list tonight. ...
I'm gonna print this list on my computer, given you're always bellyaching about my writing.
We can use an example dataset:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
newsgroups = fetch_20newsgroups(subset='train',
categories=['talk.religion.misc','sci.space', 'misc.forsale'])
X_train = newsgroups.data
y_train = newsgroups.target
pipeline = Pipeline([('tfidf', TfidfVectorizer(max_features=5000))])
X = pipeline.fit_transform(X_train).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
And do KMeans like you did, obtaining the clusters and centers, so just adding a name for the cluster:
kmeans =KMeans(n_clusters=3).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=kmeans.labels_
cluster_name = ["Cluster"+str(i) for i in set(labels)]
You can add the colors by providing the cluster to "c=" and calling a colormap from cm or defining you own map:
plt.scatter(data2D[:,0], data2D[:,1],c=labels,cmap='Set3',alpha=0.7)
for i, txt in enumerate(cluster_name):
plt.text(centers2D[i,0], centers2D[i,1],s=txt,ha="center",va="center")
You can also consider using seaborn:
sns.scatterplot(data2D[:,0], data2D[:, 1], hue=labels, legend='full',palette="Set1")
Picking up on your code try the following:
kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_
pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
group = kmeans.labels_
cdict = {0: 'red', 1: 'blue', 2: 'green'}
ldict = {0: 'label_1', 1: 'label_2', 2: 'label_3'}
fig, ax = plt.subplots()
for g in np.unique(group):
ix = np.where(group == g)
ax.scatter(data2D[:,0][ix], data2D[:,1][ix], c=cdict[g], label=ldict[g], s=100)
ax.legend()
plt.show()
I'm assuming your kmeans has n_clusters=3. The cdict and ldict need to be set up accordingly with the number of clusters. In this case cluster 0 will be red with label label_1, cluster 1 will be blue with label label_2 and so on.
EDIT: I changed the keys of cdict to start from 0.
EDIT 2: Added labels.
Related
I have a dataset with ~13k features and I want to select the features that are contributing the most to the classification of a specific label.
I am using the sklearn.svm.LinearSVC class on single cell data.
The coef_ attribute should provide this information (as far as I understood) but when excluding the top 10-100 features from coef_, the accuracy / multi class f1-score is not decreasing.
Does somebody know how to extract this information based on a trained model?
I provided exemplary code down below that does the same but with an open source dataset!
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np
data = load_iris(return_X_y=True, as_frame=True)
print(data[1].unique()) # [0 1 2] -> three classes
svc = LinearSVC()
svc.fit(data[0], data[1])
score = svc.score(data[0], data[1])
print(svc.coef_.shape) # (3, 4)
fig, axs = plt.subplots(1, 3, figsize=(15, 7))
for label, ax in enumerate(axs.flatten()):
args = np.argsort(-svc.coef_[label])
vals = [svc.coef_[label][arg] for arg in args]
ax.bar(args, vals)
ax.title.set_text(f"{label}")
plt.tight_layout()
if __name__ == '__main__':
plt.show()
There are so many ways to visualize a data set. I want to have all those methods together here and I have chosen iris data set for that. In order to do so These are been written here.
I would have use either pandas' visualization or seaborn's.
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.plotting import parallel_coordinates
import pandas as pd
# Parallel Coordinates
# Load the data set
iris = sns.load_dataset("iris")
parallel_coordinates(iris, 'species', color=('#556270', '#4ECDC4', '#C7F464'))
plt.show()
and Result is as follow:
from pandas.plotting import andrews_curves
# Andrew Curves
a_c = andrews_curves(iris, 'species')
a_c.plot()
plt.show()
and its plot is shown below:
from seaborn import pairplot
# Pair Plot
pairplot(iris, hue='species')
plt.show()
which would plot the following fig:
and also another plot which is I think the least used and the most important is the following one:
from plotly.express import scatter_3d
# Plotting in 3D by plotly.express that would show the plot with capability of zooming,
# changing the orientation, and rotating
scatter_3d(iris, x='sepal_length', y='sepal_width', z='petal_length', size="petal_width",
color="species", color_discrete_map={"Joly": "blue", "Bergeron": "violet", "Coderre": "pink"})\
.show()
This one would plot into your browser and demands HTML5 and you can see as you wish with it. The next figure is the one. Remember that It is a SCATTERING plot and the size of each ball is showing data of the petal_width so all four features are in one single plot.
Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification
problems. It is called Naive Bayes because the calculations of the probabilities for each class are
simplified to make their calculations tractable. Rather than attempting to calculate the
probabilities of each attribute value, they are assumed to be conditionally independent given the
class value. This is a very strong assumption that is most unlikely in real data, i.e. that the
attributes do not interact. Nevertheless, the approach performs surprisingly well on data where
this assumption does not hold.
Here is a good example of developing a model to predict labels of this data set. You can use this example to develop every model because this is the basic of it.
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import seaborn as sns
# Load the data set
iris = sns.load_dataset("iris")
iris = iris.rename(index=str, columns={'sepal_length': '1_sepal_length', 'sepal_width': '2_sepal_width',
'petal_length': '3_petal_length', 'petal_width': '4_petal_width'})
# Setup X and y data
X_data_plot = df1.iloc[:, 0:2]
y_labels_plot = df1.iloc[:, 2].replace({'setosa': 0, 'versicolor': 1, 'virginica': 2}).copy()
x_train, x_test, y_train, y_test = train_test_split(df2.iloc[:, 0:4], y_labels_plot, test_size=0.25,
random_state=42) # This is for the model
# Fit model
model_sk_plot = GaussianNB(priors=None)
nb_model = GaussianNB(priors=None)
model_sk_plot.fit(X_data_plot, y_labels_plot)
nb_model.fit(x_train, y_train)
# Our 2-dimensional classifier will be over variables X and Y
N_plot = 100
X_plot = np.linspace(4, 8, N_plot)
Y_plot = np.linspace(1.5, 5, N_plot)
X_plot, Y_plot = np.meshgrid(X_plot, Y_plot)
plot = sns.FacetGrid(iris, hue="species", size=5, palette='husl').map(plt.scatter, "1_sepal_length",
"2_sepal_width", ).add_legend()
my_ax = plot.ax
# Computing the predicted class function for each value on the grid
zz = np.array([model_sk_plot.predict([[xx, yy]])[0] for xx, yy in zip(np.ravel(X_plot), np.ravel(Y_plot))])
# Reshaping the predicted class into the meshgrid shape
Z = zz.reshape(X_plot.shape)
# Plot the filled and boundary contours
my_ax.contourf(X_plot, Y_plot, Z, 2, alpha=.1, colors=('blue', 'green', 'red'))
my_ax.contour(X_plot, Y_plot, Z, 2, alpha=1, colors=('blue', 'green', 'red'))
# Add axis and title
my_ax.set_xlabel('Sepal length')
my_ax.set_ylabel('Sepal width')
my_ax.set_title('Gaussian Naive Bayes decision boundaries')
plt.show()
Add whatever you think is necessary to this , for example decision boundaries in 3d is what I have not done before.
My question mainly comes from this post
:https://stats.stackexchange.com/questions/53/pca-on-correlation-or-covariance
In the article, the author plotted the vector direction and length of each variable. Based on my understanding, after performing PCA. All we get are the eigenvectors and eigenvalues. For a dataset which has a dimension M x N, each eigenvalue should be a vector as 1 x N. So, my question is maybe the length of the vector is the eigenvalue, but how to find the direction of the vector for each variable mathematical? And what is the physical meaning of the length of the vector?
Also, if it is possible, can I do similar work with scikit PCA function in python?
Thanks!
This plot is called biplot and it is very useful to understand the PCA results. The length of the vectors it is just the values that each feature/variable has on each Principal Component aka PCA loadings.
Example:
These loadings as accessible through print(pca.components_). Using the Iris Dataset the loadings are:
[[ 0.52106591, -0.26934744, 0.5804131 , 0.56485654],
[ 0.37741762, 0.92329566, 0.02449161, 0.06694199],
[-0.71956635, 0.24438178, 0.14212637, 0.63427274],
[-0.26128628, 0.12350962, 0.80144925, -0.52359713]])
Here, each row is one PC and each column corresponds to one variable/feature. So feature/variable 1, has a value 0.52106591 on the PC1 and 0.37741762 on the PC2. These are the values used to plot the vectors that you saw in the biplot. See below the coordinates of Var1. It's exactly those (above) values !!
Finally, to create this plot in python you can use this using sklearn:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general it is a good idea to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
pca = PCA()
pca.fit(X,y)
x_new = pca.transform(X)
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
plt.scatter(xs ,ys, c = y) #without scaling
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
#Call the function.
myplot(x_new[:,0:2], pca.components_.T)
plt.show()
See also this post: https://stackoverflow.com/a/50845697/5025009
and
https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f
Try the pca library. This will plot the explained variance, and create a biplot.
pip install pca
A small example:
from pca import pca
# Initialize to reduce the data up to the number of componentes that explains 95% of the variance.
model = pca(n_components=0.95)
# Or reduce the data towards 2 PCs
model = pca(n_components=2)
# Load example dataset
import pandas as pd
import sklearn
from sklearn.datasets import load_iris
X = pd.DataFrame(data=load_iris().data, columns=load_iris().feature_names, index=load_iris().target)
# Fit transform
results = model.fit_transform(X)
# Plot explained variance
fig, ax = model.plot()
# Scatter first 2 PCs
fig, ax = model.scatter()
# Make biplot with the number of features
fig, ax = model.biplot(n_feat=4)
The results is a dict containing many statistics of the PCs, loadings etc.
I am using python for k-means clustering for Mnist database(http://yann.lecun.com/exdb/mnist/). I am able to successfully cluster the data but unable to label the clusters. Meaning, I am unable to see that what cluster number holds what digit. For example cluster 5 can hold digit 7.
I need to write a code to correctly label the clusters after the k-means clustering has been done. Also need to add a legend to the code.
from __future__ import division, print_function, absolute_import
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D #only needed for 3D plots
#scikit learn
from sklearn.cluster import KMeans
#pandas to read excel file
import pandas
import xlrd
# Import MNIST data
from tensorflow.examples.tutorials.mnist import input_data
Links:
[MNIST Dataset] http://yann.lecun.com/exdb/mnist/
df = pandas.read_csv('test_encoded_with_label.csv',header=None,
delim_whitespace=True)
#df = pandas.read_excel('test_encoded_with_label.xls')
#print column names
print(df.columns)
df1 = df.iloc[:,0:2] #0 and 1, the last index is not used for iloc
labels = df.iloc[:,2]
labels = labels.values
dataset = df1.values
#train indices - depends how many samples
trainidx = np.arange(0,9999)
testidx = np.arange(0,9999)
train_data = dataset[trainidx,:]
test_data = dataset[testidx,:]
train_labels = labels[trainidx] #just 1D, no :
tpredct_labels = labels[testidx]
kmeans = KMeans(n_clusters=10, random_state=0).fit(train_data)
kmeans.labels_
#print(kmeans.labels_.shape)
plt.scatter(train_data[:,0],train_data[:,1], c=kmeans.labels_)
predct_labels = kmeans.predict(train_data)
print(predct_labels)
print('actual label', tpredct_labels)
centers = kmeans.cluster_centers_
print(centers)
plt.show()
To create markers to find cluster of labelled points, you can use the annotate method
Here is a sample code run on sklearn digits dataset where I try to mark the centroids of the resultant clustering. Note that I just label the clusters from 0-9 just for illustrative purpose:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
np.random.seed(42)
digits = load_digits()
data = scale(digits.data)
n_samples, n_features = data.shape
n_digits = len(np.unique(digits.target))
labels = digits.target
h = .02
reduced_data = PCA(n_components=2).fit_transform(data)
kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)
kmeans.fit(reduced_data)
centroids = kmeans.cluster_centers_
plt_data = plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=kmeans.labels_, cmap=plt.cm.get_cmap('Spectral', 10))
plt.colorbar()
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x')
plt.title('K-means clustering on the digits dataset (PCA-reduced data)\n'
'Centroids are marked with white cross')
plt.xlabel('component 1')
plt.ylabel('component 2')
labels = ['{0}'.format(i) for i in range(10)]
for i in range (10):
xy=(centroids[i, 0],centroids[i, 1])
plt.annotate(labels[i],xy, horizontalalignment='right', verticalalignment='top')
plt.show()
This is the result you get:
To add the legend, try:
plt.scatter(train_data[:,0], train_data[:,1], c=kmeans.labels_, label=kmeans.labels_)
plt.legend()
I am using following code to perform PCA on iris dataset:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# get iris data to a dataframe:
from sklearn import datasets
iris = datasets.load_iris()
varnames = ['SL', 'SW', 'PL', 'PW']
irisdf = pd.DataFrame(data=iris.data, columns=varnames)
irisdf['Species'] = [iris.target_names[a] for a in iris.target]
# perform pca:
from sklearn.decomposition import PCA
model = PCA(n_components=2)
scores = model.fit_transform(irisdf.iloc[:,0:4])
loadings = model.components_
# plot results:
scoredf = pd.DataFrame(data=scores, columns=['PC1','PC2'])
scoredf['Grp'] = irisdf.Species
sns.lmplot(fit_reg=False, x="PC1", y='PC2', hue='Grp', data=scoredf) # plot point;
loadings = loadings.T
for e, pt in enumerate(loadings):
plt.plot([0,pt[0]], [0,pt[1]], '--b')
plt.text(x=pt[0], y=pt[1], s=varnames[e], color='b')
plt.show()
I am getting following plot:
However, when I compare with plots from other sites (e.g. at http://marcoplebani.com/pca/ ), my plot is not correct. Following differences seem to be present:
Petal length and petal width lines should have similar lengths.
Sepal length line should be closer to petal length and petal width lines rather than closer to sepal width line.
All 4 lines should be on the same side of x-axis.
Why is my plot not correct. Where is the error and how can it be corrected?
It depends on whether you scale the variance or not. The "other site" uses scale=TRUE. If you want to do this with sklearn, add StandardScaler before fitting the model and fit the model with scaled data, like this:
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(irisdf.iloc[:,0:4])
scores = model.fit_transform(X)
Edit: Difference between StandardScaler and normalize
Here is an answer which pointed out a key difference (row vs column). Even you use normalize here, you might want to consider X = normalize(X.T).T. The following code shows some differences after transformation:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, normalize
iris = datasets.load_iris()
varnames = ['SL', 'SW', 'PL', 'PW']
fig, ax = plt.subplots(2, 2, figsize=(16, 12))
irisdf = pd.DataFrame(data=iris.data, columns=varnames)
irisdf.plot(kind='kde', title='Raw data', ax=ax[0][0])
irisdf_std = pd.DataFrame(data=StandardScaler().fit_transform(irisdf), columns=varnames)
irisdf_std.plot(kind='kde', title='StandardScaler', ax=ax[0][1])
irisdf_norm = pd.DataFrame(data=normalize(irisdf), columns=varnames)
irisdf_norm.plot(kind='kde', title='normalize', ax=ax[1][0])
irisdf_norm = pd.DataFrame(data=normalize(irisdf.T).T, columns=varnames)
irisdf_norm.plot(kind='kde', title='normalize', ax=ax[1][1])
plt.show()
I'm not sure how deep I can go with the algorithm/math. The point for StandardScaler is to get uniform/consistent mean and variance across features. The assumption is that variables with large measurement units are not necessarily (and should not be) dominant in PCA. In other word, StandardScaler makes features contribute equally to PCA. As you can see, normalize won't give consistent mean or variance.