I am working with a clustering analysis problem. My goal is to create a double for loop which changes the numbers of clusters (3 different values for clusters) as well as cycling between the three linkage types per value cluster value. Then plot all of the subplots on the same figure.
I am hoping to achieve a 3x3 view of the subplots. Where each value of cluster is on the x-axis and each type of linkage correlating to the number of clusters is displayed down the y-axis.
The csv file I am working with is simply two columns with x1 and x2 values. I exluded the code where im import and read the csv file. The code I have thus far is as follows:
X1 = input_data.X1.values
X2 = input_data.X2.values
X = np.column_stack((X1, X2))
clusters = 4
Y_Kmeans = KMeans(n_clusters = clusters)
Y_Kmeans.fit(X)
Y_Kmeans_labels = Y_Kmeans.labels_
Y_Kmeans_silhouette = metrics.silhouette_score(X, Y_Kmeans_labels, metric='sqeuclidean')
linkage_types = ['ward', 'average', 'complete']
Y_hierarchy = AgglomerativeClustering(linkage=linkage_types[0], n_clusters=clusters)
Y_hierarchy.fit(X)
Y_hierarchy_labels = Y_hierarchy.labels_
Y_hierarchy_silhouette = metrics.silhouette_score(X, Y_hierarchy_labels,
metric='sqeuclidean')
I have tried this and am not getting the desired results:
fig, axs = plt.subplots(nrows=3, ncols=3, figsize=(15, 12))
plt.subplots_adjust(hspace=0.5)
cluster = [4, 7, 10]
link = [0, 1, 2]
for i in cluster:
for j in link:
plt.scatter(X[:, 0], X[:, 1], c=colormap[Y_hierarchy_labels])
This is the output:
I see two problems:
you have to make calculations inside for-loops - and use i,j in KMeans(n_clusters=i) and AgglomerativeClustering(linkage=linkage_types[j], n_clusters=i)
you have to enumerate() cluster and link in for-loops to get ax = axs[number_cluster, number_link] and draw ax.scatter()
Minimal working code with random data.
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)
fig, axs = plt.subplots(nrows=3, ncols=3, figsize=(15, 12))
plt.subplots_adjust(hspace=0.5)
cluster = [4, 7, 10]
link = [0, 1, 2]
for number_cluster, i in enumerate(cluster):
# Y_Kmeans = KMeans(n_clusters=i)
# ... code ...
for number_link, j in enumerate(link):
# Y_hierarchy = gglomerativeClustering(linkage=linkage_types[j], n_clusters=i)
# ... code ...
X = np.random.rand(3+j, 3+i)
print(X[:, 0], X[:, 1])
ax = axs[number_cluster, number_link]
ax.scatter(X[:, 0], X[:, 1], )
ax.set_title(f'cluster: {i}, link: {j}')
plt.show()
Related
How do I add legend to the plot over in my scenario? the parameter of text is the text = tfidf.transform(document) and the parameter of clusters are the unsupervised clusters ranging from 0 to 19 clusters and have their bag of words. How do I add the legend to the plots? It is indistinguishable that which color corresponds to which cluster.
def plot_tsne_pca(data, labels):
max_label = max(labels)
max_items = np.random.choice(range(data.shape[0]), size=3000, replace=False)
pca = PCA(n_components=2).fit_transform(data[max_items,:].todense())
tsne = TSNE().fit_transform(PCA(n_components=50).fit_transform(data[max_items,:].todense()))
idx = np.random.choice(range(pca.shape[0]), size=3000, replace=False)
label_subset = labels[max_items]
label_subset = [cm.hsv(i/max_label) for i in label_subset[idx]]
f, ax = plt.subplots(1, 2, figsize=(20, 6))
ax[0].scatter(pca[idx, 0], pca[idx, 1], c=label_subset)
ax[0].set_title('PCA Cluster Plot')
ax[1].scatter(tsne[idx, 0], tsne[idx, 1], c=label_subset)
ax[1].set_title('TSNE Cluster Plot')
plot_tsne_pca(text, clusters)
Here is the full example of the code: https://pastebin.com/3PABg7xh
You can use legend_elements() to automatically return the lists of artists/labels (or a subset thereof) for legend creation. See Automated legend creation for more details
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn import (manifold, datasets)
digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target
tsne = manifold.TSNE(n_components=2, init='pca', random_state=0)
X_tsne = tsne.fit_transform(X)
fig, ax = plt.subplots()
sc = ax.scatter(X_tsne[:,0], X_tsne[:,1], c=y, cmap='tab10')
ax.legend(*sc.legend_elements(), title='clusters')
EDIT
In your particular case, the code was not working because legend_elements() is meant to be used when you have a mapping between a numeric c= list and a colormap. But instead, you were passing a list of colors that you constructed by hand (label_subset = [cm.hsv(i/max_label) for i in label_subset[idx]]). If you remove that line and keep a numeric label_subset and map it to colors using cmap= then everything works as expected
def plot_tsne_pca(data, labels, sizelist, cmap='tab10'):
max_label = max(labels)
max_items = np.random.choice(range(data.shape[0]), sizelist, replace=False)
pca = PCA(n_components=2).fit_transform(data[max_items, :].todense())
tsne = TSNE().fit_transform(PCA(n_components=1).fit_transform(data[max_items, :].todense()))
idx = np.random.choice(range(pca.shape[0]), sizelist, replace=False)
label_subset = labels[max_items]
#label_subset = [cm.hsv(i / max_label) for i in label_subset[idx]]
f, ax = plt.subplots(1, 2, figsize=(20, 6))
ax[0].scatter(pca[idx, 0], pca[idx, 1], c=label_subset, cmap=cmap)
ax[0].set_title('PCA Cluster Plot')
sc = ax[1].scatter(tsne[idx, 0], tsne[idx, 1], c=label_subset, cmap=cmap)
ax[1].set_title('TSNE Cluster Plot')
ax[1].legend(*sc.legend_elements(), title='clusters')
plot_tsne_pca(text, clusters, sizelist)
As the implicit function where 'A' is an n*2 matrix
0 = np.dot((x,y),A)
0 = xA11 yA12
0 = xA21 yA22
...
0 = xAn1 yAn2
Is it possible, via matplotlib or other means, to plot all the lines on the same plot without a large loop?
Given a n*2 matrix A, for each row i a line is defined by A[i,0]*x + A[i,1]*y == 0. This means 0,0 always lies on the line, as well as the point x=A[i,1],y=-A[i,0]. Multiplying with any value, e.g. by normalizing will again give points on the line.
The following code shows 3 ways to visualize these lines:
Some line segments cut by a circle, together with x=A[i,1],y=-A[i,0] and x=-A[i,1],y=A[i,0].
The same segments extended till the plot's border.
Just some end points on a circle.
import matplotlib.pyplot as plt
import numpy as np
from numpy.linalg import norm
from matplotlib.collections import LineCollection
n = 10
radius = 20
A = np.random.uniform(-10, 10, (n, 2))
B = A / norm(A, axis=1, keepdims=True) * radius # normalize and put on a circle with given radius
lines = np.dstack([B[:, 1], -B[:, 0], -B[:, 1], B[:, 0]]).reshape(-1, 2, 2)
fig, axes = plt.subplots(ncols=3, figsize=(14, 4))
for ax in axes:
ax.set_aspect('equal')
for ax in axes[:2]:
lc = LineCollection(lines, colors='blue', linewidths=2)
ax.add_collection(lc)
if ax == axes[0]:
ax.scatter(A[:, 1], -A[:, 0], color='crimson')
ax.scatter(-A[:, 1], A[:, 0], color='crimson')
elif ax == axes[1]:
ax.set_xlim(-radius / 2, radius / 2)
ax.set_ylim(-radius / 2, radius / 2)
for k in range(2):
axes[2].scatter(lines[:, k, 0], lines[:, k, 1], color='crimson')
axes[0].set_title('lines in circle and dots')
axes[1].set_title('lines till border')
axes[2].set_title('dots on circle')
plt.show()
This is a follow-up to my previous couple of questions. Here's the code I'm playing with:
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import numpy as np
dictOne = {'Name':['First', 'Second', 'Third', 'Fourth', 'Fifth', 'Sixth', 'Seventh', 'Eighth', 'Ninth'],
"A":[1, 2, -3, 4, 5, np.nan, 7, np.nan, 9],
"B":[4, 5, 6, 5, 3, np.nan, 2, 9, 5],
"C":[7, np.nan, 10, 5, 8, 6, 8, 2, 4]}
df2 = pd.DataFrame(dictOne)
column = 'B'
df2[df2[column] > -999].hist(column, alpha = 0.5)
param = stats.norm.fit(df2[column].dropna()) # Fit a normal distribution to the data
print(param)
pdf_fitted = stats.norm.pdf(df2[column], *param)
plt.plot(pdf_fitted, color = 'r')
I'm trying to make a histogram of the numbers in a single column in the dataframe -- I can do this -- but with an overlaid normal curve...something like the last graph on here. I'm trying to get it working on this toy example so that I can apply it to my much larger dataset for real. The code I've pasted above gives me this graph:
Why doesn't pdf_fitted match the data in this graph? How can I overlay the proper PDF?
You should plot the histogram with density=True if you hope to compare it to a true PDF. Otherwise your normalization (amplitude) will be off.
Also, you need to specify the x-values (as an ordered array) when you plot the pdf:
fig, ax = plt.subplots()
df2[df2[column] > -999].hist(column, alpha = 0.5, density=True, ax=ax)
param = stats.norm.fit(df2[column].dropna())
x = np.linspace(*df2[column].agg([min, max]), 100) # x-values
plt.plot(x, stats.norm.pdf(x, *param), color = 'r')
plt.show()
As an aside, using a histogram to compare continuous variables with a distribution is isn't always the best. (Your sample data are discrete, but the link uses a continuous variable). The choice of bins can alias the shape of your histogram, which may lead to incorrect inference. Instead, the ECDF is a much better (choice-free) illustration of the distribution for a continuous variable:
def ECDF(data):
n = sum(data.notnull())
x = np.sort(data.dropna())
y = np.arange(1, n+1) / n
return x,y
fig, ax = plt.subplots()
plt.plot(*ECDF(df2.loc[df2[column] > -999, 'B']), marker='o')
param = stats.norm.fit(df2[column].dropna())
x = np.linspace(*df2[column].agg([min, max]), 100) # x-values
plt.plot(x, stats.norm.cdf(x, *param), color = 'r')
plt.show()
I have 2 lists, each has 128 elements
x = [1,2,3,...,128]
y = [y1,y2,...,y128]
How should I use matplotlib to plot (x,y) with x axis appearing as shown in this screenshot?
To replicate the graph, I have (1) created 2 additional lists from the original lists, and (2) used set_xticklabels:
f, ax1 = plt.subplots(1,1,figsize=(16,7))
x1 = [1, 2, 4, 8, 16, 32, 64, 128]
y1 = [y[0],y[1],y[3],y[7],y[15],y[31],y[63],y[127]]
line1 = ax1.plot(x1,y1,label="Performance",color='b',linestyle="-")
ax1.set_xticklabels([0,1,2,4,8,16,32,64,128])
ax1.set_xlabel('Time Period',fontsize=15)
ax1.set_ylabel("Value",color='b',fontsize=15)
The problem with this approach is that only 8 pairs of value are plotted, and 120 pairs are ommitted.
If my comments aren't clear enough, please, ask. :)
from matplotlib import pyplot as plt
# Instanciating my lists...
f = lambda x:x**2
x = [nb for nb in range(1, 129)]
y = [f(nb) for nb in x]
# New values you want to plot, with linear spacing.
indexes_to_keep = [1, 2, 4, 8, 16, 32, 64, 128]
y_to_use = [y[nb - 1] for nb in indexes_to_keep]
# First plot that shows the 128 points as a whole.
fig = plt.figure(figsize=(10, 5.4))
ax1 = fig.add_subplot(121)
ax1.plot(x, y)
ax1.set_title('Former values')
# Second plot that shows only the indexes you wish to keep.
ax2 = fig.add_subplot(122)
# my_ticks = [1, 2, 3, 4, 5, 6, 7]
# meaning : my_ticks will be linear values.
my_ticks = [i for i in range(len(indexes_to_keep))]
# We set the ticks we want to show, meaning : all our list
# instead of some linear spacing matplotlib will show by default
ax2.set_xticks(my_ticks)
# Then, we manually change the name of the X ticks.
ax2.set_xticklabels(indexes_to_keep)
# We will then, plot the LINEAR x axis,
# but with respect to the y-axis values pre-processed.
ax2.plot(my_ticks, y_to_use)
ax2.set_title('New selected values with linear spacing')
plt.show()
Showing...
What you are looking for is a logarithmic scale with base 2. matplotlib provides logarithmic scales and you can define any base you want:
from matplotlib import pyplot as plt
from matplotlib.ticker import ScalarFormatter
#sample data
x = list(range(1, 130))
y = list(range(3, 260, 2))
f, ax1 = plt.subplots(1,1,figsize=(16,7))
x1 = [ 1, 2, 4, 8, 16, 32, 64, 128]
y1 = [y[0],y[1],y[3],y[7],y[15],y[31],y[63],y[127]]
#just the points, where the ticks are
ax1.plot(x1, y1,"bo-", label = "Performance")
#all other points to contrast this
ax1.plot(x, [270 - i for i in y], "rx-", label = "anti-Performance")
#transform x axis into logarithmic scale with base 2
plt.xscale("log", basex = 2)
#modify x axis ticks from exponential representation to float
ax1.get_xaxis().set_major_formatter(ScalarFormatter())
ax1.set_xlabel('Time Period',fontsize=15)
ax1.set_ylabel("Value",color='b',fontsize=15)
plt.legend()
plt.show()
Output:
I've written a function that reads data from a csv file and plots it. Now I need to add a subplot with another part of the data from the same file, so I've tried to write a function that calls the first function and adds a subplot. When I do this, I get the two to show up as different figures. How can I suppress this and make both of them show in the same figure?
Here is a mockup of my code:
def timex(h_ratio = [3, 1]):
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.gridspec as gridspec
total_height = h_ratio[0] + h_ratio[1]
gs = gridspec.GridSpec(total_height, 1)
time = [1, 2, 3, 4, 5]
x = [1, 2, 3, 4, 5]
y = [1, 1, 1, 1, 1]
ax1 = plt.subplot(gs[:h_ratio[0], :])
plt.plot(time, x)
plot = plt.gcf
plt.show()
return time, x, y, plot, gs, h_ratio
def timeyx():
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
time, x, y, plot, gs, h_ratio = timex(h_ratio = [3, 1])
ax2 = plt.subplot(gs[h_ratio[1], :])
plt.plot(time, y)
plt.show()
timeyx()
I realize that I have two plt.show() statements, but if I remove one that figure will not show at all.
I am not sure whether you need to use matplotlib.gridspec specifically or not, but you can use subplot2grid to make the job easy.
import matplotlib.pyplot as plt
def timex():
time = [1, 2, 3, 4, 5]
x = [1, 2, 3, 4, 5]
y = [1, 1, 1, 1, 1]
ax1 = plt.subplot2grid((1,2), (0,0))
ax1.plot(time, x)
return time, x, y
def timeyx():
time, x, y = timex()
ax2 = plt.subplot2grid((1,2), (0,1))
ax2.plot(time, y)
timeyx()
plt.show()
This produces one figure shown below with two subplots: