What is SpectralEmbedding in sklearn?

What is SpectralEmbedding in sklearn? - python

I am using Affinity Propogation to cluster my similarity matrixsims. My code is as follows. According to an answer of my previous question I am using SpectralEmbedding to plot my data points of the similarity matrix sims.
import sklearn.cluster
from sklearn.manifold import SpectralEmbedding
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
sims = np.array([[0, 17, 10, 32, 32], [18, 0, 6, 20, 15], [10, 8, 0, 20, 21], [30, 16, 20, 0, 17], [30, 15, 21, 17, 0]])
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(sims)
cluster_centers_indices = affprop.cluster_centers_indices_
print(cluster_centers_indices)
labels = affprop.labels_
n_clusters_ = len(cluster_centers_indices)
print(n_clusters_)
se = SpectralEmbedding(n_components=2, affinity='precomputed')
X = se.fit_transform(sims)
plt.close('all')
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
class_members = labels == k
cluster_center = X[cluster_centers_indices[k]]
plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
for x in X[class_members]:
plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
However, I do not understand what exactly happens with SpectralEmbedding. Please let me know what it does? And is it correct to use SpectralEmbedding to plot similarity values?

Related

What is plotted when string data is passed to the matplotlib API?

# first, some imports:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Let's say I want to make a scatter plot, using this data:
np.random.seed(42)
x=np.arange(0,50)
y=np.random.normal(loc=3000,scale=1,size=50)
Plot via:
plt.scatter(x,y)
I get this answer:
Ok, let's create a dataframe first:
df=pd.DataFrame.from_dict({'x':x,'y':y.astype(str)})
(I am aware that I am storing y as str - this is a reproducible example, and I do this to reflect the real use case.)
Then, if I do:
plt.scatter(df.x,df.y)
I get:
What am I seeing in this second plot? I thought that the second plot must be showing the x column plotted against the y column, which are converted to float. This is clearly not the case.

Matplotlib doesn't automatically convert str values to numerical, so your y values are treated as categorical. As far as Matplotlib is concerned, the differences '1.0' to '0.9' and '1.0' to '100.0' are not different.
So, the y-axis on the plot will be the same as range(len(y)) (since the difference between all categorical values is the same) with labels assigned from the categorical values.
Since your x is a range equal to range(50), and now your y is a range too (also equal to range(50)), it plots x = y, with y-labels set to respective str value.

As per the excellent answer by dm2, when you pass y as a string, y is simply being treated as arbitrary string labels, and being plotted one after the other in the order in which they appear. To demonstrate, here's an even simpler example.
from matplotlib import pyplot as plt
x = [1, 2, 3, 4]
y = [5, 25, 10, 1] # these are ints
plt.scatter(x, y)
So far so good. Now, different string y values.
y = list("abcd")
plt.scatter(x, y)
You can see how it just takes the y labels and just drops them on the axis one after another.
Finally,
y = ["5", "25", "10", "1"]
plt.scatter(x, y)
Compare this with the previous results and now it should become obvious what's going on.

It's more obvious if the labels and locations are extracted, that the API plots the strings as labels, and the axis locations are 0 indexed numbers based on the how many (len) categories exist.
.get_xticks() and .get_yticks() extract a list of the numeric locations.
.get_xticklabels() and .get_yticklabels() extract a list of matplotlib.text.Text, Text(x, y, text).
There are fewer numbers in the list for the y axis because there were duplicate values as a result of rounding.
This applies to any APIs, like seaborn or pandas that use matplotlib as the backend.
sns.scatterplot(data=df, x='x_num', y='y', ax=ax1)
ax1.scatter(data=df, x='x_num', y='y')
ax1.plot('x_num', 'y', 'o', data=df)
Labels, Locs, and Text
print(x_nums_loc)
print(y_nums_loc)
print(x_lets_loc)
print(y_lets_loc)
print(x_lets_labels)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
[Text(0, 0, 'A'), Text(1, 0, 'B'), Text(2, 0, 'C'), Text(3, 0, 'D'), Text(4, 0, 'E'),
Text(5, 0, 'F'), Text(6, 0, 'G'), Text(7, 0, 'H'), Text(8, 0, 'I'), Text(9, 0, 'J'),
Text(10, 0, 'K'), Text(11, 0, 'L'), Text(12, 0, 'M'), Text(13, 0, 'N'), Text(14, 0, 'O'),
Text(15, 0, 'P'), Text(16, 0, 'Q'), Text(17, 0, 'R'), Text(18, 0, 'S'), Text(19, 0, 'T'),
Text(20, 0, 'U'), Text(21, 0, 'V'), Text(22, 0, 'W'), Text(23, 0, 'X'), Text(24, 0, 'Y'),
Text(25, 0, 'Z')]
Imports, Data, and Plotting
import numpy as np
import string
import pandas as pd
import matplotlib.pyplot as plt
import string
# sample data
np.random.seed(45)
x_numbers = np.arange(100, 126)
x_letters = list(string.ascii_uppercase)
y= np.random.normal(loc=3000, scale=1, size=26).round(2)
df = pd.DataFrame.from_dict({'x_num': x_numbers, 'x_let': x_letters, 'y': y}).astype(str)
# plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 3.5))
df.plot(kind='scatter', x='x_num', y='y', ax=ax1, title='X Numbers', rot=90)
df.plot(kind='scatter', x='x_let', y='y', ax=ax2, title='X Letters')
x_nums_loc = ax1.get_xticks()
y_nums_loc = ax1.get_yticks()
x_lets_loc = ax2.get_xticks()
y_lets_loc = ax2.get_yticks()
x_lets_labels = ax2.get_xticklabels()
fig.tight_layout()
plt.show()

How to generate a histogram with the list below?

How to generate a histogram with the list below?
[[0, 0, 0, 19, 7], [0, 0, 0, 21, 7], [0, 0, 0, 21, 7], [0, 0, 0, 29, 0]]
Explaining the list: [0, 0, 0, 19, 7]
First value = repetition average between 0-20
Second value = repetition average between 20-40
Third value = repetition average between 40-60
Fourth value = average repetition between 60-80
Fifth value = repetition average between 80-100
These sublists within the list can grow exponentially, I would like each sub-list to have a distance between each other, to better interpret the graph
What I have achieved so far:
result = [[[0, 0, 0, 19, 7], [0, 0, 0, 21, 7], [0, 0, 0, 21, 7], [0, 0, 0, 29, 0]]]
fig, ax = plt.subplots(figsize=(10,6))
for i in range(len(result)):
data = np.array(result[i])
x=np.arange(len(data)) + i*6
# draw means
ax.bar(x-0.2, data[:,0], color='blue', width=0.4)
ax.bar(x+0.2, data[:,1], color='green', width=0.4)
ax.bar(x-0.2, data[:,2], color='yellow', width=0.4)
ax.bar(x+0.2, data[:,3], color='orange', width=0.4)
ax.bar(x+0.2, data[:,4], color='red', width=0.4)
# separation line
ax.axvline(4.75)
# turn off xticks
ax.set_xticks([])
ax.legend(labels=['0-20', '20-40', '40-60', '60-80', '80-100'])
leg = ax.get_legend()
leg.legendHandles[0].set_color('blue')
leg.legendHandles[1].set_color('green')
leg.legendHandles[2].set_color('yellow')
leg.legendHandles[3].set_color('orange')
leg.legendHandles[4].set_color('red')
plt.title("Histogram")
plt.ylabel('Consume')
plt.xlabel('Percent')
plt.show()
Any suggetions?

Here is an approach to draw the described plot. Note that normally matplotlib only sets one legend entry for a complete bar graph. To have an entry for individual bars, a label needs to be set to each of them explicitly. In the code below such a label is added to each bar in the first set.
(Note that I left out one set of square parenthesis for result as in the original post it is a 3D list. If such a 3D list would be necessary, you could write the loop as for i, data in enumerate(result[0])).
import numpy as np
import matplotlib.pyplot as plt
result = [[0, 0, 0, 19, 7], [0, 0, 0, 21, 7], [0, 0, 0, 21, 7], [0, 0, 0, 29, 0]]
colors = ['blue', 'green', 'yellow', 'orange', 'red']
labels = ['0-20', '20-40', '40-60', '60-80', '80-100']
fig, ax = plt.subplots(figsize=(10, 6))
for i, data in enumerate(result):
x = np.arange(len(data)) + i*6
bars = ax.bar(x, data, color=colors, width=0.4)
if i == 0:
for bar, label in zip(bars, labels):
bar.set_label(label)
if i < len(result) - 1:
# separation line after each part, but not after the last
ax.axvline(4.75 + i*6, color='black', linestyle=':')
ax.set_xticks([])
ax.legend()
ax.set_title("Histogram")
ax.set_ylabel('Consume')
ax.set_xlabel('Percent')
plt.show()

Plot specific values on y axis instead of increasing scale from dataframe

When plotting 2 columns from a dataframe into a line plot, is it possible to, instead of a consistently increasing scale, have fixed values on your y axis (and keep the distances between the numbers on the axis constant)? For example, instead of 0, 100, 200, 300, ... to have 0, 21, 53, 124, 287, depending on the values from your dataset? So basically to have on the axis all your possible values fixed instead of an increasing scale?

Yes, you can use: ax.set_yticks()
Example:
df = pd.DataFrame([[13, 1], [14, 1.5], [15, 1.8], [16, 2], [17, 2], [18, 3 ], [19, 3.6]], columns = ['A','B'])
fig, ax = plt.subplots()
x = df['A']
y = df['B']
ax.plot(x, y, 'g-')
ax.set_yticks(y)
plt.show()
Or if the values are very distant each other, you can use ax.set_yscale('log').
Example:
df = pd.DataFrame([[13, 1], [14, 1.5], [15, 1.8], [16, 2], [17, 2], [18, 3 ], [19, 3.6], [20, 300]], columns = ['A','B'])
fig, ax = plt.subplots()
x = df['A']
y = df['B']
ax.plot(x, y, 'g-')
ax.set_yscale('log', basex=2)
ax.yaxis.set_ticks(y)
ax.yaxis.set_ticklabels(y)
plt.show()

What you need to do is:
get all distinct y values and sort them
set their y position on the plot according to their place on the ordered list
set the y labels according to distinct ordered values
The code below would do
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame([[13, 1], [14, 1.8], [16, 2], [15, 1.5], [17, 2], [18, 3 ],
[19, 200],[20, 3.6], ], columns = ['A','B'])
x = df['A']
y = df['B']
y_keys = np.sort(y.unique())
y_values = range(len(y_keys))
y_dict = dict(zip(y_keys,y_values))
fig, ax = plt.subplots()
ax.plot(x,[y_dict[k] for k in y],'o-')
ax.set_yticks(y_values)
ax.set_yticklabels(y_keys)

How can I draw 3D plane using PCA In python?

X = np.array([[24,13,38],[8,3,17],[21,6,40],[1,14,-9],[9,3,21],[7,1,14],[8,7,11],[10,16,3],[1,3,2],
[15,2,30],[4,6,1],[12,10,18],[1,9,-4],[7,3,19],[5,1,13],[1,12,-6],[21,9,34],[8,8,7],
[1,18,-18],[15,8,25],[16,10,29],[7,0,17],[14,2,31],[3,7,0],[5,6,7]])
pca = PCA(n_components=1)
pca.fit(X)
a = pca.components_[0][0] # a
b = pca.components_[0][1] # b
c = pca.components_[0][2] # c
def average(values):
if(values) ==0:
return None
return sum(values, 0.0) / len(values)
x_mean = average(x) # For an approximation
y_mean = average(y)
z_mean = average(z)
d = -(a * x_mean + b * y_mean + c * z_mean)
so -0.375978766054x + 0.10612154283y -0.920531469111z + 15.1366572005 = 0
Actually, I'm not sure it is right.
I want to draw a plane in this situation using matplotlib library.
How can I code this?

Each principal component defines a vector in the feature space. PCA orders those vectors based on the variance of the data in each direction. So the first vector will represent the maximum variance of the data and the last vector minimum variance. Assuming the data are distributed around a plane the third vector should be perpendicular to the plane. Here's the code:
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
X = np.array([[24,13,38],[8,3,17],[21,6,40],[1,14,-9],[9,3,21],[7,1,14],[8,7,11],[10,16,3],[1,3,2],
[15,2,30],[4,6,1],[12,10,18],[1,9,-4],[7,3,19],[5,1,13],[1,12,-6],[21,9,34],[8,8,7],
[1,18,-18],[15,8,25],[16,10,29],[7,0,17],[14,2,31],[3,7,0],[5,6,7]])
pca = PCA(n_components=3)
pca.fit(X)
eig_vec = pca.components_
print(pca.explained_variance_ratio_)
# [0.90946569 0.08816839 0.00236591]
# Percentage of variance explain by last vector is less 0.2%
# This is the normal vector of minimum variance
normal = eig_vec[2, :] # (a, b, c)
centroid = np.mean(X, axis=0)
# Every point (x, y, z) on the plane should satisfy a*x+b*y+c*z = d
# Taking centroid as a point on the plane
d = -centroid.dot(normal)
# Draw plane
xx, yy = np.meshgrid(np.arange(np.min(X[:, 0]), np.max(X[:, 0])), np.arange(np.min(X[:, 1]), np.max(X[:, 1])))
z = (-normal[0] * xx - normal[1] * yy - d) * 1. / normal[2]
# plot the surface
plt3d = plt.figure().gca(projection='3d')
plt3d.plot_surface(xx, yy, z)
plt3d.scatter(*(X.T))
plt.show()

The first principal component doesn't define a plane, it defines a vector in three dimensions. Here's how to visualize it in 3D: the code starts out with yours, and then has the plotting steps:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[24, 13, 38], [8, 3, 17], [21, 6, 40], [1, 14, -9], [9, 3, 21], [7, 1, 14],
[8, 7, 11], [10, 16, 3], [1, 3, 2], [15, 2, 30], [4, 6, 1], [12, 10, 18], [1, 9, -4],
[7, 3, 19], [5, 1, 13], [1, 12, -6], [21, 9, 34], [8, 8, 7], [1, 18, -18],
[15, 8, 25], [16, 10, 29], [7, 0, 17], [14, 2, 31], [3, 7, 0], [5, 6, 7]])
pca = PCA(n_components=1)
pca.fit(X)
## New code below
p = pca.components_
centroid = np.mean(X, 0)
segments = np.arange(-40, 40)[:, np.newaxis] * p
import matplotlib
matplotlib.use('TkAgg') # might not be necessary for you
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
plt.ion()
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
scatterplot = ax.scatter(*(X.T))
lineplot = ax.plot(*(centroid + segments).T, color="red")
plt.xlabel('x')
plt.ylabel('y')
plt.savefig('result.png', dpi=150)
(Note the above code was auto-formatted with yapf, which I highly recommend.) Resulting figure:

How to place lines below markers in Python?

I have to plot multiple lines and their curve fit lines on a single plot. All these lines are plotted using a for loop. Since it is plot using loops the curve fit lines of the succeeding step is plotted over its predecessor as shown in figure.
The reproducible code:
import matplotlib.pyplot as plt
import numpy as np
x = np.array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
y = np.array([[4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24],
[6, 5.2, 8.5, 9.1, 13.4, 15.1, 16.1, 18.3, 20.4, 22.1, 23.7]])
m, n = x.shape
figure = plt.figure(figsize=(5.15, 5.15))
figure.clf()
plot = plt.subplot(111)
for i in range(m):
poly = np.polyfit(x[i, :], y[i, :], deg =1)
plt.plot(poly[0] * x[i, :] + poly[1], linestyle = '-')
plt.plot(x[i, :], y[i, :], linestyle = '', marker = 'o', markersize = 20)
plot.set_ylabel('Y', labelpad = 6)
plot.set_xlabel('X', labelpad = 6)
plt.show()
I can fix this using another loop as:
import matplotlib.pyplot as plt
import numpy as np
x = np.array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
y = np.array([[4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24],
[6, 5.2, 8.5, 9.1, 13.4, 15.1, 16.1, 18.3, 20.4, 22.1, 23.7]])
m, n = x.shape
figure = plt.figure(figsize=(5.15, 5.15))
figure.clf()
plot = plt.subplot(111)
for i in range(m):
poly = np.polyfit(x[i, :], y[i, :], deg =1)
plt.plot(poly[0] * x[i, :] + poly[1], linestyle = '-')
for i in range(m):
plt.plot(x[i, :], y[i, :], linestyle = '', marker = 'o', markersize = 20)
plot.set_ylabel('Y', labelpad = 6)
plot.set_xlabel('X', labelpad = 6)
plt.show()
which gives me all the fit lines below the markers.
But is there any built-in function in Python/matplotlib to do this without using two loops?
Update
Only as an example I have used n = 2, n can be greater than 2, i.e. the loop would be run multiple times.
Update 2 after answer
Can I do this for the same line also? As an example:
plt.plot(x[i, :], y[i, :], linestyle = ':', marker = 'o', markersize = 20)
Can I give the linestyle a zorder = 1 and the markers a zorder = 3?

Editing just your plotting lines:
plt.plot(poly[0] * x[i, :] + poly[1], linestyle = '-',
zorder=-1)
plt.plot(x[i, :], y[i, :], linestyle = '', marker = 'o', markersize = 20,
zorder=3)
now the markers are all in front of the lines, though within marker/line groups they're still order-of-plotting.
Update answer
No. One call to plot, one zorder argument.
If you want to match the color and style of markers and line in each pass through the loop, set up an iterator or generator for colors and get current_color on each pass, then use that as an argument for plot calls.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What is SpectralEmbedding in sklearn? - python

Related

What is plotted when string data is passed to the matplotlib API?

How to generate a histogram with the list below?

Plot specific values on y axis instead of increasing scale from dataframe

How can I draw 3D plane using PCA In python?

How to place lines below markers in Python?

Categories

Resources