Weighted network between cluster centroids - Python - python

The following plots vectors derived from two sets of data points. I also measure and plot the centroid of these points using k-means clustering.
I'm hoping to measure some form of adjacency matrix to plot the network between each cluster based on the number of vectors, which also accounts for the amount of vectors between each cluster. So displaying the weight.
I was thinking the diagonal values of the adjacency matrix could indicate the number of vectors in the same cluster, while the non-diagonal values could indicate the number of vectors between different clusters, while considering the direction?
I'm hoping to produce an output to the one below. Where the nodes are the centroid of the cluster. The diameter of the node should indicate the number of vectors in the same cluster and the line thickness is the number of vectors between the two clusters.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
fig, ax = plt.subplots(figsize = (6,6))
df = pd.DataFrame(np.random.randint(-80,80,size=(500, 4)), columns=list('ABCD'))
A = df['A']
B = df['B']
C = df['C']
D = df['D']
Y_sklearn = df[['A','B','C','D']].values
ax.quiver(A, B, (C-A), (D-B), angles = 'xy', scale_units = 'xy', scale = 1, alpha = 0.5)
model = KMeans(n_clusters = 20)
model.fit(Y_sklearn)
model.cluster_centers_
cluster_centers = model.cluster_centers_
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1],
color = 'black', s = 100,
alpha = 0.7, zorder = 2)
plt.scatter(Y_sklearn[:,0], Y_sklearn[:,1], color = 'blue', alpha = 0.2);
plt.scatter(Y_sklearn[:,2], Y_sklearn[:,3], color = 'red', alpha = 0.2);
Edit 2:
If if fix the data to get the intended network below, the following plots a total of 12 vectors. Two groups of 5 are overlapping, while two are unique.
df = pd.DataFrame({
'A' : [5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 5, 4],
'B' : [5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 5, 2],
'C' : [7, 7, 7, 7, 7, 3, 3, 3, 3, 3, 3, 7],
'D' : [7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4, 7],
})
fig,ax = plt.subplots()
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
A = df['A']
B = df['B']
C = df['C']
D = df['D']
ax.quiver(A, B, (C-A), (D-B), angles = 'xy', scale_units = 'xy', scale = 1, alpha = 0.5)
If I just plot the scatter with cluster centroids, it should look like the following:
Y_sklearn = df[['A','B','C','D']].values
model = KMeans(n_clusters = 4)
model.fit(Y_sklearn)
model.cluster_centers_
cluster_centers = model.cluster_centers_
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1],
color = 'black', s = 100,
alpha = 0.7, zorder = 2)
plt.scatter(cluster_centers[:, 2], cluster_centers[:, 3],
color = 'black', s = 100,
alpha = 0.7, zorder = 2)
This all works fine. The next step is where I'm having trouble. If I plot the network manually, it should look something like this. The thicker lines display 5 vectors between centroids, while the thinner lines display 1 vector.
The updated code produces the following network. The 5,5 - 7,7 line is correct, but I'm not getting the other lines that should replicate something similar to the network above.
fig,ax = plt.subplots()
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
def kmeans(arr,num_clusters):
model = KMeans(n_clusters = num_clusters)
model.fit(arr)
model.cluster_centers_
cluster_centers = model.cluster_centers_
all_labels = model.labels_
mem_count = Counter(all_labels)
return cluster_centers,all_labels,mem_count
nclusters_1,nclusters_2 = 2,2
points= df[['A','B','C','D']].values
cluster_one = kmeans(points[:,:2],nclusters_1)
cluster_two = kmeans(points[:,2:],nclusters_2)
# find connections between clusters
all_combs = [[n1,n2] for n1 in range(nclusters_1) for n2 in range(nclusters_2)]
num_connections = {}
for item in all_combs:
l1,l2 = cluster_one[1],cluster_two[1]
mask1 = np.where(l1==item[0])[0]
mask2 = np.where(l2==item[1])[0]
num_common = len(list(set(mask1).intersection(mask2)))
num_connections[(item[0],item[1]+nclusters_1)] = num_common
G = nx.Graph()
node_sizes = {}
node_colors = {}
for k,v in num_connections.items():
# the number of points in the two clusters
s1,s2 = cluster_one[2][k[0]],cluster_two[2][k[1]-nclusters_1]
G.add_node(k[0],pos=points[:,:2][k[0]])
G.add_node(k[1],pos=points[:,2:][k[1]])
G.add_edge(k[0],k[1],color='k',weight=v/3)
node_sizes[k[0]] = s1;node_sizes[k[1]] = s2
node_colors[k[0]] = 'k';node_colors[k[1]] = 'k'
edges = G.edges()
d = dict(G.degree)
pos=nx.get_node_attributes(G,'pos')
weights = [G[u][v]['weight'] for u,v in edges]
nx.draw(G,pos,edges=edges,
node_color=[node_colors[v] for v in d.keys()],
nodelist=d.keys(),
width=weights,
node_size=[node_sizes[v]*20 for v in d.keys()])

Before solving the problem, I guess the KMeans clustering should be performed for the first two columns of the df and the other two columns separately. If you apply KMeans clustering directly to the entire df, the connections between two clusters (A,B) and (C,D) will be trivial.
If it is possible to use networkx in your project, here is how you can achieve what you are looking for. First, prepare the data
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
def kmeans(arr,num_clusters):
model = KMeans(n_clusters = num_clusters)
model.fit(arr)
model.cluster_centers_
cluster_centers = model.cluster_centers_
all_labels = model.labels_
mem_count = Counter(all_labels)
return cluster_centers,all_labels,mem_count
df = pd.DataFrame(np.random.randint(-80,80,size=(200, 4)), columns=list('ABCD'))
# tail
A,B = df['A'],df['B']
# end
C,D = df['C'],df['D']
nclusters_1,nclusters_2 = 20,20
points= df[['A','B','C','D']].values
cluster_one = kmeans(points[:,:2],nclusters_1)
cluster_two = kmeans(points[:,2:],nclusters_2)
# find connections between clusters
all_combs = [[n1,n2] for n1 in range(nclusters_1) for n2 in range(nclusters_2)]
num_connections = {}
for item in all_combs:
l1,l2 = cluster_one[1],cluster_two[1]
mask1 = np.where(l1==item[0])[0]
mask2 = np.where(l2==item[1])[0]
num_common = len(list(set(mask1).intersection(mask2)))
num_connections[(item[0],item[1]+nclusters_1)] = num_common
As you can see in the above code, I performed KMeans clustering to (A,B) and (C,D) separately, and you can use different number of clusters for the two sets of points by using different nclusters_1 and nclusters_2. After the data is prepared, we can now visualize it using networkx
# plot the graph
import networkx as nx
G = nx.Graph()
node_sizes = {}
node_colors = {}
for k,v in num_connections.items():
# the number of points in the two clusters
s1,s2 = cluster_one[2][k[0]],cluster_two[2][k[1]-nclusters_1]
G.add_node(k[0],pos=cluster_one[0][k[0]])
G.add_node(k[1],pos=cluster_two[0][k[1]-nclusters_1])
G.add_edge(k[0],k[1],color='k',weight=v/3)
node_sizes[k[0]] = s1;node_sizes[k[1]] = s2
node_colors[k[0]] = 'navy';node_colors[k[1]] = 'darkviolet'
edges = G.edges()
d = dict(G.degree)
pos=nx.get_node_attributes(G,'pos')
weights = [G[u][v]['weight'] for u,v in edges]
nx.draw(G,pos,edges=edges,
node_color=[node_colors[v] for v in d.keys()],
nodelist=d.keys(),
width=weights,
node_size=[node_sizes[v]*20 for v in d.keys()])
The output figure looks like
In this figure, the actual positions of the points are used for plotting, (A,B) are colored navy and (C,D) are colored violet. If you want to scale up the node size, just use a different number than 20 for this line
node_size=[node_sizes[v]*20 for v in d.keys()]
In order to adjust the width of the edges, use can use a different number other than 3 in this line
G.add_edge(k[0],k[1],color='k',weight=v/3)
UPDATE
In the script above, the keys of num_connections represent two graph nodes, and the corresponding values represent the number of connections. In order to extract adjacency matrix from num_connections, you can try
adj_mat = np.zeros((nclusters_1,nclusters_2))
for k,v in num_connections.items():
entry = (k[0],k[1]-nclusters_1)
adj_mat[entry] = v
UPDATE 2
To resolve the issue in OP's updated post,
add this pos=nx.get_node_attributes(G,'pos') to fix the node positions
change G.add_node(k[0],pos=points[:,:2][k[0]]) and G.add_node(k[1],pos=points[:,2:][k[1]]) to G.add_node(k[0],pos=cluster_one[0][k[0]]) and G.add_node(k[1],pos=cluster_two[0][k[1]-nclusters_1])
with these two modifications,
df = pd.DataFrame({
'A' : [5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 5, 4],
'B' : [5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 5, 2],
'C' : [7, 7, 7, 7, 7, 3, 3, 3, 3, 3, 3, 7],
'D' : [7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4, 7],
})
fig,ax = plt.subplots()
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
A = df['A']
B = df['B']
C = df['C']
D = df['D']
def kmeans(arr,num_clusters):
model = KMeans(n_clusters = num_clusters)
model.fit(arr)
model.cluster_centers_
cluster_centers = model.cluster_centers_
all_labels = model.labels_
mem_count = Counter(all_labels)
return cluster_centers,all_labels,mem_count
nclusters_1,nclusters_2 = 2,2
points= df[['A','B','C','D']].values
cluster_one = kmeans(points[:,:2],nclusters_1)
cluster_two = kmeans(points[:,2:],nclusters_2)
# find connections between clusters
all_combs = [[n1,n2] for n1 in range(nclusters_1) for n2 in range(nclusters_2)]
num_connections = {}
for item in all_combs:
l1,l2 = cluster_one[1],cluster_two[1]
mask1 = np.where(l1==item[0])[0]
mask2 = np.where(l2==item[1])[0]
num_common = len(list(set(mask1).intersection(mask2)))
num_connections[(item[0],item[1]+nclusters_1)] = num_common
G = nx.Graph()
node_sizes = {}
node_colors = {}
for k,v in num_connections.items():
# the number of points in the two clusters
s1,s2 = cluster_one[2][k[0]],cluster_two[2][k[1]-nclusters_1]
G.add_node(k[0],pos=cluster_one[0][k[0]])
G.add_node(k[1],pos=cluster_two[0][k[1]-nclusters_1])
G.add_edge(k[0],k[1],color='k',weight=v/3)
node_sizes[k[0]] = s1;node_sizes[k[1]] = s2
node_colors[k[0]] = 'k';node_colors[k[1]] = 'k'
edges = G.edges()
d = dict(G.degree)
pos=nx.get_node_attributes(G,'pos')
weights = [G[u][v]['weight'] for u,v in edges]
nx.draw(G,pos,edges=edges,
node_color=[node_colors[v] for v in d.keys()],
nodelist=d.keys(),
width=weights,
node_size=[node_sizes[v]*20 for v in d.keys()])

Related

Add list to Matplotlib

I am doing a randomly generated world and I'm starting of with basic graphing trying to be similar to Perlin Noise. I did everything and the last thing that I've written (important one) did work.
import math
import random
import matplotlib.pyplot as plt
print('ur seed')
a = input()
seed = int(a)
b = (math.cos(seed) * 100)
c = round(b)
# print(c)
for i in range(10):
z = (random.randint(-1, 2))
change = (z + c)
gener = []
gener.append(change)
time = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
#print(gener)
#print(change)
plt.ylabel('generated')
plt.xlabel('time')
#Here I wanna add them to the graph and it is Erroring a lot
plt.scatter(time, gener)
plt.title('graph')
plt.show()
the problem is that you're setting gener to [] in the loop not out of the loop. also, you don't need the time variable inside the loop either.
change
for i in range(10):
z = (random.randint(-1, 2))
change = (z + c)
gener = []
gener.append(change)
time = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
to
gener = []
for i in range(10):
z = (random.randint(-1, 2))
change = (z + c)
gener.append(change)
time = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

MatPlotLib: Scatter with multiple y values to one x value, and regression lines

I would like to create a scatter plot in matplotlib to measure the performance of my algorithm.
An example of my data is as follows:
x = [1, 2, 3, 4, 5]
y1 = [1, 2, 3] # corresponding to x = 1
y2 = [4, 5, 6] # corresponding to x = 2
y3 = [7, 8, 9] # corresponding to x = 3
y4 = [10, 11, 12] # corresponding to x = 4
y5 = [13, 14, 15] # corresponding to x = 5
What data type would be best to represent multiple y values with one x value?
In my example the relation is exponential. Is there a way to plot an exponential regression line in matplotlib?
I think it is related with the data analyses. If I understand correctly, I think you want to have a comparison with every test's time efficiency, but at each test run, they should be at the same test environments (like the same machine, the same input data, etc.) So just give a suggestion, you can use each test's average run time as the standard value to show your test results. Here is some code you can use.
import numpy as np
import matplotlib.pyplot as plt
data_dim = 4 # number of test
data_points = 100 # number of each test_data_points
data_set = np.random.rand(data_dim,data_points)
time = [ list(range(len(i))) for i in data_set]
norm = np.full((data_dim,data_points),1)
aver = [] # get each test's average value
ndx = 0
for i in norm:
aver.append(i* sum(data_set[0]) / data_points)
fig = plt.figure(figsize=(10,10))
ndx = 1
for i in range(0,2):
for j in range(0,2):
ax = fig.add_subplot(2,2,ndx)
ax.plot(time[ndx-1],data_set[ndx-1],'ko')
ax.plot(time[ndx-1],aver[ndx-1],'r')
ax.set_ylim(-1,2)
ndx += 1
plt.show()
The following is the run result. Note, the red solid line is the average of your test time, which will give some senses of your each test.

Python - draw a graph with node positions

I'm trying to create a graph with the following information.
n = 6 #number of nodes
V = []
V=range(n)# list of vertices
print("vertices",V)
# Create n random points
random.seed(1)
points = []
pos = []
pos = {i:(random.randint(0,50),random.randint(0,100)) for i in V}
print("pos =", pos)
This gives my positions as
pos = {0: (8, 72), 1: (48, 8), 2: (16, 15), 3: (31, 97), 4: (28, 60), 5: (41, 48)}
I want to draw a graph with these nodes and some edges(which can be obtained in some other calculation) using Matplotlib in Python. I've tried it as follows. But didn't work.
G_1 = nx.Graph()
nx.set_node_attributes(G_1,'pos',pos)
G_1.add_nodes_from(V) # V is the set of nodes and V =range(6)
for (u,v) in tempedgelist:
G_1.add_edge(v, u, capacity=1) # tempedgelist contains my edges as a list ... ex: tempedgelist = [[0, 2], [0, 3], [1, 2], [1, 4], [5, 3]]
nx.draw(G_1,pos, edge_labels=True)
plt.show()
Can someone please help me with this...
You only need pos for nx.draw(). You can set both nodes and edges using add_edges_from().
import networkx as nx
import random
G_1 = nx.Graph()
tempedgelist = [[0, 2], [0, 3], [1, 2], [1, 4], [5, 3]]
G_1.add_edges_from(tempedgelist)
n_nodes = 6
pos = {i:(random.randint(0,50),random.randint(0,100)) for i in range(n_nodes)}
nx.draw(G_1, pos, edge_labels=True)
Note: If you need to track points and positions separately, write into lists from pos:
points = []
positions = []
for i in pos:
points.append(pos[i])
positions.append(i)
positions.append(pos[i])
I don't have a proper IDE right now, but one issue I spot in your code is that pos should be a dictionary, see the networkx doc here for setting node attribute and here for drawing
Try this
import networkx as nx
import matplotlib.pyplot as plt
g= nx.Graph()
pos = {0:(0,0), 1:(1,2)}
g.add_nodes_from(range(2))
nx.set_node_attributes(g, 'pos', pos)
g.add_edge(0, 1)
nx.draw(g, pos, edge_labels=True)
plt.show()
Let me know if it works.
You must transform your list of positions into a dictionary:
pos = dict(zip(pos[::2],pos[1::2]))
Incidentally, you can build the graph directly from the edge list (the nodes are added automatically):
G1 = nx.Graph(tempedgelist)
nx.set_node_attributes(G_1,'capacity',1)

Setting the colorbar after plotting data inside a loop in matplotlib

I want to plot some finite element results using tricontourf in matplotlib. To do so, a for loop needs to be created that accounts for the data at the nodes of each element (something like the fill command in MATLAB). This works nicely, but the problem lies in the display of the colorbar. It only shows the values of the last element (after the loop finishes) and it does not include the values of the whole domain.
Therefore, is there a way to 'accumulate' the data of the tricontourf so that the colorbar includes the values of the whole domain, or maybe a way to manipulate the colorbar to do so? I tried to append the data using contour_sets.append() and then use this as input to colorbar, but it does not work.
The code looks like this:
# nodes and coordinates
IEN = np.array([[1,2,5,6],[2,3,4,5],[5,4,9,8],[6,5,8,7]]) # nodes vs elements
xx = np.array([1, 2, 3, 3, 2, 1, 1, 2, 3]) # x coordinates
yy = np.array([1, 1, 1, 2, 2, 2, 3, 3, 3]) # y coordinates
zz = np.array([50, 40, 30, 45, 20, 30, 30, 15, 10]) # magnitude
nen,nfe = np.shape(IEN) # nodes per element, number of elements
#contour_sets = []
fig = plt.figure(1)
for e in range(nfe):
idx = np.squeeze(IEN[:,e]-1) # take the element nodes
x = xx[idx] # x coordinates of element e
y = yy[idx] # y coordinates of element e
z = zz[idx] # magnitude values of element e
# additional node in the middle to be able to use tricontourf
x5 = np.mean(x) # x coordinate
y5 = np.mean(y) # y coordinate
z5 = np.mean(z) # magnitude value
# set up the triangle nodes (rectangular element as two triangles)
triangles = np.array([ [ 2, 4, 1 ],
[ 2, 3, 4 ], ]) -1
# insert the additional middle node to the coordinates
xt = np.insert(x,nen,[x5]).squeeze()
yt = np.insert(y,nen,[y5]).squeeze()
zt = np.insert(z,nen,[z5]).squeeze()
# plot the data
fil = plt.tricontourf(xt, yt, triangles, zt, norm=plt.Normalize(vmax=zz.max(), vmin=zz.min()), cmap=cm.pink)
#contour_sets.append(fil)
# organize the plot
cb = plt.colorbar(fil, shrink=0.95, aspect=15, pad = 0.05)
cb.ax.yaxis.set_offset_position('left')
cb.update_ticks()
plt.xticks(np.arange(1,3.1,0.5))
plt.yticks(np.arange(1,3.1,0.5))
plt.xlim(1, 3)
plt.ylim(1, 3)
plt.gca().set_aspect('equal', adjustable='box')
plt.tight_layout()
which gives this Figure (edited for clarification purposes):
The colorbar only shows the values of zz asociated to the element 4

Unexpected result with polyfit in python

Permit me first to introduce the background of the question:
I came into a quiz, which gave me a data set and a Logistic equation:
Then it asked whether that model can be linearized, and if it is, use the linear model to evaluate the value of a and k.
I tried to linearize it like this:
And coded in python:
t = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
y = np.array([43.65, 109.86, 187.21, 312.67, 496.58, 707.65, 960.25, 1238.75, 1560, 1824.29, 2199, 2438.89, 2737.71])
yAss = np.log(3000/y - 1)
cof = np.polyfit(t, yAss, deg = 1)
a = math.e**(cof[0]);
k = -cof[1];
yAfter = 3000 / (1 + a*math.e**(-k*t))
sizeScalar = 10
fig = plt.figure(figsize = (sizeScalar*1.1, sizeScalar))
plt.plot(t, y, 'o', markersize = sizeScalar*0.75)
plt.plot(t, yAfter, 'r-')
plt.grid(True)
plt.show()
And got that, which is obviously incorrect:
Then by coincident, I changed some part of the code:
t = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
y = np.array([43.65, 109.86, 187.21, 312.67, 496.58, 707.65, 960.25, 1238.75, 1560, 1824.29, 2199, 2438.89, 2737.71])
yAss = np.log(3000/y - 1)
cof = np.polyfit(t, yAss, deg = 1)
a = math.e**(-cof[1]); #<<<===============here. Before: a = math.e**(cof[0])
k = cof[0]; #<<<==========================and here, Before: k = -cof[1]
temp = 3000 / (1 + a*math.e**(-k*t))
yAfter = []
for itera in temp: #<<<=======================add this
yAfter.append(3000 - itera)
sizeScalar = 10
fig = plt.figure(figsize = (sizeScalar*1.1, sizeScalar))
plt.plot(t, y, 'o', markersize = sizeScalar*0.75)
plt.plot(t, yAfter, 'r-')
plt.grid(True)
plt.show()
And received a sequence that seems to be right?
But how could it be? I think cof[0] IS beta, and cof1 IS -k, if it is, my previous code should just have some wrong concept. But change the order and sign of the coefficient, I got a result that fits well?! Is it purely coincident?
And what might be the right answer of the quiz?
np.polyfit returns the highest degree first: https://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html#numpy.polyfit
Use:
k = -cof[0]
a = exp(cof[1])
Also, you can use NumPy's exponential:
yAfter = 3000/(1+a*np.exp(-k*t)))

Categories

Resources