Related
The following plots vectors derived from two sets of data points. I also measure and plot the centroid of these points using k-means clustering.
I'm hoping to measure some form of adjacency matrix to plot the network between each cluster based on the number of vectors, which also accounts for the amount of vectors between each cluster. So displaying the weight.
I was thinking the diagonal values of the adjacency matrix could indicate the number of vectors in the same cluster, while the non-diagonal values could indicate the number of vectors between different clusters, while considering the direction?
I'm hoping to produce an output to the one below. Where the nodes are the centroid of the cluster. The diameter of the node should indicate the number of vectors in the same cluster and the line thickness is the number of vectors between the two clusters.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
fig, ax = plt.subplots(figsize = (6,6))
df = pd.DataFrame(np.random.randint(-80,80,size=(500, 4)), columns=list('ABCD'))
A = df['A']
B = df['B']
C = df['C']
D = df['D']
Y_sklearn = df[['A','B','C','D']].values
ax.quiver(A, B, (C-A), (D-B), angles = 'xy', scale_units = 'xy', scale = 1, alpha = 0.5)
model = KMeans(n_clusters = 20)
model.fit(Y_sklearn)
model.cluster_centers_
cluster_centers = model.cluster_centers_
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1],
color = 'black', s = 100,
alpha = 0.7, zorder = 2)
plt.scatter(Y_sklearn[:,0], Y_sklearn[:,1], color = 'blue', alpha = 0.2);
plt.scatter(Y_sklearn[:,2], Y_sklearn[:,3], color = 'red', alpha = 0.2);
Edit 2:
If if fix the data to get the intended network below, the following plots a total of 12 vectors. Two groups of 5 are overlapping, while two are unique.
df = pd.DataFrame({
'A' : [5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 5, 4],
'B' : [5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 5, 2],
'C' : [7, 7, 7, 7, 7, 3, 3, 3, 3, 3, 3, 7],
'D' : [7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4, 7],
})
fig,ax = plt.subplots()
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
A = df['A']
B = df['B']
C = df['C']
D = df['D']
ax.quiver(A, B, (C-A), (D-B), angles = 'xy', scale_units = 'xy', scale = 1, alpha = 0.5)
If I just plot the scatter with cluster centroids, it should look like the following:
Y_sklearn = df[['A','B','C','D']].values
model = KMeans(n_clusters = 4)
model.fit(Y_sklearn)
model.cluster_centers_
cluster_centers = model.cluster_centers_
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1],
color = 'black', s = 100,
alpha = 0.7, zorder = 2)
plt.scatter(cluster_centers[:, 2], cluster_centers[:, 3],
color = 'black', s = 100,
alpha = 0.7, zorder = 2)
This all works fine. The next step is where I'm having trouble. If I plot the network manually, it should look something like this. The thicker lines display 5 vectors between centroids, while the thinner lines display 1 vector.
The updated code produces the following network. The 5,5 - 7,7 line is correct, but I'm not getting the other lines that should replicate something similar to the network above.
fig,ax = plt.subplots()
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
def kmeans(arr,num_clusters):
model = KMeans(n_clusters = num_clusters)
model.fit(arr)
model.cluster_centers_
cluster_centers = model.cluster_centers_
all_labels = model.labels_
mem_count = Counter(all_labels)
return cluster_centers,all_labels,mem_count
nclusters_1,nclusters_2 = 2,2
points= df[['A','B','C','D']].values
cluster_one = kmeans(points[:,:2],nclusters_1)
cluster_two = kmeans(points[:,2:],nclusters_2)
# find connections between clusters
all_combs = [[n1,n2] for n1 in range(nclusters_1) for n2 in range(nclusters_2)]
num_connections = {}
for item in all_combs:
l1,l2 = cluster_one[1],cluster_two[1]
mask1 = np.where(l1==item[0])[0]
mask2 = np.where(l2==item[1])[0]
num_common = len(list(set(mask1).intersection(mask2)))
num_connections[(item[0],item[1]+nclusters_1)] = num_common
G = nx.Graph()
node_sizes = {}
node_colors = {}
for k,v in num_connections.items():
# the number of points in the two clusters
s1,s2 = cluster_one[2][k[0]],cluster_two[2][k[1]-nclusters_1]
G.add_node(k[0],pos=points[:,:2][k[0]])
G.add_node(k[1],pos=points[:,2:][k[1]])
G.add_edge(k[0],k[1],color='k',weight=v/3)
node_sizes[k[0]] = s1;node_sizes[k[1]] = s2
node_colors[k[0]] = 'k';node_colors[k[1]] = 'k'
edges = G.edges()
d = dict(G.degree)
pos=nx.get_node_attributes(G,'pos')
weights = [G[u][v]['weight'] for u,v in edges]
nx.draw(G,pos,edges=edges,
node_color=[node_colors[v] for v in d.keys()],
nodelist=d.keys(),
width=weights,
node_size=[node_sizes[v]*20 for v in d.keys()])
Before solving the problem, I guess the KMeans clustering should be performed for the first two columns of the df and the other two columns separately. If you apply KMeans clustering directly to the entire df, the connections between two clusters (A,B) and (C,D) will be trivial.
If it is possible to use networkx in your project, here is how you can achieve what you are looking for. First, prepare the data
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
def kmeans(arr,num_clusters):
model = KMeans(n_clusters = num_clusters)
model.fit(arr)
model.cluster_centers_
cluster_centers = model.cluster_centers_
all_labels = model.labels_
mem_count = Counter(all_labels)
return cluster_centers,all_labels,mem_count
df = pd.DataFrame(np.random.randint(-80,80,size=(200, 4)), columns=list('ABCD'))
# tail
A,B = df['A'],df['B']
# end
C,D = df['C'],df['D']
nclusters_1,nclusters_2 = 20,20
points= df[['A','B','C','D']].values
cluster_one = kmeans(points[:,:2],nclusters_1)
cluster_two = kmeans(points[:,2:],nclusters_2)
# find connections between clusters
all_combs = [[n1,n2] for n1 in range(nclusters_1) for n2 in range(nclusters_2)]
num_connections = {}
for item in all_combs:
l1,l2 = cluster_one[1],cluster_two[1]
mask1 = np.where(l1==item[0])[0]
mask2 = np.where(l2==item[1])[0]
num_common = len(list(set(mask1).intersection(mask2)))
num_connections[(item[0],item[1]+nclusters_1)] = num_common
As you can see in the above code, I performed KMeans clustering to (A,B) and (C,D) separately, and you can use different number of clusters for the two sets of points by using different nclusters_1 and nclusters_2. After the data is prepared, we can now visualize it using networkx
# plot the graph
import networkx as nx
G = nx.Graph()
node_sizes = {}
node_colors = {}
for k,v in num_connections.items():
# the number of points in the two clusters
s1,s2 = cluster_one[2][k[0]],cluster_two[2][k[1]-nclusters_1]
G.add_node(k[0],pos=cluster_one[0][k[0]])
G.add_node(k[1],pos=cluster_two[0][k[1]-nclusters_1])
G.add_edge(k[0],k[1],color='k',weight=v/3)
node_sizes[k[0]] = s1;node_sizes[k[1]] = s2
node_colors[k[0]] = 'navy';node_colors[k[1]] = 'darkviolet'
edges = G.edges()
d = dict(G.degree)
pos=nx.get_node_attributes(G,'pos')
weights = [G[u][v]['weight'] for u,v in edges]
nx.draw(G,pos,edges=edges,
node_color=[node_colors[v] for v in d.keys()],
nodelist=d.keys(),
width=weights,
node_size=[node_sizes[v]*20 for v in d.keys()])
The output figure looks like
In this figure, the actual positions of the points are used for plotting, (A,B) are colored navy and (C,D) are colored violet. If you want to scale up the node size, just use a different number than 20 for this line
node_size=[node_sizes[v]*20 for v in d.keys()]
In order to adjust the width of the edges, use can use a different number other than 3 in this line
G.add_edge(k[0],k[1],color='k',weight=v/3)
UPDATE
In the script above, the keys of num_connections represent two graph nodes, and the corresponding values represent the number of connections. In order to extract adjacency matrix from num_connections, you can try
adj_mat = np.zeros((nclusters_1,nclusters_2))
for k,v in num_connections.items():
entry = (k[0],k[1]-nclusters_1)
adj_mat[entry] = v
UPDATE 2
To resolve the issue in OP's updated post,
add this pos=nx.get_node_attributes(G,'pos') to fix the node positions
change G.add_node(k[0],pos=points[:,:2][k[0]]) and G.add_node(k[1],pos=points[:,2:][k[1]]) to G.add_node(k[0],pos=cluster_one[0][k[0]]) and G.add_node(k[1],pos=cluster_two[0][k[1]-nclusters_1])
with these two modifications,
df = pd.DataFrame({
'A' : [5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 5, 4],
'B' : [5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 5, 2],
'C' : [7, 7, 7, 7, 7, 3, 3, 3, 3, 3, 3, 7],
'D' : [7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4, 7],
})
fig,ax = plt.subplots()
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
A = df['A']
B = df['B']
C = df['C']
D = df['D']
def kmeans(arr,num_clusters):
model = KMeans(n_clusters = num_clusters)
model.fit(arr)
model.cluster_centers_
cluster_centers = model.cluster_centers_
all_labels = model.labels_
mem_count = Counter(all_labels)
return cluster_centers,all_labels,mem_count
nclusters_1,nclusters_2 = 2,2
points= df[['A','B','C','D']].values
cluster_one = kmeans(points[:,:2],nclusters_1)
cluster_two = kmeans(points[:,2:],nclusters_2)
# find connections between clusters
all_combs = [[n1,n2] for n1 in range(nclusters_1) for n2 in range(nclusters_2)]
num_connections = {}
for item in all_combs:
l1,l2 = cluster_one[1],cluster_two[1]
mask1 = np.where(l1==item[0])[0]
mask2 = np.where(l2==item[1])[0]
num_common = len(list(set(mask1).intersection(mask2)))
num_connections[(item[0],item[1]+nclusters_1)] = num_common
G = nx.Graph()
node_sizes = {}
node_colors = {}
for k,v in num_connections.items():
# the number of points in the two clusters
s1,s2 = cluster_one[2][k[0]],cluster_two[2][k[1]-nclusters_1]
G.add_node(k[0],pos=cluster_one[0][k[0]])
G.add_node(k[1],pos=cluster_two[0][k[1]-nclusters_1])
G.add_edge(k[0],k[1],color='k',weight=v/3)
node_sizes[k[0]] = s1;node_sizes[k[1]] = s2
node_colors[k[0]] = 'k';node_colors[k[1]] = 'k'
edges = G.edges()
d = dict(G.degree)
pos=nx.get_node_attributes(G,'pos')
weights = [G[u][v]['weight'] for u,v in edges]
nx.draw(G,pos,edges=edges,
node_color=[node_colors[v] for v in d.keys()],
nodelist=d.keys(),
width=weights,
node_size=[node_sizes[v]*20 for v in d.keys()])
I have a graph where my nodes can have multiple edges between them in both directions and I want to set the width between the nodes based on the sum of all edges between them.
import networkx as nx
nodes = [0,1]
edges = [(0,1),(1,0)]
G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(edges)
weights = [2,3]
nx.draw(G, width = weights)
I would like to have the width between 0 and 1 set to 5 as that is the summed weight.
First you need to create a MultiDiGraph and add all possible edges to it. This is because it supports multiple directed egdes between the same set of nodes including self-loops.
import networkx as nx
nodes = [0, 1, 2, 3, 4, 5]
edges = [(0,1), (1,0), (1, 0),(0, 1), (2, 3), (2, 3), (2, 3), (2, 3),
(4, 1), (4, 1), (4, 1), (4, 1), (4, 1), (4, 1), (4, 5), (5, 0)]
G = nx.MultiDiGraph()
G.add_nodes_from(nodes)
G.add_edges_from(edges)
Next, create a dictionary containing counts of each edges
from collections import Counter
width_dict = Counter(G.edges())
edge_width = [ (u, v, {'width': value})
for ((u, v), value) in width_dict.items()]
Now create a new DiGraph from the edge_width dictionary created above
G_new = nx.DiGraph()
G_new.add_edges_from(edge_width)
Plotting using thickened edges
This is an extension of answer mentioned here.
edges = G_new.edges()
weights = [G_new[u][v]['width'] for u,v in edges]
nx.draw(G_new, edges=edges, width=weights)
Add Edge labels
See this answer for more info.
pos = nx.spring_layout(G_new)
nx.draw(G_new, pos)
edge_labels=dict([((u,v,),d['width'])
for u,v,d in G_new.edges(data=True)])
nx.draw_networkx_edges(G_new, pos=pos)
nx.draw_networkx_edge_labels(G_new, pos, edge_labels=edge_labels,
label_pos=0.25, font_size=10)
You can also view this Google Colab Notebook with working code.
References
https://stackoverflow.com/a/25651827/8160718
https://stackoverflow.com/a/22862610/8160718
Drawing networkX edges
MultiDiGraph in NetworkX
Count occurrences of List items
I have a graph of different locations:
import networkx as nx
G = nx.Graph()
for edge in Edge.objects.all():
G.add_edge(edge.from_location, edge.to_location, weight=edge.distance)
The locations (nodes) have different types (toilets, building entrances, etc.) I need to find the shortest way from some given location to any location of a specific type. (For example: Find the nearest entrance from a given node.)
Is there some method in the Networkx library to solve that without loops? Something like:
nx.shortest_path(
G,
source=start_location,
target=[first_location, second_location],
weight='weight'
)
The result will be the shortest path to either the first_location or the second_location, if both locations are of the same type.
And is there some method that also returns path length?
We will do it in three steps.
Step 1: Let's create a dummy graph to illustrate
Step 2: Plot the graph and color nodes to indicate edge lengths and special node types (toilets, entrances etc.)
Step 3: From any given node (source) calculate shortest path to all reachable nodes, then subset to the node types of interest and select path with the minimum length.
The code below can definitely be optimized, but this might be easier to follow.
Step 1: Create the graph
edge_objects = [(1,2, 0.4), (1, 3, 1.7), (2, 4, 1.2), (3, 4, 0.3), (4 , 5, 1.9),
(4 ,6, 0.6), (1,7, 0.4), (3,5, 1.7), (2, 6, 1.2), (6, 7, 0.3),
(6, 8, 1.9), (8,9, 0.6)]
toilets = [5,9] # Mark two nodes (5 & 9) to be toilets
entrances = [2,7] # Mark two nodes (2 & 7) to be Entrances
common_nodes = [1,3,4,6,8] #all the other nodes
node_types = [(9, 'toilet'), (5, 'toilet'),
(7, 'entrance'), (2, 'entrance')]
#create the networkx Graph with node types and specifying edge distances
G = nx.Graph()
for n,typ in node_types:
G.add_node(n, type=typ) #add each node to the graph
for from_loc, to_loc, dist in edge_objects:
G.add_edge(from_loc, to_loc, distance=dist) #add all the edges
Step 2: Draw the graph
#Draw the graph (optional step)
pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True)
edge_labels = nx.get_edge_attributes(G,'distance')
nx.draw_networkx_edge_labels(G, pos, edge_labels = edge_labels)
nx.draw_networkx_nodes(G, pos, nodelist=toilets, node_color='b')
nx.draw_networkx_nodes(G, pos, nodelist=entrances, node_color='g')
nx.draw_networkx_nodes(G, pos, nodelist=common_nodes, node_color='r')
plt.show()
Step 3: create small functions to find the shortest path to node type
def subset_typeofnode(G, typestr):
'''return those nodes in graph G that match type = typestr.'''
return [name for name, d in G.nodes(data=True)
if 'type' in d and (d['type'] ==typestr)]
#All computations happen in this function
def find_nearest(typeofnode, fromnode):
#Calculate the length of paths from fromnode to all other nodes
lengths=nx.single_source_dijkstra_path_length(G, fromnode, weight='distance')
paths = nx.single_source_dijkstra_path(G, fromnode)
#We are only interested in a particular type of node
subnodes = subset_typeofnode(G, typeofnode)
subdict = {k: v for k, v in lengths.items() if k in subnodes}
#return the smallest of all lengths to get to typeofnode
if subdict: #dict of shortest paths to all entrances/toilets
nearest = min(subdict, key=subdict.get) #shortest value among all the keys
return(nearest, subdict[nearest], paths[nearest])
else: #not found, no path from source to typeofnode
return(None, None, None)
Test:
find_nearest('entrance', fromnode=5)
produces:
(7, 2.8, [5, 4, 6, 7])
Meaning: The nearest 'entrance' node from 5 is 7, the path length is 2.8 and the full path is: [5, 4, 6, 7]. Hope this helps you move forward. Please ask if anything is not clear.
Is there a simpler, easier way to convert coordinates (long, lat) to a "networkx"-graph, than nested looping over those coordinates and adding weighted nodes/edges for each one?
for idx1, itm1 in enumerate(data):
for idx2, itm2 in enumerate(data):
pos1 = (itm1["lng"], itm1["lat"])
pos2 = (itm2["lng"], itm2["lat"])
distance = vincenty(pos1, pos2).meters #geopy distance
# print(idx1, idx2, distance)
graph.add_edge(idx1, idx2, weight=distance)
The target is representing points as a graph in order to use several functions on this graph.
Edit: Using an adjacency_matrix would still need a nested loop
You'll have to do some kind of loop. But if you are using an undirected graph you can eliminate half of the graph.add_edge() (only need to add u-v and not v-u). Also as #EdChum suggests you can use graph.add_weighted_edges_from() to make it go faster.
Here is a nifty way to do it
In [1]: from itertools import combinations
In [2]: import networkx as nx
In [3]: data = [10,20,30,40]
In [4]: edges = ( (s[0],t[0],s[1]+t[1]) for s,t in combinations(enumerate(data),2))
In [5]: G = nx.Graph()
In [6]: G.add_weighted_edges_from(edges)
In [7]: G.edges(data=True)
Out[7]:
[(0, 1, {'weight': 30}),
(0, 2, {'weight': 40}),
(0, 3, {'weight': 50}),
(1, 2, {'weight': 50}),
(1, 3, {'weight': 60}),
(2, 3, {'weight': 70})]
A very simple question:
how to compute efficiently in Python (or Cython) the following quantity.
Given the list of polygons in 3D (polygon
There is a list of polygons given in the following form:
vertex = np.array([[0, 0, 0], [0, 0, 1], [0, 1, 0],[1, 0, 0],[0.5, 0.5, 0.5]], order = 'F').T
polygons = np.array([3, 0, 1, 2, 4, 1, 2, 3 ,4])
i.e. polygon is a 1D array, which contains entries of the form [N,i1,i2,i3,i4,...],
N is the number of vertices in a polygons and then the id numbers of the vertices in the vertex array (in the example above there is one triangle with 3 vertices [0,1,2] and one polygon with 4 vertices [1,2,3,4]
I need to compute the information: a list of all edges and for each edge the information
which faces contain this edge.
And I need to do it fast: the number of vertices can be large.
Update
The polygon is closed, i.e. a polygon [4, 0, 1, 5, 7] means that there are 4 vertices and edges are 0-1, 1-5, 5-7, 7-0
The face is a synonim to polygon in fact.
Dunno if this is the fastest option, most probably not, but it works. I think the slowest part is edges.index((v, polygon[i + 1])) where we have to find if this edge is already in list. Vertex array is not really needed since edge is a pair of vertex indexes. I used face_index as a reference to polygon index since you didn't write what face is.
vertex = [[0,0,0], [0,0,1], [0,1,0],[1,0,0],[0.5,0.5,0.5]]
polygons = [3,0,1,2,4,1,2,3,4]
_polygons = polygons
edges = []
faces = []
face_index = 0
while _polygons:
polygon = _polygons[1:_polygons[0] + 1]
polygon.append(polygon[0])
_polygons = _polygons[_polygons[0] + 1:]
for i, v in enumerate(polygon[0:-1]):
if not (v, polygon[i + 1]) in edges:
edges.append((v, polygon[i + 1]))
faces.append([face_index, ])
else:
faces[edges.index((v, polygon[i + 1]))].append(face_index)
face_index += 1
edges = map(lambda edge, face: (edge, face), edges, faces)
print edges
<<< [((0, 1), [0]), ((1, 2), [0, 1]), ((2, 0), [0]), ((2, 3), [1]), ((3, 4), [1]), ((4, 1), [1])]
You can make it faster by removing line polygon.append(polygon[0]) and append first vertice of polygon to vertices list in polygon manually, which shouldn't be a problem.
I mean change polygons = [3,0,1,2,4,1,2,3,4] into polygons = [3,0,1,2,0,4,1,2,3,4,1].
PS Try to use PEP8. It is a code typing style. It says that you should put a space after every comma in iterables so it's eaasier to read.