I have an adjacent matrix and I need to calculate the fraction of nodes in the largest component (or the largest weakly connected component in the case of a directed network):
# from dataframe
matrix_weak = matrix.copy()
# to numpy arrays
matrix_weak_to_numpy = matrix_weak.to_numpy()
G = nx.from_numpy_matrix(matrix_weak_to_numpy)
G = G.to_directed() # weakly connected component needs a directed
graph
max_wcc = max(nx.weakly_connected_components(G), key=len)
max_wcc = nx.subgraph(G, max_wcc)
How do I calculate this fraction from the code above?
The total number of nodes in the network is G.number_of_nodes(), so if I understand correctly, the answer is:
fraction = max_wcc.number_of_nodes() / G.number_of_nodes()
Related
I need to compute an edge cover of a weighted bipartite graph which I have built in Networkx. Based on this answer, I have two algorithms that respectively return a minimum weight edge cover and a minimum cardinality (and weight) one. The minimum weight algorithm presents some odd behaviour in the choice of edges, which may be related to an error that happens in the minimum cardinality algorithm, so I'll explain both situations below.
Here are a few details about the graphs being considered:
My current test case has about 1200 nodes on one side and 1600 on the other, with over a million edges
All nodes have at least one incident edge
The graph is typically disconnected in a few blocks
The problem is built as an undirected graph, but directed edges would also make sense (they would always be from the set with bipartite==_og_id to the other)
Minimum weight algorithm
This algorithm seems to always pick the vv' edges (i.e., the edges that are between a node in the original graph and its copy in the larger graph). I thought it was because some edges had a weight of 0 (causing the vv'edge to also have a weight of 0), but adding a minimum weight when building the graph did not change this behaviour. (I use 0.1 since the minimum nonzero weight in the graph should be 1) This basically reverts the algorithm to "for each node, pick the edge that has the smallest weight" which is suboptimal.
Code:
def _min_weight_edge_cover(g: nx.Graph):
"""Returns an edge cover that minimizes the total weight of included edges, but not the total number of edges"""
clone = g.copy()
for node, bi in g.nodes(data='bipartite'):
nd = f"{node}_copy"
clone.nodes[node]['copy'] = False
clone.add_node(nd, copy=True, bipartite=(_og_id if bi == _tg_id else _tg_id)) # invert the bipartite flag
minw = min([w for u, v, w in g.edges(node, data='weight')])
clone.add_edge(node, nd, weight=(2 * minw))
# Now clone contains both the nodes of g and their copies, and should still be bipartite
tops = {n for n, d in clone.nodes(data=True) if d['bipartite'] == _og_id}
bots = set(clone) - tops
print(f"[cover] we have {len(tops)} tops and {len(bots)} bots")
# Here the matching should always exist and be perfect
matching = nx.bipartite.minimum_weight_full_matching(clone, tops)
cover = g.copy()
cover.clear_edges()
keys = {k for k in matching.keys() if clone.nodes[k]['copy'] is False}
for k in keys:
v = matching[k]
if g.has_edge(k, v):
# We never get here
cover.add_edge(k, v)
else:
# v was a copy - this is always true
assert clone.nodes[v]['copy']
minw = math.inf
mine = None
# FIXME should check that we don't add edges between nodes that are already covered
for u, va, w in g.edges(k, data='weight'):
if w < minw:
minw = w
mine = (u, va)
cover.add_edge(*mine)
return cover
Minimum cardinality (and weight)
This algorithm is much simpler (start with a matching and then add the cheapest edge of each node not included in the matching). However, the nx.bipartite.minimum_weight_full_matching function causes an error with cost matrix is infeasible in scipy.optimize.linear_sum_assignment. Unfortunately, there are no details on what makes the cost matrix infeasible. The documentation states that the function takes into account the different number of nodes in the sets, and I've made sure that all nodes have at least one edge. networkx.min_weight matching does work, but it's much, much slower than the bipartite version.
Code:
def _min_cardinality_weight_edge_cover(g: nx.Graph) -> nx.Graph:
"""Returns an edge cover that minimizes
1. the number of edges included;
2. the total weight of all edges included
"""
# get the minimum weight matching.
# By definition, it will have at most one edge per node but some node may end up unmatched
matching = nx.bipartite.minimum_weight_full_matching(g, top_nodes={n for n, b in g.nodes(data='bipartite') if b ==_og_id})
# to make it into a cover, we take all edges from the matching and, for each node not matched, add its cheapest edge
cover = nx.Graph()
cover.add_edges_from(matching.items())
missing = set(g.nodes) - set(cover.nodes)
# there shouldn't be a case where two missing nodes could connect to each other or else that edge would have been
# included in the matching
for node in missing:
minw = math.inf
mine = None
for u, v, w in g.edges(node, data='weight'):
if w < minw:
minw = w
mine = (u, v)
cover.add_edge(*mine)
return cover
Any ideas as to what could be causing these issues?
After regular DBSCAN I got a map with the clusters
Im Attaching nearest nodes to each firm plotted by OSMNX , then create the network-based distance matrix in order to reproduce Network-Based Spatial Clustering from this TUTORIAL
Speed up distance matrix computation: rather than calculating every firm to every firm, find every node with at least 1 firm attached, then calculate every such node to every such node distance. Once we have the node-to-node distances, reindex it to make use those distances firm-to-firm.
this is the code:
# attach nearest network node to each firm --APPLY SOLUTION B HERE
firms['nn'] = ox.get_nearest_nodes(G, X=firms['x'], Y=firms['y'], method='balltree')
print(len(firms['nn']))
# we'll get distances for each pair of nodes that have firms attached to them
nodes_unique = pd.Series(firms['nn'].unique())
nodes_unique.index = nodes_unique.values
print(len(nodes_unique))
# convert MultiDiGraph to DiGraph for simpler faster distance matrix computation
G_dm = nx.DiGraph(G)
OUTPUT:
269
230
time: 2.74 s
THEN
# calculate network-based distance between each node --APPLY SOLUTION A HERE
def network_distance_matrix(u, G, vs=nodes_unique):
dists = [nx.dijkstra_path_length(G, source=u, target=v, weight='length') for v in vs]
return pd.Series(dists, index=vs)
AND FINALLY
%%time
from tqdm._tqdm_notebook import tqdm_notebook
tqdm_notebook.pandas()
# create node-based distance matrix called node_dm
node_dm = nodes_unique.progress_apply(network_distance_matrix, G=G_dm)
node_dm = node_dm.astype(int)
print(node_dm.size)
Solution A and B special thanks to gboeing:
# OPTION A: recursively remove unsolvable origin/destination nodes and re-try
def network_distance_matrix(u, G, vs=nodes_unique):
G2 = G.copy()
solved = False
while not solved:
try:
dists = [nx.dijkstra_path_length(G, source=u, target=v, weight='length') for v in vs]
return pd.Series(dists, index=vs)
solved = True
except nx.exception.NetworkXNoPath:
G2.remove_nodes_from([dist])
# OPTION B: Use a strongly (instead of weakly) connected graph
Gs = ox.utils_graph.get_largest_component(G, strongly=True)
# attach nearest network node to each firm
firms['nn'] = ox.get_nearest_nodes(Gs, X=firms['x'], Y=firms['y'], method='balltree')
print(len(firms['nn']))
# we'll get distances for each pair of nodes that have firms attached to them
nodes_unique = pd.Series(firms['nn'].unique())
nodes_unique.index = nodes_unique.values
print(len(nodes_unique))
# convert MultiDiGraph to DiGraph for simpler faster distance matrix computation
G_dm = nx.DiGraph(Gs)
Is there a function available in Python's NetworkX for generating random directed graphs with a maximum Euclidean distance between any two connected nodes? For example, for nodes separated by a certain Euclidean distance, there is a probability p of those nodes being connected and for all other nodes separated by greater than this distance, they will not be connected in the graph that is generated.
If you have a threshold such that distances greater than the threshold do not exist, and all edges shorter than that threshold have probability p, then you're in luck. [if it's not the same probability for all shorter edges, it's still doable but a bit harder]
Start by building a random geometric graph G. This is a graph whose nodes are put in place uniformly at random and any two are connected if they are within a threshold distance from each other.
Then create a new directed graph which has each direction of the edges in G with probability p.
import networkx as nx
import random
N=100 # 100 nodes
D = 0.2 #threshold distance of 0.2
G = nx.random_geometric_graph(N, D)
H = nx.Digraph()
H.add_nodes_from(G.edges())
p = 0.1 #keep 10% of the edges
for u,v in G.edges():
if random.random()<p:
H.add_edge(u,v)
if random.random()<p:
H.add_edge(v,u)
I am using facebook snap dataset and making a graph on it using networkX on python. But not been able to find the most important or you can say the most connected one in the network.
The code i am using i making a graph on facebook snap dataset is here:
import networkx as nx
import matplotlib.pyplot as plt
'''Exploratory Data Analysis'''
g = nx.read_edgelist('facebook_combined.txt', create_using=nx.Graph(), nodetype=int)
print nx.info(g)
'''Simple Graph'''
sp = nx.spring_layout(g)
nx.draw_networkx(g, pos=sp, with_labels=False, node_size=35)
# plt.axes("off")
plt.show()
The result it gives is this:
The link to the dataset is here
The source of dataset is here
But the question is that how can i find the most important individual in this network ?
One way to define "importance" is the individual's betweenness centrality. The betweenness centrality is a measure of how many shortest paths pass through a particular vertex. The more shortest paths that pass through the vertex, the more central the vertex is to the network.
Because the shortest path between any pair of vertices can be determined independently of any other pair of vertices.
To do this, we will use the Pool object from the multiprocessing library and the itertools library.
First thing we need to do is partition the vertices of the network into n subsets where n is dependent on the number of processors we have access to. For example, if we use a machine with 32 cores, we partition the Facebook network in 32 chunks with each chunk containing 128 vertices.
Now instead of one processor computing the betweenness for all 4,039 vertices, we can have 32 processors computing the betweenness for each of their 128 vertices in parallel. This drastically reduces the run-time of the algorithm and allows it to scale to larger networks.
The code i used is this:
import networkx as nx
import matplotlib.pyplot as plt
'''Exploratory Data Analysis'''
g = nx.read_edgelist('facebook_combined.txt', create_using=nx.Graph(), nodetype=int)
print nx.info(g)
'''Parallel Betweenness Centrality'''
from multiprocessing import Pool
import itertools
spring_pos = nx.spring_layout(g)
def partitions(nodes, n):
# '''Partitions the nodes into n subsets'''
nodes_iter = iter(nodes)
while True:
partition = tuple(itertools.islice(nodes_iter,n))
if not partition:
return
yield partition
def btwn_pool(G_tuple):
return nx.betweenness_centrality_source(*G_tuple)
def between_parallel(G, processes=None):
p = Pool(processes=processes)
part_generator = 4 * len(p._pool)
node_partitions = list(partitions(G.nodes(), int(len(G) / part_generator)))
num_partitions = len(node_partitions)
bet_map = p.map(btwn_pool,
zip([G] * num_partitions,
[True] * num_partitions,
[None] * num_partitions,
node_partitions))
bt_c = bet_map[0]
for bt in bet_map[1:]:
for n in bt:
bt_c[n] += bt[n]
return bt_c
bt = between_parallel(g)
top = 10
max_nodes = sorted(bt.iteritems(), key=lambda v: -v[1])[:top]
bt_values = [5] * len(g.nodes())
bt_colors = [0] * len(g.nodes())
for max_key, max_val in max_nodes:
bt_values[max_key] = 150
bt_colors[max_key] = 2
plt.axis("off")
nx.draw_networkx(g, pos=spring_pos, cmap=plt.get_cmap("rainbow"), node_color=bt_colors, node_size=bt_values,
with_labels=False)
plt.show()
The output it gives:
Now, let's look at the vertices with the top 10 highest betweenness centrality measures in the network.
As you can see, vertices that primarily either sit at the center of a hub or acts a bridge between two hubs have higher betweenness centrality. The bridge vertices have high betweenness because all paths connecting the hubs pass through them, and the hub center vertices have high betweenness because all intra-hub paths pass through them.
I want to do a execution time analysis of the bellman ford algorithm on a large number of graphs and in order to do that I need to generate a large number of random DAGS with the possibility of having negative edge weights.
I am using networkx in python. There are a lot of random graph generators in the networkx library but what will be the one that will return the directed graph with edge weights and the source vertex.
I am using networkx.generators.directed.gnc_graph() but that does not quite guarantee to return only a single source vertex.
Is there a way to do this with or even without networkx?
You can generate random DAGs using the gnp_random_graph() generator and only keeping edges that point from lower indices to higher. e.g.
In [44]: import networkx as nx
In [45]: import random
In [46]: G=nx.gnp_random_graph(10,0.5,directed=True)
In [47]: DAG = nx.DiGraph([(u,v,{'weight':random.randint(-10,10)}) for (u,v) in G.edges() if u<v])
In [48]: nx.is_directed_acyclic_graph(DAG)
Out[48]: True
These can have more than one source but you could fix that with #Christopher's suggestion of making a "super source" that points to all of the sources.
For small connectivity probability values (p=0.5 in the above) these won't likely be connected either.
I noticed that the generated graphs have always exactly one sink vertex which is the first vertex. You can reverse direction of all edges to get a graph with single source vertex.
The method suggested by #Aric will generate random DAGs but the method will not work for a large number of nodes for example: for n tending to 100000.
G = nx.gnp_random_graph(n, 0.5, directed=True)
DAG = nx.DiGraph([(u, v,) for (u, v) in G.edges() if u < v])
# print(nx.is_directed_acyclic_graph(DAG)) # to check if the graph is DAG (though it will be a DAG)
A = nx.adjacency_matrix(DAG)
AM = A.toarray().tolist() # 1 for outgoing edges
while(len(AM)!=n):
AM = create_random_dag(n)
# to display the DAG in matplotlib uncomment these 2 line
# nx.draw(DAG,with_labels = True)
# plt.show()
return AM
For a large number of nodes, you can use the property that every lower triangular matrix is a DAG.
So generating random Lower Triangular matrix will generate random DAG.
mat = [[0 for x in range(N)] for y in range(N)]
for _ in range(N):
for j in range(5):
v1 = random.randint(0,N-1)
v2 = random.randint(0,N-1)
if(v1 > v2):
mat[v1][v2] = 1
elif(v1 < v2):
mat[v2][v1] = 1
for r in mat:
print(','.join(map(str, r)))
For G -> DG -> DAG
DAG with k inputs and m outputs
Generate a graph with your favorite algorithm( G=watts_strogatz_graph(10,2,0.4) )
make the graph to bidirectional ( DG = G.to_directed())
ensure only node with low index points to high index
remove k lowest index nodes' input edge, and m highest index nodes' output edges ( that make DG to DAG)
make sure every k lowest index nodes have output edges, and every m highest index nodes have input edges.
check every node in this DAG, if the k<index<n-m, and it only has no input edges or output edges, randomly choose a node in k lowest index nodes to link to or choose a node in m highest index nodes to link to it, then you get a random DAG with k inputs and m outputs
Like:
def g2dag(G: nx.Graph, k: int, m: int, seed=None) -> nx.DiGraph:
if seed is not None:
random.seed(seed)
DG = G.to_directed()
n = len(DG.nodes())
assert n > k and n > m
# Ensure only node with low index points to high index
for e in list(DG.edges):
if e[0] >= e[1]:
DG.remove_edge(*e)
# Remove k lowest index nodes' input edge. Randomly link a node if
# they have not output edges.
# And remove m highest index nodes' output edges. Randomly link a node if
# they have not input edges.
# ( that make DG to DAG)
n_list = sorted(list(DG.nodes))
for i in range(k):
n_idx = n_list[i]
for e in list(DG.in_edges(n_idx)):
DG.remove_edge(*e)
if len(DG.out_edges(n_idx)) == 0:
DG.add_edge(n_idx, random.random_choice(n_list[k:]))
for i in range(n-m, n):
n_idx = n_list[i]
for e in list(DG.out_edges(n_idx)):
DG.remove_edge(*e)
if len(DG.in_edges(n_idx)) == 0:
DG.add_edge(random.random_choice(n_list[:n-m], n_idx))
# If the k<index<n-m, and it only has no input edges or output edges,
# randomly choose a node in k lowest index nodes to link to or
# choose a node in m highest index nodes to link to it,
for i in range(k, m-n):
n_idx = n_list[i]
if len(DG.in_edges(n_idx)) == 0:
DG.add_edge(random.random_choice(n_list[:k], n_idx))
if len(DG.out_edges(n_idx)) == 0:
DG.add_edge(n_idx, random.random_choice(n_list[n-m:]))
# then you get a random DAG with k inputs and m outputs
return DG