So, I'm still new at python, but I wrote this function to generate a graph that has certain hierarchical features. But, sometimes this function creates a graph with an unconnected node(s). But this should not be happening because every node should be fully connected with the other nodes below its "level". The characteristics of the hierarchical follow those of Seigel 2009: "The base is a hierarchy determined by the parameter Expansion Rate in which one individual is placed at the top, and each individual in the network is connected to a number of individuals below him equal to Expansion Rate, continuing until no more individuals are left in the population. Thus, while each level of the hierarchy before the last one contains a number of individuals equal to a power of Expansion Rate, the last level may have fewer than this if the total population does not divide appropriately. Each potential tie between individuals within the same level also has a probability equal to Level Connection of being made."
def heirarchy_graph(n,e,l):
'''
n is number of nodes in graph,
e is the Expansion Rate. This is the number of people on each level
If n/e has a remainder then the last level has that many
people in it.
l is the Level connection. It is the probability that a person is connected to someone
within the level they belong to.
'''
G = nx.Graph()
G.name="heirarchy_graph(%s,%s,%s)"%(n,e,l)
r = (n-1)%e
s = (n-r-1)/e
h = s + 1
#G = empty_graph(n=0)
G.add_node(0, level=int(0))
for i in range(s):
list = range(1,(e+1))
A = nx.Graph()
#for item in list:
#create e nodes with attribute level='i'
A.add_nodes_from(list,level=int(i))
# add edges between nodes with probability l
names = A.nodes()
for name in names:
B = non_neighbors(A,name)
for u in B:
q = random.uniform(0,1)
if q <= l:
A.add_edge(u,name)
#return A
#print(A)
G = nx.disjoint_union(G,A)
if r != 0:
h = s+1
list = range(1,(r+1))
A = nx.Graph()
#create e nodes with attribute level='i'
A.add_nodes_from(list,level=int(h))
# add edges between nodes with probability l
names = A.nodes()
for name in names:
B = non_neighbors(A,name)
for u in B:
q = random.uniform(0,1)
if q <= l:
A.add_edge(u,name)
G = nx.disjoint_union(G,A)
## add edges between levels
level = nx.get_node_attributes(G,'level')
names = G.nodes()
for name in names:
levelname = level[name]
B = non_neighbors(G,name)
for u in B:
levelneighbor = level[u]
if levelname == (levelneighbor + 1):
G.add_edge(u,name)
return G
I ran this code multiple times with n=25,e=5 and l=.5 but oftentimes I end up with nodes 21,22,23 or 24 (or some combination of them) being unconnected (checked by pulling each node's degree centrality).
I would greatly appreciate any help with this. The code runs without errors, I just don't know why it is giving me unconnected nodes. These nodes should be connected with the level above them. Thank you in advance.
Related
I was given a question during an interview and although my answer was accepted at the end they wanted a faster approach and I went blank..
Question :
Given an undirected graph, can you see if it's a tree? If so, return true and false otherwise.
A tree:
A - B
|
C - D
not a tree:
A
/ \
B - C
/
D
You'll be given two parameters: n for number of nodes, and a multidimensional array of edges like such: [[1, 2], [2, 3]], each pair representing the vertices connected by the edge.
Note:Expected space complexity : O(|V|)
The array edges can be empty
Here is My code: 105ms
def is_graph_tree(n, edges):
nodes = [None] * (n + 1)
for i in range(1, n+1):
nodes[i] = i
for i in range(len(edges)):
start_edge = edges[i][0]
dest_edge = edges[i][1]
if nodes[start_edge] != start_edge:
start_edge = nodes[start_edge]
if nodes[dest_edge] != dest_edge:
dest_edge = nodes[dest_edge]
if start_edge == dest_edge:
return False
nodes[start_edge] = dest_edge
return len(edges) <= n - 1
Here's one approach using a disjoint-set-union / union-find data structure:
def is_graph_tree(n, edges):
parent = list(range(n+1))
size = [1] * (n + 1)
for x, y in edges:
# find x (path splitting)
while parent[x] != x:
x, parent[x] = parent[x], parent[parent[x]]
# find y
while parent[y] != y:
y, parent[y] = parent[y], parent[parent[y]]
if x == y:
# Already connected
return False
# Union (by size)
if size[x] < size[y]:
x, y = y, x
parent[y] = x
size[x] += size[y]
return True
assert not is_graph_tree(4, [(1, 2), (2, 3), (3, 4), (4, 2)])
assert is_graph_tree(6, [(1, 2), (2, 3), (3, 4), (3, 5), (1, 6)])
The runtime is O(V + E*InverseAckermannFunction(V)), which better than O(V + E * log(log V)), so it's basically O(V + E).
Tim Roberts has posted a candidate solution, but this will work in the case of disconnected subtrees:
import queue
def is_graph_tree(n, edges):
# A tree with n nodes has n - 1 edges.
if len(edges) != n - 1:
return False
# Construct graph.
graph = [[] for _ in range(n)]
for first_vertex, second_vertex in edges:
graph[first_vertex].append(second_vertex)
graph[second_vertex].append(first_vertex)
# BFS to find edges that create cycles.
# The graph is undirected, so we can root the tree wherever we want.
visited = set()
q = queue.Queue()
q.put((0, None))
while not q.empty():
current_node, previous_node = q.get()
if current_node in visited:
return False
visited.add(current_node)
for neighbor in graph[current_node]:
if neighbor != previous_node:
q.put((neighbor, current_node))
# Only return true if the graph has only one connected component.
return len(visited) == n
This runs in O(n + len(edges)) time.
You could approach this from the perspective of tree leaves. Every leaf node in a tree will have exactly one edge connected to it. So, if you count the number of edges for each nodes, you can get the list of leaves (i.e. the ones with only one edge).
Then, take the linked node from these leaves and reduce their edge count by one (as if you were removing all the leaves from the tree. That will give you a new set of leaves corresponding to the parents of the original leaves. Repeat the process until you have no more leaves.
[EDIT] checking that the number of edges is N-1 eliminiates the need to do the multi-root check because there will be another discrepancy (e.g. double link, missing node) in the graph if there are multiple 'roots' or a disconnected subtree
If the graph is a tree, this process should eliminate all nodes from the node counts (i.e. they will all be flagged as leaves at some point).
Using the Counter class (from collections) will make this relatively easy to implement:
from collections import Counter
def isTree(N,E):
if N==1 and not E: return True # root only is a tree
if len(E) != N-1: return False # a tree has N-1 edges
counts = Counter(n for ab in E for n in ab) # edge counts per node
if len(counts) != N : return False # unlinked nodes
while True:
leaves = {n for n,c in counts.items() if c==1} # new leaves
if not leaves:break
for a,b in E: # subtract leaf counts
if counts[a]>1 and b in leaves: counts[a] -= 1
if counts[b]>1 and a in leaves: counts[b] -= 1
for n in leaves: counts[n] = -1 # flag leaves in counts
return all(c==-1 for c in counts.values()) # all must become leaves
output:
G = [[1,2],[1,3],[4,5],[4,6]]
print(isTree(6,G)) # False (disconnected sub-tree)
G = [[1,2],[1,3],[1,4],[2,3],[5,6]]
print(isTree(6,G)) # False (doubly linked node 3)
G = [[1,2],[2,6],[3,4],[5,1],[2,3]]
print(isTree(6,G)) # True
G = [[1,2],[2,3]]
print(isTree(3,G)) # True
G = [[1,2],[2,3],[3,4]]
print(isTree(4,G)) # True
G = [[1,2],[1,3],[2,5],[2,4]]
print(isTree(6,G)) # False (missing node)
Space complexity is O(N) because the counts dictionary has one entry per node(vertex) with an integer as value. Time complexity will be O(ExL) where E is the number of edges and L is the number of levels in the tree. The worts case time is O(E^2) for a tree where all parents have only one child node. However, since the initial condition is for E to be less than V, the worst case will actually be O(V^2)
Note that this algorithm makes no assumption on edge order or numerical relationships between node numbers. The root (last node to be made a leaf) found by this algorithm is not necessarily the only possible root given that, unless the nodes have an implicit cardinality relationship (or edges have an order), there could be ambiguous scenarios:
[1,2],[2,3],[2,4] could be:
1 2 3
|_2 OR |_1 OR |_2
|_3 |_3 |_1
|_4 |_4 |_4
If a cardinality relationship between node numbers or an order of edges can be relied upon, the algorithm could potentially be made more time efficient (because we could easily determine which node is the root and start from there).
[EDIT2] Alternative method using groups.
When the number of edges is N-1, if the graph is a tree, all nodes should be reachable from any other node. This means that, if we form groups of reachable nodes for each node and merge them together based on the edges, we should end up with a single group after going through all the edges.
Here is the modified function based on that approach:
def isTree(N,E):
if N==1 and not E: return True # root only is a tree
if len(E) != N-1: return False # a tree has N-1 edges
groups = {n:[n] for ab in E for n in ab} # each node in its own group
if len(groups) != N : return False # unlinked nodes
for a,b in E:
groups[a].extend(groups[b]) # merge groups
for n in groups[b]: groups[n] = groups[a] # update nodes' groups
return len(set(map(id,groups.values()))) == 1 # only one group when done
Given that we start out with fewer edges than nodes and that group merging will consume at most 2x a group size (so also < N), the space complexity will remain O(V). The time complexity will also be O(V^2) at for the worts case scenarios
You don't even need to know how many edges there are:
def is_graph_tree(n, edges):
seen = set()
for a,b in edges:
b = max(a,b)
if b in seen:
return False
seen.add(b)
return True
a = [[1,2],[2,3],[3,4]]
print(is_graph_tree(0,a))
b = [[1,2],[1,3],[2,3],[2,4]]
print(is_graph_tree(0,b))
Now, this WON'T catch the case of disconnected subtrees, but that wasn't in the problem description...
I am trying to determine the chance of homophily, then the homophily, of a dataset having nodes as keys and colors as values.
Example:
Node Target Colors
A N 1
N A 0
A D 1
D A 1
C X 1
X C 0
S D 0
D S 1
B 0
R N 2
N R 2
Colors are associated with the Node column and span from 0 to 2 (int).
The steps for calculating the chance of homophily on a characteristic z (in my case Color) are illustrated as follows:
c_list=df[['Node','Colors']].set_index('Node').T.to_dict('list')
print("\nChance of same color:", round(chance_homophily(c_list),2))
where chance_homophily is defined as follows:
# The function below takes a dictionary with characteristics as keys and the frequency of their occurrence as values.
# Then it computes the chance homophily for that characteristic (color)
def chance_homophily(dataset):
freq_dict = Counter([tuple(x) for x in dataset.values()])
df_freq_counter = freq_dict
c_list = list(df_freq_counter.values())
chance_homophily = 0
for class_count in c_list:
chance_homophily += (class_count/sum(c_list))**2
return chance_homophily
Then the homophily is calculated as follows:
def homophily(G, chars, IDs):
"""
Given a network G, a dict of characteristics chars for node IDs,
and dict of node IDs for each node in the network,
find the homophily of the network.
"""
num_same_ties = 0
num_ties = 0
for n1, n2 in G.edges():
if IDs[n1] in chars and IDs[n2] in chars:
if G.has_edge(n1, n2):
num_ties+=1
if chars[IDs[n1]] == chars[IDs[n2]]:
num_same_ties+=1
return (num_same_ties / num_ties)
G should be built from my dataset above (so taking into account both node and target columns).
I am not totally familiar with this network property but I think I have missed something in the implementation (e.g., is it correctly taking count of relationships among nodes in the network?). In another example (with different dataset) found on the web
https://campus.datacamp.com/courses/using-python-for-research/case-study-6-social-network-analysis?ex=1
the characteristic is also color (though it is a string, while I have a numeric variable). I do not know if they take into consideration relationship among nodes to determine, maybe using adjacency matrix: this part has not been implemented in my code, where I am using
G = nx.from_pandas_edgelist(df, source='Node', target='Target')
Your code works perfectly fine. The only thing you are missing is the IDs dict, which would map the names of your nodes to the names of the nodes in the graph G. By creating the graph from a pandas edgelist, you are already naming your nodes, as they are in the data.
This renders the use of the "IDs"dict unnecessary. Check out the example below, one time wihtou the IDs dict and one time with a trivial dict to use the original function:
import networkx as nx
import pandas as pd
from collections import Counter
df = pd.DataFrame({"Node":["A","N","A","D","C","X","S","D","B","R","N"],
"Target":["N","A","D","A","X","C","D","S","","N","R"],
"Colors":[1,0,1,1,1,0,0,1,0,2,2]})
c_list=df[['Node','Colors']].set_index('Node').T.to_dict('list')
G = nx.from_pandas_edgelist(df, source='Node', target='Target')
def homophily_without_ids(G, chars):
"""
Given a network G, a dict of characteristics chars for node IDs,
and dict of node IDs for each node in the network,
find the homophily of the network.
"""
num_same_ties = 0
num_ties = 0
for n1, n2 in G.edges():
if n1 in chars and n2 in chars:
if G.has_edge(n1, n2):
num_ties+=1
if chars[n1] == chars[n2]:
num_same_ties+=1
return (num_same_ties / num_ties)
print(homophily_without_ids(G, c_list))
#create node ids map - trivial in this case
nodes_ids = {i:i for i in G.nodes()}
def homophily(G, chars, IDs):
"""
Given a network G, a dict of characteristics chars for node IDs,
and dict of node IDs for each node in the network,
find the homophily of the network.
"""
num_same_ties = 0
num_ties = 0
for n1, n2 in G.edges():
if IDs[n1] in chars and IDs[n2] in chars:
if G.has_edge(n1, n2):
num_ties+=1
if chars[IDs[n1]] == chars[IDs[n2]]:
num_same_ties+=1
return (num_same_ties / num_ties)
print(homophily(G, c_list, nodes_ids))
I am wondering if I can speed up my operation of limiting node degree using an inbuilt function.
A submodule of my task requires me to limit the indegree to 2. So, the solution I proposed was to introduce sequential dummy nodes and absorb the extra edges. Finally, the last dummy gets connected to the children of the original node. To be specific if an original node 2 is split into 3 nodes (original node 2 & two dummy nodes), ALL the properties of the graph should be maintained if we analyse the graph by packaging 2 & its dummies into one hypothetical node 2'; The function I wrote is shown below:
def split_merging(G, dummy_counter):
"""
Args:
G: as the name suggests
dummy_counter: as the name suggests
Returns:
G with each merging node > 2 incoming split into several consecutive nodes
and dummy_counter
"""
# we need two copies; one to ensure the sanctity of the input G
# and second, to ensure that while we change the Graph in the loop,
# the loop doesn't go crazy due to changing bounds
G_copy = nx.DiGraph(G)
G_copy_2 = nx.DiGraph(G)
for node in G_copy.nodes:
in_deg = G_copy.in_degree[node]
if in_deg > 2: # node must be split for incoming
new_nodes = ["dummy" + str(i) for i in range(dummy_counter, dummy_counter + in_deg - 2)]
dummy_counter = dummy_counter + in_deg - 2
upstreams = [i for i in G_copy_2.predecessors(node)]
downstreams = [i for i in G_copy_2.successors(node)]
for up in upstreams:
G_copy_2.remove_edge(up, node)
for down in downstreams:
G_copy_2.remove_edge(node, down)
prev_node = node
G_copy_2.add_edge(upstreams[0], prev_node)
G_copy_2.add_edge(upstreams[1], prev_node)
for i in range(2, len(upstreams)):
G_copy_2.add_edge(prev_node, new_nodes[i - 2])
G_copy_2.add_edge(upstreams[i], new_nodes[i - 2])
prev_node = new_nodes[i - 2]
for down in downstreams:
G_copy_2.add_edge(prev_node, down)
return G_copy_2, dummy_counter
For clarification, the input and output are shown below:
Input:
Output:
It works as expected. But the problem is that this is very slow for larger graphs. Is there a way to speed this up using some inbuilt function from networkx or any other library?
Sure; the idea is similar to balancing a B-tree. If a node has too many in-neighbors, create two new children, and split up all your in-neighbors among those children. The children have out-degree 1 and point to your original node; you may need to recursively split them as well.
This is as balanced as possible: node n becomes a complete binary tree rooted at node n, with external in-neighbors at the leaves only, and external out-neighbors at the root.
def recursive_split_node(G: 'nx.DiGraph', node, max_in_degree: int = 2):
"""Given a possibly overfull node, create a minimal complete
binary tree rooted at that node with no overfull nodes.
Return the new graph."""
global dummy_counter
current_in_degree = G.in_degree[node]
if current_in_degree <= max_in_degree:
return G
# Complete binary tree, so left gets 1 more descendant if tied
left_child_in_degree = (current_in_degree + 1) // 2
left_child = "dummy" + str(dummy_counter)
right_child = "dummy" + str(dummy_counter + 1)
dummy_counter += 2
G.add_node(left_child)
G.add_node(right_child)
old_predecessors = list(G.predecessors(node))
# Give all predecessors to left and right children
G.add_edges_from([(y, left_child)
for y in old_predecessors[:left_child_in_degree]])
G.add_edges_from([(y, right_child)
for y in old_predecessors[left_child_in_degree:]])
# Remove all incoming edges
G.remove_edges_from([(y, node) for y in old_predecessors])
# Connect children to me
G.add_edge(left_child, node)
G.add_edge(right_child, node)
# Split children
G = recursive_split_node(G, left_child, max_in_degree)
G = recursive_split_node(G, right_child, max_in_degree)
return G
def clean_graph(G: 'nx.DiGraph', max_in_degree: int = 2) -> 'nx.DiGraph':
"""Return a copy of our original graph, with nodes added to ensure
the max in degree does not exceed our limit."""
G_copy = nx.DiGraph(G)
for node in G.nodes:
if G_copy.in_degree[node] > max_in_degree:
G_copy = recursive_split_node(G_copy, node, max_in_degree)
return G_copy
This code for recursively splitting nodes is quite handy and easily generalized, and intentionally left unoptimized.
To solve your exact use case, you could go with an iterative solution: build a full, complete binary tree (with the same structure as a heap) implicitly as an array. This is, I believe, the theoretically optimal solution to the problem, in terms of minimizing the number of graph operations (new nodes, new edges, deleting edges) to achieve the constraint, and gives the same graph as the recursive solution.
def clean_graph(G):
"""Return a copy of our original graph, with nodes added to ensure
the max in degree does not exceed 2."""
global dummy_counter
G_copy = nx.DiGraph(G)
for node in G.nodes:
if G_copy.in_degree[node] > 2:
predecessors_list = list(G_copy.predecessors(node))
G_copy.remove_edges_from((y, node) for y in predecessors_list)
N = len(predecessors_list)
leaf_count = (N + 1) // 2
internal_count = leaf_count // 2
total_nodes = leaf_count + internal_count
node_names = [node]
node_names.extend(("dummy" + str(dummy_counter + i) for i in range(total_nodes - 1)))
dummy_counter += total_nodes - 1
for i in range(internal_count):
G_copy.add_edges_from(((node_names[2 * i + 1], node_names[i]), (node_names[2 * i + 2], node_names[i])))
for leaf in range(internal_count, internal_count + leaf_count):
G_copy.add_edge(predecessors_list.pop(), node_names[leaf])
if not predecessors_list:
break
G_copy.add_edge(predecessors_list.pop(), node_names[leaf])
if not predecessors_list:
break
return G_copy
From my testing, comparing performance on very dense graphs generated with nx.fast_gnp_random_graph(500, 0.3, directed=True), this is 2.75x faster than the recursive solution, and 1.75x faster than the original posted solution. The bottleneck for further optimizations is networkx and Python, or changing the input graphs to be less dense.
I have a network present in a postgres database, where I can route with the pgrouting extension. I've read this into mem, and now want to calculate the distance of all nodes within 0.1 hours from a specific starting node:
dm = G.new_vp("double", np.inf)
gt.shortest_distance(G, source=nd[102481678], weights=wgts, dist_map = dm, max_dist=0.1)
where wgts is an EdgePropertyMap containing the weights per edge, and nd is a reverse mapping to get vertex index from the outside id.
In pgRouting this delivers 349 reachable nodes, using graph-tool only 328. The results are more or less the same (e.g. the furthest node is the same with the exact same cost, nodes present in both lists have same distance), but the graph-tool distance map just seems to miss certain nodes. The weird thing is that I found a cul-de-sac node labeled with a distance (second one from below), but the node connecting the cul-de-sac with the outside world is missing. Seems weird, because if the connecting node would not be reachable, the cul-de-sac would be unreachable as well.
I've compiled a MWE: https://gofile.io/d/YpgjSw
Below is the python code:
import graph_tool.all as gt
import numpy as np
import time
# construct list of source, target, edge-id (edge-id not really used in this example)
l = []
with open('nw.txt') as f:
rows = f.readlines()
for row in rows:
id = int(row.split('\t')[0])
source = int(row.split('\t')[1])
target = int(row.split('\t')[2])
l.append([source, target, id])
l.append([target, source, id])
print len(l)
# construct graph
G = gt.Graph(directed=True)
G.ep["edge_id"] = G.new_edge_property("int")
n = G.add_edge_list(l, hashed=True, eprops=G.ep["edge_id"])
# construct a dict for mapping outside node-id's to internal id's (node indexes)
nd = {}
i = 0
for x in n:
nd[x] = i
i = i + 1
# construct a dict for mapping (source, target) combis to a cost and reverse cost
db_wgts = {}
with open('costs.txt') as f:
rows = f.readlines()
for row in rows:
source = int(row.split('\t')[0])
target = int(row.split('\t')[1])
cost = float(row.split('\t')[2])
reverse_cost = float(row.split('\t')[3])
db_wgts[(source, target)] = cost
db_wgts[(target, source)] = reverse_cost
# construct an edge property and fill it according to previous dict
wgts = G.new_edge_property("double")
i = 0
for e in G.edges():
i = i + 1
print i
print e
s = n[int(e.source())]
t = n[int(e.target())]
try:
wgts[e] = db_wgts[(s, t)]
except KeyError:
# this was necessary
wgts[e] = 1000000
# calculate shortest distance to all nodes within 0.1 total cost from source-node with outside-id of 102481678
dm = G.new_vp("double", np.inf)
gt.shortest_distance(G, source=nd[102481678], weights=wgts, dist_map = dm, max_dist=0.1)
# some mumbo-jumbo for getting the result in a nice node-id: cost format
ar = dm.get_array()
idxs = np.where(dm.get_array() < 0.1)
vals = ar[ar < 0.1]
final_res = [(i, v) for (i,v) in zip(list(idxs[0]), list(vals))]
final_res.sort(key=lambda tup: tup[1])
for x in final_res:
print n[x[0]], x[1]
# output saved in result_missing_nodes.txt
# 328 records, should be 349
To illustrate (one of the) missing nodes:
>>> dm[nd[63447311]]
0.0696234786274957
>>> dm[nd[106448775]]
0.06165528930577409
>>> dm[nd[127601733]]
inf
>>> dm[nd[100428293]]
0.0819900275163846
>>>
This doesn't seem possible because this is the local layout of the network, labels are the id's referenced above:
This is a numerical precision problem. You have very low edge weights (1e-6) combined with very large values (1000000), which cause differences to be lost to finite precision. If you replace all values 1000000 (which I assume mean infinite weight) by numpy.inf, you actually get a more stable calculation, and no missing nodes in your example.
An even better alternative is to actually remove the "infinite weight"
edges using an edge filter:
u = GraphView(G, efilt=wgts.fa < 1000000)
and compute the distances on that.
I am working with directed graphs given by an adjacency representation. In other words a graph G will be represented by a dictionary whose keys are the vertices and whose values are dictionaries whose keys are the neighbors of a vertex, the values of which may be assigned to 1. Given two vertices u, v in a directed graph G there may be an edge from u to v but not vice versa. It is however possible that there is an edge in both directions.
I have created a function called reachable_vertices which will take a graph G and vertex v, as input and returns a list of all the vertices in G which can be reached from v. If a vertex w can be reached by v this means that there is a chain v → v1 → v2... → w where there is an edge from each vertex in the chain to the one immediately after it. The vertex v does not have to have a particular type such as int or string, it could be either of these, it need only be a key in the dictionary representing the graph G.
I have defined a function called cfb_graph which takes no arguments. I formed a directed graph from the file cfb2010.csv (Link Below) by considering the teams as vertices and creating an edge between team1 and team2 only if team1 defeated team2.
Data Set Link =https://drive.google.com/open?id=1ZgNjH_QE7if1xHMfRU2-ebd9bNpL2E3d
cfb_graph will return a dictionary giving this representation.
I was able to find the following questions for which I am attaching my code below:
i. Which teams are not reachable from Auburn. Store them in a list.
ii. Which teams are reachable from Notre Dame. Store them in a list.
iii. Which teams are not reachable from Alabama. Store them in a list.
I am working on the following code:
def reachable(G, v, setA): # This function checks if it's possible to reach w from v
setA|={v}
try:
for w in set(G[v])-setA:reachable(G,w,setA)
except KeyError:
donothing = 0
return setA
## 2a ##
def reachable_vertices(G, v):
setA=set()
setA|={v}
try:
for n in set(G[v])-setA:reachable(G,n,setA)
except KeyError:
donothing = 0
return setA
def cfb_graph():
svertex = []
evertex = []
count= 0
file = open("cfb2010.csv","r")
for line in file:
fields = line.split(",")
if fields[5].replace("\n", "") == 'W':
svertex.append(fields[1])
evertex.append(fields[2])
if count == 0:
count = count +1
graph = {}
for i in range(len(svertex)):
v = svertex[i]
if v in graph:
graph[v] |= set([evertex[i]])
else:
graph[v] = set([evertex[i]])
for key, value in graph.items():
graph[key] = dict.fromkeys(value,1)
return(graph)
######Part 2 c############
auburn_answer = list(set(cfb_graph().keys()).difference(set(reachable_vertices(cfb_graph(), "Auburn"))))
notre_dame_answer = reachable_vertices(cfb_graph(), "Notre Dame")
alabama_answer = list(set(cfb_graph().keys()).difference(set(reachable_vertices(cfb_graph(), "Alabama"))))
In particular for each vertex I want to return a dictionary where the keys are the reachable vertices and the values are as will now be described. If a vertex w is reachable from a vertex v there is a path from v to w. The value corresponding to w in the returned dictionary will be the vertex which immediately preceeds it in some path from v to w. If I use the queue approach then the value of w would the first vertex u in the while loop for which w is a neighbor of u.
Also, I want to define a function called path which will take as input a graph G and two vertices v and w. If w is reachable from v it will return a list of vertices whose first element is v and whose last element is w and the other vertices are those on a path from v to w in the order in which they are traversed. If there is no path I should return None. I will probably want to use the function defined above.
I suppose the fast and powerful graph processing library networkx will help you a lot. It has the huge amount of various algorithms so you can not to implement it manually, but just use a function call in your code.
I constructed a small workflow that copies all your functionality and solves your problems:
# Imports
import networkx as nx
import csv
# Load CSV file and construct the directed graph
G = nx.DiGraph()
with open('cfb2010.csv', 'r') as f:
sreader = csv.reader(f, delimiter=',')
for line in sreader:
if line[-1] != 'W':
continue
G.add_node(line[1])
G.add_node(line[2])
G.add_edge(line[1], line[2])
# Get all nodes
all_nodes = set(G.nodes())
# Get nodes that can be reached from the particular node
notredame_nodes = set(nx.bfs_tree(G, 'Notre Dame').nodes())
alabama_nodes = set(nx.bfs_tree(G, 'Alabama').nodes())
auburn_nodes = set(nx.bfs_tree(G, 'Auburn').nodes())
# Construct lists of nodes you need
print(all_nodes - alabama_nodes)
print(all_nodes - auburn_nodes)
print(notredame_nodes)
Networkx also has a function equals to your function called path function:
print(nx.shortest_path(G, 'Florida', 'Illinois'))
['Florida', 'Penn St', 'Michigan', 'Illinois']
P.S. Reachable nodes construction uses BFS algorithm.