I am trying to implement Prim's algorithm for a graph consisting of cities as vertices, but I am stuck. Any advice would be greatly appreciated.
I am reading in the data from a txt file and trying to get output of the score (total distance) and a list of the edges in tuples. For example, by starting with Houston, the first edge would be ('HOUSTON', 'SAN ANTONIO').
I implemented the graph/tree using a dictionary with adjacent vertices and their distance, like so:
{'HOUSTON': [('LUBBOCK', '535'), ('MIDLAND/ODESSA', '494'), ('MISSION/MCALLEN/EDINBURG', '346'), ('SAN ANTONIO', '197')],
'HARLINGEN/SAN BENITO': [('HOUSTON', '329')],
'SAN ANTONIO': [],
'WACO': [],
'ABILENE': [('DALLAS/FORT WORTH', '181')],
'LUBBOCK': [('MIDLAND/ODESSA', '137')],
'COLLEGE STATION/BRYAN': [('DALLAS/FORT WORTH', '181'), ('HOUSTON', '97')],
'MISSION/MCALLEN/EDINBURG': [],
'AMARILLO': [('DALLAS/FORT WORTH', '361'), ('HOUSTON', '596')],
'EL PASO': [('HOUSTON', '730'), ('SAN ANTONIO', '548')],
'DALLAS/FORT WORTH': [('EL PASO', '617'), ('HOUSTON', '238'), ('KILLEEN', '154'), ('LAREDO', '429'), ('LONGVIEW', '128'), ('LUBBOCK', '322'), ('MIDLAND/ODESSA', '347'), ('MISSION/MCALLEN/EDINBURG', '506'), ('SAN ANGELO', '252'), ('SAN ANTONIO', '271'), ('WACO', '91'), ('WICHITA FALLS', '141')],
'KILLEEN': [],
'SAN ANGELO': [],
'MIDLAND/ODESSA': [],
'WICHITA FALLS': [],
'CORPUS CHRISTI': [('DALLAS/FORT WORTH', '377'), ('HOUSTON', '207')],
'AUSTIN': [('DALLAS/FORT WORTH', '192'), ('EL PASO', '573'), ('HOUSTON', '162')],
'LONGVIEW': [],
'BROWNSVILLE': [('DALLAS/FORT WORTH', '550'), ('HOUSTON', '355')],
'LAREDO': [('SAN ANTONIO', '157')]}
here is what i have so far:
import csv
import operator
def prim(file_path):
with open(file_path) as csv_file:
csv_reader = csv.reader(csv_file, delimiter = "\t")
dict = {}
for row in csv_reader:
if row[0] == 'City':
continue
if row[0] in dict:
dict[row[0]].append((row[1],row[3]))
if row[1] not in dict:
dict[row[1]] = []
else:
dict[row[0]] = [(row[1], row[3])]
V = dict.keys()
A = ['HOUSTON']
score = 0 # score result
E = [] # tuple result
while A != V:
for x in A:
dict[x].sort(key=lambda x: x[1])
for y in dict[x]:
if y[0] in V and y[0] not in A:
A.append(y[0])
E.append((x, y[0]))
score += int(y[1])
break
break
break
print("Edges:")
print(E)
print("Score:")
print(score)
prim("Texas.txt")
This gives the correct first edge because of that last break statement, but when I remove the break statement, it infinitely loops and I can't exactly figure out why or how to fix it. I realize I may be implementing this algorithm totally wrong and inefficiently, so I would really appreciate any tips or advice on where to go from here/what to do differently and why. Thank you in advance!!
There are three main problems with your implementation:
Your graph is stored as a directed graph. Prim's algorithm cannot be applied to directed graphs (reference). If you really want to process a directed graph, the comments in the linked question provide information for how to compute the MST-equivalent of a directed graph (reference).
This is also related to why you algorithm gets stuck: It explores all vertices that are reachable via a directed edge from the starting vertex. There are more vertices in the graph, but they cannot be reached from the starting vertex when traversing the edges in the fixed direction. Therefore, the algorithm cannot grow the intermediate tree to include further vertices and gets stuck.
However, I assume you actually want the graph to be undirected. In that case, you could for example compute the symmetric closure of your graph, by duplicating every edge in the reverse direction. Or alternatively, you could adapt your algorithm to check both outgoing and incoming edges.
In Prim's algorithm, every iteration adds an edge to the intermediate tree. This is done by selecting the minimum-weight edge that connects the intermediate tree with a vertex from the remainder of the graph. However, in your code, you instead sort the outgoing edges of each vertex in the intermediate tree by their weight and add all edges pointing to a vertex that is not yet included in the intermediate tree. This will give you incorrect results. Consider the following example:
Assume we start at a. Your approach sorts the edges of a by their weight and adds them in that order. It therefore adds the edges a-b and a-c to the intermediate tree. All vertices of the graph are now contained in the intermediate tree, so the algorithm terminates. However, the tree b-a-c is not a minimum spanning tree.
So instead, from a you should select the minimum edge, which is a-b. You should then search for the minimum-weight edge from any of the vertices in the intermediate tree. This would be the edge b-c. You would then add this edge to the intermediate tree, finally resulting in the MST a-b-c.
The loop condition A != V will never be satisfied, even if A contains every vertex of the graph. This is because A is a list and V is the result of dict.keys(), which is not a list, but a view object. The view object is iterable, and keys will be given in the same order as they were inserted, but it will never evaluate as equal to a list, even if the list contains the same items.
Even if you would turn V into a list (e.g. with V = list(dict.keys()), this would not be enough, since lists are only equal if they contain the same items in the same order. However, you have no guarantee that the algorithm will insert the vertices into A in the same order as the keys were inserted into the dictionary.
Therefore, you should use some other approach for the loop condition. One possible solution would be to initialize V and A as sets, i.e. with V = set(dict.keys()) and A = set(['HOUSTON']). This would allow you to keep A != V as a loop condition. Or you could keep A a list, but only check if the size of A is equal to the size of V in each iteration. Since the algorithm only inserts into A if the vertex is not already contained in A, when len(A) == len(dict), it follows that A contains all vertices.
Here is an example for the fixed code. It first takes the symmetric closure of the input graph. As the loop condition, it checks if the size of A is unequal to the total number of vertices. In the loop, it computes the minimum-weight edge that connects a vertex in A with a vertex not in A:
# Compute symmetric closure
for v, edges in dict.items():
for neighbor, weight in edges:
edges_neighbor = dict[neighbor]
if (v, weight) not in edges_neighbor:
dict[neighbor].append((v, weight))
A = ['HOUSTON']
score = 0
E = []
while len(A) != len(dict):
# Prepend vertex to each (target,weight) pair in dict[x]
prepend_source = lambda x: map(lambda y: (x, *y), dict[x])
edges = itertools.chain.from_iterable(map(prepend_source, A))
# Keep only edges where the target is not in A
edges_filtered = filter(lambda edge: edge[1] not in A, edges)
# Select the minimum edge
edge_min = min(edges_filtered, key=lambda x: int(x[2]))
A.append(edge_min[1])
E.append((edge_min[0], edge_min[1]))
score += int(edge_min[2])
Note this implementation still assumes that the graph is connected. If you would want to handle disconnected graphs, you would have to apply this procedure for each connected component of the graph (resulting in a MST forest).
Related
I have spent a lot of time on the problem Dominos from kattis, see here: https://open.kattis.com/problems/dominos.
I am passing 3 testcases and on the final one I receive a runtime error. I suspect some out of bond errors might occur, but I really can't narrow down to the potential cause of the runtime error. I have pasted my code below and have tried to describe the different steps and the thinking process.
I am using Kosaraju's algorithm to identify the strongly connected components, through a DFS followed by a DFS on the reversed edges (Starting from the last node finished from the prior DFS).
Afterwards, I condense the graph to now only contain representatives of the SCCs, in order to obtain a directed acyclic graph. From here I count the indegree 0 of the SCC representatives as this will be the amount of bricks necessary to knock over manually by hand.
I hope this question is specific enough and some of you might have an idea what could be causing trouble.
from collections import defaultdict
def dfs(brick): # First traversal of graph primarily to generate the DSForder stack for 2nd traversal
visited.add(brick)
for neighbour in adj[brick]:
if neighbour not in visited:
dfs(neighbour)
DFSorder.append(brick)
def dfs2(brick):
visited.add(brick)
SCCs[SCC_number].append(brick) # Append brick to the Strongly connected component list of bricks
idx[brick] = SCCs[SCC_number][0] # Set representative as first element in the SCC
for neighbour in adj_rev[brick]:
if neighbour not in visited:
dfs2(neighbour)
t = int(input()) # testcases
for _ in range(t):
# Long list of needed things
DFSorder = []
SCC_number = 0
SCCs = defaultdict(list)
IndegreeSCC = defaultdict(int)
idx=defaultdict(int)
n,m = list(map(int,input().split()))
visited = set()
adj = defaultdict(set)
adj_rev = defaultdict(set)
adj_SCC = defaultdict(list)
for _ in range(m):
brick1,brick2 = list(map(int,input().split()))
adj[brick1].add(brick2)
adj_rev[brick2].add(brick1) # Reverse adjacency list for second DFS traversal
for i in range(1,n+1): # First traversal to generate DFS order
if i not in visited:
dfs(i)
visited = set() # Restart visited for 2nd reverse traversal
for brick in DFSorder[::-1]: # Reverse traversal to determine SCCs
if brick not in visited:
dfs2(brick)
SCC_number += 1
for val in set(idx.values()): # Initially set all indegrees of SCC representatives to 0
IndegreeSCC[val] = 0
for key in adj: # Condense graph to SCCs (Only the representatives for each SCC is needed)
for neighbour in adj[key]:
if neighbour != idx[key] and idx[key] != idx[neighbour]:
adj_SCC[idx[key]].append(idx[neighbour])
IndegreeSCC[idx[neighbour]] += 1
# Bricks that needs to be turned over manually can be found from the indegree 0 Strongly connected components
print(sum([1 for val in list(IndegreeSCC.values()) if val == 0]))
I'm trying to calculate the area of skyline (overlapping rectangles with same baseline)
building_count = int(input())
items = {} # dictionary, location on x axis is the key, height is the value
count = 0 # total area
for j in range(building_count):
line = input().split(' ')
H = int(line[0]) # height
L = int(line[1]) # left point (start of the building)
R = int(line[2]) # right point (end of the building)
for k in range(R - L):
if not (L+k in items): # if it's not there, add it
items[L+k] = H
elif H > items[L+k]: # if we have a higher building on that index
items[L+k] = H
for value in items.values(): # we add each column basically
count += value
print(count)
sample input would be:
5
3 -3 0
2 -1 1
4 2 4
2 3 7
3 6 8
and output is 29.
The issue is memory efficiency, when there are lots of values, the script simply throws MemoryError. Anyone have some ideas for optimizing memory usage?
You are allocating a separate key-value pair for every single integer value in your range. Imagine the case where R = 1 and L = 100000. Your items dictionary will be filled with 1000000 items. Your basic idea of processing/removing overlaps is is sound, but the way you do it is massive overkill.
Like so much else in life, this is a graph problem in disguise. Imaging the vertices being the rectangles you are trying to process and the (weighted) edges being the overlaps. The complication is that you can not just add up the areas of the vertices and subtract the areas of the overlaps, because many of the overlaps overlap each other as well. The overlap issue can be resolved by applying a transformation that converts two overlapping rectangles into non-overlapping rectangles, effectively cutting the edge that connects them. The transformation is shown in the image below. Notice that in some cases one of the vertices will be removed as well, simplifying the graph, while in another case a new vertex is added:
Green: overlap to be chopped out.
Normally, if we have m rectangles and n overlaps between them, constructing the graph would be an O(m2) operation because we would have to check all vertices for overlaps against each other. However, we can bypass a construction of the input graph entirely to get a O(m + n) traversal algorithm, which is going to be optimal since we will only analyze each rectangle once, and construct the output graph with no overlaps as efficiently as possible. O(m + n) assumes that your input rectangles are sorted according to their left edges in ascending order. If that is not the case, the algorithm will be O(mlog(m) + n) to account for the initial sorting step. Note that as the graph density increases, n will go from ~m to ~m2. This confirms the intuitive idea that the fewer overlaps there are, them more you would expect the process will run in O(m) time, while the more overlaps there are, the closer you will run to O(m2) time.
The space complexity of the proposed algorithm will be O(m): each rectangle in the input will result in at most two rectangles in the output, and 2m = O(m).
Enough about complexity analysis and on to the algorithm itself. The input will be a sequence of rectangles defined by L, R, H as you have now. I will assume that the input is sorted by the leftmost edge L. The output graph will be a linked list of rectangles defined by the same parameters, sorted in descending order by the rightmost edge. The head of the list will be the rightmost rectangle. The output will have no overlaps between any rectangles, so the total area of the skyline will just be the sum of H * (R - L) for each of the ~m output rectangles.
The reason for picking a linked list is that the only two operations we need is iteration from the head node and the cheapest insertion possible to maintain the list in sorted order. The sorting will be done as part of overlap checking, so we do not need to do any kind of binary searches through the list or anything like that.
Since the input list is ordered by increasing left edge and the output list is ordered by decreasing right edge, we can guarantee that each rectangle added will be checked only against the rectangles it actually overlaps1. We will do overlap checking and removal as shown in the diagram above until we reach a rectangle whose left edge is less than or equal to the left edge of the new rectangle. All further rectangles in the output list are guaranteed not to overlap with the new rectangle. This check-and-chop operation guarantees that each overlap is visited at most once, and that no non-overlapping rectangles are processed unnecessarily, making the algorithm optimal.
Before I show code, here is a diagram of the algorithm in action. Red rectangles are new rectangles; note that their left edges progress to the right. Blue rectangles are ones that are already added and have overlap with the new rectangle. Black rectangles are already added and have no overlap with the new one. The numbering represents the order of the output list. It is always done from the right. A linked list is a perfect structure to maintain this progression since it allows cheap insertions and replacements:
Here is an implementation of the algorithm which assumes that the input coordinates are passed in as an iterable of objects having the attributes l, r, and h. The iteration order is assumed to be sorted by the left edge. If that is not the case, apply sorted or list.sort to the input first:
from collections import namedtuple
# Defined in this order so you can sort a list by left edge without a custom key
Rect = namedtuple('Rect', ['l', 'r', 'h'])
class LinkedList:
__slots__ = ['value', 'next']
"""
Implements a singly-linked list with mutable nodes and an iterator.
"""
def __init__(self, value=None, next=None):
self.value = value
self.next = next
def __iter__(self):
"""
Iterate over the *nodes* in the list, starting with this one.
The `value` and `next` attribute of any node may be modified
during iteration.
"""
while self:
yield self
self = self.next
def __str__(self):
"""
Provided for inspection purposes.
Works well with `namedtuple` values.
"""
return ' -> '.join(repr(x.value) for x in self)
def process_skyline(skyline):
"""
Turns an iterable of rectangles sharing a common baseline into a
`LinkedList` of rectangles containing no overlaps.
The input is assumed to be sorted in ascending order by left edge.
Each element of the input must have the attributes `l`, r`, `h`.
The output will be sorted in descending order by right edge.
Return `None` if the input is empty.
"""
def intersect(r1, r2, default=None):
"""
Return (1) a flag indicating the order of `r1` and `r2`,
(2) a linked list of between one and three non-overlapping
rectangles covering the exact same area as `r1` and `r2`,
and (3) a pointer to the last nodes (4) a pointer to the
second-to-last node, or `default` if there is only one node.
The flag is set to True if the left edge of `r2` is strictly less
than the left edge of `r1`. That would indicate that the left-most
(last) chunk of the tuple came from `r2` instead of `r1`. For the
algorithm as a whole, that means that we need to keep checking for
overlaps.
The resulting list is always returned sorted descending by the
right edge. The input rectangles will not be modified. If they are
not returned as-is, a `Rect` object will be used instead.
"""
# Swap so left edge of r1 < left edge of r2
if r1.l > r2.l:
r1, r2 = r2, r1
swapped = True
else:
swapped = False
if r2.l >= r1.r:
# case 0: no overlap at all
last = LinkedList(r1)
s2l = result = LinkedList(r2, last)
elif r1.r < r2.r:
# case 1: simple overlap
if r1.h > r2.h:
# Chop r2
r2 = Rect(r1.r, r2.r, r2.h)
else:
r1 = Rect(r1.l, r2.l, r1.h)
last = LinkedList(r1)
s2l = result = LinkedList(r2, last)
elif r1.h < r2.h:
# case 2: split into 3
r1a = Rect(r1.l, r2.l, r1.h)
r1b = Rect(r2.r, r1.r, r1.h)
last = LinkedList(r1a)
s2l = LinkedList(r2, last)
result = LinkedList(r1b, s2l)
else:
# case 3: complete containment
result = LinkedList(r1)
last = result
s2l = default
return swapped, result, last, s2l
root = LinkedList()
skyline = iter(skyline)
try:
# Add the first node as-is
root.next = LinkedList(next(skyline))
except StopIteration:
# Empty input iterator
return None
for new_rect in skyline:
prev = root
for rect in root.next:
need_to_continue, replacement, last, second2last = \
intersect(rect.value, new_rect, prev)
# Replace the rectangle with the de-overlapped regions
prev.next = replacement
if not need_to_continue:
# Retain the remainder of the list
last.next = rect.next
break
# Force the iterator to move on to the last node
new_rect = last.value
prev = second2last
return root.next
Computing the total area is now trivial:
skyline = [
Rect(-3, 0, 3), Rect(-1, 1, 2), Rect(2, 4, 4),
Rect(3, 7, 2), Rect(6, 8, 3),
]
processed = process_skyline(skyline)
area = sum((x.value.r - x.value.l) * x.value.h for x in processed) if processed else None
Notice the altered order of the input parameters (h moved to the end). The resulting area is 29. This matches with what I get by doing the computation by hand. You can also do
>>> print(processed)
Rect(l=6, r=8, h=3) -> Rect(l=4, r=6, h=2) -> Rect(l=2, r=4, h=4) ->
Rect(l=0, r=1, h=2) -> Rect(l=-3, r=0, h=3)
This is to be expected from the diagram of the inputs/output shown below:
As an additional verification, I added a new building, Rect(-4, 9, 1) to the start of the list. It overlaps all the others and adds three units to area, or a final result of 32. processed comes out as:
Rect(l=8, r=9, h=1) -> Rect(l=6, r=8, h=3) -> Rect(l=4, r=6, h=2) ->
Rect(l=2, r=4, h=4) -> Rect(l=1, r=2, h=1) -> Rect(l=0, r=1, h=2) ->
Rect(l=-3, r=0, h=3) -> Rect(l=-4, r=-3, h=1)
Note:
While I am sure that this problem has been solved many times over, the solution I present here is entirely my own work, done without consulting any other references. The idea of using an implicit graph representation and the resulting analysis is inspired by a recent reading of Steven Skiena's Algorithm Design Manual, Second Edition. It is one of the best comp-sci books I have ever come across.
1 Technically, if a new rectangle does not overlap any other rectangles, it will be checked against one rectangle it does not overlap. If that extra check was always the case, the algorithm would have an additional m - 1 comparisons to do. Fortunately, m + m + n - 1 = O(m + n) even if we always had to check one extra rectangle (which we don't).
The reason for getting MemoryError is huge size of the dictionary being created. In the worst case, the dict can have 10^10 keys, which would end up taking all your memory. If there really is a need, shelve is a possible solution to make use of such large dict.
Let's say there is a building with 10 0 100 and another with 20 50 150, then that list might have info like [(-10^9, 0), (0, 10), (50, 20), (150, 0), (10^9, 0)]. As you come across more buildings, you can add more entries in this list. This will be O(n^2).
This might help you further.
I have an adjecency matrix and an adjecency list (I can use either) that both represent a graph.
Basically, how can I pair off connected vertices in the graph so that I am left with the least unpaired (and disconnected) vertices?
I have tried this brute-force strategy:
def max_pairs(adj_matrix):
if len(adj_matrix) % 2:
# If there are an odd amount of vertices, add a disconnected vertex
adj_matrix = [adj + [0] for adj in adj_matrix] + [0] * (len(adj_matrix) + 1)
return max(adj_matrix)
def all_pairs(adj_matrix):
# Adapted from http://stackoverflow.com/a/5360442/5754656
if len(adj_matrix) < 2:
yield 0
return
a = adj_matrix[0]
for i in range(1, len(adj_matrix)):
# Recursively get the next pairs from the list
for rest in all_pairs([
adj[1:i] + adj[i+1:] for adj in adj_matrix[1:i] + adj_matrix[i+1:]]):
yield a[i] + rest # If vertex a and i are adjacent, add 1 to the total pairs
Which is alright for the smaller graphs, but the graphs I am working with have up to 100 vertices.
Is there a way to optimise this so that it can handle that large of a graph?
And is this synonymous to another problem that has algorithms for it? I searched for "Most non-intersecting k-cycles" and variations of that, but could not find an algorithm to do this.
There is polynomial time solution (it works in O(|V|^2 * |E|)). It's known as the Blossom algorithm. The idea is to do something like a matching in a bipartite graph, but also shrink the cycles of odd length into one vertex.
I am trying to iterate over two lists of the same length, and for the pair of entries per index, execute a function. The function aims to cluster the entries
according to some requirement X on the value the function returns.
The lists in questions are:
e_list = [-0.619489,-0.465505, 0.124281, -0.498212, -0.51]
p_list = [-1.7836,-1.14238, 1.73884, 1.94904, 1.84]
and the function takes 4 entries, every combination of l1 and l2.
The function is defined as
def deltaR(e1, p1, e2, p2):
de = e1 - e2
dp = p1 - p2
return de*de + dp*dp
I have so far been able to loop over the lists simultaneously as:
for index, (eta, phi) in enumerate(zip(e_list, p_list)):
for index2, (eta2, phi2) in enumerate(zip(e_list, p_list)):
if index == index2: continue # to avoid same indices
if deltaR(eta, phi, eta2, phi2) < X:
print (index, index2) , deltaR(eta, phi, eta2, phi2)
This loops executes the function on every combination, except those that are same i.e. index 0,0 or 1,1 etc
The output of the code returns:
(0, 1) 0.659449892453
(1, 0) 0.659449892453
(2, 3) 0.657024790285
(2, 4) 0.642297230697
(3, 2) 0.657024790285
(3, 4) 0.109675332432
(4, 2) 0.642297230697
(4, 3) 0.109675332432
I am trying to return the number of indices that are all matched following the condition above. In other words, to rearrange the output to:
output = [No. matched entries]
i.e.
output = [2, 3]
2 coming from the fact that indices 0 and 1 are matched
3 coming from the fact that indices 2, 3, and 4 are all matched
A possible way I have thought of is to append to a list, all the indices used such that I return
output_list = [0, 1, 1, 0, 2, 3, 4, 3, 2, 4, 4, 2, 3]
Then, I use defaultdict to count the occurrances:
for index in output_list:
hits[index] += 1
From the dict I can manipulate it to return [2,3] but is there a more pythonic way of achieving this?
This is finding connected components of a graph, which is very easy and well documented, once you revisit the problem from that view.
The data being in two lists is a distraction. I am going to consider the data to be zip(e_list, p_list). Consider this as a graph, which in this case has 5 nodes (but could have many more on a different data set). Construct the graph using these nodes, and connected them with an edge if they pass your distance test.
From there, you only need to determine the connected components of an undirected graph, which is covered on many many places. Here is a basic depth first search on this site: Find connected components in a graph
You loop through the nodes once, performing a DFS to find all connected nodes. Once you look at a node, mark it visited, so it does not get counted again. To get the answer in the format you want, simply count the number of unvisited nodes found from each unvisited starting point, and append that to a list.
------------------------ graph theory ----------------------
You have data points that you want to break down into related groups. This is a topic in both mathematics and computer science known as graph theory. see: https://en.wikipedia.org/wiki/Graph_theory
You have data points. Imagine drawing them in eta phi space as rectangular coordinates, and then draw lines between the points that are close to each other. You now have a "graph" with vertices and edges.
To determine which of these dots have lines between them is finding connected components. Obviously it's easy to see, but if you have thousands of points, and you want a computer to find the connected components quickly, you use graph theory.
Suppose I make a list of all the eta phi points with zip(e_list, p_list), and each entry in the list is a vertex. If you store the graph in "adjacency list" format, then each vertex will also have a list of the outgoing edges which connect it to another vertex.
Finding a connected component is literally as easy as looking at each vertex, putting a checkmark by it, and then following every line to the next vertex and putting a checkmark there, until you can't find anything else connected. Now find the next vertex without a checkmark, and repeat for the next connected component.
As a programmer, you know that writing your own data structures for common problems is a bad idea when you can use published and reviewed code to handle the task. Google "python graph module". One example mentioned in comments is "pip install networkx". If you build the graph in networkx, you can get the connected components as a list of lists, then take the len of each to get the format you want: [len(_) for _ in nx.connected_components(G)]
---------------- code -------------------
But if you don't understand the math, then you might not understand a module for graphs, nor a base python implementation, but it's pretty easy if you just look at some of those links. Basically dots and lines, but pretty useful when you apply the concepts, as you can see with your problem being nothing but a very simple graph theory problem in disguise.
My graph is a basic list here, so the vertices don't actually have names. They are identified by their list index.
e_list = [-0.619489,-0.465505, 0.124281, -0.498212, -0.51]
p_list = [-1.7836,-1.14238, 1.73884, 1.94904, 1.84]
def deltaR(e1, p1, e2, p2):
de = e1 - e2
dp = p1 - p2
return de*de + dp*dp
X = 1 # you never actually said, but this works
def these_two_particles_are_going_the_same_direction(p1, p2):
return deltaR(p1.eta, p1.phi, p2.eta, p2.phi) < X
class Vertex(object):
def __init__(self, eta, phi):
self.eta = eta
self.phi = phi
self.connected = []
self.visited = False
class Graph(object):
def __init__(self, e_list, p_list):
self.vertices = []
for eta, phi in zip(e_list, p_list):
self.add_node(eta, phi)
def add_node(self, eta, phi):
# add this data point at the next available index
n = len(self.vertices)
a = Vertex(eta, phi)
for i, b in enumerate(self.vertices):
if these_two_particles_are_going_the_same_direction(a,b):
b.connected.append(n)
a.connected.append(i)
self.vertices.append(a)
def reset_visited(self):
for v in self.nodes:
v.visited = False
def DFS(self, n):
#perform depth first search from node n, return count of connected vertices
count = 0
v = self.vertices[n]
if not v.visited:
v.visited = True
count += 1
for i in v.connected:
count += self.DFS(i)
return count
def connected_components(self):
self.reset_visited()
components = []
for i, v in enumerate(self.vertices):
if not v.visited:
components.append(self.DFS(i))
return components
g = Graph(e_list, p_list)
print g.connected_components()
I have a tree as shown below.
Red means it has a certain property, unfilled means it doesn't have it. I want to minimise the Red checks.
If Red than all Ancestors are also Red (and should not be checked again).
If Not Red than all Descendants are Not Red.
The depth of the tree is d.
The width of the tree is n.
Note that children nodes have value larger than the parent.
Example: In the tree below,
Node '0' has children [1, 2, 3],
Node '1' has children [2, 3],
Node '2' has children [3] and
Node '4' has children [] (No children).
Thus children can be constructed as:
if vertex.depth > 0:
vertex.children = [Vertex(parent=vertex, val=child_val, depth=vertex.depth-1, n=n) for child_val in xrange(self.val+1, n)]
else:
vertex.children = []
Here is an example tree:
I am trying to count the number of Red nodes. Both the depth and the width of the tree will be large. So I want to do a sort of Depth-First-Search and additionally use the properties 1 and 2 from above.
How can I design an algorithm to do traverse that tree?
PS: I tagged this [python] but any outline of an algorithm would do.
Update & Background
I want to minimise the property checks.
The property check is checking the connectedness of a bipartite graph constructed from my tree's path.
Example:
The bottom-left node in the example tree has path = [0, 1].
Let the bipartite graph have sets R and C with size r and c. (Note, that the width of the tree is n=r*c).
From the path I get to the edges of the graph by starting with a full graph and removing edges (x, y) for all values in the path as such: x, y = divmod(value, c).
The two rules for the property check come from the connectedness of the graph:
- If the graph is connected with edges [a, b, c] removed, then it must also be connected with [a, b] removed (rule 1).
- If the graph is disconnected with edges [a, b, c] removed, then it must also be disconnected with additional edge d removed [a, b, c, d] (rule 2).
Update 2
So what I really want to do is check all combinations of picking d elements out of [0..n]. The tree structure somewhat helps but even if I got an optimal tree traversal algorithm, I still would be checking too many combinations. (I noticed that just now.)
Let me explain. Assuming I need checked [4, 5] (so 4 and 5 are removed from bipartite graph as explained above, but irrelevant here.). If this comes out as "Red", my tree will prevent me from checking [4] only. That is good. However, I should also mark off [5] from checking.
How can I change the structure of my tree (to a graph, maybe?) to further minimise my number of checks?
Use a variant of the deletion–contraction algorithm for evaluating the Tutte polynomial (evaluated at (1,2), gives the total number of spanning subgraphs) on the complete bipartite graph K_{r,c}.
In a sentence, the idea is to order the edges arbitrarily, enumerate spanning trees, and count, for each spanning tree, how many spanning subgraphs of size r + c + k have that minimum spanning tree. The enumeration of spanning trees is performed recursively. If the graph G has exactly one vertex, the number of associated spanning subgraphs is the number of self-loops on that vertex choose k. Otherwise, find the minimum edge that isn't a self-loop in G and make two recursive calls. The first is on the graph G/e where e is contracted. The second is on the graph G-e where e is deleted, but only if G-e is connected.
Python is close enough to pseudocode.
class counter(object):
def __init__(self, ival = 0):
self.count = ival
def count_up(self):
self.count += 1
return self.count
def old_walk_fun(ilist, func=None):
def old_walk_fun_helper(ilist, func=None, count=0):
tlist = []
if(isinstance(ilist, list) and ilist):
for q in ilist:
tlist += old_walk_fun_helper(q, func, count+1)
else:
tlist = func(ilist)
return [tlist] if(count != 0) else tlist
if(func != None and hasattr(func, '__call__')):
return old_walk_fun_helper(ilist, func)
else:
return []
def walk_fun(ilist, func=None):
def walk_fun_helper(ilist, func=None, count=0):
tlist = []
if(isinstance(ilist, list) and ilist):
if(ilist[0] == "Red"): # Only evaluate sub-branches if current level is Red
for q in ilist:
tlist += walk_fun_helper(q, func, count+1)
else:
tlist = func(ilist)
return [tlist] if(count != 0) else tlist
if(func != None and hasattr(func, '__call__')):
return walk_fun_helper(ilist, func)
else:
return []
# Crude tree structure, first element is always its colour; following elements are its children
tree_list = \
["Red",
["Red",
["Red",
[]
],
["White",
[]
],
["White",
[]
]
],
["White",
["White",
[]
],
["White",
[]
]
],
["Red",
[]
]
]
red_counter = counter()
eval_counter = counter()
old_walk_fun(tree_list, lambda x: (red_counter.count_up(), eval_counter.count_up()) if(x == "Red") else eval_counter.count_up())
print "Unconditionally walking"
print "Reds found: %d" % red_counter.count
print "Evaluations made: %d" % eval_counter.count
print ""
red_counter = counter()
eval_counter = counter()
walk_fun(tree_list, lambda x: (red_counter.count_up(), eval_counter.count_up()) if(x == "Red") else eval_counter.count_up())
print "Selectively walking"
print "Reds found: %d" % red_counter.count
print "Evaluations made: %d" % eval_counter.count
print ""
How hard are you working on making the test for connectedness fast?
To test a graph for connectedness I would pick edges in a random order and use union-find to merge vertices when I see an edge that connects them. I could terminate early if the graph was connected, and I have a sort of certificate of connectedness - the edges which connected two previously unconnected sets of vertices.
As you work down the tree/follow a path on the bipartite graph, you are removing edges from the graph. If the edge you remove is not in the certificate of connectedness, then the graph must still be connected - this looks like a quick check to me. If it is in the certificate of connectedness you could back up to the state of union/find as of just before that edge was added and then try adding new edges, rather than repeating the complete connectedness test.
Depending on exactly how you define a path, you may be able to say that extensions of that path will never include edges using a subset of vertices - such as vertices which are in the interior of the path so far. If edges originating from those untouchable vertices are sufficient to make the graph connected, then no extension of the path can ever make it unconnected. Then at the very least you just have to count the number of distinct paths. If the original graph is regular I would hope to find some dynamic programming recursion that lets you count them without explicitly enumerating them.