This is more of a conceptual programming question, so bear with me:
Say you have a list of scenes in a movie, and each scene may or may not make reference to past/future scenes in the same movie. I'm trying to find the most efficient algorithm of sorting these scenes. There may not be enough information for the scenes to be completely sorted, of course.
Here's some sample code in Python (pretty much pseudocode) to clarify:
class Reference:
def __init__(self, scene_id, relation):
self.scene_id = scene_id
self.relation = relation
class Scene:
def __init__(self, scene_id, references):
self.id = scene_id
self.references = references
def __repr__(self):
return self.id
def relative_sort(scenes):
return scenes # Algorithm in question
def main():
s1 = Scene('s1', [
Reference('s3', 'after')
])
s2 = Scene('s2', [
Reference('s1', 'before'),
Reference('s4', 'after')
])
s3 = Scene('s3', [
Reference('s4', 'after')
])
s4 = Scene('s4', [
Reference('s2', 'before')
])
print relative_sort([s1, s2, s3, s4])
if __name__ == '__main__':
main()
The goal is to have relative_sort return [s4, s3, s2, s1] in this case.
If it's helpful, I can share my initial attempt at the algorithm; I'm a little embarrassed at how brute-force it is. Also, if you're wondering, I'm trying to decode the plot of the film "Mulholland Drive".
FYI: The Python tag is only here because my pseudocode was written in Python.
The algorithm you are looking for is a topological sort:
In the field of computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering. For instance, the vertices of the graph may represent tasks to be performed, and the edges may represent constraints that one task must be performed before another; in this application, a topological ordering is just a valid sequence for the tasks.
You can compute this pretty easily using a graph library, for instance, networkx, which implements topological_sort. First we import the library and list all of the relationships between scenes -- that is, all of the directed edges in the graph
>>> import networkx as nx
>>> relations = [
(3, 1), # 1 after 3
(2, 1), # 2 before 1
(4, 2), # 2 after 4
(4, 3), # 3 after 4
(4, 2) # 4 before 2
]
We then create a directed graph:
>>> g = nx.DiGraph(relations)
Then we run a topological sort:
>>> nx.topological_sort(g)
[4, 3, 2, 1]
I have included your modified code in my answer, which solves the current (small) problem, but without a larger sample problem, I'm not sure how well it would scale. If you provide the actual problem you're trying to solve, I'd love to test and refine this code until it works on that problem, but without test data I won't optimize this solution any further.
For starters, we track references as sets, not lists.
Duplicates don't really help us (if "s1" before "s2", and "s1" before "s2", we've gained no information)
This also lets us add inverse references with abandon (if "s1" comes before "s2", then "s2" comes after "s1").
We compute a min and max position:
Min position based on how many scenes we come after
This could be extended easily: If we come after two scenes with a min_pos of 2, our min_pos is 4 (If one is 2, other must be 3)
Max position based on how many things we come before
This could be extended similarly: If we come before two scenes with a max_pos of 4, our max_pos is 2 (If one is 4, other must be 3)
If you decide to do this, just replace pass in tighten_bounds(self) with code to try to tighten the bounds for a single scene (and set anything_updated to true if it works).
The magic is in get_possible_orders
Generates all valid orderings if you iterate over it
If you only want one valid ordering, it doesn't take the time to create them all
Code:
class Reference:
def __init__(self, scene_id, relation):
self.scene_id = scene_id
self.relation = relation
def __repr__(self):
return '"%s %s"' % (self.relation, self.scene_id)
def __hash__(self):
return hash(self.scene_id)
def __eq__(self, other):
return self.scene_id == other.scene_id and self.relation == other.relation
class Scene:
def __init__(self, title, references):
self.title = title
self.references = references
self.min_pos = 0
self.max_pos = None
def __repr__(self):
return '%s (%s,%s)' % (self.title, self.min_pos, self.max_pos)
inverse_relation = {'before': 'after', 'after': 'before'}
def inverted_reference(scene, reference):
return Reference(scene.title, inverse_relation[reference.relation])
def is_valid_addition(scenes_so_far, new_scene, scenes_to_go):
previous_ids = {s.title for s in scenes_so_far}
future_ids = {s.title for s in scenes_to_go}
for ref in new_scene.references:
if ref.relation == 'before' and ref.scene_id in previous_ids:
return False
elif ref.relation == 'after' and ref.scene_id in future_ids:
return False
return True
class Movie:
def __init__(self, scene_list):
self.num_scenes = len(scene_list)
self.scene_dict = {scene.title: scene for scene in scene_list}
self.set_max_positions()
self.add_inverse_relations()
self.bound_min_max_pos()
self.can_tighten = True
while self.can_tighten:
self.tighten_bounds()
def set_max_positions(self):
for scene in self.scene_dict.values():
scene.max_pos = self.num_scenes - 1
def add_inverse_relations(self):
for scene in self.scene_dict.values():
for ref in scene.references:
self.scene_dict[ref.scene_id].references.add(inverted_reference(scene, ref))
def bound_min_max_pos(self):
for scene in self.scene_dict.values():
for ref in scene.references:
if ref.relation == 'before':
scene.max_pos -= 1
elif ref.relation == 'after':
scene.min_pos += 1
def tighten_bounds(self):
anything_updated = False
for scene in self.scene_dict.values():
pass
# If bounds for any scene are tightened, set anything_updated back to true
self.can_tighten = anything_updated
def get_possible_orders(self, scenes_so_far):
if len(scenes_so_far) == self.num_scenes:
yield scenes_so_far
raise StopIteration
n = len(scenes_so_far)
scenes_left = set(self.scene_dict.values()) - set(scenes_so_far)
valid_next_scenes = set(s
for s in scenes_left
if s.min_pos <= n <= s.max_pos)
# valid_next_scenes = sorted(valid_next_scenes, key=lambda s: s.min_pos * self.num_scenes + s.max_pos)
for s in valid_next_scenes:
if is_valid_addition(scenes_so_far, s, scenes_left - {s}):
for valid_complete_sequence in self.get_possible_orders(scenes_so_far + (s,)):
yield valid_complete_sequence
def get_possible_order(self):
return self.get_possible_orders(tuple()).__next__()
def relative_sort(lst):
try:
return [s.title for s in Movie(lst).get_possible_order()]
except StopIteration:
return None
def main():
s1 = Scene('s1', {Reference('s3', 'after')})
s2 = Scene('s2', {
Reference('s1', 'before'),
Reference('s4', 'after')
})
s3 = Scene('s3', {
Reference('s4', 'after')
})
s4 = Scene('s4', {
Reference('s2', 'before')
})
print(relative_sort([s1, s2, s3, s4]))
if __name__ == '__main__':
main()
As others have pointed out, you need a topological sort. A depth first traversal of the directed graph where the order relation forms the edges is all you need. Visit in post order. This the reverse of a topo sort. So to get the topo sort, just reverse the result.
I've encoded your data as a list of pairs showing what's known to go before what. This is just to keep my code short. You can just as easily traverse your list of classes to create the graph.
Note that for topo sort to be meaningful, the set being sorted must satisfy the definition of a partial order. Yours is fine. Order constraints on temporal events naturally satisfy the definition.
Note it's perfectly possible to create a graph with cycles. There's no topo sort of such a graph. This implementation doesn't detect cycles, but it would be easy to modify it to do so.
Of course you can use a library to get the topo sort, but where's the fun in that?
from collections import defaultdict
# Before -> After pairs dictating order. Repeats are okay. Cycles aren't.
# This is OP's data in a friendlier form.
OrderRelation = [('s3','s1'), ('s2','s1'), ('s4','s2'), ('s4','s3'), ('s4','s2')]
class OrderGraph:
# nodes is an optional list of items for use when some aren't related at all
def __init__(self, relation, nodes=[]):
self.succ = defaultdict(set) # Successor map
heads = set()
for tail, head in relation:
self.succ[tail].add(head)
heads.add(head)
# Sources are nodes that have no in-edges (tails - heads)
self.sources = set(self.succ.keys()) - heads | set(nodes)
# Recursive helper to traverse the graph and visit in post order
def __traverse(self, start):
if start in self.visited: return
self.visited.add(start)
for succ in self.succ[start]: self.__traverse(succ)
self.sorted.append(start) # Append in post-order
# Return a reverse post-order visit, which is a topo sort. Not thread safe.
def topoSort(self):
self.visited = set()
self.sorted = []
for source in self.sources: self.__traverse(source)
self.sorted.reverse()
return self.sorted
Then...
>>> print OrderGraph(OrderRelation).topoSort()
['s4', 's2', 's3', 's1']
>>> print OrderGraph(OrderRelation, ['s1', 'unordered']).topoSort()
['s4', 's2', 's3', 'unordered', 's1']
The second call shows that you can optionally pass values to be sorted in a separate list. You may but don't have mention values already in the relation pairs. Of course those not mentioned in order pairs are free to appear anywhere in the output.
Related
I have a tree, given e.g. as a networkx object. In order to inpput it into a black-box algorithm I was given, I need to save it in the following strange format:
Traverse the tree in a clockwise order. As I pass through one side of an edge, I label it incrementally. Then I want to save for each edge the labels of its two sides.
For example, a star will become a list [(0,1),(2,3),(4,5),...] and a path with 3 vertices will be [(0,3),(1,2)].
I am stumped with implementing this. How can this be done? I can use any library.
I'll answer this without reference to any library.
You would need to perform a depth-first traversal, and log the (global) incremental number before you visit a subtree, and also after you visited it. Those two numbers make up the tuple that you have to prepend to the result you get from the subtree traversal.
Here is an implementation that needs the graph to be represented as an adjacency list. The main function needs to get the root node and the adjacency list
def iter_naturals(): # helper function to produce sequential numbers
n = 0
while True:
yield n
n += 1
def half_edges(root, adj):
visited = set()
sequence = iter_naturals()
def dfs(node):
result = []
visited.add(node)
for child in adj[node]:
if child not in visited:
forward = next(sequence)
path = dfs(child)
backward = next(sequence)
result.extend([(forward, backward)] + path)
return result
return dfs(root)
Here is how you can run it for the two examples you mentioned. I have just implemented those graphs as adjacency lists, where nodes are identified by their index in that list:
Example 1: a "star":
The root is the parent of all other nodes
adj = [
[1,2,3], # 1,2,3 are children of 0
[],
[],
[]
]
print(half_edges(0, adj)) # [(0, 1), (2, 3), (4, 5)]
Example 2: a single path with 3 nodes
adj = [
[1], # 1 is a child of 0
[2], # 2 is a child of 1
[]
]
print(half_edges(0, adj)) # [(0, 3), (1, 2)]
I found this great built-in function dfs_labeled_edges in networkx. From there it is a breeze.
def get_new_encoding(G):
dfs = [(v[0],v[1]) for v in nx.dfs_labeled_edges(G, source=1) if v[0]!=v[1] and v[2]!="nontree"]
dfs_ind = sorted(range(len(dfs)), key=lambda k: dfs[k])
new_tree_encoding = [(dfs_ind[i],dfs_ind[i+1]) for i in range(0,len(dfs_ind),2)]
return new_tree_encoding
I've been trying to create a graph representation for the popular kevin bacon game. I have created the graph and vertex classes, but am having some trouble creating a Breadth first search method to traverse the graph and find the shortest path from Kevin Bacon to the actor, and print out the edges on the way.
The user should enter in an actor, and the program should find the shortest path from kevin bacon to that actor. The user will then keep entering actors, and the shortest path to that actor will be taken, and the kevin bacon number printed out, else it will print out none.
There is a vertex and graph class. The vertex class is a dictionary which contains the other vertexes it is connected to and the edges.
The data I am working with looks like this:
vertices:
["Kevin Bacon", "actor1", "actor2", "actor3", "actor4", "actor5", "actor6"]
edges:
("Kevin Bacon", "actor1", "movie1")
("Kevin Bacon", "actor2", "movie1")
("actor1", "actor2", "movie1")
("actor1", "actor3", "movie2")
("actor3", "actor2", "movie3")
("actor3", "actor4", "movie4")
("actor5", "actor6", "movie5")
Where the movie is the edge name or weight, and the other parts of the tuple are the vertices. I want the BFS algorithm to print out all of the edges and kevin bacon number, or print out that it is not possible if the actor cannot be reached.
Here is the code so far. Any advice and help is appreciated.
Thank you for your time
class Vertex:
'''
keep track of the vertices to which it is connected, and the weight of each edge
'''
def __init__(self, key):
'''
'''
self.ID = key
self.connected_to = {}
def add_neighbor(self, neighbor, weight=0):
'''
add a connection from this vertex to anothe
'''
self.connected_to[neighbor] = weight
def __str__(self):
'''
returns all of the vertices in the adjacency list, as represented by the connectedTo instance variable
'''
return str(self.ID) + ' connected to: ' + str([x.ID for x in self.connected_to])
def get_connections(self):
'''
returns all of the connections for each of the keys
'''
return self.connected_to.keys()
def get_ID(self):
'''
returns the current key id
'''
return self.ID
def get_weight(self, neighbor):
'''
returns the weight of the edge from this vertex to the vertex passed as a parameter
'''
return self.connected_to[neighbor]
class Graph:
'''
contains a dictionary that maps vertex names to vertex objects.
'''
def __init__(self):
'''
'''
self.vert_list = {}
self.num_vertices = 0
def __str__(self):
'''
'''
edges = ""
for vert in self.vert_list.values():
for vert2 in vert.get_connections():
edges += "(%s, %s)\n" %(vert.get_ID(), vert2.get_ID())
return edges
def add_vertex(self, key):
'''
adding vertices to a graph
'''
self.num_vertices = self.num_vertices + 1
new_vertex = Vertex(key)
self.vert_list[key] = new_vertex
return new_vertex
def get_vertex(self, n):
'''
'''
if n in self.vert_list:
return self.vert_list[n]
else:
return None
def __contains__(self, n):
'''
in operator
'''
return n in self.vert_list
def add_edge(self, f, t, cost=0):
'''
connecting one vertex to another
'''
if f not in self.vert_list:
nv = self.add_vertex(f)
if t not in self.vert_list:
nv = self.add_vertex(t)
self.vert_list[f].add_neighbor(self.vert_list[t], cost)
def get_vertices(self):
'''
returns the names of all of the vertices in the graph
'''
return self.vert_list.keys()
def __iter__(self):
'''
for functionality
'''
return iter(self.vert_list.values())
def bfs(self):
'''
Needs to be implemented
'''
pass
Get an actor
Check if the actor is Kevin Bacon
If the actor is Kevin Bacon, go back along the path you took
If the actor is not Kevin Bacon then find all the actors connected to this actor who you have not already checked.
Add all the actors who this actor is connected to to your list to check.
The hardest problem you will have here is keeping a record of which vertexes you have already visited. As such I think your algorithm should check a list of vertexes. Some assumptions:
Each vertex is listed only once.
Vertexes are single direction only. This means that if you want to go from Actor 1 to Actor 2 and Actor 2 to Actor 1, you need two vertexes, one for each actor essentially.
You have weights, but I don't see how they're relevant for this. I'll try to implement them though. Also your default weight should not be 0, or all paths will be equally short (0*n = 0).
OK lets go.
def bfs(self, actor):
from heapq import heappush, heappop
if actor == "Kevin Bacon":
return print("This actor is Kevin Bacon!")
visited = set()
checked = []
n = 0
heappush(checked, (0, n, [self.get_vertex(actor)]))
# if the list is empty we haven't been able to find any path
while checked:
# note that we pop our current list out of the list of checked lists,
# if all of the children of this list have been visited it won't be
# added again
current_list = heappop(checked)[2]
current_vertex = current_list[-1]
if current_vertex.ID == "Kevin Bacon":
return print(current_list)
for child in current_vertex.get_connections():
if child in visited:
# we've already found a shorter path to this actor
# don't add this path into the list
continue
n += 1
# make a hash function for the vertexes, probably just
# the hash of the ID is enough, ptherwise the memory address
# is used and identical vertices can be visited multiple times
visited.add(child)
w = sum(current_list[i].get_weight(current_list[i+1])
for i in range(len(current_list)-1))
heappush(checked, (w, n, current_list + [child]))
print("no path found!")
You should also implement a __repr__() method for your vertex class. With the one I used, the output looks like this:
g = Graph()
for t in [("Kevin Bacon", "actor1", "movie1")
,("Kevin Bacon", "actor2", "movie1")
,("actor1", "actor2", "movie1")
,("actor1", "actor3", "movie2")
,("actor3", "actor2", "movie3")
,("actor3", "actor4", "movie4")
,("actor5", "actor6", "movie5")]:
g.add_edge(t[0],t[1],cost=1)
g.add_edge(t[1],t[0],cost=1)
g.bfs("actor4")
# prints [Vertex(actor4), Vertex(actor3), Vertex(actor2), Vertex(Kevin Bacon)]
I originally wasn't going to use heapq to do this, but in the end decided I might as well. Essentially, you need to sort your checked list to get the shortest path first. The simplest way to do this is to just sort your list every time you want to pop the smallest value off the top, but this can get very slow when your list is getting large. Heapq can keep the list sorted in a more efficient manner, but there is no key method to get the smallest value of the list we add, so we need to fake it by using a tuple. The first value in the tuple is the actual cost of the path, while the second one is simply a "tie breaker" so that we do not try to compare Vertexes (which are not ordered and will raise an exception if we try to do so).
I would like to sort several points from smallest to biggest however.
I will wish to get this result:
Drogba 2 pts
Owen 4 pts
Henry 6 pts
However, my ranking seems to be reversed for now :-(
Henry 6 pts
Owen 4 pts
Drogba 2 pts
I think my problem is with my function Bubblesort ?
def Bubblesort(name, goal1, point):
swap = True
while swap:
swap = False
for i in range(len(name)-1):
if goal1[i+1] > goal1[i]:
goal1[i], goal1[i+1] = goal1[i+1], goal1[i]
name[i], name[i+1] = name[i+1], name[i]
point[i], point[i + 1] = point[i + 1], point[i]
swap = True
return name, goal1, point
def ranking(name, point):
for i in range(len(name)):
print(name[i], "\t" , point[i], " \t ")
name = ["Henry", "Owen", "Drogba"]
point = [0]*3
goal1 = [68, 52, 46]
gain = [6,4,2]
name, goal1, point = Bubblesort( name, goal1, point )
for i in range(len(name)):
point[i] += gain[i]
ranking (name, point)
In your code:
if goal1[i+1] > goal1[i]:
that checks if it is greater. You need to swap it if the next one is less, not greater.
Change that to:
if goal1[i+1] < goal1[i]:
A bunch of issues:
def Bubblesort - PEP8 says function names should be lowercase, ie def bubblesort
You are storing your data as a bunch of parallel lists; this makes it harder to work on and think about (and sort!). You should transpose your data so that instead of having a list of names, a list of points, a list of goals you have a list of players, each of whom has a name, points, goals.
def bubblesort(name, goal1, point): - should look like def bubblesort(items) because bubblesort does not need to know that it is getting names and goals and points and sorting on goals (specializing it that way keeps you from reusing the function later to sort other things). All it needs to know is that it is getting a list of items and that it can compare pairs of items using >, ie Item.__gt__ is defined.
Instead of using the default "native" sort order, Python sort functions usually let you pass an optional key function which allows you to tell it what to sort on - that is, sort on key(items[i]) > key(items[j]) instead of items[i] > items[j]. This is often more efficient and/or convenient than reshuffling your data to get the sort order you want.
for i in range(len(name)-1): - you are iterating more than needed. After each pass, the highest value in the remaining list gets pushed to the top (hence "bubble" sort, values rise to the top of the list like bubbles). You don't need to look at those top values again because you already know they are higher than any of the remaining values; after the nth pass, you can ignore the last n values.
actually, the situation is a bit better than that; you will often find runs of values which are already in sorted order. If you keep track of the highest index that actually got swapped, you don't need to go beyond that on your next pass.
So your sort function becomes
def bubblesort(items, *, key=None):
"""
Return items in sorted order
"""
# work on a copy of the list (don't destroy the original)
items = list(items)
# process key values - cache the result of key(item)
# so it doesn't have to be called repeatedly
keys = items if key is None else [key(item) for item in items]
# initialize the "last item to sort on the next pass" index
last_swap = len(items) - 1
# sort!
while last_swap:
ls = 0
for i in range(last_swap):
j = i + 1
if keys[i] > keys[j]:
# have to swap keys and items at the same time,
# because keys may be an alias for items
items[i], items[j], keys[i], keys[j] = items[j], items[i], keys[j], keys[i]
# made a swap - update the last_swap index
ls = i
last_swap = ls
return items
You may not be sure that this is actually correct, so let's test it:
from random import sample
def test_bubblesort(tries = 1000):
# example key function
key_fn = lambda item: (item[2], item[0], item[1])
for i in range(tries):
# create some sample data to sort
data = [sample("abcdefghijk", 3) for j in range(10)]
# no-key sort
assert bubblesort(data) == sorted(data), "Error: bubblesort({}) gives {}".format(data, bubblesort(data))
# keyed sort
assert bubblesort(data, key=key_fn) == sorted(data, key=key_fn), "Error: bubblesort({}, key) gives {}".format(data, bubblesort(data, key_fn))
test_bubblesort()
Now the rest of your code becomes
class Player:
def __init__(self, name, points, goals, gains):
self.name = name
self.points = points
self.goals = goals
self.gains = gains
players = [
Player("Henry", 0, 68, 6),
Player("Owen", 0, 52, 4),
Player("Drogba", 0, 46, 2)
]
# sort by goals
players = bubblesort(players, key = lambda player: player.goals)
# update points
for player in players:
player.points += player.gains
# show the result
for player in players:
print("{player.name:<10s} {player.points:>2d} pts".format(player=player))
which produces
Drogba 2 pts
Owen 4 pts
Henry 6 pts
I'm trying to get an efficient algorithm to calculate the height of a tree in Python for large datasets. The code I have works for small datasets, but takes a long time for really large ones (100,000 items) so I'm trying to figure out ways to optimize it but am getting stuck. Sorry if it seems like a really newbie question, I'm pretty new to Python.
The input is a list length and a list of values, with each list item pointing to its parent, with list item -1 indicating the root of the tree. So with an input of:
5
4 -1 4 1 1
The answer would be 3 - the tree is: ({key:1, children: [{key: 3}, {key:4, children:[{key:0, {key:2}]}] }
Here is the code that I have so far:
import sys, threading
sys.setrecursionlimit(10**7) # max depth of recursion
threading.stack_size(2**25) # new thread will get stack of such size
class TreeHeight:
def read(self):
self.n = int(sys.stdin.readline())
self.parent = list(map(int, sys.stdin.readline().split()))
def getChildren(self, node, nodes):
parent = {'key': node, 'children': []}
children = [i for i, x in enumerate(nodes) if x == parent['key']]
for child in children:
parent['children'].append(self.getChildren(child, nodes))
return parent
def compute_height(self, tree):
if len(tree['children']) == 0:
return 0
else:
max_values = []
for child in tree['children']:
max_values.append(self.compute_height(child))
return 1 + max(max_values)
def main():
tree = TreeHeight()
tree.read()
treeChild = tree.getChildren(-1, tree.parent)
print(tree.compute_height(treeChild))
threading.Thread(target=main).start()
first, while python is really a great general purpose language, using raw python for large datasets is not very efficient. consider using pandas, NumPy, SciPy or one of the many great alternatives.
second, if you're concerned with tree's height, and your tree is a write-once-read-always one. you could simply alter the code that reads the input to not only fill the tree but also measure the number of height.
this attitude makes sense when you don't expect you tree to change after been created
Use DFS to avoid stack overflow in recursive calls. Use a marker to know the end of a level during the traversal.
from collections import defaultdict
def compute_height(root, tree):
q = ListQueue()
q.enqueue(root)
q.enqueue('$')
height = 1
while not q.isEmpty():
elem = q.dequeue()
if elem =='$' and not q.isEmpty():
elem = q.dequeue()
height+=1
q.enqueue('$')
for child in tree[elem]:
q.enqueue(child)
return height
tree = defaultdict(list)
parents = [4, -1, 4, 1, 1]
for node,parent in enumerate(parents):
tree[parent].append(node)
root = tree.pop(-1)[0]
print(compute_height(root, tree))
This is a follow-up question to Combinatorics in Python
I have a tree or directed acyclic graph if you will with a structure as:
Where r are root nodes, p are parent nodes, c are child nodes and b are hypothetical branches. The root nodes are not directly linked to the parent nodes, it is only a reference.
I am intressted in finding all the combinations of branches under the constraints:
A child can be shared by any number of parent nodes given that these parent nodes do not share root node.
A valid combination should not be a subset of another combination
In this example only two valid combinations are possible under the constraints:
combo[0] = [b[0], b[1], b[2], b[3]]
combo[1] = [b[0], b[1], b[2], b[4]]
The data structure is such as b is a list of branch objects, which have properties r, c and p, e.g.:
b[3].r = 1
b[3].p = 3
b[3].c = 2
This problem can be solved in Python easily and elegantly, because there is a module called "itertools".
Lets say we have objects of type HypotheticalBranch, which have attributes r, p and c. Just as you described it in your post:
class HypotheticalBranch(object):
def __init__(self, r, p, c):
self.r=r
self.p=p
self.c=c
def __repr__(self):
return "HypotheticalBranch(%d,%d,%d)" % (self.r,self.p,self.c)
Your set of hypothetical branches is thus
b=[ HypotheticalBranch(0,0,0),
HypotheticalBranch(0,1,1),
HypotheticalBranch(1,2,1),
HypotheticalBranch(1,3,2),
HypotheticalBranch(1,4,2) ]
The magical function that returns a list of all possible branch combos could be written like so:
import collections, itertools
def get_combos(branches):
rc=collections.defaultdict(list)
for b in branches:
rc[b.r,b.c].append(b)
return itertools.product(*rc.values())
To be precise, this function returns an iterator. Get the list by iterating over it. These four lines of code will print out all possible combos:
for combo in get_combos(b):
print "Combo:"
for branch in combo:
print " %r" % (branch,)
The output of this programme is:
Combo:
HypotheticalBranch(0,1,1)
HypotheticalBranch(1,3,2)
HypotheticalBranch(0,0,0)
HypotheticalBranch(1,2,1)
Combo:
HypotheticalBranch(0,1,1)
HypotheticalBranch(1,4,2)
HypotheticalBranch(0,0,0)
HypotheticalBranch(1,2,1)
...which is just what you wanted.
So what does the script do? It creates a list of all hypothetical branches for each combination (root node, child node). And then it yields the product of these lists, i.e. all possible combinations of one item from each of the lists.
I hope I got what you actually wanted.
You second constraint means you want maximal combinations, i.e. all the combinations with the length equal to the largest combination.
I would approach this by first traversing the "b" structure and creating a structure, named "c", to store all branches coming to each child node and categorized by the root node that comes to it.
Then to construct combinations for output, for each child you can include one entry from each root set that is not empty. The order (execution time) of the algorithm will be the order of the output, which is the best you can get.
For example, your "c" structure, will look like:
c[i][j] = [b_k0, ...]
--> means c_i has b_k0, ... as branches that connect to root r_j)
For the example you provided:
c[0][0] = [0]
c[0][1] = []
c[1][0] = [1]
c[1][1] = [2]
c[2][0] = []
c[2][1] = [3, 4]
It should be fairly easy to code it using this approach. You just need to iterate over all branches "b" and fill the data structure for "c". Then write a small recursive function that goes through all items inside "c".
Here is the code (I entered your sample data at the top for testing sake):
class Branch:
def __init__(self, r, p, c):
self.r = r
self.p = p
self.c = c
b = [
Branch(0, 0, 0),
Branch(0, 1, 1),
Branch(1, 2, 1),
Branch(1, 3, 2),
Branch(1, 4, 2)
]
total_b = 5 # Number of branches
total_c = 3 # Number of child nodes
total_r = 2 # Number of roots
c = []
for i in range(total_c):
c.append([])
for j in range(total_r):
c[i].append([])
for k in range(total_b):
c[b[k].c][b[k].r].append(k)
combos = []
def list_combos(n_c, n_r, curr):
if n_c == total_c:
combos.append(curr)
elif n_r == total_r:
list_combos(n_c+1, 0, curr)
elif c[n_c][n_r]:
for k in c[n_c][n_r]:
list_combos(n_c, n_r+1, curr + [b[k]])
else:
list_combos(n_c, n_r+1, curr)
list_combos(0, 0, [])
print combos
There are really two problems here: firstly, you need to work out the algorithm that you will use to solve this problem and secondly, you need to implement it (in Python).
Algorithm
I shall assume you want a maximal collection of branches; that is, once to which you can't add any more branches. If you don't, you can consider all subsets of a maximal collection.
Therefore, for a child node we want to take as many branches as possible, subject to the constraint that no two parent nodes share a root. In other words, from each child you may have at most one edge in the neighbourhood of each root node. This seems to suggest that you want to iterate first over the children, then over the (neighbourhoods of the) root nodes, and finally over the edges between these. This concept gives the following pseudocode:
for each child node:
for each root node:
remember each permissible edge
find all combinations of permissible edges
Code
>>> import networkx as nx
>>> import itertools
>>>
>>> G = nx.DiGraph()
>>> G.add_nodes_from(["r0", "r1", "p0", "p1", "p2", "p3", "p4", "c0", "c1", "c2"])
>>> G.add_edges_from([("r0", "p0"), ("r0", "p1"), ("r1", "p2"), ("r1", "p3"),
... ("r1", "p4"), ("p0", "c0"), ("p1", "c1"), ("p2", "c1"),
... ("p3", "c2"), ("p4", "c2")])
>>>
>>> combs = set()
>>> leaves = [node for node in G if not G.out_degree(node)]
>>> roots = [node for node in G if not G.in_degree(node)]
>>> for leaf in leaves:
... for root in roots:
... possibilities = tuple(edge for edge in G.in_edges_iter(leaf)
... if G.has_edge(root, edge[0]))
... if possibilities: combs.add(possibilities)
...
>>> combs
set([(('p1', 'c1'),),
(('p2', 'c1'),),
(('p3', 'c2'), ('p4', 'c2')),
(('p0', 'c0'),)])
>>> print list(itertools.product(*combs))
[(('p1', 'c1'), ('p2', 'c1'), ('p3', 'c2'), ('p0', 'c0')),
(('p1', 'c1'), ('p2', 'c1'), ('p4', 'c2'), ('p0', 'c0'))]
The above seems to work, although I haven't tested it.
For each child c, with hypothetical parents p(c), with roots r(p(c)), choose exactly one parent p from p(c) for each root r in r(p(c)) (such that r is the root of p) and include b in the combination where b connects p to c (assuming there is only one such b, meaning it's not a multigraph). The number of combinations will be the product of the numbers of parents by which each child is hypothetically connected to each root. In other words, the size of the set of combinations will be equal to the product of the hypothetical connections of all child-root pairs. In your example all such child-root pairs have only one path, except r1-c2, which has two paths, thus the size of the set of combinations is two.
This satisfies the constraint of no combination being a subset of another because by choosing exactly one parent for each root of each child, we maximize the number connections. Subsequently adding any edge b would cause its root to be connected to its child twice, which is not allowed. And since we are choosing exactly one, all combinations will be exactly the same length.
Implementing this choice recursively will yield the desired combinations.