Finding depth of a Python tree from a text file

Finding depth of a Python tree from a text file - python

I'm relatively new to python, and I was trying out some questions when I encountered this problem. A tree is defined in a text file in the following manner,
d:
e:
b: d e
c:
a: b c
So, I want to write a simple python script that finds the depth of this. I'm not able to figure out a strategy to work this out. Is there any algorithm or technique for this?

My strategy would be as follows:
Find elements with no children.
For each of these, find the parent. Determine if any elements have this parent as a child - if not, your length is two (2).
If so, find the parent of the parent. Repeat step 2, incrementing your length counter. Continue the process updating a counter with each step.
For your case:
d -> b -> a (len 3)
e -> b -> a (len 3)
c -> a (len 2)
This could be described as a 'bottom up' tree construction method/algorithm.

The tree format you've given has a nice property: if x is the child of y, then x is given before y in the file. So you can simply loop through the file once and read the depth into a dictionary. For example:
depth = {}
for line in f:
parent, children = read_node(line)
if children:
depth[parent] = max(depth.get(child,1) for child in children) + 1
Then just print depth['a'], as a is the root. Here read_node is a quick function to parse the parent and children from a line of the file:
def read_node(line):
parent, children = line.split(":")
return parent, children.split()

I'm not sure what you mean by depth, if it's how many steps you have to go to visit every node, you could use the Depth-First Search to see how long it takes to visit every node in the graph.
Here's a simple implementation:
text_tree = """d:
e:
b: d e
c:
a: b c"""
tree = {}
for line in text_tree.splitlines():
node, childs = line.split(":")
tree[node] = set(childs.split())
def dfs(graph, start):
visited, stack = [], [start]
while stack:
vertex = stack.pop()
if vertex not in visited:
visited.append(vertex)
stack.extend(graph[vertex])
return visited
result = dfs(tree,"a")
print "It took %d steps, to visit every node in tree, the path took was %s"%(len(result),result)
Which outputs:
It took 5 steps, to visit every node in tree, the path took was ['a', 'b', 'd', 'e', 'c']

Related

Find all children of top parent in python

I have a list of parent-child relations where the structure isn't a true tree. Some parents can have many children and also some children can have more than one parent.
import pandas as pd
df = pd.DataFrame([[123,234],[123,235],[123,236],[124,236],[234,345],[236,346]], columns=['Parent','Child'])*
I would like to group all children for specific ancestors. From the data:
123,234,235,236,345,346
124,235,346
Should be the correct groups.
I tried with:
parents = set()
children = {}
for p, c in df.to_records(index=False).tolist():
parents.add(p)
children[c] = p
def getAncestors(p):
return (getAncestors(children[p]) if p in children else []) + [p]
But on 346 it only returns one group.
Also, how to then find all children for 123 and 124?
Thank you!

As you said, it's not really a tree, but more like a directed acyclic graph, so you can't map each child to just one parent; it'd have to be a list of parents. Also, given your use case, I'd suggest mapping parents to their lists of children instead.
relations = [[123,234],[234,345],[123,235],[123,236],[124,236],[236,346]]
children = {}
for p, c in relations:
children.setdefault(p, []).append(c)
roots = set(children) - set(c for cc in children.values() for c in cc)
You can then use a recursive function similar to the one you already have to get all the children to a given root node (or any parent node). The root itself is not in the list, but can easily be added.
def all_children(p):
if p not in children:
return set()
return set(children[p] + [b for a in children[p] for b in all_children(a)])
print({p: all_children(p) for p in roots})
# {123: {234, 235, 236, 345, 346}, 124: {346, 236}}

Calculating python tree height for large data sets

I'm trying to get an efficient algorithm to calculate the height of a tree in Python for large datasets. The code I have works for small datasets, but takes a long time for really large ones (100,000 items) so I'm trying to figure out ways to optimize it but am getting stuck. Sorry if it seems like a really newbie question, I'm pretty new to Python.
The input is a list length and a list of values, with each list item pointing to its parent, with list item -1 indicating the root of the tree. So with an input of:
5
4 -1 4 1 1
The answer would be 3 - the tree is: ({key:1, children: [{key: 3}, {key:4, children:[{key:0, {key:2}]}] }
Here is the code that I have so far:
import sys, threading
sys.setrecursionlimit(10**7) # max depth of recursion
threading.stack_size(2**25) # new thread will get stack of such size
class TreeHeight:
def read(self):
self.n = int(sys.stdin.readline())
self.parent = list(map(int, sys.stdin.readline().split()))
def getChildren(self, node, nodes):
parent = {'key': node, 'children': []}
children = [i for i, x in enumerate(nodes) if x == parent['key']]
for child in children:
parent['children'].append(self.getChildren(child, nodes))
return parent
def compute_height(self, tree):
if len(tree['children']) == 0:
return 0
else:
max_values = []
for child in tree['children']:
max_values.append(self.compute_height(child))
return 1 + max(max_values)
def main():
tree = TreeHeight()
tree.read()
treeChild = tree.getChildren(-1, tree.parent)
print(tree.compute_height(treeChild))
threading.Thread(target=main).start()

first, while python is really a great general purpose language, using raw python for large datasets is not very efficient. consider using pandas, NumPy, SciPy or one of the many great alternatives.
second, if you're concerned with tree's height, and your tree is a write-once-read-always one. you could simply alter the code that reads the input to not only fill the tree but also measure the number of height.
this attitude makes sense when you don't expect you tree to change after been created

Use DFS to avoid stack overflow in recursive calls. Use a marker to know the end of a level during the traversal.
from collections import defaultdict
def compute_height(root, tree):
q = ListQueue()
q.enqueue(root)
q.enqueue('$')
height = 1
while not q.isEmpty():
elem = q.dequeue()
if elem =='$' and not q.isEmpty():
elem = q.dequeue()
height+=1
q.enqueue('$')
for child in tree[elem]:
q.enqueue(child)
return height
tree = defaultdict(list)
parents = [4, -1, 4, 1, 1]
for node,parent in enumerate(parents):
tree[parent].append(node)
root = tree.pop(-1)[0]
print(compute_height(root, tree))

Cycle detected while printing root to leaves path

I used the solution to this problem to print all root to leaves path for a n-ary tree I have.
Unfortunately, I suspect, there is a cycle in one of the branch of the tree due to which the program breaches the maximum recursion limit.
A
/ \
B C
| /\
D E F
|
A (back to root)
D again goes back to A
Please tell me how should I handle the cycle detection in the below program.
def paths(tree):
#Helper function
#receives a tree and
#returns all paths that have this node as root and all other paths
if tree is the empty tree:
return ([], [])
else: #tree is a node
root = tree.value
rooted_paths = [[root]]
unrooted_paths = []
for subtree in tree.children:
(useable, unueseable) = paths(subtree)
for path in useable:
unrooted_paths.append(path)
rooted_paths.append([root]+path)
for path in unuseable:
unrooted_paths.append(path)
return (rooted_paths, unrooted_paths)
def the_function_you_use_in_the_end(tree):
a,b = paths(tree)
return a+b
p.s: I tried using visited nodes logic for detection, but that is not very helpful as a node can be legitimately visited multiple numbers of time:
for Example:
A C
A C E
A C F
C is visited multiple numbers of times

Keep a set on nodes visited and test to make sure the next node to be visited is outside the set of previously visited nodes.

kosaraju finding finishing time using iterative dfs

here is the first part of the code that i have did for Kosaraju's algorithm.
###### reading the data #####
with open('data.txt') as req_file:
ori_data = []
for line in req_file:
line = line.split()
if line:
line = [int(i) for i in line]
ori_data.append(line)
###### forming the Grev ####
revscc_dic = {}
for temp in ori_data:
if temp[1] not in revscc_dic:
revscc_dic[temp[1]] = [temp[0]]
else:
revscc_dic[temp[1]].append(temp[0])
print revscc_dic
######## finding the G#####
scc_dic = {}
for temp in ori_data:
if temp[0] not in scc_dic:
scc_dic[temp[0]] = [temp[1]]
else:
scc_dic[temp[0]].append(temp[1])
print scc_dic
##### iterative dfs ####
path = []
for i in range(max(max(ori_data)),0,-1):
start = i
q=[start]
while q:
v=q.pop(0)
if v not in path:
path.append(v)
q=revscc_dic[v]+q
print path
The code reads the data and forms Grev and G correctly. I have written a code for iterative dfs. How can i include to find the finishing time ?? I understand finding the finishing time using paper and pen but I do not understand the part of finishing time as a code ?? how can I implement it.. Only after this I can proceed my next part of code. Pls help. Thanks in advance.
The data.txt file contains:
1 4
2 8
3 6
4 7
5 2
6 9
7 1
8 5
8 6
9 7
9 3
please save it as data.txt.

With recursive dfs, it is easy to see when a given vertex has "finished" (i.e. when we have visited all of its children in the dfs tree). The finish time can be calculated just after the recursive call has returned.
However with iterative dfs, this is not so easy. Now that we are iteratively processing the queue using a while loop we have lost some of the nested structure that is associated with function calls. Or more precisely, we don't know when backtracking occurs. Unfortunately, there is no way to know when backtracking occurs without adding some additional information to our stack of vertices.
The quickest way to add finishing times to your dfs implementation is like so:
##### iterative dfs (with finish times) ####
path = []
time = 0
finish_time_dic = {}
for i in range(max(max(ori_data)),0,-1):
start = i
q = [start]
while q:
v = q.pop(0)
if v not in path:
path.append(v)
q = [v] + q
for w in revscc_dic[v]:
if w not in path: q = [w] + q
else:
if v not in finish_time_dic:
finish_time_dic[v] = time
time += 1
print path
print finish_time_dic
The trick used here is that when we pop off v from the stack, if it is the first time we have seen it, then we add it back to the stack again. This is done using: q = [v] + q. We must push v onto the stack before we push on its neighbours (we write the code that pushes v before the for loop that pushes v's neighbours) - or else the trick doesn't work. Eventually we will pop v off the stack again. At this point, v has finished! We have seen v before, so, we go into the else case and compute a fresh finish time.
For the graph provided, finish_time_dic gives the correct finishing times:
{1: 6, 2: 1, 3: 3, 4: 7, 5: 0, 6: 4, 7: 8, 8: 2, 9: 5}
Note that this dfs algorithm (with the finishing times modification) still has O(V+E) complexity, despite the fact that we are pushing each node of the graph onto the stack twice. However, more elegant solutions exist.
I recommend reading Chapter 5 of Python Algorithms: Mastering Basic Algorithms in the Python Language by Magnus Lie Hetland (ISBN: 1430232374, 9781430232377). Question 5-6 and 5-7 (on page 122) describe your problem exactly. The author answers these questions and gives an alternate solution to the problem.
Questions:
5-6 In recursive DFS, backtracking occurs when you return from one of the recursive calls. But where has the backtracking gone in the iterative version?
5-7. Write a nonrecursive version of DFS that can deal determine finish-times.
Answers:
5-6 It’s not really represented at all in the iterative version. It just implicitly occurs once you’ve popped off all your “traversal descendants” from the stack.
5-7 As explained in Exercise 5-6, there is no point in the code where backtracking occurs in the iterative DFS, so we can’t just set the finish time at some specific place (like in the recursive one). Instead, we’d need to add a marker to the stack. For example, instead of adding the neighbors of u to the stack, we could add edges of the form (u, v), and before all of them, we’d push (u, None), indicating the backtracking point for u.

Iterative DFS itself is not complicated, as seen from Wikipedia. However, calculating the finish time of each node requires some tweaks to the algorithm. We only pop the node off the stack the 2nd time we encounter it.
Here's my implementation which I feel demonstrates what's going on a bit more clearly:
step = 0 # time counter
def dfs_visit(g, v):
"""Run iterative DFS from node V"""
global step
total = 0
stack = [v] # create stack with starting vertex
while stack: # while stack is not empty
step += 1
v = stack[-1] # peek top of stack
if v.color: # if already seen
v = stack.pop() # done with this node, pop it from stack
if v.color == 1: # if GRAY, finish this node
v.time_finish = step
v.color = 2 # BLACK, done
else: # seen for first time
v.color = 1 # GRAY: discovered
v.time_discover = step
total += 1
for w in v.child: # for all neighbor (v, w)
if not w.color: # if not seen
stack.append(w)
return total
def dfs(g):
"""Run DFS on graph"""
global step
step = 0 # reset step counter
for k, v in g.nodes.items():
if not v.color:
dfs_visit(g, v)
I am following the conventions of the CLR Algorithm Book and use node coloring to designate its state during the DFS search. I feel this is easier to understand than using a separate list to track node state.
All nodes start out as white. When it's discovered during the search it is marked as gray. When we are done with it, it is marked as black.
Within the while loop, if a node is white we keep it in the stack, and change its color to gray. If it's gray we change its color to black, and set its finish time. If it's black we just ignore it.
It is possible for a node on the stack to be black (even with our coloring check before adding it to the stack). A white node can be added to the stack twice (via two different neighbors). One will eventually turn black. When we reach the 2nd instance on the stack, we need to make sure we don't change its already set finish time.
Here are some additional support codes:
class Node(object):
def __init__(self, name=None):
self.name = name
self.child = [] # children | adjacency list
self.color = 0 # 0: white [unvisited], 1: gray [found], 2: black [finished]
self.time_discover = None # DFS
self.time_finish = None # DFS
class Graph(object):
def __init__(self):
self.nodes = defaultdict(Node) # list of Nodes
self.max_heap = [] # nodes in decreasing finish time for SCC
def build_max_heap(self):
"""Build list of nodes in max heap using DFS finish time"""
for k, v in self.nodes.items():
self.max_heap.append((0-v.time_finish, v)) # invert finish time for max heap
heapq.heapify(self.max_heap)
To run DFS on the reverse graph, you can build a parent list similar to the child list for each Node when the edges file is processed, and use the parent list instead of the child list in dfs_visit().
To process Nodes in decreasing finish time for the last part of SCC computation, you can build a max heap of Nodes, and use that max heap in dfs_visit() instead of simply the child list.
while g.max_heap:
v = heapq.heappop(g.max_heap)[1]
if not v.color:
size = dfs_visit(g, v)
scc_size.append(size)

I had a few issues with the order produced by Lawson's version of the iterative DFS. Here is code for my version which has a 1-to-1 mapping with a recursive version of DFS.
n = len(graph)
time = 0
finish_times = [0] * (n + 1)
explored = [False] * (n + 1)
# Determine if every vertex connected to v
# has already been explored
def all_explored(G, v):
if v in G:
for w in G[v]:
if not explored[w]:
return False
return True
# Loop through vertices in reverse order
for v in xrange(n, 0, -1):
if not explored[v]:
stack = [v]
while stack:
print(stack)
v = stack[-1]
explored[v] = True
# If v still has outgoing edges to explore
if not all_explored(graph_reversed, v):
for w in graph_reversed[v]:
# Explore w before others attached to v
if not explored[w]:
stack.append(w)
break
# We have explored vertices findable from v
else:
stack.pop()
time += 1
finish_times[v] = time

Here are the recursive and iterative implementations in java:
int time = 0;
public void dfsRecursive(Vertex vertex) {
time += 1;
vertex.setVisited(true);
vertex.setDiscovered(time);
for (String neighbour : vertex.getNeighbours()) {
if (!vertices.get(neighbour).getVisited()) {
dfsRecursive(vertices.get(neighbour));
}
}
time += 1;
vertex.setFinished(time);
}
public void dfsIterative(Vertex vertex) {
Stack<Vertex> stack = new Stack<>();
stack.push(vertex);
while (!stack.isEmpty()) {
Vertex current = stack.pop();
if (!current.getVisited()) {
time += 1;
current.setVisited(true);
current.setDiscovered(time);
stack.push(current);
List<String> currentsNeigbours = current.getNeighbours();
for (int i = currentsNeigbours.size() - 1; i >= 0; i--) {
String currentNeigbour = currentsNeigbours.get(i);
Vertex neighBour = vertices.get(currentNeigbour);
if (!neighBour.getVisited())
stack.push(neighBour);
}
} else {
if (current.getFinished() < 1) {
time += 1;
current.setFinished(time);
}
}
}
}

First, you should know exactly what is finished time. In recursive dfs, finished time is when all of the adjacent nodes [V]s of a Node v is finished,
with this keeping in mind you need to have additional data structure to store all infos.
adj[][] //graph
visited[]=NULL //array of visited node
finished[]=NULL //array of finished node
Stack st=new Stack //normal stack
Stack backtrack=new Stack //additional stack
function getFinishedTime(){
for(node i in adj){
if (!vistied.contains[i]){
st.push(i);
visited.add(i)
while(!st.isEmpty){
int j=st.pop();
int[] unvisitedChild= getUnvistedChild(j);
if(unvisitedChild!=null){
for(int c in unvisitedChild){
st.push(c);
visited.add(c);
}
backtrack.push([j,unvisitedChild]); //you can store each entry as array with the first index as the parent node j, followed by all the unvisited child node.
}
else{
finished.add(j);
while(!backtrack.isEmpty&&finished.containsALL(backtrack.peek())) //all of the child node is finished, then we can set the parent node visited
{
parent=backtrack.pop()[0];
finished.add(parent);
}
}
}
}
}
function getUnvistedChild(int i){
unvisitedChild[]=null
for(int child in adj[i]){
if(!visited.contains(child))
unvisitedChild.add(child);
}
return unvisitedChild;
}
and the finished time should be
[5, 2, 8, 3, 6, 9, 1, 4, 7]

Python: How to find more than one pathway in a recursive loop when multiple child nodes refers back to the parent?

I'm using recursion to find the path from some point A to some point D.
I'm transversing a graph to find the pathways.
Lets say:
Graph = {'A':['route1','route2'],'B':['route1','route2','route3','route4'], 'C':['route3','route4'], 'D':['route4'] }
Accessible through:
A -> route1, route2
B -> route2, route 3, route 4
C -> route3, route4
There are two solutions in this path from A -> D:
route1 -> route2 -> route4
route1 -> route2 -> route3 -> route4
Since point B and point A has both route 1, and route 2. There is an infinite loop so i add a check whenever
i visit the node( 0 or 1 values ).
However with the check, i only get one solution back: route1 -> route2 -> route4, and not the other possible solution.
Here is the actual coding: Routes will be substituted by Reactions.
def find_all_paths(graph,start, end, addReaction, passed = {}, reaction = [] ,path=[]):
passOver = passed
path = path + [start]
reaction = reaction + [addReaction]
if start == end:
return [reaction]
if not graph.has_key(start):
return []
paths=[]
reactions=[]
for x in range (len(graph[start])):
for y in range (len(graph)):
for z in range (len(graph.values()[y])):
if (graph[start][x] == graph.values()[y][z]):
if passOver.values()[y][z] < 161 :
passOver.values()[y][z] = passOver.values()[y][z] + 1
if (graph.keys()[y] not in path):
newpaths = find_all_paths(graph, (graph.keys()[y]), end, graph.values()[y][z], passOver , reaction, path)
for newpath in newpaths:
reactions.append(newpath)
return reactions
Here is the method call: dic_passOver is a dictionary keeping track if the nodes are visited
solution = (find_all_paths( graph, "M_glc_DASH_D_c', 'M_pyr_c', 'begin', dic_passOver ))
My problem seems to be that once a route is visited, it can no longer be access, so other possible solutions are not possible. I accounted for this by adding a maximum amount of recursion at 161, where all the possible routes are found for my specific code.
if passOver.values()[y][z] < 161 :
passOver.values()[y][z] = passOver.values()[y][z] + 1
However, this seem highly inefficient, and most of my data will be graphs with indexes in their thousands. In addition i won't know the amount of allowed node visits to find all routes. The number 161 was manually figured out.

Well, I can't understand your representation of the graph. But this is a generic algorithm you can use for finding all paths which avoids infinite loops.
First you need to represent your graph as a dictionary which maps nodes to a set of nodes they are connected to. Example:
graph = {'A':{'B','C'}, 'B':{'D'}, 'C':{'D'}}
That means that from A you can go to B and C. From B you can go to D and from C you can go to D. We're assuming the links are one-way. If you want them to be two way just add links for going both ways.
If you represent your graph in that way, you can use the below function to find all paths:
def find_all_paths(start, end, graph, visited=None):
if visited is None:
visited = set()
visited |= {start}
for node in graph[start]:
if node in visited:
continue
if node == end:
yield [start,end]
else:
for path in find_all_paths(node, end, graph, visited):
yield [start] + path
Example usage:
>>> graph = {'A':{'B','C'}, 'B':{'D'}, 'C':{'D'}}
>>> for path in find_all_paths('A','D', graph):
... print path
...
['A', 'C', 'D']
['A', 'B', 'D']
>>>
Edit to take into account comments clarifying graph representation
Below is a function to transform your graph representation(assuming I understood it correctly and that routes are bi-directional) to the one used in the algorithm above
def change_graph_representation(graph):
reverse_graph = {}
for node, links in graph.items():
for link in links:
if link not in reverse_graph:
reverse_graph[link] = set()
reverse_graph[link].add(node)
result = {}
for node,links in graph.items():
adj = set()
for link in links:
adj |= reverse_graph[link]
adj -= {node}
result[node] = adj
return result
If it is important that you find the path in terms of the links, not the nodes traversed you can preserve this information like so:
def change_graph_representation(graph):
reverse_graph = {}
for node, links in graph.items():
for link in links:
if link not in reverse_graph:
reverse_graph[link] = set()
reverse_graph[link].add(node)
result = {}
for node,links in graph.items():
adj = {}
for link in links:
for n in reverse_graph[link]:
adj[n] = link
del(adj[node])
result[node] = adj
return result
And use this modified search:
def find_all_paths(start, end, graph, visited=None):
if visited is None:
visited = set()
visited |= {start}
for node,link in graph[start].items():
if node in visited:
continue
if node == end:
yield [link]
else:
for path in find_all_paths(node, end, graph, visited):
yield [link] + path
That will give you paths in terms of links to follow instead of nodes to traverse. Hope this helps :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding depth of a Python tree from a text file - python

Related

Find all children of top parent in python

Calculating python tree height for large data sets

Cycle detected while printing root to leaves path

kosaraju finding finishing time using iterative dfs

Python: How to find more than one pathway in a recursive loop when multiple child nodes refers back to the parent?

Categories

Resources