Calculating python tree height for large data sets

Calculating python tree height for large data sets - python

I'm trying to get an efficient algorithm to calculate the height of a tree in Python for large datasets. The code I have works for small datasets, but takes a long time for really large ones (100,000 items) so I'm trying to figure out ways to optimize it but am getting stuck. Sorry if it seems like a really newbie question, I'm pretty new to Python.
The input is a list length and a list of values, with each list item pointing to its parent, with list item -1 indicating the root of the tree. So with an input of:
5
4 -1 4 1 1
The answer would be 3 - the tree is: ({key:1, children: [{key: 3}, {key:4, children:[{key:0, {key:2}]}] }
Here is the code that I have so far:
import sys, threading
sys.setrecursionlimit(10**7) # max depth of recursion
threading.stack_size(2**25) # new thread will get stack of such size
class TreeHeight:
def read(self):
self.n = int(sys.stdin.readline())
self.parent = list(map(int, sys.stdin.readline().split()))
def getChildren(self, node, nodes):
parent = {'key': node, 'children': []}
children = [i for i, x in enumerate(nodes) if x == parent['key']]
for child in children:
parent['children'].append(self.getChildren(child, nodes))
return parent
def compute_height(self, tree):
if len(tree['children']) == 0:
return 0
else:
max_values = []
for child in tree['children']:
max_values.append(self.compute_height(child))
return 1 + max(max_values)
def main():
tree = TreeHeight()
tree.read()
treeChild = tree.getChildren(-1, tree.parent)
print(tree.compute_height(treeChild))
threading.Thread(target=main).start()

first, while python is really a great general purpose language, using raw python for large datasets is not very efficient. consider using pandas, NumPy, SciPy or one of the many great alternatives.
second, if you're concerned with tree's height, and your tree is a write-once-read-always one. you could simply alter the code that reads the input to not only fill the tree but also measure the number of height.
this attitude makes sense when you don't expect you tree to change after been created

Use DFS to avoid stack overflow in recursive calls. Use a marker to know the end of a level during the traversal.
from collections import defaultdict
def compute_height(root, tree):
q = ListQueue()
q.enqueue(root)
q.enqueue('$')
height = 1
while not q.isEmpty():
elem = q.dequeue()
if elem =='$' and not q.isEmpty():
elem = q.dequeue()
height+=1
q.enqueue('$')
for child in tree[elem]:
q.enqueue(child)
return height
tree = defaultdict(list)
parents = [4, -1, 4, 1, 1]
for node,parent in enumerate(parents):
tree[parent].append(node)
root = tree.pop(-1)[0]
print(compute_height(root, tree))

Related

Bad Tree design, Data Structure

I tried making a Tree as a part of my Data Structures course. The code works but is extremely slow, almost double the time that is accepted for the course. I do not have experience with Data Structures and Algorithms but I need to optimize the program. If anyone has any tips, advices, criticism I would greatly appreciate it.
The tree is not necessarily a binary tree.
Here is the code:
import sys
import threading
class Node:
def __init__(self,value):
self.value = value
self.children = []
self.parent = None
def add_child(self,child):
child.parent = self
self.children.append(child)
def compute_height(n, parents):
found = False
indices = []
for i in range(n):
indices.append(i)
for i in range(len(parents)):
currentItem = parents[i]
if currentItem == -1:
root = Node(parents[i])
startingIndex = i
found = True
break
if found == False:
root = Node(parents[0])
startingIndex = 0
return recursion(startingIndex,root,indices,parents)
def recursion(index,toWhomAdd,indexes,values):
children = []
for i in range(len(values)):
if index == values[i]:
children.append(indexes[i])
newNode = Node(indexes[i])
toWhomAdd.add_child(newNode)
recursion(i, newNode, indexes, values)
return toWhomAdd
def checkHeight(node):
if node == '' or node == None or node == []:
return 0
counter = []
for i in node.children:
counter.append(checkHeight(i))
if node.children != []:
mostChildren = max(counter)
else:
mostChildren = 0
return(1 + mostChildren)
def main():
n = int(int(input()))
parents = list(map(int, input().split()))
root = compute_height(n, parents)
print(checkHeight(root))
sys.setrecursionlimit(10**7) # max depth of recursion
threading.stack_size(2**27) # new thread will get stack of such size
threading.Thread(target=main).start()
Edit:
For this input(first number being number of nodes and other numbers the node's values)
5
4 -1 4 1 1
We expect this output(height of the tree)
3
Another example:
Input:
5
-1 0 4 0 3
Output:
4

It looks like the value that is given for a node, is a reference by index of another node (its parent). This is nowhere stated in the question, but if that assumption is right, you don't really need to create the tree with Node instances. Just read the input into a list (which you already do), and you actually have the tree encoded in it.
So for example, the list [4, -1, 4, 1, 1] represents this tree, where the labels are the indices in this list:
1
/ \
4 3
/ \
0 2
The height of this tree — according to the definition given in Wikipedia — would be 2. But apparently the expected result is 3, which is the number of nodes (not edges) on the longest path from the root to a leaf, or — otherwise put — the number of levels in the tree.
The idea to use recursion is correct, but you can do it bottom up (starting at any node), getting the result of the parent recursively, and adding one to 1. Use the principle of dynamic programming by storing the result for each node in a separate list, which I called levels:
def get_num_levels(parents):
levels = [0] * len(parents)
def recur(node):
if levels[node] == 0: # this node's level hasn't been determined yet
parent = parents[node]
levels[node] = 1 if parent == -1 else recur(parent) + 1
return levels[node]
for node in range(len(parents)):
recur(node)
return max(levels)
And the main code could be as you had it:
def main():
n = int(int(input()))
parents = list(map(int, input().split()))
print(get_num_levels(parents))

Generating, traversing and printing binary tree

I generated perfectly balanced binary tree and I want to print it. In the output there are only 0s instead of the data I generated. I think it's because of the line in function printtree that says print(tree.elem), cause in the class self.elem = 0.
How can I connect these two functions generate and printtree?
class BinTree:
def __init__(self):
self.elem = 0
self.left = None
self.right = None
def generate(pbt, N):
if N == 0:
pbt = None
else:
pbt = BinTree()
x = input()
pbt.elem = int(x)
generate(pbt.left, N // 2)
generate(pbt.right, N - N // 2 - 1)
def printtree(tree, h):
if tree is not None:
tree = BinTree()
printtree(tree.right, h+1)
for i in range(1, h):
print(end = "......")
print(tree.elem)
printtree(tree.left, h+1)
Hope somebody can help me. I am a beginner in coding.
For example:
N=6, pbt=pbt, tree=pbt, h=0
input:
1
2
3
4
5
6
and the output:
......5
............6
1
............4
......2
............3

I'd suggest reading up on: https://www.geeksforgeeks.org/tree-traversals-inorder-preorder-and-postorder/
Basically, there are three ways to traverse a binary tree; in-order, post-order and pre-order.
The issue with your print statement is that, you're reassigning the tree that is being passed in, to an empty tree.
if tree is not None:
tree = BinTree()
Right? If tree is not none and has something, lets reassign that to an empty tree.
Traversing a tree is actually a lot more simpler than you'd imagine. I think the complexity comes in just trying to imagine in your head how it all works out, but the truth is that traversing a tree can be done in 3 - 4 lines.

Non-binary Tree Height (Optimization)

Introduction
So I'm doing a course on edX and have been working on this practice assignment for
the better part of 3 hours, yet I still can't find a way to implement this method
without it taking to long and timing out the automatic grader.
I've tried 3 different methods all of which did the same thing.
Including 2 recursive approaches and 1 non-recursive approach (my latest).
The problem I think I'm having with my code is that the method to find children just takes way to long because it has to iterate over the entire list of nodes.
Input and output format
Input includes N on the first line which is the size of the list which is given on line 2.
Example:
5
-1 0 4 0 3
To build a tree from this:
Each of the values in the array are a pointer to another index in the array such that in the example above 0 is a child node of -1 (index 0). Since -1 points to no other index it is the root node.
The tree in the example has the root node -1, which has two children 0 (index 1) and 0 (index 3). The 0 with index 1 has no children and the 0 with index 3 has 1 child: 3 (index 4) which in turn has only one child which is 4 (index 2).
The output resulting from the above input is 4. This is because the max height of the branch which included -1 (the root node), 0, 3, and 4 was of height 4 compared to the height of the other branch (-1, and 0) which was height 2.
If you need more elaborate explanation then I can give another example in the comments!
The output is the max height of the tree. The size of the input goes up to 100,000 which was where I was having trouble as it has to do that it in exactly 3 seconds or under.
My code
Here's my latest non-recursive method which I think is the fastest I've made (still not fast enough). I used the starter from the website which I will also include beneath my code. Anyways, thanks for the help!
My code:
# python3
import sys, threading
sys.setrecursionlimit(10**7) # max depth of recursion
threading.stack_size(2**27) # new thread will get stack of such size
def height(node, parent_list):
h = 0
while not node == -1:
h = h + 1
node = parent_list[node]
return h + 1
def search_bottom_nodes(parent_list):
bottom_nodes = []
for index, value in enumerate(parent_list):
children = [i for i, x in enumerate(parent_list) if x == index]
if len(children) == 0:
bottom_nodes.append(value)
return bottom_nodes
class TreeHeight:
def read(self):
self.n = int(sys.stdin.readline())
self.parent = list(map(int, sys.stdin.readline().split()))
def compute_height(self):
# Replace this code with a faster implementation
bottom_nodes = search_bottom_nodes(self.parent)
h = 0
for index, value in enumerate(bottom_nodes):
h = max(height(value, self.parent), h)
return h
def main():
tree = TreeHeight()
tree.read()
print(tree.compute_height())
threading.Thread(target=main).start()
edX starter:
# python3
import sys, threading
sys.setrecursionlimit(10**7) # max depth of recursion
threading.stack_size(2**27) # new thread will get stack of such size
class TreeHeight:
def read(self):
self.n = int(sys.stdin.readline())
self.parent = list(map(int, sys.stdin.readline().split()))
def compute_height(self):
# Replace this code with a faster implementation
maxHeight = 0
for vertex in range(self.n):
height = 0
i = vertex
while i != -1:
height += 1
i = self.parent[i]
maxHeight = max(maxHeight, height);
return maxHeight;
def main():
tree = TreeHeight()
tree.read()
print(tree.compute_height())
threading.Thread(target=main).start()

Simply cache the previously computed heights of the nodes you've traversed through in a dict and reuse them when they are referenced as parents.
import sys, threading
sys.setrecursionlimit(10**7) # max depth of recursion
threading.stack_size(2**27) # new thread will get stack of such size
class TreeHeight:
def height(self, node):
if node == -1:
return 0
if self.parent[node] in self.heights:
self.heights[node] = self.heights[self.parent[node]] + 1
else:
self.heights[node] = self.height(self.parent[node]) + 1
return self.heights[node]
def read(self):
self.n = int(sys.stdin.readline())
self.parent = list(map(int, sys.stdin.readline().split()))
self.heights = {}
def compute_height(self):
maxHeight = 0
for vertex in range(self.n):
maxHeight = max(maxHeight, self.height(vertex))
return maxHeight;
def main():
tree = TreeHeight()
tree.read()
print(tree.compute_height())
threading.Thread(target=main).start()
Given the same input from your question, this (and your original code) outputs:
4

Sorting a List with Relative Positional Data

This is more of a conceptual programming question, so bear with me:
Say you have a list of scenes in a movie, and each scene may or may not make reference to past/future scenes in the same movie. I'm trying to find the most efficient algorithm of sorting these scenes. There may not be enough information for the scenes to be completely sorted, of course.
Here's some sample code in Python (pretty much pseudocode) to clarify:
class Reference:
def __init__(self, scene_id, relation):
self.scene_id = scene_id
self.relation = relation
class Scene:
def __init__(self, scene_id, references):
self.id = scene_id
self.references = references
def __repr__(self):
return self.id
def relative_sort(scenes):
return scenes # Algorithm in question
def main():
s1 = Scene('s1', [
Reference('s3', 'after')
])
s2 = Scene('s2', [
Reference('s1', 'before'),
Reference('s4', 'after')
])
s3 = Scene('s3', [
Reference('s4', 'after')
])
s4 = Scene('s4', [
Reference('s2', 'before')
])
print relative_sort([s1, s2, s3, s4])
if __name__ == '__main__':
main()
The goal is to have relative_sort return [s4, s3, s2, s1] in this case.
If it's helpful, I can share my initial attempt at the algorithm; I'm a little embarrassed at how brute-force it is. Also, if you're wondering, I'm trying to decode the plot of the film "Mulholland Drive".
FYI: The Python tag is only here because my pseudocode was written in Python.

The algorithm you are looking for is a topological sort:
In the field of computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering. For instance, the vertices of the graph may represent tasks to be performed, and the edges may represent constraints that one task must be performed before another; in this application, a topological ordering is just a valid sequence for the tasks.
You can compute this pretty easily using a graph library, for instance, networkx, which implements topological_sort. First we import the library and list all of the relationships between scenes -- that is, all of the directed edges in the graph
>>> import networkx as nx
>>> relations = [
(3, 1), # 1 after 3
(2, 1), # 2 before 1
(4, 2), # 2 after 4
(4, 3), # 3 after 4
(4, 2) # 4 before 2
]
We then create a directed graph:
>>> g = nx.DiGraph(relations)
Then we run a topological sort:
>>> nx.topological_sort(g)
[4, 3, 2, 1]

I have included your modified code in my answer, which solves the current (small) problem, but without a larger sample problem, I'm not sure how well it would scale. If you provide the actual problem you're trying to solve, I'd love to test and refine this code until it works on that problem, but without test data I won't optimize this solution any further.
For starters, we track references as sets, not lists.
Duplicates don't really help us (if "s1" before "s2", and "s1" before "s2", we've gained no information)
This also lets us add inverse references with abandon (if "s1" comes before "s2", then "s2" comes after "s1").
We compute a min and max position:
Min position based on how many scenes we come after
This could be extended easily: If we come after two scenes with a min_pos of 2, our min_pos is 4 (If one is 2, other must be 3)
Max position based on how many things we come before
This could be extended similarly: If we come before two scenes with a max_pos of 4, our max_pos is 2 (If one is 4, other must be 3)
If you decide to do this, just replace pass in tighten_bounds(self) with code to try to tighten the bounds for a single scene (and set anything_updated to true if it works).
The magic is in get_possible_orders
Generates all valid orderings if you iterate over it
If you only want one valid ordering, it doesn't take the time to create them all
Code:
class Reference:
def __init__(self, scene_id, relation):
self.scene_id = scene_id
self.relation = relation
def __repr__(self):
return '"%s %s"' % (self.relation, self.scene_id)
def __hash__(self):
return hash(self.scene_id)
def __eq__(self, other):
return self.scene_id == other.scene_id and self.relation == other.relation
class Scene:
def __init__(self, title, references):
self.title = title
self.references = references
self.min_pos = 0
self.max_pos = None
def __repr__(self):
return '%s (%s,%s)' % (self.title, self.min_pos, self.max_pos)
inverse_relation = {'before': 'after', 'after': 'before'}
def inverted_reference(scene, reference):
return Reference(scene.title, inverse_relation[reference.relation])
def is_valid_addition(scenes_so_far, new_scene, scenes_to_go):
previous_ids = {s.title for s in scenes_so_far}
future_ids = {s.title for s in scenes_to_go}
for ref in new_scene.references:
if ref.relation == 'before' and ref.scene_id in previous_ids:
return False
elif ref.relation == 'after' and ref.scene_id in future_ids:
return False
return True
class Movie:
def __init__(self, scene_list):
self.num_scenes = len(scene_list)
self.scene_dict = {scene.title: scene for scene in scene_list}
self.set_max_positions()
self.add_inverse_relations()
self.bound_min_max_pos()
self.can_tighten = True
while self.can_tighten:
self.tighten_bounds()
def set_max_positions(self):
for scene in self.scene_dict.values():
scene.max_pos = self.num_scenes - 1
def add_inverse_relations(self):
for scene in self.scene_dict.values():
for ref in scene.references:
self.scene_dict[ref.scene_id].references.add(inverted_reference(scene, ref))
def bound_min_max_pos(self):
for scene in self.scene_dict.values():
for ref in scene.references:
if ref.relation == 'before':
scene.max_pos -= 1
elif ref.relation == 'after':
scene.min_pos += 1
def tighten_bounds(self):
anything_updated = False
for scene in self.scene_dict.values():
pass
# If bounds for any scene are tightened, set anything_updated back to true
self.can_tighten = anything_updated
def get_possible_orders(self, scenes_so_far):
if len(scenes_so_far) == self.num_scenes:
yield scenes_so_far
raise StopIteration
n = len(scenes_so_far)
scenes_left = set(self.scene_dict.values()) - set(scenes_so_far)
valid_next_scenes = set(s
for s in scenes_left
if s.min_pos <= n <= s.max_pos)
# valid_next_scenes = sorted(valid_next_scenes, key=lambda s: s.min_pos * self.num_scenes + s.max_pos)
for s in valid_next_scenes:
if is_valid_addition(scenes_so_far, s, scenes_left - {s}):
for valid_complete_sequence in self.get_possible_orders(scenes_so_far + (s,)):
yield valid_complete_sequence
def get_possible_order(self):
return self.get_possible_orders(tuple()).__next__()
def relative_sort(lst):
try:
return [s.title for s in Movie(lst).get_possible_order()]
except StopIteration:
return None
def main():
s1 = Scene('s1', {Reference('s3', 'after')})
s2 = Scene('s2', {
Reference('s1', 'before'),
Reference('s4', 'after')
})
s3 = Scene('s3', {
Reference('s4', 'after')
})
s4 = Scene('s4', {
Reference('s2', 'before')
})
print(relative_sort([s1, s2, s3, s4]))
if __name__ == '__main__':
main()

As others have pointed out, you need a topological sort. A depth first traversal of the directed graph where the order relation forms the edges is all you need. Visit in post order. This the reverse of a topo sort. So to get the topo sort, just reverse the result.
I've encoded your data as a list of pairs showing what's known to go before what. This is just to keep my code short. You can just as easily traverse your list of classes to create the graph.
Note that for topo sort to be meaningful, the set being sorted must satisfy the definition of a partial order. Yours is fine. Order constraints on temporal events naturally satisfy the definition.
Note it's perfectly possible to create a graph with cycles. There's no topo sort of such a graph. This implementation doesn't detect cycles, but it would be easy to modify it to do so.
Of course you can use a library to get the topo sort, but where's the fun in that?
from collections import defaultdict
# Before -> After pairs dictating order. Repeats are okay. Cycles aren't.
# This is OP's data in a friendlier form.
OrderRelation = [('s3','s1'), ('s2','s1'), ('s4','s2'), ('s4','s3'), ('s4','s2')]
class OrderGraph:
# nodes is an optional list of items for use when some aren't related at all
def __init__(self, relation, nodes=[]):
self.succ = defaultdict(set) # Successor map
heads = set()
for tail, head in relation:
self.succ[tail].add(head)
heads.add(head)
# Sources are nodes that have no in-edges (tails - heads)
self.sources = set(self.succ.keys()) - heads | set(nodes)
# Recursive helper to traverse the graph and visit in post order
def __traverse(self, start):
if start in self.visited: return
self.visited.add(start)
for succ in self.succ[start]: self.__traverse(succ)
self.sorted.append(start) # Append in post-order
# Return a reverse post-order visit, which is a topo sort. Not thread safe.
def topoSort(self):
self.visited = set()
self.sorted = []
for source in self.sources: self.__traverse(source)
self.sorted.reverse()
return self.sorted
Then...
>>> print OrderGraph(OrderRelation).topoSort()
['s4', 's2', 's3', 's1']
>>> print OrderGraph(OrderRelation, ['s1', 'unordered']).topoSort()
['s4', 's2', 's3', 'unordered', 's1']
The second call shows that you can optionally pass values to be sorted in a separate list. You may but don't have mention values already in the relation pairs. Of course those not mentioned in order pairs are free to appear anywhere in the output.

Finding depth of a Python tree from a text file

I'm relatively new to python, and I was trying out some questions when I encountered this problem. A tree is defined in a text file in the following manner,
d:
e:
b: d e
c:
a: b c
So, I want to write a simple python script that finds the depth of this. I'm not able to figure out a strategy to work this out. Is there any algorithm or technique for this?

My strategy would be as follows:
Find elements with no children.
For each of these, find the parent. Determine if any elements have this parent as a child - if not, your length is two (2).
If so, find the parent of the parent. Repeat step 2, incrementing your length counter. Continue the process updating a counter with each step.
For your case:
d -> b -> a (len 3)
e -> b -> a (len 3)
c -> a (len 2)
This could be described as a 'bottom up' tree construction method/algorithm.

The tree format you've given has a nice property: if x is the child of y, then x is given before y in the file. So you can simply loop through the file once and read the depth into a dictionary. For example:
depth = {}
for line in f:
parent, children = read_node(line)
if children:
depth[parent] = max(depth.get(child,1) for child in children) + 1
Then just print depth['a'], as a is the root. Here read_node is a quick function to parse the parent and children from a line of the file:
def read_node(line):
parent, children = line.split(":")
return parent, children.split()

I'm not sure what you mean by depth, if it's how many steps you have to go to visit every node, you could use the Depth-First Search to see how long it takes to visit every node in the graph.
Here's a simple implementation:
text_tree = """d:
e:
b: d e
c:
a: b c"""
tree = {}
for line in text_tree.splitlines():
node, childs = line.split(":")
tree[node] = set(childs.split())
def dfs(graph, start):
visited, stack = [], [start]
while stack:
vertex = stack.pop()
if vertex not in visited:
visited.append(vertex)
stack.extend(graph[vertex])
return visited
result = dfs(tree,"a")
print "It took %d steps, to visit every node in tree, the path took was %s"%(len(result),result)
Which outputs:
It took 5 steps, to visit every node in tree, the path took was ['a', 'b', 'd', 'e', 'c']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculating python tree height for large data sets - python

Related

Bad Tree design, Data Structure

Generating, traversing and printing binary tree

Non-binary Tree Height (Optimization)

Sorting a List with Relative Positional Data

Finding depth of a Python tree from a text file

Categories

Resources