I am trying to prune decision trees, using some procedure.
After i do sorts of manipulations on the current tree, i get different values from the function validate() (inside prune_helper, by print(self.validate(validation_data))) - before and after manipulating the tree, which is great (It means that something does happen to the tree for a given node).
def prune_helper(self, curr_node):
all_err = []
# The current tree.
tree_1 = self
all_err.append((tree_1._generalization_error(), tree_1))
# Replace node with leaf 1.
tree_2 = self._replace_leaf(curr_node, DEMOCRAT)
all_err.append((tree_2._generalization_error(), tree_2))
# Replace node with leaf 0.
tree_3 = self._replace_leaf(curr_node, REPUBLICAN)
all_err.append((tree_3._generalization_error(), tree_3))
# Replace node with left subtree.
test_4 = self._replace_subtree(curr_node, LEFT)
all_err.append((test_4._generalization_error(), test_4))
# Replace node with middle subtree.
test_5 = self._replace_subtree(curr_node, MIDDLE)
all_err.append((test_5._generalization_error(), test_5))
# Replace node with right subtree.
test_6 = self._replace_subtree(curr_node, RIGHT)
all_err.append((test_6._generalization_error(), test_6))
all_err.sort(key=lambda tup: tup[0])
min_tree = all_err[0][1]
# print(self.validate(validation_data)) <-- This
self = copy.deepcopy(min_tree)
# print(self.validate(validation_data)) <-- Mostly different than this
curr_node.pruned = True
def prune(self, curr_node=None):
if curr_node is None:
curr_node = self._root
# Node is leaf.
if curr_node.leaf:
self.prune_helper(curr_node=curr_node)
return
# Node is not a leaf, we may assume that he has all three children.
if curr_node.left.pruned is False:
self.prune(curr_node=curr_node.left)
if curr_node.middle.pruned is False:
self.prune(curr_node=curr_node.middle)
if curr_node.right.pruned is False:
self.prune(curr_node=curr_node.right)
# We'll prune the current node, only if we checked all of its children.
self.prune_helper(curr_node=curr_node)
But the problem is, when i want to calculate some value for the tree after prunning it "completely", i get the same value returned by validate(), which means that maybe the tree wasnt changed afterall, and the effect on the tree only took place in prune_tree-
def prune_tree(tree):
# print(tree.validate(validation_data)) <- This
tree.prune()
# print(tree.validate(validation_data)) <- Same as this
tree.print_tree('after')
I think that maybe the problem is with the way i try to change the self object or something. Is there anything obvious that i've done wrong implementing the whole thing that may lead to this results?
Related
I am trying to adapt this code here: https://github.com/nachonavarro/gabes/blob/master/gabes/circuit.py (line 136)
but am coming across an issue because several times the attribute .chosen_label is used but I can find no mention of it anywhere in the code. The objects left_gate, right_gate and gate are Gate objects (https://github.com/nachonavarro/gabes/blob/master/gabes/gate.py)
def reconstruct(self, labels):
levels = [[node for node in children]
for children in anytree.LevelOrderGroupIter(self.tree)][::-1]
for level in levels:
for node in level:
gate = node.name
if node.is_leaf:
garblers_label = labels.pop(0)
evaluators_label = labels.pop(0)
else:
left_gate = node.children[0].name
right_gate = node.children[1].name
garblers_label = left_gate.chosen_label
evaluators_label = right_gate.chosen_label
output_label = gate.ungarble(garblers_label, evaluators_label)
gate.chosen_label = output_label
return self.tree.name.chosen_label
The code runs without error and the .chosen_label is a Label object (https://github.com/nachonavarro/gabes/blob/master/gabes/label.py)
Any help would be much appreciated
The attribute is set in the same method:
for level in levels:
for node in level:
gate = node.name
if node.is_leaf:
# set `garblers_label` and `evaluators_label` from
# the next two elements of the `labels` argument
else:
# use the child nodes of this node to use their gates, and
# set `garblers_label` and `evaluators_label` to the left and
# right `chosen_label` values, respectively.
# generate the `Label()` instance based on `garblers_label` and `evaluators_label`
output_label = gate.ungarble(garblers_label, evaluators_label)
gate.chosen_label = output_label
I'm not familiar with the anytree library, so I had to look up the documentation: the anytree.LevelOrderGroupIter(...) function orders the nodes in a tree from root to leaves, grouped by level. The tree here appears to be a balanced binary tree (each node has either 0 or 2 child nodes), so you get a list with [(rootnode,), (level1_left, level1_right), (level2_left_left, level2_left_right, level2_right_left, level2_right_right), ...]. The function loops over these levels in reverse order. This means that leaves are processed first.
Once all node.is_leaf nodes have their chosen_label set, the other non-leaf nodes can reference the chosen_label value on the leaf nodes on the level already processed before them.
So, assuming that labels is a list with at least twice the number of leaf nodes in the tree, you end up with those label values aggregated at every level via the gate.ungarble() function, and the final value is found at the root node via self.tree.name.chosen_label.
I am fairly new to data structures, and recursion. So, I decided to try and implement a method that would print all the values of all the nodes of the right subtree (in ascending order, i.e: 50, 65, 72, 91, 99) in this given tree just as a learning experience.
Here is the tree that I am working with, visually.
And I have quite a few problems understanding how to recurse through the right subtree.
That's what I have tried doing so far (the actual method is at the bottom):
class BinarySearchTree:
_root: Optional[Any]
# The left subtree, or None if the tree is empty.
_left: Optional[BinarySearchTree]
# The right subtree, or None if the tree is empty.
_right: Optional[BinarySearchTree]
def __init__(self, root: Optional[Any]) -> None:
"""Initialize a new BST containing only the given root value.
"""
if root is None:
self._root = None
self._left = None
self._right = None
else:
self._root = root
self._left = BinarySearchTree(None)
self._right = BinarySearchTree(None)
def is_empty(self) -> bool:
"""Return True if this BST is empty.
>>> bst = BinarySearchTree(None)
>>> bst.is_empty()
True
"""
return self._root is None
# That is what I have tried doing so far.
def print_right_subtree(self) -> None:
"""Print the right subtree in order
>>> bst = BinarySearchTree(41)
>>> left = BinarySearchTree(20)
>>> left._left = BinarySearchTree(11)
>>> left._right = BinarySearchTree(29)
>>> left._right._right = BinarySearchTree(32)
>>> right = BinarySearchTree(65)
>>> right._left = BinarySearchTree(50)
>>> right._right = BinarySearchTree(91)
>>> right._right._left = BinarySearchTree(72)
>>> right._right._right = BinarySearchTree(99)
>>> bst._left = left
>>> bst._right = right
>>> bst.print_right_subtree()
50
65
72
91
99
"""
if self.is_empty():
pass
else:
# I am not really sure what to do here...
# I have tried setting self._left = None, but that just made things even more complicated!
print(self._root)
self._right.print_right_subtree()
Any help would be extremely appreciated! Also, if any of you guys have a tutorial that I can follow, that would really great for a newbie like myself :).
If you want to print the nodes in the right subtree, you just have to call your print_tree on the attribute of your tree corresponding to his right node.
First, you define a print_tree method:
def print_tree(self) -> None:
if self.is_empty():
pass
else:
# you are free to do additional things here such as print node value or etc..
self._left.print_tree()
self._right.print_tree()
And then the print_right_subtree method:
def print_right_subtree(self) -> None:
self._right.print_tree() # which correspond to the print_tree of the _right attribute
Since you're not asking for code itself, and rather for help to write your own code...
There are a million ways to do it. Some are more optimized. Some are faster to write. It all depends on what you need.
Here I think you need to understand what a tree is. Any of your subtree, is a tree by itself. So, you've got to understand that print the right tree does mean anything and everything. Only the right branch of each tree for instance ? Or all the tree of the first right branch ? Maybe the second branch ?
If I got it right (Da bum tss !), you want to print the right branch of the tree that's called root on your schema. Why not say I want to print all the numbers above 41 instead ? That way, even if you want to print your tree beginning from a sub-tree, it would be really easy to do so !
You need to visualize what your algorithm would do. Here, you want to print all the numbers above 41 (the right branch of the main tree). Let me write you pseudo code for this (suppose you're already on your root node which has a value of 65 :
I want to write all the numbers in ascending orders...
My root is 65. My left is 50, my right is 91.
Which is the lowest ? 50. Does it have another branches ? No. Print it !
My root is still 65, my right is 91. My root is lower than my branch ? Print it ! And go to the right branch.
My main is now 91, my left is 72, my right 99.
Which is the lowest ? You got your recursion.
And even after all this, you could still choose to make another faster - to write, not to compute ! - solution. Gather all the values from the branch you need, and print the sorted values !
I use sklearn.tree.DecisionTreeClassifier to build a decision tree. With the optimal parameter settings, I get a tree that has unnecessary leaves (see example picture below - I do not need probabilities, so the leaf nodes marked with red are a unnecessary split)
Is there any third-party library for pruning these unnecessary nodes? Or a code snippet? I could write one, but I can't really imagine that I am the first person with this problem...
Code to replicate:
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
mdl = DecisionTreeClassifier(max_leaf_nodes=8)
mdl.fit(X,y)
PS: I have tried multiple keyword searches and am kind of surprised to find nothing - is there really no post-pruning in general in sklearn?
PPS: In response to the possible duplicate: While the suggested question might help me when coding the pruning algorithm myself, it answers a different question - I want to get rid of leaves that do not change the final decision, while the other question wants a minimum threshold for splitting nodes.
PPPS: The tree shown is an example to show my problem. I am aware of the fact that the parameter settings to create the tree are suboptimal. I am not asking about optimizing this specific tree, I need to do post-pruning to get rid of leaves that might be helpful if one needs class probabilities, but are not helpful if one is only interested in the most likely class.
Using ncfirth's link, I was able to modify the code there so that it fits to my problem:
from sklearn.tree._tree import TREE_LEAF
def is_leaf(inner_tree, index):
# Check whether node is leaf node
return (inner_tree.children_left[index] == TREE_LEAF and
inner_tree.children_right[index] == TREE_LEAF)
def prune_index(inner_tree, decisions, index=0):
# Start pruning from the bottom - if we start from the top, we might miss
# nodes that become leaves during pruning.
# Do not use this directly - use prune_duplicate_leaves instead.
if not is_leaf(inner_tree, inner_tree.children_left[index]):
prune_index(inner_tree, decisions, inner_tree.children_left[index])
if not is_leaf(inner_tree, inner_tree.children_right[index]):
prune_index(inner_tree, decisions, inner_tree.children_right[index])
# Prune children if both children are leaves now and make the same decision:
if (is_leaf(inner_tree, inner_tree.children_left[index]) and
is_leaf(inner_tree, inner_tree.children_right[index]) and
(decisions[index] == decisions[inner_tree.children_left[index]]) and
(decisions[index] == decisions[inner_tree.children_right[index]])):
# turn node into a leaf by "unlinking" its children
inner_tree.children_left[index] = TREE_LEAF
inner_tree.children_right[index] = TREE_LEAF
##print("Pruned {}".format(index))
def prune_duplicate_leaves(mdl):
# Remove leaves if both
decisions = mdl.tree_.value.argmax(axis=2).flatten().tolist() # Decision for each node
prune_index(mdl.tree_, decisions)
Using this on a DecisionTreeClassifier clf:
prune_duplicate_leaves(clf)
Edit: Fixed a bug for more complex trees
DecisionTreeClassifier(max_leaf_nodes=8) specifies (max) 8 leaves, so unless the tree builder has another reason to stop it will hit the max.
In the example shown, 5 of the 8 leaves have a very small amount of samples (<=3) compared to the others 3 leaves (>50), a possible sign of over-fitting.
Instead of pruning the tree after training, one can specifying either min_samples_leaf or min_samples_split to better guide the training, which will likely get rid of the problematic leaves. For instance use the value 0.05 for least 5% of samples.
I had a problem with the code posted here so I revised it and had to add a small section (it deals with the case that both sides are the same but there is still a comparison present):
from sklearn.tree._tree import TREE_LEAF, TREE_UNDEFINED
def is_leaf(inner_tree, index):
# Check whether node is leaf node
return (inner_tree.children_left[index] == TREE_LEAF and
inner_tree.children_right[index] == TREE_LEAF)
def prune_index(inner_tree, decisions, index=0):
# Start pruning from the bottom - if we start from the top, we might miss
# nodes that become leaves during pruning.
# Do not use this directly - use prune_duplicate_leaves instead.
if not is_leaf(inner_tree, inner_tree.children_left[index]):
prune_index(inner_tree, decisions, inner_tree.children_left[index])
if not is_leaf(inner_tree, inner_tree.children_right[index]):
prune_index(inner_tree, decisions, inner_tree.children_right[index])
# Prune children if both children are leaves now and make the same decision:
if (is_leaf(inner_tree, inner_tree.children_left[index]) and
is_leaf(inner_tree, inner_tree.children_right[index]) and
(decisions[index] == decisions[inner_tree.children_left[index]]) and
(decisions[index] == decisions[inner_tree.children_right[index]])):
# turn node into a leaf by "unlinking" its children
inner_tree.children_left[index] = TREE_LEAF
inner_tree.children_right[index] = TREE_LEAF
inner_tree.feature[index] = TREE_UNDEFINED
##print("Pruned {}".format(index))
def prune_duplicate_leaves(mdl):
# Remove leaves if both
decisions = mdl.tree_.value.argmax(axis=2).flatten().tolist() # Decision for each node
prune_index(mdl.tree_, decisions)
I'm trying to find the longest path in a Directed Acyclic graph. At the moment, my code seems to be running time complexity of O(n3).
The graph is of input {0: [1,2], 1: [2,3], 3: [4,5] }
#Input: dictionary: graph, int: start, list: path
#Output: List: the longest path in the graph (Recurrance)
# This is a modification of a depth first search
def find_longest_path(graph, start, path=[]):
path = path + [start]
paths = path
for node in graph[start]:
if node not in path:
newpaths = find_longest_path(graph, node, path)
#Only take the new path if its length is greater than the current path
if(len(newpaths) > len(paths)):
paths = newpaths
return paths
It returns a list of nodes in the form e.g. [0,1,3,5]
How can I make this more efficient than O(n3)? Is recursion the right way to solve this or should I be using a different loop?
You can solve this problem in O(n+e) (i.e. linear in the number of nodes + edges).
The idea is that you first create a topological sort (I'm a fan of Tarjan's algorithm) and the set of reverse edges. It always helps if you can decompose your problem to leverage existing solutions.
You then walk the topological sort backwards pushing to each parent node its child's distance + 1 (keeping maximums in case there are multiple paths). Keep track of the node with the largest distance seen so far.
When you have finished annotating all of the nodes with distances you can just start at the node with the largest distance which will be your longest path root, and then walk down your graph choosing the children that are exactly one count less than the current node (since they lie on the critical path).
In general, when trying to find an optimal complexity algorithm don't be afraid to run multiple stages one after the other. Five O(n) algorithms run sequentially is still O(n) and is still better than O(n2) from a complexity perspective (although it may be worse real running time depending on the constant costs/factors and the size of n).
ETA: I just noticed you have a start node. This makes it simply a case of doing a depth first search and keeping the longest solution seen so far which is just O(n+e) anyway. Recursion is fine or you can keep a list/stack of visited nodes (you have to be careful when finding the next child each time you backtrack).
As you backtrack from your depth first search you need to store the longest path from that node to a leaf so that you don't re-process any sub-trees. This will also serve as a visited flag (i.e. in addition to doing the node not in path test also have a node not in subpath_cache test before recursing). Instead of storing the subpath you could store the length and then rebuild the path once you're finished based on sequential values as discussed above (critical path).
ETA2: Here's a solution.
def find_longest_path_rec(graph, parent, cache):
maxlen = 0
for node in graph[parent]:
if node in cache:
pass
elif node not in graph:
cache[node] = 1
else:
cache[node] = find_longest_path_rec(graph, node, cache)
maxlen = max(maxlen, cache[node])
return maxlen + 1
def find_longest_path(graph, start):
cache = {}
maxlen = find_longest_path_rec(graph, start, cache)
path = [start]
for i in range(maxlen-1, 0, -1):
for node in graph[path[-1]]:
if cache[node] == i:
path.append(node)
break
else:
assert(0)
return path
Note that I've removed the node not in path test because I'm assuming that you're actually supplying a DAG as claimed. If you want that check you should really be raising an error rather than ignoring it. Also note that I've added the assertion to the else clause of the for to document that we must always find a valid next (sequential) node in the path.
ETA3: The final for loop is a little confusing. What we're doing is considering that in the critical path all of the node distances must be sequential. Consider node 0 is distance 4, node 1 is distance 3 and node 2 is distance 1. If our path started [0, 2, ...] we have a contradiction because node 0 is not 1 further from a leaf than 2.
There are a couple of non-algorithmic improvements I'd suggest (these are related to Python code quality):
def find_longest_path_from(graph, start, path=None):
"""
Returns the longest path in the graph from a given start node
"""
if path is None:
path = []
path = path + [start]
max_path = path
nodes = graph.get(start, [])
for node in nodes:
if node not in path:
candidate_path = find_longest_path_from(graph, node, path)
if len(candidate_path) > len(max_path):
max_path = candidate_path
return max_path
def find_longest_path(graph):
"""
Returns the longest path in a graph
"""
max_path = []
for node in graph:
candidate_path = find_longest_path_from(graph, node)
if len(candidate_path) > len(max_path):
max_path = candidate_path
return max_path
Changes explained:
def find_longest_path_from(graph, start, path=None):
if path is None:
path = []
I've renamed find_longest_path as find_longest_path_from to better explain what it does.
Changed the path argument to have a default argument value of None instead of []. Unless you know you will specifically benefit from them, you want to avoid using mutable objects as default arguments in Python. This means you should typically set path to None by default and then when the function is invoked, check whether path is None and create an empty list accordingly.
max_path = path
...
candidate_path = find_longest_path_from(graph, node, path)
...
I've updated the names of your variables from paths to max_path and newpaths to candidate_path. These were confusing variable names because they referred to the plural of path -- implying that the value they stored consisted of multiple paths -- when in fact they each just held a single path. I tried to give them more descriptive names.
nodes = graph.get(start, [])
for node in nodes:
Your code errors out on your example input because the leaf nodes of the graph are not keys in the dict so graph[start] would raise a KeyError when start is 2, for instance. This handles the case where start is not a key in graph by returning an empty list.
def find_longest_path(graph):
"""
Returns the longest path in a graph
"""
max_path = []
for node in graph:
candidate_path = find_longest_path_from(graph, node)
if len(candidate_path) > len(max_path):
max_path = candidate_path
return max_path
A method to find the longest path in a graph that iterates over the keys. This is entirely separate from your algorithmic analysis of find_longest_path_from but I wanted to include it.
Inspired by the answer to the question "Enumerating all paths in a tree", I wrote an adapted version (I don't need the unrooted path):
def paths(tree):
#Helper function
#receives a tree and
#returns all paths that have this node as root
if not tree:
return []
else: #tree is a node
root = tree.ID
rooted_paths = [[root]]
for subtree in tree.nextDest:
useable = paths(subtree)
for path in useable:
rooted_paths.append([root]+path)
return rooted_paths
Now, in my "tree" object, I have a number associated with each node: tree.number. I looks like this:
A,2
/ \
B,5 C,4
| / \
D,1 E,4 F,3
I would like to initialize my paths with a 0 value, and sum all tree.numbers of the path, in order to know the sum for each generated path:
A-B-D: 8
A-C-E: 10
A-C-F: 9
How should I modify my code to get this result? I'm not seeing how to do it.
In order to achieve what you want, pass one more argument to the recursion - that would be the sum you have got so far. Also return one more value for each path - its sum:
def paths(tree, sum_so_far):
#Helper function
#receives a tree and
#returns all paths that have this node as root
if not tree:
return []
else: #tree is a node
root = tree.ID
val = tree.Value
rooted_paths = [[[root], value]]
for subtree in tree.nextDest:
useable = paths(subtree, sum_so_far + val)
for path in useable:
rooted_paths.append([[root]+path[0], path[1]])
return rooted_paths
Something like this should work. Please note now you are returning an array of pair of path and integer value. The integer value is the sum along that path.
Hope that helps.