I am trying to teach myself python by building a decision tree, and I am having trouble with the pruning. I want to check if removing a random subtree branch increases the tree's accuracy over my previously learned tree, using the following function:
def prune_tree(tree, nodes, validation_examples, old_acc):
percent_to_try = 0.6
for n in range(int(percent_to_try*len(nodes))):
inc = random.randint(0, len(nodes)-1)
if isinstance(nodes[inc], TreeLeaf):
nodes.pop(inc)
continue
node_to_remove = copy.deepcopy(nodes[inc])
target_class = node_to_remove.mode
# print id(nodes[inc])
nodes[inc] = TreeLeaf(target_class)
# print id(nodes[inc])
raw_input()
new_acc = test_tree(validation_examples, tree)
if new_acc > old_acc:
print "The new accuracy is: " + str(new_acc) + "%"
continue
else:
nodes[inc] = node_to_remove
return True
By printing id(nodes[inc]) before and after I try to replace the instances, I can tell that in memory I am not changing the contents of the instance, but instead declaring a new one and assigning it to that position in the array.
I basically want the reference to the old node to point to this new leaf.
Ideally this should happen by recursively deleting everything under the node that we're deleting, testing the accuracy and either keep going if accuracy increased, or return to the previous state if it wasn't.
How should I go about this?
Related
so i guess you are all fimilliar with a binary heap data structure if not.. Brilliant. org say
i.e. a binary tree which obeys the property that the root of any tree is greater than or equal to (or smaller than or equal to) all its children (heap property). The primary use of such a data structure is to implement a priority queue.
will one of the properties of a binary heap is that it must be filled from top to bottom (from root) and from right to left
I coded this algorithm to find the next available spot to insert the next number I add (I hard coded the first nodes so I can track more further down the tree
this search method is inspired by BFS(Breadth First Search) algorithm
note that in this code I only care about finding the next empty node without the need to keep the heap property
I tested the code but I don't think I tested it enough so if you spot problems, bugs or suggest any ideas, every comment is welcomed
def insert(self, data):
if self.root.data == None:
self.root.data = data
print('root', self.root.data)
else:
self.search()
def search(self):
print('search..L31')
queue = [self.root]
while queue:
curr = queue.pop(0)
print(curr.data)
if curr.right_child == None:
print('made it')
return
else:
queue.append(curr.left_child)
queue.append(curr.right_child)
h = Min_heap(10)
h.insert(2)
h.root.left_child = Node(3)
h.root.right_child = Node(5)
h.root.left_child.left_child = Node(8)
h.root.left_child.right_child = Node(7)
h.root.right_child.left_child = Node(9)
# The tree I am building...
# __2__
# / \
# 3 5
# / \ / \
# 8 7 9 ⨂
# ↑
# what am
# looking for
h.search()
there is another way to figuring this out which is basically translating the tree into an array/list using special formulas and then we just assume that the next data we want to insert is the last element in the previous array and then work back through the same formulas but I already know that algorithm and I thought why not trying to solve it as a graph soooo...
You should better implement a binary heap as a list (array). But if you want to do it with node objects that have left/right attributes, then the position for the next node can be derived from the size of the tree.
So if you enrich your heap class instances with a size attribute and maintain that attribute to reflect the current number of nodes in the tree, then the following method will tell you where the next insertion point is, in O(logn) time:
Take the binary representation of the current size plus 1. So if the tree currently has 4 nodes, take the binary representation of 5, i.e. 101. Then drop the leftmost (most significant) bit. The bits that then remain are an encoding of the path towards the new spot: 0 means "left", 1 means "right".
Here is an implementation of a method that will return the parent node of where the new insertion spot is, and whether it would become the "left" or the "right" child of it:
def next_spot(self):
if not self.root:
raise ValueError("empty tree")
node = self.root
path = self.size + 1
sides = bin(path)[3:-1] # skip "0b1" and final bit
for side in sides:
if side == "0":
node = node.left
else:
node = node.right
# use final bit for saying "left" or "right"
return node, ("left", "right")[path % 2]
If you want to guarantee balanced, just add to each node how many items are there or below. Maintain that with the heap. And when placing an element, always go to where there are the fewest things.
If you just want a simple way to place, just randomly place it. You don't have to be perfect. You will still on average be O(log(n)) levels, just with a worse constant.
(Of course your constants are better with the array approach, but you say you know that one and are deliberately not implementing it.)
I am solving a question where we are asked to return the sum of the Depths of all the nodes in a Binary Tree.
For example:
Usually I would use a debugger to find the error in my code, but I don't know how to set up trees/binary tree in my IDE.
My code below passes most of the tests, but fails some. It fails the above test, and produces an output of 20 instead of 16.
def nodeDepths(root):
queue = [root]
sumOfDepths = 0
currentDepth = 0
while len(queue):
for _ in range(len(queue)):
currentNode = queue.pop()
sumOfDepths += currentDepth
if currentNode.left:
queue.append(currentNode.left)
if currentNode.right:
queue.append(currentNode.right)
currentDepth += 1
return sumOfDepths
Any suggestions where the code fails/is doing something unexpected.
I believe the source of your error is in your current_node = queue.pop() statement. According to the docs "The argument passed to the method is optional. If not passed, the default index -1 is passed as an argument (index of the last item)." Because you are pulling the last entry from the queue everything works okay until you have entries in the queue from different depths. To fix this problem use current_node = queue.pop(0), this will always pull the oldest entry from the queue.
Not sure how this works...
class TreeNode:
def __init__(self, val, left=None, right=None):
self.val = val
self.left, self.right = left, right
def find_diameter(self, root):
self.calculate_height(root)
return self.treeDiameter
def calculate_height(self, currentNode):
if currentNode is None:
return 0
leftTreeDiameter = self.calculate_height(currentNode.left)
rightTreeDiameter = self.calculate_height(currentNode.right)
diameter = leftTreeDiameter + rightTreeDiameter + 1
self.treeDiameter = max(self.treeDiameter, diameter)
return max(leftTreeDiameter, rightTreeDiameter) + 1
The above code works to get the max diameter of a binary tree but I don't understand the last line in calculate_height. Why do we need to return max(leftTreeDiameter, rightTreeDiameter) + 1
I obviously don't understand it but what I do know is that for each currentNode we are going to keep going down the left side of the tree and similarly then do the same for the right. If we ended up with no node (meaning right before we were at a leaf node) then we return 0 as we don't want to add 1 for a node that does not exist.
The only place that seems to be adding anything besides 0 is the last line of code in calculate_height because although we are adding leftTreeDiameter + rightTreeDiameter + 1 to get the total diameter this is only possible because of the return 0 and return max(leftTreeDiameter, rightTreeDiameter) + 1 correct?
Also, I am confused as to why leftTreeDiameter can be assigned self.calculate_height(currentNode.left). What I mean is that I thought I would need something like...
def calculate_left_height(self, currentNode, height=0):
if currentNode is None:
return 0
self.calculate_height(currentNode.left, height + 1)
return height
where we just add 1 to the height each time. In this case instead of doing something like leftTreeDiameter += self.calculate_height(currentNode.left) I just pass in as an argument height + 1 each time we see a node.
but if I do this I would need a separate method just to calculate the right height as well and in my find_diameter method would need to recursively call find_diameter with both root.left and also with root.right.
Where is my logic wrong and how is it that calculate_height actually works. I guess I am having trouble trying to figure out how to keep track of the stack?
The names used in this code are confusing: leftTreeDiameter and rightTreeDiameter are not diameters, but heights.
Secondly, the function calculate_height has side effects, which is not very nice. On the one hand it returns a height, and simultaneously it assigns a diameter. This is confusing. Many Python coders would prefer a function to be pure and just return something, without altering anything else. Or, alternatively, a function could only alter some state and not return it. Doing both can be confusing.
Also, it is confusing that although the class is called TreeNode, its find_diameter method still requires a node as argument. This is counter-intuitive. We would expect the method to take self as the node to act on, not the argument.
But let's just rename the variables and add some comments:
leftHeight = self.calculate_height(currentNode.left)
rightHeight = self.calculate_height(currentNode.right)
# What is the size of the longest path from leaf-to-leaf
# whose top node is the current node?
diameter = leftHeight + rightHeight + 1
# Is this path longer than the longest path that we
# had found so far? If so, take this one.
self.treeDiameter = max(self.treeDiameter, diameter)
# The height of the tree rooted at the current node
# is the height of the highest childtree (either left or right),
# with one added to account for the current node
return max(leftHeight, rightHeight) + 1
It should be clear, but do realise that self in this process is always the instance on which the find_diameter method is called, and does not really play a role as actual node, as the root is passed as argument. So the repeated assignment to self.treeDiameter is always to the same one property. This property is not created on every node... just on the node on which you invoke find_diameter.
I hope the inserted comments have clarified how this algorithm works.
NB: your own idea on creating calculate_left_height is not going to do it: it never alters the value of height that it receives as argument, and ends up returning it. So it returns the same value it already receives. That is obviously not going to do much...
I need to figure out the number of left children in a binary tree. There is a lot of ways to do it, but I would like to know why the code below does not work.
def leftChildren(self):
leftChildren = []
if self != None:
leftChildren.append(self.v)
if self.l:
leftChildren = leftChildren + self.l.leftChildren()
if self.r:
self.r.leftChildren()
return leftChildren
What is wrong and how to improve it?
I suspect :
self.r to return the right children
self.l to return the left children
All childrens are either nodes or trees
self.v return the value of the node/leaf
Therefore when looking at your code, I can see that
the right side of the tree isn't computed as it should be.
You should probably store the remaining values to only compute it once at the top (personal advice, don't need to)
The value and the name of func is shared, which can be confusing
I would advise to rename to the function and the variable to avoid confusion with the right/left and left = left to see
I don't think you need to check for value. if your binary tree is correctly computed.
You really need to provide the full code , or at least a minimal viable run.
Here is a quick draft:
def getremainingChildren(self):
self.remainingChildren = []
self.remainingChildren.append(self.v)
if self.l:
self.remainingChildren += self.l.getremainingChildren()
if self.r:
self.remainingChildren += self.r.getremainingChildren()
return self.remainingChildren
I'm trying to recursively build a binary decision tree, for diagnosing diseases in python 3.
The builder takes a list of records (each is an illness and a list of its symptoms), and a list of symptoms, shown bellow:
class Node:
def __init__(self, data = "", pos = None, neg = None):
self.data = data
self.positive_child = pos
self.negative_child = neg
class Record:
def __init__(self, illness, symptoms):
self.illness = illness
self.symptoms = symptoms
records= [Record('A',['1','3']),
Record('B',['1','2']),
Record('C',['2','3']),
]
symptoms = ['1','2','3']
And builds a binary tree, each level checks if symptom is true, or false, with a child node for each one. The right child is always means the symtom is not present and the left one that it is present. For the example data the tree should look like this:
1
2 2
3 3 3 3
None B A None C None None Healthy
For example, the leaf A is reached by asking:
1 : True
2 : False
3 : True
and it's path is [1,3] (the trues)
Here is the code I'm using, but isn't working:
def builder(records, symptoms, path):
#Chekl if we are in a leaf node that matches an illness
for record in records:
if path == record.symptoms:
return Node(record.illness,None,None)
#No more symptoms means an empty leaf node
if len(symptoms) == 0:
return Node(None,None,None)
#create subtree
else:
symptom = symptoms.pop(0)
right_child = builder(records,symptoms,path)
path.append(symptom)
left_child = builder(records,symptoms,path)
return Node(symptom,right_child,left_child)
I tried a cold run, and in paper it worked fine. I'm not sure of what I'm missing, but the resulting tree has a lot of empty nodes, and not one with the illness. Maybe I'm messing up the path thing, but I'm not sure how to fix it right now.
Your symptoms.pop(0) is affecting the one symptoms list shared by all calls to builder. This is fine on the way down, since you want to consider only the subsequent symptoms. But when a recursive call returns, your list is missing elements. (If it returns without finding a match, it’s empty!) Similarly, the shared path keeps growing forever.
The simple if inefficient answer is to make new lists when recursing:
symptom=symptoms[0]
symptoms=symptoms[1:]
path=path+[symptom] # not +=