What's wrong with this least common ancestor algorithm?

What's wrong with this least common ancestor algorithm? - python

I was asked the following question in a job interview:
Given a root node (to a well formed binary tree) and two other nodes (which are guaranteed to be in the tree, and are also distinct), return the lowest common ancestor of the two nodes.
I didn't know any least common ancestor algorithms, so I tried to make one on the spot. I produced the following code:
def least_common_ancestor(root, a, b):
lca = [None]
def check_subtree(subtree, lca=lca):
if lca[0] is not None or subtree is None:
return 0
if subtree is a or subtree is b:
return 1
else:
ans = sum(check_subtree(n) for n in (subtree.left, subtree.right))
if ans == 2:
lca[0] = subtree
return 0
return ans
check_subtree(root)
return lca[0]
class Node:
def __init__(self, left, right):
self.left = left
self.right = right
I tried the following test cases and got the answer that I expected:
a = Node(None, None)
b = Node(None, None)
tree = Node(Node(Node(None, a), b), None)
tree2 = Node(a, Node(Node(None, None), b))
tree3 = Node(a, b)
but my interviewer told me that "there is a class of trees for which your algorithm returns None." I couldn't figure out what it was and I flubbed the interview. I can't think of a case where the algorithm would make it to the bottom of the tree without ans ever becoming 2 -- what am I missing?

You forgot to account for the case where a is a direct ancestor of b, or vice versa. You stop searching as soon as you find either node and return 1, so you'll never find the other node in that case.
You were given a well-formed binary search tree; one of the properties of such a tree is that you can easily find elements based on their relative size to the current node; smaller elements are going into the left sub-tree, greater go into the right. As such, if you know that both elements are in the tree you only need to compare keys; as soon as you find a node that is in between the two target nodes, or equal to one them, you have found lowest common ancestor.
Your sample nodes never included the keys stored in the tree, so you cannot make use of this property, but if you did, you'd use:
def lca(tree, a, b):
if a.key <= tree.key <= b.key:
return tree
if a.key < tree.key and b.key < tree.key:
return lca(tree.left, a, b)
return lca(tree.right, a, b)
If the tree is merely a 'regular' binary tree, and not a search tree, your only option is to find the paths for both elements and find the point at which these paths diverge.
If your binary tree maintains parent references and depth, this can be done efficiently; simply walk up the deeper of the two nodes until you are at the same depth, then continue upwards from both nodes until you have found a common node; that is the least-common-ancestor.
If you don't have those two elements, you'll have to find the path to both nodes with separate searches, starting from the root, then find the last common node in those two paths.

You are missing the case where a is an ancestor of b.
Look at the simple counter example:
a
b None
a is also given as root, and when invoking the function, you invoke check_subtree(root), which is a, you then find out that this is what you are looking for (in the stop clause that returns 1), and return 1 immidiately without setting lca as it should have been.

Related

Is it right approach to fix problem of BST?

I have to check is Binary tree balanced and I am pretty sure my solution should work.
import sys
class Solution:
def isBalanced(self, root: TreeNode) -> bool:
cache = {
max:-sys.maxsize, #min possible number
min:sys.maxsize #max possible number
}
self.checkBalanced(root, cache, 0)
return cache[max] - cache[min] <= 1
def checkBalanced(self,node, cache, depth):
if node is None:
if depth < cache[min]:
cache[min] = depth
if depth > cache[max]:
cache[max] = depth
else:
self.checkBalanced(node.left, cache, depth+1)
self.checkBalanced(node.right, cache, depth+1)
But in this case I have an error
Here is link for question on Leetcode: https://leetcode.com/problems/balanced-binary-tree

There is a definition from the link that you provided:
For this problem, a height-balanced binary tree is defined as: a binary tree in which the depth of the two subtrees of every node never differ by more than 1.
For given "bad" input your code calculates cache[max] = 5, cache[min] = 3, so it returns false. However, if we consider root, its left subtree has depth 4 and right subtree has depth 3, so this satisfies the definition.
You should find depths of left and right subtrees for each root, however your code calculates depth (your cache[max]) and length of shortest path to any leaf of subtree (your cache[min]).
I hope you will easily fix your code after these clarifications.

Not sure how this recursion works

class Solution:
def findDuplicateSubtrees(self, root):
self.res = []
self.dic = {}
self.dfs(root)
return self.res
def dfs(self, root):
if not root: return '#'
tree = self.dfs(root.left) + self.dfs(root.right) + str(root.val)
if tree in self.dic and self.dic[tree] == 1:
self.res.append(root)
self.dic[tree] = self.dic.get(tree, 0) + 1
return tree
This is a solution for getting all duplicate subtrees given a binary tree.
I am not sure what tree = self.dfs(root.left) + self.dfs(root.right) + str(root.val) is trying to give.
I know it's trying to do a postorder traversal, but how does this part actually work?
It would be great if anyone can follow through this code. Thanks.

Basically, the variable tree can be regarded as an encoding string for each subtree. And we use a global dict self.dic to memorize those encoded strings.
An examples:
A
B C
D D E B
D D
In level order, the binary tree can be descripted as [[A], [B, C], [D, D, E, B], [#, #, #, #, #, #, D, D]], so we have at least two duplicate subtrees as [[B], [D, D]] and [D]
Follow the code, we have
dfs(A)
dfs(B)
dfs(D) *save ##D, return ##D
dfs(D) *find ##D, get one subtree, return ##D
*save ##D##DB, return ##D##DB
dfs(C)
...

Recursion is best untangled from bottom up.
A non-node (a daughter of a leaf node) will return "#".
A leaf with value 1 would return "##1" (as it has two non-node children).
A node 3 with two leaf daughters 1 and 2 would return ##1##23. "##1" is the dfs of the left daughter, "##2" is the dfs of the right daughter, and "3" is the stringified current node's value.
In this way, assuming there's no node with value 23 and another with empty string value, you can see that if two different nodes produce ##1##23, they are a duplicate subtree. It would be much more robust if there were some additional separators employed (e.g. a semicolon at the end of that line would produce "##1;##2;3;), which would be sufficient to make it a bit more readable and less ambiguous. Even safer (but slower) if done with lists.

Root to leaf algo bug

Wrote an unnecessarily complex solution to the following question:
Given a binary tree and a sum, determine if the tree has a
root-to-leaf path such that adding up all the values along the path
equals the given sum.
Anyway, I'm trying to debug what went wrong here. I used a named tuple so that I can track both whether the number has been found and the running sum, but it looks like running sum is never incremented beyond zero. At any given leaf node, though, the running sum will be the leaf node's value, and in the next iteration the running sum should be incremented by the current running sum. Anyone know what's wrong with my "recursive leap of faith" here?
def root_to_leaf(target_sum, tree):
NodeData = collections.namedtuple('NodeData', ['running_sum', 'num_found'])
def root_to_leaf_helper(node, node_data):
if not node:
return NodeData(0, False)
if node_data.num_found:
return NodeData(target_sum, True)
left_check = root_to_leaf_helper(node.left, node_data)
if left_check.num_found:
return NodeData(target_sum, True)
right_check = root_to_leaf_helper(node.right, node_data)
if right_check.num_found:
return NodeData(target_sum, True)
new_running_sum = node.val + node_data.running_sum
return NodeData(new_running_sum, new_running_sum == target_sum)
return root_to_leaf_helper(tree, NodeData(0, False)).num_found
EDIT: I realize this is actually just checking if any path (not ending at leaf) has the correct value, but my question still stands on understanding why running sum isn't properly incremented.

I think you need to think clearly about whether information is flowing down the tree (from root to leaf) or up the tree (from leaf to root). It looks like the node_data argument to root_to_leaf_helper is initialized at the top of the tree, the root, and then passed down through each node via recursive calls. That's fine, but as far as I can tell, it's never changed on the way down the tree. It's just passed along untouched. Therefore the first check, for node_data.num_found, will always be false.
Even worse, since node_data is always the same ((0, False)) on the way down the tree, the following line that tries to add the current node's value to a running sum:
new_running_sum = node.val + node_data.running_sum
will always be adding node.val to 0, since node_data.running_sum is always 0.
Hopefully this is clear, I realize that it's a little difficult to explain.
But trying to think very clearly about information flowing down the tree (in the arguments to recursive calls) vs information flowing up the tree (in the return value from the recursive calls) will make this a little bit more clear.

You can keep a running list of the path as part of the signature of the recursive function/method, and call the function/method on both the right and left nodes using a generator. The generator will enable you to find the paths extending from the starting nodes. For simplicity, I have implemented the solution as a class method:
class Tree:
def __init__(self, *args):
self.__dict__ = dict(zip(['value', 'left', 'right'], args))
def get_sums(self, current = []):
if self.left is None:
yield current + [self.value]
else:
yield from self.left.get_sums(current+[self.value])
if self.right is None:
yield current+[self.value]
else:
yield from self.right.get_sums(current+[self.value])
tree = Tree(4, Tree(10, Tree(4, Tree(7, None, None), Tree(6, None, None)), Tree(12, None, None)), Tree(5, Tree(6, None, Tree(11, None, None)), None))
paths = list(tree.get_sums())
new_paths = [a for i, a in enumerate(paths) if a not in paths[:i]]
final_path = [i for i in paths if sum(i) == 15]
Output:
[[4, 5, 6]]

Size of subtree in Python

Almost every online tutorial I see on the subject when it comes to finding the size of a subtree involves calling a recursive function on each child's subtree.
The problem with this in Python is that it overflows if you recurse past a few hundred levels, so if I theoretically had a long, linear tree, it would fail.
Is there a better way to handle this? Do I need to use a stack instead?

Do I need to use a stack instead?
Sure, that's one way of doing it.
def iter_tree(root):
to_explore = [root]
while to_explore:
node = to_explore.pop()
yield node
for child in node.children:
to_explore.append(child)
def size(root):
count = 0
for node in iter_tree(root):
count += 1
return count

The stack would be the easiest non-recursive way of getting the size of the subtree (count of nodes under the given node, including the current node)
class Node():
def __init__(self, value):
self.value = value
self.left = None
self.right = None
def subtree_size(root):
visited = 0
if not root: return visited
stack = [root]
while stack:
node = stack.pop()
visited += 1
if node.left: stack.append(node.left)
if node.right: stack.append(node.right)
return visited

You can mirror the recursive algorithm using a stack:
numNodes = 0
nodeStack = [(root,0)] # (Node,0 means explore left 1 means explore right)
while nodeStack:
nextNode, leftOrRight = nodeStack.pop()
if not nextNode: #nextNode is empty
continue
if leftOrRight == 0:
numNodes += 1
nodeStack.append((nextNode,1))
nodeStack.append((nextNode.leftChild,0))
else:
nodeStack.append((nextNode.rightChild,0))
print(numNodes)
Some things to notice: This is still a Depth-first search! That is, we still fully explore a subtree before starting to explore the other. What this means to you is that the amount of additional memory required is proportional to the height of the tree and not the width of the tree. For a balanced tree the width of the tree is 2^h where h is the height of the tree. For a totally unbalanced tree the height of the tree is the number of nodes in the tree, whereas the width is one! so it all depends on what you need :)
Now It is worth mentioning that you can make a potential optimization by checking if one of the subtrees is empty! We can change the body of if leftOrRight == 0: to:
numNodes += 1
if nextNode.rightChild: #Has a right child to explore
nodeStack.append((nextNode,1))
nodeStack.append((nextNode.leftChild,0))
Which potentially cuts down on memory usage :)

Mapping a list to a Huffman Tree whilst preserving relative order

I'm having an issue with a search algorithm over a Huffman tree: for a given probability distribution I need the Huffman tree to be identical regardless of permutations of the input data.
Here is a picture of what's happening vs what I want:
Basically I want to know if it's possible to preserve the relative order of the items from the list to the tree. If not, why is that so?
For reference, I'm using the Huffman tree to generate sub groups according to a division of probability, so that I can run the search() procedure below. Notice that the data in the merge() sub-routine is combined, along with the weight. The codewords themselves aren't as important as the tree (which should preserve the relative order).
For example if I generate the following Huffman codes:
probabilities = [0.30, 0.25, 0.20, 0.15, 0.10]
items = ['a','b','c','d','e']
items = zip(items, probabilities)
t = encode(items)
d,l = hi.search(t)
print(d)
Using the following Class:
class Node(object):
left = None
right = None
weight = None
data = None
code = None
def __init__(self, w,d):
self.weight = w
self.data = d
def set_children(self, ln, rn):
self.left = ln
self.right = rn
def __repr__(self):
return "[%s,%s,(%s),(%s)]" %(self.data,self.code,self.left,self.right)
def __cmp__(self, a):
return cmp(self.weight, a.weight)
def merge(self, other):
total_freq = self.weight + other.weight
new_data = self.data + other.data
return Node(total_freq,new_data)
def index(self, node):
return node.weight
def encode(symbfreq):
pdb.set_trace()
tree = [Node(sym,wt) for wt,sym in symbfreq]
heapify(tree)
while len(tree)>1:
lo, hi = heappop(tree), heappop(tree)
n = lo.merge(hi)
n.set_children(lo, hi)
heappush(tree, n)
tree = tree[0]
def assign_code(node, code):
if node is not None:
node.code = code
if isinstance(node, Node):
assign_code(node.left, code+'0')
assign_code(node.right, code+'1')
assign_code(tree, '')
return tree
I get:
'a'->11
'b'->01
'c'->00
'd'->101
'e'->100
However, an assumption I've made in the search algorithm is that more probable items get pushed toward the left: that is I need 'a' to have the '00' codeword - and this should always be the case regardless of any permutation of the 'abcde' sequence. An example output is:
codewords = {'a':'00', 'b':'01', 'c':'10', 'd':'110', 'e':111'}
(N.b even though the codeword for 'c' is a suffix for 'd' this is ok).
For completeness, here is the search algorithm:
def search(tree):
print(tree)
pdb.set_trace()
current = tree.left
other = tree.right
loops = 0
while current:
loops+=1
print(current)
if current.data != 0 and current is not None and other is not None:
previous = current
current = current.left
other = previous.right
else:
previous = other
current = other.left
other = other.right
return previous, loops
It works by searching for the 'leftmost' 1 in a group of 0s and 1s - the Huffman tree has to put more probable items on the left. For example if I use the probabilities above and the input:
items = [1,0,1,0,0]
Then the index of the item returned by the algorithm is 2 - which isn't what should be returned (0 should, as it's leftmost).

The usual practice is to use Huffman's algorithm only to generate the code lengths. Then a canonical process is used to generate the codes from the lengths. The tree is discarded. Codes are assigned in order from shorter codes to longer codes, and within a code, the symbols are sorted. This gives the codes you are expecting, a = 00, b = 01, etc. This is called a Canonical Huffman code.
The main reason this is done is to make the transmission of the Huffman code more compact. Instead of sending the code for each symbol along with the compressed data, you only need to send the code length for each symbol. Then the codes can be reconstructed on the other end for decompression.
A Huffman tree is not normally used for decoding either. With a canonical code, simple comparisons to determine the length of the next code, and an index using the code value will take you directly to the symbol. Or a table-driven approach can avoid the search for the length.
As for your tree, there are arbitrary choices being made when there are equal frequencies. In particular, on the second step the first node pulled is c with probability 0.2, and the second node pulled is b with probability 0.25. However it would have been equally valid to pull, instead of b, the node that was made in the first step, (e,d), whose probability is also 0.25. In fact that is what you'd prefer for your desired end state. Alas, you have relinquished the control of that arbitrary choice to the heapq library.
(Note: since you are using floating point values, 0.1 + 0.15 is not necessarily exactly equal to 0.25. Though it turns out it is. As another example, 0.1 + 0.2 is not equal to 0.3. You would be better off using integers for the frequencies if you want to see what happens when sums of frequencies are equal to other frequencies or sums of frequencies. E.g. 6,5,4,3,2.)
Some of the wrong ordering can be fixed by fixing some mistakes: change lo.merge(high) to hi.merge(lo), and reverse the order of the bits to: assign_code(node.left, code+'1') followed by assign_code(node.right, code+'0'). Then at least a gets assigned 00 and d is before e and b is before c. The ordering is then adebc.
Now that I think about it, even if you pick (e,d) over b, e.g by setting the probability of b to 0.251, you still don't get the complete order that you're after. No matter what, the probability of (e,d) (0.25) is greater than the probability of c (0.2). So even in that case, the final ordering would be (with the fixes above) abdec instead of your desired abcde. So it is not possible to get what you want assuming a consistent tree ordering and bit assignment with respect to the probabilities of groups of symbols. E.g., assuming that for each branch the stuff on the left has a greater or equal probability than the stuff on the right, and 0 is always assigned to left and 1 is always assigned to right. You would need to do something different.
The different thing that comes to mind is what I said at the start of the answer. Use the Huffman algorithm just to get the code lengths. Then you can assign the codes to the symbols in whatever order you like, and build a new tree. That would be much easier than trying to come up with some sort of scheme to coerce the original tree to be what you want, and proving that that works in all cases.

I'll flesh out what Mark Adler said with working code. Everything he said is right :-) The high points:
You must not use floating-point weights, or any other scheme that loses information about weights. Use integers. Simple and correct. If, e.g., you have 3-digit floating probabilities, convert each to an integer via int(round(the_probability * 1000)), then maybe fiddle them to ensure the sum is exactly 1000.
heapq heaps are not "stable": nothing is defined about which item is popped if multiple items have the same minimal weight.
So you can't get what you want while building the tree.
A small variation of "canonical Huffman codes" appears to be what you do want. Constructing a tree for that is a long-winded process, but each step is straightforward enough. The first tree built is thrown away: the only information taken from it is the lengths of the codes assigned to each symbol.
Running:
syms = ['a','b','c','d','e']
weights = [30, 25, 20, 15, 10]
t = encode(syms, weights)
print t
prints this (formatted for readability):
[abcde,,
([ab,0,
([a,00,(None),(None)]),
([b,01,(None),(None)])]),
([cde,1,
([c,10,(None),(None)]),
([de,11,
([d,110,(None),(None)]),
([e,111,(None),(None)])])])]
Best I understand, that's exactly what you want. Complain if it isn't ;-)
EDIT: there was a bug in the assignment of canonical codes, which didn't show up unless weights were very different; fixed it.
class Node(object):
def __init__(self, data=None, weight=None,
left=None, right=None,
code=None):
self.data = data
self.weight = weight
self.left = left
self.right = right
self.code = code
def is_symbol(self):
return self.left is self.right is None
def __repr__(self):
return "[%s,%s,(%s),(%s)]" % (self.data,
self.code,
self.left,
self.right)
def __cmp__(self, a):
return cmp(self.weight, a.weight)
def encode(syms, weights):
from heapq import heapify, heappush, heappop
tree = [Node(data=s, weight=w)
for s, w in zip(syms, weights)]
sym2node = {s.data: s for s in tree}
heapify(tree)
while len(tree) > 1:
a, b = heappop(tree), heappop(tree)
heappush(tree, Node(weight=a.weight + b.weight,
left=a, right=b))
# Compute code lengths for the canonical coding.
sym2len = {}
def assign_codelen(node, codelen):
if node is not None:
if node.is_symbol():
sym2len[node.data] = codelen
else:
assign_codelen(node.left, codelen + 1)
assign_codelen(node.right, codelen + 1)
assign_codelen(tree[0], 0)
# Create canonical codes, but with a twist: instead
# of ordering symbols alphabetically, order them by
# their position in the `syms` list.
# Construct a list of (codelen, index, symbol) triples.
# `index` breaks ties so that symbols with the same
# code length retain their original ordering.
triples = [(sym2len[name], i, name)
for i, name in enumerate(syms)]
code = oldcodelen = 0
for codelen, _, name in sorted(triples):
if codelen > oldcodelen:
code <<= (codelen - oldcodelen)
sym2node[name].code = format(code, "0%db" % codelen)
code += 1
oldcodelen = codelen
# Create a tree corresponding to the new codes.
tree = Node(code="")
dir2attr = {"0": "left", "1": "right"}
for snode in sym2node.values():
scode = snode.code
codesofar = ""
parent = tree
# Walk the tree creating any needed interior nodes.
for d in scode:
assert parent is not None
codesofar += d
attr = dir2attr[d]
child = getattr(parent, attr)
if codesofar == scode:
# We're at the leaf position.
assert child is None
setattr(parent, attr, snode)
elif child is not None:
assert child.code == codesofar
else:
child = Node(code=codesofar)
setattr(parent, attr, child)
parent = child
# Finally, paste the `data` attributes together up
# the tree. Why? Don't know ;-)
def paste(node):
if node is None:
return ""
elif node.is_symbol():
return node.data
else:
result = paste(node.left) + paste(node.right)
node.data = result
return result
paste(tree)
return tree
Duplicate symbols
Could I swap the sym2node dict to an ordereddict to deal with
repeated 'a'/'b's etc?
No, and for two reasons:
No mapping type supports duplicate keys; and,
The concept of "duplicate symbols" makes no sense for Huffman encoding.
So, if you're determined ;-) to pursue this, first you have to ensure that symbols are unique. Just add this line at the start of the function:
syms = list(enumerate(syms))
For example, if the syms passed in is:
['a', 'b', 'a']
that will change to:
[(0, 'a'), (1, 'b'), (2, 'a')]
All symbols are now 2-tuples, and are obviously unique since each starts with a unique integer. The only thing the algorithm cares about is that symbols can be used as dict keys; it couldn't care less whether they're strings, tuples, or any other hashable type that supports equality testing.
So nothing in the algorithm needs to change. But before the end, we'll want to restore the original symbols. Just insert this before the paste() function:
def restore_syms(node):
if node is None:
return
elif node.is_symbol():
node.data = node.data[1]
else:
restore_syms(node.left)
restore_syms(node.right)
restore_syms(tree)
That simply walks the tree and strips the leading integers off the symbols' .data members. Or, perhaps simpler, just iterate over sym2node.values(), and transform the .data member of each.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.