Not sure how this recursion works

Not sure how this recursion works - python

class Solution:
def findDuplicateSubtrees(self, root):
self.res = []
self.dic = {}
self.dfs(root)
return self.res
def dfs(self, root):
if not root: return '#'
tree = self.dfs(root.left) + self.dfs(root.right) + str(root.val)
if tree in self.dic and self.dic[tree] == 1:
self.res.append(root)
self.dic[tree] = self.dic.get(tree, 0) + 1
return tree
This is a solution for getting all duplicate subtrees given a binary tree.
I am not sure what tree = self.dfs(root.left) + self.dfs(root.right) + str(root.val) is trying to give.
I know it's trying to do a postorder traversal, but how does this part actually work?
It would be great if anyone can follow through this code. Thanks.

Basically, the variable tree can be regarded as an encoding string for each subtree. And we use a global dict self.dic to memorize those encoded strings.
An examples:
A
B C
D D E B
D D
In level order, the binary tree can be descripted as [[A], [B, C], [D, D, E, B], [#, #, #, #, #, #, D, D]], so we have at least two duplicate subtrees as [[B], [D, D]] and [D]
Follow the code, we have
dfs(A)
dfs(B)
dfs(D) *save ##D, return ##D
dfs(D) *find ##D, get one subtree, return ##D
*save ##D##DB, return ##D##DB
dfs(C)
...

Recursion is best untangled from bottom up.
A non-node (a daughter of a leaf node) will return "#".
A leaf with value 1 would return "##1" (as it has two non-node children).
A node 3 with two leaf daughters 1 and 2 would return ##1##23. "##1" is the dfs of the left daughter, "##2" is the dfs of the right daughter, and "3" is the stringified current node's value.
In this way, assuming there's no node with value 23 and another with empty string value, you can see that if two different nodes produce ##1##23, they are a duplicate subtree. It would be much more robust if there were some additional separators employed (e.g. a semicolon at the end of that line would produce "##1;##2;3;), which would be sufficient to make it a bit more readable and less ambiguous. Even safer (but slower) if done with lists.

Related

Determine if subtree t is inside tree s

I'm trying leetcode problem 572.
Given two non-empty binary trees s and t, check whether tree t has exactly the same structure and node values with a subtree of s. A subtree of s is a tree consists of a node in s and all of this node's descendants. The tree s could also be considered as a subtree of itself.
Since, tree's a great for recursion, I thought about splitting the cases up.
a) If the current tree s is not the subtree t, then recurse on the left and right parts of s if possible
b) If tree s is subtree t, then return True
c) if s is empty, then we've exhausted all the subtrees in s and should return False
def isSubtree(self, s: TreeNode, t: TreeNode) -> bool:
if not s:
return False
if s == t:
return True
else:
if s.left and s.right:
return any([self.isSubtree(s.left, t), self.isSubtree(s.right, t)])
elif s.left:
return self.isSubtree(s.left, t)
elif s.right:
return self.isSubtree(s.right, t)
else:
return False
However, this for some reason returns False even for the cases where they are obviously True
Ex:
My code here returns False, but it should be True. Any pointers on what to do?

This'll simply get through:
class Solution:
def isSubtree(self, a, b):
def sub(node):
return f'A{node.val}#{sub(node.left)}{sub(node.right)}' if node else 'Z'
return sub(b) in sub(a)
References
For additional details, you can see the Discussion Board. There are plenty of accepted solutions with a variety of languages and explanations, efficient algorithms, as well as asymptotic time/space complexity analysis1, 2 in there.

You need to change second if statement.
In order to check if tree t is subtree of tree s, each time a node in s matches the root of t, call check method which determines whether two subtrees are identical.
if s.val == t.val and check(s, t):
return True
A check method is look like this.
def check(self, s, t):
if s is None and t is None:
return True
if s is None or t is None or s.val != t.val:
return False
return self.check(s.left, t.left) and self.check(s.right, t.right)
While other code work well, your code will much simpler in else statement like the following. You don't need to check whether left and right nodes are None because first if statement will check that.
else:
return self.isSubtree(s.left, t) or self.isSubtree(s.right, t)

Python - Questions for the different use of comma

Recently, I am solving problems No. 113 Path Sum II on LeetCode and I find a solution online.
Given a binary tree and a sum, find all root-to-leaf paths where each path's sum equals the given sum.
Code as below:
class Solution:
def pathSum(self, R: TreeNode, S: int) -> List[List[int]]:
A, P = [], []
def dfs(N):
if N == None: return
P.append(N.val)
if (N.left,N.right) == (None,None) and sum(P) == S: A.append(list(P))
else: dfs(N.left), dfs(N.right)
P.pop()
dfs(R)
return A
- Junaid Mansuri
- Chicago, IL
I would like to ask some questions based on the above code to help me understand how Python works more.
Why do we need to use list(), A.append(list(P)), to successfully append the list into A if P itself is already a list?
What happens when the interpreter runs dfs(N.left), dfs(N.right). Both of the function will append a value into P, but they seem don't affect the other functions(like they are running at the exact same time with the exact same P), is it something like multithreading?
A related question of the above, is A, P = [ ], [ ] works with same concept as dfs(N.left), dfs(N.right)? If not, what is the difference?
what does P.pop() pop indeed, I mean which value will be poped out if both dfs(N.left) and dfs(N.right) runs? I mean, will there be two P after the two functions run?
Updates (more question)
10 while head != None:
11 if id(head.next) in hashMap: return True
12 head = head.next
13 hashMap.add(id(head.next))
Line 13: AttributeError: 'NoneType' object has no attribute 'next'
The above is part of the code. It just simply look into a linked list. It will show an error as above which I think is normal when it reaches the end of the linked list.
What I want to understand is that if the code changed as below, there will be no error and the code run successfully. Is that related to the comma or there is another reason that makes it runs?
10 while head != None:
11 if id(head.next) in hashMap: return True
12 head, _ = head.next, hashMap.add(id(head.next))

An internet search will point you to lots of articles on depth-first search.
To answer your immediate questions.
Why do we need to use list(), A.append(list(P)), to successfully
append the list into A if P itself is already a list?
A.append(list(P))
Uses the list constructor to make a shallow copy of P to add to A.
Otherwise, if you just used:
A.append(P)
then the list in A will change every time P changes
What happens when the interpreter runs dfs(N.left), dfs(N.right). Both
of the function will append a value into P, but they seem don't affect
the other functions(like they are running at the exact same time with
the exact same P), is it something like multithreading?
These functions are run sequentially. First dfs(N.left) followed by dfs(N.right).
This performs a depth-first search (DFS) on the left subtree, followed by a DFS on right subtree.
Each function is run for its side-effect of updating A and P.
A related question of the above, is A, P = [ ], [ ] works with same
concept as dfs(N.left), dfs(N.right)? If not, what is the difference?
what does P.pop() pop indeed, I mean which value will be poped out if
both dfs(N.left) and dfs(N.right) runs? I mean, will there be two P
after the two functions run?
Variable A and P are local variables of pathSum. dfs being a nested function within pathSum has access to these local variables. Thus there is only one A and one P which dfs updates as it is called recursively.
A, P = [], []
Is initializing A, P (done once within pathSum).
dfs(N.left), dfs(N.right)
Is calling the dfs methods on the left and right subnodes, which performs updates on A and P as the recursive calls run.
what does P.pop() pop indeed, I mean which value will be poped out if
both dfs(N.left) and dfs(N.right) runs? I mean, will there be two P
after the two functions run?
P.pop()
Removes the last value appended to list P.
dfs(N.left) and dfs(N.right) are run one after the other. For example with N correspoding the value = 1:
dfs(N.left), dfs(N.right)
First dfs(N.left) will recursively traverse nodes with values of:
2, 4, 5
Then, dfs(N.right) will traverse the node with value 3.
The values of A, P will be updated during the traversal. P contains the path to the current node. When we branch left (i.e. DFS(N.left) the left path is added to P. When we return we need to remove this (thus P.pop()). Similarly, N.right() is run.
When we have traversed both the left and right children, we remove the current node with P.pop() and control flow returns to the parent.
Tuple Abbreviation
a, b is an abbreviation for tuple (a, b)
Thus:
[], [] computes tuple ([], [])
b, a computes tuple (b, a)
dfs(N.left), dbs(N.right) computes tuple (dfs(N.left), dfs(N.right))
Tuples can be unpacked
Example:
t = (2, 4, 6, 8)
x, y, z, w = t
Will have x = 2, y = 4, z = 6, w = 8
With:
A, P = [], [] equivalent to A, P = ([], [])
Unpacking then has:
A = [], P = []
With:
a, b = b, a is equivalent to a, b = (b, a)
Unpacking then has:
a = b and b == a
In terms of result think of these assignments as occuring in parallel.
So with:
(a, b) = (5, 3)
a, b = (b, a) will have a = 3, b = 5 (thus swapping a & b)
DFS(N.left), DFS(N.right) is computing the tuple(DFS(N.left, DFS(N.right)).
This necessitates running DFS on N.left and N.right.
The resulting tuple is discarded, but this has the desired effect of updating A & P as DFS is run recursively.

Python tree traversal and sorting groups of items inside sorted list

I am traversing non-binary tree, I have a function to calc height of the node and number of children. WHat I want to do is to sort children of my node first by height, and inside each height group I want it to be sorted by number of children
eg:
a
/ \
b c
/|\ /
d e f g
/
h
so when I traverse tree:
def orderTree(node):
if "children" in node:
if node['children']:
node['children'].sort(key=findHeight)
node['children'].sort(key=countChildren)
for child in node['children']:
print(child['name'])
orderTree(child)
with this code I go = > a,c,g,h,b,d,e,f
but what I need is = > a,b,d,e,f,c,g,h
Any Idea how to sort sorted group of items inside python list?

What you want to do is called "multi-field sorting",
To sort a list of nodes by their height and then by their number of children simply give sort the following function as key:
lambda x : (findHeight(x), children(x))
This simply returns a tuple of (height, children). And then sort uses this tuple to compare two nodes.
Code:
def orderTree(node):
# I combined the two ifs
if "children" in node and node['children']:
node['children'].sort(key= lambda x : (findHeight(x), children(x) ) )
for child in node['children']:
print(child['name'])
orderTree(child)
Lets say I have
A = (a,b) and B = (x, y)
These two tuples will be compared like this:
def compare_tuple(A, B):
if A[0] != B[0]: return A[0] < B[0]
else: return A[1] < B[1]

The type of tree traversal you are after is called pre-order. It is hard to understand your code without knowing the full structure. But it looks like you are looking at the children from the root node first then traversing down.
I think you want to open the first child node and then keep traversing through that. For pre-order it should be something like.
node
node.left
node.right
This is a good reference
http://www.geeksforgeeks.org/tree-traversals-inorder-preorder-and-postorder/

What's wrong with this least common ancestor algorithm?

I was asked the following question in a job interview:
Given a root node (to a well formed binary tree) and two other nodes (which are guaranteed to be in the tree, and are also distinct), return the lowest common ancestor of the two nodes.
I didn't know any least common ancestor algorithms, so I tried to make one on the spot. I produced the following code:
def least_common_ancestor(root, a, b):
lca = [None]
def check_subtree(subtree, lca=lca):
if lca[0] is not None or subtree is None:
return 0
if subtree is a or subtree is b:
return 1
else:
ans = sum(check_subtree(n) for n in (subtree.left, subtree.right))
if ans == 2:
lca[0] = subtree
return 0
return ans
check_subtree(root)
return lca[0]
class Node:
def __init__(self, left, right):
self.left = left
self.right = right
I tried the following test cases and got the answer that I expected:
a = Node(None, None)
b = Node(None, None)
tree = Node(Node(Node(None, a), b), None)
tree2 = Node(a, Node(Node(None, None), b))
tree3 = Node(a, b)
but my interviewer told me that "there is a class of trees for which your algorithm returns None." I couldn't figure out what it was and I flubbed the interview. I can't think of a case where the algorithm would make it to the bottom of the tree without ans ever becoming 2 -- what am I missing?

You forgot to account for the case where a is a direct ancestor of b, or vice versa. You stop searching as soon as you find either node and return 1, so you'll never find the other node in that case.
You were given a well-formed binary search tree; one of the properties of such a tree is that you can easily find elements based on their relative size to the current node; smaller elements are going into the left sub-tree, greater go into the right. As such, if you know that both elements are in the tree you only need to compare keys; as soon as you find a node that is in between the two target nodes, or equal to one them, you have found lowest common ancestor.
Your sample nodes never included the keys stored in the tree, so you cannot make use of this property, but if you did, you'd use:
def lca(tree, a, b):
if a.key <= tree.key <= b.key:
return tree
if a.key < tree.key and b.key < tree.key:
return lca(tree.left, a, b)
return lca(tree.right, a, b)
If the tree is merely a 'regular' binary tree, and not a search tree, your only option is to find the paths for both elements and find the point at which these paths diverge.
If your binary tree maintains parent references and depth, this can be done efficiently; simply walk up the deeper of the two nodes until you are at the same depth, then continue upwards from both nodes until you have found a common node; that is the least-common-ancestor.
If you don't have those two elements, you'll have to find the path to both nodes with separate searches, starting from the root, then find the last common node in those two paths.

You are missing the case where a is an ancestor of b.
Look at the simple counter example:
a
b None
a is also given as root, and when invoking the function, you invoke check_subtree(root), which is a, you then find out that this is what you are looking for (in the stop clause that returns 1), and return 1 immidiately without setting lca as it should have been.

Mapping a list to a Huffman Tree whilst preserving relative order

I'm having an issue with a search algorithm over a Huffman tree: for a given probability distribution I need the Huffman tree to be identical regardless of permutations of the input data.
Here is a picture of what's happening vs what I want:
Basically I want to know if it's possible to preserve the relative order of the items from the list to the tree. If not, why is that so?
For reference, I'm using the Huffman tree to generate sub groups according to a division of probability, so that I can run the search() procedure below. Notice that the data in the merge() sub-routine is combined, along with the weight. The codewords themselves aren't as important as the tree (which should preserve the relative order).
For example if I generate the following Huffman codes:
probabilities = [0.30, 0.25, 0.20, 0.15, 0.10]
items = ['a','b','c','d','e']
items = zip(items, probabilities)
t = encode(items)
d,l = hi.search(t)
print(d)
Using the following Class:
class Node(object):
left = None
right = None
weight = None
data = None
code = None
def __init__(self, w,d):
self.weight = w
self.data = d
def set_children(self, ln, rn):
self.left = ln
self.right = rn
def __repr__(self):
return "[%s,%s,(%s),(%s)]" %(self.data,self.code,self.left,self.right)
def __cmp__(self, a):
return cmp(self.weight, a.weight)
def merge(self, other):
total_freq = self.weight + other.weight
new_data = self.data + other.data
return Node(total_freq,new_data)
def index(self, node):
return node.weight
def encode(symbfreq):
pdb.set_trace()
tree = [Node(sym,wt) for wt,sym in symbfreq]
heapify(tree)
while len(tree)>1:
lo, hi = heappop(tree), heappop(tree)
n = lo.merge(hi)
n.set_children(lo, hi)
heappush(tree, n)
tree = tree[0]
def assign_code(node, code):
if node is not None:
node.code = code
if isinstance(node, Node):
assign_code(node.left, code+'0')
assign_code(node.right, code+'1')
assign_code(tree, '')
return tree
I get:
'a'->11
'b'->01
'c'->00
'd'->101
'e'->100
However, an assumption I've made in the search algorithm is that more probable items get pushed toward the left: that is I need 'a' to have the '00' codeword - and this should always be the case regardless of any permutation of the 'abcde' sequence. An example output is:
codewords = {'a':'00', 'b':'01', 'c':'10', 'd':'110', 'e':111'}
(N.b even though the codeword for 'c' is a suffix for 'd' this is ok).
For completeness, here is the search algorithm:
def search(tree):
print(tree)
pdb.set_trace()
current = tree.left
other = tree.right
loops = 0
while current:
loops+=1
print(current)
if current.data != 0 and current is not None and other is not None:
previous = current
current = current.left
other = previous.right
else:
previous = other
current = other.left
other = other.right
return previous, loops
It works by searching for the 'leftmost' 1 in a group of 0s and 1s - the Huffman tree has to put more probable items on the left. For example if I use the probabilities above and the input:
items = [1,0,1,0,0]
Then the index of the item returned by the algorithm is 2 - which isn't what should be returned (0 should, as it's leftmost).

The usual practice is to use Huffman's algorithm only to generate the code lengths. Then a canonical process is used to generate the codes from the lengths. The tree is discarded. Codes are assigned in order from shorter codes to longer codes, and within a code, the symbols are sorted. This gives the codes you are expecting, a = 00, b = 01, etc. This is called a Canonical Huffman code.
The main reason this is done is to make the transmission of the Huffman code more compact. Instead of sending the code for each symbol along with the compressed data, you only need to send the code length for each symbol. Then the codes can be reconstructed on the other end for decompression.
A Huffman tree is not normally used for decoding either. With a canonical code, simple comparisons to determine the length of the next code, and an index using the code value will take you directly to the symbol. Or a table-driven approach can avoid the search for the length.
As for your tree, there are arbitrary choices being made when there are equal frequencies. In particular, on the second step the first node pulled is c with probability 0.2, and the second node pulled is b with probability 0.25. However it would have been equally valid to pull, instead of b, the node that was made in the first step, (e,d), whose probability is also 0.25. In fact that is what you'd prefer for your desired end state. Alas, you have relinquished the control of that arbitrary choice to the heapq library.
(Note: since you are using floating point values, 0.1 + 0.15 is not necessarily exactly equal to 0.25. Though it turns out it is. As another example, 0.1 + 0.2 is not equal to 0.3. You would be better off using integers for the frequencies if you want to see what happens when sums of frequencies are equal to other frequencies or sums of frequencies. E.g. 6,5,4,3,2.)
Some of the wrong ordering can be fixed by fixing some mistakes: change lo.merge(high) to hi.merge(lo), and reverse the order of the bits to: assign_code(node.left, code+'1') followed by assign_code(node.right, code+'0'). Then at least a gets assigned 00 and d is before e and b is before c. The ordering is then adebc.
Now that I think about it, even if you pick (e,d) over b, e.g by setting the probability of b to 0.251, you still don't get the complete order that you're after. No matter what, the probability of (e,d) (0.25) is greater than the probability of c (0.2). So even in that case, the final ordering would be (with the fixes above) abdec instead of your desired abcde. So it is not possible to get what you want assuming a consistent tree ordering and bit assignment with respect to the probabilities of groups of symbols. E.g., assuming that for each branch the stuff on the left has a greater or equal probability than the stuff on the right, and 0 is always assigned to left and 1 is always assigned to right. You would need to do something different.
The different thing that comes to mind is what I said at the start of the answer. Use the Huffman algorithm just to get the code lengths. Then you can assign the codes to the symbols in whatever order you like, and build a new tree. That would be much easier than trying to come up with some sort of scheme to coerce the original tree to be what you want, and proving that that works in all cases.

I'll flesh out what Mark Adler said with working code. Everything he said is right :-) The high points:
You must not use floating-point weights, or any other scheme that loses information about weights. Use integers. Simple and correct. If, e.g., you have 3-digit floating probabilities, convert each to an integer via int(round(the_probability * 1000)), then maybe fiddle them to ensure the sum is exactly 1000.
heapq heaps are not "stable": nothing is defined about which item is popped if multiple items have the same minimal weight.
So you can't get what you want while building the tree.
A small variation of "canonical Huffman codes" appears to be what you do want. Constructing a tree for that is a long-winded process, but each step is straightforward enough. The first tree built is thrown away: the only information taken from it is the lengths of the codes assigned to each symbol.
Running:
syms = ['a','b','c','d','e']
weights = [30, 25, 20, 15, 10]
t = encode(syms, weights)
print t
prints this (formatted for readability):
[abcde,,
([ab,0,
([a,00,(None),(None)]),
([b,01,(None),(None)])]),
([cde,1,
([c,10,(None),(None)]),
([de,11,
([d,110,(None),(None)]),
([e,111,(None),(None)])])])]
Best I understand, that's exactly what you want. Complain if it isn't ;-)
EDIT: there was a bug in the assignment of canonical codes, which didn't show up unless weights were very different; fixed it.
class Node(object):
def __init__(self, data=None, weight=None,
left=None, right=None,
code=None):
self.data = data
self.weight = weight
self.left = left
self.right = right
self.code = code
def is_symbol(self):
return self.left is self.right is None
def __repr__(self):
return "[%s,%s,(%s),(%s)]" % (self.data,
self.code,
self.left,
self.right)
def __cmp__(self, a):
return cmp(self.weight, a.weight)
def encode(syms, weights):
from heapq import heapify, heappush, heappop
tree = [Node(data=s, weight=w)
for s, w in zip(syms, weights)]
sym2node = {s.data: s for s in tree}
heapify(tree)
while len(tree) > 1:
a, b = heappop(tree), heappop(tree)
heappush(tree, Node(weight=a.weight + b.weight,
left=a, right=b))
# Compute code lengths for the canonical coding.
sym2len = {}
def assign_codelen(node, codelen):
if node is not None:
if node.is_symbol():
sym2len[node.data] = codelen
else:
assign_codelen(node.left, codelen + 1)
assign_codelen(node.right, codelen + 1)
assign_codelen(tree[0], 0)
# Create canonical codes, but with a twist: instead
# of ordering symbols alphabetically, order them by
# their position in the `syms` list.
# Construct a list of (codelen, index, symbol) triples.
# `index` breaks ties so that symbols with the same
# code length retain their original ordering.
triples = [(sym2len[name], i, name)
for i, name in enumerate(syms)]
code = oldcodelen = 0
for codelen, _, name in sorted(triples):
if codelen > oldcodelen:
code <<= (codelen - oldcodelen)
sym2node[name].code = format(code, "0%db" % codelen)
code += 1
oldcodelen = codelen
# Create a tree corresponding to the new codes.
tree = Node(code="")
dir2attr = {"0": "left", "1": "right"}
for snode in sym2node.values():
scode = snode.code
codesofar = ""
parent = tree
# Walk the tree creating any needed interior nodes.
for d in scode:
assert parent is not None
codesofar += d
attr = dir2attr[d]
child = getattr(parent, attr)
if codesofar == scode:
# We're at the leaf position.
assert child is None
setattr(parent, attr, snode)
elif child is not None:
assert child.code == codesofar
else:
child = Node(code=codesofar)
setattr(parent, attr, child)
parent = child
# Finally, paste the `data` attributes together up
# the tree. Why? Don't know ;-)
def paste(node):
if node is None:
return ""
elif node.is_symbol():
return node.data
else:
result = paste(node.left) + paste(node.right)
node.data = result
return result
paste(tree)
return tree
Duplicate symbols
Could I swap the sym2node dict to an ordereddict to deal with
repeated 'a'/'b's etc?
No, and for two reasons:
No mapping type supports duplicate keys; and,
The concept of "duplicate symbols" makes no sense for Huffman encoding.
So, if you're determined ;-) to pursue this, first you have to ensure that symbols are unique. Just add this line at the start of the function:
syms = list(enumerate(syms))
For example, if the syms passed in is:
['a', 'b', 'a']
that will change to:
[(0, 'a'), (1, 'b'), (2, 'a')]
All symbols are now 2-tuples, and are obviously unique since each starts with a unique integer. The only thing the algorithm cares about is that symbols can be used as dict keys; it couldn't care less whether they're strings, tuples, or any other hashable type that supports equality testing.
So nothing in the algorithm needs to change. But before the end, we'll want to restore the original symbols. Just insert this before the paste() function:
def restore_syms(node):
if node is None:
return
elif node.is_symbol():
node.data = node.data[1]
else:
restore_syms(node.left)
restore_syms(node.right)
restore_syms(tree)
That simply walks the tree and strips the leading integers off the symbols' .data members. Or, perhaps simpler, just iterate over sym2node.values(), and transform the .data member of each.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.