Huffman encoding issue

Huffman encoding issue - python

As an exercise I'm trying to encode some symbols using Huffman trees, but using my own class instead of the built in data types with Python.
Here is my node class:
class Node(object):
left = None
right = None
weight = None
data = None
code = ''
length = len(code)
def __init__(self, d, w, c):
self.data = d
self.weight = w
self.code = c
def set_children(self, ln, rn):
self.left = ln
self.right = rn
def __repr__(self):
return "[%s,%s,(%s),(%s)]" %(self.data,self.code,self.left,self.right)
def __cmp__(self, a):
return cmp(self.code, a.code)
def __getitem__(self):
return self.code
and here is the encoding function:
def encode(symbfreq):
tree = [Node(sym,wt,'') for sym, wt in symbfreq]
heapify(tree)
while len(tree)>1:
lo, hi = sorted([heappop(tree), heappop(tree)])
lo.code = '0'+lo.code
hi.code = '1'+hi.code
n = Node(lo.data+hi.data,lo.weight+hi.weight,lo.code+hi.code)
n.set_children(lo, hi)
heappush(tree, n)
return tree[0]
(Note, that the data field will eventually contain a set() of all the items in the children of a node. It just contains a sum for the moment whilst I get the encoding correct).
Here is the previous function I had for encoding the tree:
def encode(symbfreq):
tree = [[wt, [sym, ""]] for sym, wt in symbfreq]
heapq.heapify(tree)
while len(tree)>1:
lo, hi = sorted([heapq.heappop(tree), heapq.heappop(tree)], key=len)
for pair in lo[1:]:
pair[1] = '0' + pair[1]
for pair in hi[1:]:
pair[1] = '1' + pair[1]
heapq.heappush(tree, [lo[0] + hi[0]] + lo[1:] + hi[1:])
return sorted(heapq.heappop(tree)[1:], key=lambda p: (len(p[-1]), p))
However I've noticed that my new procedure is incorrect: it gives the top nodes the longest codewords instead of the final leaves, and doesn't produce the same tree for permutations of input symbols i.e. the following don't produce the same tree (when run with new encoding function):
input1 = [(1,0.25),(0,0.25),(0,0.25),(0,0.125),(0,0.125)]
input2 = [(0,0.25),(0,0.25),(0,0.25),(1,0.125),(0,0.125)]
I'm finding I'm really bad at avoiding this kind of off-by-one/ordering bugs - how might I go about sorting this out in the future?

There's more than one oddity ;-) in this code, but I think your primary problem is this:
def __cmp__(self, a):
return cmp(self.code, a.code)
Heap operations use the comparison method to order the heap, but for some reason you're telling it to order Nodes by the current length of their codes. You almost certainly want the heap to order them by their weights instead, right? That's how Huffman encoding works.
def __cmp__(self, a):
return cmp(self.weight, a.weight)
For the rest, it's difficult to follow because 4 of your 5 symbols are the same (four 0 and one 1). How can you possibly tell whether it's working or not?
Inside the loop, this is strained:
lo, hi = sorted([heappop(tree), heappop(tree)])
Given the repair to __cmp__, that's easier as:
lo = heappop(tree)
hi = heappop(tree)
Sorting is pointless - the currently smallest element is always popped. So pop twice, and lo <= hi must be true.
I'd say more ;-), but at this point I'm confused about what you're trying to accomplish in the end. If you agree __cmp__ should be repaired, make that change and edit the question to give both some inputs and the exact output you're hoping to get.
More
About:
it gives the top nodes the longest codewords instead of the final leaves,
This isn't an "off by 1" thing, it's more of a "backwards" thing ;-) Huffman coding looks at nodes with the smallest weights first. The later a node is popped from the heap, the higher the weight, and the shorter its code should be. But you're making codes longer & longer as the process goes on. They should be getting shorter & shorter as the process goes on.
You can't do this while building the tree. In fact the codes aren't knowable until the tree-building process has finished.
So, rather than guess at intents, etc, I'll give some working code you can modify to taste. And I'll include a sample input and its output:
from heapq import heappush, heappop, heapify
class Node(object):
def __init__(self, weight, left, right):
self.weight = weight
self.left = left
self.right = right
self.code = None
def __cmp__(self, a):
return cmp(self.weight, a.weight)
class Symbol(object):
def __init__(self, name, weight):
self.name = name
self.weight = weight
self.code = None
def __cmp__(self, a):
return cmp(self.weight, a.weight)
def encode(symbfreq):
# return pair (sym2object, tree), where
# sym2object is a dict mapping a symbol name to its Symbol object,
# and tree is the root of the Huffman tree
sym2object = {sym: Symbol(sym, w) for sym, w in symbfreq}
tree = sym2object.values()
heapify(tree)
while len(tree) > 1:
lo = heappop(tree)
hi = heappop(tree)
heappush(tree, Node(lo.weight + hi.weight, lo, hi))
tree = tree[0]
def assigncode(node, code):
node.code = code
if isinstance(node, Node):
assigncode(node.left, code + "0")
assigncode(node.right, code + "1")
assigncode(tree, "")
return sym2object, tree
i = [('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)]
s2o, t = encode(i)
for v in s2o.values():
print v.name, v.code
That prints:
a 010
c 00
b 011
e 11
d 10
So, as hoped, the symbols with the highest weights have the shortest codes.

I suspect the problem is in segment: -
lo.code = '0'+lo.code
hi.code = '1'+hi.code
both hi & low can be intermediate nodes where as you need to update the codes at the leaves where the actual symbols are. I think you should not maintain any codes at construction of the huffman tree but get the codes of individual symbols after the construction of huffman tree by just traversing.
Heres my pseudo code for encode: -
tree = ConstructHuffman(); // Same as your current but without code field
def encode(tree,symbol):
if tree.data == symbol :
return None
else if tree.left.data.contains(symbol) :
return '0' + encode(tree.left,symbol)
else :
return '1' + encode(tree.right,symbol)
Calculate codes for all symbol using above encode method and then u can use them for encoding.
Note: Change comparision function from comparing codes to comparing weights.

Related

Different behavior recursing binary tree to output list v. string

I'm learning recursion through binary trees with python.
I'm not understanding the discrepancy in the output of these two functions. In the first one, I'm coding the traversal to return a list, in the second a string.
For the list:
class Node:
def __init__(self, value):
self.value = value
self.left = None
self.right = None
class Tree:
def __init__(self):
self.root = None
def inorder(self, data, traversal=[]):
if data:
self.inorder(data.left, traversal)
traversal.append(str(data.value))
self.inorder(data.right, traversal)
return traversal
"""
1
/ \
2 3
/ \
4 5
"""
thing = Tree()
thing.root = Node(1)
thing.root.left = Node(2)
thing.root.right = Node(3)
thing.root.left.left = Node(4)
thing.root.left.right = Node(5)
print(thing.inorder(thing.root))
['4', '2', '5', '1', '3']
However, if i modify the inorder function to return a string:
def inorder(self, data, traversal=""):
if data:
self.inorder(data.left, traversal)
traversal += str(data.value)
self.inorder(data.right, traversal)
return traversal
'1'
It only works if i assign the output of the recursive calls to traversal, like this:
def inorder(self, data, traversal=""):
if data:
traversal = self.inorder(data.left, traversal)
traversal += str(data.value)
traversal = self.inorder(data.right, traversal)
return traversal
'42513'
I'm not understanding why the behavior is different if i append to a list rather than concatenate a string. Can someone explain it to me?

Strings in python are immutable. Your traversal accumulator relies on mutation to work. Rather than relying on mutating your "traversal" data structure, you can modify it to actually use the return value(s).
if data:
t1 = self.inorder(data.left, "")
t2 = data.value
t3 = self.inorder(data.right, "")
traversal = t1 + t2 + t3
You could (probably) use the same technique for lists. And at that point, you would no longer need to pass the accumulator along on the recursive calls.

Printing binary tree in python

I want to print my binary tree vertical and the nodes should be random numbers, but output that I get is not what I want.
For example I receive something like this:
497985
406204
477464
But program should print this:
.....497985
406204
.....477464
I don't know where is the problem. Im begginer at programming so if anyone can help me it would be great.
from numpy.random import randint
import random
from timeit import default_timer as timer
class binarytree:
def __init__(self):
self.elem = 0
self.left = None
self.right = None
def printtree(tree, h):
if tree is not None:
printtree(tree.right, h+1)
for i in range(1, h):
print(end = ".....")
print(tree.elem)
printtree(tree.left, h+1)
def generate(tree, N, h):
if N == 0:
tree = None
else:
tree = binarytree()
x = randint(0, 1000000)
tree.elem = int(x)
generate(tree.left, N // 2, h)
generate(tree.right, N - N // 2 - 1, h)
printtree(tree, h)
tree = binarytree()
generate(tree, 3, 0)

I've re-written your code to be a little more Pythonic, and fixed what appeared
to be some bugs:
from random import randrange
class BinaryTree:
def __init__(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
#classmethod
def generate_random_tree(cls, depth):
if depth > 0:
return cls(randrange(1000000),
cls.generate_random_tree(depth - 1),
cls.generate_random_tree(depth - 1))
return None
def __str__(self):
"""
Overloaded method to format tree as string
"""
return "\n".join(self._format())
def _format(self, cur_depth=0):
"""
Format tree as string given current depth. This is a generator of
strings which represent lines of the representation.
"""
if self.right is not None:
yield from self.right._format(cur_depth + 1)
yield "{}{}".format("." * cur_depth, self.val)
if self.left is not None:
yield from self.left._format(cur_depth + 1)
print(BinaryTree.generate_random_tree(4))
print()
print(BinaryTree(1,
BinaryTree(2),
BinaryTree(3,
BinaryTree(4),
BinaryTree(5))))
This outputs a tree, which grows from the root at the middle-left to leaves at
the right:
...829201
..620327
...479879
.746527
...226199
..463199
...498695
987559
...280755
..168727
...603817
.233132
...294525
..927571
...263402
..5
.3
..4
1
.2
Here it's printed one random tree, and one tree that I manually constructed.
I've written it so that the "right" subtree gets printed on top - you can switch
this around if you want.
You can make some further adjustments to this by making more dots at once (which
I think was your h parameter?), or adapting bits into whatever program
structure you want.
This code works because it makes sure it builds a whole tree first, and then
later prints it. This clear distinction makes it easier to make sure you're
getting all of it right. Your code was a little confused in this regard - your
generate function did instantiate a new BinaryTree at each call, but you
never attached any of them to each other! This is because when you do
tree = ... inside your function, all you're doing is changing what the local
name tree points to - it doesn't change the attribute of the tree from higher
up that you pass to it!
Your generate also calls printtree every time, but printtree also calls
itself... This just doesn't seem like it was entirely right.
In my program, this bit generates the random tree:
def generate_random_tree(cls, depth):
if depth > 0:
return cls(randrange(1000000),
cls.generate_random_tree(depth - 1),
cls.generate_random_tree(depth - 1))
return None
The classmethod bit is just some object-oriented Python stuff that isn't very
important for the algorithm. The key thing to see is that it generates two new
smaller trees, and they end up being the left and right attributes of the
tree it returns. (This also works because I did a little re-write of
__init__).
This bit basically prints the tree:
def _format(self, cur_depth=0):
"""
Format tree as string given current depth. This is a generator of
strings which represent lines of the representation.
"""
if self.right is not None:
yield from self.right._format(cur_depth + 1)
yield "{}{}".format("." * cur_depth, self.val)
if self.left is not None:
yield from self.left._format(cur_depth + 1)
It's a generator of lines - but you could make it directly print the tree by
changing each yield ... to print(...) and removing the yield froms. The
key things here are the cur_depth, which lets the function be aware of how
deep in the tree it currently is, so it knows how many dots to print, and the
fact that it recursively calls itself on the left and right subtrees (and you
have to check if they're None or not, of course).
(This part is what you could modify to re-introduce h).

Need help understanding Python syntax to write recursion when solving max binary tree depth

The problem is about solving max depth of a binary tree using recursion. originally from leetcode https://leetcode.com/problems/maximum-depth-of-binary-tree/solution/
I'm trying to understand the code by walking through a real example. root = 3, left child = 9, right child = null. should return 2.
Specifically, I don't quite understand how left_height would get int value of 1. I understand that right_height is None, therefore 0.
It would be great someone can walk through the example with the real value. I understand the algorithm well. I'm not very familiar with manipulating of python objects.
# Definition for a binary tree node.
# class TreeNode(object):
# def __init__(self, x):
# self.val = x
# self.left = None
# self.right = None
class Solution:
def maxDepth(self, root):
"""
:type root: TreeNode
:rtype: int
"""
if root is None:
return 0
else:
left_height = self.maxDepth(root.left)
right_height = self.maxDepth(root.right)
return max(left_height, right_height) + 1

The left node has no further left or right, so it'll return max(0,0)+1, or 1
This call comes back up the tree to the root to return max(1,0)+1, and the final result is 2

I followed the link, these numbers are irrelevant so
3
/ \
9 20
/ \
15 7
is the same as
.
/ \
. .
/ \
. .

Circular, doubly linked lists with hash table space

I'm currently working on implementing a Fibonacci heap in Python for my own personal development. While writing up the object class for a circular, doubly linked-list, I ran into an issue that I wasn't sure of.
For fast membership testing of the linked-list (in order to perform operations like 'remove' and 'merge' faster), I was thinking of adding a hash-table (a python 'set' object) to my linked-list class. See my admittedly very imperfect code below for how I did this:
class Node:
def __init__(self,value):
self.value = value
self.degree = 0
self.p = None
self.child = None
self.mark = False
self.next = self
self.prev = self
def __lt__(self,other):
return self.value < other.value
class Linked_list:
def __init__(self):
self.root = None
self.nNodes = 0
self.members = set()
def add_node(self,node):
if self.root == None:
self.root = node
else:
self.root.next.prev = node
node.next = self.root.next
self.root.next = node
node.prev = self.root
if node < self.root:
self.root = node
self.members.add(node)
self.nNodes = len(self.members)
def find_min():
min = None
for element in self.members:
if min == None or element<min:
min = element
return min
def remove_node(self,node):
if node not in self.members:
raise ValueError('node not in Linked List')
node.prev.next, node.next.prev = node.next, node.prev
self.members.remove(node)
if self.root not in self.members:
self.root = self.find_min()
self.nNodes -=1
def merge_linked_list(self,LL2):
for element in self.members&LL2.members:
self.remove_node(element)
self.root.prev.next = LL2.root
LL2.root.prev.next = self.root
self.root.prev, LL2.root.prev = LL2.root.prev, self.root.prev
if LL2.root < self.root:
self.root = LL2.root
self.members = self.members|LL2.members
self.nNodes = len(self.members)
def print_values(self):
print self.root.value
j = self.root.next
while j is not self.root:
print j.value
j = j.next
My question is, does the hash table take up double the amount of space that just implementing the linked list without the hash table? When I look at the Node objects in the hash table, they seem to be in the exact same memory location that they are when just independent node objects. For example, if I create a node:
In: n1 = Node(5)
In: print n1
Out: <__main__.Node instance at 0x1041aa320>
and then put this node in a set:
In: s1 = set()
In: s1.add(n1)
In: print s1
Out: <__main__.Node instance at 0x1041aa320>
which is the same memory location. So it seems like the set doesn't copy the node.
My question is, what is the space complexity for a linked list of size n with a hash-table that keeps track of elements. Is it n or 2n? Is there anything elementary wrong about using a hash table to keep track of elements.
I hope this isn't a duplicate. I tried searching for a post that answered this question, but didn't find anything satisfactory.

Check In-memory size of a Python structure and How do I determine the size of an object in Python? for complete answers in determining size of objects
I have this small results on a 64 bits machine with python 3
>>> import sys
>>> sys.getsizeof (1)
28
>>> sys.getsizeof (set())
224
>>> sys.getsizeof (set(range(100)))
8416
The results are in bytes. This can give you a hint about how big sets are (they are pretty big).
My question is, what is the space complexity for a linked list of size n with a hash-table that keeps track of elements. Is it n or 2n? Is there anything elementary wrong about using a hash table to keep track of elements.
Complexity calculations never make a difference between n and 2n.Optimisation does. And it's commonly said "early optimisation is the root of all evil", to warn about potential optimisation pitfalls. So do as you think is best for supported operations.

python returning a list from a recursive method

Im using a binary tree described in this book
problem solving with algorithms and data structures
class BinaryTree:
def __init__(self,rootObj):
self.key = rootObj
self.leftChild = None
self.rightChild = None
There is already a preorder traversal method defined as follows.
def preorder(tree):
if tree:
print(tree.self.key)
preorder(tree.getLeftChild())
preorder(tree.getRightChild())
I just want to add a return value of the list of nodes visited. So I can do something like
for i in preorder(tree):
etc...
Im having trouble returning a list from a recursive method. The recursion stops as soon as it hits the 'return' I've tried variations using
return [tree.self.key] + preorder()
Or
yield ...
Any ideas?

Are you sure you want tree.self.key and not simply tree.key when you print?
Otherwise, a solution with yield from (Python 3.3+):
def preorder(tree):
if tree:
yield tree
yield from preorder(tree.getLeftChild())
yield from preorder(tree.getRightChild())
Or with simple yield:
def preorder(tree):
if tree:
yield tree
for e in preorder(tree.getLeftChild()):
yield e
for e in preorder(tree.getRightChild()):
yield e
Note that using yield or yield from transforms the function into a generator function; in particular, if you want an actual list (for indexing, slicing, or displaying for instance), you'll need to explicitly create it: list(preorder(tree)).
If you have a varying number of children, it's easy to adapt:
def preorder(tree):
if tree:
yield tree
for child in tree.children:
yield from preorder(child)

You have to actually return a value from the recursive function (currently, it's just printing values). And you should build the list as you go, and maybe clean the code up a bit - something like this:
def preorder(tree):
if not tree:
return []
# optionally, you can print the key in this line: print(self.key)
return [tree.key] + preorder(tree.leftChild) + preorder(tree.rightChild)

You can add second argument to preorder function, something like
def preorder(tree, visnodes):
if tree:
visnodes.append(tree.self.key)
print(tree.self.key)
preorder(tree.getLeftChild(), visnodes)
preorder(tree.getRightChild(), visnodes)
...
vn = []
preorder(tree, vn)

You can simply return a concatenated list for each recursive call, which combines the current key, the left subtree and the right subtree:
def preorder(tree):
if tree:
return ([tree.key] + preorder(tree.getLeftChild()) +
preorder(tree.getRightChild()))
return []
For an n-ary tree:
def preorder(tree):
if tree:
lst = [tree.key]
for child in tree.children:
lst.extend(preorder(child))
return lst
return []

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Huffman encoding issue - python

Related

Different behavior recursing binary tree to output list v. string

Printing binary tree in python

Need help understanding Python syntax to write recursion when solving max binary tree depth

Circular, doubly linked lists with hash table space

python returning a list from a recursive method

Categories

Resources