Finding the length of compressed text (Huffman coding)

Finding the length of compressed text (Huffman coding) - python

Given a text of n characters and a Binary tree, generated by Huffman coding, such that the leaf nodes have attributes: a string (the character itself) and an integer (its frequency in the text). The path from the root to any leaf represents its codeword.
I would like to write a recusive function that calculates the length of the compressed text and find its Big O-complexitiy.
So for instance, if I have text
abaccab
and each character has associated frequency and depth in Huffman tree:
4
/ \
a:3 5
/ \
b:2 c:2
then the overall length of compressed text is 11
I came up with this, but it seems very crude:
def get_length(node, depth):
#Leaf node
if node.left_child is None and node.right_child is None:
return node.freq*depth
#Node with only one child
elif node.left_child is None and node.right_child is not None:
return get_length(node.right_child, depth+1)
elif node.right_child is None and node.left_child is not None:
return get_length(node.left_child, depth+1)
#Node with two children
else:
return get_length(node.left_child, depth+1) + get_length(node.right_child, depth+1)
get_length(root,0)
Complexity: O(log 2n) where n is the number of characters.
How can I improve this? What would be the complexity in this case?

To find the exact total length of compresssed text,
I don't see any way around having to individually deal with each unique character
and the count of how many times it occurs in the text, which is a total of O(n) where n is the number of unique characters in the text (also n is the number of leaf nodes in the Huffman tree).
There are several different ways to represent the mapping from Huffman codes to plaintext letters. Your binary tree representation is good for finding the exact total length of the compressed text; there is a total of 2*n - 1 nodes in the tree, where n is the number of unique characters in the text, and a recursive scan through every node requires 2*n - 1 time, which is also equivalent to a total of O(n).
def get_length(node, depth):
#Leaf node
if node.left_child is None and node.right_child is None:
return node.freq*depth
#null link from node with only one child, either left or right:
elif node is None:
print("not a properly constructed Huffman tree")
return 0
#Node with two children
else:
return get_length(node.left_child, depth+1) + get_length(node.right_child, depth+1)
get_length(root,0)

While the complexity to find the length of the compressed text should O(n) (utilizing simple len), the time complexity to complete the encoding should be O(nlog(n)). The algorithm is as follows:
t1 = FullTree
for each character in uncompressed input do: #O(n)
tree_lookup(t1, character) #O(log(n))
Looping over the uncompressed input is O(n), while finding a node in a balanced binary tree is O(log(n)) (O(n) worst case or otherwise). Thus, the result is n*O(log(n)) => O(nlog(n)). Also, note that O(log 2n) for a complexity for lookup is accurate, as by rules of logarithms can be simplified to O(log(2)+log(n)) => O(k + log(n)), for some constant k. However, since Big-O only examines worst case approximations, O(k+log(n)) => O(log(n)).
You can improve your binary tree by creating a simpler lookup in your tree:
from collections import Counter
class Tree:
def __init__(self, node1, node2):
self.right = node1
self.left = node2
self.value = sum(getattr(i, 'value', i[-1]) for i in [node1, node2])
def __contains__(self, _node):
if self.value == _node:
return True
return _node in self.left or _node in self.right
def __lt__(self, _node): #needed to apply sorted function
return self.value < getattr(_node, 'value', _node[-1])
def lookup(self, _t, path = []):
if self.value == _t:
return ''.join(map(str, path))
if self.left and _t in self.left:
return ''.join(map(str, path+[0])) if isinstance(self.left, tuple) else self.left.lookup(_t, path+[0])
if self.right and _t in self.right:
return ''.join(map(str, path+[1])) if isinstance(self.right, tuple) else self.right.lookup(_t, path+[1])
def __getitem__(self, _node):
return self.lookup(_node)
s = list('abaccab')
r = sorted(Counter(s).items(), key=lambda x:x[-1])
while len(r) > 1:
a, b, *_r = r
r = sorted(_r+[Tree(a, b)])
compressed_text = ''.join(r[0][i] for i in s)
Output:
'10110000101'

Related

Can someone help to spot an edge case with error for this algorithm?

I'm solving 'Non overlap intervals' problem on leetcode [https://leetcode.com/problems/non-overlapping-intervals/]
In short, we need to define the minimum amount of intervals to delete to create non-overlapping set of them (number to delete is requested result).
And my solution is to build augmented interval tree ([https://en.wikipedia.org/wiki/Interval_tree#Augmented_tree]) out of all the intervals (for O((n log n) time complexity), then (the second traversal through the intervals) measure how many other intervals each given interval intersects (also for O((n log n) time complexity) (it gives also +1 self-intersection, but I use it only as relative metric) and sort all the intervals on this 'number-of intersections of others' metric.
At the last step I just get intervals one by one out of the sorted, as described above, list and create non-overlapping set (have an explicit check for non-overlapping, using another instance of interval tree) forming the result set that should be deleted.
And below I give full code of the described solution to play on leetcode with.
The approach work sufficiently fast, BUT sometimes I get wrong, differs by 1, result. Leetcode doesn't give much feedback throwing back at me 'expected 810' instead of my result '811'. So I'm still debugging digging the 811 intervals.... :)
Even knowing other solutions to this problem I'd like find the case on which described approach fails (it can be useful edge case by itself). So if someone saw similar problem or just can spot it with some 'fresh eyes' - it would be the most appreciated!
Thank in advance for any constructive comments and ideas!
The solution code:
class Interval:
def __init__(self, lo: int, hi: int):
self.lo = lo
self.hi = hi
class Node:
def __init__(self, interval: Interval, left: 'Node' = None, right: 'Node' = None):
self.left = left
self.right = right
self.interval = interval
self.max_hi = interval.hi
class IntervalTree:
def __init__(self):
self.root = None
def __add(self, interval: Interval, node:Node) -> Node:
if node is None:
node = Node(interval)
node.max_hi = interval.hi
return node
if node.interval.lo > interval.lo:
node.left = self.__add(interval, node.left)
else:
node.right = self.__add(interval, node.right)
node.max_hi = max(node.left.max_hi if node.left else 0, node.right.max_hi if node.right else 0, node.interval.hi)
return node
def add(self, lo: int, hi: int):
interval = Interval(lo, hi)
self.root = self.__add(interval, self.root)
def __is_intersect(self, interval: Interval, node: Node) -> bool:
if node is None:
return False
if not (node.interval.lo >= interval.hi or node.interval.hi <= interval.lo):
# print(f'{interval.lo}-{interval.hi} intersects {node.interval.lo}-{node.interval.hi}')
return True
if node.left and node.left.max_hi > interval.lo:
return self.__is_intersect(interval, node.left)
return self.__is_intersect(interval, node.right)
def is_intersect(self, lo: int, hi: int) -> bool:
interval = Interval(lo, hi)
return self.__is_intersect(interval, self.root)
def __all_intersect(self, interval: Interval, node: Node) -> Iterable[Interval]:
if node is None:
yield from ()
else:
if not (node.interval.lo >= interval.hi or node.interval.hi <= interval.lo):
# print(f'{interval.lo}-{interval.hi} intersects {node.interval.lo}-{node.interval.hi}')
yield node.interval
if node.left and node.left.max_hi > interval.lo:
yield from self.__all_intersect(interval, node.left)
yield from self.__all_intersect(interval, node.right)
def all_intersect(self, lo: int, hi: int) -> Iterable[Interval]:
interval = Interval(lo, hi)
yield from self.__all_intersect(interval, self.root)
class Solution:
def eraseOverlapIntervals(self, intervals: List[List[int]]) -> int:
ranged_intervals = []
interval_tree = IntervalTree()
for interval in intervals:
interval_tree.add(interval[0], interval[1])
for interval in intervals:
c = interval_tree.all_intersect(interval[0], interval[1])
ranged_intervals.append((len(list(c))-1, interval)) # decrement intersection to account self intersection
interval_tree = IntervalTree()
res = []
ranged_intervals.sort(key=lambda t: t[0], reverse=True)
while ranged_intervals:
_, interval = ranged_intervals.pop()
if not interval_tree.is_intersect(interval[0], interval[1]):
interval_tree.add(interval[0], interval[1])
else:
res.append(interval)
return len(res)

To make a counter example for your algorithm, you can construct a problem where selecting the segment with the fewest number of intersections ruins the solution, like this:
[----][----][----][----]
[-------][----][-------]
[-------] [-------]
[-------] [-------]
[-------] [-------]
Your algorithm will choose the center interval first, which is incompatible with the optimal solution:
[----][----][----][----]
An algorithm that does work is, while there are any overlaps:
Find the left-most point of overlap
Pick any two intervals that overlap that point, and delete the one that extends farthest to the right.
This algorithm is also very simple to implement. You can do it in a single traversal through the list of intervals, sorted by start point:
class Solution:
def eraseOverlapIntervals(self, intervals: List[List[int]]) -> int:
intervals.sort()
extent = None
deletes = 0
for interval in intervals:
if extent == None or extent <= interval[0]:
extent = interval[1]
else:
deletes += 1
extent = min(extent, interval[1])
return deletes

How do I decompress files using Huffman Compression?

I have written a Huffman Compression Algorithm in python, but so far I have only managed to do the compression part (by making a priority queue using heapq). I have created a class called HuffmanCoding which takes in the text to compress. I know that one way of decompressing is saving a dictionary with the characters and their codes to the compressed text file, but I am not sure about how I should do this (especially because my compressed text is stored in a binary file). Does anyone have any pointers or advice to how I could go about decompressing? Below is my compression code.
class HeapNode:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
def __lt__(self, other): # if the frequency of one character is lower than the frequency of another one
return self.freq < other.freq
def __eq__(self, other): # if two characters have the same frequencies
if other == None:
return False
if not isinstance(other, HeapNode): # checks if the character is a node or not
return False
return self.freq == other.freq
class HuffmanCoding:
def __init__(self, text_to_compress):
self.text_to_compress = text_to_compress # text that will be compressed
self.heap = []
self.codes = {} # will store the Huffman code of each character
def get_frequency(self): # method to find frequency of each character in text - RLE
frequency_Dictionary = {} # creates an empty dictionary where frequency of each character will be stored
for character in self.text_to_compress: # Iterates through the text to be compressed
if character in frequency_Dictionary:
frequency_Dictionary[character] = frequency_Dictionary[character] + 1 # if character already exists in
# dictionary, its value is increased by 1
else:
frequency_Dictionary[character] = 1 # if character is not present in list, its value is set to 1
return frequency_Dictionary
def make_queue(self, frequency): # creates the priority queue of each character and its associated frequency
for key in frequency:
node = HeapNode(key, frequency[key]) # create node (character) and store its frequency alongside it
heapq.heappush(self.heap, node) # Push the node into the heap
def merge_nodes(
self): # creates HuffmanTree by getting the two minimum nodes and merging them together, until theres
# only one node left
while len(self.heap) > 1:
node1 = heapq.heappop(self.heap) # pop node from top of heap
node2 = heapq.heappop(self.heap) # pop next node which is now at the top of heap
merged = HeapNode(None, node1.freq + node2.freq) # merge the two nodes we popped out from heap
merged.left = node1
merged.right = node2
heapq.heappush(self.heap, merged) # push merged node back into the heap
def make_codes(self, root, current_code): # Creates Huffman code for each character
if root == None:
return
if root.char != None:
self.codes[root.char] = current_code
self.make_codes(root.left, current_code + "0") # Every time you traverse left, add a 0 - Recursive Call
self.make_codes(root.right, current_code + "1") # Every time you traverse right, add a 1 - Recursive Call
def assignCodes(self): # Assigns codes to each character
root = heapq.heappop(self.heap) # extracts root node from heap
current_code = ""
self.make_codes(root, current_code)
def get_compressed_text(self, text): # Replaces characters in original text with codes
compressed_text = ""
for character in text:
compressed_text += self.codes[character]
return compressed_text
def pad_encoded_text(self, compressed_text):
extra_padding = 8 - len(compressed_text) % 8 # works out how much extra padding is required
for i in range(extra_padding):
compressed_text += "0" # adds the amount of 0's that are required
padded_info = "{0:08b}".format(extra_padding)
compressed_text = padded_info + compressed_text
return compressed_text
def make_byte_array(self, padded_text):
byte_array = bytearray()
for i in range(0, len(padded_text), 8):
byte_array.append(int(padded_text[i:i + 8], 2))
return(byte_array)
def show_compressed_text(self):
frequency = self.get_frequency()
self.make_queue(frequency)
self.merge_nodes()
self.assignCodes()
encoded_text = self.get_compressed_text(self.text_to_compress)
padded_encoded_text = self.pad_encoded_text(encoded_text)
byte_array = self.make_byte_array(padded_encoded_text)
return bytes(byte_array)
def remove_padding(self, padded_encoded_text): # removes the padding that was added
padded_info = padded_encoded_text[:8]
extra_padding = int(padded_info, 2)
padded_encoded_text = padded_encoded_text[8:]
encoded_text = padded_encoded_text[:-1 * extra_padding]
return encoded_text```

First you need to send a description of the Huffman code along with the data. Second you need to use that on the decompression end to decode the Huffman codes.
The most straightforward way is to traverse the tree, sending a 1 bit for every node encountered and a 0 bit followed by the symbol bits for every leaf. On the decompression end you can read that and reconstruct the tree. Then you can use the tree, traversing to the left or right with each bit of data until you get to a leaf. Emit the symbol for that leaf, and then start back at the root of the tree for the next bit.
The way this is usually done in compression libraries is to discard the tree and instead keep and send just the code lengths for every symbol, compressing that as well. Then on both the compression and decompression side, the same canonical Huffman code is constructed from the lengths. Then a table look up can be used for decoding, for speed.

Count Number of Good Nodes

problem statement
I am having trouble understanding what is wrong with my code and understanding the constraint below.
My pseudocode:
Traverse the tree Level Order and construct the array representation (input is actually given as a single root, but they use array representation to show the full tree)
iterate over this array representation, skipping null nodes
for each node, let's call it X, iterate upwards until we reach the root checking to see if at any point in the path, parentNode > nodeX, meaning, nodeX is not a good node.
increment counter if the node is good
Constraints:
The number of nodes in the binary tree is in the range [1, 10^5].
Each node's value is between [-10^4, 10^4]
First of all:
My confusion on the constraint is that, the automated tests are giving input such as [2,4,4,4,null,1,3,null,null,5,null,null,null,null,5,4,4] and if we follow the rules that childs are at c1 = 2k+1 and c2 = 2k+2 and parent = (k-1)//2 then this means that there are nodes with value null
Secondly:
For the input above, my code outputs 8, the expected value is 6, but when I draw the tree from the array, I also think the answer should be 8!
tree of input
# Definition for a binary tree node.
# class TreeNode:
# def __init__(self, val=0, left=None, right=None):
# self.val = val
# self.left = left
# self.right = right
class Solution:
def goodNodes(self, root: TreeNode) -> int:
arrRepresentation = []
queue = []
queue.append(root)
# while queue not empty
while queue:
# remove node
node = queue.pop(0)
if node is None:
arrRepresentation.append(None)
else:
arrRepresentation.append(node.val)
if node is not None:
# add left to queue
queue.append(node.left)
# add right to queue
queue.append(node.right)
print(arrRepresentation)
goodNodeCounter = 1
# iterate over array representation of binary tree
for k in range(len(arrRepresentation)-1, 0, -1):
child = arrRepresentation[k]
if child is None:
continue
isGoodNode = self._isGoodNode(k, arrRepresentation)
print('is good: ' + str(isGoodNode))
if isGoodNode:
goodNodeCounter += 1
return goodNodeCounter
def _isGoodNode(self, k, arrRepresentation):
child = arrRepresentation[k]
print('child: '+str(child))
# calculate index of parent
parentIndex = (k-1)//2
isGood = True
# if we have not reached root node
while parentIndex >= 0:
parent = arrRepresentation[parentIndex]
print('parent: '+ str(parent))
# calculate index of parent
parentIndex = (parentIndex-1)//2
if parent is None:
continue
if parent > child:
isGood = False
break
return isGood

Recursion might be easier:
class Node:
def __init__(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
def good_nodes(root, maximum=float('-inf')):
if not root: # null-root
return 0
is_this_good = maximum <= root.val # is this root a good node?
maximum = max(maximum, root.val) # update max
good_from_left = good_nodes(root.left, maximum) if root.left else 0
good_from_right = good_nodes(root.right, maximum) if root.right else 0
return is_this_good + good_from_left + good_from_right
tree = Node(2, Node(4, Node(4)), Node(4, Node(1, Node(5, None, Node(5, Node(4), Node(4)))), Node(3)))
print(good_nodes(tree)) # 6
Basically, recursion traverses the tree while updating the maximum number seen so far. At each iteration, the value of a root is compared with the maximum, incrementing the counter if necessary.

Since you wanted to solve with breadth first search:
from collections import deque
class Solution:
def goodNodes(self,root:TreeNode)->int:
if not root:
return 0
queue=deque()
# run bfs with track of max_val till its parent node
queue.append((root,-inf))
res=0
while queue:
current,max_val=queue.popleft()
if current.val>=max_val:
res+=1
if current.left:
queue.append((current.left,max(max_val,current.val)))
if current.right:
queue.append((current.right,max(max_val,current.val)))
return res
I added the node and its max_value till its parent node. I could not add a global max_value, because look at this tree:
For the first 3 nodes, you would have this [3,1,4] and if you were keeping the max_val globally, max_val would be 4.
Now next node would be 3, leaf node on the left. Since max_node is 4, 3<4 would be incorrect so 3 would not be considered as good node. So instead, I keep track of max_val of each node till its parent node

The binary heap you provided corresponds to the folloring hierarchy:
tree = [2,4,4,4,None,1,3,None,None,5,None,None,None,None,5,4,4]
printHeapTree(tree)
2
/ \
4 4
/ / \
4 1 3
\
5
In that tree, only item value 1 has an ancestor that is greater than itself. The 6 other nodes are good, because they have no ancestor that are greater than themselves (counting the root as good).
Note that there are values in the list that are unreachable because their parent is null (None) so they are not part of the tree (this could be a copy/paste mistake though). If we replace these None values by something else to make them part of the tree, we can see where the unreachable nodes are located in the hierarchy:
t = [2,4,4,4,'*', 1,3,'*',None, 5,None, None,None,None,5,4,4]
printHeapTree(t)
2
__/ \_
4 4
/ \ / \
4 * 1 3
/ / \
* 5 5
/ \
4 4
This is likely where the difference between a result of 8 (not counting root as good) vs 6 (counting root as good) comes from.
You can find the printHeapTree() function here.

Generating, traversing and printing binary tree

I generated perfectly balanced binary tree and I want to print it. In the output there are only 0s instead of the data I generated. I think it's because of the line in function printtree that says print(tree.elem), cause in the class self.elem = 0.
How can I connect these two functions generate and printtree?
class BinTree:
def __init__(self):
self.elem = 0
self.left = None
self.right = None
def generate(pbt, N):
if N == 0:
pbt = None
else:
pbt = BinTree()
x = input()
pbt.elem = int(x)
generate(pbt.left, N // 2)
generate(pbt.right, N - N // 2 - 1)
def printtree(tree, h):
if tree is not None:
tree = BinTree()
printtree(tree.right, h+1)
for i in range(1, h):
print(end = "......")
print(tree.elem)
printtree(tree.left, h+1)
Hope somebody can help me. I am a beginner in coding.
For example:
N=6, pbt=pbt, tree=pbt, h=0
input:
1
2
3
4
5
6
and the output:
......5
............6
1
............4
......2
............3

I'd suggest reading up on: https://www.geeksforgeeks.org/tree-traversals-inorder-preorder-and-postorder/
Basically, there are three ways to traverse a binary tree; in-order, post-order and pre-order.
The issue with your print statement is that, you're reassigning the tree that is being passed in, to an empty tree.
if tree is not None:
tree = BinTree()
Right? If tree is not none and has something, lets reassign that to an empty tree.
Traversing a tree is actually a lot more simpler than you'd imagine. I think the complexity comes in just trying to imagine in your head how it all works out, but the truth is that traversing a tree can be done in 3 - 4 lines.

Maximum depth of a binary tree in python

I created a tuple from a binary tree and it looks like this:
tuple = (1,(2,(4,5,6),(7,None,8)),(3,9,(10,11,12)))
The tree structure becomes more clear by applying indentation:
(1,
(2,
(4,
5,
6
),
(7,
None,
8
)
),
(3,
9,
(10,
11,
12
)
)
)
I know how to find the maximum depth of the binary tree using recursive method, but I am trying to find the maximum depth using the tuple I created. Can anyone help me with how to do it?

Recursive method:
a = (1,(2,(4,5,6),(7,None,8)),(3,9,(10,11,12)));
def depth(x):
if(isinstance(x, int) or x == None):
return 1;
else:
dL = depth(x[1]);
dR = depth(x[2]);
return max(dL, dR) + 1;
print(depth(a));
The idea is to determine the depth of a tree by looking at its left and right subtree. If the node does not have subtrees, a depth of 1 is returned. Else it returns max(depth of right, depth of left) + 1

Here is a tricky but rather efficient solution, that will work, provided no elements of your data structure is a string containing '(' or ')'.
I would convert the tuple to a string, and parse it so as to count the depth of the parentheses.
string = str(myTuple)
currentDepth = 0
maxDepth = 0
for c in string:
if c == '(':
currentDepth += 1
elif c == ')':
currentDepth -= 1
maxDepth = max(maxDepth, currentDepth)
It gives the depth in a linear time with regards to the number of characters in the string into which the tuple was converted.
That number should be more or less proportional to the number of elements plus the depth, so you'd have a complexity somewhat equal to O(n + d).

I solve this with level order traversal. If you know level order traversal, this question and https://leetcode.com/problems/binary-tree-right-side-view/ question can be solved with same technique:
from collections import deque
class Solution:
def max_depth(self,root):
if not root:
return 0
level=0
q=deque([root])
while q:
# once we exhaust the for loop, that means we traverse all the nodes in the same level
# so after for loop increase level+=1
for i in range(len(q)):
node=q.popleft()
if node.left:
q.append(node.left)
if node.right:
q.append(node.right)
level+=1
return level

class Node(object):
def __init__(self, x):
self.val = x
self.left = None
self.right = None
class Solution(object):
def maxDepth(self, root):
if not root:
return 0
ldepth = self.maxDepth(root.left)
rdepth = self.maxDepth(root.right)
return max(ldepth, rdepth) + 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding the length of compressed text (Huffman coding) - python

Related

Can someone help to spot an edge case with error for this algorithm?

How do I decompress files using Huffman Compression?

Count Number of Good Nodes

Generating, traversing and printing binary tree

Maximum depth of a binary tree in python

Categories

Resources