How do I decompress files using Huffman Compression?

How do I decompress files using Huffman Compression? - python

I have written a Huffman Compression Algorithm in python, but so far I have only managed to do the compression part (by making a priority queue using heapq). I have created a class called HuffmanCoding which takes in the text to compress. I know that one way of decompressing is saving a dictionary with the characters and their codes to the compressed text file, but I am not sure about how I should do this (especially because my compressed text is stored in a binary file). Does anyone have any pointers or advice to how I could go about decompressing? Below is my compression code.
class HeapNode:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
def __lt__(self, other): # if the frequency of one character is lower than the frequency of another one
return self.freq < other.freq
def __eq__(self, other): # if two characters have the same frequencies
if other == None:
return False
if not isinstance(other, HeapNode): # checks if the character is a node or not
return False
return self.freq == other.freq
class HuffmanCoding:
def __init__(self, text_to_compress):
self.text_to_compress = text_to_compress # text that will be compressed
self.heap = []
self.codes = {} # will store the Huffman code of each character
def get_frequency(self): # method to find frequency of each character in text - RLE
frequency_Dictionary = {} # creates an empty dictionary where frequency of each character will be stored
for character in self.text_to_compress: # Iterates through the text to be compressed
if character in frequency_Dictionary:
frequency_Dictionary[character] = frequency_Dictionary[character] + 1 # if character already exists in
# dictionary, its value is increased by 1
else:
frequency_Dictionary[character] = 1 # if character is not present in list, its value is set to 1
return frequency_Dictionary
def make_queue(self, frequency): # creates the priority queue of each character and its associated frequency
for key in frequency:
node = HeapNode(key, frequency[key]) # create node (character) and store its frequency alongside it
heapq.heappush(self.heap, node) # Push the node into the heap
def merge_nodes(
self): # creates HuffmanTree by getting the two minimum nodes and merging them together, until theres
# only one node left
while len(self.heap) > 1:
node1 = heapq.heappop(self.heap) # pop node from top of heap
node2 = heapq.heappop(self.heap) # pop next node which is now at the top of heap
merged = HeapNode(None, node1.freq + node2.freq) # merge the two nodes we popped out from heap
merged.left = node1
merged.right = node2
heapq.heappush(self.heap, merged) # push merged node back into the heap
def make_codes(self, root, current_code): # Creates Huffman code for each character
if root == None:
return
if root.char != None:
self.codes[root.char] = current_code
self.make_codes(root.left, current_code + "0") # Every time you traverse left, add a 0 - Recursive Call
self.make_codes(root.right, current_code + "1") # Every time you traverse right, add a 1 - Recursive Call
def assignCodes(self): # Assigns codes to each character
root = heapq.heappop(self.heap) # extracts root node from heap
current_code = ""
self.make_codes(root, current_code)
def get_compressed_text(self, text): # Replaces characters in original text with codes
compressed_text = ""
for character in text:
compressed_text += self.codes[character]
return compressed_text
def pad_encoded_text(self, compressed_text):
extra_padding = 8 - len(compressed_text) % 8 # works out how much extra padding is required
for i in range(extra_padding):
compressed_text += "0" # adds the amount of 0's that are required
padded_info = "{0:08b}".format(extra_padding)
compressed_text = padded_info + compressed_text
return compressed_text
def make_byte_array(self, padded_text):
byte_array = bytearray()
for i in range(0, len(padded_text), 8):
byte_array.append(int(padded_text[i:i + 8], 2))
return(byte_array)
def show_compressed_text(self):
frequency = self.get_frequency()
self.make_queue(frequency)
self.merge_nodes()
self.assignCodes()
encoded_text = self.get_compressed_text(self.text_to_compress)
padded_encoded_text = self.pad_encoded_text(encoded_text)
byte_array = self.make_byte_array(padded_encoded_text)
return bytes(byte_array)
def remove_padding(self, padded_encoded_text): # removes the padding that was added
padded_info = padded_encoded_text[:8]
extra_padding = int(padded_info, 2)
padded_encoded_text = padded_encoded_text[8:]
encoded_text = padded_encoded_text[:-1 * extra_padding]
return encoded_text```

First you need to send a description of the Huffman code along with the data. Second you need to use that on the decompression end to decode the Huffman codes.
The most straightforward way is to traverse the tree, sending a 1 bit for every node encountered and a 0 bit followed by the symbol bits for every leaf. On the decompression end you can read that and reconstruct the tree. Then you can use the tree, traversing to the left or right with each bit of data until you get to a leaf. Emit the symbol for that leaf, and then start back at the root of the tree for the next bit.
The way this is usually done in compression libraries is to discard the tree and instead keep and send just the code lengths for every symbol, compressing that as well. Then on both the compression and decompression side, the same canonical Huffman code is constructed from the lengths. Then a table look up can be used for decoding, for speed.

Related

Count Number of Good Nodes

problem statement
I am having trouble understanding what is wrong with my code and understanding the constraint below.
My pseudocode:
Traverse the tree Level Order and construct the array representation (input is actually given as a single root, but they use array representation to show the full tree)
iterate over this array representation, skipping null nodes
for each node, let's call it X, iterate upwards until we reach the root checking to see if at any point in the path, parentNode > nodeX, meaning, nodeX is not a good node.
increment counter if the node is good
Constraints:
The number of nodes in the binary tree is in the range [1, 10^5].
Each node's value is between [-10^4, 10^4]
First of all:
My confusion on the constraint is that, the automated tests are giving input such as [2,4,4,4,null,1,3,null,null,5,null,null,null,null,5,4,4] and if we follow the rules that childs are at c1 = 2k+1 and c2 = 2k+2 and parent = (k-1)//2 then this means that there are nodes with value null
Secondly:
For the input above, my code outputs 8, the expected value is 6, but when I draw the tree from the array, I also think the answer should be 8!
tree of input
# Definition for a binary tree node.
# class TreeNode:
# def __init__(self, val=0, left=None, right=None):
# self.val = val
# self.left = left
# self.right = right
class Solution:
def goodNodes(self, root: TreeNode) -> int:
arrRepresentation = []
queue = []
queue.append(root)
# while queue not empty
while queue:
# remove node
node = queue.pop(0)
if node is None:
arrRepresentation.append(None)
else:
arrRepresentation.append(node.val)
if node is not None:
# add left to queue
queue.append(node.left)
# add right to queue
queue.append(node.right)
print(arrRepresentation)
goodNodeCounter = 1
# iterate over array representation of binary tree
for k in range(len(arrRepresentation)-1, 0, -1):
child = arrRepresentation[k]
if child is None:
continue
isGoodNode = self._isGoodNode(k, arrRepresentation)
print('is good: ' + str(isGoodNode))
if isGoodNode:
goodNodeCounter += 1
return goodNodeCounter
def _isGoodNode(self, k, arrRepresentation):
child = arrRepresentation[k]
print('child: '+str(child))
# calculate index of parent
parentIndex = (k-1)//2
isGood = True
# if we have not reached root node
while parentIndex >= 0:
parent = arrRepresentation[parentIndex]
print('parent: '+ str(parent))
# calculate index of parent
parentIndex = (parentIndex-1)//2
if parent is None:
continue
if parent > child:
isGood = False
break
return isGood

Recursion might be easier:
class Node:
def __init__(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
def good_nodes(root, maximum=float('-inf')):
if not root: # null-root
return 0
is_this_good = maximum <= root.val # is this root a good node?
maximum = max(maximum, root.val) # update max
good_from_left = good_nodes(root.left, maximum) if root.left else 0
good_from_right = good_nodes(root.right, maximum) if root.right else 0
return is_this_good + good_from_left + good_from_right
tree = Node(2, Node(4, Node(4)), Node(4, Node(1, Node(5, None, Node(5, Node(4), Node(4)))), Node(3)))
print(good_nodes(tree)) # 6
Basically, recursion traverses the tree while updating the maximum number seen so far. At each iteration, the value of a root is compared with the maximum, incrementing the counter if necessary.

Since you wanted to solve with breadth first search:
from collections import deque
class Solution:
def goodNodes(self,root:TreeNode)->int:
if not root:
return 0
queue=deque()
# run bfs with track of max_val till its parent node
queue.append((root,-inf))
res=0
while queue:
current,max_val=queue.popleft()
if current.val>=max_val:
res+=1
if current.left:
queue.append((current.left,max(max_val,current.val)))
if current.right:
queue.append((current.right,max(max_val,current.val)))
return res
I added the node and its max_value till its parent node. I could not add a global max_value, because look at this tree:
For the first 3 nodes, you would have this [3,1,4] and if you were keeping the max_val globally, max_val would be 4.
Now next node would be 3, leaf node on the left. Since max_node is 4, 3<4 would be incorrect so 3 would not be considered as good node. So instead, I keep track of max_val of each node till its parent node

The binary heap you provided corresponds to the folloring hierarchy:
tree = [2,4,4,4,None,1,3,None,None,5,None,None,None,None,5,4,4]
printHeapTree(tree)
2
/ \
4 4
/ / \
4 1 3
\
5
In that tree, only item value 1 has an ancestor that is greater than itself. The 6 other nodes are good, because they have no ancestor that are greater than themselves (counting the root as good).
Note that there are values in the list that are unreachable because their parent is null (None) so they are not part of the tree (this could be a copy/paste mistake though). If we replace these None values by something else to make them part of the tree, we can see where the unreachable nodes are located in the hierarchy:
t = [2,4,4,4,'*', 1,3,'*',None, 5,None, None,None,None,5,4,4]
printHeapTree(t)
2
__/ \_
4 4
/ \ / \
4 * 1 3
/ / \
* 5 5
/ \
4 4
This is likely where the difference between a result of 8 (not counting root as good) vs 6 (counting root as good) comes from.
You can find the printHeapTree() function here.

Bad Tree design, Data Structure

I tried making a Tree as a part of my Data Structures course. The code works but is extremely slow, almost double the time that is accepted for the course. I do not have experience with Data Structures and Algorithms but I need to optimize the program. If anyone has any tips, advices, criticism I would greatly appreciate it.
The tree is not necessarily a binary tree.
Here is the code:
import sys
import threading
class Node:
def __init__(self,value):
self.value = value
self.children = []
self.parent = None
def add_child(self,child):
child.parent = self
self.children.append(child)
def compute_height(n, parents):
found = False
indices = []
for i in range(n):
indices.append(i)
for i in range(len(parents)):
currentItem = parents[i]
if currentItem == -1:
root = Node(parents[i])
startingIndex = i
found = True
break
if found == False:
root = Node(parents[0])
startingIndex = 0
return recursion(startingIndex,root,indices,parents)
def recursion(index,toWhomAdd,indexes,values):
children = []
for i in range(len(values)):
if index == values[i]:
children.append(indexes[i])
newNode = Node(indexes[i])
toWhomAdd.add_child(newNode)
recursion(i, newNode, indexes, values)
return toWhomAdd
def checkHeight(node):
if node == '' or node == None or node == []:
return 0
counter = []
for i in node.children:
counter.append(checkHeight(i))
if node.children != []:
mostChildren = max(counter)
else:
mostChildren = 0
return(1 + mostChildren)
def main():
n = int(int(input()))
parents = list(map(int, input().split()))
root = compute_height(n, parents)
print(checkHeight(root))
sys.setrecursionlimit(10**7) # max depth of recursion
threading.stack_size(2**27) # new thread will get stack of such size
threading.Thread(target=main).start()
Edit:
For this input(first number being number of nodes and other numbers the node's values)
5
4 -1 4 1 1
We expect this output(height of the tree)
3
Another example:
Input:
5
-1 0 4 0 3
Output:
4

It looks like the value that is given for a node, is a reference by index of another node (its parent). This is nowhere stated in the question, but if that assumption is right, you don't really need to create the tree with Node instances. Just read the input into a list (which you already do), and you actually have the tree encoded in it.
So for example, the list [4, -1, 4, 1, 1] represents this tree, where the labels are the indices in this list:
1
/ \
4 3
/ \
0 2
The height of this tree — according to the definition given in Wikipedia — would be 2. But apparently the expected result is 3, which is the number of nodes (not edges) on the longest path from the root to a leaf, or — otherwise put — the number of levels in the tree.
The idea to use recursion is correct, but you can do it bottom up (starting at any node), getting the result of the parent recursively, and adding one to 1. Use the principle of dynamic programming by storing the result for each node in a separate list, which I called levels:
def get_num_levels(parents):
levels = [0] * len(parents)
def recur(node):
if levels[node] == 0: # this node's level hasn't been determined yet
parent = parents[node]
levels[node] = 1 if parent == -1 else recur(parent) + 1
return levels[node]
for node in range(len(parents)):
recur(node)
return max(levels)
And the main code could be as you had it:
def main():
n = int(int(input()))
parents = list(map(int, input().split()))
print(get_num_levels(parents))

Finding the length of compressed text (Huffman coding)

Given a text of n characters and a Binary tree, generated by Huffman coding, such that the leaf nodes have attributes: a string (the character itself) and an integer (its frequency in the text). The path from the root to any leaf represents its codeword.
I would like to write a recusive function that calculates the length of the compressed text and find its Big O-complexitiy.
So for instance, if I have text
abaccab
and each character has associated frequency and depth in Huffman tree:
4
/ \
a:3 5
/ \
b:2 c:2
then the overall length of compressed text is 11
I came up with this, but it seems very crude:
def get_length(node, depth):
#Leaf node
if node.left_child is None and node.right_child is None:
return node.freq*depth
#Node with only one child
elif node.left_child is None and node.right_child is not None:
return get_length(node.right_child, depth+1)
elif node.right_child is None and node.left_child is not None:
return get_length(node.left_child, depth+1)
#Node with two children
else:
return get_length(node.left_child, depth+1) + get_length(node.right_child, depth+1)
get_length(root,0)
Complexity: O(log 2n) where n is the number of characters.
How can I improve this? What would be the complexity in this case?

To find the exact total length of compresssed text,
I don't see any way around having to individually deal with each unique character
and the count of how many times it occurs in the text, which is a total of O(n) where n is the number of unique characters in the text (also n is the number of leaf nodes in the Huffman tree).
There are several different ways to represent the mapping from Huffman codes to plaintext letters. Your binary tree representation is good for finding the exact total length of the compressed text; there is a total of 2*n - 1 nodes in the tree, where n is the number of unique characters in the text, and a recursive scan through every node requires 2*n - 1 time, which is also equivalent to a total of O(n).
def get_length(node, depth):
#Leaf node
if node.left_child is None and node.right_child is None:
return node.freq*depth
#null link from node with only one child, either left or right:
elif node is None:
print("not a properly constructed Huffman tree")
return 0
#Node with two children
else:
return get_length(node.left_child, depth+1) + get_length(node.right_child, depth+1)
get_length(root,0)

While the complexity to find the length of the compressed text should O(n) (utilizing simple len), the time complexity to complete the encoding should be O(nlog(n)). The algorithm is as follows:
t1 = FullTree
for each character in uncompressed input do: #O(n)
tree_lookup(t1, character) #O(log(n))
Looping over the uncompressed input is O(n), while finding a node in a balanced binary tree is O(log(n)) (O(n) worst case or otherwise). Thus, the result is n*O(log(n)) => O(nlog(n)). Also, note that O(log 2n) for a complexity for lookup is accurate, as by rules of logarithms can be simplified to O(log(2)+log(n)) => O(k + log(n)), for some constant k. However, since Big-O only examines worst case approximations, O(k+log(n)) => O(log(n)).
You can improve your binary tree by creating a simpler lookup in your tree:
from collections import Counter
class Tree:
def __init__(self, node1, node2):
self.right = node1
self.left = node2
self.value = sum(getattr(i, 'value', i[-1]) for i in [node1, node2])
def __contains__(self, _node):
if self.value == _node:
return True
return _node in self.left or _node in self.right
def __lt__(self, _node): #needed to apply sorted function
return self.value < getattr(_node, 'value', _node[-1])
def lookup(self, _t, path = []):
if self.value == _t:
return ''.join(map(str, path))
if self.left and _t in self.left:
return ''.join(map(str, path+[0])) if isinstance(self.left, tuple) else self.left.lookup(_t, path+[0])
if self.right and _t in self.right:
return ''.join(map(str, path+[1])) if isinstance(self.right, tuple) else self.right.lookup(_t, path+[1])
def __getitem__(self, _node):
return self.lookup(_node)
s = list('abaccab')
r = sorted(Counter(s).items(), key=lambda x:x[-1])
while len(r) > 1:
a, b, *_r = r
r = sorted(_r+[Tree(a, b)])
compressed_text = ''.join(r[0][i] for i in s)
Output:
'10110000101'

Maximum depth of a binary tree in python

I created a tuple from a binary tree and it looks like this:
tuple = (1,(2,(4,5,6),(7,None,8)),(3,9,(10,11,12)))
The tree structure becomes more clear by applying indentation:
(1,
(2,
(4,
5,
6
),
(7,
None,
8
)
),
(3,
9,
(10,
11,
12
)
)
)
I know how to find the maximum depth of the binary tree using recursive method, but I am trying to find the maximum depth using the tuple I created. Can anyone help me with how to do it?

Recursive method:
a = (1,(2,(4,5,6),(7,None,8)),(3,9,(10,11,12)));
def depth(x):
if(isinstance(x, int) or x == None):
return 1;
else:
dL = depth(x[1]);
dR = depth(x[2]);
return max(dL, dR) + 1;
print(depth(a));
The idea is to determine the depth of a tree by looking at its left and right subtree. If the node does not have subtrees, a depth of 1 is returned. Else it returns max(depth of right, depth of left) + 1

Here is a tricky but rather efficient solution, that will work, provided no elements of your data structure is a string containing '(' or ')'.
I would convert the tuple to a string, and parse it so as to count the depth of the parentheses.
string = str(myTuple)
currentDepth = 0
maxDepth = 0
for c in string:
if c == '(':
currentDepth += 1
elif c == ')':
currentDepth -= 1
maxDepth = max(maxDepth, currentDepth)
It gives the depth in a linear time with regards to the number of characters in the string into which the tuple was converted.
That number should be more or less proportional to the number of elements plus the depth, so you'd have a complexity somewhat equal to O(n + d).

I solve this with level order traversal. If you know level order traversal, this question and https://leetcode.com/problems/binary-tree-right-side-view/ question can be solved with same technique:
from collections import deque
class Solution:
def max_depth(self,root):
if not root:
return 0
level=0
q=deque([root])
while q:
# once we exhaust the for loop, that means we traverse all the nodes in the same level
# so after for loop increase level+=1
for i in range(len(q)):
node=q.popleft()
if node.left:
q.append(node.left)
if node.right:
q.append(node.right)
level+=1
return level

class Node(object):
def __init__(self, x):
self.val = x
self.left = None
self.right = None
class Solution(object):
def maxDepth(self, root):
if not root:
return 0
ldepth = self.maxDepth(root.left)
rdepth = self.maxDepth(root.right)
return max(ldepth, rdepth) + 1

Counting nodes of a binary tree without recursion Python

I know how to do it recursively:
def num_nodes(tree):
if not tree.left and not tree.right:
return 1
else:
return 1 + num_nodes(tree.left) + num_nodes(tree.right)
But how would you do it non recursively? Having trouble accessing a node that's on the right of a left subtree.

You could store the nodes you have to do in a list.
queue = []
count = 0
queue.append(root)
# Above is start of tree
while(queue):
if queue[0].left:
queue.append(queue[0].left)
if queue[0].right:
queue.append(queue[0].right)
queue.pop(0)
count += 1
Which should work as long as your tree does not have any loops in it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I decompress files using Huffman Compression? - python

Related

Count Number of Good Nodes

Bad Tree design, Data Structure

Finding the length of compressed text (Huffman coding)

Maximum depth of a binary tree in python

Counting nodes of a binary tree without recursion Python

Categories

Resources