Encoding intermediate leaves in the Huffman algorithm - python

I am encoding a set of items using Huffman codes, and, along with the final codes, I'd like to return the intermediate nodes encoded as well, but with the the data of the child nodes concatenated into the the data of the intermediate node.
For example if I were to encode this set of symbols and probabilities:
tree = [('a',0.25),('b',0,25),('c',0.25),('d',0.125),('e',0.125)]
I'd like the following to be returned:
tree = [['ab','0'],['cde','1'],['a','00'],['b','01'],['c','10'],['de','11'],['d','110'],['e','111']]
I'm using the following code to produce Huffman trees:
import heapq
#symbfreq = list of symbols and associated frequencies
def encode(symbfreq):
#create a nested list as a tree
tree = [[wt, [sym, ""]] for sym, wt in symbfreq]
#turn the tree into a heap
heapq.heapify(tree)
while len(tree)>1:
#pop the lowest two nodes off the heap, sorted on the length
lo, hi = sorted([heapq.heappop(tree), heapq.heappop(tree)], key=len)
#add the next bit of the codeword
for pair in lo[1:]:
pair[1] = '0' + pair[1]
for pair in hi[1:]:
pair[1] = '1' + pair[1]
#push a new node, which is the sum of the two lowest probability nodes onto the heap
heapq.heappush(tree, [lo[0]+hi[0]] + lo[1:] + hi[1:])
return sorted(heapq.heappop(tree)[1:], key=lambda p: (len(p[-1]), p))
The Huffman algorithm is:
1.Create a leaf node for each symbol and add it to the priority queue.
2.While there is more than one node in the queue:
3.Remove the node of highest priority (lowest probability) twice to get two nodes.
4.Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
5.Add the new node to the queue.
6.The remaining node is the root node and the tree is complete.
I can't for the life of me think of a way to stop the intermediate nodes being overwritten (i.e. I want to persist the intermediate nodes created at stage 4).

I don't know how do it while constructing (a part from building an output tree which is never popped), but you can retrieve the intermediary nodes quite easily :
huffman_tree = encode(tree)
complete_tree = huffman_tree
get_intermediate_node = lambda val, arr : ''.join( [ char for char,binary in itertools.ifilter( lambda node : node[1].startswith( val ),arr)] )
for val in range( next_power_of_two( len(huffman_tree) ) ):
bvalue = bin(val)[2:]
node = [ get_intermediate_node( bvalue , huffman_tree) , bvalue ]
if node not in complete_tree:
complete_tree.append( node)
print sorted( complete_tree , key=lambda p: (len(p[-1]), p) )
>>> [['ab', '0'], ['cde', '1'], ['a', '00'], ['b', '01'], ['c', '10'],
['de', '11'], ['', '100'], ['', '101'], ['d', '110'], ['e', '111']]
(You still need to prune the empty nodes )

Related

Find all possible paths using networkx

how can I find all possible path between two nodes in a graph using networks?
import networkx as nx
G = nx.Graph()
edges = ['start-A', 'start-b', 'A-c', 'A-b', 'b-d', 'A-end', 'b-end']
nodes = []
for node in edges:
n1 = node.split('-')[0]
n2 = node.split('-')[1]
if n1 not in nodes:
nodes.append(n1)
if n2 not in nodes:
nodes.append(n2)
for node in nodes:
G.add_node(node)
for edge in edges:
n1 = edge.split('-')[0]
n2 = edge.split('-')[1]
G.add_edge(n1, n2)
for path in nx.all_simple_paths(G, 'start', 'end'):
print(path)
This is the result:
['start', 'A', 'b', 'end']
['start', 'A', 'end']
['start', 'b', 'A', 'end']
['start', 'b', 'end']
But I want all possible path so for e.g. start,b,A,c,A,end
If repeated visits to a node are allowed, then in a graph where at least 2 nodes on the path (not counting start and end) are connected, there is no upper bound to the number of valid paths. If there are 2 nodes on the path that are connected, e.g. nodes A and B, then any number of new paths can be formed by inserting A->B->A into the appropriate section of the valid path between start and end.
If number of repeated visits is restricted, then one might take the all_simple_paths as a starting point and insert any valid paths between two nodes in between, repeating this multiple times depending on the number of repeated visits allowed.
In your example, this would be taking the third output of all_simple_paths(G, 'start', 'end'), i.e. ['start', 'b', 'A', 'end'] and then for all nodes connected to b iterate over the results of all_simple_paths(G, X, 'A'), where X is the iterated node.
Here's rough pseudocode (it won't work but suggests an algo):
for path in nx.all_simple_paths(G, 'start', 'end'):
print(path)
for n, X, Y in enumerate(zip(path, path[1:])):
if X is not 'start' and X is not 'end':
for sub_path in nx.all_simple_paths(G, X, Y):
print(path[:n] + sub_path + path[n+2:])
This is not great, since with this formulation it's hard to control the number of repeated visits. One way to fix that is to create an additional filter based on the counts of nodes. However, for any real-world graphs this is not going to be computationally feasible due to the very large number of paths and nodes...

Topological Sort Algorithm (DFS) Implementation in Python

I am new to python and algorithms. I have been trying to implement a topological sorting algorithm for a while but can't seem to create a structure that works. The functions I have made run on a graph represented in an adj list.
When I have a DFS, the nodes are discovered top down, and nodes that have been already visited and not processed again:
def DFS(location, graph, visited = None):
if visited == None:
visited = [False for i in range(len(graph))]
if visited[location] == True:
return
visited[location] = True
node_visited.append(location)
for node in graph[location]:
DFS(node, graph, visited)
return visited
When I am trying to build a topological sort algorithm, I create a new function which essentially checks the "availability" of that node to be added to the sorted list (ie: whether its neighbouring nodes have been visited already)
def availability(graph, node):
count = 0
for neighbour in graph[node]:
if neighbour in available_nodes:
count += 1
if count != 0:
return False
return True
However, my issue is that once I have visited the node path to get to the bottom of the graph, the DFS does not allow me to revisit that those nodes. Hence, any updates I make once I discover the end of the path can not be processed.
My approach may be totally off, but I am wondering if someone could help improve my implementation design, or explain how the implementation is commonly done. Thanks in advance.
You don't need that availability check to do a topological sort with DFS.
DFS itself ensures that you don't leave a node until its children have already been processed, so if you add each node to a list when DFS finishes with it, they will be added in (reverse) topological order.
Don't forget to do the whole graph, though, like this:
def toposort(graph):
visited = [False for i in range(len(graph))]
result = []
def DFS(node):
if visited[node]:
return
visited[node] = True
for adj in graph[node]:
DFS(adj)
result.append(node)
for i in range(len(graph)):
DFS(i)
return result
class Graph:
def __init__(self):
self.edges = {}
def addNode(self, node):
self.edges[node] = []
def addEdge(self, node1, node2):
self.edges[node1] += [node2]
def getSub(self, node):
return self.edges[node]
def DFSrecu(self, start, path):
for node in self.getSub(start):
if node not in path:
path = self.DFSrecu(node, path)
if start not in path:
path += [start]
return path
def topological_sort(self, start):
topo_ordering_list = self.DFSrecu(start, [])
# this for loop it will help you to visit all nodes in the graph if you chose arbitrary node
# because you need to check if all nodes in the graph is visited and sort them
for node in g.edges:
if node not in topo_ordering_list:
topo_ordering_list = g.DFSrecu(node, topo_ordering_list)
return topo_ordering_list
if __name__ == "__main__":
g = Graph()
for node in ['S', 'B', 'A', 'C', 'G', 'I', "L", 'D', 'H']:
g.addNode(node)
g.addEdge("S", "A")
g.addEdge("S", "B")
g.addEdge("B", "D")
g.addEdge("D", "H")
g.addEdge("D", "G")
g.addEdge("H", "I")
g.addEdge("I", "L")
g.addEdge("G", "I")
last_path1 = g.topological_sort("D")
last_path2 = g.topological_sort("S")
print("Start From D: ",last_path1)
print("start From S: ",last_path2)
Output:
Start From D: ['L', 'I', 'H', 'G', 'D', 'A', 'B', 'S', 'C']
start From S: ['A', 'L', 'I', 'H', 'G', 'D', 'B', 'S', 'C']
you can see here 'C' is included in topological sorted list even it's not connect to any other node but 'C' in the graph and you need to visited her
that's way you need for loop in topological_sort() function

create a list but getting a string?

s = """
1:A,B,C,D;E,F
2:G,H;J,K
&:L,M,N
"""
def read_nodes(gfile):
for line in gfile.split():
nodes = line.split(":")[1].replace(';',',').split(',')
for node in nodes:
print node
print read_nodes(s)
I am expected to get ['A','B','C','D','E',.....'N'], but I get A B C D E .....N and it's not a list. I spent a lot of time debugging, but could not find the right way.
I believe this is what you're looking for:
s = """
1:A,B,C,D;E,F
2:G,H;J,K
&:L,M,N
"""
def read_nodes(gfile):
nodes = [line.split(":")[1].replace(';',',').split(',') for line in gfile.split()]
nodes = [n for l in nodes for n in l]
return nodes
print read_nodes(s) # prints: ['A','B','C','D','E',.....'N']
What you were doing wrong is that for each sub-list you create, your were iterating over that sub-list and printing out the contents.
The code above uses list comprehension to first iterate over the gfile and create a list of lists. The list is then flattened with the second line. Afterwards, the flatten list is returned.
If you still want to do it your way, Then you need a local variable to store the contents of each sub-list in, and then return that variable:
s = """
1:A,B,C,D;E,F
2:G,H;J,K
&:L,M,N
"""
def read_nodes(gfile):
all_nodes = []
for line in gfile.split():
nodes = line.split(":")[1].replace(';',',').split(',')
all_nodes.extend(nodes)
return all_nodes
print read_nodes(s)
Each line you read will create a new list called nodes. You need to create a list outside this loop and store all the nodes.
s = """
1:A,B,C,D;E,F
2:G,H;J,K
&:L,M,N
"""
def read_nodes(gfile):
allNodes = []
for line in gfile.split():
nodes =line.split(":")[1].replace(';',',').split(',')
for node in nodes:
allNodes.append(node)
return allNodes
print read_nodes(s)
Not quite sure what you are ultimately trying to accomplish but this will print what you say you are expecting:
s = """
1:A,B,C,D;E,F
2:G,H;J,K
&:L,M,N
"""
def read_nodes(gfile):
nodes = []
for line in gfile.split():
nodes += line.split(":")[1].replace(';',',').split(',')
return nodes
print read_nodes(s)
Add the following code so that the output is
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N']
//Code to be added
nodes_list = []
def read_nodes(gfile):
for line in gfile.split():
nodes =line.split(":")[1].replace(';',',').split(',')
nodes_list.extend(nodes)
print nodes_list
print read_nodes(s)

How to iterate this tree/graph

I need to iterate a tree/graph and produce a certain output but following some rules:
_ d
/ / \
b c _e
/ / |
a f g
The expected output should be (order irrelevant):
{'bde', 'bcde', 'abde', 'abcde', 'bdfe', 'bdfge', 'abdfe', ...}
The rules are:
The top of the tree 'bde' (leftmost_root_children+root+rightmost_root_children) should always be present
The left-right order should be preserved so for example the combinations 'cb' or 'gf' are not allowed.
All paths follow the left to right direction.
I need to find all paths following these rules. Unfortunately I don't have a CS background and my head is exploding. Any tip will be helpful.
EDIT: This structure represents my tree very closely:
class N():
"""Node"""
def __init__(self, name, lefts, rights):
self.name = name
self.lefts = lefts
self.rights = rights
tree = N('d', [N('b', [N('a', [], [])], []), N('c', [], [])],
[N('e', [N('f', [], []), N('g', [], [])],
[])])
or may be more readable:
N('d', lefts =[N('b', lefts=[N('a', [], [])], rights=[]), N('c', [], [])],
rights=[N('e', lefts=[N('f', [], []), N('g', [], [])], rights=[])])
So this can be treated as a combination of two problems. My code below will assume the N class and tree structure have already been defined as in your problem statement.
First: given a tree structure like yours, how do you produce an in-order traversal of its nodes? This is a pretty straightforward problem, so I'll just show a simple recursive generator that solves it:
def inorder(node):
if not isinstance(node, list):
node = [node]
for n in node:
for left in inorder(getattr(n, 'lefts', [])):
yield left
yield n.name
for right in inorder(getattr(n, 'rights', [])):
yield right
print list(inorder(tree))
# ['a', 'b', 'c', 'd', 'f', 'g', 'e']
Second: Now that we have the "correct" ordering of the nodes, we next need to figure out all possible combinations of these that a) maintain this order, and b) contain the three "anchor" elements ('b', 'd', 'e'). This we can accomplish using some help from the always-handy itertools library.
The basic steps are:
Identify the anchor elements and partition the list into four pieces around them
Figure out all combinations of elements for each partition (i.e. the power set)
Take the product of all such combinations
Like so:
from itertools import chain, combinations
# powerset recipe taken from itertools documentation
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
def traversals(tree):
left, mid, right = tree.lefts[0].name, tree.name, tree.rights[0].name
nodes = list(inorder(tree))
l_i, m_i, r_i = [nodes.index(x) for x in (left, mid, right)]
parts = nodes[:l_i], nodes[l_i+1:m_i], nodes[m_i+1:r_i], nodes[r_i+1:]
psets = [powerset(x) for x in parts]
for p1, p2, p3, p4 in product(*psets):
yield ''.join(chain(p1, left, p2, mid, p3, right, p4))
print list(traversals(tree))
# ['bde', 'bdfe', 'bdge', 'bdfge', 'bcde', 'bcdfe',
# 'bcdge', 'bcdfge', 'abde', 'abdfe', 'abdge', 'abdfge',
# 'abcde', 'abcdfe', 'abcdge', 'abcdfge']

Creating lists from trees which preserve the structure of the tree

For an application I'm writing, I need to test the branches of a Huffman tree for a certain properrty. To that end, I've thought about querying a node, and returning a flat list which contains sublists representing the items in each branch.
For example, if I had this tree:
-a
|-b
|-c
|-d
I'd like to create a list by querying the topmost item ('a') and return the following list:
[[a],[b,c,d]]
If I queried the second leaf ('b') I want to return:
[[b].[c,d]]
etc
So far I'm storing my tree as a tuple like so:
(1.0,(0.5,(0.25, (0.125,'d'),(0.125,'c')),(0.25,'b')),(0.5,'a'))
I have a function which prints the information on the leaves:
def printTree(tree, prefix = ''):
if len(tree) == 2:
print tree[1], prefix
else:
printTree(tree[1], prefix + '0')
printTree(tree[2], prefix + '1')
I've tried creating a function which replaces the print statements with list() statements, but that didn't work.
Does anyone have any ideas about how I could go about this?
So you're looking for something like:
def printTree(tree, prefix = '', res=[]):
if len(tree) == 2:
res.append((tree[1], prefix))
print tree[1], prefix
else:
printTree(tree[1], prefix + '0', res=res)
printTree(tree[2], prefix + '1', res=res)
return res
res will hold your results while in the recursion, and at the end it'd return.
With your tree this will return: [('d', '000'), ('c', '001'), ('b', '01'), ('a', '1')]
, was that what you wanted?

Categories

Resources