Writing Huffman Coding Algorithm in Python. Have successfully managed to create a tree based on a string input but am stuck on the best way to traverse it while generating the codes for each letter.
from collections import Counter
class HuffNode:
def __init__(self, count, letter=None):
self.letter = letter
self.count = count
self.right = None
self.left = None
word = input()
d = dict(Counter(word))
Nodes = [HuffNode(d[w], w) for w in sorted(d, key=d.get, reverse=True)]
while len(Nodes) > 1:
a = Nodes.pop()
b = Nodes.pop()
c = HuffNode(a.count+b.count)
c.left, c.right = a, b
Nodes.append(c)
Nodes.sort(key=lambda x: x.count, reverse=True)
For a word like "hello".
d = dict(Counter(word)) would get the frequency of each letter in the string and convert it to a dict. Thus having {'e': 1, 'l': 2, 'h': 1, 'o': 1} Each letter if then converted to a HuffNode and stored in Nodes
The while loop then proceeds to generate a tree until we only have one root
When the loop exits I'll have:
Whats the best way to traverse this tree then generating the codes for each letter?
Thanks
Generally speaking, you would want a recursive function that, given a HuffNode h and a prefix p:
if h.letter is not empty (i.e. h is a leaf), yields (p, h.letter) -> this is the code for the letter
otherwise, calls itself on h.left with prefix p + '0' and on h.right with p + '1'
A possible implementation (not tested, may have typos):
def make_code(node, prefix):
if node is None:
return []
if node.letter is not None:
return [(prefix, node.letter)]
else:
result = []
result.extend(make_code(h.left, prefix + '0'))
result.extend(make_code(h.right, prefix + '1'))
return result
codes = make_code(root, '')
where rootis the Huffman tree you built in the first step. The first test (if node is None) is to manage lopsided nodes, where one of the children may be empty.
Related
I want to be able to generate a string from a dictionary containing substrings, whereby I input a string where each character corresponds to the key of the dictionary and it spits out a new string from the associated values to that key. However I also want to minimise certain characters being next to each other.
For example:
dict = {'I': ['ATA', 'ATC', 'ATT'], 'M': ['ATG'], 'T': ['ACA', 'ACC', 'ACG', 'ACT'], 'N':['AAC', 'AAT'], 'K': ['AAA', 'AAG'], 'S': ['AGC', 'AGT'], 'R': ['AGA', 'AGG']}
input_str = "IIMTSTTKRI"
The output would be a string of the three character substrings associated with each key.However there are many 3 character substrings that could be used, I would like to minimise the number of G's and C's that are next to one another.
I currently have this:
n = []
#make list of possible substrings for each character in string
for i in str:
if i in dict.keys():
n.append(dict[i])
#generate all permutations
p = [''.join(s) for s in itertools.product(*n)]
#if no consecutive GCs in a permutation add to list
ls = []
for i in p:
q = i.count('GC')
if q == 0:
ls.append(i)
Which 'works' but there are a couple of problems. The first (minor one) is that I have to assume the consective "GC" is 0 and for some strings that may not be possible. The second (major one), is its extremely slow for longer strings because it has to generate all permutations.
Can anyone provide a way to improve the speed or an alternative way?
Based on your comments, you can look at the problem as a optimal path searching (Think about your problem as a graph where you must follow path defined in input_str and in each vertex you must chose from a list of defined 3-character wide strings).
There are many search algorithms, my solution is using A*Star:
from heapq import heappop, heappush
dct = {
"I": ["ATA", "ATC", "ATT"],
"M": ["ATG"],
"T": ["ACA", "ACC", "ACG", "ACT"],
"N": ["AAC", "AAT"],
"K": ["AAA", "AAG"],
"S": ["AGC", "AGT"],
"R": ["AGA", "AGG"],
}
input_str = "IIMTSTTKRI"
def valid_moves(s):
key = input_str[len(s) // 3]
for i in dct[key]:
yield s + i
def distance(s):
return len(input_str) - (len(s) // 3)
def my_cost_func(_from, _to):
return _to.count("GC")
def a_star(start, moves_func, h_func, cost_func):
"""
Find a shortest sequence of states from start to a goal state
(a state s with h_func(s) == 0).
"""
frontier = [
(h_func(start), start)
] # A priority queue, ordered by path length, f = g + h
previous = {
start: None
} # start state has no previous state; other states will
path_cost = {start: 0} # The cost of the best path to a state.
Path = lambda s: ([] if (s is None) else Path(previous[s]) + [s])
while frontier:
(f, s) = heappop(frontier)
if h_func(s) == 0:
return Path(s)
for s2 in moves_func(s):
g = path_cost[s] + cost_func(s, s2)
if s2 not in path_cost or g < path_cost[s2]:
heappush(frontier, (g + h_func(s2), s2))
path_cost[s2] = g
previous[s2] = s
path = a_star("", valid_moves, distance, my_cost_func)
print("Result:", path[-1])
This prints:
Result: ATAATAATGACAAGTACAACAAAAAGAATAAGT
I am trying to figure out how to merge overlapping strings in a list together, for example for
['aacc','accb','ccbe']
I would get
['aaccbe']
This following code works for the example above, however it does not provide me with the desired result in the following case:
s = ['TGT','GTT','TTC','TCC','CCC','CCT','CCT','CTG','TGA','GAA','AAG','AGC','GCG','CGT','TGC','GCT','CTC','TCT','CTT','TTT','TTT','TTC','TCA','CAT','ATG','TGG','GGA','GAT','ATC','TCT','CTA','TAT','ATG','TGA','GAT','ATT','TTC']
a = s[0]
b = s[-1]
final_s = a[:a.index(b[0])]+b
print(final_s)
>>>TTC
My output is clearly not right, and I don't know why it doesn't work in this case. Note that I have already organized the list with the overlapping strings next to each other.
You can use a trie to storing the running substrings and more efficiently determine overlap. When the possibility of an overlap occurs (i.e for an input string, there exists a string in the trie with a letter that starts or ends the input string), a breadth-first search to find the largest possible overlap takes place, and then the remaining bits of string are added to the trie:
from collections import deque
#trie node (which stores a single letter) class definition
class Node:
def __init__(self, e, p = None):
self.e, self.p, self.c = e, p, []
def add_s(self, s):
if s:
self.c.append(self.__class__(s[0], self).add_s(s[1:]))
return self
class Trie:
def __init__(self):
self.c = []
def last_node(self, n):
return n if not n.c else self.last_node(n.c[0])
def get_s(self, c, ls):
#for an input string, find a letter in the trie that the string starts or ends with.
for i in c:
if i.e in ls:
yield i
yield from self.get_s(i.c, ls)
def add_string(self, s):
q, d = deque([j for i in self.get_s(self.c, (s[0], s[-1])) for j in [(s, i, 0), (s, i, -1)]]), []
while q:
if (w:=q.popleft())[1] is None:
d.append((w[0] if not w[0] else w[0][1:], w[2], w[-1]))
elif w[0] and w[1].e == w[0][w[-1]]:
if not w[-1]:
if not w[1].c:
d.append((w[0][1:], w[1], w[-1]))
else:
q.extend([(w[0][1:], i, 0) for i in w[1].c])
else:
q.append((w[0][:-1], w[1].p, w[1], -1))
if not (d:={a:b for a, *b in d}):
self.c.append(Node(s[0]).add_s(s[1:]))
elif (m:=min(d, key=len)):
if not d[m][-1]:
d[m][0].add_s(m)
else:
t = Node(m[0]).add_s(m)
d[m][0].p = self.last_node(t)
Putting it all together
t = Trie()
for i in ['aacc','accb','ccbe']:
t.add_string(i)
def overlaps(trie, c = ''):
if not trie.c:
yield c+trie.e
else:
yield from [j for k in trie.c for j in overlaps(k, c+trie.e)]
r = [j for k in t.c for j in overlaps(k)]
Output:
['aaccbe']
Use difflib.find_longest_match to find the overlap and concatenate appropriately, then use reduce to apply the entire list.
import difflib
from functools import reduce
def overlap(s1, s2):
# https://stackoverflow.com/a/14128905/4001592
s = difflib.SequenceMatcher(None, s1, s2)
pos_a, pos_b, size = s.find_longest_match(0, len(s1), 0, len(s2))
return s1[:pos_a] + s2[pos_b:]
s = ['aacc','accb','ccbe']
result = reduce(overlap, s, "")
print(result)
Output
aaccbe
I have question, where I need to implement ladder problem with different logic.
In each step, the player must either add one letter to the word
from the previous step, or take away one letter, and then rearrange the letters to make a new word.
croissant(-C) -> arsonist(-S) -> aroints(+E)->notaries(+B)->baritones(-S)->baritone
The new word should make sense from a wordList.txt which is dictionary of word.
Dictionary
My code look like this,
where I have calculated first the number of character removed "remove_list" and added "add_list". Then I have stored that value in the list.
Then I read the file, and stored into the dictionary which the sorted pair.
Then I started removing and add into the start word and matched with dictionary.
But now challenge is, some word after deletion and addition doesn't match with the dictionary and it misses the goal.
In that case, it should backtrack to previous step and should add instead of subtracting.
I am looking for some sort of recursive function, which could help in this or complete new logic which I could help to achieve the output.
Sample of my code.
start = 'croissant'
goal = 'baritone'
list_start = map(list,start)
list_goal = map(list, goal)
remove_list = [x for x in list_start if x not in list_goal]
add_list = [x for x in list_goal if x not in list_start]
file = open('wordList.txt','r')
dict_words = {}
for word in file:
strip_word = word.rstrip()
dict_words[''.join(sorted(strip_word))]=strip_word
file.close()
final_list = []
flag_remove = 0
for i in remove_list:
sorted_removed_list = sorted(start.replace(''.join(map(str, i)),"",1))
sorted_removed_string = ''.join(map(str, sorted_removed_list))
if sorted_removed_string in dict_words.keys():
print dict_words[sorted_removed_string]
final_list.append(sorted_removed_string)
flag_remove = 1
start = sorted_removed_string
print final_list
flag_add = 0
for i in add_list:
first_character = ''.join(map(str,i))
sorted_joined_list = sorted(''.join([first_character, final_list[-1]]))
sorted_joined_string = ''.join(map(str, sorted_joined_list))
if sorted_joined_string in dict_words.keys():
print dict_words[sorted_joined_string]
final_list.append(sorted_joined_string)
flag_add = 1
sorted_removed_string = sorted_joined_string
Recursion-based backtracking isn't a good idea for search problem of this sort. It blindly goes downward in search tree, without exploiting the fact that words are almost never 10-12 distance away from each other, causing StackOverflow (or recursion limit exceeded in Python).
The solution here uses breadth-first search. It uses mate(s) as helper, which given a word s, finds all possible words we can travel to next. mate in turn uses a global dictionary wdict, pre-processed at the beginning of the program, which for a given word, finds all it's anagrams (i.e re-arrangement of letters).
from queue import Queue
words = set(''.join(s[:-1]) for s in open("wordsEn.txt"))
wdict = {}
for w in words:
s = ''.join(sorted(w))
if s in wdict: wdict[s].append(w)
else: wdict[s] = [w]
def mate(s):
global wdict
ans = [''.join(s[:c]+s[c+1:]) for c in range(len(s))]
for c in range(97,123): ans.append(s + chr(c))
for m in ans: yield from wdict.get(''.join(sorted(m)),[])
def bfs(start,goal,depth=0):
already = set([start])
prev = {}
q = Queue()
q.put(start)
while not q.empty():
cur = q.get()
if cur==goal:
ans = []
while cur: ans.append(cur);cur = prev.get(cur)
return ans[::-1] #reverse the array
for m in mate(cur):
if m not in already:
already.add(m)
q.put(m)
prev[m] = cur
print(bfs('croissant','baritone'))
which outputs: ['croissant', 'arsonist', 'rations', 'senorita', 'baritones', 'baritone']
I have 2 lists. One contains values, the other contains the levels those values hold in a sum tree. (the lists have same length)
For example:
[40,20,5,15,10,10] and [0,1,2,2,1,1]
Those lists correctly correspond because
- 40
- - 20
- - - 5
- - - 15
- - 10
- - 10
(20+10+10) == 40 and (5+15) == 20
I need to check if a given list of values and a list of its levels corresponds correctly. So far I have managed to put together this function, but for some reason it's not returning True for correct lists array and numbers. Input numbers here would be [40,20,5,15,10,10] and array would be [0,1,2,2,1,1]
def testsum(array, numbers):
k = len(array)
target = [0]*k
subsum = [0]*k
for x in range(0, k):
if target[array[x]]!=subsum[array[x]]:
return False
target[array[x]]=numbers[x]
subsum[array[x]]=0
if array[x]>0:
subsum[array[x]-1]+=numbers[x]
for x in range(0, k):
if(target[x]!=subsum[x]):
print(x, target[x],subsum[x])
return False
return True
I got this running using itertools.takewhile to grab the subtree under each level. Toss that into a recursive function and assert that all recursions pass.
I've slightly improved my initial implementation by grabbing a next_v and next_l and testing early to see if the current node is a parent node and only building subtree if there's something to build. That inequality check is much cheaper than iterating through the whole vs_ls zip.
import itertools
def testtree(values, levels):
if len(values) == 1:
# Last element, always true!
return True
vs_ls = zip(values, levels)
test_v, test_l = next(vs_ls)
next_v, next_l = next(vs_ls)
if next_l > test_l:
subtree = [v for v,l in itertools.takewhile(
lambda v_l: v_l[1] > test_l,
itertools.chain([(next_v, next_l)], vs_ls))
if l == test_l+1]
if sum(subtree) != test_v and subtree:
#TODO test if you can remove the "and subtree" check now!
print("{} != {}".format(subtree, test_v))
return False
return testtree(values[1:], levels[1:])
if __name__ == "__main__":
vs = [40, 20, 15, 5, 10, 10]
ls = [0, 1, 2, 2, 1, 1]
assert testtree(vs, ls) == True
It unfortunately adds a lot of complexity to the code since it pulls out the first value that we need, which necessitates an extra itertools.chain call. That's not ideal. Unless you're expecting to get very large lists for values and levels, it might be worthwhile to do vs_ls = list(zip(values, levels)) and approach this list-wise rather than iterator-wise. e.g...
...
vs_ls = list(zip(values, levels))
test_v, test_l = vs_ls[0]
next_v, next_l = vs_ls[1]
...
subtree = [v for v,l in itertools.takewhile(
lambda v_l: v_l[1] > test_l,
vs_ls[1:]) if l == test_l+1]
I still think the fastest way is probably to iterate once with an approach almost like a state machine and grab all the possible subtrees, then check them all individually. Something like:
from collections import namedtuple
Tree = namedtuple("Tree", ["level_num", "parent", "children"])
# equivalent to
# # class Tree:
# # def __init__(self, level_num: int,
# # parent: int,
# # children: list):
# # self.level_num = level_num
# # self.parent = parent
# # self.children = children
def build_trees(values, levels):
trees = [] # list of Trees
pending_trees = []
vs_ls = zip(values, levels)
last_v, last_l = next(vs_ls)
test_l = last_l + 1
for v, l in zip(values, levels):
if l > last_l:
# we've found a new tree
if l != last_l + 1:
# What do you do if you get levels like [0, 1, 3]??
raise ValueError("Improper leveling: {}".format(levels))
test_l = l
# Stash the old tree and start a new one.
pending_trees.append(cur_tree)
cur_tree = Tree(level_num=last_l, parent=last_v, children=[])
elif l < test_l:
# tree is finished
# Store the finished tree and grab the last one we stashed.
trees.append(cur_tree)
try:
cur_tree = pending_trees.pop()
except IndexError:
# No trees pending?? That's weird....
# I can't think of any case that this should happen, so maybe
# we should be raising ValueError here, but I'm not sure either
cur_tree = Tree(level_num=-1, parent=-1, children=[])
elif l == test_l:
# This is a child value in our current tree
cur_tree.children.append(v)
# Close the pending trees
trees.extend(pending_trees)
return trees
This should give you a list of Tree objects, each of which having the following attributes
level_num := level number of parent (as found in levels)
parent := number representing the expected sum of the tree
children := list containing all the children in that level
After you do that, you should be able to simply check
all([sum(t.children) == t.parent for t in trees])
But note that I haven't been able to test this second approach.
I'm trying to write a program that mimics a word game where, from a given set of words, it will find the longest possible sequence of words. No word can be used twice.
I can do the matching letters and words up, and storing them into lists, but I'm having trouble getting my head around how to handle the potentially exponential number of possibilities of words in lists. If word 1 matches word 2 and then I go down that route, how do I then back up to see if words 3 or 4 match up with word one and then start their own routes, all stemming from the first word?
I was thinking some way of calling the function inside itself maybe?
I know it's nowhere near doing what I need it to do, but it's a start. Thanks in advance for any help!
g = "audino bagon baltoy banette bidoof braviary bronzor carracosta charmeleon cresselia croagunk darmanitan deino emboar emolga exeggcute gabite girafarig gulpin haxorus"
def pokemon():
count = 1
names = g.split()
first = names[count]
master = []
for i in names:
print (i, first, i[0], first[-1])
if i[0] == first[-1] and i not in master:
master.append(i)
count += 1
first = i
print ("success", master)
if len(master) == 0:
return "Pokemon", first, "does not work"
count += 1
first = names[count]
pokemon()
Your idea of calling a function inside of itself is a good one. We can solve this with recursion:
def get_neighbors(word, choices):
return set(x for x in choices if x[0] == word[-1])
def longest_path_from(word, choices):
choices = choices - set([word])
neighbors = get_neighbors(word, choices)
if neighbors:
paths = (longest_path_from(w, choices) for w in neighbors)
max_path = max(paths, key=len)
else:
max_path = []
return [word] + max_path
def longest_path(choices):
return max((longest_path_from(w, choices) for w in choices), key=len)
Now we just define our word list:
words = ("audino bagon baltoy banette bidoof braviary bronzor carracosta "
"charmeleon cresselia croagunk darmanitan deino emboar emolga "
"exeggcute gabite girafarig gulpin haxorus")
words = frozenset(words.split())
Call longest_path with a set of words:
>>> longest_path(words)
['girafarig', 'gabite', 'exeggcute', 'emolga', 'audino']
A couple of things to know: as you point out, this has exponential complexity, so beware! Also, know that python has a recursion limit!
Using some black magic and graph theory I found a partial solution that might be good (not thoroughly tested).
The idea is to map your problem into a graph problem rather than a simple iterative problem (although it might work too!). So I defined the nodes of the graph to be the first letters and last letters of your words. I can only create edges between nodes of type first and last. I cannot map node first number X to node last number X (a word cannot be followed by it self). And from that your problem is just the same as the Longest path problem which tends to be NP-hard for general case :)
By taking some information here: stackoverflow-17985202 I managed to write this:
g = "audino bagon baltoy banette bidoof braviary bronzor carracosta charmeleon cresselia croagunk darmanitan deino emboar emolga exeggcute gabite girafarig gulpin haxorus"
words = g.split()
begin = [w[0] for w in words] # Nodes first
end = [w[-1] for w in words] # Nodes last
links = []
for i, l in enumerate(end): # Construct edges
ok = True
offset = 0
while ok:
try:
bl = begin.index(l, offset)
if i != bl: # Cannot map to self
links.append((i, bl))
offset = bl + 1 # next possible edge
except ValueError: # no more possible edge for this last node, Next!
ok = False
# Great function shamelessly taken from stackoverflow (link provided above)
import networkx as nx
def longest_path(G):
dist = {} # stores [node, distance] pair
for node in nx.topological_sort(G):
# pairs of dist,node for all incoming edges
pairs = [(dist[v][0]+1,v) for v in G.pred[node]]
if pairs:
dist[node] = max(pairs)
else:
dist[node] = (0, node)
node,(length,_) = max(dist.items(), key=lambda x:x[1])
path = []
while length > 0:
path.append(node)
length,node = dist[node]
return list(reversed(path))
# Construct graph
G = nx.DiGraph()
G.add_edges_from(links)
# TADAAAA!
print(longest_path(G))
Although it looks nice, there is a big drawback. You example works because there is no cycle in the resulting graph of input words, however, this solution fails on cyclic graphs.
A way around that is to detect cycles and break them. Detection can be done this way:
if nx.recursive_simple_cycles(G):
print("CYCLES!!! /o\")
Breaking the cycle can be done by just dropping a random edge in the cycle and then you will randomly find the optimal solution for your problem (imagine a cycle with a tail, you should cut the cycle on the node having 3 edges), thus I suggest brute-forcing this part by trying all possible cycle breaks, computing longest path and taking the longest of the longest path. If you have multiple cycles it becomes a bit more explosive in number of possibilities... but hey it's NP-hard, at least the way I see it and I didn't plan to solve that now :)
Hope it helps
Here's a solution that doesn't require recursion. It uses the itertools permutation function to look at all possible orderings of the words, and find the one with the longest length. To save time, as soon as an ordering hits a word that doesn't work, it stops checking that ordering and moves on.
>>> g = 'girafarig eudino exeggcute omolga gabite'
... p = itertools.permutations(g.split())
... longestword = ""
... for words in p:
... thistry = words[0]
... # Concatenates words until the next word doesn't link with this one.
... for i in range(len(words) - 1):
... if words[i][-1] != words[i+1][0]:
... break
... thistry += words[i+1]
... i += 1
... if len(thistry) > len(longestword):
... longestword = thistry
... print(longestword)
... print("Final answer is {}".format(longestword))
girafarig
girafariggabiteeudino
girafariggabiteeudinoomolga
girafariggabiteexeggcuteeudinoomolga
Final answer is girafariggabiteexeggcuteeudinoomolga
First, let's see what the problem looks like:
from collections import defaultdict
import pydot
words = (
"audino bagon baltoy banette bidoof braviary bronzor carracosta "
"charmeleon cresselia croagunk darmanitan deino emboar emolga "
"exeggcute gabite girafarig gulpin haxorus"
).split()
def main():
# get first -> last letter transitions
nodes = set()
arcs = defaultdict(lambda: defaultdict(list))
for word in words:
first = word[0]
last = word[-1]
nodes.add(first)
nodes.add(last)
arcs[first][last].append(word)
# create a graph
graph = pydot.Dot("Word_combinations", graph_type="digraph")
# use letters as nodes
for node in sorted(nodes):
n = pydot.Node(node, shape="circle")
graph.add_node(n)
# use first-last as directed edges
for first, sub in arcs.items():
for last, wordlist in sub.items():
count = len(wordlist)
label = str(count) if count > 1 else ""
e = pydot.Edge(first, last, label=label)
graph.add_edge(e)
# save result
graph.write_jpg("g:/temp/wordgraph.png", prog="dot")
if __name__=="__main__":
main()
results in
which makes the solution fairly obvious (path shown in red), but only because the graph is acyclic (with the exception of two trivial self-loops).