I mimic a bidirectional solution to wordLadder problem in leetcode
Word Ladder - LeetCode
Given two words (beginWord and endWord), and a dictionary's word list, find the length of shortest transformation sequence from beginWord to endWord, such that:
Only one letter can be changed at a time.
Each transformed word must exist in the word list. Note that beginWord is not a transformed word.
Note:
Return 0 if there is no such transformation sequence.
All words have the same length.
All words contain only lowercase alphabetic characters.
You may assume no duplicates in the word list.
You may assume beginWord and endWord are non-empty and are not the same.
Example 1:
Input:
beginWord = "hit",
endWord = "cog",
wordList = ["hot","dot","dog","lot","log","cog"]
Output: 5
Explanation: As one shortest transformation is "hit" -> "hot" -> "dot" -> "dog" -> "cog",
return its length 5.
Example 2:
Input:
beginWord = "hit"
endWord = "cog"
wordList = ["hot","dot","dog","lot","log"]
Output: 0
Explanation: The endWord "cog" is not in wordList, therefore no possible transformation.
the solution
class Solution2(object):
def ladderLength(self, beginWord, endWord, wordList):
#base case
if (endWord not in wordList) or (not endWord) or (not beginWord) or (not wordList):
return 0
size = len(beginWord)
word_set = set(wordList)
forwards, backwards = {beginWord}, {endWord}
visited = set()
step = 0
while forwards and backwards:
step += 1 #treat the first word as step 1
if len(forwards) > len(backwards):
forwards, backwards = backwards, forwards #switch process
#logging.debug(f"step: {step}, forwards: {forwards}, backwords: {backwards}")
neighbors= set()
for word in forwards:#visit words on this level
if word in visited: continue
for i in range(size):
for c in 'abcdefghijklmnopqrstuvwxyz':
next_word = word[:i] + c + word[i+1:]
if next_word in backwards: return step + 1 #terminating case
if next_word in word_set: neighbors.add(next_word)
#logging.debug(f"next_word{next_word}, step: {step}")
visited.add(word) #add visited word as the final step
forwards = neighbors
#logging.debug(f"final: {step}")
return 0
Reference to switch process
if len(forwards) > len(backwards):
forwards, backwards = backwards, forwards #switch process
This solution is concise but not intuitive, I tried to change it to
if len(forwards) <= len(backward): current = forwards
else: current = backwards
neighbors = set()
for word in current:
.......
Unfortunately, the final step forwards = neighbors cannot be handled properly.
How could solve the prolem
this is called bidirectional search aglorithm. forward and backward in this solution is kind of two-pointer idea, always choose the larger set to do BFS. it helps to find the path more quickly.
about the question you want to use current instead of switch forward and backward, I think it is impossible. The reason is, we use both forward and backward in the logic, so beyond current you should also provide a variable like another. but current and another is same as forward and backward, so your idea doesn't work.
in my opinion, I think this two-pointer version is elegant and concise enough, and I personally like it.
I figure out another way, use current index is close to your idea:
class Solution:
def ladderLength(self, beginWord, endWord, wordList):
#base case
if (endWord not in wordList) or (not endWord) or (not beginWord) or (not wordList):
return 0
size = len(beginWord)
word_set = set(wordList)
entries = [{beginWord}, {endWord}]
visited = set()
step = 0
cur = 0
while entries[cur] and entries[1-cur]:
step += 1 #treat the first word as step 1
if entries[1-cur] > entries[cur]: #switch process
cur ^= 1
neighbors= set()
for word in entries[cur]:#visit words on this level
if word in visited: continue
for i in range(size):
for c in 'abcdefghijklmnopqrstuvwxyz':
next_word = word[:i] + c + word[i+1:]
if next_word in entries[1-cur]: return step + 1 #terminating case
if next_word in word_set: neighbors.add(next_word)
visited.add(word) #add visited word as the final step
entries[cur] = neighbors
return 0
Hope that helps you, and comment if you have further questions. : )
Related
EDIT: The problem, and the answer lies in using the enumerate function to access the index of iterables, but I'm still working on applying it properly.
I was asked to generate a random word with N length, and to print uppercase alphabet neighbours and lower - upper neighbours. I really don't know how to put this better.
The example is in the title, here is my code so far, and I think it works, I just need to fix the error made by the index search in the ascii_uppercase list variable.
Also, please excuse the messy do - while loop at the beginning.
import string
import random
letters = list(string.ascii_uppercase + string.ascii_lowercase)
enter = int(input('N: '))
def randomised(signs = string.ascii_uppercase + string.ascii_lowercase, X = enter):
return( ''.join(random.choice(signs) for _ in range(X)))
if enter == 1:
print('END')
while enter > 1:
enter = int(input('N: '))
word1 = randomised()
word2 = list(word1)
neighbour = ''
same = ''
for j in word2:
if word2[j] and word2[j+1] in string.ascii_uppercase and letters.index(j) == word2.index(j) and letters.index(j+1) == word2.index(j+1):
same += j
same += j+1
for i in word2:
if word2[i] == i.upper and word2[i+1] == (i+1).upper:
neighbour += i
neighbour += i+1
print('Created: {}, Neighbour uppercase letters: {}, Neighbour same letters: {}' .format(word1,same,neighbour))
expected behaviour:
N: 7
Created: aaBCDdD, Neighbour uppercase letters: BCD, Neighbour same letters: dD
N: 1
N: END
I am not so sure, but i think your problem might stem from the use of "i+1" and "j+1" without limiting the iterations stop before the last one.
The next thing is that you need to put this code in english for people to be able to provide better answers, i am not native english, but the core of my code is in english so it will be understandable worldwide, there are other general improvements that can be done, but those are learn while coding.
I hope your assignment goes great.
Another recommendation you can use "a".islower() to see if the character is lowercase, isupper() to se if it is uppercase, you don't need to import the whole alphabet. There are many builtin functions to deal with common scenarios and those tend to be more efficient than what most people would do without them.
Edit: the error is because you are doing string + number
Here is a simple working code (without formatting of the output). I used a simple pairwise iteration to compare each character with the previous one.
# generation of random word
N = 50
word = ''.join(random.choices(ascii_letters, k=N))
# forcing word for testing
word = 'aaBCDdD'
# test of conditions
cond1 = ''
cond2 = ''
flag1 = False # flag to add last character of stretch
for a,b in zip('_'+word, word+'_'):
if a.isupper() and b.isupper() and ord(a) == ord(b)-1:
cond1 += a
flag1 = True
elif flag1:
flag1 = False
cond1 += a
if a.islower() and b.isupper() and a.lower() == b.lower():
cond2 += a+b
print(cond1, cond2, sep='\n')
# BCD
# dD
NB. In case the conditions are met several times in the word, the identified patterns will just be concatenated
Example on random word of 500 characters:
IJRSOPHILMMNVW
kKdDyY
I'm having trouble doing the next task:
So basically, I need to build a function that receives a (sentence, word, occurrence)
and it will search for that word and reverse it only where it occurs
for example:
function("Dani likes bananas, Dani also likes apples", "lik", "2")
returns: "Dani likes bananas, Dani also kiles apples"
As you can see, the "word" is 'lik' and at the second time it occurred it reversed to 'kil'.
I wrote something but it's too messy and that part still doesn't work for me,
def q2(sentence, word, occurrence):
count = 0
reSentence = ''
reWord = ''
for char in word:
if sentence.find(word) == -1:
print('could not find the word')
break
for letter in sentence:
if char == letter:
if word != reWord:
reWord += char
reSentence += letter
break
elif word == reWord:
if count == int(occurrence):
reWord = word[::-1]
reSentence += reWord
elif count > int(occurrence):
print("no such occurrence")
else:
count += 1
else:
reSentence += letter
print(reSentence)
sentence = 'Dani likes bananas, Dani also likes apples'
word = 'li'
occurrence = '2'
q2(sentence,word,occurrence)
the main problem right now is that, after it breaks it goes back to check from the start of the sentence so it will find i in "Dani". I couldn't think of a way to make it check from where it stopped.
I tried using enumerate but still had no idea how.
This will work for the given scenario
scentence = 'Dani likes bananas, Dani also likes apples'
word = 'lik'
st = word
occ = 2
lt = scentence.split(word)
op = ''
if (len(lt) > 1):
for i,x in enumerate(lt[:-1]):
if (i+1) == occ:
word = ''.join(reversed(word))
op = op + x + word
word = st
print(op+lt[-1])
Please test yourself for other scenario
This line for i,x in enumerate(lt[:-1]) basically loops on the list excluding the last element. using enumerate we can get index of the element in the list in i and value of element in x. So when code gets loops through it I re-join the split list with same word by which I broke, but I change the word on the specified position where you desired. The reason to exclude the last element while looping is because inside loop there is addition of word and after each list of element and if I include the whole list there will be extra word at the end. Hope it explains.
Your approach shows that you've clearly thought about the problem and are using the means you know well enough to solve it. However, your code has a few too many issue to simply fix, for example:
you only check for occurrence of the word once you're inside the loop;
you loop over the entire sentence for each letter in the word;
you only compare a character at a time, and make some mistakes in keeping track of how much you've matched so far.
you pass a string '2', which you intend to use as a number 2
All of that and other problems can be fixed, but you would do well to use what the language gives you. Your task breaks down into:
find the n-th occurrence of a substring in a string
replace it with another word where found and return the string
Note that you're not really looking for a 'word' per se, as your example shows you replacing only part of a word (i.e. 'lik') and a 'word' is commonly understood to mean a whole word between word boundaries.
def q2(sentence, word, occurrence):
# the first bit
position = 0
count = 0
while count < occurrence:
position = sentence.find(word, position+1)
count += 1
if position == -1:
print (f'Word "{word}" does not appear {occurrence} times in "{sentence}"')
return None
# and then using what was found for a result
return sentence[0:position] + word[::-1] + sentence[position+len(word):]
print(q2('Dani likes bananas, Dani also likes apples','lik',2))
print(q2('Dani likes bananas, Dani also likes apples','nope',2))
A bit of explanation on that return statement:
sentence[0:position] gets sentence from the start 0 to the character just before position, this is called a 'slice'
word[::-1] get word from start to end, but going in reverse -1. Leaving out the values in the slice implies 'from one end to the other'
sentence[position+len(word):] gets sentence from the position position + len(word), which is the character after the found word, until the end (no index, so taking everything).
All those combined is the result you need.
Note that the function returns None if it can't find the word the right number of times - that may not be what is needed in your case.
import re
from itertools import islice
s = "Dani likes bananas, Dani also likes apples"
t = "lik"
n = 2
x = re.finditer(t, s)
try:
i = next(islice(x, n - 1, n)).start()
except StopIteration:
i = -1
if i >= 0:
y = s[i: i + len(t)][::-1]
print(f"{s[:i]}{y}{s[i + len(t):]}")
else:
print(s)
Finds the 2nd starting index (if exists) using Regex. May require two passes in the worst case over string s, one to find the index, one to form the output. This can also be done in one pass using two pointers, but I'll leave that to you. From what I see, no one has offered a solution yet that does in one pass.
index = Find index of nth occurence
Use slice notation to get part you are interested in (you have it's beginning and length)
Reverse it
Construct your result string:
result = sentence[:index] + reversed part + sentence[index+len(word):]
I am completing this HackerRank coding challenge. Essentially, the challenge asks us to find all the substrings of the input string without mixing up the letters. Then, we count the number of substrings that start with a vowel and count the number of substrings that start with a consonant.
The coding challenge is structured as a game where Stuart's score is the number of consonant starting substrings and Kevin's score is the number of vowel starting substrings. The program outputs the winner, i.e. the one with the most substrings.
For example, I created the following code:
def constwordfinder(word):
word = word.lower()
return_lst = []
for indx in range(1,len(word)+1):
if word[indx-1:indx] not in ['a','e','i','o','u']:
itr = indx
while itr < len(word)+1:
return_lst.append(word[indx-1:itr])
itr +=1
return return_lst
def vowelwordfinder(word):
word = word.lower()
return_lst = []
for indx in range(1,len(word)+1):
if word[indx-1:indx] in ['a','e','i','o','u']:
itr = indx
while itr < len(word)+1:
return_lst.append(word[indx-1:itr])
itr +=1
return return_lst
def game_scorer(const_list, vow_list):
if len(const_list) == len(vow_list):
return 'Draw'
else:
if len(const_list) > len(vow_list):
return 'Stuart ' + str(len(const_list))
else:
return 'Kevin ' + str(len(vow_list))
input_str = input()
print(game_scorer(constwordfinder(input_str), vowelwordfinder(input_str)))
This worked for smaller strings like BANANA, although when HackerRank started inputting strings like the following, I got multiple Runtime errors on the test cases:
NANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANAN
I tried structuring the program to be a bit more concise, although I still got Runtime errors on the longer test cases:
def wordfinder(word):
word = word.lower()
return_lst = []
for indx in range(1,len(word)+1):
itr = indx
while itr < len(word)+1:
return_lst.append(word[indx-1:itr])
itr +=1
return return_lst
def game_scorer2(word_list):
kevin_score = 0
stuart_score = 0
for word in word_list:
if word[0:1] not in ['a','e','i','o','u']:
stuart_score += 1
else:
kevin_score +=1
if stuart_score == kevin_score:
return 'Draw'
else:
if stuart_score > kevin_score:
return 'Stuart ' + str(stuart_score)
else:
return 'Kevin ' + str(kevin_score)
print(game_scorer2(wordfinder(input())))
What else exactly should I be doing to structure my program to avoid Runtime errors like before?
Here's a quick and dirty partial solution based on my hints:
input_str = raw_input()
kevin = 0
for i, c in enumerate(input_str):
if c.lower() in "aeiou":
kevin += len(input_str) - i
print kevin
Basically, iterate over each character, and if it is in the set of vowels, Kevin's score increases by the number of remaining characters in the string.
The remaining work should be rather obvious, I hope!
[stolen from the spoilers section of the site in question]
Because say for each consonant, you can make n substrings beginning with that consanant. So for the BANANA example look at the first B. With that B, you can make: B, BA, BAN, BANA, BANAN, BANANA. That's six substrings starting with that B or length(string) - indexof(character), which means that B adds 6 to the score. So you go through the string, looking for each consonant, and add length(string) - index to the score.
The problem here is your algorithm.You are finding all the Sub strings of the text. It takes exponential time to solve the problem. That's why you got run time errors here. You have to use another good algorithm to solve this problem rather than using sub strings.
I am building a tree for a spell checker with suggestions. Each node contains a key (a letter) and a value (array of letters down that path).
So assume the following sub-trie in my big trie:
W
/ \
a e
| |
k k
| |
is word--> e e
|
...
This is just a subpath of a sub-trie. W is a node and a and e are two nodes in its value array etc...
At each node, I check if the next letter in the word is a value of the node. I am trying to support mistyped vowels for now. So 'weke' will yield 'wake' as a suggestion. Here's my searchWord function in my trie:
def searchWord(self, word, path=""):
if len(word) > 0:
key = word[0]
word = word[1:]
if self.values.has_key(key):
path = path + key
nextNode = self.values[key]
return nextNode.searchWord(word, path)
else:
# check here if key is a vowel. If it is, check for other vowel substitutes
else:
if self.isWord:
return path # this is the word found
else:
return None
Given 'weke', at the end when word is of length zero and path is 'weke', my code will hit the second big else block. weke is not marked as a word and so it will return with None. This will return out of searchWord with None.
To avoid this, at each stack unwind or recursion backtrack, I need to check if a letter is a vowel and if it is, do the checking again.
I changed the if self.values.has_key(key) loop to the following:
if self.values.has_key(key):
path = path + key
nextNode = self.values[key]
ret = nextNode.searchWord(word, path)
if ret == None:
# check if key == vowel and replace path
# return nextNode.searchWord(...
return ret
What am I doing wrong here? What can I do when backtracking to achieve what I'm trying to do?
Search recursively. Keep track of the current index and the original word.
letters = [chr(i) for i in range(97,97+26)]
print letters
max = 300
def searchWord(orig,word, curindex,counter):
if counter>max: return
if counter==0:
s = letters[0] + word[1:]
searchWord(orig,s,0,counter+1)
else:
c = word[curindex]
print 'checking ',word,curindex
s = word
i = letters.index(c)
if i==len(letters)-1 and curindex==len(orig)-1:
print 'done'
return
if i==len(letters)-1:
print 'end of letters reached'
print 'curindex',curindex
s = list(word)
s[curindex] = list(orig)[curindex]
s[curindex+1] = letters[0]
s[1] = letters[0]
s = ''.join(s)
searchWord(orig,s,curindex+1,counter+1)
else:
s = list(word)
try:
s[curindex] = letters[i+1]
except:
print '?? ',s,curindex,letters[i]
s = ''.join(s)
searchWord(orig,s ,curindex,counter+1)
searchWord("weke","weke",0,0)
I'm not sure recursion and tree-search is the right approach here. If you have a table of words in your memory, the loopkup will be very fast. It is only when the search space is so big, that one has has to split the problem. So the better algorithm will be probably simply something like this:
corpus_words = {'wake',....} # this is in memory
allowed = word in corpus_words # perhaps improve this with adjusted binary search
A typical corpus has 5-30 million words, which is less than 1 Gigabyte. Lookup will be very fast because you can do binary search, which is O(log n) in the average case. The problem with searching for a subset of the word is that you don't know that the typed words is not a word. However you could build allowed vowels. Certain combinations of letters won't be in the corpus. So in terms of computation this problem is pretty easy nowadays. Of course one can quickly improve the simple lookup, by keeping a core corpus in memory, and the rest on disk. Swipe on android works pretty well. It uses a personalized corpus and some machine learning.
What I would do to solve this particular problem, is to calulate neighbours of the word 'weke' and check if they are in the corpus, i.e.
word = 'weke'
suggestions = list()
letters = [chr(x) for x in range(97,97+26)]
for i in range(len(word)):
for a in letters: # or do this in a smarter way to iterate
newword = word
newword[i] = a
if newword in corpus: suggestions.append(newword)
And then to improve it, check subsections if they are in a corpus of syllables. There is a lot of work, which has been done on this front so you can probably find standard solutions on the internet, for example: http://nltk.org/
I guess you could classify this as a Scrabble style problem, but it started out due to a friend mentioning the UK TV quiz show Countdown. Various rounds in the show involve the contestants being presented a scrambled set of letters and they have to come up with the longest word they can. The one my friend mentioned was "RAEPKWAEN".
In fairly short order I whipped up something in Python to handle this problem, using PyEnchant to handle the dictionary look-ups, however I'm noticing that it really can't scale all that well.
Here's what I have currently:
#!/usr/bin/python
from itertools import permutations
import enchant
from sys import argv
def find_longest(origin):
s = enchant.Dict("en_US")
for i in range(len(origin),0,-1):
print "Checking against words of length %d" % i
pool = permutations(origin,i)
for comb in pool:
word = ''.join(comb)
if s.check(word):
return word
return ""
if (__name__)== '__main__':
result = find_longest(argv[1])
print result
That's fine on a 9 letter example like they use in the show, 9 factorial = 362,880 and 8 factorial = 40,320. On that scale even if it would have to check all possible permutations and word lengths it's not that many.
However once you reach 14 characters that's 87,178,291,200 possibly combinations, meaning you're reliant on luck that a 14 character word is quickly found.
With the example word above it's taking my machine about 12 1/2 seconds to find "reawaken". With 14 character scrambled words we could be talking on the scale of 23 days just to check all possible 14 character permutations.
Is there any more efficient way to handle this?
Implementation of Jeroen Coupé idea from his answer with letters count:
from collections import defaultdict, Counter
def find_longest(origin, known_words):
return iter_longest(origin, known_words).next()
def iter_longest(origin, known_words, min_length=1):
origin_map = Counter(origin)
for i in xrange(len(origin) + 1, min_length - 1, -1):
for word in known_words[i]:
if check_same_letters(origin_map, word):
yield word
def check_same_letters(origin_map, word):
new_map = Counter(word)
return all(new_map[let] <= origin_map[let] for let in word)
def load_words_from(file_path):
known_words = defaultdict(list)
with open(file_path) as f:
for line in f:
word = line.strip()
known_words[len(word)].append(word)
return known_words
if __name__ == '__main__':
known_words = load_words_from('words_list.txt')
origin = 'raepkwaen'
big_origin = 'raepkwaenaqwertyuiopasdfghjklzxcvbnmqwertyuiopasdfghjklzxcvbnmqwertyuiopasdfghjklzxcvbnmqwertyuiopasdfghjklzxcvbnm'
print find_longest(big_origin, known_words)
print list(iter_longest(origin, known_words, 5))
Output (for my small 58000 words dict):
counterrevolutionaries
['reawaken', 'awaken', 'enwrap', 'weaken', 'weaker', 'apnea', 'arena', 'awake',
'aware', 'newer', 'paean', 'parka', 'pekan', 'prank', 'prawn', 'preen', 'renew',
'waken', 'wreak']
Notes:
It's simple implementation without optimizations.
words_list.txt - can be /usr/share/dict/words on Linux.
UPDATE
In case we need to find word only once, and we have dictionary with words sorted by length, e.g. by this script:
with open('words_list.txt') as f:
words = f.readlines()
with open('words_by_len.txt', 'w') as f:
for word in sorted(words, key=lambda w: len(w), reverse=True):
f.write(word)
We can find longest word without loading full dict to memory:
from collections import Counter
import sys
def check_same_letters(origin_map, word):
new_map = Counter(word)
return all(new_map[let] <= origin_map[let] for let in word)
def iter_longest_from_file(origin, file_path, min_length=1):
origin_map = Counter(origin)
origin_len = len(origin)
with open(file_path) as f:
for line in f:
word = line.strip()
if len(word) > origin_len:
continue
if len(word) < min_length:
return
if check_same_letters(origin_map, word):
yield word
def find_longest_from_file(origin, file_path):
return iter_longest_from_file(origin, file_path).next()
if __name__ == '__main__':
origin = sys.argv[1] if len(sys.argv) > 1 else 'abcdefghijklmnopqrstuvwxyz'
print find_longest_from_file(origin, 'words_by_len.txt')
You want to avoid doing the permutation. You could count how many times a character appears in both strings ( the original string and the one from the dictionary). Dismiss all the words from the dictionary where the frequency of characters isn't the same.
So to check one word from the dictionary you will need to count the characters at most MAX (26, n) time.
Pre-parse the dictionary as sorted(word), word pairs. (e.g. giilnstu, linguist)
Sort the dictionary file.
Then, when you are searching for a given set of letters:
Binary search the dictionary for the letters you have, sorting the letters first.
You'd need to do this separately for each word length.
EDIT: should say that you're searching for all unique combinations of the sorted letters of the target word length (range(len(letters), 0, -1))
This is similar to an anagram problem I've worked on before. I solved that by using prime numbers to represent each letter. The product of the letters for each word produces a number. To determine if a given set of input characters are sufficient to make a work, just divide the product of the input character by the product for the number you want to check. If there is no remainder then the input characters are sufficient. I've implemented it below. The output is:
$ python longest.py rasdaddea aosddna raepkwaen
rasdaddea --> sadder
aosddna --> soda
raepkwaen --> reawaken
You can find more details and a thorough explanation of the anagrams case at:
http://mostlyhighperformance.blogspot.com/2012/01/generating-anagrams-efficient-and-easy.html
This algorithm takes a small amount of time to set up a dictionary, and then individual checks are as easy as a single division for every word in the dictionary. There may be faster methods that rely on closing off parts of the dictionary if it lacks a letter, but these may end up performing worse if you have large number of input letters so it is actually not able to close off any part of the dictionary.
import sys
def nextprime(x):
while True:
x += 1
for pot_fac in range(2,x):
if x % pot_fac == 0:
break
else:
return x
def prime_generator():
'''Returns a generator that produces the next largest prime as
compared to the one returned from this function the last time
it was called. The first time it is called it will return 2.'''
lastprime = 1
while True:
lastprime = nextprime(lastprime)
yield lastprime
# Assign prime numbers to each lower case letter
gen = prime_generator()
primes = dict( [ (chr(x),gen.next()) for x in range(ord('a'),ord('z')+1) ] )
product = lambda x: reduce( lambda m,n: m*n, x, 1 )
make_key = lambda x: product( [ primes[y] for y in x ] )
try:
words = open('words').readlines()
words = [ ''.join( [ c for c in x.lower() \
if ord('a') <= ord(c) <= ord('z') ] ) \
for x in words ]
for x in words:
try:
make_key(x)
except:
print x
raise
except IOError:
words = [ 'reawaken','awaken','enwrap','weaken','weaker', ]
words = dict( ( (make_key(x),x,) for x in words ) )
inputs = sys.argv[1:] if sys.argv[1:] else [ 'raepkwaen', ]
for input in inputs:
input_key = make_key(input)
results = [ words[x] for x in words if input_key % x == 0 ]
result = reversed(sorted(results, key=len)).next()
print input,'--> ',result
I started this last night shortly after you asked the question, but didn't get around to polishing it up until just now. This was my solution, which is basically a modified trie, which I didn't know until today!
class Node(object):
__slots__ = ('words', 'letter', 'child', 'sib')
def __init__(self, letter, sib=None):
self.words = []
self.letter = letter
self.child = None
self.sib = sib
def get_child(self, letter, create=False):
child = self.child
if not child or child.letter > letter:
if create:
self.child = Node(letter, child)
return self.child
return None
return child.get_sibling(letter, create)
def get_sibling(self, letter, create=False):
node = self
while node:
if node.letter == letter:
return node
sib = node.sib
if not sib or sib.letter > letter:
if create:
node.sib = Node(letter, sib)
node = node.sib
return node
return None
node = sib
return None
def __repr__(self):
return '<Node({}){}{}: {}>'.format(chr(self.letter), 'C' if self.child else '', 'S' if self.sib else '', self.words)
def add_word(root, word):
word = word.lower().strip()
letters = [ord(c) for c in sorted(word)]
node = root
for letter in letters:
node = node.get_child(letter, True)
node.words.append(word)
def find_max_word(root, word):
word = word.lower().strip()
letters = [ord(c) for c in sorted(word)]
words = []
def grab_words(root, letters):
last = None
for idx, letter in enumerate(letters):
if letter == last: # prevents duplication
continue
node = root.get_child(letter)
if node:
words.extend(node.words)
grab_words(node, letters[idx+1:])
last = letter
grab_words(root, letters)
return words
root = Node(0)
with open('/path/to/dict/file', 'rt') as f:
for word in f:
add_word(root, word)
Testing:
>>> def nonrepeating_words():
... return find_max_word(root, 'abcdefghijklmnopqrstuvwxyz')
...
>>> sorted(nonrepeating_words(), key=len)[-10:]
['ambidextrously', 'troublemakings', 'dermatoglyphic', 'hydromagnetics', 'hydropneumatic', 'pyruvaldoxines', 'hyperabductions', 'uncopyrightable', 'dermatoglyphics', 'endolymphaticus']
>>> len(nonrepeating_words())
67590
I think I prefer dermatoglyphics to uncopyrightable for longest word, myself. Performance-wise, utilizing a ~500k word dictionary (from here),
>>> import timeit
>>> timeit.timeit(nonrepeating_words, number=100)
62.8912091255188
>>>
So, on average, 6/10ths of a second (on my i5-2500) to find all sixty-seven thousand words that contain no repeating letters.
The big differences between this implementation and a trie (which makes it even further from a DAWG in general) is that: words are stored in the trie in relation to their sorted letters. So the word 'dog' is stored under the same path as 'god': d-g-o. The second bit is the the find_max_word algorithm, which makes sure every possible letter combination is visited by continually lopping off its head and re-running the search.
Oh, and just for giggles:
>>> sorted(tree.find_max_word('RAEPKWAEN'), key=len)[-5:]
['wakener', 'rewaken', 'reawake', 'reawaken', 'awakener']
Another approach, similar to #market's answer, is to precompute a 'bitmask' for each word in the dictionary. Bit 0 is set if the word contains at least one A, bit 1 is set if it contains at least one B, and so on up to bit 25 for Z.
If you want to search for all words in the dictionary that could be made up from a combination of letters, you start by forming the bitmask for the collection of letters. You can then filter out all of the words that use other letters by checking whether wordBitmask & ~lettersBitMask is zero. If this is zero, the word only uses letters available in the collection, and so could be valid. If this is non-zero, it uses a letter not available in the collection and so is not allowed.
The advantage of this approach is that the bitwise operations are fast. The vast majority of words in the dictionary will use at least one of the 17 or more letters that aren't in the collection given, and you can speedily discount them all. However, for the minority of words that make it through the filter, there is one more check that you still have to make. You still need to check that words aren't using letters more often than they appear in the collection. For example, the word 'weakener' must be disallowed because it has three 'e's, whereas there are only two in the collection of letters RAEPKWAEN. The bitwise approach alone will not filter out this word since each letter in the word appears in the collection.
When looking for words longer than 10 letters you may try to iterate over words (I think there are not so many words with 10 letters) that are longer than 10 letters and check it you have required letters in your set.
Problem is that you have to find all those len(word) >= 10 words first.
So, what I would do:
When reading the dictionary split the words into 2 categories: shorts and longs. You can process shorts by iterating over every possible permutation. Than you can process longs by iterating over then and checking it they are possible.
Of course there are many optimisations possible to both paths.
Construct a trie (prefix tree) from your dictionary. You may want to cache it.
Walk on this trie and remove whole branches that do not fit your bag of letters.
At this point, your trie is the representation of all words in your dictionary that can be constructed from your bag of letters.
Just take the longer one(s) :-)
Edit: you may also use a DAGW (Directed Acyclic Word Graph) which will have fewer vertices. Although I haven't read it, this wikipedia article have a link about The World's Fastest Scrabble Program.
DAWG (Directed Acyclic Word Graph)
Mark Wutka was kind enough to provide some pascal code here.
http://www.wutka.com/dawg.html
http://www.wutka.com/DictConvert.ZIP
In case you have a text file with sorted words. Simply this code does the math:
UsrWrd = input() #here you Enter scrambled letters
with open('words.db','r') as f:
for Line in f:
for Word in Line.split():
if len(Word) == len(UsrWrd) and set(Word) == set(UsrWrd):
print(Word)
break
else:continue `