Finding a word from a text dictionary with given random letters

Finding a word from a text dictionary with given random letters - python

When a person enters a function (e.g. find_from_dict(letters)), the function searches a word from dictionary.txt that can be made from the letters that the user has inputted—a word that contains the most letters inputted).
For example, letters is input as random typing such as "BAJPPNLE" which will then find "APPLE" from the dictionary since "APPLE" has the most letters from "BAJPPNLE".
def find_from_dict(letters):
n = 0
y = 0
x = 0
dictFile = [line.rstrip('\n') for line in open("dictionary.txt")]
listLetters = list(letters)
final = []
while True:
if n < len(dictFile) and len(list(dictFile[n])) <= len(listLetters) and x < len(list(dictFile[n])) and list(dictFile[n])[x] in listLetters:
x = x + 1
elif n < len(dictFile) and len(list(dictFile[n])) <= len(listLetters) and x < len(list(dictFile[n])) and list(dictFile[n])[x] not in listLetters:
x = 0
n = n + 1
elif n < len(dictFile) and len(list(dictFile[n])) <= len(listLetters) and x == len(list(dictFile[n])):
final.append(dictFile[n])
elif n < len(dictFile) and len(list(dictFile[n])) > len(listLetters):
n = n + 1
else:
print(final)
break
I have this code at the moment, but since my dictionary.txt file is huge and the code is inefficient, it takes forever to go through..
Does anyone have any idea how I could make this code efficient?

You can speed this up by preparing a word index formed of the sorted letters in your word list. Then look for sorted combinations of the letters in that index:
for example:
from collections import defaultdict
from itertools import combinations
with open("/usr/share/dict/words","r") as wordList:
words = defaultdict(list)
for word in wordList.read().upper().split("\n"):
words[tuple(sorted(word))].append(word) # index by sorted letters
def findWords(letters):
for size in range(len(letters),2,-1): # from large to small (minimum 3 letters)
for combo in combinations(sorted(letters),size): # combinations of that size
for word in (w for w in words[combo]): # matching fords from index
yield word # return as you go (iterator)
# If you only want one, change this to: return word
testing:
while True:
letters = input("Enter letters:")
if not letters: break
for word in findWords(letters.upper()):
stop = input(word)
if stop: break
print("")
sample output:
Enter letters:BAJPPNLE
JELAB
BEJAN
LEBAN
NABLE
PEBAN
PEBAN
ALPEN
NEPAL
PANEL
PENAL
PLANE
ALPEN
NEPAL
PANEL
PENAL
PLANE
APPLE
NAPPE.
Enter letters:EPROING
PERIGON
PIGEON
IGNORE
REGION
PROGNE
OPINER.
Enter letters:
if you need a solution without using libraries, you will need to use a recursive approach that does a breadth first traversal of the combination tree:
with open("/usr/share/dict/words","r") as wordList:
words = dict()
for word in wordList.read().upper().split("\n"):
words.setdefault(tuple(sorted(word)),list()).append(word) # index by sorted letters
def findWords(letters,size=None):
if size == None:
letters = sorted(letters)
for size in range(len(letters),2,-1):
for word in findWords(letters,size): yield word
elif len(letters) == size:
for word in words.get(tuple(letters),[]): yield word
elif len(letters)>size:
for i in range(len(letters)):
for word in findWords(letters[:i]+letters[i+1:],size):
yield word

You can kind of "cheat" your way through it by pre-processing the dictionary file.
The idea is: instead of having a list of words, you have a list of groups which is determined by the sorted letters of the words.
For example, something like:
"aeegr": [
"agree",
"eager",
],
"alps": [
"alps",
"laps",
"pals",
]
Then if you wanted to just find the exact match, you could sort the letters from the input and search in the processed file.
But you want the one that matches the most letters, so what you could do is number the letters with prime numbers (I'm only considering lowercase ascii characters), so that a is 2, b is 3, c is 5, d is 7 and so on.
Then, you can get a number by multiplying all the letters, so for example for alps you'd get 2*37*53*67.
In your dictionary file you then have the numbers obtained the same way for each word.
Like:
262774: [
"alps",
"laps",
"pals",
]
You then go through your dictionary and if the initial number divided by the dictionary number has a remainder of 0, that's a possible match.
The maximum number with a remainder of 0 is the one that you want, because that's the one with the most letters present.
Keep in mind that the numbers might get very big very quickly, depending on how many letters you use.

Related

Efficient way to choose random element from list based on length

Apologies if this is the wrong forum - it's my first question. I'm learning python and writing a password generator as an exercise from www.practicepython.org
I've written the following but it can be really slow so I guess i"m doing it inefficiently. I want to select a random word from the dictionary and then add ascii characters to it. I want at least 2 ascii characters in the password so I use a while loop to ensure that the word element contains (length - 2).
This works fine if you say that you want the password to be 10 characters long, but if you constrict to something like 5 I think the while loop has to go through so many iterations it can take up to 30 seconds.
I can't find the answer via searching - guidance appreciated!
import string
import random
import nltk
from nltk.corpus import words
word = words.words()[random.randint(1, len(words.words()))]
ascii_str = (string.ascii_letters + string.digits + string.punctuation)
length = int(input("How long do you want the password to be? "))
while len(word) >= (length - 2):
word = words.words()[random.randint(1, len(words.words()))]
print("The password is: " + word, end="")
for i in range(0, (length - len(word))):
print(ascii_str[random.randint(1, len(ascii_str) - 1)], end="")

Start by calling words.words() just once and store that in a variable:
allwords = words.words()
That saves a lot of work, because now the nltk.corpus library won't try to load the whole list each time you try to get the length of the list or try to select a random word with the index you generated.
Next, use random.choice() to pick a random element from that list. That eliminates the need to keep passing in a list length:
word = random.choice(allwords)
# ...
while len(word) >= (length - 2):
word = random.choice(allwords)
Next, you could group the words by length first:
allwords = words.words()
by_length = {}
for word in allwords:
by_length.setdefault(len(word), []).append(word)
This gives you a dictionary with keys representing the length of the words; the nltk corpus has words between 1 and 24 letters long. Each value in the dictionary is a list of words of the same length, so by_length[12] would give you a list of words that are all exactly 12 characters long.
This allows you to pick words of a specific length:
# start with the desired length, and see if there are words this long in the
# dictionary, but don’t presume that all possible lengths exist:
wordlength = length - 2
while wordlength > 0 and wordlength not in by_length:
wordlength -= 1
# we picked a length, but it could be 0, -1 or -2, so start with an empty word
# and then pick a random word from the list with words of the right length.
word = ''
if wordlength > 0:
word = random.choice(by_length[wordlength])
Now word is the longest random word that'll fit your criteria: at least 2 characters shorter than the required length, and taken at random from the word list.
More importantly: we only picked a random word once. Provided you keep the by_length dictionary around for longer and re-use it in a password-generating function, that's a big win.
Picking the nearest available length from by_length can be done without stepping through every possible length one step at a time if you use bisection, but I’ll leave adding that as an exercise for the reader.

You are looking at random.choice
From the docs:
random.choice(seq)
Return a random element from the non-empty sequence seq.
In [22]: import random
In [23]: random.choice([1,2,3,4,5])
Out[23]: 3
In [24]: random.choice([1,2,3,4,5])
Out[24]: 5
In [25]: random.choice([1,2,3,4,5])
Out[25]: 1
The code can then be simplified to
import string
import random
import nltk
from nltk.corpus import words
#All words assigned to a list first
words = words.words()
#Get a random word
word = random.choice(words)
ascii_str = string.ascii_letters + string.digits + string.punctuation
length = int(input("How long do you want the password to be? "))
while len(word) >= (length - 2):
word = random.choice(words)
#Use random.sample to choose n random samples, and join them all to make a string
password = word + ''.join(random.sample(ascii_str, length))
print("The password is: " + password, end="")
Possible outputs are
How long do you want the password to be? 10
The password is: heyT{7<XEVc!l
How long do you want the password to be? 8
The password is: hiBk-^8t7]
But ofcourse, this is not an optimized solution as noted by #MartjinPieters in the comment, but I will try to provide something along the lines as he pointed in his answer, in a different way as follows
I will use itertools.groupby to create the by_length dictionary, a dictionary with key as word length and values as list of words of that length using itertools.groupby
I will ensure a minimum length restriction for length of password
Use random.sample to choose pass_len random samples, and join them all to make a string, and append the word in front!
import string
import random
from itertools import groupby
#All words assigned to a list first
words = ['a', 'c', 'e', 'bc', 'def', 'ghij' , 'jklmn']
#Get a random word
word = random.choice(words)
ascii_str = string.ascii_letters + string.digits + string.punctuation
#Check for minimum length, and exit the code if it is not
min_length = 8
pass_len = int(input("How long do you want the password to be? Minimum length is {}".format(min_length)))
if pass_len <= min_length:
print('Password is not long enough')
exit()
#Create the by_length dictionary, a dictionary with key as word length and values as list of words of that length using itertools.groupby
by_length = {}
for model, group in groupby(words, key=len):
by_length[model] = list(group)
chosen_word = ''
req_len = pass_length - 2
#Iterate till you find the word of required length of pass_len - 2, else reduce the required length by 1
while req_len > 0:
if req_len in words:
chosen_word = by_length[req_len]
else:
req_len -= 1
#Use random.sample to choose n random samples, and join them all to make a string
password = word + ''.join(random.sample(ascii_str, length))
print("The password is: " + password, end="")

Scrabble cheater: scoring wildcard characters to zero in Python

I'm new to python world, and I made a code of scrabble finder with two wildcards (* and ?) in it. When scoring the word, I would like to score wildcard letters to zero, but it looks like it doesn't work. I'm wondering what is missing here.
When you look into the line after "# Add score and valid word to the empty list", I tried to code if a letter in the word is not in the rack, I removed the letter so that I can only score other characters that are not coming from wildcards and matches with the letter in the rack. For example, if I have B* in my rack and the word is BO, I would like to remove O and only score B so that I can score wildcard to zero.
But the result is not what I expected.
import sys
if len(sys.argv) < 2:
print("no rack error.")
exit(1)
rack = sys.argv[1]
rack_low = rack.lower()
# Turn the words in the sowpods.txt file into a Python list.
with open("sowpods.txt","r") as infile:
raw_input = infile.readlines()
data = [datum.strip('\n') for datum in raw_input]
# Find all of the valid sowpods words that can be made
# up of the letters in the rack.
valid_words = []
# Call each word in the sowpods.txt
for word in data:
# Change word to lowercase not to fail due to case.
word_low = word.lower()
candidate = True
rack_letters = list(rack_low)
# Iterate each letter in the word and check if the letter is in the
# Scrabble rack. If used once in the rack, remove the letter from the rack.
# If there's no letter in the rack, skip the letter.
for letter in word_low:
if letter in rack_letters:
rack_letters.remove(letter)
elif '*' in rack_letters:
rack_letters.remove('*')
elif '?' in rack_letters:
rack_letters.remove('?')
else:
candidate = False
if candidate == True:
# Add score and valid word to the empty list
total = 0
for letter in word_low:
if letter not in rack_letters:
word_strip = word_low.strip(letter)
for letter in word_strip:
total += scores[letter]
valid_words.append([total, word_low])

I'm going to go a slightly different route with my answer and hopefully speed the overall process up. We're going to import another function from the standard library -- permutations -- and then find possible results by trimming the total possible word list by the length of the rack (or, whatever argument is passed).
I've commented accordingly.
import sys
from itertools import permutations # So we can get our permutations from all the letters.
if len(sys.argv) < 2:
print("no rack error.")
exit(1)
rack = sys.argv[1]
rack_low = rack.lower()
# Turn the words in the sowpods.txt file into a Python list.
txt_path = r'C:\\\\\sowpods.txt'
with open(txt_path,'r') as infile:
raw_input = infile.readlines()
# Added .lower() here.
data = [i.strip('\n').lower() for i in raw_input]
## Sample rack of 7 letters with wildcard character.
sample_rack = 'jrnyoj?'
# Remove any non-alphabetic characters (i.e. - wildcards)
# We're using the isalpha() method.
clean_rack = ''.join([i for i in sample_rack if i.isalpha()])
# Trim word list to the letter count in the rack.
# (You can skip this part, but it might make producing results a little quicker.)
trimmed_data = [i for i in data if len(i) <= len(clean_rack)]
# Create all permutations from the letters in the rack
# We'll iterate over a count from 2 to the length of the rack
# so that we get all relevant permutations.
all_permutations = list()
for i in range(2, len(clean_rack) + 1):
all_permutations.extend(list(map(''.join, permutations(clean_rack, i))))
# We'll use set().intersection() to help speed the discovery process.
valid_words = list(set(all_permutations).intersection(set(trimmed_data)))
# Print sorted list of results to check.
print(f'Valid words for a rack containing letters \'{sample_rack}\' are:\n\t* ' + '\n\t* '.join(sorted(valid_words)))
Our output would be the following:
Valid words for a rack containing letters 'jrnyoj?' are:
* jo
* jor
* joy
* no
* nor
* noy
* ny
* on
* ony
* or
* oy
* yo
* yon
If you want to verify that the results are actually in the sowpods.txt file, you can just index the sowpods.txt list by where the word you want to look up is indexed:
trimmed_data[trimmed_data.index('jor')]

When you are totalling the scores you are using the words from the wordlist and not the inputted words:
total=0
for letter in word_low:
...
Rather, this should be:
total=0
for letter in rack_low:
...
Also, You do not need to loop and remove the letters with strip at the end.
you can just have:
total = 0
for letter in rack_low:
if letter not in rack_letters:
try:
total += scores[letter]
except KeyError: # If letter is * or ? then a KeyError occurs
pass
valid_words.append([total, word_low])

How can I isolate the first character of each word in a string?

I wrote this program, but it doesn't work because, I cannot figure out what it is doing when i input two words seperated by a space
sinput = input("Enter a sentence") #Collects a string from the user
x = len(sinput) #Calculates how many characters there are in the string
for n in range(x):
n = 0 #Sets n to 0
lsinput = sinput.split(" ") #Splits sinput into seperate words
lsinput = lsinput[n] #Targets the nth word and isolates it into the variable lsinput
print(lsinput[1]) #Prints the 1st letter of lsinput
n += 1 #Adds 1 to n so the loop can move on to the next word

i recommend starting with a beginner's book on python. not sure what. but definitely do some reading.
to answer your question to help get you going though, you can just do this:
[w[0] for w in sinput.split() if w]

The problem was that you:
set n back to 0 at every loop
you looped over the wrong amount of iterations
you used 1 to retrieve the first letter rather than 0 (indexes start at 0)
Adjusting this for your code:
sinput = input("Enter a string to convert to phonetic alphabet") #Collects a string from the user
lsinput = sinput.split(" ") #Splits sinput into seperate words
x = len(lsinput) #Calculates how many characters there are in the string
n = 0 #Sets n to 0
for n in range(x):
print(lsinput[n][0]) #Prints the 1st letter of the nth word in 5lsinput
n += 1 #Adds 1 to n so the loop can move on to the next word
I also moved lsinput forward so that you don't recalculate this list with every iteration.

I am not sure i really understood the question, but if you want to get all the first letters of each word in the input this code will do it
map(lambda x: x[0], sinput.split(" "))

Python 3.2 - Converting words to lower case & number of palindromes in the list of words that have at least 3 letters

I have a random file of words and some of them are palindromes and some are not. Some of those palindromes are 3 or more letters long. How do I count them? I'm wondering how to make the conditions better. I thought I could just length but I keep getting 0 as my answer, which I know is not true because I have the .txt file.
Where am I messing up?
number_of_words = []
with open('words.txt') as wordFile:
for word in wordFile:
word = word.strip()
for letter in word:
letter_lower = letter.lower()
def count_pali(wordFile):
count_pali = 0
for word in wordFile:
word = word.strip()
if word == word[::-1]:
count_pali += 1
return count_pali
print(count_pali)
count = 0
for number in number_of_words:
if number >= 3:
count += 1
print("Number of palindromes in the list of words that have at least 3 letters: {}".format(count))

You are looping through number_of_words in order to calculate count, but number_of_words is initialized to an empty list and never changed after that, hence the loop
for number in number_of_words:
if number >= 3:
count += 1
will execute exactly 0 times

Your code looks good right up until the loop:
for number in number_of_words:
if number >= 3:
count += 1
There is a problem in the logic here. If you think about the data structure of number_of_words, and what you are actually asking python to compare with the 'number >= 3' condition, then I think you will figure your way through it nicely.
--- revised look:
# Getting the words into a list
# word_file = [word1, word2, word3, ..., wordn]
word_file = open('words.txt').readlines()
# set up counters
count_pali, high_score = 0, 0
# iterate through each word and test it
for word in word_file:
# strip newline character
word = word.strip()
# if word is palindrome
if word == word[::-1]:
count_pali += 1
# if word is palindrome AND word is longer than 3 letters
if len(word) > 3:
high_score += 1
print('Number of palindromes in the list of words that have at least 3 letter: {}'.format(high_score))
NOTES:
count_pali: counts the total number of words that are palindromes
high_score: counts the total number of palindromes that are longer than 3 letters
len(word): if word is palindrome, will test the length of the word
Good luck!

This doesn't directly answer your question, but it might help to understand some of the problems that we have run into here. Mostly, you can see how to add to a list, and hopefully see the difference between getting the length of a string, list and integer (which you actually can't do!).
Try running the code below, and examine what is happening:
def step_forward():
raw_input('(Press ENTER to continue...)')
print('\n.\n.\n.')
def experiment():
""" Run a whole lot experiments to explore the idea of lists and
variables"""
# create an empty list, test length
word_list = []
print('the length of word_list is: {}'.format(len(word_list)))
# expect output to be zero
step_forward()
# add some words to the list
print('\nAdding some words...')
word_list.append('Hello')
word_list.append('Experimentation')
word_list.append('Interesting')
word_list.append('ending')
# test length of word_list again
print('\ttesting length again...')
print('\tthe length of word_list is: {}'.format(len(word_list)))
step_forward()
# print the length of each word in the list
print('\nget the length of each word...')
for each_word in word_list:
print('\t{word} has a length of: {length}'.format(word=each_word, length=len(each_word)))
# output:
# Hello has a length of: 5
# Experimentation has a length of: 15
# Interesting has a length of: 11
# ending has a length of: 6
step_forward()
# set up a couple of counters
short_word = 0
long_word = 0
# test the length of the counters:
print('\nTrying to get the length of our counter variables...')
try:
len(short_word)
len(long_word)
except TypeError:
print('\tERROR: You can not get the length of an int')
# you see, you can't get the length of an int
# but, you can get the length of a word, or string!
step_forward()
# we will make some new tests, and some assumptions:
# short_word: a word is short, if it has less than 9 letters
# long_word: a word is long, if it has 9 or more letters
# this test will count how many short and long words there are
print('\nHow many short and long words are there?...')
for each_word in word_list:
if len(each_word) < 9:
short_word += 1
else:
long_word += 1
print('\tThere are {short} short words and {long} long words.'.format(short=short_word, long=long_word))
step_forward()
# BUT... what if we want to know which are the SHORT words and which are the LONG words?
short_word = 0
long_word = 0
for each_word in word_list:
if len(each_word) < 9:
short_word += 1
print('\t{word} is a SHORT word'.format(word=each_word))
else:
long_word += 1
print('\t{word} is a LONG word'.format(word=each_word))
step_forward()
# and lastly, if you need to use the short of long words again, you can
# create new sublists
print('\nMaking two new lists...')
shorts = []
longs = []
short_word = 0
long_word = 0
for each_word in word_list:
if len(each_word) < 9:
short_word += 1
shorts.append(each_word)
else:
long_word += 1
longs.append(each_word)
print('short list: {}'.format(shorts))
print('long list: {}'.format(longs))
# now, the counters short_words and long_words should equal the length of the new lists
if short_word == len(shorts) and long_word == len(longs):
print('Hurray, its works!')
else:
print('Oh no!')
experiment()
Hopefully, when you look through our answers here, and examine the mini-experiment above, you will be able to get your code to do what you need it to do :)

Efficient hunting for words in scrambled letters

I guess you could classify this as a Scrabble style problem, but it started out due to a friend mentioning the UK TV quiz show Countdown. Various rounds in the show involve the contestants being presented a scrambled set of letters and they have to come up with the longest word they can. The one my friend mentioned was "RAEPKWAEN".
In fairly short order I whipped up something in Python to handle this problem, using PyEnchant to handle the dictionary look-ups, however I'm noticing that it really can't scale all that well.
Here's what I have currently:
#!/usr/bin/python
from itertools import permutations
import enchant
from sys import argv
def find_longest(origin):
s = enchant.Dict("en_US")
for i in range(len(origin),0,-1):
print "Checking against words of length %d" % i
pool = permutations(origin,i)
for comb in pool:
word = ''.join(comb)
if s.check(word):
return word
return ""
if (__name__)== '__main__':
result = find_longest(argv[1])
print result
That's fine on a 9 letter example like they use in the show, 9 factorial = 362,880 and 8 factorial = 40,320. On that scale even if it would have to check all possible permutations and word lengths it's not that many.
However once you reach 14 characters that's 87,178,291,200 possibly combinations, meaning you're reliant on luck that a 14 character word is quickly found.
With the example word above it's taking my machine about 12 1/2 seconds to find "reawaken". With 14 character scrambled words we could be talking on the scale of 23 days just to check all possible 14 character permutations.
Is there any more efficient way to handle this?

Implementation of Jeroen Coupé idea from his answer with letters count:
from collections import defaultdict, Counter
def find_longest(origin, known_words):
return iter_longest(origin, known_words).next()
def iter_longest(origin, known_words, min_length=1):
origin_map = Counter(origin)
for i in xrange(len(origin) + 1, min_length - 1, -1):
for word in known_words[i]:
if check_same_letters(origin_map, word):
yield word
def check_same_letters(origin_map, word):
new_map = Counter(word)
return all(new_map[let] <= origin_map[let] for let in word)
def load_words_from(file_path):
known_words = defaultdict(list)
with open(file_path) as f:
for line in f:
word = line.strip()
known_words[len(word)].append(word)
return known_words
if __name__ == '__main__':
known_words = load_words_from('words_list.txt')
origin = 'raepkwaen'
big_origin = 'raepkwaenaqwertyuiopasdfghjklzxcvbnmqwertyuiopasdfghjklzxcvbnmqwertyuiopasdfghjklzxcvbnmqwertyuiopasdfghjklzxcvbnm'
print find_longest(big_origin, known_words)
print list(iter_longest(origin, known_words, 5))
Output (for my small 58000 words dict):
counterrevolutionaries
['reawaken', 'awaken', 'enwrap', 'weaken', 'weaker', 'apnea', 'arena', 'awake',
'aware', 'newer', 'paean', 'parka', 'pekan', 'prank', 'prawn', 'preen', 'renew',
'waken', 'wreak']
Notes:
It's simple implementation without optimizations.
words_list.txt - can be /usr/share/dict/words on Linux.
UPDATE
In case we need to find word only once, and we have dictionary with words sorted by length, e.g. by this script:
with open('words_list.txt') as f:
words = f.readlines()
with open('words_by_len.txt', 'w') as f:
for word in sorted(words, key=lambda w: len(w), reverse=True):
f.write(word)
We can find longest word without loading full dict to memory:
from collections import Counter
import sys
def check_same_letters(origin_map, word):
new_map = Counter(word)
return all(new_map[let] <= origin_map[let] for let in word)
def iter_longest_from_file(origin, file_path, min_length=1):
origin_map = Counter(origin)
origin_len = len(origin)
with open(file_path) as f:
for line in f:
word = line.strip()
if len(word) > origin_len:
continue
if len(word) < min_length:
return
if check_same_letters(origin_map, word):
yield word
def find_longest_from_file(origin, file_path):
return iter_longest_from_file(origin, file_path).next()
if __name__ == '__main__':
origin = sys.argv[1] if len(sys.argv) > 1 else 'abcdefghijklmnopqrstuvwxyz'
print find_longest_from_file(origin, 'words_by_len.txt')

You want to avoid doing the permutation. You could count how many times a character appears in both strings ( the original string and the one from the dictionary). Dismiss all the words from the dictionary where the frequency of characters isn't the same.
So to check one word from the dictionary you will need to count the characters at most MAX (26, n) time.

Pre-parse the dictionary as sorted(word), word pairs. (e.g. giilnstu, linguist)
Sort the dictionary file.
Then, when you are searching for a given set of letters:
Binary search the dictionary for the letters you have, sorting the letters first.
You'd need to do this separately for each word length.
EDIT: should say that you're searching for all unique combinations of the sorted letters of the target word length (range(len(letters), 0, -1))

This is similar to an anagram problem I've worked on before. I solved that by using prime numbers to represent each letter. The product of the letters for each word produces a number. To determine if a given set of input characters are sufficient to make a work, just divide the product of the input character by the product for the number you want to check. If there is no remainder then the input characters are sufficient. I've implemented it below. The output is:
$ python longest.py rasdaddea aosddna raepkwaen
rasdaddea --> sadder
aosddna --> soda
raepkwaen --> reawaken
You can find more details and a thorough explanation of the anagrams case at:
http://mostlyhighperformance.blogspot.com/2012/01/generating-anagrams-efficient-and-easy.html
This algorithm takes a small amount of time to set up a dictionary, and then individual checks are as easy as a single division for every word in the dictionary. There may be faster methods that rely on closing off parts of the dictionary if it lacks a letter, but these may end up performing worse if you have large number of input letters so it is actually not able to close off any part of the dictionary.
import sys
def nextprime(x):
while True:
x += 1
for pot_fac in range(2,x):
if x % pot_fac == 0:
break
else:
return x
def prime_generator():
'''Returns a generator that produces the next largest prime as
compared to the one returned from this function the last time
it was called. The first time it is called it will return 2.'''
lastprime = 1
while True:
lastprime = nextprime(lastprime)
yield lastprime
# Assign prime numbers to each lower case letter
gen = prime_generator()
primes = dict( [ (chr(x),gen.next()) for x in range(ord('a'),ord('z')+1) ] )
product = lambda x: reduce( lambda m,n: m*n, x, 1 )
make_key = lambda x: product( [ primes[y] for y in x ] )
try:
words = open('words').readlines()
words = [ ''.join( [ c for c in x.lower() \
if ord('a') <= ord(c) <= ord('z') ] ) \
for x in words ]
for x in words:
try:
make_key(x)
except:
print x
raise
except IOError:
words = [ 'reawaken','awaken','enwrap','weaken','weaker', ]
words = dict( ( (make_key(x),x,) for x in words ) )
inputs = sys.argv[1:] if sys.argv[1:] else [ 'raepkwaen', ]
for input in inputs:
input_key = make_key(input)
results = [ words[x] for x in words if input_key % x == 0 ]
result = reversed(sorted(results, key=len)).next()
print input,'--> ',result

I started this last night shortly after you asked the question, but didn't get around to polishing it up until just now. This was my solution, which is basically a modified trie, which I didn't know until today!
class Node(object):
__slots__ = ('words', 'letter', 'child', 'sib')
def __init__(self, letter, sib=None):
self.words = []
self.letter = letter
self.child = None
self.sib = sib
def get_child(self, letter, create=False):
child = self.child
if not child or child.letter > letter:
if create:
self.child = Node(letter, child)
return self.child
return None
return child.get_sibling(letter, create)
def get_sibling(self, letter, create=False):
node = self
while node:
if node.letter == letter:
return node
sib = node.sib
if not sib or sib.letter > letter:
if create:
node.sib = Node(letter, sib)
node = node.sib
return node
return None
node = sib
return None
def __repr__(self):
return '<Node({}){}{}: {}>'.format(chr(self.letter), 'C' if self.child else '', 'S' if self.sib else '', self.words)
def add_word(root, word):
word = word.lower().strip()
letters = [ord(c) for c in sorted(word)]
node = root
for letter in letters:
node = node.get_child(letter, True)
node.words.append(word)
def find_max_word(root, word):
word = word.lower().strip()
letters = [ord(c) for c in sorted(word)]
words = []
def grab_words(root, letters):
last = None
for idx, letter in enumerate(letters):
if letter == last: # prevents duplication
continue
node = root.get_child(letter)
if node:
words.extend(node.words)
grab_words(node, letters[idx+1:])
last = letter
grab_words(root, letters)
return words
root = Node(0)
with open('/path/to/dict/file', 'rt') as f:
for word in f:
add_word(root, word)
Testing:
>>> def nonrepeating_words():
... return find_max_word(root, 'abcdefghijklmnopqrstuvwxyz')
...
>>> sorted(nonrepeating_words(), key=len)[-10:]
['ambidextrously', 'troublemakings', 'dermatoglyphic', 'hydromagnetics', 'hydropneumatic', 'pyruvaldoxines', 'hyperabductions', 'uncopyrightable', 'dermatoglyphics', 'endolymphaticus']
>>> len(nonrepeating_words())
67590
I think I prefer dermatoglyphics to uncopyrightable for longest word, myself. Performance-wise, utilizing a ~500k word dictionary (from here),
>>> import timeit
>>> timeit.timeit(nonrepeating_words, number=100)
62.8912091255188
>>>
So, on average, 6/10ths of a second (on my i5-2500) to find all sixty-seven thousand words that contain no repeating letters.
The big differences between this implementation and a trie (which makes it even further from a DAWG in general) is that: words are stored in the trie in relation to their sorted letters. So the word 'dog' is stored under the same path as 'god': d-g-o. The second bit is the the find_max_word algorithm, which makes sure every possible letter combination is visited by continually lopping off its head and re-running the search.
Oh, and just for giggles:
>>> sorted(tree.find_max_word('RAEPKWAEN'), key=len)[-5:]
['wakener', 'rewaken', 'reawake', 'reawaken', 'awakener']

Another approach, similar to #market's answer, is to precompute a 'bitmask' for each word in the dictionary. Bit 0 is set if the word contains at least one A, bit 1 is set if it contains at least one B, and so on up to bit 25 for Z.
If you want to search for all words in the dictionary that could be made up from a combination of letters, you start by forming the bitmask for the collection of letters. You can then filter out all of the words that use other letters by checking whether wordBitmask & ~lettersBitMask is zero. If this is zero, the word only uses letters available in the collection, and so could be valid. If this is non-zero, it uses a letter not available in the collection and so is not allowed.
The advantage of this approach is that the bitwise operations are fast. The vast majority of words in the dictionary will use at least one of the 17 or more letters that aren't in the collection given, and you can speedily discount them all. However, for the minority of words that make it through the filter, there is one more check that you still have to make. You still need to check that words aren't using letters more often than they appear in the collection. For example, the word 'weakener' must be disallowed because it has three 'e's, whereas there are only two in the collection of letters RAEPKWAEN. The bitwise approach alone will not filter out this word since each letter in the word appears in the collection.

When looking for words longer than 10 letters you may try to iterate over words (I think there are not so many words with 10 letters) that are longer than 10 letters and check it you have required letters in your set.
Problem is that you have to find all those len(word) >= 10 words first.
So, what I would do:
When reading the dictionary split the words into 2 categories: shorts and longs. You can process shorts by iterating over every possible permutation. Than you can process longs by iterating over then and checking it they are possible.
Of course there are many optimisations possible to both paths.

Construct a trie (prefix tree) from your dictionary. You may want to cache it.
Walk on this trie and remove whole branches that do not fit your bag of letters.
At this point, your trie is the representation of all words in your dictionary that can be constructed from your bag of letters.
Just take the longer one(s) :-)
Edit: you may also use a DAGW (Directed Acyclic Word Graph) which will have fewer vertices. Although I haven't read it, this wikipedia article have a link about The World's Fastest Scrabble Program.

DAWG (Directed Acyclic Word Graph)
Mark Wutka was kind enough to provide some pascal code here.
http://www.wutka.com/dawg.html
http://www.wutka.com/DictConvert.ZIP

In case you have a text file with sorted words. Simply this code does the math:
UsrWrd = input() #here you Enter scrambled letters
with open('words.db','r') as f:
for Line in f:
for Word in Line.split():
if len(Word) == len(UsrWrd) and set(Word) == set(UsrWrd):
print(Word)
break
else:continue `

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding a word from a text dictionary with given random letters - python

Related

Efficient way to choose random element from list based on length

Scrabble cheater: scoring wildcard characters to zero in Python

How can I isolate the first character of each word in a string?

Python 3.2 - Converting words to lower case & number of palindromes in the list of words that have at least 3 letters

Efficient hunting for words in scrambled letters

Categories

Resources