I guess you could classify this as a Scrabble style problem, but it started out due to a friend mentioning the UK TV quiz show Countdown. Various rounds in the show involve the contestants being presented a scrambled set of letters and they have to come up with the longest word they can. The one my friend mentioned was "RAEPKWAEN".
In fairly short order I whipped up something in Python to handle this problem, using PyEnchant to handle the dictionary look-ups, however I'm noticing that it really can't scale all that well.
Here's what I have currently:
#!/usr/bin/python
from itertools import permutations
import enchant
from sys import argv
def find_longest(origin):
s = enchant.Dict("en_US")
for i in range(len(origin),0,-1):
print "Checking against words of length %d" % i
pool = permutations(origin,i)
for comb in pool:
word = ''.join(comb)
if s.check(word):
return word
return ""
if (__name__)== '__main__':
result = find_longest(argv[1])
print result
That's fine on a 9 letter example like they use in the show, 9 factorial = 362,880 and 8 factorial = 40,320. On that scale even if it would have to check all possible permutations and word lengths it's not that many.
However once you reach 14 characters that's 87,178,291,200 possibly combinations, meaning you're reliant on luck that a 14 character word is quickly found.
With the example word above it's taking my machine about 12 1/2 seconds to find "reawaken". With 14 character scrambled words we could be talking on the scale of 23 days just to check all possible 14 character permutations.
Is there any more efficient way to handle this?
Implementation of Jeroen Coupé idea from his answer with letters count:
from collections import defaultdict, Counter
def find_longest(origin, known_words):
return iter_longest(origin, known_words).next()
def iter_longest(origin, known_words, min_length=1):
origin_map = Counter(origin)
for i in xrange(len(origin) + 1, min_length - 1, -1):
for word in known_words[i]:
if check_same_letters(origin_map, word):
yield word
def check_same_letters(origin_map, word):
new_map = Counter(word)
return all(new_map[let] <= origin_map[let] for let in word)
def load_words_from(file_path):
known_words = defaultdict(list)
with open(file_path) as f:
for line in f:
word = line.strip()
known_words[len(word)].append(word)
return known_words
if __name__ == '__main__':
known_words = load_words_from('words_list.txt')
origin = 'raepkwaen'
big_origin = 'raepkwaenaqwertyuiopasdfghjklzxcvbnmqwertyuiopasdfghjklzxcvbnmqwertyuiopasdfghjklzxcvbnmqwertyuiopasdfghjklzxcvbnm'
print find_longest(big_origin, known_words)
print list(iter_longest(origin, known_words, 5))
Output (for my small 58000 words dict):
counterrevolutionaries
['reawaken', 'awaken', 'enwrap', 'weaken', 'weaker', 'apnea', 'arena', 'awake',
'aware', 'newer', 'paean', 'parka', 'pekan', 'prank', 'prawn', 'preen', 'renew',
'waken', 'wreak']
Notes:
It's simple implementation without optimizations.
words_list.txt - can be /usr/share/dict/words on Linux.
UPDATE
In case we need to find word only once, and we have dictionary with words sorted by length, e.g. by this script:
with open('words_list.txt') as f:
words = f.readlines()
with open('words_by_len.txt', 'w') as f:
for word in sorted(words, key=lambda w: len(w), reverse=True):
f.write(word)
We can find longest word without loading full dict to memory:
from collections import Counter
import sys
def check_same_letters(origin_map, word):
new_map = Counter(word)
return all(new_map[let] <= origin_map[let] for let in word)
def iter_longest_from_file(origin, file_path, min_length=1):
origin_map = Counter(origin)
origin_len = len(origin)
with open(file_path) as f:
for line in f:
word = line.strip()
if len(word) > origin_len:
continue
if len(word) < min_length:
return
if check_same_letters(origin_map, word):
yield word
def find_longest_from_file(origin, file_path):
return iter_longest_from_file(origin, file_path).next()
if __name__ == '__main__':
origin = sys.argv[1] if len(sys.argv) > 1 else 'abcdefghijklmnopqrstuvwxyz'
print find_longest_from_file(origin, 'words_by_len.txt')
You want to avoid doing the permutation. You could count how many times a character appears in both strings ( the original string and the one from the dictionary). Dismiss all the words from the dictionary where the frequency of characters isn't the same.
So to check one word from the dictionary you will need to count the characters at most MAX (26, n) time.
Pre-parse the dictionary as sorted(word), word pairs. (e.g. giilnstu, linguist)
Sort the dictionary file.
Then, when you are searching for a given set of letters:
Binary search the dictionary for the letters you have, sorting the letters first.
You'd need to do this separately for each word length.
EDIT: should say that you're searching for all unique combinations of the sorted letters of the target word length (range(len(letters), 0, -1))
This is similar to an anagram problem I've worked on before. I solved that by using prime numbers to represent each letter. The product of the letters for each word produces a number. To determine if a given set of input characters are sufficient to make a work, just divide the product of the input character by the product for the number you want to check. If there is no remainder then the input characters are sufficient. I've implemented it below. The output is:
$ python longest.py rasdaddea aosddna raepkwaen
rasdaddea --> sadder
aosddna --> soda
raepkwaen --> reawaken
You can find more details and a thorough explanation of the anagrams case at:
http://mostlyhighperformance.blogspot.com/2012/01/generating-anagrams-efficient-and-easy.html
This algorithm takes a small amount of time to set up a dictionary, and then individual checks are as easy as a single division for every word in the dictionary. There may be faster methods that rely on closing off parts of the dictionary if it lacks a letter, but these may end up performing worse if you have large number of input letters so it is actually not able to close off any part of the dictionary.
import sys
def nextprime(x):
while True:
x += 1
for pot_fac in range(2,x):
if x % pot_fac == 0:
break
else:
return x
def prime_generator():
'''Returns a generator that produces the next largest prime as
compared to the one returned from this function the last time
it was called. The first time it is called it will return 2.'''
lastprime = 1
while True:
lastprime = nextprime(lastprime)
yield lastprime
# Assign prime numbers to each lower case letter
gen = prime_generator()
primes = dict( [ (chr(x),gen.next()) for x in range(ord('a'),ord('z')+1) ] )
product = lambda x: reduce( lambda m,n: m*n, x, 1 )
make_key = lambda x: product( [ primes[y] for y in x ] )
try:
words = open('words').readlines()
words = [ ''.join( [ c for c in x.lower() \
if ord('a') <= ord(c) <= ord('z') ] ) \
for x in words ]
for x in words:
try:
make_key(x)
except:
print x
raise
except IOError:
words = [ 'reawaken','awaken','enwrap','weaken','weaker', ]
words = dict( ( (make_key(x),x,) for x in words ) )
inputs = sys.argv[1:] if sys.argv[1:] else [ 'raepkwaen', ]
for input in inputs:
input_key = make_key(input)
results = [ words[x] for x in words if input_key % x == 0 ]
result = reversed(sorted(results, key=len)).next()
print input,'--> ',result
I started this last night shortly after you asked the question, but didn't get around to polishing it up until just now. This was my solution, which is basically a modified trie, which I didn't know until today!
class Node(object):
__slots__ = ('words', 'letter', 'child', 'sib')
def __init__(self, letter, sib=None):
self.words = []
self.letter = letter
self.child = None
self.sib = sib
def get_child(self, letter, create=False):
child = self.child
if not child or child.letter > letter:
if create:
self.child = Node(letter, child)
return self.child
return None
return child.get_sibling(letter, create)
def get_sibling(self, letter, create=False):
node = self
while node:
if node.letter == letter:
return node
sib = node.sib
if not sib or sib.letter > letter:
if create:
node.sib = Node(letter, sib)
node = node.sib
return node
return None
node = sib
return None
def __repr__(self):
return '<Node({}){}{}: {}>'.format(chr(self.letter), 'C' if self.child else '', 'S' if self.sib else '', self.words)
def add_word(root, word):
word = word.lower().strip()
letters = [ord(c) for c in sorted(word)]
node = root
for letter in letters:
node = node.get_child(letter, True)
node.words.append(word)
def find_max_word(root, word):
word = word.lower().strip()
letters = [ord(c) for c in sorted(word)]
words = []
def grab_words(root, letters):
last = None
for idx, letter in enumerate(letters):
if letter == last: # prevents duplication
continue
node = root.get_child(letter)
if node:
words.extend(node.words)
grab_words(node, letters[idx+1:])
last = letter
grab_words(root, letters)
return words
root = Node(0)
with open('/path/to/dict/file', 'rt') as f:
for word in f:
add_word(root, word)
Testing:
>>> def nonrepeating_words():
... return find_max_word(root, 'abcdefghijklmnopqrstuvwxyz')
...
>>> sorted(nonrepeating_words(), key=len)[-10:]
['ambidextrously', 'troublemakings', 'dermatoglyphic', 'hydromagnetics', 'hydropneumatic', 'pyruvaldoxines', 'hyperabductions', 'uncopyrightable', 'dermatoglyphics', 'endolymphaticus']
>>> len(nonrepeating_words())
67590
I think I prefer dermatoglyphics to uncopyrightable for longest word, myself. Performance-wise, utilizing a ~500k word dictionary (from here),
>>> import timeit
>>> timeit.timeit(nonrepeating_words, number=100)
62.8912091255188
>>>
So, on average, 6/10ths of a second (on my i5-2500) to find all sixty-seven thousand words that contain no repeating letters.
The big differences between this implementation and a trie (which makes it even further from a DAWG in general) is that: words are stored in the trie in relation to their sorted letters. So the word 'dog' is stored under the same path as 'god': d-g-o. The second bit is the the find_max_word algorithm, which makes sure every possible letter combination is visited by continually lopping off its head and re-running the search.
Oh, and just for giggles:
>>> sorted(tree.find_max_word('RAEPKWAEN'), key=len)[-5:]
['wakener', 'rewaken', 'reawake', 'reawaken', 'awakener']
Another approach, similar to #market's answer, is to precompute a 'bitmask' for each word in the dictionary. Bit 0 is set if the word contains at least one A, bit 1 is set if it contains at least one B, and so on up to bit 25 for Z.
If you want to search for all words in the dictionary that could be made up from a combination of letters, you start by forming the bitmask for the collection of letters. You can then filter out all of the words that use other letters by checking whether wordBitmask & ~lettersBitMask is zero. If this is zero, the word only uses letters available in the collection, and so could be valid. If this is non-zero, it uses a letter not available in the collection and so is not allowed.
The advantage of this approach is that the bitwise operations are fast. The vast majority of words in the dictionary will use at least one of the 17 or more letters that aren't in the collection given, and you can speedily discount them all. However, for the minority of words that make it through the filter, there is one more check that you still have to make. You still need to check that words aren't using letters more often than they appear in the collection. For example, the word 'weakener' must be disallowed because it has three 'e's, whereas there are only two in the collection of letters RAEPKWAEN. The bitwise approach alone will not filter out this word since each letter in the word appears in the collection.
When looking for words longer than 10 letters you may try to iterate over words (I think there are not so many words with 10 letters) that are longer than 10 letters and check it you have required letters in your set.
Problem is that you have to find all those len(word) >= 10 words first.
So, what I would do:
When reading the dictionary split the words into 2 categories: shorts and longs. You can process shorts by iterating over every possible permutation. Than you can process longs by iterating over then and checking it they are possible.
Of course there are many optimisations possible to both paths.
Construct a trie (prefix tree) from your dictionary. You may want to cache it.
Walk on this trie and remove whole branches that do not fit your bag of letters.
At this point, your trie is the representation of all words in your dictionary that can be constructed from your bag of letters.
Just take the longer one(s) :-)
Edit: you may also use a DAGW (Directed Acyclic Word Graph) which will have fewer vertices. Although I haven't read it, this wikipedia article have a link about The World's Fastest Scrabble Program.
DAWG (Directed Acyclic Word Graph)
Mark Wutka was kind enough to provide some pascal code here.
http://www.wutka.com/dawg.html
http://www.wutka.com/DictConvert.ZIP
In case you have a text file with sorted words. Simply this code does the math:
UsrWrd = input() #here you Enter scrambled letters
with open('words.db','r') as f:
for Line in f:
for Word in Line.split():
if len(Word) == len(UsrWrd) and set(Word) == set(UsrWrd):
print(Word)
break
else:continue `
Related
There is such a task with Leetcode. Everything works for me when I press RUN, but when I submit, it gives an error:
text = "a b c d e"
brokenLetters = "abcde"
Output : 1
Expected: 0
def canBeTypedWords(self, text, brokenLetters):
for i in brokenLetters:
cnt = 0
text = text.split()
s1 = text[0]
s2 = text[1]
if i in s1 and i in s2:
return 0
else:
cnt += 1
return cnt
Can you please assist what I missed here?
Everything work exclude separate letters condition in a text.
So consider logically what you have to do, then write that algorithmically.
Logically, you have a list of words, a list of broken letters, and you need to return the count of words that have none of those broken letters in them.
"None of those broken letters in them" is the important bit -- if even one broken letter is in the word, it's no good.
def count_words(broken_letters, word_list) -> int:
words = word_list.split() # split on spaces
broken_letters = set(broken_letters) # we'll be doing membership checks
# on this kind of a lot, so changing
# it to a set is more performant
count = 0
for word in words:
for letter in word:
if letter in broken_letters:
# this word doesn't work, so break out of the
# "for letter in word" loop
break
else:
# a for..else block is only entered if execution
# falls off the bottom naturally, so in this case
# the word works!
count += 1
return count
This can, of course, be written much more concisely and (one might argue) idiomatically, but it is less obvious to a novice how this code works. As exercise to the reader: see if you can understand how this code works and how you might modify it if the exercise was, instead, giving you all the letters that work rather than the letters that are broken.
def count_words(broken_letters, word_list) -> int:
words = word_list.split()
broken_letters = set(broken_letters)
return sum((1 for word in words if all(lett not in broken_letters for lett in word)))
Here I have a string in a list:
['aaaaaaappppppprrrrrriiiiiilll']
I want to get the word 'april' in the list, but not just one of them, instead how many times the word 'april' actually occurs the string.
The output should be something like:
['aprilaprilapril']
Because the word 'april' occurred three times in that string.
Well the word actually didn't occurred three times, all the characters did. So I want to order these characters to 'april' for how many times did they appeared in the string.
My idea is basically to extract words from some random strings, but not just extracting the word, instead to extract all of the word that appears in the string. Each word should be extracted and the word (characters) should be ordered the way I wanted to.
But here I have some annoying conditions; you can't delete all the elements in the list and then just replace them with the word 'april'(you can't replace the whole string with the word 'april'); you can only extract 'april' from the string, not replacing them. You can't also delete the list with the string. Just think of all the string there being very important data, we just want some data, but these data must be ordered, and we need to delete all other data that doesn't match our "data chain" (the word 'april'). But once you delete the whole string you will lose all the important data. You don't know how to make another one of these "data chains", so we can't just put the word 'april' back in the list.
If anyone know how to solve my weird problem, please help me out, I am a beginner python programmer. Thank you!
One way is to use itertools.groupby which will group the characters individually and unpack and iterate them using zip which will iterate n times given n is the number of characters in the smallest group (i.e. the group having lowest number of characters)
from itertools import groupby
'aaaaaaappppppprrrrrriiiiiilll'
result = ''
for each in zip(*[list(g) for k, g in groupby('aaaaaaappppppprrrrrriiiiiilll')]):
result += ''.join(each)
# result = 'aprilaprilapril'
Another possible solution is to create a custom counter that will count each unique sequence of characters (Please be noted that this method will work only for Python 3.6+, for lower version of Python, order of dictionaries is not guaranteed):
def getCounts(strng):
if not strng:
return [], 0
counts = {}
current = strng[0]
for c in strng:
if c in counts.keys():
if current==c:
counts[c] += 1
else:
current = c
counts[c] = 1
return counts.keys(), min(counts.values())
result = ''
counts=getCounts('aaaaaaappppppprrrrrriiiiiilll')
for i in range(counts[1]):
result += ''.join(counts[0])
# result = 'aprilaprilapril'
How about using regex?
import re
word = 'april'
text = 'aaaaaaappppppprrrrrriiiiiilll'
regex = "".join(f"({c}+)" for c in word)
match = re.match(regex, text)
if match:
# Find the lowest amount of character repeats
lowest_amount = min(len(g) for g in match.groups())
print(word * lowest_amount)
else:
print("no match")
Outputs:
aprilaprilapril
Works like a charm
Here is a more native approach, with plain iteration.
It has a time complexity of O(n).
It uses an outer loop to iterate over the character in the search key, then an inner while loop that consumes all occurrences of that character in the search string while maintaining a counter. Once all consecutive occurrences of the current letter have been consumes, it updates a the minLetterCount to be the minimum of its previous value or this new count. Once we have iterated over all letters in the key, we return this accumulated minimum.
def countCompleteSequenceOccurences(searchString, key):
left = 0
minLetterCount = 0
letterCount = 0
for i, searchChar in enumerate(key):
while left < len(searchString) and searchString[left] == searchChar:
letterCount += 1
left += 1
minLetterCount = letterCount if i == 0 else min(minLetterCount, letterCount)
letterCount = 0
return minLetterCount
Testing:
testCasesToOracles = {
"aaaaaaappppppprrrrrriiiiiilll": 3,
"ppppppprrrrrriiiiiilll": 0,
"aaaaaaappppppprrrrrriiiiii": 0,
"aaaaaaapppppppzzzrrrrrriiiiiilll": 0,
"pppppppaaaaaaarrrrrriiiiiilll": 0,
"zaaaaaaappppppprrrrrriiiiiilll": 3,
"zzzaaaaaaappppppprrrrrriiiiiilll": 3,
"aaaaaaappppppprrrrrriiiiiilllzzz": 3,
"zzzaaaaaaappppppprrrrrriiiiiilllzzz": 3,
}
key = "april"
for case, oracle in testCasesToOracles.items():
result = countCompleteSequenceOccurences(case, key)
assert result == oracle
Usage:
key = "april"
result = countCompleteSequenceOccurences("aaaaaaappppppprrrrrriiiiiilll", key)
print(result * key)
Output:
aprilaprilapril
A word will only occur as many times as the minimum letter recurrence. To account for the possibility of having repeated letters in the word (for example, appril, you need to factor this count out. Here is one way of doing this using collections.Counter:
from collections import Counter
def count_recurrence(kernel, string):
# we need to count both strings
kernel_counter = Counter(kernel)
string_counter = Counter(string)
# now get effective count by dividing the occurence in string by occurrence
# in kernel
effective_counter = {
k: int(string_counter.get(k, 0)/v)
for k, v in kernel_counter.items()
}
# min occurence of kernel is min of effective counter
min_recurring_count = min(effective_counter.values())
return kernel * min_recurring_count
I have a list L of words in alphabetical order, e.g hello = ehllo
how do I check for this words "blanagrams", which are words with mostly all similar letters except 1. For example, orchestra turns into orchestre. I've only been able to think to this part. I have an idea in which you have to test every letter of the current word and see whether this corresponds, to the other and if it does, it is a blanagram and create a dictionary, but i'm struggling to put it into code
L = [list of alphabetically ordered strings]
for word in L:
for letter in word:
#confused at this part
from collections import Counter
def same_except1(s1, s2):
ct1, ct2 = Counter(s1), Counter(s2)
return sum((ct1 - ct2).values()) == 1 and sum((ct2 - ct1).values()) == 1
Examples:
>>> same_except1('hello', 'hella')
True
>>> same_except1('hello', 'heela')
False
>>> same_except1('hello', 'hello')
False
>>> same_except1('hello', 'helloa')
False
Steven Rumbalski's answer got me thinking and there's also another way you can do this with a Counter (+1 for use of collections and thank you for sparking my interest)
from collections import Counter
def diff_one(w,z):
c=Counter(sorted(w+z)).values()
c=filter(lambda x:x%2!=0,c)
return len(c)==2
Basically all matched letters will have a counter value that will be even. So you filter those out and get left with the unmatched ones. If you have more than 2 unmatched then you have a problem.
Assuming this is similar to the part of the ladders game commonly used in AI, and you are trying to create a graph with adjacent nodes as possible words and not a dictionary.
d = {}
# create buckets of words that differ by one letter
for word in L:
for i in range(len(word)):
bucket = word[:i] + '_' + word[i+1:]
if bucket in d:
d[bucket].append(word)
else:
d[bucket] = [word]
I need to display the 10 most frequent words in a text file, from the most frequent to the least as well as the number of times it has been used. I can't use the dictionary or counter function. So far I have this:
import urllib
cnt = 0
i=0
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
uniques = []
for line in txtFile:
words = line.split()
for word in words:
if word not in uniques:
uniques.append(word)
for word in words:
while i<len(uniques):
i+=1
if word in uniques:
cnt += 1
print cnt
Now I think I should look for every word in the array 'uniques' and see how many times it is repeated in this file and then add that to another array that counts the instance of each word. But this is where I am stuck. I don't know how to proceed.
Any help would be appreciated. Thank you
The above problem can be easily done by using python collections
below is the Solution.
from collections import Counter
data_set = "Welcome to the world of Geeks " \
"This portal has been created to provide well written well" \
"thought and well explained solutions for selected questions " \
"If you like Geeks for Geeks and would like to contribute " \
"here is your chance You can write article and mail your article " \
" to contribute at geeksforgeeks org See your article appearing on " \
"the Geeks for Geeks main page and help thousands of other Geeks. " \
# split() returns list of all the words in the string
split_it = data_set.split()
# Pass the split_it list to instance of Counter class.
Counters_found = Counter(split_it)
#print(Counters)
# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counters_found.most_common(4)
print(most_occur)
You're on the right track. Note that this algorithm is quite slow because for each unique word, it iterates over all of the words. A much faster approach without hashing would involve building a trie.
# The following assumes that we already have alice30.txt on disk.
# Start by splitting the file into lowercase words.
words = open('alice30.txt').read().lower().split()
# Get the set of unique words.
uniques = []
for word in words:
if word not in uniques:
uniques.append(word)
# Make a list of (count, unique) tuples.
counts = []
for unique in uniques:
count = 0 # Initialize the count to zero.
for word in words: # Iterate over the words.
if word == unique: # Is this word equal to the current unique?
count += 1 # If so, increment the count
counts.append((count, unique))
counts.sort() # Sorting the list puts the lowest counts first.
counts.reverse() # Reverse it, putting the highest counts first.
# Print the ten words with the highest counts.
for i in range(min(10, len(counts))):
count, word = counts[i]
print('%s %d' % (word, count))
from string import punctuation #you will need it to strip the punctuation
import urllib
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
counter = {}
for line in txtFile:
words = line.split()
for word in words:
k = word.strip(punctuation).lower() #the The or you You counted only once
# you still have words like I've, you're, Alice's
# you could change re to are, ve to have, etc...
if "'" in k:
ks = k.split("'")
else:
ks = [k,]
#now the tally
for k in ks:
counter[k] = counter.get(k, 0) + 1
#and sorting the counter by the value which holds the tally
for word in sorted(counter, key=lambda k: counter[k], reverse=True)[:10]:
print word, "\t", counter[word]
import urllib
import operator
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt").readlines()
txtFile = " ".join(txtFile) # this with .readlines() replaces new lines with spaces
txtFile = "".join(char for char in txtFile if char.isalnum() or char.isspace()) # removes everything that's not alphanumeric or spaces.
word_counter = {}
for word in txtFile.split(" "): # split in every space.
if len(word) > 0 and word != '\r\n':
if word not in word_counter: # if 'word' not in word_counter, add it, and set value to 1
word_counter[word] = 1
else:
word_counter[word] += 1 # if 'word' already in word_counter, increment it by 1
for i,word in enumerate(sorted(word_counter,key=word_counter.get,reverse=True)[:10]):
# sorts the dict by the values, from top to botton, takes the 10 top items,
print "%s: %s - %s"%(i+1,word,word_counter[word])
output:
1: the - 1432
2: and - 734
3: to - 703
4: a - 579
5: of - 501
6: she - 466
7: it - 440
8: said - 434
9: I - 371
10: in - 338
This methods ensures that only alphanumeric and spaces are in the counter. Doesn't matter that much tho.
Personally I'd make my own implementation of collections.Counter. I assume you know how that object works, but if not I'll summarize:
text = "some words that are mostly different but are not all different not at all"
words = text.split()
resulting_count = collections.Counter(words)
# {'all': 2,
# 'are': 2,
# 'at': 1,
# 'but': 1,
# 'different': 2,
# 'mostly': 1,
# 'not': 2,
# 'some': 1,
# 'that': 1,
# 'words': 1}
We can certainly sort that based on frequency by using the key keyword argument of sorted, and return the first 10 items in that list. However that doesn't much help you because you don't have Counter implemented. I'll leave THAT part as an exercise for you, and show you how you might implement Counter as a function rather than an object.
def counter(iterable):
d = {}
for element in iterable:
if element in d:
d[element] += 1
else:
d[element] = 1
return d
Not difficult, actually. Go through each element of an iterable. If that element is NOT in d, add it to d with a value of 1. If it IS in d, increment that value. It's more easily expressed by:
def counter(iterable):
d = {}
for element in iterable:
d.setdefault(element, 0) += 1
Note that in your use case, you probably want to strip out the punctuation and possibly casefold the whole thing (so that someword gets counted the same as Someword rather than as two separate words). I'll leave that to you as well, but I will point out str.strip takes an argument as to what to strip out, and string.punctuation contains all the punctuation you're likely to need.
You can also do it through pandas dataframes and get result in convinient form as a table: "word-its freq." ordered.
def count_words(words_list):
words_df = pn.DataFrame(words_list)
words_df.columns = ["word"]
words_df_unique = pn.DataFrame(pn.unique(words_list))
words_df_unique.columns = ["unique"]
words_df_unique["count"] = 0
i = 0
for word in pn.Series.tolist(words_df_unique.unique):
words_df_unique.iloc[i, 1] = len(words_df.word[words_df.word == word])
i+=1
res = words_df_unique.sort_values('count', ascending = False)
return(res)
To do the same operation on a pandas data frame, you may use the following through Counter function from Collections:
from collections import Counter
cnt = Counter()
for text in df['text']:
for word in text.split():
cnt[word] += 1
# Find most common 10 words from the Pandas dataframe
cnt.most_common(10)
I am building a tree for a spell checker with suggestions. Each node contains a key (a letter) and a value (array of letters down that path).
So assume the following sub-trie in my big trie:
W
/ \
a e
| |
k k
| |
is word--> e e
|
...
This is just a subpath of a sub-trie. W is a node and a and e are two nodes in its value array etc...
At each node, I check if the next letter in the word is a value of the node. I am trying to support mistyped vowels for now. So 'weke' will yield 'wake' as a suggestion. Here's my searchWord function in my trie:
def searchWord(self, word, path=""):
if len(word) > 0:
key = word[0]
word = word[1:]
if self.values.has_key(key):
path = path + key
nextNode = self.values[key]
return nextNode.searchWord(word, path)
else:
# check here if key is a vowel. If it is, check for other vowel substitutes
else:
if self.isWord:
return path # this is the word found
else:
return None
Given 'weke', at the end when word is of length zero and path is 'weke', my code will hit the second big else block. weke is not marked as a word and so it will return with None. This will return out of searchWord with None.
To avoid this, at each stack unwind or recursion backtrack, I need to check if a letter is a vowel and if it is, do the checking again.
I changed the if self.values.has_key(key) loop to the following:
if self.values.has_key(key):
path = path + key
nextNode = self.values[key]
ret = nextNode.searchWord(word, path)
if ret == None:
# check if key == vowel and replace path
# return nextNode.searchWord(...
return ret
What am I doing wrong here? What can I do when backtracking to achieve what I'm trying to do?
Search recursively. Keep track of the current index and the original word.
letters = [chr(i) for i in range(97,97+26)]
print letters
max = 300
def searchWord(orig,word, curindex,counter):
if counter>max: return
if counter==0:
s = letters[0] + word[1:]
searchWord(orig,s,0,counter+1)
else:
c = word[curindex]
print 'checking ',word,curindex
s = word
i = letters.index(c)
if i==len(letters)-1 and curindex==len(orig)-1:
print 'done'
return
if i==len(letters)-1:
print 'end of letters reached'
print 'curindex',curindex
s = list(word)
s[curindex] = list(orig)[curindex]
s[curindex+1] = letters[0]
s[1] = letters[0]
s = ''.join(s)
searchWord(orig,s,curindex+1,counter+1)
else:
s = list(word)
try:
s[curindex] = letters[i+1]
except:
print '?? ',s,curindex,letters[i]
s = ''.join(s)
searchWord(orig,s ,curindex,counter+1)
searchWord("weke","weke",0,0)
I'm not sure recursion and tree-search is the right approach here. If you have a table of words in your memory, the loopkup will be very fast. It is only when the search space is so big, that one has has to split the problem. So the better algorithm will be probably simply something like this:
corpus_words = {'wake',....} # this is in memory
allowed = word in corpus_words # perhaps improve this with adjusted binary search
A typical corpus has 5-30 million words, which is less than 1 Gigabyte. Lookup will be very fast because you can do binary search, which is O(log n) in the average case. The problem with searching for a subset of the word is that you don't know that the typed words is not a word. However you could build allowed vowels. Certain combinations of letters won't be in the corpus. So in terms of computation this problem is pretty easy nowadays. Of course one can quickly improve the simple lookup, by keeping a core corpus in memory, and the rest on disk. Swipe on android works pretty well. It uses a personalized corpus and some machine learning.
What I would do to solve this particular problem, is to calulate neighbours of the word 'weke' and check if they are in the corpus, i.e.
word = 'weke'
suggestions = list()
letters = [chr(x) for x in range(97,97+26)]
for i in range(len(word)):
for a in letters: # or do this in a smarter way to iterate
newword = word
newword[i] = a
if newword in corpus: suggestions.append(newword)
And then to improve it, check subsections if they are in a corpus of syllables. There is a lot of work, which has been done on this front so you can probably find standard solutions on the internet, for example: http://nltk.org/