Find Compound Words in List of Words - Python - python

I have a simple list of words I need to filter, but each word in the list has an accompanying "score" appended to it which is causing me some trouble. The input list has this structure:
lst = ['FAST;5','BREAK;60','FASTBREAK;40',
'OUTBREAK;110','BREAKFASTBUFFET;35',
'BUFFET;75','FASTBREAKPOINTS;60'
]
I am trying to figure out how to identify words in my list that are compounded solely from other words on the same list. For example, the code applied to lst above would produce:
ans = ['FASTBREAK:40','BREAKFASTBUFFET;35']
I found a prior question that deals with a nearly identical situation , but in that instance, there are no trailing scores with the words on the list and I am having trouble dealing with these trailing scores on my list. The ans list must keep the scores with the compound words found. The order of the words in lst is random and irrelevant. Ideally, I would like the ans list to be sorted by the length of the word (before the ' ; '), as shown above. This would save me some additional post-processing on ans.
I have figured out a way that works using ReGex and nested for loops (I will spare you the ugliness of my 1980s-esque brute force code, it's really not pretty), but my word list has close to a million entries, and my solution takes so long as to be completely unusable. I am looking for a solution a little more Pythonic that I can actually use. I'm having trouble working through it.

Here is some code that does the job. I'm sure it's not perfect for your situation (with a million entries), but perhaps can be useful in parts:
#!/usr/bin/env python
from collections import namedtuple
Word = namedtuple("Word", ("characters", "number"))
separator = ";"
lst = [
"FAST;5",
"BREAK;60",
"FASTBREAK;40",
"OUTBREAK;110",
"BREAKFASTBUFFET;35",
"BUFFET;75",
"FASTBREAKPOINTS;60",
]
words = [Word(*w.rsplit(separator, 1)) for w in lst]
def findparts(oword, parts):
if len(oword.characters) == 0:
return parts
for iword in words:
if not parts and iword.characters == oword.characters:
continue
if iword.characters in oword.characters:
parts.append(iword)
characters = oword.characters.replace(iword.characters, "")
return findparts(Word(characters, oword.number), parts)
return []
ans = []
for word in words:
parts = findparts(word, [])
if parts:
ans.append(separator.join(word))
print(ans)
It uses a recursive function that takes a word in your list and tries to assemble it with other words from that same list. This function will also present you with the actual atomic words forming the compound one.
It's not very smart, however. Here is an example of a composition it will not detect:
[BREAKFASTBUFFET, BREAK, BREAKFAST, BUFFET].
It uses a small detour using a namedtuple to temporarily separate the actual word from the number attached to it, assuming that the separator will always be ;.
I don't think regular expressions hold an advantage over a simple string search here.
If you know some more conditions about the composition of the compound words, like for instance the maximum number of components, the itertools combinatoric generators might help you to speed things up significantly and avoid missing the example given above too.

I think I would do it like this: make a new list containing only the words. In a for loop go through this list, and within it look for the words that are part of the word of the outer loop. If they are found: replace the found part by an empty string. If afterwards the entire word is replaced by an empty string: show the word of the corresponding index of the original list.
EDIT: As was pointed out in the comments, there could be a problem with the code in some situations, like this one: lst = ["BREAKFASTBUFFET;35", "BREAK;60", "BREAKFAST;18", "BUFFET;75"] In BREAKFASTBUFFET I first found that BREAK was a part of it, so I replaced that one with an empty string, which prevented BREAKFAST to be found. I hope that problem can be tackled by sorting the list descending by length of the word.
EDIT 2
My former edit was not flaw-proof, for instance is there was a word BREAKFASTEN, it shouldn't be "eaten" by BREAKFAST. This version does the following:
make a list of candidates: all words that ar part of the word in investigation
make another list of words that the word is started with
keep track of the words in the candidates list that you've allready tried
in a while True: keep trying until either the start list is empty, or you've successfully replaced all words by the candidates
lst = ['FAST;5','BREAK;60','FASTBREAK;40',
'OUTBREAK;110','BREAKFASTBUFFET;35',
'POINTS;25',
'BUFFET;75','FASTBREAKPOINTS;60', 'BREAKPOINTS;15'
]
lst2 = [ s.split(';')[0] for s in lst ]
for i, word in enumerate(lst2):
# candidates: words that are part of current word
candidates = [ x for i2, x in enumerate(lst2) if x in word and i != i2 ]
if len(candidates) > 0:
tried = []
word2 = word
found = False
while not found:
# start: subset of candidates that the current word starts with
start = [ x for x in candidates if word2.startswith(x) and x not in tried ]
for trial in start:
word2 = word2.replace(trial,'')
tried.append(trial)
if len(word2)==0:
print(lst[i])
found = True
break
if len(candidates)>1:
candidates = candidates[1:]
word2=candidates[0]
else:
break

There are several ways of speeding up the process but I doubt there is a polynomial solution.
So let's use multiprocessing, and do what we can to generate a meaningful result. The sample below is not identical to what you are asking for, but it does compose a list of apparently compound words from a large dictionary.
For the code below, I am sourcing https://gist.github.com/h3xx/1976236 which lists about 80,000 unique words in order of frequency in English.
The code below can easily be sped up if the input wordlist is sorted alphabetically beforehand, as each head of a compound will be immediately followed by it's potential compound members:
black
blackberries
blackberry
blackbird
blackbirds
blackboard
blackguard
blackguards
blackmail
blackness
blacksmith
blacksmiths
As mentioned in the comment, you may also need to use a semantic filter to identify true compound words - for instance, the word ‘generally’ isn’t a compound word for a ‘gene rally’ !! So, while you may get a list of contenders you will need to eliminate false positives somehow.
# python 3.9
import multiprocessing as mp
# returns an ordered list of lowercase words to be used.
def load(name) -> list:
return [line[:-1].lower() for line in open(name) if not line.startswith('#') and len(line) > 3]
# function that identifies the compounds of a word from a list.
# ... can be optimised if using a sorted list.
def compounds_of(word: str, values: list):
return [w for w in values if w.startswith(word) and w.removeprefix(word) in values]
# apply compound finding across an mp environment
# but this is the slowest part
def compose(values: list) -> dict:
with mp.Pool() as pool:
result = {(word, i): pool.apply(compounds_of, (word, values)) for i, word in enumerate(values)}
return result
if __name__ == '__main__':
# https://gist.github.com/h3xx/1976236
words = load('wiki-100k.txt') # words are ordered by popularity, and are 3 or more letters, in lowercase.
words = list(dict.fromkeys(words))
# remove those word heads which have less than 3 tails
compounds = {k: v for k, v in compose(words).items() if len(v) > 3}
# get the top 500 keys
rank = list(sorted(compounds.keys(), key=lambda x: x[1]))[:500]
# compose them into a dict and print
tops = {k[0]: compounds[k] for k in rank}
print(tops)

Related

Trouble getting list of words given a list of available letters for each character (Python)

I have a list of words that I would like to go through and remove any that don't fit my criteria.
The criteria is a list of lists of letters that are possible for each character.
letters = [['l','e'],['a','b','c'],['d','e','f']]
words = ['lab','lad','ebf','tem','abe','dan','lce']
The function I have written to try and solve this is:
def calc_words(letters,words):
for w in words:
for i in range(len(letters)):
if w in words:
for j in letters[i]:
if w in words:
if j != w[i]:
words.remove(w)
return words
The output when I run the function calc_words(letters,words) should be ['lad', 'ebf', 'lce']. But, I get ['lad', 'tem', 'dan'] instead.
I can't figure out what is going on. I'm relatively new to Python, so if someone either knows what is going wrong with my function, or knows a different way to go about this, I would appreciate any input.
In general, it's good to avoid using one-letter variable names and reduce nesting as much as possible. Here's an implementation of calc_words() that should suit your needs:
letters = [['l','e'],['a','b','c'],['d','e','f']]
words = ['lab','lad','ebf','tem','abe','dan','lce']
def calc_words(letters, words):
# Iterate over each word.
# Assume that a word should be in the result
# until we reach a letter that violates the
# constraint set by the letters list.
result = []
for word in words:
all_letters_match = True
for index, letter in enumerate(word):
if letter not in letters[index]:
all_letters_match = False
break
if all_letters_match:
result.append(word)
return result
# Prints "['lad', 'ebf', 'lce']".
print(calc_words(letters, words))
There is a comment in the first definition that describes how it works. This is similar to your implementation, with some nesting removed and some improved naming (e.g. the if w in words check aren't necessary, because in each iteration of w, w takes on the value of an element from words). I've tried to keep this solution as close to your original code.

Python: string comparison with double conditions

Trying to search 2 lists for common strings. 1-st list being a file with text, while the 2-nd is a list of words with logarithmic probability before the actual word – to match, a word not only needs to be in both lists, but also have a certain minimal log probability (for instance, between -2,123456 and 0,000000; that is negative 2 increasing up to 0). The tab separated list can look like:
-0.962890 dog
-1.152454 lol
-2.050454 cat
I got stuck doing something like this:
common = []
for i in list1:
if i in list2 and re.search("\-[0-1]\.[\d]+", list2):
common.append(i)
The idea to simply preprocess the list to remove lines under a certain threshold is valid of course, but since both the word and its probability are on the same line, isn’t a condition also possible? (Regexps aren’t necessary, but for comparison solutions both with and without them would be interesting.)
EDIT: own answer to this question below.
Assuming your list contains strings such as "-0.744342 dog", and my_word_list is a list of strings, how about this:
worddict = dict(map(lambda x: (x, True), my_word_list))
import re
for item in my_list:
parts = re.split("\s+", item)
if len(parts) != 2:
raise ValueError("Wrong data format")
logvalue, word = float(parts[0]), parts[1]
if logvalue > TRESHHOLD and word in worddict:
print("Hurrah, a match!")
Note the first line, which makes a dictionary out of your list of words. If you don't do this, you'll waste a lot of time doing linear searches through your word list, causing your algorithm to have a time complexity of O(n*m), while the solution I propose is much closer to O(n+m), n being the number of lines in my_list, m being the number of words in my_word_list.
Here's my solution without using regex. First create a dictionary of words which are within the accepted range, then check if each word in the the text is in the dict.
word_dict = {}
with open('probability.txt', 'r') as prob, open('text.txt', 'r') as textfile:
for line in prob:
if (-2.123456 < float(line.split()[0]) < 0):
word_dict[line.split()[1]] = line.split()[0]
for line in textfile:
for word in line.split():
if word in word_dict.keys():
print('MATCH, {}: {}'.format(word, word_dict[word]))
Answering my own question after hours of trial and error, and read tips from here and there. Turns out, i was thinking in the right direction from start, but needed to separate word detection and pattern matching, and instead combine the latter with log probability checking. Thus creating a temporary list of items with needed log prob, and then just comparing that against the text file.
common = []
prob = []
loga , rithmus = -9.87 , -0.01
for i in re.findall("\-\d\.\d+", list2):
if (loga < float(i.split()[0]) < rithmus):
prob.append(i)
prob = "\n".join(prob)
for i in list1:
if i in prob:
common.append(i)

Reducing computation time of counting word frequency in a corpus (python)

def foo(self, n):
wordFreqDict = defaultdict(int)
resultList = []
for sentence in self.allSentences:
for word in self.wordSet:
if word in sentence:
wordFreqDict[word] += 1
for candidateWord in wordFreqDict:
if wordFreqDict[candidateWord] >= n:
resultList.append(candidateWord)
return resultList
My function basically return the list of words where a word is in the list iff it occurs in at least n sentences. What I am trying to do is just iterate through the self.allSentences is a list of list of words(sentence) and iterate through the self.wordSet(all unique words in the corpus) and add the frequency of each word to the dictionary (wordFreqDict).
Then loop through the wordFreqDict and add a word with frequency >= n into the resultList.
I am assuming that this will work, but it's taking too long time to check the result.
Is there a way to make it for efficient and make the computation time shorter?
EDIT:
Here is how the self.allSentences is computed
def storeAllSentences(self, corpus):
for sentence in corpus:
self.allSentences.append(sentence)
and the self.wordSet:
def getUniqueWordSet(self, corpus):
for sentence in self.allSentences:
for word in sentence:
self.wordSet.append(word)
self.wordSet = set(self.wordSet)
An assumption here is that a word is chosen if it uniquely appears in at least n sentences. So, even if it occurs 10 times in a single sentence, that's still once.
A couple of glaring issues are your nested loops and your in check on a string. This is effectively cubic in complexity. It should be possible to reduce this drastically using a set + Counter.
from collections import Counter
from itertools import takewhile
def foo(self, n):
c = Counter()
for sent in self.allSentences:
c.update(set(sent))
resultSet = list(itertools.takewhile(lambda x: x[1] >= n, c.most_common()))
return resultSet
Here, I use regex to remove special characters and punctuation, following which I split and retrieve all unique words in the sentence. The Counter is then updated.
Finally, extract all words of the desired frequency (or more) using itertools.takewhile and return.
If sent is a string sentence, you might use re based filtering to remove punctuation and then split:
import re
tempWordSet = set(re.sub('[^\w\s]+', '', sent).split())
c.update(tempWordSet)
This does not take into consideration whether a word belongs to self.wordSet or not. So if you want that as well, you might modify the first loop a bit to include a filter step:
c.update(set(filter(lambda x: x in self.wordSet, sent)))
Or, using the second technique:
tempWordSet = set(filter(lambda x: x in self.wordSet,
re.sub('[^\w\s]+', '', sent).split()))
On an unrelated note, are you trying to perform text mining? You might be interested in looking into TFIDF.
The set wordSet is probably at least two orders of magnitude larger than the words in a single sentence. Therefore it makes sense loop by words in a sentence. However, this requires that splitting a sentence is not a really slow operation. If it is, you should do the whole process in getUniqueWordSet. Here's only the first for-loop changed:
def foo(self, n):
wordFreqDict = defaultdict(int)
resultList = []
for sentence in self.allSentences:
for word in sentence:
# This if may be left out if allSentences is guaranteed to stay the same after storing them
if word in self.wordSet:
wordFreqDict[word] += 1
for candidateWord in wordFreqDict:
if wordFreqDict[candidateWord] >= n:
resultList.append(candidateWord)
return resultList

Trying to filter a dictionary for an AI

Hi so I'm currently taking a class and one of our assignments is to create a Hangman AI. There are two parts to this assignment and currently I am stuck on the first task, which is, given the state of the hangman puzzle, and a list of words as a dictionary, filter out non-possible answers to the puzzle from the dictionary. As an example, if the puzzle given is t--t, then we should filter out all words that are not 4 letters long. Following that, we should filter out words which do not have t as their first and last letter (ie. tent and test are acceptable, but type and help should be removed). Currently, for some reason, my code seems to remove all entries from the dictionary and I have no idea why. Help would be much appreciated. I have also attached my code below for reference.
def filter(self,puzzle):
wordlist = dictionary.getWords()
newword = {i : wordlist[i] for i in range(len(wordlist)) if len(wordlist[i])==len(str(puzzle))}
string = puzzle.getState()
array = []
for i in range(len(newword)):
n = newword[i]
for j in range(len(string)):
if string[j].isalpha() and n[j]==string[j]:
continue
elif not string[j].isalpha():
continue
else:
array+=i
print(i)
break
array = list(reversed(array))
for i in array:
del newword[i]
Some additional information:
puzzle.getState() is a function given to us that returns a string describing the state of the puzzle, eg. a-a--- or t---t- or elepha--
dictionary.getWords essentially creates a dictionary from the list of words
Thanks!
This may be a bit late, but using regular expressions:
import re
def remove(regularexpression, somewords):
words_to_remove=[]
for word in somewords:
#if not re.search(regularexpression, word):
if re.search(regularexpression, word)==None:
words_to_remove.append(word)
for word in words_to_remove:
somewords.remove(word)
For example you have the list words=['hello', 'halls', 'harp', 'heroic', 'tests'].
You can now do:
remove('h[a-zA-Z]{4}$', words) and words would become ['hello', 'halls'], while remove('h[a-zA-Z]{3}s$', words) only leaves words as ['halls']
This line
newword = {i : wordlist[i] for i in range(len(wordlist)) if len(wordlist[i])==len(str(puzzle))}
creates a dictionary with non-contiguous keys. You want only the words that are the same length as puzzle so if your wordlist is ['test', 'type', 'string', 'tent'] and puzzle is 4 letters newword will be {0:'test', 1:'type', 3:'tent'}. You then use for i in range(len(newword)): to iterate over the dictionary based on the length of the dictionary. I'm a bit surprised that you aren't getting a KeyError with your code as written.
I'm not able to test this without the rest of your code, but I think changing the loop to:
for i in newword.keys():
will help.

Breaking a string into individual words in Python

I have a large list of domain names (around six thousand), and I would like to see which words trend the highest for a rough overview of our portfolio.
The problem I have is the list is formatted as domain names, for example:
examplecartrading.com
examplepensions.co.uk
exampledeals.org
examplesummeroffers.com
+5996
Just running a word count brings up garbage. So I guess the simplest way to go about this would be to insert spaces between whole words then run a word count.
For my sanity I would prefer to script this.
I know (very) little python 2.7 but I am open to any recommendations in approaching this, example of code would really help. I have been told that using a simple string trie data structure would be the simplest way of achieving this but I have no idea how to implement this in python.
We try to split the domain name (s) into any number of words (not just 2) from a set of known words (words). Recursion ftw!
def substrings_in_set(s, words):
if s in words:
yield [s]
for i in range(1, len(s)):
if s[:i] not in words:
continue
for rest in substrings_in_set(s[i:], words):
yield [s[:i]] + rest
This iterator function first yields the string it is called with if it is in words. Then it splits the string in two in every possible way. If the first part is not in words, it tries the next split. If it is, the first part is prepended to all the results of calling itself on the second part (which may be none, like in ["example", "cart", ...])
Then we build the english dictionary:
# Assuming Linux. Word list may also be at /usr/dict/words.
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())
# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")
# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))
# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))
Now we can put things together:
count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk",
"exampledeals.org", "examplesummeroffers.com"]
# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
# Extract the part in front of the first ".", and make it lower case
name = domain.partition(".")[0].lower()
found = set()
for split in substrings_in_set(name, words):
found |= set(split)
for word in found:
count[word] = count.get(word, 0) + 1
if not found:
no_match.append(name)
print count
print "No match found for:", no_match
Result: {'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}
Using a set to contain the english dictionary makes for fast membership checks. -= removes items from the set, |= adds to it.
Using the all function together with a generator expression improves efficiency, since all returns on the first False.
Some substrings may be a valid word both as either a whole or split, such as "example" / "ex" + "ample". For some cases we can solve the problem by excluding unwanted words, such as "ex" in the above code example. For others, like "pensions" / "pens" + "ions", it may be unavoidable, and when this happens, we need to prevent all the other words in the string from being counted multiple times (once for "pensions" and once for "pens" + "ions"). We do this by keeping track of the found words of each domain name in a set -- sets ignore duplicates -- and then count the words once all have been found.
EDIT: Restructured and added lots of comments. Forced strings to lower case to avoid misses because of capitalization. Also added a list to keep track of domain names where no combination of words matched.
NECROMANCY EDIT: Changed substring function so that it scales better. The old version got ridiculously slow for domain names longer than 16 characters or so. Using just the four domain names above, I've improved my own running time from 3.6 seconds to 0.2 seconds!
assuming you only have a few thousand standard domains you should be able to do this all in memory.
domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
for substring in all_sub_strings(domain):
if substring in dictionary:
found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want
print c
with open('/usr/share/dict/words') as f:
words = [w.strip() for w in f.readlines()]
def guess_split(word):
result = []
for n in xrange(len(word)):
if word[:n] in words and word[n:] in words:
result = [word[:n], word[n:]]
return result
from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
for line in f.readlines():
for word in line.strip().split('.'):
if len(word) > 3:
# junks the com , org, stuff
for x in guess_split(word):
word_counts[x] += 1
for spam in word_counts.items():
print '{word}: {count}'.format(word=spam[0],count=spam[1])
Here's a brute force method which only tries to split the domains into 2 english words. If the domain doesn't split into 2 english words, it gets junked. It should be straightforward to extend this to attempt more splits, but it will probably not scale well with the number of splits unless you be clever. Fortunately I guess you'll only need 3 or 4 splits max.
output:
deals: 1
example: 2
pensions: 1

Categories

Resources