I want to check if a string in Python is a collection of words using PyEnchant.
For example, I want to somehow check a concatenated string is a word or not:
eng = enchant.Dict("en_US")
eng.check("Applebanana")
I know this will return false, but I want it to return true, since Apple + banana are legit words by PyEnchant.
If you limit yourself to words combined from two other words, you can check the combinations yourself:
>>> s = "applebanana"
>>> splits = [(s[:i], s[i:]) for i in range(1,len(s))]
>>> splits
[('a', 'pplebanana'), ('ap', 'plebanana'), ('app', 'lebanana'),
('appl', 'ebanana'), ('apple', 'banana'), ('appleb', 'anana'),
('appleba', 'nana'), ('appleban', 'ana'), ('applebana', 'na'),
('applebanan', 'a')]
>>> any((eng.check(item[0]) and eng.check(item[1])) for item in splits)
True
Of course you can expand that to more than two, but this should give you a general idea of where you're headed.
Related
The code below is a brute force method of searching a list of words and creating sub-lists of any that are Anagrams.
Searching the entire English dictionary is prohibitively time consuming so I'm curious of anyone has tips for reducing the compute complexity of the code?
def anogramtastic(anagrms):
d = []
e = []
for j in range(len(anagrms)):
if anagrms[j] in e:
pass
else:
templist = []
tester = anagrms[j]
tester = list(tester)
tester.sort()
tester = ''.join(tester)
for k in range(len(anagrms)):
if k == j:
pass
else:
testers = anagrms[k]
testers = list(testers)
testers.sort()
testers = ''.join(testers)
if testers == tester:
templist.append(anagrms[k])
e.append(anagrms[k])
if len(templist) > 0:
templist.append(anagrms[j])
d.append(templist)
d.sort(key=len,reverse=True)
return d
print(anogramtastic(wordlist))
How about using a dictionary of frozensets? Frozensets are immutable, meaning you can hash them for constant lookup. And when it comes to anagrams, what makes two words anagrams of each other is that they have the same letters with the same count. So you can construct a frozenset of {(letter, count), ...} pairs, and hash these for efficient lookup.
Here's a quick little function to convert a word to a multiset using collections.Counter:
from collections import Counter, defaultdict
def word2multiset(word):
return frozenset(Counter(word).items())
Now, given a list of words, populate your anagram dictionary like this:
list_of_words = [... ]
anagram_dict = defaultdict(set)
for word in list_of_words:
anagram_dict[word2multiset(word)].add(word)
For example, when list_of_words = ['hello', 'olleh', 'test', 'apple'], this is the output of anagram_dict after a run of the loop above:
print(anagram_dict)
defaultdict(set,
{frozenset({('e', 1), ('h', 1), ('l', 2), ('o', 1)}): {'hello',
'olleh'},
frozenset({('e', 1), ('s', 1), ('t', 2)}): {'test'},
frozenset({('a', 1), ('e', 1), ('l', 1), ('p', 2)}): {'apple'}})
Unless I'm misunderstanding the problem, simply grouping the words by sorting their characters should be an efficient solution -- as you've already realized. The trick is to avoid comparing every word to all the other ones. A dict with the char-sorted string as key will make finding the right group for each word fast; a lookup/insertion will be O(log n).
#!/usr/bin/env python3
#coding=utf8
from sys import stdin
groups = {}
for line in stdin:
w = line.strip()
g = ''.join(sorted(w))
if g not in groups:
groups[g] = []
groups[g].append(w)
for g, words in groups.items():
if len(words) > 1:
print('%2d %-20s' % (len(words), g), ' '.join(words))
Testing on my words file (99171 words), it seems to work well:
anagram$ wc /usr/share/dict/words
99171 99171 938848 /usr/share/dict/words
anagram$ time ./anagram.py < /usr/share/dict/words | tail
2 eeeprsw sweeper weepers
2 brsu burs rubs
2 aeegnrv avenger engrave
2 ddenoru redound rounded
3 aesy ayes easy yeas
2 gimnpu impugn umping
2 deeiinsst densities destinies
2 abinost bastion obtains
2 degilr girdle glider
2 orsttu trouts tutors
real 0m0.366s
user 0m0.357s
sys 0m0.012s
You can speed things up considerably by using a dictionary for checking membership instead of doing linear searches. The only "trick" is to devise a way to create keys for it such that it will be the same for anagrammatical words (and not for others).
In the code below this is being done by creating a sorted tuple from the letters in each word.
def anagramtastic(words):
dct = {}
for word in words:
key = tuple(sorted(word)) # Identifier based on letters.
dct.setdefault(key, []).append(word)
# Return a list of all that had an anagram.
return [words for words in dct.values() if len(words) > 1]
wordlist = ['act', 'cat', 'binary', 'brainy', 'case', 'aces',
'aide', 'idea', 'earth', 'heart', 'tea', 'tee']
print('result:', anagramtastic(wordlist))
Output produced:
result: [['act', 'cat'], ['binary', 'brainy'], ['case', 'aces'], ['aide', 'idea'], ['earth', 'heart']]
I can use this to determine whether or not any of a set of multiple strings exist in another string,
bar = 'this is a test string'
if any(s in bar for s in ('this', 'test', 'bob')):
print("found")
but I'm not sure how to check if any of a set of multiple strings occur in any of many strings. It seems like this would work. Syntactically it does not fail, but it doesn't print me out anything either:
a = 'test string'
b = 'I am a cat'
c = 'Washington'
if any(s in (a,b,c) for s in ('this', 'test', 'cat')):
print("found")
Need to iterate through the tuple of test strings:
a = 'test string'
b = 'I am a cat'
c = 'Washington'
if any(s in test for test in (a,b,c) for s in ('this', 'test', 'cat')):
print("found")
At this point it's probably worth compiling a regular expression of the substrings you're looking for and then just apply a single check using that... This means that you're only scanning each string once - not potentially three times (or however many substrings you're looking for) and keeps the any check at a single level of comprehension.
import re
has_substring = re.compile('this|test|cat').search
if any(has_substring(text) for text in (a,b,c)):
# do something
Note you can modify the expression to only search for whole words, eg:
has_word = re.compile(r'\b(this|test|cat)\b').search
You can try this:
a = 'test string'
b = 'I am a cat'
c = 'Washington'
l = [a, b, c]
tests = ('this', 'test', 'cat')
if any(any(i in b for b in l) for i in tests):
print("found")
I am new to python and I want to write a program that determines if a string consists of repetitive characters. The list of strings that I want to test are:
Str1 = "AAAA"
Str2 = "AGAGAG"
Str3 = "AAA"
The pseudo-code that I come up with:
WHEN len(str) % 2 with zero remainder:
- Divide the string into two sub-strings.
- Then, compare the two sub-strings and check if they have the same characters, or not.
- if the two sub-strings are not the same, divide the string into three sub-strings and compare them to check if repetition occurs.
I am not sure if this is applicable way to solve the problem, Any ideas how to approach this problem?
Thank you!
You could use the Counter library to count the most common occurrences of the characters.
>>> from collections import Counter
>>> s = 'abcaaada'
>>> c = Counter(s)
>>> c.most_common()
[('a', 5), ('c', 1), ('b', 1), ('d', 1)]
To get the single most repetitive (common) character:
>>> c.most_common(1)
[('a', 5)]
You could do this using a RegX backreferences.
To find a pattern in Python, you are going to need to use "Regular Expressions". A regular expression is typically written as:
match = re.search(pat, str)
This is usually followed by an if-statement to determine if the search succeeded.
for example this is how you would find the pattern "AAAA" in a string:
import re
string = ' blah blahAAAA this is an example'
match = re.search(r'AAAA', string)
if match:
print 'found', match.group()
else:
print 'did not find'
This returns "found 'AAAA'"
Do the same for your other two strings and it will work the same.
Regular expressions can do a lot more than just this so work around with them and see what else they can do.
Assuming you mean the whole string is a repeating pattern, this answer has a good solution:
def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]
I have a list of strings as such:
mylist = ["superduperlongstring", "a short string", "the middle"]
I want to sort this in such a way that the string with the largest number of words is first, ie,
mylist = ["a short string", "the middle", "superduperlongstring"]
Its a bit tricky, since if I sort in place by length
mylist.sort(key = len)
I'm back where I started.
Has anyone come across a graceful solution to this? Thanks.
Assuming that words are separated by whitespace, calling str.split with no arguments returns a list of the words that a string contains:
>>> "superduperlongstring".split()
['superduperlongstring']
>>> "a short string".split()
['a', 'short', 'string']
>>> "the middle".split()
['the', 'middle']
>>>
Therefore, you can get the output you want by sorting mylist based off of the length of these lists:
>>> mylist = ["superduperlongstring", "a short string", "the middle"]
>>> mylist.sort(key=lambda x: len(x.split()), reverse=True)
>>> mylist
['a short string', 'the middle', 'superduperlongstring']
>>>
You will also need to set the reverse parameter of list.sort to True, as shown above.
I have a problem and I have no idea how to solve it. Please, give a piece of advice.
I have a text. Big, big text. The task is to find all the repeated phrases which lenght is 3(contain of three words) in the text.
You have, it seems to me, two problems.
The first is coming up with an efficient way of normalizing the input. You say you want to find all of the three-word phrases in the input, but what constitutes a phrase? For instance, are the black dog and The black, dog? the same phrase?
A way of doing this, as marcog suggests, is by using something like re.findall. But this is pretty inefficient: it traverses your entire input and copies the words into a list, and then you have to process that list. If your input text is very long, that's going to be wasteful of both time and space.
A better approach would be to treat the input as a stream, and build a generator that pulls off one word at a time. Here's an example, which uses spaces as the delimiter between words, then strips non-alpha characters out of the words and converts them to lower case:
>>> def words(text):
pattern = re.compile(r"[^\s]+")
non_alpha = re.compile(r"[^a-z]", re.IGNORECASE)
for match in pattern.finditer(text):
nxt = non_alpha.sub("", match.group()).lower()
if nxt: # skip blank, non-alpha words
yield nxt
>>> text
"O'er the bright blue sea, for Sir Joseph Porter K.C.B."
>>> list(words(text))
['oer', 'the', 'bright', 'blue', 'sea', 'for', 'sir', 'joseph', 'porter', 'kcb']
The second problem is grouping the normalized words into three-word phrases. Again, here is a place where a generator will perform efficiently:
>>> def phrases(words):
phrase = []
for word in words:
phrase.append(word)
if len(phrase) > 3:
phrase.remove(phrase[0])
if len(phrase) == 3:
yield tuple(phrase)
>>> list(phrases(words(text)))
[('oer', 'the', 'bright'), ('the', 'bright', 'blue'), ('bright', 'blue', 'sea'), ('blue', 'sea', 'for'), ('sea', 'for', 'sir'), ('for', 'sir', 'joseph'), ('sir', 'joseph', 'porter'), ('joseph', 'porter', 'kcb')]
There's almost certainly a simpler version of that function possible, but this one's efficient, and it's not hard to understand.
Significantly, chaining the generators together only traverses the list once, and it doesn't build any large temporary data structures in memory. You can use the result to build a defaultdict keyed by phrase:
>>> import collections
>>> counts = collections.defaultdict(int)
>>> for phrase in phrases(words(text)):
counts[phrase] += 1
This makes a single pass over text as it counts the phrases. When it's done, find every entry in the dictionary whose value is greater than one.
the crudest way would be to read text in a string. Do a string.split() and get individual words in a list. You could then slice list per three words, and use collections.defaultdict(int) for keeping the count.
d = collections.defaultdict(int)
d[phrase]+=1
as I said, its very crude. But should certainly get you started
I would suggest looking at the NLTK toolkit. This is open source and intended for natural language teaching. as well as higher level NLP functions, it has a lot of tokenizing type of functions and collections.
Here's a roughly O(n) solution, which should work on pretty large input texts. If it's too slow, you probably want to look into using Perl which was designed for text processing or C++ for pure performance.
>>> s = 'The quick brown fox jumps over the lazy dog'
>>> words = string.lower(s).split()
>>> phrases = collections.defaultdict(int)
>>> for a, b, c in zip(words[:-3], words[1:-2], words[2:]):
... phrases[(a, b, c)] += 1
...
>>> phrases
defaultdict(<type 'int'>, {('over', 'the', 'lazy'): 1, ('quick', 'brown', 'fox'): 1, ('the', '
quick', 'brown'): 1, ('jumps', 'over', 'the'): 1, ('brown', 'fox', 'jumps'): 1, ('fox', 'jumps
', 'over'): 1})
>>> [phrase for phrase, count in phrases.iteritems() if count > 1]
>>> []