Check if word is inside of list of tuples - python

I'm wondering how I can efficiently check whether a value is inside a given list of tuples. Say I have a list of:
("the", 1)
("check", 1)
("brown, 2)
("gary", 5)
how can I check whether a given word is inside the list, ignoring the second value of the tuples? If it was just a word I could use
if "the" in wordlist:
#...
but this will not work, is there something along the line this i can do?
if ("the", _) in wordlist:
#...

May be use a hash
>>> word in dict(list_of_tuples)

Use any:
if any(word[0] == 'the' for word in wordlist):
# do something

Lookup of the word in the list will be O(n) time complexity, so the more words in the list, the slower find will work. To speed up you may sort a list by word as a key alphabeticaly and then use binary search - search of the word becomes log(N) complexity, but the most efficient way is to use hashing with the set structure:
'the' in set((word for word, _ in a))
O(1), independent of how many words are in the set. BTW, it guarantees that only one instance of the word is inside the structure, while list can hold as many "the" as many you append. Set should be constructed once, add words with the .add method(add new word is O(1) complexity too)

for tupl in wordlist:
if 'the' in tupl:
# ...

words,scores = zip(*wordlist)
to split the wordlist into a list of words and a list of scores then just
print "the" in words

Related

Find Compound Words in List of Words - Python

I have a simple list of words I need to filter, but each word in the list has an accompanying "score" appended to it which is causing me some trouble. The input list has this structure:
lst = ['FAST;5','BREAK;60','FASTBREAK;40',
'OUTBREAK;110','BREAKFASTBUFFET;35',
'BUFFET;75','FASTBREAKPOINTS;60'
]
I am trying to figure out how to identify words in my list that are compounded solely from other words on the same list. For example, the code applied to lst above would produce:
ans = ['FASTBREAK:40','BREAKFASTBUFFET;35']
I found a prior question that deals with a nearly identical situation , but in that instance, there are no trailing scores with the words on the list and I am having trouble dealing with these trailing scores on my list. The ans list must keep the scores with the compound words found. The order of the words in lst is random and irrelevant. Ideally, I would like the ans list to be sorted by the length of the word (before the ' ; '), as shown above. This would save me some additional post-processing on ans.
I have figured out a way that works using ReGex and nested for loops (I will spare you the ugliness of my 1980s-esque brute force code, it's really not pretty), but my word list has close to a million entries, and my solution takes so long as to be completely unusable. I am looking for a solution a little more Pythonic that I can actually use. I'm having trouble working through it.
Here is some code that does the job. I'm sure it's not perfect for your situation (with a million entries), but perhaps can be useful in parts:
#!/usr/bin/env python
from collections import namedtuple
Word = namedtuple("Word", ("characters", "number"))
separator = ";"
lst = [
"FAST;5",
"BREAK;60",
"FASTBREAK;40",
"OUTBREAK;110",
"BREAKFASTBUFFET;35",
"BUFFET;75",
"FASTBREAKPOINTS;60",
]
words = [Word(*w.rsplit(separator, 1)) for w in lst]
def findparts(oword, parts):
if len(oword.characters) == 0:
return parts
for iword in words:
if not parts and iword.characters == oword.characters:
continue
if iword.characters in oword.characters:
parts.append(iword)
characters = oword.characters.replace(iword.characters, "")
return findparts(Word(characters, oword.number), parts)
return []
ans = []
for word in words:
parts = findparts(word, [])
if parts:
ans.append(separator.join(word))
print(ans)
It uses a recursive function that takes a word in your list and tries to assemble it with other words from that same list. This function will also present you with the actual atomic words forming the compound one.
It's not very smart, however. Here is an example of a composition it will not detect:
[BREAKFASTBUFFET, BREAK, BREAKFAST, BUFFET].
It uses a small detour using a namedtuple to temporarily separate the actual word from the number attached to it, assuming that the separator will always be ;.
I don't think regular expressions hold an advantage over a simple string search here.
If you know some more conditions about the composition of the compound words, like for instance the maximum number of components, the itertools combinatoric generators might help you to speed things up significantly and avoid missing the example given above too.
I think I would do it like this: make a new list containing only the words. In a for loop go through this list, and within it look for the words that are part of the word of the outer loop. If they are found: replace the found part by an empty string. If afterwards the entire word is replaced by an empty string: show the word of the corresponding index of the original list.
EDIT: As was pointed out in the comments, there could be a problem with the code in some situations, like this one: lst = ["BREAKFASTBUFFET;35", "BREAK;60", "BREAKFAST;18", "BUFFET;75"] In BREAKFASTBUFFET I first found that BREAK was a part of it, so I replaced that one with an empty string, which prevented BREAKFAST to be found. I hope that problem can be tackled by sorting the list descending by length of the word.
EDIT 2
My former edit was not flaw-proof, for instance is there was a word BREAKFASTEN, it shouldn't be "eaten" by BREAKFAST. This version does the following:
make a list of candidates: all words that ar part of the word in investigation
make another list of words that the word is started with
keep track of the words in the candidates list that you've allready tried
in a while True: keep trying until either the start list is empty, or you've successfully replaced all words by the candidates
lst = ['FAST;5','BREAK;60','FASTBREAK;40',
'OUTBREAK;110','BREAKFASTBUFFET;35',
'POINTS;25',
'BUFFET;75','FASTBREAKPOINTS;60', 'BREAKPOINTS;15'
]
lst2 = [ s.split(';')[0] for s in lst ]
for i, word in enumerate(lst2):
# candidates: words that are part of current word
candidates = [ x for i2, x in enumerate(lst2) if x in word and i != i2 ]
if len(candidates) > 0:
tried = []
word2 = word
found = False
while not found:
# start: subset of candidates that the current word starts with
start = [ x for x in candidates if word2.startswith(x) and x not in tried ]
for trial in start:
word2 = word2.replace(trial,'')
tried.append(trial)
if len(word2)==0:
print(lst[i])
found = True
break
if len(candidates)>1:
candidates = candidates[1:]
word2=candidates[0]
else:
break
There are several ways of speeding up the process but I doubt there is a polynomial solution.
So let's use multiprocessing, and do what we can to generate a meaningful result. The sample below is not identical to what you are asking for, but it does compose a list of apparently compound words from a large dictionary.
For the code below, I am sourcing https://gist.github.com/h3xx/1976236 which lists about 80,000 unique words in order of frequency in English.
The code below can easily be sped up if the input wordlist is sorted alphabetically beforehand, as each head of a compound will be immediately followed by it's potential compound members:
black
blackberries
blackberry
blackbird
blackbirds
blackboard
blackguard
blackguards
blackmail
blackness
blacksmith
blacksmiths
As mentioned in the comment, you may also need to use a semantic filter to identify true compound words - for instance, the word ‘generally’ isn’t a compound word for a ‘gene rally’ !! So, while you may get a list of contenders you will need to eliminate false positives somehow.
# python 3.9
import multiprocessing as mp
# returns an ordered list of lowercase words to be used.
def load(name) -> list:
return [line[:-1].lower() for line in open(name) if not line.startswith('#') and len(line) > 3]
# function that identifies the compounds of a word from a list.
# ... can be optimised if using a sorted list.
def compounds_of(word: str, values: list):
return [w for w in values if w.startswith(word) and w.removeprefix(word) in values]
# apply compound finding across an mp environment
# but this is the slowest part
def compose(values: list) -> dict:
with mp.Pool() as pool:
result = {(word, i): pool.apply(compounds_of, (word, values)) for i, word in enumerate(values)}
return result
if __name__ == '__main__':
# https://gist.github.com/h3xx/1976236
words = load('wiki-100k.txt') # words are ordered by popularity, and are 3 or more letters, in lowercase.
words = list(dict.fromkeys(words))
# remove those word heads which have less than 3 tails
compounds = {k: v for k, v in compose(words).items() if len(v) > 3}
# get the top 500 keys
rank = list(sorted(compounds.keys(), key=lambda x: x[1]))[:500]
# compose them into a dict and print
tops = {k[0]: compounds[k] for k in rank}
print(tops)

Given string and (list) of words, return words that contain string (optimal algorithm)

Let's say we have a list of unique words and a substring.
I am looking for an optimal algorithm, that returns words containing the substring.
The general application is: Given a database use search bar to filter the results.
A simple implementation in Python:
def search_bar(words, substring):
ret = []
for word in words:
if substring in word:
ret.append(word)
return ret
words = ["abc", "bcd", "thon", "Python"]
substring = "on"
search_bar(words, substring)
this would return:
["thon", "Python"]
in time O(lenght_of_list * complexity_of_in), where complexity_of_in depends in some way on the length of the substring and the length of individual words.
What I am asking is whether there is faster implementation. Given that we can preprocess the list into any structure we want.
Just redirection to the problem/answer would be amazing.
Note: It would be better if such structure doesn't take too long to add a new word. But primarily it doesn't have to be able to add anything, as the Python example doesn't.
Also, I am not sure about the tags with this question...
maybe use
word.find(substring)
instead
substring in word
and as variant:
def search_bar(words, substring):
return list(filter(lambda word: word.find(substring)!=-1, words))

A function to count words in a corpora using dictionary values using Python

I'm a Python Newbie trying to get a count of words that occur within a corpora (corpora) using a dictionary of specific words. The corpora is a string type that has been tokenized, normalized, lemmatized, and stemmed.
dict = {}
dict ['words'] = ('believe', 'tried', 'trust', 'experience')
counter=0
Result = []
for word in corpora:
if word in dict.values():
counter = i + 1
else counter = 0
This code produces a syntax error on the dict.values() line. Any help is appreciated!
Don't do dict = {}. dict is a built-in function and you are shadowing it. That's not critical, you won't be able t use if you 'll need it later.
A dictionary is a key→value mapping. Like a real dictionary (word → translation). What you did is said that value ('believe', …), which is a tuple, corresponds to the key 'word' in your dictionary. Then you are using dict.values() which gives you a sequence of all the values stored in the dictionary, in your case this sequence consists of exacly one item, and this item is a tuple. Your if condition will never be True: word is a string and dict.values() is a sequence, consisting of a single tuple of strings.
I'm not really sure why you are using a dictionary. It seems that you've got a set of words that are important for you, and you are scanning your corpora and count the number of occurences of those words. The key word here is set. You don't need a dictionary, you need a set.
It is not clear, what you are counting. What's that i you are adding to the counter? If you meant to increment counter by one, that should be counter = counter + 1 or simply counter += 1.
Why are you resetting counter?
counter = 0
I don't think you really want to reset the counter when you found an unknown word. It seems that unkown words shouldn't change your counter, then, just don't alter it.
Notes. Try to avoid using upper case letters in variable names (Result = [] is bad). Also as others mntioned, you are missing a colon after else.
So, now let's put it all together. The first thing to do is to make a set of words we are interested in:
words = {'believe', 'tried', 'trust', 'experience'}
Next you can iterate over the words in your corpora and see which of them are present in the set:
for word in corpora:
if word in words:
# do something
It is not clear what exactly the code should do, but if your goal is to know how many times all the words in the set are found in the corpora all together, then you'll just add one to counter inside that if.
If you want to know how many times each word of the set appears in the corpora, you'll have to maintain a separate counter for every word in the set (that's where a dictionary might be useful). This can be achieved easily with collections.Counter (which is a special dictionary) and you'll have to filter your corpora to leave only the words you are interested in, that's where ifilter will help you.
filtered_corpora = itertools.ifilter(lambda w: w in words, corpora)
—this is your corpora will all the words not found in words removed. You can pass it do Counter right away.
This trick is also useful for the first case (i.e. when you need only the total count). You'll just return the length of this filtered corpora (len(filtered_corpora)).
You have multiple issues. You did not define corpora in the example here. you are redfining dict, which is a built-in type. the else is not indented correctly. dict.values() return an iterable, each of which is a tuple; word will not be inside it, if word is a string. and it is not clear what counter counts, actually. adn what Results is doing there?
your code may be similar to this (pseudo)code
d = {'words' : ('believe', 'tried', 'trust', 'experience')} #if that's really what you want
counter = {}
for word in corpora:
for tup in d.values(): # each tup is a tuple
if word in tup:
x = counter[word] if word in counter else 0
counter[word] = x+1
There Is A Shorter Way To Do It.
This task, of counting things, is so common that a specific class for doing so exists in the library: collections.Counter.

Using list comprehension and sets

Create and print a list of words for which both the following criteria are all met:
the word is at least 8 characters long;
the word formed from the odd-numbered letter is in the set of lower-case words; and
the word formed from the even-numbered letters is in the set of lower-case words.
For example, the word "ballooned" should be included in your list because the word formed from the odd-numbered letters, "blond", and the word formed from the even-numbered letters, "aloe", are both in the set of lower-case words. Similarly, "triennially" splits into "tinily" and "renal", both of which are in the word list.
My teacher told us we should use a set: s=set(lowers) because this would be faster.
what i have so far:
s=set(lowers)
[word for word in lowers if len(word)>=8
and list(word)(::2) in s
and list(word)(::-2) in s]
I do not think I am using the set right. can someone help me get this to work
The problem is that you cast word to a list (unnecessary), your slices are not in brackets (you used parenthesis), and your second slice uses the wrong indices (should be 1::2, not ::-2).
Here are the slices done correctly:
>>> word = "ballooned"
>>> word[::2]
'blond'
>>> word[1::2]
'aloe'
Note that s is an odd name for a collection of lowercase words. A better name would be words.
Your use of set is correct. The reason your teacher wants you to use a set is it is much faster to test membership of a set than it is for a list.
Putting it together:
words = set(lowers)
[word for word in words if len(word) >= 8
and word[::2] in words
and word[1::2] in words]
Here is a quick example of how to structure your condition check inside of the list comprehension:
>>> word = 'ballooned'
>>> lowers = ['blond', 'aloe']
>>> s = set(lowers)
>>> len(word) >= 8 and word[::2] in s and word[1::2] in s
True
edit: Just realized that lowers contains both the valid words and the "search" words like 'ballooned' and 'triennially', in any case you should be able to use the above condition inside of your list comprehension to get the correct result.
list(word)(::2)
First, the syntax to access index ranges is using squared parentheses, also, you don’t need to cast word to a list first, you can directly do that on the string:
>>> 'ballooned'[::2]
'blond'
Also, [::-2] won’t give you the uneven word, but a reversed version of the other one. You need to use [1::2] (i.e. skip the first, and then every second character):
>>> 'ballooned'[::-2]
'dnolb'
>>> 'ballooned'[1::2]
'aloe'
In general it is always a good idea to test certain parts separately to see if they really do what you think they do.
this should do it:
s=set(lowers)
[word for word in lowers if len(word)>=8 and word[::2] in s and word[1::2] in s]
or using all():
In [166]: [word for word in lowers if all((len(word)>=8,
word[::2] in s,
word[1::2] in s))]
use [::] not (::) and there's no need of list() here, plus to get the word formed by letters placed at odd position use [1::2].
In [151]: "ballooned"[::2]
Out[151]: 'blond'
In [152]: "ballooned"[1::2]
Out[152]: 'aloe'

Python - match letters of words in a list

I'm trying to create a simple program where a user enters a few letters
Enter letters: abc
I then want to run through a list of words I have in list and match and words that contain 'a','b', and 'c'.
This is what I've tried so far with no luck
for word in good_words: #For all words in good words list
for letter in letters: #for each letter inputed by user
if not(letter in word):
break
matches.append(word)
If you want all the letters inside the word:
[word for word in good_words if all(letter in word for letter in letters)]
The problem with your code is the break inside the inner loop. Python doesn't have a construction to allow breaking more than one loop at once (and you wanted that)
You could probably improve the spee using a Set or FrozenSet
If you look at the doc, it mentionned the case of testing membership :
A set object is an unordered collection of distinct hashable objects.
Common uses include membership testing, removing duplicates from a
sequence, and computing mathematical operations such as intersection,
union, difference, and symmetric difference.
List comprehensions are definitely the way to go, but just to address the issue that OP was having with his code:
Your break statement only breaks out of the innermost loop. Because of that the word is still appended to matches. A quick fix for this is to take advantage of python's for... else construct:
for word in good_words:
for letter in letters:
if letter not in word:
break
else:
matches.append(word)
In the above code, else only executes if the loop is allowed to run all the way through. The break statement exits out of the loop completely, and matches.append(..) is not executed.
import collections
I would first compute the occurrences of letters in the words list.
words_by_letters = collections.defaultdict(list)
for word in good_words:
key = frozenset(word)
words_by_letters[key].append(word)
Then it's simply a matter of looking for words with particular letter occurrences. This is hopefully faster than checking each word individually.
subkey = set(letters)
for key, words in words_by_letters.iteritems():
if key.issuperset(subkey):
matches.extend(words)
If you want to keep track of letter repeats, you can do something similar by building a key from collections.Counter.

Categories

Resources