Using list comprehension and sets - python

Create and print a list of words for which both the following criteria are all met:
the word is at least 8 characters long;
the word formed from the odd-numbered letter is in the set of lower-case words; and
the word formed from the even-numbered letters is in the set of lower-case words.
For example, the word "ballooned" should be included in your list because the word formed from the odd-numbered letters, "blond", and the word formed from the even-numbered letters, "aloe", are both in the set of lower-case words. Similarly, "triennially" splits into "tinily" and "renal", both of which are in the word list.
My teacher told us we should use a set: s=set(lowers) because this would be faster.
what i have so far:
s=set(lowers)
[word for word in lowers if len(word)>=8
and list(word)(::2) in s
and list(word)(::-2) in s]
I do not think I am using the set right. can someone help me get this to work

The problem is that you cast word to a list (unnecessary), your slices are not in brackets (you used parenthesis), and your second slice uses the wrong indices (should be 1::2, not ::-2).
Here are the slices done correctly:
>>> word = "ballooned"
>>> word[::2]
'blond'
>>> word[1::2]
'aloe'
Note that s is an odd name for a collection of lowercase words. A better name would be words.
Your use of set is correct. The reason your teacher wants you to use a set is it is much faster to test membership of a set than it is for a list.
Putting it together:
words = set(lowers)
[word for word in words if len(word) >= 8
and word[::2] in words
and word[1::2] in words]

Here is a quick example of how to structure your condition check inside of the list comprehension:
>>> word = 'ballooned'
>>> lowers = ['blond', 'aloe']
>>> s = set(lowers)
>>> len(word) >= 8 and word[::2] in s and word[1::2] in s
True
edit: Just realized that lowers contains both the valid words and the "search" words like 'ballooned' and 'triennially', in any case you should be able to use the above condition inside of your list comprehension to get the correct result.

list(word)(::2)
First, the syntax to access index ranges is using squared parentheses, also, you don’t need to cast word to a list first, you can directly do that on the string:
>>> 'ballooned'[::2]
'blond'
Also, [::-2] won’t give you the uneven word, but a reversed version of the other one. You need to use [1::2] (i.e. skip the first, and then every second character):
>>> 'ballooned'[::-2]
'dnolb'
>>> 'ballooned'[1::2]
'aloe'
In general it is always a good idea to test certain parts separately to see if they really do what you think they do.

this should do it:
s=set(lowers)
[word for word in lowers if len(word)>=8 and word[::2] in s and word[1::2] in s]
or using all():
In [166]: [word for word in lowers if all((len(word)>=8,
word[::2] in s,
word[1::2] in s))]
use [::] not (::) and there's no need of list() here, plus to get the word formed by letters placed at odd position use [1::2].
In [151]: "ballooned"[::2]
Out[151]: 'blond'
In [152]: "ballooned"[1::2]
Out[152]: 'aloe'

Related

Check if word is inside of list of tuples

I'm wondering how I can efficiently check whether a value is inside a given list of tuples. Say I have a list of:
("the", 1)
("check", 1)
("brown, 2)
("gary", 5)
how can I check whether a given word is inside the list, ignoring the second value of the tuples? If it was just a word I could use
if "the" in wordlist:
#...
but this will not work, is there something along the line this i can do?
if ("the", _) in wordlist:
#...
May be use a hash
>>> word in dict(list_of_tuples)
Use any:
if any(word[0] == 'the' for word in wordlist):
# do something
Lookup of the word in the list will be O(n) time complexity, so the more words in the list, the slower find will work. To speed up you may sort a list by word as a key alphabeticaly and then use binary search - search of the word becomes log(N) complexity, but the most efficient way is to use hashing with the set structure:
'the' in set((word for word, _ in a))
O(1), independent of how many words are in the set. BTW, it guarantees that only one instance of the word is inside the structure, while list can hold as many "the" as many you append. Set should be constructed once, add words with the .add method(add new word is O(1) complexity too)
for tupl in wordlist:
if 'the' in tupl:
# ...
words,scores = zip(*wordlist)
to split the wordlist into a list of words and a list of scores then just
print "the" in words

How to find the combination of words that includes all the letters in the input with Python

I want to find the most efficient way to loop through the combination of letters that are entered in Python and return a set of words whose combination includes all the letters, if feasible.
Example:
Say user entered A B C D E. Goal is to find the least number of words that includes all the letters. In this case an optimum solution, in preference order, will be:
One word that has all 5 letters
Two words that has all the 5 letters. (can be 4-letter word + 1-letter word OR 3 letter word + 2 letter word. Does not make difference)
....
etc.
If no match, then find go back to 1. with n-1 letters etc.
I have a function to check if a "combination of letters" (i.e. word) is in dictionary.
def is_in_lib(word):
if word in lib:
return word
return False
Ideal answer should not include finding the combination of those letters and searching all of those. Searching through my dictionary is very costly, so I need something that can take also optimize the time that we search through the dictionary
IMPORTANT EDIT: The order matters and continuity is required. Meaning if user enters "H", "T", "A", you cannot build "HAT".
Real Example: If the input is : T - H - G - R - A - C - E - K - B - Y - E " output should be "Grace" and "Bye"
You could create a string/list from the input letters, and iterate trought THEM on every word in the word library:
inputstring='abcde'
for i in lib:
is_okay=True
for j in inputstring:
if i.find(j)=-1:
is_okay=False
if is_okay:
return i
I think the other cases (two words with 3-2 letters) can be implemented recursively, but it couldn't be efficient.
I think the key idea here would be to have some kind of index providing a mapping from a canonical sequence of characters to actual words. Something like that:
# List of known words
>>> words = ('bonjour', 'jour', 'bon', 'poire', 'proie')
# Build the index
>>> index = collections.defaultdict(list)
>>> for w in words:
... index[''.join(sorted(w.lower()))].append(w)
...
This will produce a efficient way to find all the anagrams corresponding to a sequence of characters:
>>> index
defaultdict(<class 'list'>, {'joru': ['jour'], 'eiopr': ['poire', 'proie'], 'bjnooru': ['bonjour'], 'bno': ['bon']})
You could query the index that way:
>>> user_str = 'OIREP'
>>> index.get(''.join(sorted(user_str.lower())), "")
['poire', 'proie']
Of course, this will only find "exact" anagrams -- that is containing all the letters provided by the user. To find all the string that match a subset of the user provided string, you will have to remove one letter at a time and check again each combination. I feel like recursivity will help to solve that problem ;)
EDIT:
(should I put that on a spoiler section?)
Here is a possibl solution:
import collections
words = ('bonjour', 'jour', 'bon', 'or', 'pire', 'poire', 'proie')
index = collections.defaultdict(list)
for w in words:
index[''.join(sorted(w.lower()))].append(w)
# Recursively search all the words containing a sequence of letters
def search(letters, result = set()):
# Assume "letters" ordered
if not letters:
return
solutions = index.get(letters)
if solutions:
for s in solutions:
result.add(s)
for i in range(0,len(letters)):
search(letters[:i]+letters[i+1:], result)
return result
# Use case:
user_str = "OIREP"
s = search(''.join(sorted(user_str.lower())))
print(s)
Producing:
set(['poire', 'or', 'proie', 'pire'])
It is not that bad, but could be improved since the same subset of characters are examined several times. This is especially true is the user provided search string contain several identical letters.

Need to divide string inside list comprehension

Asked of me: By filtering the lowers list, create and print a list of the words for which the first half of the word matches the second half of the word. Examples include "bonbon", "froufrou", "gaga", and "murmur".
What i have so far:
[word for word in lowers if list.(word)(1:len(word)/2:)==list.(word)(len(word)/2::)]
Im not sure how to make word a list so I can only use certain characters for this filter. I know this will not work but its my logic so far.
Logical Error: You're slicing from index 1 instead of 0 in list.(word)(1:len(word)/2:)
Syntax Errors: list.(word) is incorrect syntax, and list slicing uses [ ] not ( ). Simply use word[start:stop] to break it up
Use:
[word for word in lowers if word[:len(word)//2]==word[len(word)//2:]]
Edit: Thanks to Ignacio Vazquez-Abrams' comment - Use integer division ( // operator ) for Python 3 compatibility
Try this:
[word for word in lowers if word[len(word)//2:] == word[:len(word)//2]]

Python - match letters of words in a list

I'm trying to create a simple program where a user enters a few letters
Enter letters: abc
I then want to run through a list of words I have in list and match and words that contain 'a','b', and 'c'.
This is what I've tried so far with no luck
for word in good_words: #For all words in good words list
for letter in letters: #for each letter inputed by user
if not(letter in word):
break
matches.append(word)
If you want all the letters inside the word:
[word for word in good_words if all(letter in word for letter in letters)]
The problem with your code is the break inside the inner loop. Python doesn't have a construction to allow breaking more than one loop at once (and you wanted that)
You could probably improve the spee using a Set or FrozenSet
If you look at the doc, it mentionned the case of testing membership :
A set object is an unordered collection of distinct hashable objects.
Common uses include membership testing, removing duplicates from a
sequence, and computing mathematical operations such as intersection,
union, difference, and symmetric difference.
List comprehensions are definitely the way to go, but just to address the issue that OP was having with his code:
Your break statement only breaks out of the innermost loop. Because of that the word is still appended to matches. A quick fix for this is to take advantage of python's for... else construct:
for word in good_words:
for letter in letters:
if letter not in word:
break
else:
matches.append(word)
In the above code, else only executes if the loop is allowed to run all the way through. The break statement exits out of the loop completely, and matches.append(..) is not executed.
import collections
I would first compute the occurrences of letters in the words list.
words_by_letters = collections.defaultdict(list)
for word in good_words:
key = frozenset(word)
words_by_letters[key].append(word)
Then it's simply a matter of looking for words with particular letter occurrences. This is hopefully faster than checking each word individually.
subkey = set(letters)
for key, words in words_by_letters.iteritems():
if key.issuperset(subkey):
matches.extend(words)
If you want to keep track of letter repeats, you can do something similar by building a key from collections.Counter.

Breaking a string into individual words in Python

I have a large list of domain names (around six thousand), and I would like to see which words trend the highest for a rough overview of our portfolio.
The problem I have is the list is formatted as domain names, for example:
examplecartrading.com
examplepensions.co.uk
exampledeals.org
examplesummeroffers.com
+5996
Just running a word count brings up garbage. So I guess the simplest way to go about this would be to insert spaces between whole words then run a word count.
For my sanity I would prefer to script this.
I know (very) little python 2.7 but I am open to any recommendations in approaching this, example of code would really help. I have been told that using a simple string trie data structure would be the simplest way of achieving this but I have no idea how to implement this in python.
We try to split the domain name (s) into any number of words (not just 2) from a set of known words (words). Recursion ftw!
def substrings_in_set(s, words):
if s in words:
yield [s]
for i in range(1, len(s)):
if s[:i] not in words:
continue
for rest in substrings_in_set(s[i:], words):
yield [s[:i]] + rest
This iterator function first yields the string it is called with if it is in words. Then it splits the string in two in every possible way. If the first part is not in words, it tries the next split. If it is, the first part is prepended to all the results of calling itself on the second part (which may be none, like in ["example", "cart", ...])
Then we build the english dictionary:
# Assuming Linux. Word list may also be at /usr/dict/words.
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())
# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")
# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))
# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))
Now we can put things together:
count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk",
"exampledeals.org", "examplesummeroffers.com"]
# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
# Extract the part in front of the first ".", and make it lower case
name = domain.partition(".")[0].lower()
found = set()
for split in substrings_in_set(name, words):
found |= set(split)
for word in found:
count[word] = count.get(word, 0) + 1
if not found:
no_match.append(name)
print count
print "No match found for:", no_match
Result: {'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}
Using a set to contain the english dictionary makes for fast membership checks. -= removes items from the set, |= adds to it.
Using the all function together with a generator expression improves efficiency, since all returns on the first False.
Some substrings may be a valid word both as either a whole or split, such as "example" / "ex" + "ample". For some cases we can solve the problem by excluding unwanted words, such as "ex" in the above code example. For others, like "pensions" / "pens" + "ions", it may be unavoidable, and when this happens, we need to prevent all the other words in the string from being counted multiple times (once for "pensions" and once for "pens" + "ions"). We do this by keeping track of the found words of each domain name in a set -- sets ignore duplicates -- and then count the words once all have been found.
EDIT: Restructured and added lots of comments. Forced strings to lower case to avoid misses because of capitalization. Also added a list to keep track of domain names where no combination of words matched.
NECROMANCY EDIT: Changed substring function so that it scales better. The old version got ridiculously slow for domain names longer than 16 characters or so. Using just the four domain names above, I've improved my own running time from 3.6 seconds to 0.2 seconds!
assuming you only have a few thousand standard domains you should be able to do this all in memory.
domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
for substring in all_sub_strings(domain):
if substring in dictionary:
found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want
print c
with open('/usr/share/dict/words') as f:
words = [w.strip() for w in f.readlines()]
def guess_split(word):
result = []
for n in xrange(len(word)):
if word[:n] in words and word[n:] in words:
result = [word[:n], word[n:]]
return result
from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
for line in f.readlines():
for word in line.strip().split('.'):
if len(word) > 3:
# junks the com , org, stuff
for x in guess_split(word):
word_counts[x] += 1
for spam in word_counts.items():
print '{word}: {count}'.format(word=spam[0],count=spam[1])
Here's a brute force method which only tries to split the domains into 2 english words. If the domain doesn't split into 2 english words, it gets junked. It should be straightforward to extend this to attempt more splits, but it will probably not scale well with the number of splits unless you be clever. Fortunately I guess you'll only need 3 or 4 splits max.
output:
deals: 1
example: 2
pensions: 1

Categories

Resources