Append to a dict of lists with a dict comprehension - python

Suppose I have a large list of words. For an example:
>>> with open('/usr/share/dict/words') as f:
... words=[word for word in f.read().split('\n') if word]
If I wanted to build an index by first letter of this word list, this is easy:
d={}
for word in words:
if word[0].lower() in 'aeiou':
d.setdefault(word[0].lower(),[]).append(word)
# You could use defaultdict here too...
Results in something like this:
{'a':[list of 'a' words], 'e':[list of 'e' words], 'i': etc...}
Is there a way to do this with Python 2.7, 3+ dict comprehension? In other words, is it possible with the dict comprehension syntax to append the list represented by the key as the dict is being built?
ie:
index={k[0].lower():XXX for k in words if k[0].lower() in 'aeiou'}
Where XXX performs an append operation or list creation for the key as index is being created.
Edit
Taking the suggestions and benchmarking:
def f1():
d={}
for word in words:
c=word[0].lower()
if c in 'aeiou':
d.setdefault(c,[]).append(word)
def f2():
d={}
{d.setdefault(word[0].lower(),[]).append(word) for word in words
if word[0].lower() in 'aeiou'}
def f3():
d=defaultdict(list)
{d[word[0].lower()].append(word) for word in words
if word[0].lower() in 'aeiou'}
def f4():
d=functools.reduce(lambda d, w: d.setdefault(w[0], []).append(w[1]) or d,
((w[0].lower(), w) for w in words
if w[0].lower() in 'aeiou'), {})
def f5():
d=defaultdict(list)
for word in words:
c=word[0].lower()
if c in 'aeiou':
d[c].append(word)
Produces this benchmark:
rate/sec f4 f2 f1 f3 f5
f4 11 -- -21.8% -31.1% -31.2% -41.2%
f2 14 27.8% -- -11.9% -12.1% -24.8%
f1 16 45.1% 13.5% -- -0.2% -14.7%
f3 16 45.4% 13.8% 0.2% -- -14.5%
f5 18 70.0% 33.0% 17.2% 16.9% --
The straight loop with a default dict is fastest followed by set comprehension and loop with setdefault.
Thanks for the ideas!

No - dict comprehensions are designed to generate non-overlapping keys with each iteration; they don't support aggregation. For this particular use case, a loop is the proper way to accomplish the task efficiently (in linear time).

It is not possible (at least easily or directly) with a dict comprehension.
It is possible, but potentially abusive of the syntax, with a set or list comprehension:
# your code:
d={}
for word in words:
if word[0].lower() in 'aeiou':
d.setdefault(word[0].lower(),[]).append(word)
# a side effect set comprehension:
index={}
r={index.setdefault(word[0].lower(),[]).append(word) for word in words
if word[0].lower() in 'aeiou'}
print r
print [(k, len(d[k])) for k in sorted(d.keys())]
print [(k, len(index[k])) for k in sorted(index.keys())]
Prints:
set([None])
[('a', 17094), ('e', 8734), ('i', 8797), ('o', 7847), ('u', 16385)]
[('a', 17094), ('e', 8734), ('i', 8797), ('o', 7847), ('u', 16385)]
The set comprehension produces a set with the results of the setdefault() method after iterating over the words list. The sum total of set([None]) in this case. It also produces your desired side effect of producing your dict of lists.
It is not as readable (IMHO) as the straight looping construct and should be avoided (IMHO). It is no shorter and probably not materially faster. This is more interesting trivia about Python than useful -- IMHO... Maybe to win a bet?

I'd use filter:
>>> words = ['abcd', 'abdef', 'eft', 'egg', 'uck', 'ice']
>>> index = {k.lower() : list(filter(lambda x:x[0].lower() == k.lower(),words)) for k in 'aeiou'}
>>> index
{'a': ['abcd', 'abdef'], 'i': ['ice'], 'e': ['eft', 'egg'], 'u': ['uck'], 'o': []}

This is not exactly a dict comprehension, but:
reduce(lambda d, w: d.setdefault(w[0], []).append(w[1]) or d,
((w[0].lower(), w) for w in words
if w[0].lower() in 'aeiou'), {})

Not answering the question of a dict comprehension, but it might help someone searching this problem. In a reduced example, when filling growing lists on the run into a new dictionary, consider calling a function in a list comprehension, which is, admittedly, nothing better than a loop.
def fill_lists_per_dict_keys(k, v):
d[k] = (
v
if k not in d
else d[k] + v
)
# global d
d = {}
out = [fill_lists_per_dict_keys(i[0], [i[1]]) for i in d2.items()]
The out is only to suppress the None Output of each loop.
If you ever want to use the new dictionary even inside the list comprehension at runtime or if you run into another reason why your dictionary gets overwritten by each loop, check to make it global with global d at the beginning of the script (commented out because not necessary here).

Related

For loop to dictionary comprehension correct translation (Python)

I need to find the longest string in a list for each letter of the alphabet.
My first straight forward approach looked like this:
alphabet = ["a","b", ..., "z"]
text = ["ane4", "anrhgjt8", "andhjtje9", "ajhe5", "]more_crazy_words"]
result = {key:"" for key in alphabet} # create a dictionary
# go through all words, if that word is longer than the current longest, save it
for word in text:
if word[0].lower() in alphabet and len(result[word[0].lower()]) < len(word):
result[word[0].lower()] = word.lower()
print(result)
which returns:
{'a': 'andhjtje9'}
as it is supposed to do.
In order to practice dictionary comprehension I tried to solve this in just one line:
result2 = {key:"" for key in alphabet}
result2 = {word[0].lower(): word.lower() for word in text if word[0].lower() in alphabet and len(result2[word[0].lower()]) < len(word)}
I just copied the if statement into the comprehension loop...
results2 however is:
{'a': 'ajhe5'}
can someone explain me why this is the case? I feel like I did exactly the same as in the first loop...
Thanks for any help!
List / Dict / Set - comprehension can not refer to themself while building itself - thats why you do not get what you want.
You can use a complicated dictionary comprehension to do this - with help of collections.groupby on a sorted list this could look like this:
from string import ascii_lowercase
from itertools import groupby
text = ["ane4", "anrhgjt8", "andhjtje9", "ajhe5", "]more_crazy_words"]
d = {key:sorted(value, key=len)[-1]
for key,value in groupby((s for s in sorted(text)
if s[0].lower() in frozenset(ascii_lowercase)),
lambda x:x[0].lower())}
print(d) # {'a': 'andhjtje9'}
or
text = ["ane4", "anrhgjt8", "andhjtje9", "ajhe5", "]more_crazy_words"]
d = {key:next(value) for key,value in groupby(
(s for s in sorted(text, key=lambda x: (x[0],-len(x)))
if s[0].lower() in frozenset(ascii_lowercase)),
lambda x:x[0].lower())}
print(d) # {'a': 'andhjtje9'}
or several other ways ... but why would you?
Having it as for loops is much cleaner and easier to understand and would, in this case, follow the zen of python probably better.
Read about the zen of python by running:
import this

Reducing compute time for Anagram word search

The code below is a brute force method of searching a list of words and creating sub-lists of any that are Anagrams.
Searching the entire English dictionary is prohibitively time consuming so I'm curious of anyone has tips for reducing the compute complexity of the code?
def anogramtastic(anagrms):
d = []
e = []
for j in range(len(anagrms)):
if anagrms[j] in e:
pass
else:
templist = []
tester = anagrms[j]
tester = list(tester)
tester.sort()
tester = ''.join(tester)
for k in range(len(anagrms)):
if k == j:
pass
else:
testers = anagrms[k]
testers = list(testers)
testers.sort()
testers = ''.join(testers)
if testers == tester:
templist.append(anagrms[k])
e.append(anagrms[k])
if len(templist) > 0:
templist.append(anagrms[j])
d.append(templist)
d.sort(key=len,reverse=True)
return d
print(anogramtastic(wordlist))
How about using a dictionary of frozensets? Frozensets are immutable, meaning you can hash them for constant lookup. And when it comes to anagrams, what makes two words anagrams of each other is that they have the same letters with the same count. So you can construct a frozenset of {(letter, count), ...} pairs, and hash these for efficient lookup.
Here's a quick little function to convert a word to a multiset using collections.Counter:
from collections import Counter, defaultdict
def word2multiset(word):
return frozenset(Counter(word).items())
Now, given a list of words, populate your anagram dictionary like this:
list_of_words = [... ]
anagram_dict = defaultdict(set)
for word in list_of_words:
anagram_dict[word2multiset(word)].add(word)
For example, when list_of_words = ['hello', 'olleh', 'test', 'apple'], this is the output of anagram_dict after a run of the loop above:
print(anagram_dict)
defaultdict(set,
{frozenset({('e', 1), ('h', 1), ('l', 2), ('o', 1)}): {'hello',
'olleh'},
frozenset({('e', 1), ('s', 1), ('t', 2)}): {'test'},
frozenset({('a', 1), ('e', 1), ('l', 1), ('p', 2)}): {'apple'}})
Unless I'm misunderstanding the problem, simply grouping the words by sorting their characters should be an efficient solution -- as you've already realized. The trick is to avoid comparing every word to all the other ones. A dict with the char-sorted string as key will make finding the right group for each word fast; a lookup/insertion will be O(log n).
#!/usr/bin/env python3
#coding=utf8
from sys import stdin
groups = {}
for line in stdin:
w = line.strip()
g = ''.join(sorted(w))
if g not in groups:
groups[g] = []
groups[g].append(w)
for g, words in groups.items():
if len(words) > 1:
print('%2d %-20s' % (len(words), g), ' '.join(words))
Testing on my words file (99171 words), it seems to work well:
anagram$ wc /usr/share/dict/words
99171 99171 938848 /usr/share/dict/words
anagram$ time ./anagram.py < /usr/share/dict/words | tail
2 eeeprsw sweeper weepers
2 brsu burs rubs
2 aeegnrv avenger engrave
2 ddenoru redound rounded
3 aesy ayes easy yeas
2 gimnpu impugn umping
2 deeiinsst densities destinies
2 abinost bastion obtains
2 degilr girdle glider
2 orsttu trouts tutors
real 0m0.366s
user 0m0.357s
sys 0m0.012s
You can speed things up considerably by using a dictionary for checking membership instead of doing linear searches. The only "trick" is to devise a way to create keys for it such that it will be the same for anagrammatical words (and not for others).
In the code below this is being done by creating a sorted tuple from the letters in each word.
def anagramtastic(words):
dct = {}
for word in words:
key = tuple(sorted(word)) # Identifier based on letters.
dct.setdefault(key, []).append(word)
# Return a list of all that had an anagram.
return [words for words in dct.values() if len(words) > 1]
wordlist = ['act', 'cat', 'binary', 'brainy', 'case', 'aces',
'aide', 'idea', 'earth', 'heart', 'tea', 'tee']
print('result:', anagramtastic(wordlist))
Output produced:
result: [['act', 'cat'], ['binary', 'brainy'], ['case', 'aces'], ['aide', 'idea'], ['earth', 'heart']]

Trying to sort a dict by dict.values()

The task is to read a file, create a dict and print out the word and its counter value. Below is code that works fine, but I can't seem to get my mind to understand why in the print_words() function, I can't change the sort to:
words = sorted(word_count.values())
and then print the word and its counter, sorted by the counter (number of times that word is in word_count[]).
def word_count_dict(filename):
word_count = {}
input_file = open(filename, 'r')
for line in input_file:
words = line.split()
for word in words:
word = word.lower()
if not word in word_count:
word_count[word] = 1
else:
word_count[word] = word_count[word] + 1
input_file.close()
return word_count
def print_words(filename):
word_count = word_count_dict(filename)
words = sorted(word_count.keys())
for word in words:
print word, word_count[word]
If you sorted output by value (including the keys), the simplest approach is sorting the items (key-value pairs), using a key argument to sorted that sorts on the value, then iterating the result. So for your example, you'd replace:
words = sorted(word_count.keys())
for word in words:
print word, word_count[word]
with (adding from operator import itemgetter to the top of the module):
# key=itemgetter(1) means the sort key is the second value in each key-value
# tuple, meaning the value
sorted_word_counts = sorted(word_count.items(), key=itemgetter(1))
for word, count in sorted_word_counts:
print word, count
First thing to note is that dictionaries are not considered to be ordered, although this may change in the future. Therefore, it is good practice to convert your dict to a list of tuples ordered in some way.
The below function will help you convert a dictionary to a list of tuples ordered by values.
d = {'a': 5, 'b': 1, 'c': 7, 'd': 3}
def order_by_values(dct):
rev = sorted((v, k) for k, v in dct.items())
return [t[::-1] for t in rev]
order_by_values(d) # [('b', 1), ('d', 3), ('a', 5), ('c', 7)]

Creating a dictionary where the key is an integer and the value is the length of a random sentence

Super new to to python here, I've been struggling with this code for a while now. Basically the function returns a dictionary with the integers as keys and the values are all the words where the length of the word corresponds with each key.
So far I'm able to create a dictionary where the values are the total number of each word but not the actual words themselves.
So passing the following text
"the faith that he had had had had an affect on his life"
to the function
def get_word_len_dict(text):
result_dict = {'1':0, '2':0, '3':0, '4':0, '5':0, '6' :0}
for word in text.split():
if str(len(word)) in result_dict:
result_dict[str(len(word))] += 1
return result_dict
returns
1 - 0
2 - 3
3 - 6
4 - 2
5 - 1
6 - 1
Where I need the output to be:
2 - ['an', 'he', 'on']
3 - ['had', 'his', 'the']
4 - ['life', 'that']
5 - ['faith']
6 - ['affect']
I think I need to have to return the values as a list. But I'm not sure how to approach it.
I think that what you want is a dic of lists.
result_dict = {'1':[], '2':[], '3':[], '4':[], '5':[], '6' :[]}
for word in text.split():
if str(len(word)) in result_dict:
result_dict[str(len(word))].append(word)
return result_dict
Fixing Sabian's answer so that duplicates aren't added to the list:
def get_word_len_dict(text):
result_dict = {1:[], 2:[], 3:[], 4:[], 5:[], 6 :[]}
for word in text.split():
n = len(word)
if n in result_dict and word not in result_dict[n]:
result_dict[n].append(word)
return result_dict
Check out list comprehensions
Integers are legal dictionaries keys so there is no need to make the numbers strings unless you want it that way for some other reason.
if statement in the for loop controls flow to add word only once. You could get this effect more automatically if you use set() type instead of list() as your value data structure. See more in the docs. I believe the following does the job:
def get_word_len_dict(text):
result_dict = {len(word) : [] for word in text.split()}
for word in text.split():
if word not in result_dict[len(word)]:
result_dict[len(word)].append(word)
return result_dict
try to make it better ;)
Instead of defining the default value as 0, assign it as set() and within if condition do, result_dict[str(len(word))].add(word).
Also, instead of preassigning result_dict, you should use collections.defaultdict.
Since you need non-repetitive words, I am using set as value instead of list.
Hence, your final code should be:
from collections import defaultdict
def get_word_len_dict(text):
result_dict = defaultdict(set)
for word in text.split():
result_dict[str(len(word))].add(word)
return result_dict
In case it is must that you want list as values (I think set should suffice your requirement), you need to further iterate it as:
for key, value in result_dict.items():
result_dict[key] = list(value)
What you need is a map to list-construct (if not many words, otherwise a 'Counter' would be fine):
Each list stands for a word class (number of characters). Map is checked whether word class ('3') found before. List is checked whether word ('had') found before.
def get_word_len_dict(text):
result_dict = {}
for word in text.split():
if not result_dict.get(str(len(word))): # add list to map?
result_dict[str(len(word))] = []
if not word in result_dict[str(len(word))]: # add word to list?
result_dict[str(len(word))].append(word)
return result_dict
-->
3 ['the', 'had', 'his']
2 ['he', 'an', 'on']
5 ['faith']
4 ['that', 'life']
6 ['affect']
the problem here is you are counting the word by length, instead you want to group them. You can achieve this by storing a list instead of a int:
def get_word_len_dict(text):
result_dict = {}
for word in text.split():
if len(word) in result_dict:
result_dict[len(word)].add(word)
else:
result_dict[len(word)] = {word} #using a set instead of list to avoid duplicates
return result_dict
Other improvements:
don't hardcode the key in the initialized dict but let it empty instead. Let the code add the new keys dynamically when necessary
you can use int as keys instead of strings, it will save you the conversion
use sets to avoid repetitions
Using groupby
Well, I'll try to propose something different: you can group by length using groupby from the python standard library
import itertools
def get_word_len_dict(text):
# split and group by length (you get a list if tuple(key, list of values)
groups = itertools.groupby(sorted(text.split(), key=lambda x: len(x)), lambda x: len(x))
# convert to a dictionary with sets
return {l: set(words) for l, words in groups}
You say you want the keys to be integers but then you convert them to strings before storing them as a key. There is no need to do this in Python; integers can be dictionary keys.
Regarding your question, simply initialize the values of the keys to empty lists instead of the number 0. Then, in the loop, append the word to the list stored under the appropriate key (the length of the word), like this:
string = "the faith that he had had had had an affect on his life"
def get_word_len_dict(text):
result_dict = {i : [] for i in range(1, 7)}
for word in text.split():
length = len(word)
if length in result_dict:
result_dict[length].append(word)
return result_dict
This results in the following:
>>> get_word_len_dict(string)
{1: [], 2: ['he', 'an', 'on'], 3: ['the', 'had', 'had', 'had', 'had', 'his'], 4: ['that', 'life'], 5: ['faith'], 6: ['affect']}
If you, as you mentioned, wish to remove the duplicate words when collecting your input string, it seems elegant to use a set and convert to a list as a final processing step, if this is needed. Also note the use of defaultdict so you don't have to manually initialize the dictionary keys and values as a default value set() (i.e. the empty set) gets inserted for each key that we try to access but not others:
from collections import defaultdict
string = "the faith that he had had had had an affect on his life"
def get_word_len_dict(text):
result_dict = defaultdict(set)
for word in text.split():
length = len(word)
result_dict[length].add(word)
return {k : list(v) for k, v in result_dict.items()}
This gives the following output:
>>> get_word_len_dict(string)
{2: ['he', 'on', 'an'], 3: ['his', 'had', 'the'], 4: ['life', 'that'], 5: ['faith'], 6: ['affect']}
Your code is counting the occurrence of each word length - but not storing the words themselves.
In addition to capturing each word into a list of words with the same size, you also appear to want:
If a word length is not represented, do not return an empty list for that length - just don't have a key for that length.
No duplicates in each word list
Each word list is sorted
A set container is ideal for accumulating the words - sets naturally eliminate any duplicates added to them.
Using defaultdict(sets) will setup an empty dictionary of sets -- a dictionary key will only be created if it is referenced in our loop that examines each word.
from collections import defaultdict
def get_word_len_dict(text):
#create empty dictionary of sets
d = defaultdict(set)
# the key is the length of each word
# The value is a growing set of words
# sets automatically eliminate duplicates
for word in text.split():
d[len(word)].add(word)
# the sets in the dictionary are unordered
# so sort them into a new dictionary, which is returned
# as a dictionary of lists
return {i:sorted(d[i]) for i in d.keys()}
In your example string of
a="the faith that he had had had had an affect on his life"
Calling the function like this:
z=get_word_len_dict(a)
Returns the following list:
print(z)
{2: ['an', 'he', 'on'], 3: ['had', 'his', 'the'], 4: ['life', 'that'], 5: ['faith'], 6: ['affect']}
The type of each value in the dictionary is "list".
print(type(z[2]))
<class 'list'>

Sub-dictionary erroneously repeated throughout dictionary?

I'm trying to store in a dictionary the number of times a given letter occurs after another given letter. For example, dictionary['a']['d'] would give me the number of times 'd' follows 'a' in short_list.
alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
short_list = ['ford','hello','orange','apple']
# dictionary to keep track of how often a given letter occurs
tally = {}
for a in alphabet:
tally[a] = 0
# dictionary to keep track of how often a given letter occurs after a given letter
# e.g. how many times does 'd' follow 'a' -- master_dict['a']['d']
master_dict = {}
for a in alphabet:
master_dict[a] = tally
def precedingLetter(letter,word):
if word.index(letter) == 0:
return
else:
return word[word.index(letter)-1]
for a in alphabet:
for word in short_list:
for b in alphabet:
if precedingLetter(b,word) == a:
master_dict[a][b] += 1
However, the entries for all of the letters (the keys) in master_dict are all the same. I can't think of another way to properly tally each letter's occurrence after another letter. Can anyone offer some insight here?
If the sub-dicts are all supposed to be updated independently after creation, you need to shallow copy them. Easiest/fastest way is with .copy():
for a in alphabet:
master_dict[a] = tally.copy()
The other approach is to initialize the dict lazily. The easiest way to do that is with defaultdict:
from collections import defaultdict
masterdict = defaultdict(lambda: defaultdict(int))
# or
from collections import Counter, defaultdict
masterdict = defaultdict(Counter)
No need to pre-create empty tallies or populate masterdict at all, and this avoids creating dicts when the letter never occurs. If you access masterdict[a] for an a that doesn't yet exist, it creates a defaultdict(int) value for it automatically. When masterdict[a][b] is accessed and doesn't exist, the count is initialized to 0 automatically.
In Addition to the first answer it could be handy to perform your search the other way around. So instead of looking for each possible pair of letters, you could iterate just over the words.
In combination with the defaultdict this could simplify the process. As an example:
from collections import defaultdict
short_list = ['ford','hello','orange','apple']
master_dict = defaultdict(lambda: defaultdict(int))
for word in short_list:
for i in range(0,len(word)-1):
master_dict[word[i]][word[i+1]] += 1
Now master_dict contains all occured letter combinations while it returns zero for all other ones. A few examples below:
print(master_dict["f"]["o"]) # ==> 1
print(master_dict["o"]["r"]) # ==> 2
print(master_dict["a"]["a"]) # ==> 0
The problem you ask about is that the master_dict[a] = tally is only assigning the same object another name, so updating it through any of the references updates them all. You could fix that by making a copy of it each time by using master_dict[a] = tally.copy() as already pointed out in #ShadowRanger's answer.
As #ShadowRanger goes on to point out, it would also be considerably less wasteful to make your master_dict a defaultdict(lambda: defaultdict(int)) because doing so would only allocate and initialize counts for the combinations that actually encountered rather than all possible 2 letter permutations (if it was used properly).
To give you a concert idea of the savings, consider that there are only 15 unique letter pairs in your sample short_list of words, yet the exhaustive approach would still create and initialize 26 placeholders in 26 dictionaries for all 676 the possible counts.
It also occurs to me that you really don't need a two-level dictionary at all to accomplish what you want since the same thing could be done with a single dictionary which had keys comprised of tuples of pairs of characters.
Beyond that, another important improvement, as pointed out in #AdmPicard's answer, is that your approach of iterating through all possible permutations and seeing if any pairs of them are in each word via the precedingLetter() function is significantly more time consuming than it would be if you just iterated over all the successive pairs of letters that actually occurred in each one of them.
So, putting all this advice together would result in something like the following:
from collections import defaultdict
from string import ascii_lowercase
alphabet = set(ascii_lowercase)
short_list = ['ford','hello','orange','apple']
# dictionary to keep track of how often a letter pair occurred after one other.
# e.g. how many times 'd' followed an 'a' -> master_dict[('a','d')]
master_dict = defaultdict(int)
try:
from itertools import izip
except ImportError: # Python 3
izip = zip
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = iter(iterable), iter(iterable) # 2 independent iterators
next(b, None) # advance the 2nd one
return izip(a, b)
for word in short_list:
for (ch1,ch2) in pairwise(word.lower()):
if ch1 in alphabet and ch2 in alphabet:
master_dict[(ch1,ch2)] += 1
# display results
unique_pairs = 0
for (ch1,ch2) in sorted(master_dict):
print('({},{}): {}'.format(ch1, ch2, master_dict[(ch1,ch2)]))
unique_pairs += 1
print('A total of {} different letter pairs occurred in'.format(unique_pairs))
print('the words: {}'.format(', '.join(repr(word) for word in short_list)))
Which produces this output from the short_list:
(a,n): 1
(a,p): 1
(e,l): 1
(f,o): 1
(g,e): 1
(h,e): 1
(l,e): 1
(l,l): 1
(l,o): 1
(n,g): 1
(o,r): 2
(p,l): 1
(p,p): 1
(r,a): 1
(r,d): 1
A total of 15 different letter pairs occurred in
the words: 'ford', 'hello', 'orange', 'apple'

Categories

Resources