Sub-dictionary erroneously repeated throughout dictionary? - python

I'm trying to store in a dictionary the number of times a given letter occurs after another given letter. For example, dictionary['a']['d'] would give me the number of times 'd' follows 'a' in short_list.
alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
short_list = ['ford','hello','orange','apple']
# dictionary to keep track of how often a given letter occurs
tally = {}
for a in alphabet:
tally[a] = 0
# dictionary to keep track of how often a given letter occurs after a given letter
# e.g. how many times does 'd' follow 'a' -- master_dict['a']['d']
master_dict = {}
for a in alphabet:
master_dict[a] = tally
def precedingLetter(letter,word):
if word.index(letter) == 0:
return
else:
return word[word.index(letter)-1]
for a in alphabet:
for word in short_list:
for b in alphabet:
if precedingLetter(b,word) == a:
master_dict[a][b] += 1
However, the entries for all of the letters (the keys) in master_dict are all the same. I can't think of another way to properly tally each letter's occurrence after another letter. Can anyone offer some insight here?

If the sub-dicts are all supposed to be updated independently after creation, you need to shallow copy them. Easiest/fastest way is with .copy():
for a in alphabet:
master_dict[a] = tally.copy()
The other approach is to initialize the dict lazily. The easiest way to do that is with defaultdict:
from collections import defaultdict
masterdict = defaultdict(lambda: defaultdict(int))
# or
from collections import Counter, defaultdict
masterdict = defaultdict(Counter)
No need to pre-create empty tallies or populate masterdict at all, and this avoids creating dicts when the letter never occurs. If you access masterdict[a] for an a that doesn't yet exist, it creates a defaultdict(int) value for it automatically. When masterdict[a][b] is accessed and doesn't exist, the count is initialized to 0 automatically.

In Addition to the first answer it could be handy to perform your search the other way around. So instead of looking for each possible pair of letters, you could iterate just over the words.
In combination with the defaultdict this could simplify the process. As an example:
from collections import defaultdict
short_list = ['ford','hello','orange','apple']
master_dict = defaultdict(lambda: defaultdict(int))
for word in short_list:
for i in range(0,len(word)-1):
master_dict[word[i]][word[i+1]] += 1
Now master_dict contains all occured letter combinations while it returns zero for all other ones. A few examples below:
print(master_dict["f"]["o"]) # ==> 1
print(master_dict["o"]["r"]) # ==> 2
print(master_dict["a"]["a"]) # ==> 0

The problem you ask about is that the master_dict[a] = tally is only assigning the same object another name, so updating it through any of the references updates them all. You could fix that by making a copy of it each time by using master_dict[a] = tally.copy() as already pointed out in #ShadowRanger's answer.
As #ShadowRanger goes on to point out, it would also be considerably less wasteful to make your master_dict a defaultdict(lambda: defaultdict(int)) because doing so would only allocate and initialize counts for the combinations that actually encountered rather than all possible 2 letter permutations (if it was used properly).
To give you a concert idea of the savings, consider that there are only 15 unique letter pairs in your sample short_list of words, yet the exhaustive approach would still create and initialize 26 placeholders in 26 dictionaries for all 676 the possible counts.
It also occurs to me that you really don't need a two-level dictionary at all to accomplish what you want since the same thing could be done with a single dictionary which had keys comprised of tuples of pairs of characters.
Beyond that, another important improvement, as pointed out in #AdmPicard's answer, is that your approach of iterating through all possible permutations and seeing if any pairs of them are in each word via the precedingLetter() function is significantly more time consuming than it would be if you just iterated over all the successive pairs of letters that actually occurred in each one of them.
So, putting all this advice together would result in something like the following:
from collections import defaultdict
from string import ascii_lowercase
alphabet = set(ascii_lowercase)
short_list = ['ford','hello','orange','apple']
# dictionary to keep track of how often a letter pair occurred after one other.
# e.g. how many times 'd' followed an 'a' -> master_dict[('a','d')]
master_dict = defaultdict(int)
try:
from itertools import izip
except ImportError: # Python 3
izip = zip
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = iter(iterable), iter(iterable) # 2 independent iterators
next(b, None) # advance the 2nd one
return izip(a, b)
for word in short_list:
for (ch1,ch2) in pairwise(word.lower()):
if ch1 in alphabet and ch2 in alphabet:
master_dict[(ch1,ch2)] += 1
# display results
unique_pairs = 0
for (ch1,ch2) in sorted(master_dict):
print('({},{}): {}'.format(ch1, ch2, master_dict[(ch1,ch2)]))
unique_pairs += 1
print('A total of {} different letter pairs occurred in'.format(unique_pairs))
print('the words: {}'.format(', '.join(repr(word) for word in short_list)))
Which produces this output from the short_list:
(a,n): 1
(a,p): 1
(e,l): 1
(f,o): 1
(g,e): 1
(h,e): 1
(l,e): 1
(l,l): 1
(l,o): 1
(n,g): 1
(o,r): 2
(p,l): 1
(p,p): 1
(r,a): 1
(r,d): 1
A total of 15 different letter pairs occurred in
the words: 'ford', 'hello', 'orange', 'apple'

Related

Randomly increment a key in a python dictionary until all of the keys have the same value?

I have 5 groups - A,B,C,D,E. I want to pick a random letter within the group letters, but I need exactly 4 instances of each letter
So in the end I want four As, four Bs, four Cs four Ds, and four Es, but I want to pick them randomly.
Using a dictionary was the best way I thought I could do this. I can keep track of how many letters I have this way, but I'm not certain how to write out code such that each letter only appears four times.
import random
random.seed(1)
groups = {
'A' : 0,
'B' : 0,
'C' : 0,
'D' : 0,
'E' : 0,
}
#return a random letter between A and E, the keys of the group dictionary
def random_letter():
return random.choice(list(groups.keys()))
while (groups['A'] != 4) and (groups['B'] != 4): #What can I put here so that it makes all of the groups A to E have only 4. using 'and' here isnt working im assuming because im using it incorrectly
groups[random_letter()] += 1
print(list(groups.keys())) #should return ['A', 'B', 'C', 'D', 'E']
print(list(groups.values())) #should return [4,4,4,4,4]
I think this is the simplest way I could do it.
str_options = "abcde" * 4
options_as_list = list(str_options)
random.shuffle(options_as_list)
print(options_as_list)
Why don't you use random.sample:
import random
res = random.sample('ABCDE', counts=[4]*5, k=20)
If all the letters always have the same number of occurrences:
res = random.sample('ABCDE'*4, 20)
Copy the desired choices into a list and shuffle it only once:
import random
random.seed(1)
groups = {
'A': 0,
'B': 0,
'C': 0,
'D': 0,
'E': 0
}
full_list = [*groups]*4
random.shuffle(full_list)
print(full_list)
from random import choice
groups = dict.fromkeys("ABCDE", 0)
while (key_pool := [key for key in groups if groups[key] != 4]):
key = choice(key_pool)
print(f"Incrementing {key}")
groups[key] += 1
Explanation: dict.fromkeys("ABCDE", 0) is a convenient short-hand for creating a dictionary, in this case, with the keys A through E, all mapping to the value 0.
while (key_pool := [key for key in groups if groups[key] != 4]): is pretty dense. We keep iterating as long as there exist keys which do not map to a count of 4. We generate that list of keys with the list comprehension, which we assign to key_pool via the so-called "walrus operator".
key = choice(key_pool) picks a random key from the collection of possible keys (keys that don't map to 4.)
groups[key] += 1 increments the value of the chosen key.
That being said, it's not immediately obvious from your question what you actually intend to do with this dictionary. Re-reading your question makes me wonder if what you really want is just a string with the letters A through E, each appearing four times in some random order?
If that's the case, maybe something like this?:
from random import sample
chars = "ABCDE"
print("".join(sample(chars*4, k=len(chars)*4)))
chars*4 will yield the string 'ABCDEABCDEABCDEABCDE'
random.sample will pick k characters from that string (when an element is picked, that element will not be considered in successive samples).
"".join(...) joins a collections of strings into a single string.

Find the Letters Occurring Odd Number of Times

I came across a funny question, I am wondering whether we can solve it.
The Background
In time complexity O(n), can we find the letters occurring odd number of times, Output a list contain letters and keep the order of letters consistent with original string.
In case of multiple options to choose from, take the last occurence as the unpaired character.
Here is an example:
# note we should keep the order of letters
findodd('Hello World') == ["H", "e", " ", "W", "r", "l", "d"] # it is good
findodd('Hello World') == ["H", "l", " ", "W", "r", "e", "d"] # it is wrong
My attempt
def findodd(s):
hash_map = {}
# This step is a bit strange. I will show an example:
# If I have a string 'abc', I will convert string to list = ['a','b','c'].
# Just because we can not use dict.get(a) to lookup dict. However, dict.get('a') works well.
s = list(s)
res = []
for i in range(len(s)):
if hash_map.get(s[i]) == 1:
hash_map[s[i]] = 0
res.remove(s[i])
else:
hash_map[s[i]] = 1
res.append(s[i])
return res
findodd('Hello World')
Out:
["H", "e", " ", "W", "r", "l", "d"]
However, since I use list.remove, the time complexity is above O(n) in my solution.
My Question:
Can anyone give some advice about O(n) solution?
If I don't wanna use s = list(s), how to iterate over a string 'abc' to lookup the value of key = 'a' in a dict? dict.get('a') works but dict.get(a) won't work.
Source
Here are 2 webpage I watched, however they did not take the order of letter into account and did not provide O(n) solution.
find even time number, stack overflow
find odd time number, geeks for geeks
Python 3.7 up has dictionary keys input ordered. Use collection.OrderedDict for lower python versions.
Go through your word, add letter do dict if not in, else delete key from dict.
Solution is the dict.keys() collection:
t = "Hello World"
d = {}
for c in t:
if c in d: # even time occurences: delete key
del d[c]
else:
d[c] = None # odd time occurence: add key
print(d.keys())
Output:
dict_keys(['H', 'e', ' ', 'W', 'r', 'l', 'd'])
Its O(n) because you touch each letter in your input exactly once - lookup into dict is O(1).
There is some overhead by key adding/deleting. If that bothers you, use a counter instead and filter the key() collection for those that are odd - this will make it O(2*n) - 2 is constant so still O(n).
Here is an attempt (keys are ordered in python 3.6 dict):
from collections import defaultdict
def find_odd(s):
counter = defaultdict(int)
for x in s:
counter[x] += 1
return [l for l, c in counter.items() if c%2 != 0]
the complexity of this algo is less than 2n, which is O(n)!
Example
>>> s = "hello world"
>>> find_odd(s)
['h', 'e', 'l', ' ', 'w', 'r', 'd']
You could use the hash map to store the index at which a character occurs, and toggle it when it already has a value.
And then you just iterate the string again and only keep those letters that occur at the index that you have in the hash map:
from collections import defaultdict
def findodd(s):
hash_map = defaultdict(int)
for i, c in enumerate(s):
hash_map[c] = 0 if hash_map[c] else i+1
return [c for i, c in enumerate(s) if hash_map[c] == i+1]
My solution from scratch
It actually uses the feature that a dict in Python 3.6 is key-ordered.
def odd_one_out(s):
hash_map = {}
# reverse the original string to capture the last occurance
s = list(reversed(s))
res = []
for i in range(len(s)):
if hash_map.get(s[i]):
hash_map[s[i]] += 1
else:
hash_map[s[i]] = 1
for k,v in hash_map.items():
if v % 2 != 0:
res.append(k)
return res[::-1]
Crazy super short solution
#from user FArekkusu on Codewars
from collections import Counter
def find_odd(s):
d = Counter(reversed(s))
return [x for x in d if d[x] % 2][::-1]
Using Counter from collections will give you an O(n) solution. And since the Counter object is a dictionary (which keeps the occurrence order), your result can simply be a filter on the counts:
from collections import Counter
text = 'Hello World'
oddLetters = [ char for char,count in Counter(text).items() if count&1 ]
print(oddLetters) # ['H', 'e', 'l', ' ', 'W', 'r', 'd']

How to identify matching values in dictionary and create a new string with only those keys?

I have a method that pulls repeating letters from a string and adds them to a dictionary with the amount of times they repeat as the values. Now what I would like to do is pull all the keys that have matching values and create a string with only those keys.
example:
text = "theerrrdd"
count = {}
same_value = ""
for ch in text:
if text.count(ch) > 1:
count[ch] = text.count(ch)
How can I check count for keys with matching values, and if found, add those keys to same_value?
So in this example "e" and "d" would both have a value of 2. I want to add them to same_value so that when called, same_value would return "ed".
I basically just want to be able to identify which letters repeated the same amount of time.
First create a letter to count mapping, then reverse this mapping. Using the collections module:
from collections import defaultdict, Counter
text = 'theerrrdd'
# create dictionary mapping letter to count
letter_count = Counter(text)
# reverse mapping to give count to letters mapping
count_letters = defaultdict(list)
for letter, count in letter_count.items():
count_letters[count].append(letter)
Result:
print(count_letters)
defaultdict(<class 'list'>, {1: ['t', 'h'],
2: ['e', 'd'],
3: ['r']})
Then, for example, count_letters[2] gives you all letters which are seen twice in your input string.
Using str.count in a loop is inefficient as it requires a full iteration of your string for each letter. In other words, such an algorithm has quadratic complexity, while collections.Counter has linear complexity.
Another approach would be to use set() to get just the unique characters in the string, loop through the set and create a dict where the counts are the keys with lists of characters for each count. Then you can generate strings for each count using join().
text = "theerrrdd"
chars = set(text)
counts = {}
for ch in chars:
ch_count = text.count(ch)
if counts.get(ch_count, None):
counts[ch_count].append(ch)
else:
counts[ch_count] = [ch]
# print string of chars where count is 2
print(''.join(counts[2]))
# OUTPUT
# ed
I think the most simple solution of all !!
from collections import Counter
text = "theerrrdd"
count = Counter(text)
same_value = ''.join([k for k in count.keys() if count[k] > 1])
print(count)
print(same_value)
From counter dictionary, create another dictionary with all same count letters as values and the count as key. Iterate through values of this newly created dictionary to find all values whose length is greater than 1 and then form strings:
from collections import defaultdict
text = "theerrrdd"
count = {}
new_dict = defaultdict(list)
for ch in text:
if text.count(ch) > 1:
count[ch] = text.count(ch)
for k, v in count.items():
new_dict[v].append(k)
same_values_list = [v for v in new_dict.values() if len(v) > 1]
for x in same_values_list:
print(''.join(x))
# ed
new_dict is the newly created dictionary with count as key and all letters with that count as the value for that key:
print(new_dict)
# defaultdict(<class 'list'>, {2: ['e', 'd'],
# 3: ['r']})

Reducing compute time for Anagram word search

The code below is a brute force method of searching a list of words and creating sub-lists of any that are Anagrams.
Searching the entire English dictionary is prohibitively time consuming so I'm curious of anyone has tips for reducing the compute complexity of the code?
def anogramtastic(anagrms):
d = []
e = []
for j in range(len(anagrms)):
if anagrms[j] in e:
pass
else:
templist = []
tester = anagrms[j]
tester = list(tester)
tester.sort()
tester = ''.join(tester)
for k in range(len(anagrms)):
if k == j:
pass
else:
testers = anagrms[k]
testers = list(testers)
testers.sort()
testers = ''.join(testers)
if testers == tester:
templist.append(anagrms[k])
e.append(anagrms[k])
if len(templist) > 0:
templist.append(anagrms[j])
d.append(templist)
d.sort(key=len,reverse=True)
return d
print(anogramtastic(wordlist))
How about using a dictionary of frozensets? Frozensets are immutable, meaning you can hash them for constant lookup. And when it comes to anagrams, what makes two words anagrams of each other is that they have the same letters with the same count. So you can construct a frozenset of {(letter, count), ...} pairs, and hash these for efficient lookup.
Here's a quick little function to convert a word to a multiset using collections.Counter:
from collections import Counter, defaultdict
def word2multiset(word):
return frozenset(Counter(word).items())
Now, given a list of words, populate your anagram dictionary like this:
list_of_words = [... ]
anagram_dict = defaultdict(set)
for word in list_of_words:
anagram_dict[word2multiset(word)].add(word)
For example, when list_of_words = ['hello', 'olleh', 'test', 'apple'], this is the output of anagram_dict after a run of the loop above:
print(anagram_dict)
defaultdict(set,
{frozenset({('e', 1), ('h', 1), ('l', 2), ('o', 1)}): {'hello',
'olleh'},
frozenset({('e', 1), ('s', 1), ('t', 2)}): {'test'},
frozenset({('a', 1), ('e', 1), ('l', 1), ('p', 2)}): {'apple'}})
Unless I'm misunderstanding the problem, simply grouping the words by sorting their characters should be an efficient solution -- as you've already realized. The trick is to avoid comparing every word to all the other ones. A dict with the char-sorted string as key will make finding the right group for each word fast; a lookup/insertion will be O(log n).
#!/usr/bin/env python3
#coding=utf8
from sys import stdin
groups = {}
for line in stdin:
w = line.strip()
g = ''.join(sorted(w))
if g not in groups:
groups[g] = []
groups[g].append(w)
for g, words in groups.items():
if len(words) > 1:
print('%2d %-20s' % (len(words), g), ' '.join(words))
Testing on my words file (99171 words), it seems to work well:
anagram$ wc /usr/share/dict/words
99171 99171 938848 /usr/share/dict/words
anagram$ time ./anagram.py < /usr/share/dict/words | tail
2 eeeprsw sweeper weepers
2 brsu burs rubs
2 aeegnrv avenger engrave
2 ddenoru redound rounded
3 aesy ayes easy yeas
2 gimnpu impugn umping
2 deeiinsst densities destinies
2 abinost bastion obtains
2 degilr girdle glider
2 orsttu trouts tutors
real 0m0.366s
user 0m0.357s
sys 0m0.012s
You can speed things up considerably by using a dictionary for checking membership instead of doing linear searches. The only "trick" is to devise a way to create keys for it such that it will be the same for anagrammatical words (and not for others).
In the code below this is being done by creating a sorted tuple from the letters in each word.
def anagramtastic(words):
dct = {}
for word in words:
key = tuple(sorted(word)) # Identifier based on letters.
dct.setdefault(key, []).append(word)
# Return a list of all that had an anagram.
return [words for words in dct.values() if len(words) > 1]
wordlist = ['act', 'cat', 'binary', 'brainy', 'case', 'aces',
'aide', 'idea', 'earth', 'heart', 'tea', 'tee']
print('result:', anagramtastic(wordlist))
Output produced:
result: [['act', 'cat'], ['binary', 'brainy'], ['case', 'aces'], ['aide', 'idea'], ['earth', 'heart']]

Append to a dict of lists with a dict comprehension

Suppose I have a large list of words. For an example:
>>> with open('/usr/share/dict/words') as f:
... words=[word for word in f.read().split('\n') if word]
If I wanted to build an index by first letter of this word list, this is easy:
d={}
for word in words:
if word[0].lower() in 'aeiou':
d.setdefault(word[0].lower(),[]).append(word)
# You could use defaultdict here too...
Results in something like this:
{'a':[list of 'a' words], 'e':[list of 'e' words], 'i': etc...}
Is there a way to do this with Python 2.7, 3+ dict comprehension? In other words, is it possible with the dict comprehension syntax to append the list represented by the key as the dict is being built?
ie:
index={k[0].lower():XXX for k in words if k[0].lower() in 'aeiou'}
Where XXX performs an append operation or list creation for the key as index is being created.
Edit
Taking the suggestions and benchmarking:
def f1():
d={}
for word in words:
c=word[0].lower()
if c in 'aeiou':
d.setdefault(c,[]).append(word)
def f2():
d={}
{d.setdefault(word[0].lower(),[]).append(word) for word in words
if word[0].lower() in 'aeiou'}
def f3():
d=defaultdict(list)
{d[word[0].lower()].append(word) for word in words
if word[0].lower() in 'aeiou'}
def f4():
d=functools.reduce(lambda d, w: d.setdefault(w[0], []).append(w[1]) or d,
((w[0].lower(), w) for w in words
if w[0].lower() in 'aeiou'), {})
def f5():
d=defaultdict(list)
for word in words:
c=word[0].lower()
if c in 'aeiou':
d[c].append(word)
Produces this benchmark:
rate/sec f4 f2 f1 f3 f5
f4 11 -- -21.8% -31.1% -31.2% -41.2%
f2 14 27.8% -- -11.9% -12.1% -24.8%
f1 16 45.1% 13.5% -- -0.2% -14.7%
f3 16 45.4% 13.8% 0.2% -- -14.5%
f5 18 70.0% 33.0% 17.2% 16.9% --
The straight loop with a default dict is fastest followed by set comprehension and loop with setdefault.
Thanks for the ideas!
No - dict comprehensions are designed to generate non-overlapping keys with each iteration; they don't support aggregation. For this particular use case, a loop is the proper way to accomplish the task efficiently (in linear time).
It is not possible (at least easily or directly) with a dict comprehension.
It is possible, but potentially abusive of the syntax, with a set or list comprehension:
# your code:
d={}
for word in words:
if word[0].lower() in 'aeiou':
d.setdefault(word[0].lower(),[]).append(word)
# a side effect set comprehension:
index={}
r={index.setdefault(word[0].lower(),[]).append(word) for word in words
if word[0].lower() in 'aeiou'}
print r
print [(k, len(d[k])) for k in sorted(d.keys())]
print [(k, len(index[k])) for k in sorted(index.keys())]
Prints:
set([None])
[('a', 17094), ('e', 8734), ('i', 8797), ('o', 7847), ('u', 16385)]
[('a', 17094), ('e', 8734), ('i', 8797), ('o', 7847), ('u', 16385)]
The set comprehension produces a set with the results of the setdefault() method after iterating over the words list. The sum total of set([None]) in this case. It also produces your desired side effect of producing your dict of lists.
It is not as readable (IMHO) as the straight looping construct and should be avoided (IMHO). It is no shorter and probably not materially faster. This is more interesting trivia about Python than useful -- IMHO... Maybe to win a bet?
I'd use filter:
>>> words = ['abcd', 'abdef', 'eft', 'egg', 'uck', 'ice']
>>> index = {k.lower() : list(filter(lambda x:x[0].lower() == k.lower(),words)) for k in 'aeiou'}
>>> index
{'a': ['abcd', 'abdef'], 'i': ['ice'], 'e': ['eft', 'egg'], 'u': ['uck'], 'o': []}
This is not exactly a dict comprehension, but:
reduce(lambda d, w: d.setdefault(w[0], []).append(w[1]) or d,
((w[0].lower(), w) for w in words
if w[0].lower() in 'aeiou'), {})
Not answering the question of a dict comprehension, but it might help someone searching this problem. In a reduced example, when filling growing lists on the run into a new dictionary, consider calling a function in a list comprehension, which is, admittedly, nothing better than a loop.
def fill_lists_per_dict_keys(k, v):
d[k] = (
v
if k not in d
else d[k] + v
)
# global d
d = {}
out = [fill_lists_per_dict_keys(i[0], [i[1]]) for i in d2.items()]
The out is only to suppress the None Output of each loop.
If you ever want to use the new dictionary even inside the list comprehension at runtime or if you run into another reason why your dictionary gets overwritten by each loop, check to make it global with global d at the beginning of the script (commented out because not necessary here).

Categories

Resources