For loop to dictionary comprehension correct translation (Python) - python

I need to find the longest string in a list for each letter of the alphabet.
My first straight forward approach looked like this:
alphabet = ["a","b", ..., "z"]
text = ["ane4", "anrhgjt8", "andhjtje9", "ajhe5", "]more_crazy_words"]
result = {key:"" for key in alphabet} # create a dictionary
# go through all words, if that word is longer than the current longest, save it
for word in text:
if word[0].lower() in alphabet and len(result[word[0].lower()]) < len(word):
result[word[0].lower()] = word.lower()
print(result)
which returns:
{'a': 'andhjtje9'}
as it is supposed to do.
In order to practice dictionary comprehension I tried to solve this in just one line:
result2 = {key:"" for key in alphabet}
result2 = {word[0].lower(): word.lower() for word in text if word[0].lower() in alphabet and len(result2[word[0].lower()]) < len(word)}
I just copied the if statement into the comprehension loop...
results2 however is:
{'a': 'ajhe5'}
can someone explain me why this is the case? I feel like I did exactly the same as in the first loop...
Thanks for any help!

List / Dict / Set - comprehension can not refer to themself while building itself - thats why you do not get what you want.
You can use a complicated dictionary comprehension to do this - with help of collections.groupby on a sorted list this could look like this:
from string import ascii_lowercase
from itertools import groupby
text = ["ane4", "anrhgjt8", "andhjtje9", "ajhe5", "]more_crazy_words"]
d = {key:sorted(value, key=len)[-1]
for key,value in groupby((s for s in sorted(text)
if s[0].lower() in frozenset(ascii_lowercase)),
lambda x:x[0].lower())}
print(d) # {'a': 'andhjtje9'}
or
text = ["ane4", "anrhgjt8", "andhjtje9", "ajhe5", "]more_crazy_words"]
d = {key:next(value) for key,value in groupby(
(s for s in sorted(text, key=lambda x: (x[0],-len(x)))
if s[0].lower() in frozenset(ascii_lowercase)),
lambda x:x[0].lower())}
print(d) # {'a': 'andhjtje9'}
or several other ways ... but why would you?
Having it as for loops is much cleaner and easier to understand and would, in this case, follow the zen of python probably better.
Read about the zen of python by running:
import this

Related

Decrease time of a function (Python)

I'm trying to create a function in python that from a list of strings will return me a dict where the key(index) shows the most repetitive character for each index between all the strings. for example a list1 = ['one', 'two', 'twin', 'who'] should return index 0=t index 1=w index 2=o index 3=n in fact the most frequent character at the index 1 between all the string is 'w'.
I found a solution but if I have lists with thousands of strings inside it will require too much time to perform. I would like to know if you can give me some help to decrease the time of execution.
Here is what I tried to do but seems too slow to perform with lists of thousands strings inside
list1 = ['one', 'two', 'twin', 'who']
width = len(max(list1, key=len))
chars = {}
for i, item in enumerate(zip(*[s.ljust(width) for s in list1])):
set1 = set(item)
if ' ' in set1:
set1.remove(' ')
chars[i] = max(set1, key=item.count)
print(chars)
Whether something is quick enough is a matter of use case, but this solution uses a couple of seconds to go through the default wordlist available under OS X.
Python's collections.Counter implements a counter object for you, so you don't have keep track of the counts of multiple possible values yourself.
I've paired it with defaultdict, which intializes a key with a function if the key is undefined - so that if we haven't already seen the index we're updating the count for, it gets initialized to a Counter object that we then update.
from collections import defaultdict, Counter
with open("/usr/share/dict/words") as f:
words = f.read().splitlines()
letters = defaultdict(Counter)
for word in words:
for idx, letter in enumerate(word):
letters[idx].update((letter, ))
for idx, counter in letters.items():
print(idx, counter.most_common(1))
Whether this is quick enough depends on your use case as mentioned; it can be done a lot quicker if necessary, but it's probably quick enough. For 235 886 words the runtime is:
python3 letterfreq.py 2.67s user 0.04s system 99% cpu 2.734 total
This assumes that every word is lowercased, if not, lowercase it before adding it to your Counter object.
If you want to implement it without using the Counter or defaultdict parts of the standard library (which are just helper functionality to avoid reimplementing the same small code repeatedly), you can do the exact thing yourself manually:
with open("/usr/share/dict/words") as f:
words = f.read().splitlines()
letter_positions = {}
for word in words:
for idx, letter in enumerate(word):
if idx not in letter_positions:
letter_positions[idx] = {}
if letter not in letter_positions[idx]:
letter_positions[idx][letter] = 0
letter_positions[idx][letter] += 1
final_dict = {}
for idx, counts in letter_positions.items():
most_popular = sorted(counts.items(), key=lambda v: v[1], reverse=True)
print(idx, most_popular)
final_dict[idx] = most_popular[0][0]
print(final_dict)
Then pick as many entries as necessary from most_popular when going through the list afterwards.
Since we're no longer using the defaultdict and Counter abstractions, our running time is now about a third of the previous one:
python3 letterfreq2.py 1.08s user 0.03s system 98% cpu 1.124 total
It's usually a good idea to go through what you're trying to do and formulate a strategy - i.e. "ok, I need to keep track of how many times a letter has appeared in this location .. so for that I need some way to keep values for each index .. and then for each letter ..".
I just make some improvements based on your algorithm.
First, you can use itertools.zip_longest() instead of zip() to remove the need of ljust() and the width variable:
from itertools import zip_longest
list1 = ['one', 'two', 'twin', 'who']
chars = {}
for i, item in enumerate(zip_longest(*list1)):
set1 = set(item)
if None in set1:
set1.remove(None)
chars[i] = max(set1, key=item.count)
print(chars)
Then, replace max(set1, key=item.count) with a more efficent way Counter(item).most_common(1)[0][0], combined with or set1.most_common(2)[1][0] to filter None values
from itertools import zip_longest
from collections import Counter
list1 = ['one', 'two', 'twin', 'who']
chars = {}
for i, item in enumerate(zip_longest(*list1)):
set1 = Counter(item)
chars[i] = set1.most_common(1)[0][0] or set1.most_common(2)[1][0]
print(chars)
As itertools and collections are Python built-in modules, you can import them directly without pip install them.

Python regex query to parse very simple dictionary

I am new to regex module and learning a simple case to extract key and values from a simple dictionary.
the dictionary can not contain nested dicts and any lists, but may have simple tuples
MWE
import re
# note: the dictionary are simple and does NOT contains list, nested dicts, just these two example suffices for the regex matching.
d = "{'a':10,'b':True,'c':(5,'a')}" # ['a', 10, 'b', True, 'c', (5,'a') ]
d = "{'c':(5,'a'), 'd': 'TX'}" # ['c', (5,'a'), 'd', 'TX']
regexp = r"(.*):(.*)" # I am not sure how to repeat this pattern separated by ,
out = re.match(regexp,d).groups()
out
You should not use regex for this job. When the input string is valid Python syntax, you can use ast.literal_eval.
Like this:
import ast
# ...
out = ast.literal_eval(d)
Now you have a dictionary object in Python. You can for instance get the key/value pairs in a (dict_items) list:
print(out.items())
Regex
Regex is not the right tool. There will always be cases where some boundary case will be wrongly parsed. But to get the repeated matches, you can better use findall. Here is a simple example regex:
regexp = r"([^{\s][^:]*):([^:}]*)(?:[,}])"
out = re.findall(regexp, d)
This will give a list of pairs.
Regex would be hard (perhaps impossible, but I'm not versed enough to say confidently) to use because of the ',' nested in your tuples. Just for the sake of it, I wrote (regex-less) code to parse your string for separators, ignoring parts inside parentheses:
d = "{'c':(5,'a',1), 'd': 'TX', 1:(1,2,3)}"
d=d.replace("{","").replace("}","")
indices = []
inside = False
for i,l in enumerate(d):
if inside:
if l == ")":
inside = False
continue
continue
if l == "(":
inside = True
continue
if l in {":",","}:
indices.append(i)
indices.append(len(d))
parts = []
start = 0
for i in indices:
parts.append(d[start:i].strip())
start = i+1
parts
# ["'c'", "(5,'a',1)", "'d'", "'TX'", '1', '(1,2,3)']

Populate dictionary from list

I have a list of strings (from a .tt file) that looks like this:
list1 = ['have\tVERB', 'and\tCONJ', ..., 'tree\tNOUN', 'go\tVERB']
I want to turn it into a dictionary that looks like:
dict1 = { 'have':'VERB', 'and':'CONJ', 'tree':'NOUN', 'go':'VERB' }
I was thinking of substitution, but it doesn't work that well. Is there a way to tag the tab string '\t' as a divider?
Try the following:
dict1 = dict(item.split('\t') for item in list1)
Output:
>>>dict1
{'and': 'CONJ', 'go': 'VERB', 'tree': 'NOUN', 'have': 'VERB'}
Since str.split also splits on '\t' by default ('\t' is considered white space), you could get a functional approach by feeding dict with a map that looks quite elegant:
d = dict(map(str.split, list1))
With the dictionary d now being in the wanted form:
print(d)
{'and': 'CONJ', 'go': 'VERB', 'have': 'VERB', 'tree': 'NOUN'}
If you need a split only on '\t' (while ignoring ' ' and '\n') and still want to use the map approach, you can create a partial object with functools.partial that only uses '\t' as the separator:
from functools import partial
# only splits on '\t' ignoring new-lines, white space e.t.c
tabsplit = partial(str.split, sep='\t')
d = dict(map(tabsplit, list1))
this, of course, yields the same result for d using the sample list of strings.
do that with a simple dict comprehension and a str.split (without arguments strip splits on blanks)
list1 = ['have\tVERB', 'and\tCONJ', 'tree\tNOUN', 'go\tVERB']
dict1 = {x.split()[0]:x.split()[1] for x in list1}
result:
{'and': 'CONJ', 'go': 'VERB', 'tree': 'NOUN', 'have': 'VERB'}
EDIT: the x.split()[0]:x.split()[1] does split twice, which is not optimal. Other answers here do it better without dict comprehension.
A short way to solve the problem, since split method splits '\t' by default (as pointed out by Jim Fasarakis-Hilliard), could be:
dictionary = dict(item.split() for item in list1)
print dictionary
I also wrote down a more simple and classic approach.
Not very pythonic but easy to understand for beginners:
list1 = ['have\tVERB', 'and\tCONJ', 'tree\tNOUN', 'go\tVERB']
dictionary1 = {}
for item in list1:
splitted_item = item.split('\t')
word = splitted_item[0]
word_type = splitted_item[1]
dictionary1[word] = word_type
print dictionary1
Here I wrote the same code with very verbose comments:
# Let's start with our word list, we'll call it 'list1'
list1 = ['have\tVERB', 'and\tCONJ', 'tree\tNOUN', 'go\tVERB']
# Here's an empty dictionary, 'dictionary1'
dictionary1 = {}
# Let's start to iterate using variable 'item' through 'list1'
for item in list1:
# Here I split item in two parts, passing the '\t' character
# to the split function and put the resulting list of two elements
# into 'splitted_item' variable.
# If you want to know more about split function check the link available
# at the end of this answer
splitted_item = item.split('\t')
# Just to make code more readable here I now put 1st part
# of the splitted item (part 0 because we start counting
# from number 0) in "word" variable
word = splitted_item[0]
# I use the same apporach to save the 2nd part of the
# splitted item into 'word_type' variable
# Yes, you're right: we use 1 because we start counting from 0
word_type = splitted_item[1]
# Finally I add to 'dictionary1', 'word' key with a value of 'word_type'
dictionary1[word] = word_type
# After the for loop has been completed I print the now
# complete dictionary1 to check if result is correct
print dictionary1
Useful links:
You can quickly copy and paste this code here to check how it works and tweak it if you like: http://www.codeskulptor.com
If you want to learn more about split and string functions in general: https://docs.python.org/2/library/string.html

Append to a dict of lists with a dict comprehension

Suppose I have a large list of words. For an example:
>>> with open('/usr/share/dict/words') as f:
... words=[word for word in f.read().split('\n') if word]
If I wanted to build an index by first letter of this word list, this is easy:
d={}
for word in words:
if word[0].lower() in 'aeiou':
d.setdefault(word[0].lower(),[]).append(word)
# You could use defaultdict here too...
Results in something like this:
{'a':[list of 'a' words], 'e':[list of 'e' words], 'i': etc...}
Is there a way to do this with Python 2.7, 3+ dict comprehension? In other words, is it possible with the dict comprehension syntax to append the list represented by the key as the dict is being built?
ie:
index={k[0].lower():XXX for k in words if k[0].lower() in 'aeiou'}
Where XXX performs an append operation or list creation for the key as index is being created.
Edit
Taking the suggestions and benchmarking:
def f1():
d={}
for word in words:
c=word[0].lower()
if c in 'aeiou':
d.setdefault(c,[]).append(word)
def f2():
d={}
{d.setdefault(word[0].lower(),[]).append(word) for word in words
if word[0].lower() in 'aeiou'}
def f3():
d=defaultdict(list)
{d[word[0].lower()].append(word) for word in words
if word[0].lower() in 'aeiou'}
def f4():
d=functools.reduce(lambda d, w: d.setdefault(w[0], []).append(w[1]) or d,
((w[0].lower(), w) for w in words
if w[0].lower() in 'aeiou'), {})
def f5():
d=defaultdict(list)
for word in words:
c=word[0].lower()
if c in 'aeiou':
d[c].append(word)
Produces this benchmark:
rate/sec f4 f2 f1 f3 f5
f4 11 -- -21.8% -31.1% -31.2% -41.2%
f2 14 27.8% -- -11.9% -12.1% -24.8%
f1 16 45.1% 13.5% -- -0.2% -14.7%
f3 16 45.4% 13.8% 0.2% -- -14.5%
f5 18 70.0% 33.0% 17.2% 16.9% --
The straight loop with a default dict is fastest followed by set comprehension and loop with setdefault.
Thanks for the ideas!
No - dict comprehensions are designed to generate non-overlapping keys with each iteration; they don't support aggregation. For this particular use case, a loop is the proper way to accomplish the task efficiently (in linear time).
It is not possible (at least easily or directly) with a dict comprehension.
It is possible, but potentially abusive of the syntax, with a set or list comprehension:
# your code:
d={}
for word in words:
if word[0].lower() in 'aeiou':
d.setdefault(word[0].lower(),[]).append(word)
# a side effect set comprehension:
index={}
r={index.setdefault(word[0].lower(),[]).append(word) for word in words
if word[0].lower() in 'aeiou'}
print r
print [(k, len(d[k])) for k in sorted(d.keys())]
print [(k, len(index[k])) for k in sorted(index.keys())]
Prints:
set([None])
[('a', 17094), ('e', 8734), ('i', 8797), ('o', 7847), ('u', 16385)]
[('a', 17094), ('e', 8734), ('i', 8797), ('o', 7847), ('u', 16385)]
The set comprehension produces a set with the results of the setdefault() method after iterating over the words list. The sum total of set([None]) in this case. It also produces your desired side effect of producing your dict of lists.
It is not as readable (IMHO) as the straight looping construct and should be avoided (IMHO). It is no shorter and probably not materially faster. This is more interesting trivia about Python than useful -- IMHO... Maybe to win a bet?
I'd use filter:
>>> words = ['abcd', 'abdef', 'eft', 'egg', 'uck', 'ice']
>>> index = {k.lower() : list(filter(lambda x:x[0].lower() == k.lower(),words)) for k in 'aeiou'}
>>> index
{'a': ['abcd', 'abdef'], 'i': ['ice'], 'e': ['eft', 'egg'], 'u': ['uck'], 'o': []}
This is not exactly a dict comprehension, but:
reduce(lambda d, w: d.setdefault(w[0], []).append(w[1]) or d,
((w[0].lower(), w) for w in words
if w[0].lower() in 'aeiou'), {})
Not answering the question of a dict comprehension, but it might help someone searching this problem. In a reduced example, when filling growing lists on the run into a new dictionary, consider calling a function in a list comprehension, which is, admittedly, nothing better than a loop.
def fill_lists_per_dict_keys(k, v):
d[k] = (
v
if k not in d
else d[k] + v
)
# global d
d = {}
out = [fill_lists_per_dict_keys(i[0], [i[1]]) for i in d2.items()]
The out is only to suppress the None Output of each loop.
If you ever want to use the new dictionary even inside the list comprehension at runtime or if you run into another reason why your dictionary gets overwritten by each loop, check to make it global with global d at the beginning of the script (commented out because not necessary here).

Perform set operation difference on a list of tuples

I am trying to get the difference between 2 containers but the containers are in a weird structure so I dont know whats the best way to perform a difference on it. One containers type and structure I cannot alter but the others I can(variable delims).
delims = ['on','with','to','and','in','the','from','or']
words = collections.Counter(s.split()).most_common()
# words results in [("the",2), ("a",9), ("diplomacy", 1)]
#I want to perform a 'difference' operation on words to remove all the delims words
descriptive_words = set(words) - set(delims)
# because of the unqiue structure of words(list of tuples) its hard to perform a difference
# on it. What would be the best way to perform a difference? Maybe...
delims = [('on',0),('with',0),('to',0),('and',0),('in',0),('the',0),('from',0),('or',0)]
words = collections.Counter(s.split()).most_common()
descriptive_words = set(words) - set(delims)
# Or maybe
words = collections.Counter(s.split()).most_common()
n_words = []
for w in words:
n_words.append(w[0])
delims = ['on','with','to','and','in','the','from','or']
descriptive_words = set(n_words) - set(delims)
How about just modifying words by removing all the delimiters?
words = collections.Counter(s.split())
for delim in delims:
del words[delim]
This I how I would do it:
delims = set(['on','with','to','and','in','the','from','or'])
# ...
descriptive_words = filter(lamdba x: x[0] not in delims, words)
Using the filter method. A viable alternative would be:
delims = set(['on','with','to','and','in','the','from','or'])
# ...
decsriptive_words = [ (word, count) for word,count in words if word not in delims ]
Making sure that the delims are in a set to allow for O(1) lookup.
The simplest answer is to do:
import collections
s = "the a a a a the a a a a a diplomacy"
delims = {'on','with','to','and','in','the','from','or'}
// For older versions of python without set literals:
// delims = set(['on','with','to','and','in','the','from','or'])
words = collections.Counter(s.split())
not_delims = {key: value for (key, value) in words.items() if key not in delims}
// For older versions of python without dict comprehensions:
// not_delims = dict(((key, value) for (key, value) in words.items() if key not in delims))
Which gives us:
{'a': 9, 'diplomacy': 1}
An alternative option is to do it pre-emptively:
import collections
s = "the a a a a the a a a a a diplomacy"
delims = {'on','with','to','and','in','the','from','or'}
counted_words = collections.Counter((word for word in s.split() if word not in delims))
Here you apply the filtering on the list of words before you give it to the counter, and this gives the same result.
If you're iterating through it anyway why bother converting them to sets?
dwords = [delim[0] for delim in delims]
words = [word for word in words if word[0] not in dwords]
For performance, you can use lambda functions
filter(lambda word: word[0] not in delim, words)

Categories

Resources