Best way to turn word list into frequency dict - python

What's the best way to convert a list/tuple into a dict where the keys are the distinct values of the list and the values are the the frequencies of those distinct values?
In other words:
['a', 'b', 'b', 'a', 'b', 'c']
-->
{'a': 2, 'b': 3, 'c': 1}
(I've had to do something like the above so many times, is there anything in the standard lib that does it for you?)
EDIT:
Jacob Gabrielson points out there is something coming in the standard lib for the 2.7/3.1 branch

I find that the easiest to understand (while might not be the most efficient) way is to do:
{i:words.count(i) for i in set(words)}

Kind of
from collections import defaultdict
fq= defaultdict( int )
for w in words:
fq[w] += 1
That usually works nicely.

Just a note that, starting with Python 2.7/3.1, this functionality will be built in to the collections module, see this bug for more information. Here's the example from the release notes:
>>> from collections import Counter
>>> c=Counter()
>>> for letter in 'here is a sample of english text':
... c[letter] += 1
...
>>> c
Counter({' ': 6, 'e': 5, 's': 3, 'a': 2, 'i': 2, 'h': 2,
'l': 2, 't': 2, 'g': 1, 'f': 1, 'm': 1, 'o': 1, 'n': 1,
'p': 1, 'r': 1, 'x': 1})
>>> c['e']
5
>>> c['z']
0

Actually, the answer of Counter was already mentioned, but we can even do better (easier)!
from collections import Counter
my_list = ['a', 'b', 'b', 'a', 'b', 'c']
Counter(my_list) # returns a Counter, dict-like object
>> Counter({'b': 3, 'a': 2, 'c': 1})

This is an abomination, but:
from itertools import groupby
dict((k, len(list(xs))) for k, xs in groupby(sorted(items)))
I can't think of a reason one would choose this method over S.Lott's, but if someone's going to point it out, it might as well be me. :)

I think using collection library is the easiest way to get it. But If you want to get the frequency dictionary without using it then it's another way,
l = [1,4,2,1,2,6,8,2,2]
d ={}
for i in l:
if i in d.keys():
d[i] = 1 + d[i]
else:
d[i] = 1
print (d)
op:
{1: 2, 4: 1, 2: 4, 6: 1, 8: 1}

I decided to go ahead and test the versions suggested, I found the collections.Counter as suggested by Jacob Gabrielson to be the fastest, followed by the defaultdict version by SLott.
Here are my codes :
from collections import defaultdict
from collections import Counter
import random
# using default dict
def counter_default_dict(list):
count=defaultdict(int)
for i in list:
count[i]+=1
return count
# using normal dict
def counter_dict(list):
count={}
for i in list:
count.update({i:count.get(i,0)+1})
return count
# using count and dict
def counter_count(list):
count={i:list.count(i) for i in set(list)}
return count
# using count and dict
def counter_counter(list):
count = Counter(list)
return count
list=sorted([random.randint(0,250) for i in range(300)])
if __name__=='__main__':
from timeit import timeit
print("collections.Defaultdict ",timeit("counter_default_dict(list)", setup="from __main__ import counter_default_dict,list", number=1000))
print("Dict",timeit("counter_dict(list)",setup="from __main__ import counter_dict,list",number=1000))
print("list.count ",timeit("counter_count(list)", setup="from __main__ import counter_count,list", number=1000))
print("collections.Counter.count ",timeit("counter_counter(list)", setup="from __main__ import counter_counter,list", number=1000))
And my results:
collections.Defaultdict
0.06787874956330614
Dict
0.15979115872995675
list.count
1.199258431219126
collections.Counter.count
0.025896202538920665
Do let me know how I can improve the analysis.

I have to share an interesting but kind of ridiculous way of doing it that I just came up with:
>>> class myfreq(dict):
... def __init__(self, arr):
... for k in arr:
... self[k] = 1
... def __setitem__(self, k, v):
... dict.__setitem__(self, k, self.get(k, 0) + v)
...
>>> myfreq(['a', 'b', 'b', 'a', 'b', 'c'])
{'a': 2, 'c': 1, 'b': 3}

Related

Create dictionary with alphabet characters mapping to numbers

I want to write a code in Python, which assigns a number to every alphabetical character, like so: a=0, b=1, c=2, ..., y=24, z=25. I personally don't prefer setting up conditions for every single alphabet, and don't want my code look over engineered. I'd like to know the ways I can do this the shortest (meaning the shortest lines of code), fastest and easiest.
(What's on my mind is to create a dictionary for this purpose, but I wonder if there's a neater and better way).
Any suggestions and tips are in advance appreciated.
You definitely want a dictionary for this, not to declare each as a variable. A simple way is to use a dictionary comprehension with string.ascii_lowercase as:
from string import ascii_lowercase
{v:k for k,v in enumerate(ascii_lowercase)}
# {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5...
Here's my two cents, for loop will do the work:
d = {} #empty dictionary
alpha = 'abcdefghijklmnopqrstuvwxyz'
for i in range(26):
d[alpha[i]] = i #assigns the key value as alphabets and corresponding index value from alpha string as the value for the key
print(d) #instant verification that the dictionary has been created properly
One-liner with map and enumerate:
# given
foo = 'abcxyz'
dict(enumerate(foo))
# returns: {0: 'a', 1: 'b', 2: 'c', 3: 'x', 4: 'y', 5: 'z'}
If you needed it with the characters as the dictionary keys, what comes into my mind is either a dict comprehension...
{letter:num for (num,letter) in enumerate(foo) }
# returns {'a': 0, 'b': 1, 'c': 2, 'z': 3, 'y': 4, 'x': 5}
... or a lambda...
dict( map(lambda x: (x[1],x[0]), enumerate(foo)) )
# returns {'a': 0, 'b': 1, 'c': 2, 'z': 3, 'y': 4, 'x': 5}
I feel dict comprehension is much more readable than map+lambda+enumerate.
There are already numbers associated with characters. You can use these code points with ord().
A short (in terms of lines) solution would be:
num_of = lambda s: ord(s) - 97
A normal function would be easier to read:
def num_of(s):
return ord(s) - 97
Usage:
num_of("a") # 0
num_of("z") # 25
If it must be a dictionary you can create it without imports like that:
{chr(n):n-97 for n in range(ord("a"), ord("z")+1)}

Count how many times is a character repeated in a row in Python [duplicate]

This question already has answers here:
Why does range(start, end) not include end? [duplicate]
(11 answers)
Closed 4 years ago.
I'm currently trying to solve a problem of counting repeating characters in a row in Python.
This code works until it comes to the last different character in a string, and I have no idea how to solve this problem
def repeating(word):
count=1
tmp = ""
res = {}
for i in range(1, len(word)):
tmp += word[i - 1]
if word[i - 1] == word[i]:
count += 1
else :
res[tmp] = count
count = 1
tmp = ""
return res
word="aabc"
print (repeating(word))
The given output should be {'aa': 2, 'b': 1, 'c' : 1},
but I am getting {'aa': 2, 'b': 1}
How do I solve this?
In this case, you can use the collections.Counter which does all the work for you.
>>> from collections import Counter
>>> Counter('aabc')
Counter({'a': 2, 'c': 1, 'b': 1})
You can also iterator over the letters in string, since this is iterable. But then I would use the defaultdict from collections to save on the 'counting' part.
>>> from collections import defaultdict
>>>
>>> def repeating(word):
... res = defaultdict(int)
... for letter in word:
... res[letter] +=1
... return res
...
>>> word="aabc"
>>> print (repeating(word))
defaultdict(<type 'int'>, {'a': 2, 'c': 1, 'b': 1})
I would recommend using Counter from the collections module. It does exactly what you are trying to achieve
from collections import Counter
wourd = "aabc"
print(Counter(word))
# Counter({'a': 2, 'b': 1, 'c': 1})
But if you want to implement it yourself, I should know that str is an Iterable. Hence you are able to iterate over every letter with a simple loop.
Additionally, there is something called defaultdict, which comes quite handy in this scenario. Normally you have to check whether a key (in this case a letter) is already defined. If not you have to create that key. If you are using a defaultdict, you can define that every new key has a default value of something.
from collections import defaultdict
def repeating(word):
counter = defaultdict(int)
for letter in word:
counter[letter] += 1
return counter
The result would be similar:
In [6]: repeating('aabc')
Out[6]: defaultdict(int, {'a': 2, 'b': 1, 'c': 1})

create a dictionary with incrementing values

I have a list and I want to generate a dictionary d taking out duplicates and excluding a single item, such that the first key has value 0, the second has value 1, and so on.
I have written the following code:
d = {}
i = 0
for l in a_list:
if (l not in d) and (l != '<'):
d[l] = i
i += 1
If a_list = ['a', 'b', '<', 'c', 'b', 'd'], after running the code d contains {'a': 0, 'b': 1, 'c': 2, 'd':3}. Order is not important.
Is there a more elegant way to obtain the same result?
Use dict.fromkeys to get your unique occurrences (minus values you don't want), then .update it to apply the sequence, eg:
a_list = ['a', 'b', '<', 'c', 'b', 'd']
d = dict.fromkeys(el for el in a_list if el != '<')
d.update((k, i) for i, k in enumerate(d))
Gives you:
{'a': 0, 'b': 1, 'd': 2, 'c': 3}
If order is important, then use collections.OrderedDict.fromkeys to retain the ordering of the original values, or sort the unique values if they should be alphabetical instead.
{b: a for a, b in enumerate(set(a_list) - {'<'})}
set(a_list) creates a set from a_list.
That effectively strips duplicate numbers in a_list, as a set can only contain unique values.
What is needed here is an ordereddict and to manually filter the list:
from collections import OrderedDict
d = OrderedDict()
new_list = []
a_list = [1,3,2,3,2,1,3,2,3,1]
for i in a_list:
if i not in new_list:
new_list.append(i)
for i, a in enumerate(new_list):
if a != "<":
d[i] = a
Output:
OrderedDict([(0, 1), (1, 3), (2, 2)])
If original order is not important:
final_d = {i:a for i, a in enumerate(set(a_list)) if a != "<"}
I personally find recursion quite elegant, tail-recursion especially so:
def f( d, a_list ):
if a_list:
if a_list[0] not in d and a_list[0] != '<':
d[a_list[0]] = len(d)
return f( d, a_list[1:] )
else:
return d
So that
f( {}, "acbcbabcbabcb" )
will yield
{'a': 0, 'c': 1, 'b': 2}
just like the original code does on the same input (modulo order of the keys).
If truly:
Order is not important.
{k: i for i, k in enumerate(filter(lambda x: x not in "<", set(a_list)))}
# {'a': 3, 'b': 1, 'c': 0, 'd': 2}
EDIT: #qnnnnez's answer takes advantage of set operations, giving an elegant version of the latter code.
Otherwise you can implement the unique_everseen itertools recipe to preserve order. For convenience, you can import it from a library that implements this recipe for you, i.e. more_itertools.
from more_itertools import unique_everseen
{k: i for i, k in enumerate(filter(lambda x: x not in "<", unique_everseen(a_list)))}
# {'a': 0, 'b': 1, 'c': 2, 'd': 3}

Find count of characters within the string in Python

I am trying to create a dictionary of word and number of times it is repeating in string. Say suppose if string is like below
str1 = "aabbaba"
I want to create a dictionary like this
word_count = {'a':4,'b':3}
I am trying to use dictionary comprehension to do this.
I did
dic = {x:dic[x]+1 if x in dic.keys() else x:1 for x in str}
This ends up giving an error saying
File "<stdin>", line 1
dic = {x:dic[x]+1 if x in dic.keys() else x:1 for x in str}
^
SyntaxError: invalid syntax
Can anybody tell me what's wrong with the syntax? Also,How can I create such a dictionary using dictionary comprehension?
As others have said, this is best done with a Counter.
You can also do:
>>> {e:str1.count(e) for e in set(str1)}
{'a': 4, 'b': 3}
But that traverses the string 1+n times for each unique character (once to create the set, and once for each unique letter to count the number of times it appears. i.e., This has quadratic runtime complexity.). Bad result if you have a lot of unique characters in a long string... A Counter only traverses the string once.
If you want no import version that is more efficient than using .count, you can use .setdefault to make a counter:
>>> count={}
>>> for c in str1:
... count[c]=count.setdefault(c, 0)+1
...
>>> count
{'a': 4, 'b': 3}
That only traverses the string once no matter how long or how many unique characters.
You can also use defaultdict if you prefer:
>>> from collections import defaultdict
>>> count=defaultdict(int)
>>> for c in str1:
... count[c]+=1
...
>>> count
defaultdict(<type 'int'>, {'a': 4, 'b': 3})
>>> dict(count)
{'a': 4, 'b': 3}
But if you are going to import collections -- Use a Counter!
Ideal way to do this is via using collections.Counter:
>>> from collections import Counter
>>> str1 = "aabbaba"
>>> Counter(str1)
Counter({'a': 4, 'b': 3})
You can not achieve this via simple dict comprehension expression as you will require reference to your previous value of count of element. As mentioned in Dawg's answer, as a work around you may use list.count(e) in order to find count of each element from the set of string within you dict comprehension expression. But time complexity will be n*m as it will traverse the complete string for each unique element (where m are uniques elements), where as with counter it will be n.
This is a nice case for collections.Counter:
>>> from collections import Counter
>>> Counter(str1)
Counter({'a': 4, 'b': 3})
It's dict subclass so you can work with the object similarly to standard dictionary:
>>> c = Counter(str1)
>>> c['a']
4
You can do this without use of Counter class as well. The simple and efficient python code for this would be:
>>> d = {}
>>> for x in str1:
... d[x] = d.get(x, 0) + 1
...
>>> d
{'a': 4, 'b': 3}
Note that this is not the correct way to do it since it won't count repeated characters more than once (apart from losing other characters from the original dict) but this answers the original question of whether if-else is possible in comprehensions and demonstrates how it can be done.
To answer your question, yes it's possible but the approach is like this:
dic = {x: (dic[x] + 1 if x in dic else 1) for x in str1}
The condition is applied on the value only not on the key:value mapping.
The above can be made clearer using dict.get:
dic = {x: dic.get(x, 0) + 1 for x in str1}
0 is returned if x is not in dic.
Demo:
In [78]: s = "abcde"
In [79]: dic = {}
In [80]: dic = {x: (dic[x] + 1 if x in dic else 1) for x in s}
In [81]: dic
Out[81]: {'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 1}
In [82]: s = "abfg"
In [83]: dic = {x: dic.get(x, 0) + 1 for x in s}
In [84]: dic
Out[84]: {'a': 2, 'b': 2, 'f': 1, 'g': 1}

Count words without checking that a word is "in" dictionary

I understand that there are modules out there that can do this kind of behavior, but I'm interested in how to approach the following "issue".
Whenever I used to want to count occurrences I found it a bit silly I had to first check for whether or not a key is "in" the dictionary (#1). I believe at the time I even used a try...exception because I didn't know how to do it properly.
# 1
words = ['a', 'b', 'c', 'a', 'b']
dicty = {}
for w in words:
if w in dicty:
dicty[w] += 1
else:
dicty[w] = 1
At this moment, I'm interested in the question what has to be done to make a class "SpecialDictionary" behave such that if a word is not in a dictionary, it automatically gets a default 0 value (#2). Which concepts are needed for this question?
Note: I understand that this "in" check could be done in the class' definition, but there must be something more pythonic/elegant?
# 2
special_dict = SpecialDictionary()
for w in words:
special_dict[w] += 1
Subclass dict and override its __missing__ method to return 0:
class SpecialDictionary(dict):
def __missing__(self, k):
return 0
words = ['a', 'b', 'c', 'a', 'b']
special_dict = SpecialDictionary()
for w in words:
special_dict[w] += 1
print special_dict
#{'c': 1, 'a': 2, 'b': 2}
You need to use dict.get:
>>> my_dict = {}
>>> for x in words:
... my_dict[x] = my_dict.get(x,0) + 1
...
>>> my_dict
{'a': 2, 'c': 1, 'b': 2}
dict.get returns the value of the key if present, else a default
Syntax: dict.get(key,[default])
you can also use try and except, if key is not found in dictionary it raises keyError:
>>> for x in words:
... try:
... my_dict[x] += 1
... except KeyError:
... my_dict[x] = 1
...
>>> my_dict
{'a': 2, 'c': 1, 'b': 2}
using Counter:
>>> from collections import Counter
>>> words = ['a', 'b', 'c', 'a', 'b']
>>> my_count = Counter(words)
>>> my_count
Counter({'a': 2, 'b': 2, 'c': 1})
You can use a defaultdict. Or is this one of the “modules out there” that you wish to avoid?
from collections import defaultdict
d = defaultdict(lambda : 0)
d['a'] += 1
print(d['a'])
print(d['b'])
It will print:
1
0
The 'SpecialDictionary' that implements that kind of behavior is the collections.defaultdict. It takes a function as first parameter as an default-value-factory. When ever a lookup is performed it checks if the key is already in the dictionary and if thats not the case it uses that factory-function to create a value which is then added to the dictionary (and returned by the lookup). See the docs on how it is implemented.
Counter is a special variant of the defaultdict that uses int as factory-function (and provides some additional methods )

Categories

Resources