Count words without checking that a word is "in" dictionary - python

I understand that there are modules out there that can do this kind of behavior, but I'm interested in how to approach the following "issue".
Whenever I used to want to count occurrences I found it a bit silly I had to first check for whether or not a key is "in" the dictionary (#1). I believe at the time I even used a try...exception because I didn't know how to do it properly.
# 1
words = ['a', 'b', 'c', 'a', 'b']
dicty = {}
for w in words:
if w in dicty:
dicty[w] += 1
else:
dicty[w] = 1
At this moment, I'm interested in the question what has to be done to make a class "SpecialDictionary" behave such that if a word is not in a dictionary, it automatically gets a default 0 value (#2). Which concepts are needed for this question?
Note: I understand that this "in" check could be done in the class' definition, but there must be something more pythonic/elegant?
# 2
special_dict = SpecialDictionary()
for w in words:
special_dict[w] += 1

Subclass dict and override its __missing__ method to return 0:
class SpecialDictionary(dict):
def __missing__(self, k):
return 0
words = ['a', 'b', 'c', 'a', 'b']
special_dict = SpecialDictionary()
for w in words:
special_dict[w] += 1
print special_dict
#{'c': 1, 'a': 2, 'b': 2}

You need to use dict.get:
>>> my_dict = {}
>>> for x in words:
... my_dict[x] = my_dict.get(x,0) + 1
...
>>> my_dict
{'a': 2, 'c': 1, 'b': 2}
dict.get returns the value of the key if present, else a default
Syntax: dict.get(key,[default])
you can also use try and except, if key is not found in dictionary it raises keyError:
>>> for x in words:
... try:
... my_dict[x] += 1
... except KeyError:
... my_dict[x] = 1
...
>>> my_dict
{'a': 2, 'c': 1, 'b': 2}
using Counter:
>>> from collections import Counter
>>> words = ['a', 'b', 'c', 'a', 'b']
>>> my_count = Counter(words)
>>> my_count
Counter({'a': 2, 'b': 2, 'c': 1})

You can use a defaultdict. Or is this one of the “modules out there” that you wish to avoid?
from collections import defaultdict
d = defaultdict(lambda : 0)
d['a'] += 1
print(d['a'])
print(d['b'])
It will print:
1
0

The 'SpecialDictionary' that implements that kind of behavior is the collections.defaultdict. It takes a function as first parameter as an default-value-factory. When ever a lookup is performed it checks if the key is already in the dictionary and if thats not the case it uses that factory-function to create a value which is then added to the dictionary (and returned by the lookup). See the docs on how it is implemented.
Counter is a special variant of the defaultdict that uses int as factory-function (and provides some additional methods )

Related

Count how many times is a character repeated in a row in Python [duplicate]

This question already has answers here:
Why does range(start, end) not include end? [duplicate]
(11 answers)
Closed 4 years ago.
I'm currently trying to solve a problem of counting repeating characters in a row in Python.
This code works until it comes to the last different character in a string, and I have no idea how to solve this problem
def repeating(word):
count=1
tmp = ""
res = {}
for i in range(1, len(word)):
tmp += word[i - 1]
if word[i - 1] == word[i]:
count += 1
else :
res[tmp] = count
count = 1
tmp = ""
return res
word="aabc"
print (repeating(word))
The given output should be {'aa': 2, 'b': 1, 'c' : 1},
but I am getting {'aa': 2, 'b': 1}
How do I solve this?
In this case, you can use the collections.Counter which does all the work for you.
>>> from collections import Counter
>>> Counter('aabc')
Counter({'a': 2, 'c': 1, 'b': 1})
You can also iterator over the letters in string, since this is iterable. But then I would use the defaultdict from collections to save on the 'counting' part.
>>> from collections import defaultdict
>>>
>>> def repeating(word):
... res = defaultdict(int)
... for letter in word:
... res[letter] +=1
... return res
...
>>> word="aabc"
>>> print (repeating(word))
defaultdict(<type 'int'>, {'a': 2, 'c': 1, 'b': 1})
I would recommend using Counter from the collections module. It does exactly what you are trying to achieve
from collections import Counter
wourd = "aabc"
print(Counter(word))
# Counter({'a': 2, 'b': 1, 'c': 1})
But if you want to implement it yourself, I should know that str is an Iterable. Hence you are able to iterate over every letter with a simple loop.
Additionally, there is something called defaultdict, which comes quite handy in this scenario. Normally you have to check whether a key (in this case a letter) is already defined. If not you have to create that key. If you are using a defaultdict, you can define that every new key has a default value of something.
from collections import defaultdict
def repeating(word):
counter = defaultdict(int)
for letter in word:
counter[letter] += 1
return counter
The result would be similar:
In [6]: repeating('aabc')
Out[6]: defaultdict(int, {'a': 2, 'b': 1, 'c': 1})

Count how many times are items from list 1 in list 2

I have 2 lists:
1. ['a', 'b', 'c']
2. ['a', 'd', 'a', 'b']
And I want dictionary output like this:
{'a': 2, 'b': 1, 'c': 0}
I already made it:
#b = list #1
#words = list #2
c = {}
for i in b:
c.update({i:words.count(i)})
But it is very slow, I need to process like 10MB txt file.
EDIT: Entire code, currently testing so unused imports..
import string
import os
import operator
import time
from collections import Counter
def getbookwords():
a = open("wu.txt", encoding="utf-8")
b = a.read().replace("\n", "").lower()
a.close()
b.translate(string.punctuation)
b = b.split(" ")
return b
def wordlist(words):
a = open("wordlist.txt")
b = a.read().lower()
b = b.split("\n")
a.close()
t = time.time()
#c = dict((i, words.count(i)) for i in b )
c = Counter(words)
result = {k: v for k, v in c.items() if k in set(b)}
print(time.time() - t)
sorted_d = sorted(c.items(), key=operator.itemgetter(1))
return(sorted_d)
print(wordlist(getbookwords()))
Since speed is currently an issue, it might be worth considering not passing through the list for each thing you want to count. The set() function allows you to only use the unique keys in your list words.
An important thing to remember for speed in all cases is the line unique_words = set(b). Without this, an entire pass through your list is being done to create a set from b at every iteration in whichever kind of data structure you happen to use.
c = {k:0 for k in set(words)}
for w in words:
c[w] += 1
unique_words = set(b)
c = {k:counts[k] for k in c if k in unique_words}
Alternatively, defaultdicts can be used to eliminate some of the initialization.
from collections import defaultdict
c = defaultdict(int)
for w in words:
c[w] += 1
unique_words = set(b)
c = {k:counts[k] for k in c if k in unique_words}
For completeness sake, I do like the Counter based solutions in the other answers (like from Reut Sharabani). The code is cleaner, and though I haven't benchmarked it I wouldn't be surprised if a built-in counting class is faster than home-rolled solutions with dictionaries.
from collections import Counter
c = Counter(words)
unique_words = set(b)
c = {k:v for k, v in c.items() if k in unique_words}
Try using collections.Counter and move b to a set, not a list:
from collections import Counter
c = Counter(words)
b = set(b)
result = {k: v for k, v in c.items() if k in b}
Also, if you can read the words lazily and not create an intermediate list that should be faster.
Counter provides the functionality you want (counting items), and filtering the result against a set uses hashing which should be a lot faster.
You can use collection.Counter on a generator that skips ignored keys using a set lookup.
from collections import Counter
keys = ['a', 'b', 'c']
lst = ['a', 'd', 'a', 'b']
unique_keys = set(keys)
count = Counter(x for x in lst if x in unique_keys)
print(count) # Counter({'a': 2, 'b': 1})
# count['c'] == 0
Note that count['c'] is not printed, but is still 0 by default in a Counter.
Here's an example I just coughed up in repl. Assuming you're not counting duplicates in list two. We create a hash table using a dictionary. For each item in the list were matching two, we create a key value pair with the item being the key and we set the value to 0.
Next we iterate through the second list, for each value, we check if the value has been defined already, if it has been, than we increment the value using the key. Else, we ignore.
Least amount of iterations possible. You hit each item in each list only once.
x = [1, 2, 3, 4, 5];
z = [1, 2, 2, 2, 1];
y = {};
for n in x:
y[n] = 0; //Set the value to zero for each item in the list
for n in z:
if(n in y): //If we defined the value in the hash already, increment by one
y[n] += 1;
print(y)
#Makalone, above answers are appreciable. You can also try the below code sample which uses Python's Counter() from collections module.
You can try it at http://rextester.com/OTYG56015.
Python code »
from collections import Counter
list1 = ['a', 'b', 'c']
list2 = ['a', 'd', 'a', 'b']
counter = Counter(list2)
d = {key: counter[key] for key in set(list1)}
print(d)
Output »
{'a': 2, 'c': 0, 'b': 1}

Find count of characters within the string in Python

I am trying to create a dictionary of word and number of times it is repeating in string. Say suppose if string is like below
str1 = "aabbaba"
I want to create a dictionary like this
word_count = {'a':4,'b':3}
I am trying to use dictionary comprehension to do this.
I did
dic = {x:dic[x]+1 if x in dic.keys() else x:1 for x in str}
This ends up giving an error saying
File "<stdin>", line 1
dic = {x:dic[x]+1 if x in dic.keys() else x:1 for x in str}
^
SyntaxError: invalid syntax
Can anybody tell me what's wrong with the syntax? Also,How can I create such a dictionary using dictionary comprehension?
As others have said, this is best done with a Counter.
You can also do:
>>> {e:str1.count(e) for e in set(str1)}
{'a': 4, 'b': 3}
But that traverses the string 1+n times for each unique character (once to create the set, and once for each unique letter to count the number of times it appears. i.e., This has quadratic runtime complexity.). Bad result if you have a lot of unique characters in a long string... A Counter only traverses the string once.
If you want no import version that is more efficient than using .count, you can use .setdefault to make a counter:
>>> count={}
>>> for c in str1:
... count[c]=count.setdefault(c, 0)+1
...
>>> count
{'a': 4, 'b': 3}
That only traverses the string once no matter how long or how many unique characters.
You can also use defaultdict if you prefer:
>>> from collections import defaultdict
>>> count=defaultdict(int)
>>> for c in str1:
... count[c]+=1
...
>>> count
defaultdict(<type 'int'>, {'a': 4, 'b': 3})
>>> dict(count)
{'a': 4, 'b': 3}
But if you are going to import collections -- Use a Counter!
Ideal way to do this is via using collections.Counter:
>>> from collections import Counter
>>> str1 = "aabbaba"
>>> Counter(str1)
Counter({'a': 4, 'b': 3})
You can not achieve this via simple dict comprehension expression as you will require reference to your previous value of count of element. As mentioned in Dawg's answer, as a work around you may use list.count(e) in order to find count of each element from the set of string within you dict comprehension expression. But time complexity will be n*m as it will traverse the complete string for each unique element (where m are uniques elements), where as with counter it will be n.
This is a nice case for collections.Counter:
>>> from collections import Counter
>>> Counter(str1)
Counter({'a': 4, 'b': 3})
It's dict subclass so you can work with the object similarly to standard dictionary:
>>> c = Counter(str1)
>>> c['a']
4
You can do this without use of Counter class as well. The simple and efficient python code for this would be:
>>> d = {}
>>> for x in str1:
... d[x] = d.get(x, 0) + 1
...
>>> d
{'a': 4, 'b': 3}
Note that this is not the correct way to do it since it won't count repeated characters more than once (apart from losing other characters from the original dict) but this answers the original question of whether if-else is possible in comprehensions and demonstrates how it can be done.
To answer your question, yes it's possible but the approach is like this:
dic = {x: (dic[x] + 1 if x in dic else 1) for x in str1}
The condition is applied on the value only not on the key:value mapping.
The above can be made clearer using dict.get:
dic = {x: dic.get(x, 0) + 1 for x in str1}
0 is returned if x is not in dic.
Demo:
In [78]: s = "abcde"
In [79]: dic = {}
In [80]: dic = {x: (dic[x] + 1 if x in dic else 1) for x in s}
In [81]: dic
Out[81]: {'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 1}
In [82]: s = "abfg"
In [83]: dic = {x: dic.get(x, 0) + 1 for x in s}
In [84]: dic
Out[84]: {'a': 2, 'b': 2, 'f': 1, 'g': 1}

Better way to write 'assign A or if not possible - B' [duplicate]

This question already has answers here:
Check if a given key already exists in a dictionary and increment it
(12 answers)
Closed 6 years ago.
So, in my code I have a dictionary I use to count up items I have no prior knowledge of:
if a_thing not in my_dict:
my_dict[a_thing] = 0
else:
my_dict[a_thing] += 1
Obviously, I can't increment an entry of a value that doesn't exist yet. For some reason I have a feeling (in my still-Python-inexperienced brain) there might exist a more Pythonic way to do this with, say, some construct which allows to assign a result of an expression to a thing and if not possible something else in a single statement.
So, does anything like that exist in Python?
This looks like a good job for defaultdict, from collections. Observe the example below:
>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> d['a'] += 1
>>> d
defaultdict(<class 'int'>, {'a': 1})
>>> d['b'] += 1
>>> d['a'] += 1
>>> d
defaultdict(<class 'int'>, {'b': 1, 'a': 2})
defaultdict will take a single parameter which indicates your initial value. In this case you are incrementing integer values, so you want int.
Alternatively, since you are counting items, you could also (as mentioned in comments) use Counter which will ultimately do all the work for you:
>>> d = Counter(['a', 'b', 'a', 'c', 'a', 'b', 'c'])
>>> d
Counter({'a': 3, 'c': 2, 'b': 2})
It also comes with some nice bonuses. Like most_common:
>>> d.most_common()
[('a', 3), ('c', 2), ('b', 2)]
Now you have an order to give you the most common counts.
using get method
>>> d = {}
>>> d['a'] = d.get('a', 0) + 1
>>> d
{'a': 1}
>>> d['b'] = d.get('b', 2) + 1
>>> d
{'b': 3, 'a': 1}

Best way to turn word list into frequency dict

What's the best way to convert a list/tuple into a dict where the keys are the distinct values of the list and the values are the the frequencies of those distinct values?
In other words:
['a', 'b', 'b', 'a', 'b', 'c']
-->
{'a': 2, 'b': 3, 'c': 1}
(I've had to do something like the above so many times, is there anything in the standard lib that does it for you?)
EDIT:
Jacob Gabrielson points out there is something coming in the standard lib for the 2.7/3.1 branch
I find that the easiest to understand (while might not be the most efficient) way is to do:
{i:words.count(i) for i in set(words)}
Kind of
from collections import defaultdict
fq= defaultdict( int )
for w in words:
fq[w] += 1
That usually works nicely.
Just a note that, starting with Python 2.7/3.1, this functionality will be built in to the collections module, see this bug for more information. Here's the example from the release notes:
>>> from collections import Counter
>>> c=Counter()
>>> for letter in 'here is a sample of english text':
... c[letter] += 1
...
>>> c
Counter({' ': 6, 'e': 5, 's': 3, 'a': 2, 'i': 2, 'h': 2,
'l': 2, 't': 2, 'g': 1, 'f': 1, 'm': 1, 'o': 1, 'n': 1,
'p': 1, 'r': 1, 'x': 1})
>>> c['e']
5
>>> c['z']
0
Actually, the answer of Counter was already mentioned, but we can even do better (easier)!
from collections import Counter
my_list = ['a', 'b', 'b', 'a', 'b', 'c']
Counter(my_list) # returns a Counter, dict-like object
>> Counter({'b': 3, 'a': 2, 'c': 1})
This is an abomination, but:
from itertools import groupby
dict((k, len(list(xs))) for k, xs in groupby(sorted(items)))
I can't think of a reason one would choose this method over S.Lott's, but if someone's going to point it out, it might as well be me. :)
I think using collection library is the easiest way to get it. But If you want to get the frequency dictionary without using it then it's another way,
l = [1,4,2,1,2,6,8,2,2]
d ={}
for i in l:
if i in d.keys():
d[i] = 1 + d[i]
else:
d[i] = 1
print (d)
op:
{1: 2, 4: 1, 2: 4, 6: 1, 8: 1}
I decided to go ahead and test the versions suggested, I found the collections.Counter as suggested by Jacob Gabrielson to be the fastest, followed by the defaultdict version by SLott.
Here are my codes :
from collections import defaultdict
from collections import Counter
import random
# using default dict
def counter_default_dict(list):
count=defaultdict(int)
for i in list:
count[i]+=1
return count
# using normal dict
def counter_dict(list):
count={}
for i in list:
count.update({i:count.get(i,0)+1})
return count
# using count and dict
def counter_count(list):
count={i:list.count(i) for i in set(list)}
return count
# using count and dict
def counter_counter(list):
count = Counter(list)
return count
list=sorted([random.randint(0,250) for i in range(300)])
if __name__=='__main__':
from timeit import timeit
print("collections.Defaultdict ",timeit("counter_default_dict(list)", setup="from __main__ import counter_default_dict,list", number=1000))
print("Dict",timeit("counter_dict(list)",setup="from __main__ import counter_dict,list",number=1000))
print("list.count ",timeit("counter_count(list)", setup="from __main__ import counter_count,list", number=1000))
print("collections.Counter.count ",timeit("counter_counter(list)", setup="from __main__ import counter_counter,list", number=1000))
And my results:
collections.Defaultdict
0.06787874956330614
Dict
0.15979115872995675
list.count
1.199258431219126
collections.Counter.count
0.025896202538920665
Do let me know how I can improve the analysis.
I have to share an interesting but kind of ridiculous way of doing it that I just came up with:
>>> class myfreq(dict):
... def __init__(self, arr):
... for k in arr:
... self[k] = 1
... def __setitem__(self, k, v):
... dict.__setitem__(self, k, self.get(k, 0) + v)
...
>>> myfreq(['a', 'b', 'b', 'a', 'b', 'c'])
{'a': 2, 'c': 1, 'b': 3}

Categories

Resources