Best Way to Count Occurences of Each Character in a Large Dataset

Best Way to Count Occurences of Each Character in a Large Dataset - python

I am trying to count the number of occurrences of each character within a large dateset. For example, if the data was the numpy array ['A', 'AB', 'ABC'] then I would want {'A': 3, 'B': 2, 'C': 1} as the output. I currently have an implementation that looks like this:
char_count = {}
for c in string.printable:
char_count[c] = np.char.count(data, c).sum()
The issue I am having is that this takes too long for my data. I have ~14,000,000 different strings that I would like to count and this implementation is not efficient for that amount of data. Any help is appreciated!

Another way.
import collections
c = collections.Counter()
for thing in data:
c.update(thing)
Same basic advantage - only iterates the data once.

One approach:
import numpy as np
from collections import defaultdict
data = np.array(['A', 'AB', 'ABC'])
counts = defaultdict(int)
for e in data:
for c in e:
counts[c] += 1
print(counts)
Output
defaultdict(<class 'int'>, {'A': 3, 'B': 2, 'C': 1})
Note that your code iterates len(string.printable) times over data in contrast my proposal iterates one time.
One alternative using a dictionary:
data = np.array(['A', 'AB', 'ABC'])
counts = dict()
for e in data:
for c in e:
counts[c] = counts.get(c, 0) + 1
print(counts)

Related

Get sequences of same values within list and count elements within sequences

I'd like to find the amount of values within sequences of the same value from a list:
list = ['A','A','A','B','B','C','A','A']
The result should look like:
result_dic = {A: [3,2], B: [2], C: [1]}
I do not just want the counts of different values in a list as you can see in the result for A.

collections.defaultdict and itertools.groupby
from itertools import groupby
from collections import defaultdict
listy = ['A','A','A','B','B','C','A','A']
d = defaultdict(list)
for k, v in groupby(listy):
d[k].append(len([*v]))
d
defaultdict(list, {'A': [3, 2], 'B': [2], 'C': [1]})
groupby will loop through an iterable and lump contiguous things together.
[(k, [*v]) for k, v in groupby(listy)]
[('A', ['A', 'A', 'A']), ('B', ['B', 'B']), ('C', ['C']), ('A', ['A', 'A'])]
So I loop through those results and append the length of each grouped thing to the values of a defaultdict

I'd suggest using a defaultdict and looping through the list.
from collections import defaultdict
sample = ['A','A','A','B','B','C','A','A']
result_dic = defaultdict(list)
last_letter = None
num = 0
for l in sample:
if last_letter == l or last_letter is None:
num += 1
else:
result_dic[last_letter].append(num)
Edit
This is my approach, although I'd have a look at #piRSquared's answer because they were keen enough to include groupby as well. Nice work!

I'd suggest looping through the list.
result_dic = {}
old_word = ''
for word in list:
if not word in result_dic:
d[word] = [1]
elif word == old_word:
result_dic[word][-1] += 1
else:
result_dic[word].append(1)
old_word = word

Count how many times are items from list 1 in list 2

I have 2 lists:
1. ['a', 'b', 'c']
2. ['a', 'd', 'a', 'b']
And I want dictionary output like this:
{'a': 2, 'b': 1, 'c': 0}
I already made it:
#b = list #1
#words = list #2
c = {}
for i in b:
c.update({i:words.count(i)})
But it is very slow, I need to process like 10MB txt file.
EDIT: Entire code, currently testing so unused imports..
import string
import os
import operator
import time
from collections import Counter
def getbookwords():
a = open("wu.txt", encoding="utf-8")
b = a.read().replace("\n", "").lower()
a.close()
b.translate(string.punctuation)
b = b.split(" ")
return b
def wordlist(words):
a = open("wordlist.txt")
b = a.read().lower()
b = b.split("\n")
a.close()
t = time.time()
#c = dict((i, words.count(i)) for i in b )
c = Counter(words)
result = {k: v for k, v in c.items() if k in set(b)}
print(time.time() - t)
sorted_d = sorted(c.items(), key=operator.itemgetter(1))
return(sorted_d)
print(wordlist(getbookwords()))

Since speed is currently an issue, it might be worth considering not passing through the list for each thing you want to count. The set() function allows you to only use the unique keys in your list words.
An important thing to remember for speed in all cases is the line unique_words = set(b). Without this, an entire pass through your list is being done to create a set from b at every iteration in whichever kind of data structure you happen to use.
c = {k:0 for k in set(words)}
for w in words:
c[w] += 1
unique_words = set(b)
c = {k:counts[k] for k in c if k in unique_words}
Alternatively, defaultdicts can be used to eliminate some of the initialization.
from collections import defaultdict
c = defaultdict(int)
for w in words:
c[w] += 1
unique_words = set(b)
c = {k:counts[k] for k in c if k in unique_words}
For completeness sake, I do like the Counter based solutions in the other answers (like from Reut Sharabani). The code is cleaner, and though I haven't benchmarked it I wouldn't be surprised if a built-in counting class is faster than home-rolled solutions with dictionaries.
from collections import Counter
c = Counter(words)
unique_words = set(b)
c = {k:v for k, v in c.items() if k in unique_words}

Try using collections.Counter and move b to a set, not a list:
from collections import Counter
c = Counter(words)
b = set(b)
result = {k: v for k, v in c.items() if k in b}
Also, if you can read the words lazily and not create an intermediate list that should be faster.
Counter provides the functionality you want (counting items), and filtering the result against a set uses hashing which should be a lot faster.

You can use collection.Counter on a generator that skips ignored keys using a set lookup.
from collections import Counter
keys = ['a', 'b', 'c']
lst = ['a', 'd', 'a', 'b']
unique_keys = set(keys)
count = Counter(x for x in lst if x in unique_keys)
print(count) # Counter({'a': 2, 'b': 1})
# count['c'] == 0
Note that count['c'] is not printed, but is still 0 by default in a Counter.

Here's an example I just coughed up in repl. Assuming you're not counting duplicates in list two. We create a hash table using a dictionary. For each item in the list were matching two, we create a key value pair with the item being the key and we set the value to 0.
Next we iterate through the second list, for each value, we check if the value has been defined already, if it has been, than we increment the value using the key. Else, we ignore.
Least amount of iterations possible. You hit each item in each list only once.
x = [1, 2, 3, 4, 5];
z = [1, 2, 2, 2, 1];
y = {};
for n in x:
y[n] = 0; //Set the value to zero for each item in the list
for n in z:
if(n in y): //If we defined the value in the hash already, increment by one
y[n] += 1;
print(y)

#Makalone, above answers are appreciable. You can also try the below code sample which uses Python's Counter() from collections module.
You can try it at http://rextester.com/OTYG56015.
Python code »
from collections import Counter
list1 = ['a', 'b', 'c']
list2 = ['a', 'd', 'a', 'b']
counter = Counter(list2)
d = {key: counter[key] for key in set(list1)}
print(d)
Output »
{'a': 2, 'c': 0, 'b': 1}

Find count of characters within the string in Python

I am trying to create a dictionary of word and number of times it is repeating in string. Say suppose if string is like below
str1 = "aabbaba"
I want to create a dictionary like this
word_count = {'a':4,'b':3}
I am trying to use dictionary comprehension to do this.
I did
dic = {x:dic[x]+1 if x in dic.keys() else x:1 for x in str}
This ends up giving an error saying
File "<stdin>", line 1
dic = {x:dic[x]+1 if x in dic.keys() else x:1 for x in str}
^
SyntaxError: invalid syntax
Can anybody tell me what's wrong with the syntax? Also,How can I create such a dictionary using dictionary comprehension?

As others have said, this is best done with a Counter.
You can also do:
>>> {e:str1.count(e) for e in set(str1)}
{'a': 4, 'b': 3}
But that traverses the string 1+n times for each unique character (once to create the set, and once for each unique letter to count the number of times it appears. i.e., This has quadratic runtime complexity.). Bad result if you have a lot of unique characters in a long string... A Counter only traverses the string once.
If you want no import version that is more efficient than using .count, you can use .setdefault to make a counter:
>>> count={}
>>> for c in str1:
... count[c]=count.setdefault(c, 0)+1
...
>>> count
{'a': 4, 'b': 3}
That only traverses the string once no matter how long or how many unique characters.
You can also use defaultdict if you prefer:
>>> from collections import defaultdict
>>> count=defaultdict(int)
>>> for c in str1:
... count[c]+=1
...
>>> count
defaultdict(<type 'int'>, {'a': 4, 'b': 3})
>>> dict(count)
{'a': 4, 'b': 3}
But if you are going to import collections -- Use a Counter!

Ideal way to do this is via using collections.Counter:
>>> from collections import Counter
>>> str1 = "aabbaba"
>>> Counter(str1)
Counter({'a': 4, 'b': 3})
You can not achieve this via simple dict comprehension expression as you will require reference to your previous value of count of element. As mentioned in Dawg's answer, as a work around you may use list.count(e) in order to find count of each element from the set of string within you dict comprehension expression. But time complexity will be n*m as it will traverse the complete string for each unique element (where m are uniques elements), where as with counter it will be n.

This is a nice case for collections.Counter:
>>> from collections import Counter
>>> Counter(str1)
Counter({'a': 4, 'b': 3})
It's dict subclass so you can work with the object similarly to standard dictionary:
>>> c = Counter(str1)
>>> c['a']
4
You can do this without use of Counter class as well. The simple and efficient python code for this would be:
>>> d = {}
>>> for x in str1:
... d[x] = d.get(x, 0) + 1
...
>>> d
{'a': 4, 'b': 3}

Note that this is not the correct way to do it since it won't count repeated characters more than once (apart from losing other characters from the original dict) but this answers the original question of whether if-else is possible in comprehensions and demonstrates how it can be done.
To answer your question, yes it's possible but the approach is like this:
dic = {x: (dic[x] + 1 if x in dic else 1) for x in str1}
The condition is applied on the value only not on the key:value mapping.
The above can be made clearer using dict.get:
dic = {x: dic.get(x, 0) + 1 for x in str1}
0 is returned if x is not in dic.
Demo:
In [78]: s = "abcde"
In [79]: dic = {}
In [80]: dic = {x: (dic[x] + 1 if x in dic else 1) for x in s}
In [81]: dic
Out[81]: {'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 1}
In [82]: s = "abfg"
In [83]: dic = {x: dic.get(x, 0) + 1 for x in s}
In [84]: dic
Out[84]: {'a': 2, 'b': 2, 'f': 1, 'g': 1}

Better way to write 'assign A or if not possible - B' [duplicate]

This question already has answers here:
Check if a given key already exists in a dictionary and increment it
(12 answers)
Closed 6 years ago.
So, in my code I have a dictionary I use to count up items I have no prior knowledge of:
if a_thing not in my_dict:
my_dict[a_thing] = 0
else:
my_dict[a_thing] += 1
Obviously, I can't increment an entry of a value that doesn't exist yet. For some reason I have a feeling (in my still-Python-inexperienced brain) there might exist a more Pythonic way to do this with, say, some construct which allows to assign a result of an expression to a thing and if not possible something else in a single statement.
So, does anything like that exist in Python?

This looks like a good job for defaultdict, from collections. Observe the example below:
>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> d['a'] += 1
>>> d
defaultdict(<class 'int'>, {'a': 1})
>>> d['b'] += 1
>>> d['a'] += 1
>>> d
defaultdict(<class 'int'>, {'b': 1, 'a': 2})
defaultdict will take a single parameter which indicates your initial value. In this case you are incrementing integer values, so you want int.
Alternatively, since you are counting items, you could also (as mentioned in comments) use Counter which will ultimately do all the work for you:
>>> d = Counter(['a', 'b', 'a', 'c', 'a', 'b', 'c'])
>>> d
Counter({'a': 3, 'c': 2, 'b': 2})
It also comes with some nice bonuses. Like most_common:
>>> d.most_common()
[('a', 3), ('c', 2), ('b', 2)]
Now you have an order to give you the most common counts.

using get method
>>> d = {}
>>> d['a'] = d.get('a', 0) + 1
>>> d
{'a': 1}
>>> d['b'] = d.get('b', 2) + 1
>>> d
{'b': 3, 'a': 1}

Count words without checking that a word is "in" dictionary

I understand that there are modules out there that can do this kind of behavior, but I'm interested in how to approach the following "issue".
Whenever I used to want to count occurrences I found it a bit silly I had to first check for whether or not a key is "in" the dictionary (#1). I believe at the time I even used a try...exception because I didn't know how to do it properly.
# 1
words = ['a', 'b', 'c', 'a', 'b']
dicty = {}
for w in words:
if w in dicty:
dicty[w] += 1
else:
dicty[w] = 1
At this moment, I'm interested in the question what has to be done to make a class "SpecialDictionary" behave such that if a word is not in a dictionary, it automatically gets a default 0 value (#2). Which concepts are needed for this question?
Note: I understand that this "in" check could be done in the class' definition, but there must be something more pythonic/elegant?
# 2
special_dict = SpecialDictionary()
for w in words:
special_dict[w] += 1

Subclass dict and override its __missing__ method to return 0:
class SpecialDictionary(dict):
def __missing__(self, k):
return 0
words = ['a', 'b', 'c', 'a', 'b']
special_dict = SpecialDictionary()
for w in words:
special_dict[w] += 1
print special_dict
#{'c': 1, 'a': 2, 'b': 2}

You need to use dict.get:
>>> my_dict = {}
>>> for x in words:
... my_dict[x] = my_dict.get(x,0) + 1
...
>>> my_dict
{'a': 2, 'c': 1, 'b': 2}
dict.get returns the value of the key if present, else a default
Syntax: dict.get(key,[default])
you can also use try and except, if key is not found in dictionary it raises keyError:
>>> for x in words:
... try:
... my_dict[x] += 1
... except KeyError:
... my_dict[x] = 1
...
>>> my_dict
{'a': 2, 'c': 1, 'b': 2}
using Counter:
>>> from collections import Counter
>>> words = ['a', 'b', 'c', 'a', 'b']
>>> my_count = Counter(words)
>>> my_count
Counter({'a': 2, 'b': 2, 'c': 1})

You can use a defaultdict. Or is this one of the “modules out there” that you wish to avoid?
from collections import defaultdict
d = defaultdict(lambda : 0)
d['a'] += 1
print(d['a'])
print(d['b'])
It will print:
1
0

The 'SpecialDictionary' that implements that kind of behavior is the collections.defaultdict. It takes a function as first parameter as an default-value-factory. When ever a lookup is performed it checks if the key is already in the dictionary and if thats not the case it uses that factory-function to create a value which is then added to the dictionary (and returned by the lookup). See the docs on how it is implemented.
Counter is a special variant of the defaultdict that uses int as factory-function (and provides some additional methods )

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best Way to Count Occurences of Each Character in a Large Dataset - python

Another way. import collections c = collections.Counter() for thing in data: c.update(thing) Same basic advantage - only iterates the data once.

Related

Get sequences of same values within list and count elements within sequences

Count how many times are items from list 1 in list 2

Find count of characters within the string in Python

Better way to write 'assign A or if not possible - B' [duplicate]

Count words without checking that a word is "in" dictionary

Categories

Resources