counting duplicate words in python the fastest way [duplicate]

counting duplicate words in python the fastest way [duplicate] - python

This question already has answers here:
Why do `key in dict` and `key in dict.keys()` have the same output?
(2 answers)
Closed 13 days ago.
I was trying to count duplicate words over a list of 230 thousand words.I used python dictionary to do so. The code is given below:
for words in word_list:
if words in word_dict.keys():
word_dict[words] += 1
else:
word_dict[words] = 1
The above code took 3 minutes!. I ran the same code over 1.5 million words and it was running for more than 25 minutes and I lost my patience and terminated. Then I found that I can use the following code from here (also shown below). The result was so surprising, it completed within seconds!. So my question is what is the faster way to do this operation?. I guess the dictionary creation process must be taking O(N) time. How was the Counter method able to complete this process in seconds and create an exact dictionary of word as key and frequency as it's value?
from collections import Counter
word_dict = Counter(word_list)

It's because of this:
if words in word_dict.keys():
.keys() returns a list of all the keys. Lists take linear time to scan, so your program was running in quadratic time!
Try this instead:
if words in word_dict:
Also, if you're interested, you can see the Counter implementation for yourself. It's written in regular Python.

your dictionary counting method is not well constructed.
you could have used a defaultdict in the following way:
d = defaultdict(int)
for word in word_list:
d[word] += 1
but the counter method from itertools is still faster even though it is doing almost the same thing, because it is written in a more efficient implementation. however, with the counter method, you need to pass it a list to count, whereas using a defaultdict, you can put sources from different locations and have a more complicated loop.
ultimately it is your preference. if counting a list, counter is the way to go, if iterating from multiple sources, or you simply want a counter in your program and dont want the extra lookup to check if an item is already being counted or not. then defaultdict is your choice.

You can actually look at the Counter code, here is the update method that is called on init:
(Notice it uses the performance trick of defining a local definition of self.get)
def update(self, iterable=None, **kwds):
'''Like dict.update() but add counts instead of replacing them.
Source can be an iterable, a dictionary, or another Counter instance.
>>> c = Counter('which')
>>> c.update('witch') # add elements from another iterable
>>> d = Counter('watch')
>>> c.update(d) # add elements from another counter
>>> c['h'] # four 'h' in which, witch, and watch
4
'''
# The regular dict.update() operation makes no sense here because the
# replace behavior results in the some of original untouched counts
# being mixed-in with all of the other counts for a mismash that
# doesn't have a straight-forward interpretation in most counting
# contexts. Instead, we implement straight-addition. Both the inputs
# and outputs are allowed to contain zero and negative counts.
if iterable is not None:
if isinstance(iterable, Mapping):
if self:
self_get = self.get
for elem, count in iterable.iteritems():
self[elem] = self_get(elem, 0) + count
else:
super(Counter, self).update(iterable) # fast path when counter is empty
else:
self_get = self.get
for elem in iterable:
self[elem] = self_get(elem, 0) + 1
if kwds:
self.update(kwds)

You could also try to use defaultdict as a more competitive choice. Try:
from collections import defaultdict
word_dict = defaultdict(lambda: 0)
for word in word_list:
word_dict[word] +=1
print word_dict

Similar to what monkut mentioned, one of the best ways to do this is to utilize the .get() function. Credit for this goes to Charles Severance and the Python For Everybody Course
For testing:
# Pretend line is as follow.
# It can and does contain \n (newline) but not \t (tab).
line = """Your battle is my battle . We fight together . One team . One team .
Shining sun always comes with the rays of hope . The hope is there .
Our best days yet to come . Let the hope light the road .""".lower()
His code (with my notes):
# make an empty dictionary
# split `line` into a list. default is to split on a space character
# etc., etc.
# iterate over the LIST of words (made from splitting the string)
counts = dict()
words = line.split()
for word in words:
counts[word] = counts.get(word,0) + 1
Your code:
for words in word_list:
if words in word_dict.keys():
word_dict[words] += 1
else:
word_dict[words] = 1
.get() does this:
Return the VALUE in the dictionary associated with word.
Otherwise (, if the word is not a key in the dictionary,) return 0.
No matter what is returned, we add 1 to it. Thus it handles the base case of seeing the word for the first time. We cannot use a dictionary comprehension, since the variable the comprehension is assigned to won't exist as we are creating that variable. Meaning
this: counts = { word:counts.get(word,0) + 1 for word in words} is not possible, since counts (is being created and assigned to at the same time. Alternatively, since) counts the variable hasn't been fully defined when we reference it (again) to .get() from it.
Output
>> counts
{'.': 8,
'always': 1,
'battle': 2,
'best': 1,
'come': 1,
'comes': 1,
'days': 1,
'fight': 1,
'hope': 3,
'is': 2,
'let': 1,
'light': 1,
'my': 1,
'of': 1,
'one': 2,
'our': 1,
'rays': 1,
'road': 1,
'shining': 1,
'sun': 1,
'team': 2,
'the': 4,
'there': 1,
'to': 1,
'together': 1,
'we': 1,
'with': 1,
'yet': 1,
'your': 1}
As an aside here is a "loaded" use of .get() that I wrote as a way to solve the classic FizzBuzz question. I'm currently writing code for a similar situation in which I will use modulus and a dictionary, but for a split string as input.

Related

separate and count emojis in python list

I want to count the occurrences of Emojis in a list in python.
Assuming my list looks like this
li = ['😁', '🤣😁', '😁🤣😋']
Counter(li) would give me {'😁': 1, '🤣😁': 1, '😁🤣😋': 1}
But I would like to get the total amount of emojis aka {'😁': 3, '🤣': 2, '😋': 1}
My main issue is how to seperate large chunks of continous emoji into single list entries. I tried with replacing the beginning "\U" with " \U" so i could then simple split by " " but it does not seem to work.
Thanks for your help in advance :)

You can flatten you list into a single string using join and then apply Counter to that:
Counter("".join(li))
results in
Counter({'😁': 3, '🤣': 2, '😋': 1})
or maybe a more memory efficient way is
counter = Counter()
for item in li:
counter.update(item)

You can count the emojis by iterating on the characters of each string:
from collections import Counter
li = ['😁', '🤣😁', '😁🤣😋']
count = Counter(emoji for string in li for emoji in string)
print(count)
# Counter({'😁': 3, '🤣': 2, '😋': 1})
#Dan gave a different answer just before me, which he sadly deleted since then, so I reproduce it for <10k users who can't see it:
Counter("".join(li))
I thought it might be less efficient because of the creation of the joined string, but I did some timings with small and larger lists up to 10 000 000 items, and it appears that his solution is consistently 30 to 40% faster.

One more way is to use the fact that counters implement addition:
>>> li = ['😁', '🤣😁', '😁🤣😋']
>>> from collections import Counter
>>> sum(map(Counter, li), Counter())
Counter({'😁': 3, '🤣': 2, '😋': 1})

How can I efficiently count the number of occurrences of a given character within certain range of a string?

Given a unsorted string, e.g. "googol". I want to find the number of occurrences of character "o" in the range: [1, 3). So, in this case, the answer would be 1.
However, my method has complexity O(N^2). The problem of my method is that copy array needs O(N) time. Therefore, I was looking for another way which would be more efficient. Space complexity does not matter to me. Because I am learning string processing algorithms, it is better if I could achieve this algorithm on my own.
Any help would be appreciated.
My method.
tmp = [0] * 26 # 26 alphabet
occurrences_table = []
tmp[ord(a_string[0])] += 1
occurrences_table.append(tmp)
for i in range(1, len(a_string)):
temp = occurrences_table[i - 1]
temp[ord(a_string[i])] += 1
occurrences_table.append(temp)

Since you don't want to use counter and want to implement it yourself, your code can be tidied up and sped up a little by using dictionaries.
a_string = "googol"
my_counter = {}
for c in a_string[:2]:
my_counter[c] = my_counter.get(c, 0) + 1
which would give you:
{'o': 1, 'g': 1}
To explain it a little further a_string[:2] gets the characters up to index 2 in your string ('google'[:2] = 'go') and for c in a_string[:2]: loops over those 2 characters.
In the next line, my_counter.get(c, 0) + 1 tries to get the dictionary value for the key 'c' (a single character in your string), if it exists it returns its value, if not returns 0 and either way adds the incremented value back to the dictionary.
EDIT:
Complexity should be just O(n) due to the for loop since the complexity of dictionary.get() is constant.
I've measured it up and for very small strings like yours, this method was 8-10 times faster than Collections.Counter, but for very large strings it is 2-3 times slower.

You could use a Counter:
from collections import Counter
a_string = "googol"
occurrences = Counter(a_string[0:2])
which results in
Counter({'o': 1, 'g': 1})
Notice that array slicing works on strings.

If you can use standard libraries:
>>> from itertools import islice
>>> from collections import Counter
>>> Counter(islice('googol', 1, 3))
Counter({'o': 2})
>>> Counter(islice('googol', 0, 2))
Counter({'g': 1, 'o': 1})
(islice avoids a temporary list.)
If you want to do it manually:
>>> s = 'googol'
>>> counter = dict()
>>> for i in range(0, 2):
... if s[i] not in counter:
... counter[s[i]] = 1
... else:
... counter[s[i]] += 1
...
>>> counter
{'g': 1, 'o': 1}
The point is: use a dict.

Python, Take dictionary, and produce list with (words>1, most common words, longest words)

So i made a function
def word_count(string):
my_string = string.lower().split()
my_dict = {}
for item in my_string:
if item in my_dict:
my_dict[item] += 1
else:
my_dict[item] = 1
print(my_dict)
so, what this does is that it takes a string, splits it, and produces a dictionary with the key being the word, and the value being how many times it appears.
Okay, so what im trying to do now, is to make a function that takes the output of that function, and produces a list in the following format-
((list of words longer than 1 letter),(list of most frequent words), (list of words with the longest length))
also, for example lets say two words have appeared 3 times, and both words are 6 letters long, it should include both words in both the (most frequent) and (longest length) lists.
So, this has been my attempt thus far at tackling this problem
def analyze(x):
longer_than_one= []
most_frequent= []
longest= []
for key in x.item:
if len(key) >1:
key.append(longer_than_one)
print(longer_than_one)
so what i was trying to do here, is make a series of for and if loops, that append to the lists depending on whether or not the items meet the criteria, however i have run into the following problems:-
1- how do i iterate over a dictionary without getting an error?
2- I cant figure out a way to count the most frequent words (i was thinking to append the keys with the highest values)
3- I cant figure out a way to only append the words that are the longest in the dictionary (i was thinking of using len(key) but it said error)
If it's any help, im working in Anaconda's Spyder using Python 3.5.1 ,any tips would be appreciated!

You really are trying to re-invent the wheel.
Imagine you have list_of_words which is, well, a list of strings.
To get the most frequent word, use Counter:
from collections import Counter
my_counter = Counter(list_of_words)
To sort the list by the length:
sorted_by_length = sorted(list_of_words, key=len)
To get the list of words longer than one letter you can simply use your sorted list, or create a new list with only these:
longer_than_one_letter = [word for word in list_of_words if len(word) > 1]
To get your output on your required format, simply use all of the above.

Most of your problems are solved or get easier when you use a Counter.
Writing word_count with a Counter:
>>> from collections import Counter
>>> def word_count(string):
... return Counter(string.split())
Demo:
>>> c = word_count('aa aa aa xxx xxx xxx b b ccccccc')
>>> c
Counter({'aa': 3, 'xxx': 3, 'b': 2, 'ccccccc': 1})
>>> c['aa']
3
The most_common method of a Counter helps with getting the most frequent words:
>>> c.most_common()
[('aa', 3), ('xxx', 3), ('b', 2), ('ccccccc', 1)]
>>> c.most_common(1)
[('aa', 3)]
>>> max_count = c.most_common(1)[0][1]
>>> [word for word, count in c.items() if count == max_count]
['aa', 'xxx']
You can get the words themselves with c.keys()
>>> c.keys()
['aa', 'xxx', 'b', 'ccccccc']
and a list of words with the longest length this way:
>>> max_len = len(max(c, key=len))
>>> [word for word in c if len(word) == max_len]
['ccccccc']

1)
To iterate over dictionary you can either use:
for key in my_dict:
or if you want to get key and value at the same time use:
for key, value in my_dict.iteritems():
2)
To find most frequent words you have to assume that first word is most frequent, then you look at next word used count and if it's the same you append it to your list, if it's less just skip it, if it's more - clear you list and assume that this one is most frequent
3) Pretty much the same as 2. Assume that your first is longest the compare if next one, if it's lenght equals to your current max just append to a list, if it's less skip it, if it's more clear your list and assume that this is your max.
I didn't add any code since it's better if you write it your own in order to learn something

There are other nice answers for your question, But I would like to help you in your attempt, I have done few modification in your code to make it working-
def analyze(x):
longer_than_one= []
most_frequent= []
longest= []
for key in x:
if len(key) >1:
longer_than_one.append(key)
print(longer_than_one)
It seems you haven't attempted for 2nd and 3rd use case.

At first, check collections.Counter:
import collections
word_counts = collections.Counter(your_text.split())
Given that, you can use its .most_common method for the most common words. It produces a list of (word, its_count) tuples.
To discover the longest words in the dictionary, you can do:
import heapq
largest_words= heapq.nlargest(N, word_counts, key=len)
N being the count of largest words you want. This works because by default the iteration over a dict produces only the keys, so it sorts them according to the word length (key=len) and returns only the N largest ones.
But you seem to have fallen deep into Python without going over the tutorial. Is it homework?

how to determine the no of occurrences of a letter in a string using python without built in functions or counters

i do have a rough idea to use ord and id function to obtain values of the letters in a string but no idea how to increment for every other similar occurrence of the same value.This is my Interview personal preparation. Please give me idea and suggestion no codes.

Without using built in functions or counters, I would think outside the box. Just put a comment that instructs the user what to do physically at their desk. This uses just enough Python to get the job done without using any of the built in functions.
"""
Using a pencil, write down the word, then starting at the first letter:
Write down the letter and put a dash or tick or something next to it
Look for that letter in the word, and add additional marks next to the letter to keep track of the count.
When you've reached the end of the word, go to the next letter and repeat the process
Skip any letters that you have already counted(Otherwise instead of the count of each letter's occurrences, you'll get the factorial of the count!)
"""

User a counter variable, something like this should get you started:
>>> count = 0
>>> meaty = 'asjhdkajhskjfhalksjhdflaksjdkhaskjd'
>>> for i in meaty:
... if i == 'a':
... count+=1
...
>>> print count
5
if you are tracking multiple occurrences of multiple letters, user the letter as a dictionary key that stores a counter integer:
>>> count = {}
>>> for i in meaty:
... key = i
... if key in count:
... count[key]+=1
... else:
... count[key]=1
...
>>> count
{'a': 5, 'd': 4, 'f': 2, 'h': 5, 'k': 6, 'j': 6, 'l': 2, 's': 5}
Edit: Meh, no counter, no built in functions, no coding examples on Stack Overflow? Not sure how to count without counting and isn't python essentially a collection of built in functions? Isn't Stack Overflow for actual coding help?

python dictionary comprehension iterator

Hey I have a doubt in the following python code i wrote :
#create a list of elements
#use a dictionary to find out the frequency of each element
list = [1,2,6,3,4,5,1,1,3,2,2,5]
list.sort()
dict = {i: list.count(i) for i in list}
print(dict)
In the dictionary compression method, "for i in list" is the sequence supplied to the method right ? So it takes 1,2,3,4.. as the keys. My question is why doesn't it take 1 three times ? Because i've said "for i in list", doesn't it have to take each and every element in the list as a key ?
(I'm new to python so be easy on me !)

My question is why doesn't it take 1 three times ?
That's because dictionary keys are unique. If there is another entry found for the same key, the previous value for that key will be overwritten.
Well, for your issue, if you are only after counting the frequency of each element in your list, then you can use collections.Counter
And please don't use list as variable name. It's a built-in.
>>> lst = [1,2,6,3,4,5,1,1,3,2,2,5]
>>> from collections import Counter
>>> Counter(lst)
Counter({1: 3, 2: 3, 3: 2, 5: 2, 4: 1, 6: 1})

Yes, your suspicion is correct. 1 will come up 3 times during iteration. However, since dictionaries have unique keys, each time 1 comes up it will replace the previously generated key/value pair with the newly generated key/value pair. This will give the right answer, it's not the most efficient. You could convert the list to a set instead to avoid reprocessing duplicate keys:
dict = {i: list.count(i) for i in set(list)}
However, even this method is horribly inefficient because it does a full pass over the list for each value in the list, i.e. O(n²) total comparisons. You could do this in a single pass over the list, but you wouldn't use a dict comprehension:
xs = [1,2,6,3,4,5,1,1,3,2,2,5]
counts = {}
for x in xs:
counts[x] = counts.get(x, 0) + 1
The result for counts is: {1: 3, 2: 3, 3: 2, 4: 1, 5: 2, 6: 1}
Edit: I didn't realize there was something in the library to do this for you. You should use Rohit Jain's solution with collections.Counter instead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

counting duplicate words in python the fastest way [duplicate] - python

You could also try to use defaultdict as a more competitive choice. Try: from collections import defaultdict word_dict = defaultdict(lambda: 0) for word in word_list: word_dict[word] +=1 print word_dict

Related

separate and count emojis in python list

How can I efficiently count the number of occurrences of a given character within certain range of a string?

Python, Take dictionary, and produce list with (words>1, most common words, longest words)

how to determine the no of occurrences of a letter in a string using python without built in functions or counters

python dictionary comprehension iterator

Categories

Resources