I am trying to get the sorted output from following program.
"""Count words."""
# TODO: Count the number of occurences of each word in s
# TODO: Sort the occurences in descending order (alphabetically in case of ties)
# TODO: Return the top n words as a list of tuples
from operator import itemgetter
def count_words(s, n):
"""Return the n most frequently occuring words in s."""
t1=[]
t2=[]
temp={}
top_n={}
words=s.split()
for word in words:
if word not in temp:
t1.append(word)
temp[word]=1
else:
temp[word]+=1
top_n=sorted(temp.items(), key=itemgetter(1,0),reverse=True)
print top_n
return
def test_run():
"""Test count_words() with some inputs."""
count_words("cat bat mat cat bat cat", 3)
count_words("betty bought a bit of butter but the butter was bitter", 3)
if __name__ == '__main__':
test_run()
This program output is like:
[('cat', 3), ('bat', 2), ('mat', 1)]
[('butter', 2), ('was', 1), ('the', 1), ('of', 1), ('but', 1), ('bought', 1), ('bitter', 1), ('bit', 1), ('betty', 1), ('a', 1)]
but I need in the form like:
[('cat', 3), ('bat', 2), ('mat', 1)]
[('butter', 2), ('a', 1),('betty', 1),('bit', 1),('bitter', 1) ... rest of them here]
COuld you please let me know the best possible way?
You need to change the key function you're giving to sorted, since the items in your desired output need to be sorted in descending order by count but ascending order alphabetically. I'd use a lambda function:
top_n = sorted(temp.items(), key=lambda item: (-item[1], item[0]))
By negating the count, an ascending sort gets you the desired order.
You can change:
top_n=sorted(temp.items(), key=itemgetter(1,0),reverse=True)
To:
temp2=sorted(temp.items(), key=itemgetter(0),reverse=False)
top_n=sorted(temp2.items(), key=itemgetter(1),reverse=True)
and thanks to the Sort Stability you will be good
Instead of itemgetter, use lambda t:(-t[1],t[0]) and drop the reverse=True:
top_n=sorted(temp.items(), key=lambda t:(-t[1],t[0]))
This returns the same thing as itemgetter(1,0) only with the first value inverted so that higher numbers will be sorted before lower numbers.
def count_words(s, n):
"""Return the n most frequently occuring words in s."""
t1=[]
t2=[]
temp={}
top_n={}
words=s.split()
for word in words:
if word not in temp:
t1.append(word)
temp[word]=1
else:
temp[word]+=1
top_n=sorted(temp.items(), key=lambda t: (t[1], t[0]),reverse=True)
print top_n
return
def test_run():
"""Test count_words() with some inputs."""
count_words("cat bat mat cat bat cat", 3)
count_words("betty bought a bit of butter but the butter was bitter", 3)
if __name__ == '__main__':
test_run()
I used lambda instead of itemgetter and in other apps I've written, lambda seems to work.
Related
I try to count the frequency of word occurances in a variable. The variables counts more than 700.000 observations. The output should return a dictionary with the words that occured the most. I used the code below to do this:
d1 = {}
for i in range(len(words)-1):
x=words[i]
c=0
for j in range(i,len(words)):
c=words.count(x)
count=dict({x:c})
if x not in d1.keys():
d1.update(count)
I've runned the code for the first 1000 observations and it worked perfectly. The output is shown below:
[('semantic', 23),
('representations', 11),
('models', 10),
('task', 10),
('data', 9),
('parser', 9),
('language', 8),
('languages', 8),
('paper', 8),
('meaning', 8),
('rules', 8),
('results', 7),
('performance', 7),
('parsing', 7),
('systems', 7),
('neural', 6),
('tasks', 6),
('entailment', 6),
('generic', 6),
('te', 6),
('natural', 5),
('method', 5),
('approaches', 5)]
When I try to run it for 100.000 observations, it keeps running. I've tried it for more than 24 hours and it still doesn't execute. Does anyone have an idea?
You can use collections.Counter.
from collections import Counter
counts = Counter(words)
print(counts.most_common(20))
#Jon answer is the best in your case, however in some cases collections.counter will be slower than iteration. (specially if afterwards you don't need to sort by frequency) as I asked in this question
You can count frequencies by iteration.
d1 = {}
for item in words:
if item in d1.keys():
d1[item] += 1
else:
d1[item] = 1
# finally sort the dictionary of frequencies
print(dict(sorted(d1.items(), key=lambda item: item[1])))
But again, for your case, using #Jon answer is faster and more compact.
#...
for i in range(len(words)-1):
#...
#...
for j in range(i,len(words)):
c=words.count(x)
#...
if x not in d1.keys():
#...
I've tried to highlight the problems your code is having above. In english this looks something like:
"Count the number of occurences of each word after the word I'm looking at, repeatedly, for every word in the whole list. Also, look through the whole dictioniary I'm building again for every word in the list, while I'm building it."
This is way more work than you need to do; you only need to look at each word in the list once. You do need to look in the dictionary once for every word, but looking at d1.keys() makes this far slower by converting the dictionary to another list and looking through the whole thing. The following code will do what you want, much more quickly:
words = ['able', 'baker', 'charlie', 'dog', 'easy', 'able', 'charlie', 'dog', 'dog']
word_counts = {}
# Look at each word in our list once
for word in words:
# If we haven't seen it before, create a new count in our dictionary
if word not in word_counts:
word_counts[word] = 0
# We've made sure our count exists, so just increment it by 1
word_counts[word] += 1
print(word_counts.items())
The above example will give:
[
('charlie', 2),
('baker', 1),
('able', 2),
('dog', 3),
('easy', 1)
]
With this code i print all the elements sorted with the most common word used in the textfile first. But how do i print the first ten elements?
with open("something.txt") as f:
words = Counter(f.read().split())
print(words)
From the docs:
most_common([n])
Return a list of the n most common elements and their counts from the most common to the least. If n is omitted or None, most_common() returns all elements in the counter. Elements with equal counts are ordered arbitrarily:
I would try:
words = Counter(f.read().split()).most_common(10)
Source: here
This will give you the most common ten words in your words Counter:
first_ten_words = [word for word,cnt in words.most_common(10)]
You'll need to extract only first elements from the list of pairs (word, count) returned by Counter.most_common():
>>> words.most_common(10)
[('qui', 4),
('quia', 4),
('ut', 3),
('eum', 2),
('aut', 2),
('vel', 2),
('sed', 2),
('et', 2),
('voluptas', 2),
('enim', 2)]
with a simple list comprehension:
>>> [word for word,cnt in words.most_common(10)]
['qui', 'quia', 'ut', 'eum', 'aut', 'vel', 'sed', 'et', 'voluptas', 'enim']
I am trying to take the Spark word count example and aggregate word counts by some other value (for example, words and counts by person where person is "VI" or "MO" in the case below)
I have an rdd which is a list of tuples whose values are lists of tuples:
from operator import add
reduced_tokens = tokenized.reduceByKey(add)
reduced_tokens.take(2)
Which gives me:
[(u'VI', [(u'word1', 1), (u'word2', 1), (u'word3', 1)]),
(u'MO',
[(u'word4', 1),
(u'word4', 1),
(u'word5', 1),
(u'word8', 1),
(u'word10', 1),
(u'word1', 1),
(u'word4', 1),
(u'word6', 1),
(u'word9', 1),
...
)]
I want something like:
[
('VI',
[(u'word1', 1), (u'word2', 1), (u'word3', 1)],
('MO',
[(u'word4', 58), (u'word8', 2), (u'word9', 23) ...)
]
Similar to the word count example here, I would like to be able to filter out words with a count below some threshold for some person. Thanks!
The keys that you're trying to reduce across are (name, word) pairs, not just names. So you need to do a .map step to fix-up your data:
def key_by_name_word(record):
name, (word, count) = record
return (name, word), count
tokenized_by_name_word = tokenized.map(key_by_name_word)
counts_by_name_word = tokenized_by_name_word.reduce(add)
This should give you
[
(('VI', 'word1'), 1),
(('VI', 'word2'), 1),
(('VI', 'word3'), 1),
(('MO', 'word4'), 58),
...
]
To get it into exactly the same format you mentioned, you can then do:
def key_by_name(record):
# this is the inverse of key_by_name_word
(name, word), count = record
return name, (word, count)
output = counts_by_name_word.map(key_by_name).reduceByKey(add)
But it might actually be easier to work with the data in the flat format that counts_by_name_word is in.
For completeness, here is how I solved each part of the question:
Ask 1: Aggregate word counts by some key
import re
def restructure_data(name_and_freetext):
name = name_and_freetext[0]
tokens = re.sub('[&|/|\d{4}|\.|\,|\:|\-|\(|\)|\+|\$|\!]', ' ', name_and_freetext[1]).split()
return [((name, token), 1) for token in tokens]
filtered_data = data.filter((data.flag==1)).select('name', 'item')
tokenized = filtered_data.rdd.flatMap(restructure_data)
Ask 2: Filter out words with a count below some threshold:
from operator import add
# keep words which have counts >= 5
counts_by_state_word = tokenized.reduceByKey(add).filter(lambda x: x[1] >= 5)
# map filtered word counts into a list by key so we can sort them
restruct = counts_by_name_word.map(lambda x: (x[0][0], [(x[0][1], x[1])]))
Bonus: Sort words from most frequent to least frequent
# sort the word counts from most frequent to least frequent words
output = restruct.reduceByKey(add).map(lambda x: (x[0], sorted(x[1], key=lambda y: y[1], reverse=True))).collect()
I'm trying to count the occurrence of each character for any given string input, the occurrences must be output in ascending order( includes numbers and exclamation marks)
I have this for my code so far, i am aware of the Counter function, but it does not output the answer in the format I'd like it to, and I do not know how to format Counter. Instead I'm trying to find away to use the count() to count each character. I've also seen the dictionary function , but I'd be hoping that there is a easier way to do it with count()
from collections import Counter
sentence=input("Enter a sentence b'y: ")
lowercase=sentence.lower()
list1=list(lowercase)
list1.sort()
length=len(list1)
list2=list1.count(list1)
print(list2)
p=Counter(list1)
print(p)
collections.Counter objects provide a most_common() method that returns a list of tuples in decreasing frequency. So, if you want it in ascending frequency, reverse the list:
from collections import Counter
sentence = input("Enter a sentence: ")
c = Counter(sentence.lower())
result = reversed(c.most_common())
print(list(result))
Demo run
Enter a sentence: Here are 3 sentences. This is the first one. Here is the second. The end!
[('a', 1), ('!', 1), ('3', 1), ('f', 1), ('d', 2), ('o', 2), ('c', 2), ('.', 3), ('r', 4), ('i', 4), ('n', 5), ('t', 6), ('h', 6), ('s', 7), (' ', 14), ('e', 14)]
Just call .most_common and reverse the output with reversed to get the output from least to most common:
from collections import Counter
sentence= "foobar bar"
lowercase = sentence.lower()
for k, count in reversed(Counter(lowercase).most_common()):
print(k,count)
If you just want to format the Counter output differently:
for key, value in Counter(list1).items():
print('%s: %s' % (key, value))
Your best bet is to use Counter (which does work on a string) and then sort on it's output.
from collections import Counter
sentence = input("Enter a sentence b'y: ")
lowercase = sentence.lower()
# Counter will work on strings
p = Counter(lowercase)
count = Counter.items()
# count is now (more or less) equivalent to
# [('a', 1), ('r', 1), ('b', 1), ('o', 2), ('f', 1)]
# And now you can run your sort
sorted_count = sorted(count)
# Which will sort by the letter. If you wanted to
# sort by quantity, tell the sort to use the
# second element of the tuple by setting key:
# sorted_count = sorted(count, key=lambda x:x[1])
for letter, count in sorted_count:
# will cycle through in order of letters.
# format as you wish
print(letter, count)
Another way to avoid using Counter.
sentence = 'abc 11 222 a AAnn zzz?? !'
list1 = list(sentence.lower())
#If you want to remove the spaces.
#list1 = list(sentence.replace(" ", ""))
#Removing duplicate characters from the string
sentence = ''.join(set(list1))
dict = {}
for char in sentence:
dict[char] = list1.count(char)
for item in sorted(dict.items(), key=lambda x: x[1]):
print 'Number of Occurences of %s is %d.' % (item[0], item[1])
Output:
Number of Occurences of c is 1.
Number of Occurences of b is 1.
Number of Occurences of ! is 1.
Number of Occurences of n is 2.
Number of Occurences of 1 is 2.
Number of Occurences of ? is 2.
Number of Occurences of 2 is 3.
Number of Occurences of z is 3.
Number of Occurences of a is 4.
Number of Occurences of is 6.
One way to do this would be by removing instances of your sub string and looking at the length...
def nofsub(s,ss):
return((len(s)-len(s.replace(ss,"")))/len(ss))
alternatively you could use re or regular expressions,
from re import *
def nofsub(s,ss):
return(len(findall(compile(ss), s)))
finally you could count them manually,
def nofsub(s,ss):
return(len([k for n,k in enumerate(s) if s[n:n+len(ss)]==ss]))
Test any of the three with...
>>> nofsub("asdfasdfasdfasdfasdf",'asdf')
5
Now that you can count any given character you can iterate through your string's unique characters and apply a counter for each unique character you find. Then sort and print the result.
def countChars(s):
s = s.lower()
d = {}
for k in set(s):
d[k]=nofsub(s,k)
for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)):
print "%s: %s" % (key, value)
You could use the list function to break the words apart`from collections
from collections import Counter
sentence=raw_input("Enter a sentence b'y: ")
lowercase=sentence.lower()
list1=list(lowercase)
list(list1)
length=len(list1)
list2=list1.count(list1)
print(list2)
p=Counter(list1)
print(p)
I believe this should be pretty straightforward, but it seems I am not able to think straight to get this right.
I have a list as follows:
comp = [Amazon, Apple, Microsoft, Google, Amazon, Ebay, Apple, Paypal, Google]
I just want to print the words that occur the most. I did the following:
cnt = Counter(comp.split(','))
final_list = cnt.most_common(2)
This gives me the following output:
[[('Amazon', 2), ('Apple', 2)]]
I am not sure what parameter pass in most_common() since it could be different for each input list. So, I would like to know how I can print the top occurring words, be it 3 for one list or 4 for another. So, for the above sample, the output would be as follows:
[[('Amazon', 2), ('Apple', 2), ('Google',2)]]
Thanks
You can use itertools.takewhile here:
>>> from itertools import takewhile
>>> lis = ['Amazon', 'Apple', 'Microsoft', 'Google', 'Amazon', 'Ebay', 'Apple', 'Paypal', 'Google']
>>> c = Counter(lis)
>>> items = c.most_common()
Get the max count:
>>> max_ = items[0][1]
Select only those items where count = max_, and stop as soon as an item with less count is found:
>>> list(takewhile(lambda x: x[1]==max_, items))
[('Google', 2), ('Apple', 2), ('Amazon', 2)]
You've misunderstood Counter.most_common:
most_common(self, n=None)
List the n most common elements and their counts from the most common
to the least. If n is None, then list all element counts.
i.e n is not the count here, it is the number of top items you want to return. It is essentially equivalent to:
>>> c.most_common(4)
[('Google', 2), ('Apple', 2), ('Amazon', 2), ('Paypal', 1)]
>>> c.most_common()[:4]
[('Google', 2), ('Apple', 2), ('Amazon', 2), ('Paypal', 1)]
You can do this by maintaining two variables maxi and maxi_value storing the maximum element and no of times it has occured.
dict = {}
maxi = None
maxi_value = 0
for elem in comp:
try:
dict[elem] += 1
except IndexError:
dict[elem] = 1
if dict[elem] > mini_value:
mini = elem
print (maxi)
Find the number of occurences of one of the top words, and then filter the whole list returned by most_common:
>>> mc = cnt.most_common()
>>> filter(lambda t: t[1] == mc[0][1], mc)