Related
I am trying to take the Spark word count example and aggregate word counts by some other value (for example, words and counts by person where person is "VI" or "MO" in the case below)
I have an rdd which is a list of tuples whose values are lists of tuples:
from operator import add
reduced_tokens = tokenized.reduceByKey(add)
reduced_tokens.take(2)
Which gives me:
[(u'VI', [(u'word1', 1), (u'word2', 1), (u'word3', 1)]),
(u'MO',
[(u'word4', 1),
(u'word4', 1),
(u'word5', 1),
(u'word8', 1),
(u'word10', 1),
(u'word1', 1),
(u'word4', 1),
(u'word6', 1),
(u'word9', 1),
...
)]
I want something like:
[
('VI',
[(u'word1', 1), (u'word2', 1), (u'word3', 1)],
('MO',
[(u'word4', 58), (u'word8', 2), (u'word9', 23) ...)
]
Similar to the word count example here, I would like to be able to filter out words with a count below some threshold for some person. Thanks!
The keys that you're trying to reduce across are (name, word) pairs, not just names. So you need to do a .map step to fix-up your data:
def key_by_name_word(record):
name, (word, count) = record
return (name, word), count
tokenized_by_name_word = tokenized.map(key_by_name_word)
counts_by_name_word = tokenized_by_name_word.reduce(add)
This should give you
[
(('VI', 'word1'), 1),
(('VI', 'word2'), 1),
(('VI', 'word3'), 1),
(('MO', 'word4'), 58),
...
]
To get it into exactly the same format you mentioned, you can then do:
def key_by_name(record):
# this is the inverse of key_by_name_word
(name, word), count = record
return name, (word, count)
output = counts_by_name_word.map(key_by_name).reduceByKey(add)
But it might actually be easier to work with the data in the flat format that counts_by_name_word is in.
For completeness, here is how I solved each part of the question:
Ask 1: Aggregate word counts by some key
import re
def restructure_data(name_and_freetext):
name = name_and_freetext[0]
tokens = re.sub('[&|/|\d{4}|\.|\,|\:|\-|\(|\)|\+|\$|\!]', ' ', name_and_freetext[1]).split()
return [((name, token), 1) for token in tokens]
filtered_data = data.filter((data.flag==1)).select('name', 'item')
tokenized = filtered_data.rdd.flatMap(restructure_data)
Ask 2: Filter out words with a count below some threshold:
from operator import add
# keep words which have counts >= 5
counts_by_state_word = tokenized.reduceByKey(add).filter(lambda x: x[1] >= 5)
# map filtered word counts into a list by key so we can sort them
restruct = counts_by_name_word.map(lambda x: (x[0][0], [(x[0][1], x[1])]))
Bonus: Sort words from most frequent to least frequent
# sort the word counts from most frequent to least frequent words
output = restruct.reduceByKey(add).map(lambda x: (x[0], sorted(x[1], key=lambda y: y[1], reverse=True))).collect()
I'm trying to count the occurrence of each character for any given string input, the occurrences must be output in ascending order( includes numbers and exclamation marks)
I have this for my code so far, i am aware of the Counter function, but it does not output the answer in the format I'd like it to, and I do not know how to format Counter. Instead I'm trying to find away to use the count() to count each character. I've also seen the dictionary function , but I'd be hoping that there is a easier way to do it with count()
from collections import Counter
sentence=input("Enter a sentence b'y: ")
lowercase=sentence.lower()
list1=list(lowercase)
list1.sort()
length=len(list1)
list2=list1.count(list1)
print(list2)
p=Counter(list1)
print(p)
collections.Counter objects provide a most_common() method that returns a list of tuples in decreasing frequency. So, if you want it in ascending frequency, reverse the list:
from collections import Counter
sentence = input("Enter a sentence: ")
c = Counter(sentence.lower())
result = reversed(c.most_common())
print(list(result))
Demo run
Enter a sentence: Here are 3 sentences. This is the first one. Here is the second. The end!
[('a', 1), ('!', 1), ('3', 1), ('f', 1), ('d', 2), ('o', 2), ('c', 2), ('.', 3), ('r', 4), ('i', 4), ('n', 5), ('t', 6), ('h', 6), ('s', 7), (' ', 14), ('e', 14)]
Just call .most_common and reverse the output with reversed to get the output from least to most common:
from collections import Counter
sentence= "foobar bar"
lowercase = sentence.lower()
for k, count in reversed(Counter(lowercase).most_common()):
print(k,count)
If you just want to format the Counter output differently:
for key, value in Counter(list1).items():
print('%s: %s' % (key, value))
Your best bet is to use Counter (which does work on a string) and then sort on it's output.
from collections import Counter
sentence = input("Enter a sentence b'y: ")
lowercase = sentence.lower()
# Counter will work on strings
p = Counter(lowercase)
count = Counter.items()
# count is now (more or less) equivalent to
# [('a', 1), ('r', 1), ('b', 1), ('o', 2), ('f', 1)]
# And now you can run your sort
sorted_count = sorted(count)
# Which will sort by the letter. If you wanted to
# sort by quantity, tell the sort to use the
# second element of the tuple by setting key:
# sorted_count = sorted(count, key=lambda x:x[1])
for letter, count in sorted_count:
# will cycle through in order of letters.
# format as you wish
print(letter, count)
Another way to avoid using Counter.
sentence = 'abc 11 222 a AAnn zzz?? !'
list1 = list(sentence.lower())
#If you want to remove the spaces.
#list1 = list(sentence.replace(" ", ""))
#Removing duplicate characters from the string
sentence = ''.join(set(list1))
dict = {}
for char in sentence:
dict[char] = list1.count(char)
for item in sorted(dict.items(), key=lambda x: x[1]):
print 'Number of Occurences of %s is %d.' % (item[0], item[1])
Output:
Number of Occurences of c is 1.
Number of Occurences of b is 1.
Number of Occurences of ! is 1.
Number of Occurences of n is 2.
Number of Occurences of 1 is 2.
Number of Occurences of ? is 2.
Number of Occurences of 2 is 3.
Number of Occurences of z is 3.
Number of Occurences of a is 4.
Number of Occurences of is 6.
One way to do this would be by removing instances of your sub string and looking at the length...
def nofsub(s,ss):
return((len(s)-len(s.replace(ss,"")))/len(ss))
alternatively you could use re or regular expressions,
from re import *
def nofsub(s,ss):
return(len(findall(compile(ss), s)))
finally you could count them manually,
def nofsub(s,ss):
return(len([k for n,k in enumerate(s) if s[n:n+len(ss)]==ss]))
Test any of the three with...
>>> nofsub("asdfasdfasdfasdfasdf",'asdf')
5
Now that you can count any given character you can iterate through your string's unique characters and apply a counter for each unique character you find. Then sort and print the result.
def countChars(s):
s = s.lower()
d = {}
for k in set(s):
d[k]=nofsub(s,k)
for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)):
print "%s: %s" % (key, value)
You could use the list function to break the words apart`from collections
from collections import Counter
sentence=raw_input("Enter a sentence b'y: ")
lowercase=sentence.lower()
list1=list(lowercase)
list(list1)
length=len(list1)
list2=list1.count(list1)
print(list2)
p=Counter(list1)
print(p)
I am trying to get the sorted output from following program.
"""Count words."""
# TODO: Count the number of occurences of each word in s
# TODO: Sort the occurences in descending order (alphabetically in case of ties)
# TODO: Return the top n words as a list of tuples
from operator import itemgetter
def count_words(s, n):
"""Return the n most frequently occuring words in s."""
t1=[]
t2=[]
temp={}
top_n={}
words=s.split()
for word in words:
if word not in temp:
t1.append(word)
temp[word]=1
else:
temp[word]+=1
top_n=sorted(temp.items(), key=itemgetter(1,0),reverse=True)
print top_n
return
def test_run():
"""Test count_words() with some inputs."""
count_words("cat bat mat cat bat cat", 3)
count_words("betty bought a bit of butter but the butter was bitter", 3)
if __name__ == '__main__':
test_run()
This program output is like:
[('cat', 3), ('bat', 2), ('mat', 1)]
[('butter', 2), ('was', 1), ('the', 1), ('of', 1), ('but', 1), ('bought', 1), ('bitter', 1), ('bit', 1), ('betty', 1), ('a', 1)]
but I need in the form like:
[('cat', 3), ('bat', 2), ('mat', 1)]
[('butter', 2), ('a', 1),('betty', 1),('bit', 1),('bitter', 1) ... rest of them here]
COuld you please let me know the best possible way?
You need to change the key function you're giving to sorted, since the items in your desired output need to be sorted in descending order by count but ascending order alphabetically. I'd use a lambda function:
top_n = sorted(temp.items(), key=lambda item: (-item[1], item[0]))
By negating the count, an ascending sort gets you the desired order.
You can change:
top_n=sorted(temp.items(), key=itemgetter(1,0),reverse=True)
To:
temp2=sorted(temp.items(), key=itemgetter(0),reverse=False)
top_n=sorted(temp2.items(), key=itemgetter(1),reverse=True)
and thanks to the Sort Stability you will be good
Instead of itemgetter, use lambda t:(-t[1],t[0]) and drop the reverse=True:
top_n=sorted(temp.items(), key=lambda t:(-t[1],t[0]))
This returns the same thing as itemgetter(1,0) only with the first value inverted so that higher numbers will be sorted before lower numbers.
def count_words(s, n):
"""Return the n most frequently occuring words in s."""
t1=[]
t2=[]
temp={}
top_n={}
words=s.split()
for word in words:
if word not in temp:
t1.append(word)
temp[word]=1
else:
temp[word]+=1
top_n=sorted(temp.items(), key=lambda t: (t[1], t[0]),reverse=True)
print top_n
return
def test_run():
"""Test count_words() with some inputs."""
count_words("cat bat mat cat bat cat", 3)
count_words("betty bought a bit of butter but the butter was bitter", 3)
if __name__ == '__main__':
test_run()
I used lambda instead of itemgetter and in other apps I've written, lambda seems to work.
I'm new to Python and is confused by a piece of code in Python's official documentation.
unique_words = set(word for line in page for word in line.split())
To me, it looks equivalent to:
unique_words=set()
for word in line.split():
for line in page:
unique_words.add(word)
How can line be used in the first loop before it's defined in the nested loop? However, it actually works. I think it suggests the order of nested list comprehension and generator expression is from left to right, which contradicts with my previous understanding.
Can anyone clarify the correct order for me?
word for line in page for word in line.split()
this part works like this:-
for line in page:
for word in line.split():
print word
() this makes it `generator function
hence overall statement work lie this:-
def solve():
for line in page:
for word in line.split():
yield word
and set() is used to avoid duplicacy or repetition of same word as the code is meant to get 'unique words'.
From the tutorial in the official documentation:
A list comprehension consists of brackets containing an expression followed by a for clause, then zero or more for or if clauses. The result will be a new list resulting from evaluating the expression in the context of the for and if clauses which follow it. For example, this listcomp combines the elements of two lists if they are not equal:
>>> [(x, y) for x in [1,2,3] for y in [3,1,4] if x != y]
[(1, 3), (1, 4), (2, 3), (2, 1), (2, 4), (3, 1), (3, 4)]
and it’s equivalent to:
>>> combs = []
>>> for x in [1,2,3]:
... for y in [3,1,4]:
... if x != y:
... combs.append((x, y))
...
>>> combs
[(1, 3), (1, 4), (2, 3), (2, 1), (2, 4), (3, 1), (3, 4)]
Note how the order of the for and if statements is the same in both these snippets.
See the last sentence quoted above.
Also note that the construct you're describing is not (officially) called a "nested list comprehension". A nested list comprehension entails a list comprehension which is within another list comprehension, such as (again from the tutorial):
[[row[i] for row in matrix] for i in range(4)]
The thing you're asking about is simply a list comprehension with multiple for clauses.
You got the loops wrong. Use this:
unique_words = set(word for line in page for word in line.split())
print unique_words
l = []
for line in page:
for word in line.split():
l.append(word)
print set(l)
output:
C:\...>python test.py
set(['sdaf', 'sadfa', 'sfsf', 'fsdf', 'fa', 'sdf', 'asd', 'asdf'])
set(['sdaf', 'sadfa', 'sfsf', 'fsdf', 'fa', 'sdf', 'asd', 'asdf'])
You have the nested loops mixed. What the code does is:
unique_words={}
for line in page:
for word in line.split():
unique_words.add(word)
In addition to the right answers that stressed the point of the order, I would add the fact that we use set to delete duplicates from line to make "unique words". check this and this thread
unique_words = set(word for line in page for word in line.split())
print unique_words
l = {}
for line in page:
for word in line.split():
l.add(word)
print l
I believe this should be pretty straightforward, but it seems I am not able to think straight to get this right.
I have a list as follows:
comp = [Amazon, Apple, Microsoft, Google, Amazon, Ebay, Apple, Paypal, Google]
I just want to print the words that occur the most. I did the following:
cnt = Counter(comp.split(','))
final_list = cnt.most_common(2)
This gives me the following output:
[[('Amazon', 2), ('Apple', 2)]]
I am not sure what parameter pass in most_common() since it could be different for each input list. So, I would like to know how I can print the top occurring words, be it 3 for one list or 4 for another. So, for the above sample, the output would be as follows:
[[('Amazon', 2), ('Apple', 2), ('Google',2)]]
Thanks
You can use itertools.takewhile here:
>>> from itertools import takewhile
>>> lis = ['Amazon', 'Apple', 'Microsoft', 'Google', 'Amazon', 'Ebay', 'Apple', 'Paypal', 'Google']
>>> c = Counter(lis)
>>> items = c.most_common()
Get the max count:
>>> max_ = items[0][1]
Select only those items where count = max_, and stop as soon as an item with less count is found:
>>> list(takewhile(lambda x: x[1]==max_, items))
[('Google', 2), ('Apple', 2), ('Amazon', 2)]
You've misunderstood Counter.most_common:
most_common(self, n=None)
List the n most common elements and their counts from the most common
to the least. If n is None, then list all element counts.
i.e n is not the count here, it is the number of top items you want to return. It is essentially equivalent to:
>>> c.most_common(4)
[('Google', 2), ('Apple', 2), ('Amazon', 2), ('Paypal', 1)]
>>> c.most_common()[:4]
[('Google', 2), ('Apple', 2), ('Amazon', 2), ('Paypal', 1)]
You can do this by maintaining two variables maxi and maxi_value storing the maximum element and no of times it has occured.
dict = {}
maxi = None
maxi_value = 0
for elem in comp:
try:
dict[elem] += 1
except IndexError:
dict[elem] = 1
if dict[elem] > mini_value:
mini = elem
print (maxi)
Find the number of occurences of one of the top words, and then filter the whole list returned by most_common:
>>> mc = cnt.most_common()
>>> filter(lambda t: t[1] == mc[0][1], mc)