Selecting all top words in Python list using Counter

Selecting all top words in Python list using Counter - python

I believe this should be pretty straightforward, but it seems I am not able to think straight to get this right.
I have a list as follows:
comp = [Amazon, Apple, Microsoft, Google, Amazon, Ebay, Apple, Paypal, Google]
I just want to print the words that occur the most. I did the following:
cnt = Counter(comp.split(','))
final_list = cnt.most_common(2)
This gives me the following output:
[[('Amazon', 2), ('Apple', 2)]]
I am not sure what parameter pass in most_common() since it could be different for each input list. So, I would like to know how I can print the top occurring words, be it 3 for one list or 4 for another. So, for the above sample, the output would be as follows:
[[('Amazon', 2), ('Apple', 2), ('Google',2)]]
Thanks

You can use itertools.takewhile here:
>>> from itertools import takewhile
>>> lis = ['Amazon', 'Apple', 'Microsoft', 'Google', 'Amazon', 'Ebay', 'Apple', 'Paypal', 'Google']
>>> c = Counter(lis)
>>> items = c.most_common()
Get the max count:
>>> max_ = items[0][1]
Select only those items where count = max_, and stop as soon as an item with less count is found:
>>> list(takewhile(lambda x: x[1]==max_, items))
[('Google', 2), ('Apple', 2), ('Amazon', 2)]
You've misunderstood Counter.most_common:
most_common(self, n=None)
List the n most common elements and their counts from the most common
to the least. If n is None, then list all element counts.
i.e n is not the count here, it is the number of top items you want to return. It is essentially equivalent to:
>>> c.most_common(4)
[('Google', 2), ('Apple', 2), ('Amazon', 2), ('Paypal', 1)]
>>> c.most_common()[:4]
[('Google', 2), ('Apple', 2), ('Amazon', 2), ('Paypal', 1)]

You can do this by maintaining two variables maxi and maxi_value storing the maximum element and no of times it has occured.
dict = {}
maxi = None
maxi_value = 0
for elem in comp:
try:
dict[elem] += 1
except IndexError:
dict[elem] = 1
if dict[elem] > mini_value:
mini = elem
print (maxi)

Find the number of occurences of one of the top words, and then filter the whole list returned by most_common:
>>> mc = cnt.most_common()
>>> filter(lambda t: t[1] == mc[0][1], mc)

Related

iterating over list containing duplicate values

I am looking to iterate over a list with duplicate values. The 101 has 101.A and 101.B which is right but the 102 starts from 102.C instead of 102.A
import string
room_numbers = ['101','103','101','102','104','105','106','107','102','108']
door_numbers = []
num_count = 0
for el in room_numbers:
if room_numbers.count(el) == 1:
door_numbers.append("%s.%s" % (el, string.ascii_uppercase[0]))
elif room_numbers.count(el) > 1:
door_numbers.append("%s.%s" % (el, string.ascii_uppercase[num_count]))
num_count += 1
door_numbers = ['101.A','103.A','101.B','102.C','104.A',
'105.A','106.A','107.A','102.D','108.A']

Given
import string
import itertools as it
import collections as ct
room_numbers = ['101','103','101','102','104','105','106','107','102','108']
letters = string.ascii_uppercase
Code
Simple, Two-Line Solution
dd = ct.defaultdict(it.count)
print([".".join([room, letters[next(dd[room])]]) for room in room_numbers])
or
dd = ct.defaultdict(lambda: iter(letters))
print([".".join([room, next(dd[room])]) for room in room_numbers])
Output
['101.A', '103.A', '101.B', '102.A', '104.A', '105.A', '106.A', '107.A', '102.B', '108.A']
Details
In the first example we are using itertools.count as a default factory. This means that a new count() iterator is made whenever a new room number is added to the defaultdict dd. Iterators are useful because they are lazily evaluated and memory efficient.
In the list comprehension, these iterators get initialized per room number. The next number of the counter is yielded, the number is used as an index to get a letter, and the result is simply joined as a suffix to each room number.
In the second example (recommended), we use an iterator of strings as the default factory. The callable requirement is satisfied by returning the iterator in a lambda function. An iterator of strings enables us to simply call next() and directly get the next letter. Consequently, the comprehension is simplified since slicing letters is no longer required.

The problem in your implementation is that you have a value num_count which is continuously incremented for each item in the list than just the specific items' count. What you'd have to do instead is to count the number of times each of the item has occurred in the list.
Pseudocode would be
1. For each room in room numbers
2. Add the room to a list of visited rooms
3. Count the number of times the room number is available in visited room
4. Add the count to 64 and convert it to an ascii uppercase character where 65=A
5. Join the required strings in the way you want to and then append it to the door_numbers list.
Here's an implementation
import string
room_numbers = ['101','103','101','102','104','105','106','107','102','108']
door_numbers = []
visited_rooms = []
for room in room_numbers:
visited_rooms.append(room)
room_count = visited_rooms.count(room)
door_value = chr(64+room_count) # Since 65 = A when 1st item is present
door_numbers.append("%s.%s"%(room, door_value))
door_numbers now contains the final list you're expecting which is
['101.A', '103.A', '101.B', '102.A', '104.A', '105.A', '106.A', '107.A', '102.B', '108.A']
for the given input room_numbers

The naive way, simply count the number of times the element is contained in the list up until that index:
>>> door_numbers = []
>>> for i in xrange(len(room_numbers)):
... el = room_numbers[i]
... n = 0
... for j in xrange(0, i):
... n += el == room_numbers[j]
... c = string.ascii_uppercase[n]
... door_numbers.append("{}.{}".format(el, c))
...
>>> door_numbers
['101.A', '103.A', '101.B', '102.A', '104.A', '105.A', '106.A', '107.A', '102.B', '108.A']
This two explicit for-loops make the quadratic complexity pop out. Indeed, (1/2) * (N * (N-1)) iterations are made. I would say that in most cases you would be better off keeping a dict of counts instead of counting each time.
>>> door_numbers = []
>>> counts = {}
>>> for el in room_numbers:
... count = counts.get(el, 0)
... c = string.ascii_uppercase[count]
... counts[el] = count + 1
... door_numbers.append("{}.{}".format(el, c))
...
>>> door_numbers
['101.A', '103.A', '101.B', '102.A', '104.A', '105.A', '106.A', '107.A', '102.B', '108.A']
That way, there's no messing around with indices, and it's more time efficient (at the expense of auxiliary space).

Using iterators and comprehensions:
Enumerate the rooms to preserve the original order
Group rooms by room number, sorting first as required by groupby()
For each room in a group, append .A, .B, etc.
Sort by the enumeration values from step 1 to restore the original order
Extract the door numbers, e.g. '101.A'
.
#!/usr/bin/env python3
import operator
from itertools import groupby
import string
room_numbers = ['101', '103', '101', '102', '104',
'105', '106', '107', '102', '108']
get_room_number = operator.itemgetter(1)
enumerated_and_sorted = sorted(list(enumerate(room_numbers)),
key=get_room_number)
# [(0, '101'), (2, '101'), (3, '102'), (8, '102'), (1, '103'),
# (4, '104'), (5, '105'), (6, '106'), (7, '107'), (9, '108')]
grouped_by_room = groupby(enumerated_and_sorted, key=get_room_number)
# [('101', [(0, '101'), (2, '101')]),
# ('102', [(3, '102'), (8, '102')]),
# ('103', [(1, '103')]),
# ('104', [(4, '104')]),
# ('105', [(5, '105')]),
# ('106', [(6, '106')]),
# ('107', [(7, '107')]),
# ('108', [(9, '108')])]
door_numbers = ((order, '{}.{}'.format(room, char))
for _, room_list in grouped_by_room
for (order, room), char in zip(room_list,
string.ascii_uppercase))
# [(0, '101.A'), (2, '101.B'), (3, '102.A'), (8, '102.B'),
# (1, '103.A'), (4, '104.A'), (5, '105.A'), (6, '106.A'),
# (7, '107.A'), (9, '108.A')]
door_numbers = [room for _, room in sorted(door_numbers)]
# ['101.A', '103.A', '101.B', '102.A', '104.A',
# '105.A', '106.A', '107.A', '102.B', '108.A']

(key, value) pair using Python Lambdas

I am trying to work on a simple word count problem and trying to figure if that can be done by use of map, filter and reduce exclusively.
Following is an example of an wordRDD(the list used for spark):
myLst = ['cats', 'elephants', 'rats', 'rats', 'cats', 'cats']
All i need is to count the words and present it in a tuple format:
counts = [('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)]
I tried with simple map() and lambdas as:
counts = myLst.map(lambdas x: (x, <HERE IS THE PROBLEM>))
I might be wrong with the syntax or maybe confused.
P.S.: This isnt a duplicate questin as rest answers give suggestions using if/else or list comprehensions.
Thanks for the help.

You don't need map(..) at all. You can do it with just reduce(..)
>>> def function(obj, x):
... obj[x] += 1
... return obj
...
>>> from functools import reduce
>>> reduce(function, myLst, defaultdict(int)).items()
dict_items([('elephants', 1), ('rats', 2), ('cats', 3)])
You can then iterate of the result.
However, there's a better way of doing it: Look into Counter

Not using a lambda but gets the job done.
from collections import Counter
c = Counter(myLst)
result = list(c.items())
And the output:
In [21]: result
Out[21]: [('cats', 3), ('rats', 2), ('elephants', 1)]

If you don't want the full reduce step done for you (which aggregated the counts in SuperSaiyan's answer), you can use map this way:
>>> myLst = ['cats', 'elephants', 'rats', 'rats', 'cats', 'cats']
>>> counts = list(map(lambda s: (s,1), myLst))
>>> print(counts)
[('cats', 1), ('elephants', 1), ('rats', 1), ('rats', 1), ('cats', 1), ('cats', 1)]

You Can use map() to get this result:
myLst = ['cats', 'elephants', 'rats', 'rats', 'cats', 'cats']
list(map(lambda x : (x,len(x)), myLst))

How to return the count of the same elements in two lists?

I have two very large lists(that's why I used ... ), a list of lists:
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],...,['how to match and return the frequency?']]
and a list of strings:
y = ['hi', 'nice', 'ok',..., 'frequency']
I would like to return in a new list the times (count) that any word in y occurred in all the lists of x. For example, for the above lists, this should be the correct output:
[(1,2),(2,0),(3,1),...,(n,count)]
As follows, [(1,count),...,(n,count)]. Where n is the number of the list and count the number of times that any word from y appeared in x. Any idea of how to approach this?.

First, you should preprocess x into a list of sets of lowercased words -- that will speed up the following lookups enormously. E.g:
ppx = []
for subx in x:
ppx.append(set(w.lower() for w in re.finditer(r'\w+', subx))
(yes, you could collapse this into a list comprehension, but I'm aiming for some legibility).
Next, you loop over y, checking how many of the sets in ppx contain each item of y -- that would be
[sum(1 for s in ppx if w in s) for w in y]
That doesn't give you those redundant first items you crave, but enumerate to the rescue...:
list(enumerate((sum(1 for s in ppx if w in s) for w in y), 1))
should give exactly what you require.

Here is a more readable solution. Check my comments in the code.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
y = ['hi', 'nice', 'ok', 'frequency']
assert len(x)==len(y), "you have to make sure length of x equals y's"
num = []
for i in xrange(len(y)):
# lower all the strings in x for comparison
# find all matched patterns in x and count it, and store result in variable num
num.append(len(re.findall(y[i], x[i][0].lower())))
res = []
# use enumerate to give output in format you want
for k, v in enumerate(num):
res.append((k,v))
# here is what you want
print res
OUTPUT:
[(0, 1), (1, 0), (2, 1), (3, 1)]

INPUT:
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],
['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
y = ['hi', 'nice', 'ok', 'frequency']
CODE:
import re
s1 = set(y)
index = 0
result = []
for itr in x:
itr = re.sub('[!.?]', '',itr[0].lower()).split(' ')
# remove special chars and convert to lower case
s2 = set(itr)
intersection = s1 & s2
#find intersection of common strings
num = len(intersection)
result.append((index,num))
index = index+1
OUTPUT:
result = [(0, 2), (1, 0), (2, 1), (3, 1)]

You could do like this also.
>>> x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
>>> y = ['hi', 'nice', 'ok', 'frequency']
>>> l = []
>>> for i,j in enumerate(x):
c = 0
for x in y:
if re.search(r'(?i)\b'+x+r'\b', j[0]):
c += 1
l.append((i+1,c))
>>> l
[(1, 2), (2, 0), (3, 1), (4, 1)]
(?i) will do a case-insensitive match. \b called word boundaries which matches between a word character and a non-word character.

Maybe you could concatenate the strings in x to make the computation easy:
w = ' '.join(i[0] for i in x)
Now w is a long string like this:
>>> w
"I like stackoverflow. Hi ok! this is a great community Ok, I didn't like this!. how to match and return the frequency?"
With this conversion, you can simply do this:
>>> l = []
>>> for i in range(len(y)):
l.append((i+1, w.count(str(y[i]))))
which gives you:
>>> l
[(1, 2), (2, 0), (3, 1), (4, 0), (5, 1)]

You can make a dictionary where key is each item in the "Y" List. Loop through the values of the keys and look up for them in the dictionary. Keep updating the value as soon as you encounter the word into your X nested list.

How to find the 2nd max of a Counter - Python

The max of a counter can be accessed as such:
c = Counter()
c['foo'] = 124123
c['bar'] = 43
c['foofro'] =5676
c['barbar'] = 234
# This only prints the max key
print max(c), src_sense[max(c)]
# print the max key of the value
x = max(src_sense.iteritems(), key=operator.itemgetter(1))[0]
print x, src_sense[x]
What if i want a sorted counter in descending counts?
How do i access the 2nd maximum, or the 3rd or the Nth maximum key?

most_common(self, n=None) method of collections.Counter instance
List the n most common elements and their counts from the most common to the least. If n is None, then list all element counts.
>>> Counter('abcdeabcdabcaba').most_common(3)
[('a', 5), ('b', 4), ('c', 3)]
and so:
>>> c.most_common()
[('foo', 124123), ('foofro', 5676), ('barbar', 234), ('bar', 43)]
>>> c.most_common(2)[-1]
('foofro', 5676)
Note that max(c) probably doesn't return what you want: iteration over a Counter is iteration over the keys, and so max(c) == max(c.keys()) == 'foofro', because it's the last after string sorting. You'd need to do something like
>>> max(c, key=c.get)
'foo'
to get the (a) key with the largest value. In a similar fashion, you could forego most_common entirely and do the sort yourself:
>>> sorted(c, key=c.get)[-2]
'foofro'

Separating nltk.FreqDist words into two lists?

I have a series of texts that are instances of a custom WebText class. Each text is an object that has a rating (-10 to +10) and a word count (nltk.FreqDist) associated with it:
>>trainingTexts = [WebText('train1.txt'), WebText('train2.txt'), WebText('train3.txt'), WebText('train4.txt')]
>>trainingTexts[1].rating
10
>>trainingTexts[1].freq_dist
<FreqDist: 'the': 60, ',': 49, 'to': 38, 'is': 34,...>
How can you now get two lists (or dictionaries) containing every word used exclusively in the positively rated texts (trainingText[].rating>0), and another list containing every word used exclusively in the negative texts (trainingText[].rating<0). And have each list contain the total word counts for all the positive or negative texts, so that you get something like this:
>>only_positive_words
[('sky', 10), ('good', 9), ('great', 2)...]
>>only_negative_words
[('earth', 10), ('ski', 9), ('food', 2)...]
I considered using sets, as sets contain unique instances, but i can't see how this can be done with nltk.FreqDist, and on top of that, a set wouldn't be ordered by word frequency. Any ideas?

Ok, let's say you start with this for the purposes of testing:
class Rated(object):
def __init__(self, rating, freq_dist):
self.rating = rating
self.freq_dist = freq_dist
a = Rated(5, nltk.FreqDist('the boy sees the dog'.split()))
b = Rated(8, nltk.FreqDist('the cat sees the mouse'.split()))
c = Rated(-3, nltk.FreqDist('some boy likes nothing'.split()))
trainingTexts = [a,b,c]
Then your code would look like:
from collections import defaultdict
from operator import itemgetter
# dictionaries for keeping track of the counts
pos_dict = defaultdict(int)
neg_dict = defaultdict(int)
for r in trainingTexts:
rating = r.rating
freq = r.freq_dist
# choose the appropriate counts dict
if rating > 0:
partition = pos_dict
elif rating < 0:
partition = neg_dict
else:
continue
# add the information to the correct counts dict
for word,count in freq.iteritems():
partition[word] += count
# Turn the counts dictionaries into lists of descending-frequency words
def only_list(counts, filtered):
return sorted(filter(lambda (w,c): w not in filtered, counts.items()), \
key=itemgetter(1), \
reverse=True)
only_positive_words = only_list(pos_dict, neg_dict)
only_negative_words = only_list(neg_dict, pos_dict)
And the result is:
>>> only_positive_words
[('the', 4), ('sees', 2), ('dog', 1), ('cat', 1), ('mouse', 1)]
>>> only_negative_words
[('nothing', 1), ('some', 1), ('likes', 1)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting all top words in Python list using Counter - python

You can do this by maintaining two variables maxi and maxi_value storing the maximum element and no of times it has occured. dict = {} maxi = None maxi_value = 0 for elem in comp: try: dict[elem] += 1 except IndexError: dict[elem] = 1 if dict[elem] > mini_value: mini = elem print (maxi)

Find the number of occurences of one of the top words, and then filter the whole list returned by most_common: >>> mc = cnt.most_common() >>> filter(lambda t: t[1] == mc[0][1], mc)

Related

iterating over list containing duplicate values

(key, value) pair using Python Lambdas

How to return the count of the same elements in two lists?

How to find the 2nd max of a Counter - Python

Separating nltk.FreqDist words into two lists?

Categories

Resources