iterating over list containing duplicate values - python

I am looking to iterate over a list with duplicate values. The 101 has 101.A and 101.B which is right but the 102 starts from 102.C instead of 102.A
import string
room_numbers = ['101','103','101','102','104','105','106','107','102','108']
door_numbers = []
num_count = 0
for el in room_numbers:
if room_numbers.count(el) == 1:
door_numbers.append("%s.%s" % (el, string.ascii_uppercase[0]))
elif room_numbers.count(el) > 1:
door_numbers.append("%s.%s" % (el, string.ascii_uppercase[num_count]))
num_count += 1
door_numbers = ['101.A','103.A','101.B','102.C','104.A',
'105.A','106.A','107.A','102.D','108.A']

Given
import string
import itertools as it
import collections as ct
room_numbers = ['101','103','101','102','104','105','106','107','102','108']
letters = string.ascii_uppercase
Code
Simple, Two-Line Solution
dd = ct.defaultdict(it.count)
print([".".join([room, letters[next(dd[room])]]) for room in room_numbers])
or
dd = ct.defaultdict(lambda: iter(letters))
print([".".join([room, next(dd[room])]) for room in room_numbers])
Output
['101.A', '103.A', '101.B', '102.A', '104.A', '105.A', '106.A', '107.A', '102.B', '108.A']
Details
In the first example we are using itertools.count as a default factory. This means that a new count() iterator is made whenever a new room number is added to the defaultdict dd. Iterators are useful because they are lazily evaluated and memory efficient.
In the list comprehension, these iterators get initialized per room number. The next number of the counter is yielded, the number is used as an index to get a letter, and the result is simply joined as a suffix to each room number.
In the second example (recommended), we use an iterator of strings as the default factory. The callable requirement is satisfied by returning the iterator in a lambda function. An iterator of strings enables us to simply call next() and directly get the next letter. Consequently, the comprehension is simplified since slicing letters is no longer required.

The problem in your implementation is that you have a value num_count which is continuously incremented for each item in the list than just the specific items' count. What you'd have to do instead is to count the number of times each of the item has occurred in the list.
Pseudocode would be
1. For each room in room numbers
2. Add the room to a list of visited rooms
3. Count the number of times the room number is available in visited room
4. Add the count to 64 and convert it to an ascii uppercase character where 65=A
5. Join the required strings in the way you want to and then append it to the door_numbers list.
Here's an implementation
import string
room_numbers = ['101','103','101','102','104','105','106','107','102','108']
door_numbers = []
visited_rooms = []
for room in room_numbers:
visited_rooms.append(room)
room_count = visited_rooms.count(room)
door_value = chr(64+room_count) # Since 65 = A when 1st item is present
door_numbers.append("%s.%s"%(room, door_value))
door_numbers now contains the final list you're expecting which is
['101.A', '103.A', '101.B', '102.A', '104.A', '105.A', '106.A', '107.A', '102.B', '108.A']
for the given input room_numbers

The naive way, simply count the number of times the element is contained in the list up until that index:
>>> door_numbers = []
>>> for i in xrange(len(room_numbers)):
... el = room_numbers[i]
... n = 0
... for j in xrange(0, i):
... n += el == room_numbers[j]
... c = string.ascii_uppercase[n]
... door_numbers.append("{}.{}".format(el, c))
...
>>> door_numbers
['101.A', '103.A', '101.B', '102.A', '104.A', '105.A', '106.A', '107.A', '102.B', '108.A']
This two explicit for-loops make the quadratic complexity pop out. Indeed, (1/2) * (N * (N-1)) iterations are made. I would say that in most cases you would be better off keeping a dict of counts instead of counting each time.
>>> door_numbers = []
>>> counts = {}
>>> for el in room_numbers:
... count = counts.get(el, 0)
... c = string.ascii_uppercase[count]
... counts[el] = count + 1
... door_numbers.append("{}.{}".format(el, c))
...
>>> door_numbers
['101.A', '103.A', '101.B', '102.A', '104.A', '105.A', '106.A', '107.A', '102.B', '108.A']
That way, there's no messing around with indices, and it's more time efficient (at the expense of auxiliary space).

Using iterators and comprehensions:
Enumerate the rooms to preserve the original order
Group rooms by room number, sorting first as required by groupby()
For each room in a group, append .A, .B, etc.
Sort by the enumeration values from step 1 to restore the original order
Extract the door numbers, e.g. '101.A'
.
#!/usr/bin/env python3
import operator
from itertools import groupby
import string
room_numbers = ['101', '103', '101', '102', '104',
'105', '106', '107', '102', '108']
get_room_number = operator.itemgetter(1)
enumerated_and_sorted = sorted(list(enumerate(room_numbers)),
key=get_room_number)
# [(0, '101'), (2, '101'), (3, '102'), (8, '102'), (1, '103'),
# (4, '104'), (5, '105'), (6, '106'), (7, '107'), (9, '108')]
grouped_by_room = groupby(enumerated_and_sorted, key=get_room_number)
# [('101', [(0, '101'), (2, '101')]),
# ('102', [(3, '102'), (8, '102')]),
# ('103', [(1, '103')]),
# ('104', [(4, '104')]),
# ('105', [(5, '105')]),
# ('106', [(6, '106')]),
# ('107', [(7, '107')]),
# ('108', [(9, '108')])]
door_numbers = ((order, '{}.{}'.format(room, char))
for _, room_list in grouped_by_room
for (order, room), char in zip(room_list,
string.ascii_uppercase))
# [(0, '101.A'), (2, '101.B'), (3, '102.A'), (8, '102.B'),
# (1, '103.A'), (4, '104.A'), (5, '105.A'), (6, '106.A'),
# (7, '107.A'), (9, '108.A')]
door_numbers = [room for _, room in sorted(door_numbers)]
# ['101.A', '103.A', '101.B', '102.A', '104.A',
# '105.A', '106.A', '107.A', '102.B', '108.A']

Related

A more complex version of "How can I tell if a string repeats itself in Python?"

I was reading this post and I wonder if someone can find the way to catch repetitive motifs into a more complex string.
For example, find all the repetitive motifs in
string = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
Here the repetitive motifs:
'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
So, the output should be something like this:
output = {'ACGT': {'repeat': 2,
'region': (5,13)},
'GT': {'repeat': 3,
'region': (19,24)},
'TATACG': {'repeat': 2,
'region': (29,40)}}
This example comes from a typical biological phenomena termed microsatellite which is present into the DNA.
UPDATE 1: Asterisks were removed from the string variable. It was a mistake.
UPDATE 2: Single character motif doesn't count. For example: in ACGUGAAAGUC, the 'A' motif is not taken into account.
you can use a recursion function as following :
Note: The result argument will be treated as a global variable (because passing mutable object to the function affects the caller)
import re
def finder(st,past_ind=0,result=[]):
m=re.search(r'(.+)\1+',st)
if m:
i,j=m.span()
sub=st[i:j]
ind = (sub+sub).find(sub, 1)
sub=sub[:ind]
if len(sub)>1:
result.append([sub,(i+past_ind+1,j+past_ind+1)])
past_ind+=j
return finder(st[j:],past_ind)
else:
return result
s='AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
print finder(s)
result:
[['ACGT', (5, 13)], ['GT', (19, 25)], ['TATACG', (29, 41)]]
answer to previous question for following string :
s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
You can use both answers from mentioned question and some extra recipes :
First you can split the string with ** then create a new list contain the repeated strings with r'(.+)\1+' regex :
So the result will be :
>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> new
['AAA', 'ACGTACGT', 'TT', 'GTGTGT', 'CCCC', 'TATACGTATACG', 'TTT']
Note about 'ACGTACGT' that missed the A at the end!
Then you can use principal_period's function to get the repeated sub strings :
def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]
>>> for i in new:
... p=principal_period(i)
... if p is not None and len(p)>1:
... l.append(p)
... sub.append(i)
...
So you will have the repeated strings in l and main strings in sub :
>>> l
['ACGT', 'GT', 'TATACG']
>>> sub
['ACGTACGT', 'GTGTGT', 'TATACGTATACG']
Then you need a the region that you can do it with span method :
>>> for t in sub:
... regons.append(re.search(t,s).span())
>>> regons
[(6, 14), (24, 30), (38, 50)]
And at last you can zip the 3 list regon,sub,l and use a dict comprehension to create the expected result :
>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}
The main code :
>>> s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
>>> sub=[]
>>> l=[]
>>> regon=[]
>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> for i in new:
... p=principal_period(i)
... if p is not None and len(p)>1:
... l.append(p)
... sub.append(i)
...
>>> for t in sub:
... regons.append(re.search(t,s).span())
...
>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}
If you can bound your query then you can use a single pass of the string. The number of comparisons will be length of string * (max_length - min_length) so will scale linearly.
s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
def find_repeats(s, max_length, min_length=2):
for i in xrange(len(s)):
for j in xrange(min_length, max_length+1):
count = 1
while s[i:i+j] == s[i+j*count:i+j*count+j]: count += 1
if count > 1:
yield s[i:i+j], i, count
for pattern, position, count in find_repeats(s, 6, 2):
print "%6s at region (%d, %d), %d repeats" % (pattern, position, position + count*len(pattern), count)
Output:
AC at region (2, 6), 2 repeats
ACGT at region (4, 12), 2 repeats
CGTA at region (5, 13), 2 repeats
GT at region (18, 24), 3 repeats
TG at region (19, 23), 2 repeats
GT at region (20, 24), 2 repeats
CC at region (24, 28), 2 repeats
TA at region (28, 32), 2 repeats
TATACG at region (28, 40), 2 repeats
ATACGT at region (29, 41), 2 repeats
TA at region (34, 38), 2 repeats
Note that this catches a fair few more overlapping patterns than the regexp answers, but without knowing more about what you consider a good match it is difficult to reduce it further, for example why is TATACG better than ATACGT?
Extra: Using a dict to return matches is a bad idea as the patterns are not going to be unique.
This simple while loop detects all repeated patterns:
def expand():
global hi
hi += 1
def shrink():
global lo
lo += 1
s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
motifs = set()
lo = 0
hi = 0
f = expand
while hi <= len(s):
sub = s[lo : hi+1]
if s.count(sub) > 1:
motifs.add(sub)
if lo==hi: f = expand
f()
else:
f = shrink if lo<=hi else expand
f()
At this point, motifs contains all the repeated patterns... Let's filter them with some criteria:
minlen = 3
for m in filter(lambda m: len(m)>=minlen and s.count(2*m)>=1, motifs):
print(m)
'''
ATACGT
ACGT
TATACG
CGTA
'''
You can use the fact that in regex, lookaheads do not advance the primary iterator. Thus, you can nest a capture group within a lookahead to find the (potentially overlapping) patterns that repeat and have a specified minimum length:
>>> import re
>>> s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
>>> re.findall(r'(?=(.{2,})\1+)', s)
['AC', 'ACGT', 'CGTA', 'GT', 'TG', 'GT', 'CC', 'TATACG', 'ATACGT', 'TA']
>>> re.findall(r'(?=(.{2,}?)\1+)', s)
['AC', 'ACGT', 'CGTA', 'GT', 'TG', 'GT', 'CC', 'TA', 'ATACGT', 'TA']
Note the slightly different results between using a greedy and a non-greedy quantifier. The greedy quantifier searches for the longest repeating substring starting from every index in the original string, if one exists. The non-greedy quantifier searches for the shortest of the same. The limitation is that you can only get a maximum one pattern per starting index in the string. If you have any ideas to solve this problem, let me know! Potentially, we can use the greedy quantifier regex to set up a recursive solution that finds every repeating pattern starting from each index, but let's avoid "premature computation" for now.
Now if we take the regex (?=(.{2,})\1+) and modify it, we can also capture the entire substring that contains repeated motifs. By doing this, we can use the span of the matches to calculate the number of repetitions:
(?=((.{2,})\2+))
In the above regex, we have a capture group inside a capture group inside a lookahead. Now we have everything we need to solve the problem:
def repeated_motifs(s):
import re
from collections import defaultdict
rmdict = defaultdict(list)
for match in re.finditer(r'(?=((.{2,})\2+))', s):
motif = match.group(2)
span1, span2 = match.span(1), match.span(2)
startindex = span1[0]
repetitions = (span1[1] - startindex) // (span2[1] - startindex)
others = rmdict[motif]
if not others or startindex > others[-1]['region'][1]:
others.append({'repeat': repetitions, 'region': span1})
return rmdict
s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
d = repeated_motifs(s)
print(d)
# list of the repeating motifs we have found sorted by first region
print(sorted(list(d.keys()), key=lambda k: d[k][0]['region']))
Because desired behavior in the situation where a motif repeats itself in multiple "regions" of the string was not specified, I have made the assumption that OP would like a dictionary of string->list where each list contains its own set of dictionaries.

Combine consecutive epoch-periods in a list of tuples [duplicate]

I have a list of tuples where each tuple is a (start-time, end-time). I am trying to merge all overlapping time ranges and return a list of distinct time ranges.
For example
[(1, 5), (2, 4), (3, 6)] ---> [(1,6)]
[(1, 3), (2, 4), (5, 8)] ---> [(1, 4), (5,8)]
Here is how I implemented it.
# Algorithm
# initialranges: [(a,b), (c,d), (e,f), ...]
# First we sort each tuple then whole list.
# This will ensure that a<b, c<d, e<f ... and a < c < e ...
# BUT the order of b, d, f ... is still random
# Now we have only 3 possibilities
#================================================
# b<c<d: a-------b Ans: [(a,b),(c,d)]
# c---d
# c<=b<d: a-------b Ans: [(a,d)]
# c---d
# c<d<b: a-------b Ans: [(a,b)]
# c---d
#================================================
def mergeoverlapping(initialranges):
i = sorted(set([tuple(sorted(x)) for x in initialranges]))
# initialize final ranges to [(a,b)]
f = [i[0]]
for c, d in i[1:]:
a, b = f[-1]
if c<=b<d:
f[-1] = a, d
elif b<c<d:
f.append((c,d))
else:
# else case included for clarity. Since
# we already sorted the tuples and the list
# only remaining possibility is c<d<b
# in which case we can silently pass
pass
return f
I am trying to figure out if
Is the a an built-in function in some python module that can do this more efficiently? or
Is there a more pythonic way of accomplishing the same goal?
Your help is appreciated. Thanks!
A few ways to make it more efficient, Pythonic:
Eliminate the set() construction, since the algorithm should prune out duplicates during in the main loop.
If you just need to iterate over the results, use yield to generate the values.
Reduce construction of intermediate objects, for example: move the tuple() call to the point where the final values are produced, saving you from having to construct and throw away extra tuples, and reuse a list saved for storing the current time range for comparison.
Code:
def merge(times):
saved = list(times[0])
for st, en in sorted([sorted(t) for t in times]):
if st <= saved[1]:
saved[1] = max(saved[1], en)
else:
yield tuple(saved)
saved[0] = st
saved[1] = en
yield tuple(saved)
data = [
[(1, 5), (2, 4), (3, 6)],
[(1, 3), (2, 4), (5, 8)]
]
for times in data:
print list(merge(times))
Sort tuples then list, if t1.right>=t2.left => merge
and restart with the new list, ...
-->
def f(l, sort = True):
if sort:
sl = sorted(tuple(sorted(i)) for i in l)
else:
sl = l
if len(sl) > 1:
if sl[0][1] >= sl[1][0]:
sl[0] = (sl[0][0], sl[1][1])
del sl[1]
if len(sl) < len(l):
return f(sl, False)
return sl
The sort part: use standard sorting, it compares tuples the right way already.
sorted_tuples = sorted(initial_ranges)
The merge part. It eliminates duplicate ranges, too, so no need for a set. Suppose you have current_tuple and next_tuple.
c_start, c_end = current_tuple
n_start, n_end = next_tuple
if n_start <= c_end:
merged_tuple = min(c_start, n_start), max(c_end, n_end)
I hope the logic is clear enough.
To peek next tuple, you can use indexed access to sorted tuples; it's a wholly known sequence anyway.
Sort all boundaries then take all pairs where a boundary end is followed by a boundary start.
def mergeOverlapping(initialranges):
def allBoundaries():
for r in initialranges:
yield r[0], True
yield r[1], False
def getBoundaries(boundaries):
yield boundaries[0][0]
for i in range(1, len(boundaries) - 1):
if not boundaries[i][1] and boundaries[i + 1][1]:
yield boundaries[i][0]
yield boundaries[i + 1][0]
yield boundaries[-1][0]
return getBoundaries(sorted(allBoundaries()))
Hm, not that beautiful but was fun to write at least!
EDIT: Years later, after an upvote, I realised my code was wrong! This is the new version just for fun:
def mergeOverlapping(initialRanges):
def allBoundaries():
for r in initialRanges:
yield r[0], -1
yield r[1], 1
def getBoundaries(boundaries):
openrange = 0
for value, boundary in boundaries:
if not openrange:
yield value
openrange += boundary
if not openrange:
yield value
def outputAsRanges(b):
while b:
yield (b.next(), b.next())
return outputAsRanges(getBoundaries(sorted(allBoundaries())))
Basically I mark the boundaries with -1 or 1 and then sort them by value and only output them when the balance between open and closed braces is zero.
Late, but might help someone looking for this. I had a similar problem but with dictionaries. Given a list of time ranges, I wanted to find overlaps and merge them when possible. A little modification to #samplebias answer led me to this:
Merge function:
def merge_range(ranges: list, start_key: str, end_key: str):
ranges = sorted(ranges, key=lambda x: x[start_key])
saved = dict(ranges[0])
for range_set in sorted(ranges, key=lambda x: x[start_key]):
if range_set[start_key] <= saved[end_key]:
saved[end_key] = max(saved[end_key], range_set[end_key])
else:
yield dict(saved)
saved[start_key] = range_set[start_key]
saved[end_key] = range_set[end_key]
yield dict(saved)
Data:
data = [
{'start_time': '09:00:00', 'end_time': '11:30:00'},
{'start_time': '15:00:00', 'end_time': '15:30:00'},
{'start_time': '11:00:00', 'end_time': '14:30:00'},
{'start_time': '09:30:00', 'end_time': '14:00:00'}
]
Execution:
print(list(merge_range(ranges=data, start_key='start_time', end_key='end_time')))
Output:
[
{'start_time': '09:00:00', 'end_time': '14:30:00'},
{'start_time': '15:00:00', 'end_time': '15:30:00'}
]
When using Python 3.7, following the suggestion given by “RuntimeError: generator raised StopIteration” every time I try to run app, the method outputAsRanges from #UncleZeiv should be:
def outputAsRanges(b):
while b:
try:
yield (next(b), next(b))
except StopIteration:
return

Selecting all top words in Python list using Counter

I believe this should be pretty straightforward, but it seems I am not able to think straight to get this right.
I have a list as follows:
comp = [Amazon, Apple, Microsoft, Google, Amazon, Ebay, Apple, Paypal, Google]
I just want to print the words that occur the most. I did the following:
cnt = Counter(comp.split(','))
final_list = cnt.most_common(2)
This gives me the following output:
[[('Amazon', 2), ('Apple', 2)]]
I am not sure what parameter pass in most_common() since it could be different for each input list. So, I would like to know how I can print the top occurring words, be it 3 for one list or 4 for another. So, for the above sample, the output would be as follows:
[[('Amazon', 2), ('Apple', 2), ('Google',2)]]
Thanks
You can use itertools.takewhile here:
>>> from itertools import takewhile
>>> lis = ['Amazon', 'Apple', 'Microsoft', 'Google', 'Amazon', 'Ebay', 'Apple', 'Paypal', 'Google']
>>> c = Counter(lis)
>>> items = c.most_common()
Get the max count:
>>> max_ = items[0][1]
Select only those items where count = max_, and stop as soon as an item with less count is found:
>>> list(takewhile(lambda x: x[1]==max_, items))
[('Google', 2), ('Apple', 2), ('Amazon', 2)]
You've misunderstood Counter.most_common:
most_common(self, n=None)
List the n most common elements and their counts from the most common
to the least. If n is None, then list all element counts.
i.e n is not the count here, it is the number of top items you want to return. It is essentially equivalent to:
>>> c.most_common(4)
[('Google', 2), ('Apple', 2), ('Amazon', 2), ('Paypal', 1)]
>>> c.most_common()[:4]
[('Google', 2), ('Apple', 2), ('Amazon', 2), ('Paypal', 1)]
You can do this by maintaining two variables maxi and maxi_value storing the maximum element and no of times it has occured.
dict = {}
maxi = None
maxi_value = 0
for elem in comp:
try:
dict[elem] += 1
except IndexError:
dict[elem] = 1
if dict[elem] > mini_value:
mini = elem
print (maxi)
Find the number of occurences of one of the top words, and then filter the whole list returned by most_common:
>>> mc = cnt.most_common()
>>> filter(lambda t: t[1] == mc[0][1], mc)

How to find the 2nd max of a Counter - Python

The max of a counter can be accessed as such:
c = Counter()
c['foo'] = 124123
c['bar'] = 43
c['foofro'] =5676
c['barbar'] = 234
# This only prints the max key
print max(c), src_sense[max(c)]
# print the max key of the value
x = max(src_sense.iteritems(), key=operator.itemgetter(1))[0]
print x, src_sense[x]
What if i want a sorted counter in descending counts?
How do i access the 2nd maximum, or the 3rd or the Nth maximum key?
most_common(self, n=None) method of collections.Counter instance
List the n most common elements and their counts from the most common to the least. If n is None, then list all element counts.
>>> Counter('abcdeabcdabcaba').most_common(3)
[('a', 5), ('b', 4), ('c', 3)]
and so:
>>> c.most_common()
[('foo', 124123), ('foofro', 5676), ('barbar', 234), ('bar', 43)]
>>> c.most_common(2)[-1]
('foofro', 5676)
Note that max(c) probably doesn't return what you want: iteration over a Counter is iteration over the keys, and so max(c) == max(c.keys()) == 'foofro', because it's the last after string sorting. You'd need to do something like
>>> max(c, key=c.get)
'foo'
to get the (a) key with the largest value. In a similar fashion, you could forego most_common entirely and do the sort yourself:
>>> sorted(c, key=c.get)[-2]
'foofro'

Separating nltk.FreqDist words into two lists?

I have a series of texts that are instances of a custom WebText class. Each text is an object that has a rating (-10 to +10) and a word count (nltk.FreqDist) associated with it:
>>trainingTexts = [WebText('train1.txt'), WebText('train2.txt'), WebText('train3.txt'), WebText('train4.txt')]
>>trainingTexts[1].rating
10
>>trainingTexts[1].freq_dist
<FreqDist: 'the': 60, ',': 49, 'to': 38, 'is': 34,...>
How can you now get two lists (or dictionaries) containing every word used exclusively in the positively rated texts (trainingText[].rating>0), and another list containing every word used exclusively in the negative texts (trainingText[].rating<0). And have each list contain the total word counts for all the positive or negative texts, so that you get something like this:
>>only_positive_words
[('sky', 10), ('good', 9), ('great', 2)...]
>>only_negative_words
[('earth', 10), ('ski', 9), ('food', 2)...]
I considered using sets, as sets contain unique instances, but i can't see how this can be done with nltk.FreqDist, and on top of that, a set wouldn't be ordered by word frequency. Any ideas?
Ok, let's say you start with this for the purposes of testing:
class Rated(object):
def __init__(self, rating, freq_dist):
self.rating = rating
self.freq_dist = freq_dist
a = Rated(5, nltk.FreqDist('the boy sees the dog'.split()))
b = Rated(8, nltk.FreqDist('the cat sees the mouse'.split()))
c = Rated(-3, nltk.FreqDist('some boy likes nothing'.split()))
trainingTexts = [a,b,c]
Then your code would look like:
from collections import defaultdict
from operator import itemgetter
# dictionaries for keeping track of the counts
pos_dict = defaultdict(int)
neg_dict = defaultdict(int)
for r in trainingTexts:
rating = r.rating
freq = r.freq_dist
# choose the appropriate counts dict
if rating > 0:
partition = pos_dict
elif rating < 0:
partition = neg_dict
else:
continue
# add the information to the correct counts dict
for word,count in freq.iteritems():
partition[word] += count
# Turn the counts dictionaries into lists of descending-frequency words
def only_list(counts, filtered):
return sorted(filter(lambda (w,c): w not in filtered, counts.items()), \
key=itemgetter(1), \
reverse=True)
only_positive_words = only_list(pos_dict, neg_dict)
only_negative_words = only_list(neg_dict, pos_dict)
And the result is:
>>> only_positive_words
[('the', 4), ('sees', 2), ('dog', 1), ('cat', 1), ('mouse', 1)]
>>> only_negative_words
[('nothing', 1), ('some', 1), ('likes', 1)]

Categories

Resources