I am trying to build an inverted index, i.e. map a text to the document it came from.
It's position within the list/document.
In my case i have parsed list containing lists(i.e list of lists).
My input is like this.
[
['why', 'was', 'cinderella', 'late', 'for', 'the', 'ball', 'she', 'forgot', 'to', 'swing', 'the', 'bat'],
['why', 'is', 'the', 'little', 'duck', 'always', 'so', 'sad', 'because', 'he', 'always', 'sees', 'a', 'bill', 'in', 'front', 'of', 'his', 'face'],
['what', 'has', 'four', 'legs', 'and', 'goes', 'booo', 'a', 'cow', 'with', 'a', 'cold'],
['what', 'is', 'a', 'caterpillar', 'afraid', 'of', 'a', 'dogerpillar'],
['what', 'did', 'the', 'crop', 'say', 'to', 'the', 'farmer', 'why', 'are', 'you', 'always', 'picking', 'on', 'me']
]
This is my code
def create_inverted(mylists):
myDict = {}
for sublist in mylists:
for i in range(len(sublist)):
if sublist[i] in myDict:
myDict[sublist[i]].append(i)
else:
myDict[sublist[i]] = [i]
return myDict
It does build the dictionary, but when i do a search i am not getting the correct
result. I am trying to do something like this.
documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']]
index = {'owl': [0, 2],
'lion': [0, 1], # IDs are sorted.
'deer': [1],
'leopard': [2]}
def indexed_search(documents, index, query):
return [documents[doc_id] for doc_id in index[query]]
print indexed_search(documents, index, 'lion')
Where i can enter search text and it gets the list ids.
Any Ideas.
You're mapping each word to the positions it was found in in each document, not which document it was found in. You should store indexes into the list of documents instead of indexes into the documents themselves, or perhaps just map words to documents directly instead of to indices:
def create_inverted_index(documents):
index = {}
for i, document in enumerate(documents):
for word in set(document):
if word in index:
index[word].append(i)
else:
index[word] = [i]
return index
Most of this is the same as your code. The main differences are in these two lines:
for i, document in enumerate(documents):
for word in set(document):
which correspond to the following part of your code:
for sublist in mylists:
for i in range(len(sublist)):
enumerate iterates over the indices and elements of a sequence. Since enumerate is on the outer loop, i in my code is the index of the document, while i in your code is the index of a word within a document.
set(document) creates a set of the words in the document, where each word appears only once. This ensures that each word is only counted once per document, rather than having 10 occurrences of 2 in the list for 'Cheetos' if 'Cheetos' appears in document 2 10 times.
At first I would extract all possible words and store them in one set.
Then I look up each word in each list and collect all the indexes of lists the word happens to be in...
source = [
['why', 'was', 'cinderella', 'late', 'for', 'the', 'ball', 'she', 'forgot', 'to', 'swing', 'the', 'bat'],
['why', 'is', 'the', 'little', 'duck', 'always', 'so', 'sad', 'because', 'he', 'always', 'sees', 'a', 'bill', 'in', 'front', 'of', 'his', 'face'],
['what', 'has', 'four', 'legs', 'and', 'goes', 'booo', 'a', 'cow', 'with', 'a', 'cold'],
['what', 'is', 'a', 'caterpillar', 'afraid', 'of', 'a', 'dogerpillar'],
['what', 'did', 'the', 'crop', 'say', 'to', 'the', 'farmer', 'why', 'are', 'you', 'always', 'picking', 'on', 'me']
]
allWords = set(word for lst in source for word in lst)
wordDict = { word: [
i for i, lst in enumerate(source) if word in lst
] for word in allWords }
print wordDict
Out[30]:
{'a': [1, 2, 3],
'afraid': [3],
'always': [1, 4],
'and': [2],
...
This is straightforward as long you don't need efficient code:
documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']]
def index(docs):
doc_index = {}
for doc_id, doc in enumerate(docs, 1):
for term_pos, term in enumerate(doc, 1):
doc_index.setdefault(term, {}).setdefault(doc_id, []).append(term_pos)
return doc_index
Now you get a two-level dictionary giving you access to the document ids, and then to the positions of the terms in this document:
>>> index(documents)
{'lion': {1: [2], 2: [1]}, 'leopard': {3: [2]}, 'deer': {2: [2]}, 'owl': {1: [1], 3: [1]}}
This is only a preliminary step for indexing; afterwards, you need to separate the term dictionary from the document postings from the positions postings. Typically, the dictionary is stored in a tree-like structures (there are Python packages for this), and the document postings and positions postings are represented as arrays of unsigned integers.
I'd accumulate the indices into a set to avoid duplicates and then sort
>>> documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']]
>>> from collections import defaultdict
>>> D = defaultdict(set)
>>> for i, doc in enumerate(documents):
... for word in doc:
... D[word].add(i)
...
>>> D ## Take a look at the defaultdict
defaultdict(<class 'set'>, {'owl': {0, 2}, 'leopard': {2}, 'lion': {0, 1}, 'deer': {1}})
>>> {k:sorted(v) for k,v in D.items()}
{'lion': [0, 1], 'owl': [0, 2], 'leopard': [2], 'deer': [1]}
Related
Python beginner here. I made this function to find the 10 most frequent words in a dictionary called "Counts".
The thing is, I have to exclude all the items from the englprep, englconj, englpronouns and specialwords lists from the "Counts" dictionary, and then get the top 10 most frequent words returned as a dictionary. Basically I have to get the "getMostFrequent()" function to take the "Counts" dictionary and the specified lists of "no-no" words as an input to output a new dictionary containing the 10 most frequent words.
I have tried for hours but I can't for the life of me get this to work.
expected output should be somewhere along the lines of: {'river': 755, 'party': 527, 'water': 472, etc...}
but i just get: {'the': 16517,
'of': 8550,
'and': 6390,
'to': 5471,
'a': 3508,
'in': 3298,
'was': 2371,
'on': 2094,
'that': 1893,
'he': 1557}, Which contains words that i specified not to be included :/
Would really aprecciate some help or maybe even a possible solution. Thanks in advance to anyone willing to help.
PS! I use python 3.8
def countWords():
Counts = {}
for x in wordList:
if not x in Counts:
Counts[x] = wordList.count(x)
return Counts
def getMostFrequent():
exclWordList = tuple(englConj), tuple(englPrep), tuple(englPronouns), tuple(specialWords)
topNumber = 10
topFreqWords = dict(sorted(Counts.items(), key=lambda x: x[1], reverse=True)[:topNumber])
new_dict = {}
for key, value in topFreqWords.items():
for index in exclWordList:
for y in index:
if value is not y:
new_dict[key] = value
topFreqWords = new_dict
return topFreqWords
if __name__ == "__main__":
Counts = countWords()
englPrep = ['about', 'beside', 'near', 'to', 'above', 'between', 'of',
'towards', 'across', 'beyond', 'off', 'under', 'after', 'by',
'on', 'underneath', 'against', 'despite', 'onto', 'unlike',
'along', 'down', 'opposite', 'until', 'among', 'during', 'out',
'up', 'around', 'except', 'outside', 'along', 'as', 'for',
'over', 'via', 'at', 'from', 'past', 'with', 'before', 'in',
'round', 'within', 'behind', 'inside', 'since', 'without',
'below', 'into', 'than', 'beneath', 'like', 'through']
englConj = ['for', 'and', 'nor', 'but', 'or', 'yet', 'so']
englPronouns = ['you', 'he', 'she', 'him', 'her', 'his', 'hers', 'yours']
specialWords = ['the']
topFreqWords = getMostFrequent()
Try to pass Counts dictionary in getMostFrequent(Counts).
Your function should accept it unless Counts is declared in global scope.
In your code you take top 10 most frequent words including stopwords. You need to remove stopwords from Counts before sorting dict by value.
def getMostFrequent(Counts, englConj, englPronouns, specialWords):
exclWordList = set(englConj + englPrep + englPronouns + specialWords)
popitems = exclWordList.intersection(Counts.keys())
for i in popitems:
Counts.pop(i)
topNumber = 10
topFreqWords = dict(sorted(Counts.items(), key=lambda x: x[1], reverse=True)[:topNumber])
return topFreqWords
I have a number of sentences which I would like to split on specific words (e.g. and). However, when splitting the sentences sometimes there are two or more combinations of a word I'd like to split on in a sentence.
Example sentences:
['i', 'am', 'just', 'hoping', 'for', 'strength', 'and', 'guidance', 'because', 'i', 'have', 'no', 'idea', 'why']
['maybe', 'that', 'is', 'why', 'he', 'does', 'not', 'come', 'home', 'and', 'tell', 'you', 'how', 'good', 'his', 'day', 'at', 'work', 'was', 'because', 'he', 'is', 'been', 'told', 'not', 'to', 'talk']
so I have written some code to split a sentence:
split_on_word = []
no_splitting = []
indexPosList = [ i for i in range(len(kth)) if kth[i] == 'and'] # check if word is in sentence
for e in example:
kth = e.split() # split strings into list so it looks like example sentence
for n in indexPosList:
if n > 4: # only split when the word's position is 4 or more
h = e.split("and")
for i in h:
split_on_word.append(i)# append split sentences
else:
no_splitting.append(kth) #append sentences that don't need to be split
However, you can see that when using this code more than once (e.g.: replace the word to split on with another) I will create duplicates or part duplicates of the sentences that I append to a new list.
Is there any way to check for multiple conditions, so that if a sentence contains both or other combinations of it that I split the sentence in one go?
The output from the examples should then look like this:
['i', 'am', 'just', 'hoping', 'for', 'strength']
['guidance', 'because']
['i', 'have', 'no', 'idea', 'why']
['maybe', 'that', 'is', 'why', 'he', 'does', 'not', 'come', 'home']
[ 'tell', 'you', 'how', 'good', 'his', 'day', 'at', 'work', 'was']
['he', 'is', 'been', 'told', 'not', 'to', 'talk']
You can use itertools.groupby with a function that checks whether a word is a split-word:
In [11]: split_words = {'and', 'because'}
In [12]: [list(g) for k, g in it.groupby(example, key=lambda x: x not in split_words) if k]
Out[12]:
[['maybe', 'that', 'is', 'why', 'he', 'does', 'not', 'come', 'home'],
['tell', 'you', 'how', 'good', 'his', 'day', 'at', 'work', 'was'],
['he', 'is', 'been', 'told', 'not', 'to', 'talk']]
I am continuing with a coding exercise which has me return a dictionary where the key is the length of a word and the value is the word itself. This is done by splitting a text, which is the parameter passed to the get_word_len_dict(text) function and counting the number of characters. The length is then sorted and outputted in print_dict_in_key_order(a_dict).
I get an output like this:
2 : ['to', 'is']
3 : ['why', 'you', 'say', 'are', 'but', 'the', 'wet']
4 : ['does', 'when', 'four', 'they', 'have']
5 : ['there', 'stars', 'check', 'paint']
7 : ['someone', 'believe', 'billion']
Which looks right, but what if I wanted to order the values within the list by alphabetical order? That means that words starting in caps should also be prioritised. For example. ['May', 'and'].
Ideally, I would want an output like this with the values in alphabetical order:
2 : ['is', 'to']
3 : ['are', 'but', 'say', 'the', 'wet', 'why', 'you']
4 : ['does', 'four', 'have', 'they', 'when']
5 : ['check', 'paint', 'stars', 'there']
7 : ['believe', 'billion', 'someone']
I have managed to sort the keys so far within the print_dict_in_key_order(a_dict), but not sure how to go about it if I want to also sort the values?
def get_word_len_dict(text):
dictionary = {}
word_list = text.split()
for word in word_list:
letter = len(word)
dictionary.setdefault(letter,[])
if word not in dictionary[letter]:
dictionary[letter].append(word)
return dictionary
def test_get_word_len_dict():
text = 'why does someone believe you when you say there are four billion stars but they have to check when you say the paint is wet'
the_dict = get_word_len_dict(text)
print_dict_in_key_order(the_dict)
def print_dict_in_key_order(a_dict):
all_keys = list(a_dict.keys())
all_keys.sort()
for key in all_keys:
print(key, ":", a_dict[key])
What you want to do is to group by length and then sort by value (since uppercase letters are "smaller" than lowercase letters when compared lexicographically), then remove duplicates from each group and put everything in a dict comprehension.
Note that itertools.groupby, unlike the analogous function in, say, pandas, will treat noncontiguous groups as distinct, so we need to sort by length first.
Example:
from itertools import groupby
from pprint import pprint
def solution(sentence):
sorted_words = sorted(sentence.split(' '), key=len)
return {length: sorted(set(words)) for length, words in groupby(sorted_words, len)}
sentence = 'Why does someone believe you when you say there are four billion stars but they have to check when you say the paint is wet'
pprint(solution(sentence))
Output:
{2: ['is', 'to'],
3: ['Why', 'are', 'but', 'say', 'the', 'wet', 'you'],
4: ['does', 'four', 'have', 'they', 'when'],
5: ['check', 'paint', 'stars', 'there'],
7: ['believe', 'billion', 'someone']}
Notice that 'Why' comes before the others because it starts with a capital letter, and the rest are sorted alphabetically.
If you want to retain your function structure, you can just sort each list in your dictionary inplace:
def get_word_len_dict(text):
dictionary = {}
word_list = text.split()
for word in word_list:
letter = len(word)
dictionary.setdefault(letter,[])
if word not in dictionary[letter]:
dictionary[letter].append(word)
for words in dictionary.values():
words.sort()
return dictionary
Given this dict
d = {
2: ['to', 'is'],
3: ['why', 'you', 'say', 'are', 'but', 'the', 'wet'],
4: ['does', 'when', 'four', 'they', 'have'],
5: ['there', 'stars', 'check', 'paint'],
7: ['someone', 'believe', 'billion'],
}
You can sort the values like this:
{k: sorted(v) for k, v in d.items()}
Output (via pprint):
{2: ['is', 'to'],
3: ['are', 'but', 'say', 'the', 'wet', 'why', 'you'],
4: ['does', 'four', 'have', 'they', 'when'],
5: ['check', 'paint', 'stars', 'there'],
7: ['believe', 'billion', 'someone']}
Though if you only care about sorting it when printing, just change this line in your code:
print(key, ":", a_dict[key])
to this:
print(key, ":", sorted(a_dict[key]))
d = {
2: ['to', 'is'],
3: ['why', 'you', 'say', 'are', 'but', 'the', 'wet'],
4: ['does', 'when', 'four', 'they', 'have'],
5: ['there', 'stars', 'check', 'paint'],
7: ['someone', 'believe', 'billion'],
}
for i in d:
d[i].sort()
print(d)
output
{
2: ['is', 'to'],
3: ['are', 'but', 'say', 'the', 'wet', 'why', 'you'],
4: ['does', 'four', 'have', 'they', 'when'],
5: ['check', 'paint', 'stars', 'there'],
7: ['believe', 'billion', 'someone']
}
I'm trying to find the top 50 words that occur within three texts of Shakespeare and the ratio of each words occurrance in, macbeth.txt, allswell.txt, and othello.txt. Here is my code so far:
def byFreq(pair):
return pair[1]
def shakespeare():
counts = {}
A = []
for words in ['macbeth.txt','allswell.txt','othello.txt']:
text = open(words, 'r').read()
test = text.lower()
for ch in '!"$%&()*+,-./:;<=>?#[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.split()
for w in words:
counts[w] = counts.get(w, 0) + 1
items = list(counts.items())
items.sort()
items.sort(key=byFreq, reverse = True)
for i in range(50):
word, count = items[i]
count = count / float(len(counts))
A += [[word, count]]
print A
And its output:
>>> shakespeare()
[['the', 0.12929982922664066], ['and', 0.09148572822639668], ['I', 0.08075140278116613], ['of', 0.07684801171017322], ['to', 0.07562820200048792], ['a', 0.05220785557453037], ['you', 0.04415711149060746], ['in', 0.041717492071236886], ['And', 0.04147353012929983], ['my', 0.04147353012929983], ['is', 0.03927787265186631], ['not', 0.03781410100024396], ['that', 0.0358624054647475], ['it', 0.03366674798731398], ['Macb', 0.03342278604537692], ['with', 0.03269090021956575], ['his', 0.03147109050988046], ['be', 0.03025128080019517], ['The', 0.028787509148572824], ['haue', 0.028543547206635766], ['me', 0.027079775555013418], ['your', 0.02683581361307636], ['our', 0.025128080019516955], ['him', 0.021956574774335203], ['Enter', 0.019516955354964626], ['That', 0.019516955354964626], ['for', 0.01927299341302757], ['this', 0.01927299341302757], ['he', 0.018541107587216395], ['To', 0.01780922176140522], ['so', 0.017077335935594046], ['all', 0.0156135642839717], ['What', 0.015369602342034643], ['are', 0.015369602342034643], ['thou', 0.015369602342034643], ['will', 0.015125640400097584], ['Macbeth', 0.014881678458160527], ['thee', 0.014881678458160527], ['But', 0.014637716516223469], ['but', 0.014637716516223469], ['Macd', 0.014149792632349353], ['they', 0.014149792632349353], ['their', 0.013905830690412296], ['we', 0.013905830690412296], ['as', 0.01341790680653818], ['vs', 0.01341790680653818], ['King', 0.013173944864601122], ['on', 0.013173944864601122], ['yet', 0.012198097096852892], ['Rosse', 0.011954135154915833], ['the', 0.15813168261114238], ['I', 0.14279684862127182], ['and', 0.1231007315700619], ['to', 0.10875070343275182], ['of', 0.10481148002250985], ['a', 0.08581879572312887], ['you', 0.08581879572312887], ['my', 0.06992121553179516], ['in', 0.061902082160945414], ['is', 0.05852560495216657], ['not', 0.05486775464265616], ['it', 0.05472706809229038], ['that', 0.05472706809229038], ['his', 0.04727068092290377], ['your', 0.04389420371412493], ['me', 0.043753517163759144], ['be', 0.04305008441193022], ['And', 0.04037703995498031], ['with', 0.038266741699493526], ['him', 0.037703995498030385], ['for', 0.03601575689364097], ['he', 0.03404614518851998], ['The', 0.03137310073157006], ['this', 0.030810354530106922], ['her', 0.029262802476083285], ['will', 0.0291221159257175], ['so', 0.027011817670230726], ['have', 0.02687113111986494], ['our', 0.02687113111986494], ['but', 0.024760832864378166], ['That', 0.02293190770962296], ['PAROLLES', 0.022791221159257174], ['To', 0.021384355655599326], ['all', 0.021384355655599326], ['shall', 0.021102982554867755], ['are', 0.02096229600450197], ['as', 0.02096229600450197], ['thou', 0.02039954980303883], ['Macb', 0.019274057400112548], ['thee', 0.019274057400112548], ['no', 0.01871131119864941], ['But', 0.01842993809791784], ['Enter', 0.01814856499718627], ['BERTRAM', 0.01758581879572313], ['HELENA', 0.01730444569499156], ['we', 0.01730444569499156], ['do', 0.017163759144625774], ['thy', 0.017163759144625774], ['was', 0.01674169949352842], ['haue', 0.016460326392796848], ['I', 0.19463784682531435], ['the', 0.17894627455055595], ['and', 0.1472513769094877], ['to', 0.12989712147978802], ['of', 0.12002494024732412], ['you', 0.1079704873739998], ['a', 0.10339810869791126], ['my', 0.0909279850358516], ['in', 0.07627558973293151], ['not', 0.07159929335965914], ['is', 0.0697287748103502], ['it', 0.0676504208666736], ['that', 0.06733866777512211], ['me', 0.06099968824690845], ['your', 0.0543489556271433], ['And', 0.053205860958121166], ['be', 0.05310194326093734], ['his', 0.05154317780317988], ['with', 0.04769822300737816], ['him', 0.04665904603553985], ['her', 0.04364543281720877], ['for', 0.04322976202847345], ['he', 0.042190585056635144], ['this', 0.04187883196508366], ['will', 0.035332017042502335], ['Iago', 0.03522809934531851], ['so', 0.03356541619037722], ['The', 0.03325366309882573], ['haue', 0.031902733035435935], ['do', 0.03138314454951678], ['but', 0.030240049880494647], ['That', 0.02857736672555336], ['thou', 0.027642107450898887], ['as', 0.027434272056531227], ['To', 0.026810765873428243], ['our', 0.02504416502130313], ['are', 0.024628494232567806], ['But', 0.024420658838200146], ['all', 0.024316741141016316], ['What', 0.024212823443832486], ['shall', 0.024004988049464823], ['on', 0.02265405798607503], ['thee', 0.022134469500155875], ['Enter', 0.021822716408604385], ['thy', 0.021199210225501402], ['no', 0.020783539436766082], ['she', 0.02026395095084693], ['am', 0.02005611555647927], ['by', 0.019848280162111608], ['have', 0.019848280162111608]]
Instead of outputing the top 50 words of all three texts, its outputs the top 50 words of each text, 150 words. Im struggling on trying to delete the duplicates but add their ratios together. For example, in macbeth.txt the word 'the' has a ratio of 0.12929982922664066, allswell.txt has a ratio of 0.15813168261114238, and othello.txt has a ratio of 0.17894627455055595. I want to combine the ratios of all three of them. I;m pretty sure I have to use a for loop but I'm struggling to loop through a list within a list. I am more of a java guy so any help would be appreciated!
You can use a list comprehension and the Counter-class:
from collections import Counter
c = Counter([word for file in ['macbeth.txt','allswell.txt','othello.txt']
for word in open(file).read().split()])
Then you get a dict which maps words to their counts. You can sort them like this:
sorted([(i,v) for v,i in c.items()])
If you want the relative quantities, then you can calculate the total number of words:
numWords = sum([i for (v,i) in c.items()])
and adapt the dict c via a dict-comprehension:
c = { v:(i/numWords) for (v,i) in c.items()}
You're summarizing the count inside your loop over files. Move the summary code outside your for loop.
This question already has answers here:
Find the item with maximum occurrences in a list [duplicate]
(14 answers)
Closed 2 years ago.
The community reviewed whether to reopen this question 10 months ago and left it closed:
Original close reason(s) were not resolved
Given the following list
['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats',
'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and',
'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.',
'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats',
'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise',
'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle',
'Moon', 'to', 'rise.', '']
I am trying to count how many times each word appears and display the top 3.
However I am only looking to find the top three that have the first letter capitalized and ignore all words that do not have the first letter capitalized.
I am sure there is a better way than this, but my idea was to do the following:
put the first word in the list into another list called uniquewords
delete the first word and all its duplicated from the original list
add the new first word into unique words
delete the first word and all its duplicated from original list.
etc...
until the original list is empty....
count how many times each word in uniquewords appears in the original list
find top 3 and print
In Python 2.7 and above there is a class called Counter which can help you:
from collections import Counter
words_to_count = (word for word in word_list if word[:1].isupper())
c = Counter(words_to_count)
print c.most_common(3)
Result:
[('Jellicle', 6), ('Cats', 5), ('And', 2)]
I am quite new to programming so please try and do it in the most barebones fashion.
You could instead do this using a dictionary with the key being a word and the value being the count for that word. First iterate over the words adding them to the dictionary if they are not present, or else increasing the count for the word if it is present. Then to find the top three you can either use a simple O(n*log(n)) sorting algorithm and take the first three elements from the result, or you can use a O(n) algorithm that scans the list once remembering only the top three elements.
An important observation for beginners is that by using builtin classes that are designed for the purpose you can save yourself a lot of work and/or get better performance. It is good to be familiar with the standard library and the features it offers.
If you are using an earlier version of Python or you have a very good reason to roll your own word counter (I'd like to hear it!), you could try the following approach using a dict.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> word_list = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
>>> word_counter = {}
>>> for word in word_list:
... if word in word_counter:
... word_counter[word] += 1
... else:
... word_counter[word] = 1
...
>>> popular_words = sorted(word_counter, key = word_counter.get, reverse = True)
>>>
>>> top_3 = popular_words[:3]
>>>
>>> top_3
['Jellicle', 'Cats', 'and']
Top Tip: The interactive Python interpretor is your friend whenever you want to play with an algorithm like this. Just type it in and watch it go, inspecting elements along the way.
To just return a list containing the most common words:
from collections import Counter
words=["i", "love", "you", "i", "you", "a", "are", "you", "you", "fine", "green"]
most_common_words= [word for word, word_count in Counter(words).most_common(3)]
print most_common_words
this prints:
['you', 'i', 'a']
the 3 in "most_common(3)", specifies the number of items to print.
Counter(words).most_common() returns a a list of tuples with each tuple having the word as the first member and the frequency as the second member.The tuples are ordered by the frequency of the word.
`most_common = [item for item in Counter(words).most_common()]
print(str(most_common))
[('you', 4), ('i', 2), ('a', 1), ('are', 1), ('green', 1), ('love',1), ('fine', 1)]`
"the word for word, word_counter in", extracts only the first member of the tuple.
Is't it just this ....
word_list=['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats',
'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and',
'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.',
'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats',
'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise',
'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle',
'Moon', 'to', 'rise.', '']
from collections import Counter
c = Counter(word_list)
c.most_common(3)
Which should output
[('Jellicle', 6), ('Cats', 5), ('are', 3)]
There's two standard library ways to find the most frequent value in a list:
statistics.mode:
from statistics import mode
most_common = mode([3, 2, 2, 2, 1, 1]) # 2
most_common = mode([3, 2]) # StatisticsError: no unique mode
Raises an exception if there's no unique most frequent value
Only returns single most frequent value
collections.Counter.most_common:
from collections import Counter
most_common, count = Counter([3, 2, 2, 2, 1, 1]).most_common(1)[0] # 2, 3
(most_common_1, count_1), (most_common_2, count_2) = Counter([3, 2, 2]).most_common(2) # (2, 2), (3, 1)
Can return multiple most frequent values
Returns element count as well
So in the case of the question, the second one would be the right choice. As a side note, both are identical in terms of performance.
nltk is convenient for a lot of language processing stuff. It has methods for frequency distribution built in. Something like:
import nltk
fdist = nltk.FreqDist(your_list) # creates a frequency distribution from a list
most_common = fdist.max() # returns a single element
top_three = fdist.keys()[:3] # returns a list
A simple, two-line solution to this, which does not require any extra modules is the following code:
lst = ['Jellicle', 'Cats', 'are', 'black', 'and','white,',
'Jellicle', 'Cats','are', 'rather', 'small;', 'Jellicle',
'Cats', 'are', 'merry', 'and','bright,', 'And', 'pleasant',
'to','hear', 'when', 'they', 'caterwaul.','Jellicle',
'Cats', 'have','cheerful', 'faces,', 'Jellicle',
'Cats','have', 'bright', 'black','eyes;', 'They', 'like',
'to', 'practise','their', 'airs', 'and', 'graces', 'And',
'wait', 'for', 'the', 'Jellicle','Moon', 'to', 'rise.', '']
lst_sorted=sorted([ss for ss in set(lst) if len(ss)>0 and ss.istitle()],
key=lst.count,
reverse=True)
print lst_sorted[0:3]
Output:
['Jellicle', 'Cats', 'And']
The term in squared brackets returns all unique strings in the list, which are not empty and start with a capital letter. The sorted() function then sorts them by how often they appear in the list (by using the lst.count key) in reverse order.
The simple way of doing this would be (assuming your list is in 'l'):
>>> counter = {}
>>> for i in l: counter[i] = counter.get(i, 0) + 1
>>> sorted([ (freq,word) for word, freq in counter.items() ], reverse=True)[:3]
[(6, 'Jellicle'), (5, 'Cats'), (3, 'to')]
Complete sample:
>>> l = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
>>> counter = {}
>>> for i in l: counter[i] = counter.get(i, 0) + 1
...
>>> counter
{'and': 3, '': 1, 'merry': 1, 'rise.': 1, 'small;': 1, 'Moon': 1, 'cheerful': 1, 'bright': 1, 'Cats': 5, 'are': 3, 'have': 2, 'bright,': 1, 'for': 1, 'their': 1, 'rather': 1, 'when': 1, 'to': 3, 'airs': 1, 'black': 2, 'They': 1, 'practise': 1, 'caterwaul.': 1, 'pleasant': 1, 'hear': 1, 'they': 1, 'white,': 1, 'wait': 1, 'And': 2, 'like': 1, 'Jellicle': 6, 'eyes;': 1, 'the': 1, 'faces,': 1, 'graces': 1}
>>> sorted([ (freq,word) for word, freq in counter.items() ], reverse=True)[:3]
[(6, 'Jellicle'), (5, 'Cats'), (3, 'to')]
With simple I mean working in nearly every version of python.
if you don't understand some of the functions used in this sample, you can always do this in the interpreter (after pasting the code above):
>>> help(counter.get)
>>> help(sorted)
The answer from #Mark Byers is best, but if you are on a version of Python < 2.7 (but at least 2.5, which is pretty old these days), you can replicate the Counter class functionality very simply via defaultdict (otherwise, for python < 2.5, three extra lines of code are needed before d[i] +=1, as in #Johnnysweb's answer).
from collections import defaultdict
class Counter():
ITEMS = []
def __init__(self, items):
d = defaultdict(int)
for i in items:
d[i] += 1
self.ITEMS = sorted(d.iteritems(), reverse=True, key=lambda i: i[1])
def most_common(self, n):
return self.ITEMS[:n]
Then, you use the class exactly as in Mark Byers's answer, i.e.:
words_to_count = (word for word in word_list if word[:1].isupper())
c = Counter(words_to_count)
print c.most_common(3)
I will like to answer this with numpy, great powerful array computation module in python.
Here is code snippet:
import numpy
a = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats',
'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and',
'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.',
'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats',
'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise',
'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle',
'Moon', 'to', 'rise.', '']
dict(zip(*numpy.unique(a, return_counts=True)))
Output
{'': 1, 'And': 2, 'Cats': 5, 'Jellicle': 6, 'Moon': 1, 'They': 1, 'airs': 1, 'and': 3, 'are': 3, 'black': 2, 'bright': 1, 'bright,': 1, 'caterwaul.': 1, 'cheerful': 1, 'eyes;': 1, 'faces,': 1, 'for': 1, 'graces': 1, 'have': 2, 'hear': 1, 'like': 1, 'merry': 1, 'pleasant': 1, 'practise': 1, 'rather': 1, 'rise.': 1, 'small;': 1, 'the': 1, 'their': 1, 'they': 1, 'to': 3, 'wait': 1, 'when': 1, 'white,': 1}
Output is in dictionary object in format of (key, value) pairs, where value is count of particular word
This answer is inspire by another answer on stackoverflow, you can view it here
If you are using Count, or have created your own Count-style dict and want to show the name of the item and the count of it, you can iterate around the dictionary like so:
top_10_words = Counter(my_long_list_of_words)
# Iterate around the dictionary
for word in top_10_words:
# print the word
print word[0]
# print the count
print word[1]
or to iterate through this in a template:
{% for word in top_10_words %}
<p>Word: {{ word.0 }}</p>
<p>Count: {{ word.1 }}</p>
{% endfor %}
Hope this helps someone