Finding the common elements of a list - python

Hi as per the earlier post.
Given the following list:
['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
I am trying to count how many times each word THAT STARTS WITH A CAPITAL appears and display the top 3.
I am not interested in words that do not start with a capital.
If a word appears multiple times, sometimes starting with a capital and sometimes not, only count the times it does with a capital.
THis is what my code looks like currently:
words = ""
for word in open('novel.txt', 'rU'):
words += word
words = words.split(' ')
words= list(words)
words = ('\n'.join(words)).split('\n')
word_counter = {}
for word in words:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
popular_words = sorted(word_counter, key = word_counter.get, reverse = True)
top_3 = popular_words[:3]
matches = []
for i in range(3):
print word_counter[top_3[i]], top_3[i]

#uncomment to produce the word file
##words = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
##open('novel.txt','w').write('\n'.join(words))
import string
cap_words = [word.strip(string.punctuation) for word in open('novel.txt').read().split() if word.istitle()]
##print(cap_words) # debug
try:
from collections import Counter # Python >= 2.7
print('Counter')
print(Counter(cap_words).most_common(3))
except ImportError:
print('Normal dict')
wordcount= dict()
for word in cap_words:
wordcount[word] = (wordcount[word] + 1
if word in wordcount
else 1)
print(sorted(wordcount.items(), key = lambda x: x[1], reverse = True)[:3])
I do not get it why you would like to keep different kinds of line termination with 'rU' mode. I would use normally normal open as I wrote in edited code above.
EDIT: You had words together with punctuation, so cleaned up those with strip()

print "\n".join(sorted(["%d %s" % (lst.count(i), i) \
for i in set(lst) if i.istitle()])[-3:])
2 And
5 Cats
6 Jellicle

Here are some additional comments:
words = ""
for word in open('novel.txt', 'rU'):
words += word
words = words.split(' ')
words= list(words)
words = ('\n'.join(words)).split('\n')
can be replaced with:
text = open('novel.txt', 'rU').read() # read everything
wordlist = text.split() # split on all whitespace
But you don't use your 'must start with a capital letter' requirement yet. Time to add:
capwordlist = (word for word in wordlist if word.istitle())
istitle() means word[0].isupper() and word[1:].islower(). This means 'SO'.istitle() -> False.
That might work for you, but maybe you just want the word[0].isupper() instead.
This part is good if you can't use collections.Counter (new in 2.7)
word_counter = {}
for word in capwordlist:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
popular_words = sorted(word_counter, key = word_counter.get, reverse = True)
top_3 = popular_words[:3]
else this simply becomes:
from collections import Counter
word_counter = Counter(capwords)
top_3 = word_counter.most_common(3) # gives `word, count` pairs!
And this:
for i in range(3):
print word_counter[top_3[i]], top_3[i]
can be this:
for word in top_3:
print word_counter[word], word

One thing i'd avoid is reading all the words in before processing. It will work, but IMHO, it's better not to do that if you don't need to, and you don't. Here's my solution (with elements liberally stolen from previous ones!), done with 2.6.2:
import sys
# a generator function which iterates over the words in a file
def words(f):
for line in f:
for word in line.split():
yield word
# returns a generator expression filtering an iterator down to titlecase words
def titles(s):
return (word for word in s if word.istitle())
# count the titlecase words in the file
count = {}
for word in titles(words(file(sys.argv[1]))):
count[word] = count.get(word, 0) + 1
# build a list of tuples with the count for each word
countsAndWords = [(kv[1], kv[0]) for kv in count.iteritems()]
# put them in decreasing order
countsAndWords.sort()
countsAndWords.reverse()
# print the top three
for count, word in countsAndWords[:3]:
print word, count
I do a sort of decorate-sort-undecorate on the counts rather than doing the sort with a comparator which does lookups in the count dictionary; it's less elegant, but i believe it will be faster. That's probably a sinful thing to do.

In general, word[0].isupper() will tel you if a word starts with an uppercase letter. Combine this into a list comprehension (or your loop)
[x for x in my_list if x[0].isupper()]
(assuming there are no empty strings)
and you get all words starting with an uppercase letter.

Since you aren't using Python2.7 and don't have Counter
from collections import defaultdict
counter = defaultdict(int)
words = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
for word in (word for word in words if word[0].isupper()):
counter[word]+=1
print counter

you could use itertools
import itertools
words = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
capwords = (word for word in words if len(word) > 1 and word[0].isupper())
capwordssorted = sorted(capwords)
wordswithcounts = ((k,len(list(g))) for (k,g) in itertools.groupby(capwordssorted))
print sorted(wordswithcounts,key=lambda x:x[1],reverse=True)[:3]

Related

Split list of strings on multiple conditions

I have a number of sentences which I would like to split on specific words (e.g. and). However, when splitting the sentences sometimes there are two or more combinations of a word I'd like to split on in a sentence.
Example sentences:
['i', 'am', 'just', 'hoping', 'for', 'strength', 'and', 'guidance', 'because', 'i', 'have', 'no', 'idea', 'why']
['maybe', 'that', 'is', 'why', 'he', 'does', 'not', 'come', 'home', 'and', 'tell', 'you', 'how', 'good', 'his', 'day', 'at', 'work', 'was', 'because', 'he', 'is', 'been', 'told', 'not', 'to', 'talk']
so I have written some code to split a sentence:
split_on_word = []
no_splitting = []
indexPosList = [ i for i in range(len(kth)) if kth[i] == 'and'] # check if word is in sentence
for e in example:
kth = e.split() # split strings into list so it looks like example sentence
for n in indexPosList:
if n > 4: # only split when the word's position is 4 or more
h = e.split("and")
for i in h:
split_on_word.append(i)# append split sentences
else:
no_splitting.append(kth) #append sentences that don't need to be split
However, you can see that when using this code more than once (e.g.: replace the word to split on with another) I will create duplicates or part duplicates of the sentences that I append to a new list.
Is there any way to check for multiple conditions, so that if a sentence contains both or other combinations of it that I split the sentence in one go?
The output from the examples should then look like this:
['i', 'am', 'just', 'hoping', 'for', 'strength']
['guidance', 'because']
['i', 'have', 'no', 'idea', 'why']
['maybe', 'that', 'is', 'why', 'he', 'does', 'not', 'come', 'home']
[ 'tell', 'you', 'how', 'good', 'his', 'day', 'at', 'work', 'was']
['he', 'is', 'been', 'told', 'not', 'to', 'talk']
You can use itertools.groupby with a function that checks whether a word is a split-word:
In [11]: split_words = {'and', 'because'}
In [12]: [list(g) for k, g in it.groupby(example, key=lambda x: x not in split_words) if k]
Out[12]:
[['maybe', 'that', 'is', 'why', 'he', 'does', 'not', 'come', 'home'],
['tell', 'you', 'how', 'good', 'his', 'day', 'at', 'work', 'was'],
['he', 'is', 'been', 'told', 'not', 'to', 'talk']]

removing common words from a text file

I am trying to remove common words from a text. for example the sentence
"It is not a commonplace river, but on the contrary is in all ways remarkable."
I want to turn it into just unique words. This means removing "it", "but", "a" etc. I have a text file that has all the common words and another text file that contains a paragraph. How can I delete the common words in the paragraph text file?
For example:
['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
How do I remove the common words from the file efficiently. I have a text file called common.txt that has all the common words listed. How do I use that list to remove identical words in the sentence above. End output I want:
['commonplace', 'river', 'contrary', 'remarkable']
Does that make sense?
Thanks.
you would want to use "set" objects in python.
If order and number of occurrence are not important:
str_list = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
common_words = ['It', 'is', 'not', 'a', 'but', 'on', 'the', 'in', 'all', 'ways','other_words']
set(str_list) - set(common_words)
>>> {'contrary', 'commonplace', 'river', 'remarkable'}
If both are important:
#Using "set" is so much faster
common_set = set(common_words)
[s for s in str_list if not s in common_set]
>>> ['commonplace', 'river', 'contrary', 'remarkable']
Here's an example that you can use:
l = text.replace(",","").replace(".","").split(" ")
occurs = {}
for word in l:
occurs[word] = l.count(word)
resultx = ''
for word in occurs.keys()
if occurs[word] < 3:
resultx += word + " "
resultx = resultx[:-1]
you can change 3 with what you think suited or based it on the average using :
occurs.values()/len(occurs)
Additional
if you want it to be Case insensitive change the 1st line with :
l = text.replace(",","").replace(".","").lower().split(" ")
Most simple method would be just to read() your common.txt and then use list comprehension and only take the words that are not in the file we read
with open('common.txt') as f:
content = f.read()
s = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
res = [i for i in s if i not in content]
print(res)
# ['commonplace', 'river', 'contrary', 'remarkable']
filter also works here
res = list(filter(lambda x: x not in content, s))

Finding duplicates in a list of a list, and adding their values

I'm trying to find the top 50 words that occur within three texts of Shakespeare and the ratio of each words occurrance in, macbeth.txt, allswell.txt, and othello.txt. Here is my code so far:
def byFreq(pair):
return pair[1]
def shakespeare():
counts = {}
A = []
for words in ['macbeth.txt','allswell.txt','othello.txt']:
text = open(words, 'r').read()
test = text.lower()
for ch in '!"$%&()*+,-./:;<=>?#[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.split()
for w in words:
counts[w] = counts.get(w, 0) + 1
items = list(counts.items())
items.sort()
items.sort(key=byFreq, reverse = True)
for i in range(50):
word, count = items[i]
count = count / float(len(counts))
A += [[word, count]]
print A
And its output:
>>> shakespeare()
[['the', 0.12929982922664066], ['and', 0.09148572822639668], ['I', 0.08075140278116613], ['of', 0.07684801171017322], ['to', 0.07562820200048792], ['a', 0.05220785557453037], ['you', 0.04415711149060746], ['in', 0.041717492071236886], ['And', 0.04147353012929983], ['my', 0.04147353012929983], ['is', 0.03927787265186631], ['not', 0.03781410100024396], ['that', 0.0358624054647475], ['it', 0.03366674798731398], ['Macb', 0.03342278604537692], ['with', 0.03269090021956575], ['his', 0.03147109050988046], ['be', 0.03025128080019517], ['The', 0.028787509148572824], ['haue', 0.028543547206635766], ['me', 0.027079775555013418], ['your', 0.02683581361307636], ['our', 0.025128080019516955], ['him', 0.021956574774335203], ['Enter', 0.019516955354964626], ['That', 0.019516955354964626], ['for', 0.01927299341302757], ['this', 0.01927299341302757], ['he', 0.018541107587216395], ['To', 0.01780922176140522], ['so', 0.017077335935594046], ['all', 0.0156135642839717], ['What', 0.015369602342034643], ['are', 0.015369602342034643], ['thou', 0.015369602342034643], ['will', 0.015125640400097584], ['Macbeth', 0.014881678458160527], ['thee', 0.014881678458160527], ['But', 0.014637716516223469], ['but', 0.014637716516223469], ['Macd', 0.014149792632349353], ['they', 0.014149792632349353], ['their', 0.013905830690412296], ['we', 0.013905830690412296], ['as', 0.01341790680653818], ['vs', 0.01341790680653818], ['King', 0.013173944864601122], ['on', 0.013173944864601122], ['yet', 0.012198097096852892], ['Rosse', 0.011954135154915833], ['the', 0.15813168261114238], ['I', 0.14279684862127182], ['and', 0.1231007315700619], ['to', 0.10875070343275182], ['of', 0.10481148002250985], ['a', 0.08581879572312887], ['you', 0.08581879572312887], ['my', 0.06992121553179516], ['in', 0.061902082160945414], ['is', 0.05852560495216657], ['not', 0.05486775464265616], ['it', 0.05472706809229038], ['that', 0.05472706809229038], ['his', 0.04727068092290377], ['your', 0.04389420371412493], ['me', 0.043753517163759144], ['be', 0.04305008441193022], ['And', 0.04037703995498031], ['with', 0.038266741699493526], ['him', 0.037703995498030385], ['for', 0.03601575689364097], ['he', 0.03404614518851998], ['The', 0.03137310073157006], ['this', 0.030810354530106922], ['her', 0.029262802476083285], ['will', 0.0291221159257175], ['so', 0.027011817670230726], ['have', 0.02687113111986494], ['our', 0.02687113111986494], ['but', 0.024760832864378166], ['That', 0.02293190770962296], ['PAROLLES', 0.022791221159257174], ['To', 0.021384355655599326], ['all', 0.021384355655599326], ['shall', 0.021102982554867755], ['are', 0.02096229600450197], ['as', 0.02096229600450197], ['thou', 0.02039954980303883], ['Macb', 0.019274057400112548], ['thee', 0.019274057400112548], ['no', 0.01871131119864941], ['But', 0.01842993809791784], ['Enter', 0.01814856499718627], ['BERTRAM', 0.01758581879572313], ['HELENA', 0.01730444569499156], ['we', 0.01730444569499156], ['do', 0.017163759144625774], ['thy', 0.017163759144625774], ['was', 0.01674169949352842], ['haue', 0.016460326392796848], ['I', 0.19463784682531435], ['the', 0.17894627455055595], ['and', 0.1472513769094877], ['to', 0.12989712147978802], ['of', 0.12002494024732412], ['you', 0.1079704873739998], ['a', 0.10339810869791126], ['my', 0.0909279850358516], ['in', 0.07627558973293151], ['not', 0.07159929335965914], ['is', 0.0697287748103502], ['it', 0.0676504208666736], ['that', 0.06733866777512211], ['me', 0.06099968824690845], ['your', 0.0543489556271433], ['And', 0.053205860958121166], ['be', 0.05310194326093734], ['his', 0.05154317780317988], ['with', 0.04769822300737816], ['him', 0.04665904603553985], ['her', 0.04364543281720877], ['for', 0.04322976202847345], ['he', 0.042190585056635144], ['this', 0.04187883196508366], ['will', 0.035332017042502335], ['Iago', 0.03522809934531851], ['so', 0.03356541619037722], ['The', 0.03325366309882573], ['haue', 0.031902733035435935], ['do', 0.03138314454951678], ['but', 0.030240049880494647], ['That', 0.02857736672555336], ['thou', 0.027642107450898887], ['as', 0.027434272056531227], ['To', 0.026810765873428243], ['our', 0.02504416502130313], ['are', 0.024628494232567806], ['But', 0.024420658838200146], ['all', 0.024316741141016316], ['What', 0.024212823443832486], ['shall', 0.024004988049464823], ['on', 0.02265405798607503], ['thee', 0.022134469500155875], ['Enter', 0.021822716408604385], ['thy', 0.021199210225501402], ['no', 0.020783539436766082], ['she', 0.02026395095084693], ['am', 0.02005611555647927], ['by', 0.019848280162111608], ['have', 0.019848280162111608]]
Instead of outputing the top 50 words of all three texts, its outputs the top 50 words of each text, 150 words. Im struggling on trying to delete the duplicates but add their ratios together. For example, in macbeth.txt the word 'the' has a ratio of 0.12929982922664066, allswell.txt has a ratio of 0.15813168261114238, and othello.txt has a ratio of 0.17894627455055595. I want to combine the ratios of all three of them. I;m pretty sure I have to use a for loop but I'm struggling to loop through a list within a list. I am more of a java guy so any help would be appreciated!
You can use a list comprehension and the Counter-class:
from collections import Counter
c = Counter([word for file in ['macbeth.txt','allswell.txt','othello.txt']
for word in open(file).read().split()])
Then you get a dict which maps words to their counts. You can sort them like this:
sorted([(i,v) for v,i in c.items()])
If you want the relative quantities, then you can calculate the total number of words:
numWords = sum([i for (v,i) in c.items()])
and adapt the dict c via a dict-comprehension:
c = { v:(i/numWords) for (v,i) in c.items()}
You're summarizing the count inside your loop over files. Move the summary code outside your for loop.

Stripping punctuation from text file including 's and commas

I am creating a list of most frequent words in a text file. But I keep getting words and their possessive versions. Like iphone and iphone's. I also need to strip the commas after words like iphone and iphone, in my results. I want to count those words together as one entity.
Here is my entire code.
# Prompuser for text file to analyze
filename = input("Enter a valid text file to analyze: ")
# Open file and read it line by line
s = open(filename, 'r').read().lower()
# List of stop words that are not inportant to the meaning of the content
# These words will not be counted in the number of characters, or words.
stopwords = ['a', 'about', 'above', 'across', 'after', 'afterwards']
stopwords += ['again', 'against', 'all', 'almost', 'alone', 'along']
stopwords += ['already', 'also', 'although', 'always', 'am', 'among']
stopwords += ['amongst', 'amoungst', 'amount', 'an', 'and', 'another']
stopwords += ['any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere']
stopwords += ['are', 'around', 'as', 'at', 'back', 'be', 'became']
stopwords += ['because', 'become', 'becomes', 'becoming', 'been']
stopwords += ['before', 'beforehand', 'behind', 'being', 'below']
stopwords += ['beside', 'besides', 'between', 'beyond', 'bill', 'both']
stopwords += ['bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant']
stopwords += ['co', 'computer', 'con', 'could', 'couldnt', 'cry', 'de']
stopwords += ['describe', 'detail', 'did', 'do', 'done', 'down', 'due']
stopwords += ['during', 'each', 'eg', 'eight', 'either', 'eleven', 'else']
stopwords += ['elsewhere', 'empty', 'enough', 'etc', 'even', 'ever']
stopwords += ['every', 'everyone', 'everything', 'everywhere', 'except']
stopwords += ['few', 'fifteen', 'fifty', 'fill', 'find', 'fire']
stopwords += ['five', 'for', 'former', 'formerly', 'forty', 'found']
stopwords += ['four', 'from', 'front', 'full', 'further', 'get', 'give']
stopwords += ['go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her']
stopwords += ['here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers']
stopwords += ['herself', 'him', 'himself', 'his', 'how', 'however']
stopwords += ['hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed']
stopwords += ['interest', 'into', 'is', 'it', 'its', 'itself', 'keep']
stopwords += ['last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made']
stopwords += ['many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine']
stopwords += ['more', 'moreover', 'most', 'mostly', 'move', 'much']
stopwords += ['must', 'my', 'myself', 'name', 'namely', 'neither', 'never']
stopwords += ['nevertheless', 'next', 'nine', 'no', 'nobody', 'none']
stopwords += ['noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of']
stopwords += ['off', 'often', 'on','once', 'one', 'only', 'onto', 'or']
stopwords += ['other', 'others', 'otherwise', 'our', 'ours', 'ourselves']
stopwords += ['out', 'over', 'own', 'part', 'per', 'perhaps', 'please']
stopwords += ['put', 'rather', 're', 's', 'same', 'see', 'seem', 'seemed']
stopwords += ['seeming', 'seems', 'serious', 'several', 'she', 'should']
stopwords += ['show', 'side', 'since', 'sincere', 'six', 'sixty', 'so']
stopwords += ['some', 'somehow', 'someone', 'something', 'sometime']
stopwords += ['sometimes', 'somewhere', 'still', 'such', 'take']
stopwords += ['ten', 'than', 'that', 'the', 'their', 'them', 'themselves']
stopwords += ['then', 'thence', 'there', 'thereafter', 'thereby']
stopwords += ['therefore', 'therein', 'thereupon', 'these', 'they']
stopwords += ['thick', 'thin', 'third', 'this', 'those', 'though', 'three']
stopwords += ['three', 'through', 'throughout', 'thru', 'thus', 'to']
stopwords += ['together', 'too', 'top', 'toward', 'towards', 'twelve']
stopwords += ['twenty', 'two', 'un', 'under', 'until', 'up', 'upon']
stopwords += ['us', 'very', 'via', 'was', 'we', 'well', 'were', 'what']
stopwords += ['whatever', 'when', 'whence', 'whenever', 'where']
stopwords += ['whereafter', 'whereas', 'whereby', 'wherein', 'whereupon']
stopwords += ['wherever', 'whether', 'which', 'while', 'whither', 'who']
stopwords += ['whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with']
stopwords += ['within', 'without', 'would', 'yet', 'you', 'your']
stopwords += ['yours', 'yourself', 'the', '—', 'The', '9', '5', 'just']
# count characters
num_chars = len(s)
# count lines
num_lines = s.count('\n')
# Split the words in the file
words = s.split()
# Create empty dictionary
d = {}
for i in d:
i = i.replace('.','')
i = i.replace(',','')
i = i.replace('\'','')
d.append(i.split())
# Add words to dictionary if they are not in it already.
for w in words:
if w in d:
#increase the count of each word added to the dictionary by 1 each time it appears
d[w] += 1
else:
d[w] = 1
# Find the sum of each word's count in the dictionary. Drop the stopwords.
num_words = sum(d[w] for w in d if w not in stopwords)
# Create a list of words and their counts from file
# in the form number of occurrence word count word
lst = [(d[w], w) for w in d if w not in stopwords]
# Sort the list of words by the count of the word
lst.sort()
# Sort the word list from greatest to least
lst.reverse()
# Print number of characters, number of lines rea, then number of words
# that were counted minus the stop words.
print('Your input file has characters = ' + str(num_chars))
print('Your input file has num_lines = ' + str(num_lines))
print('Your input file has num_words = ' + str(num_words))
# Print the top 30 most frequent words
print('\n The 30 most frequent words are \n')
# Start list at 1, the most most frequent word
i = 1
# Create list for the first 30 frequent words
for count, word in lst[:30]:
# Create list with the number of occurrence the word count then the word
print('%2s. %4s %s' % (i, count, word))
# Increase the list number by 1 after each entry
i += 1
Here are some of my results.
1. 40 iphone
2. 15 users
3. 12 iphone’s
4. 9 music
5. 9 apple
6. 8 web
7. 7 new
Any help would be greatly appreciated. Thank You
You need to say for i in words: not for i in d:
You are iterating through an empty dictionary during the replacement steps, so nothing is changing. Just remove that loop and move the replacement steps to the top of the for w in words: loop, so you only have to make one pass through.
I would redo that whole section thusly:
for w in words:
w = w.replace('.','').replace(',','').replace('\'','').replace("’","")
d[w] = d.get(w,0) + 1
As it is now, you are also trying to split i before appending to the dictionary. It has already been split. Also, you need a key:value for the dictionary. Just give it a value of zero at that point?, so later you can count occurrences w/o testing?
Instead of testing if w in d: it will be hundreds (even thousands of times faster) to use .get() with a default value of zero (returned if w is not found), as shown above.

How to find most common elements of a list? [duplicate]

This question already has answers here:
Find the item with maximum occurrences in a list [duplicate]
(14 answers)
Closed 2 years ago.
The community reviewed whether to reopen this question 10 months ago and left it closed:
Original close reason(s) were not resolved
Given the following list
['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats',
'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and',
'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.',
'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats',
'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise',
'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle',
'Moon', 'to', 'rise.', '']
I am trying to count how many times each word appears and display the top 3.
However I am only looking to find the top three that have the first letter capitalized and ignore all words that do not have the first letter capitalized.
I am sure there is a better way than this, but my idea was to do the following:
put the first word in the list into another list called uniquewords
delete the first word and all its duplicated from the original list
add the new first word into unique words
delete the first word and all its duplicated from original list.
etc...
until the original list is empty....
count how many times each word in uniquewords appears in the original list
find top 3 and print
In Python 2.7 and above there is a class called Counter which can help you:
from collections import Counter
words_to_count = (word for word in word_list if word[:1].isupper())
c = Counter(words_to_count)
print c.most_common(3)
Result:
[('Jellicle', 6), ('Cats', 5), ('And', 2)]
I am quite new to programming so please try and do it in the most barebones fashion.
You could instead do this using a dictionary with the key being a word and the value being the count for that word. First iterate over the words adding them to the dictionary if they are not present, or else increasing the count for the word if it is present. Then to find the top three you can either use a simple O(n*log(n)) sorting algorithm and take the first three elements from the result, or you can use a O(n) algorithm that scans the list once remembering only the top three elements.
An important observation for beginners is that by using builtin classes that are designed for the purpose you can save yourself a lot of work and/or get better performance. It is good to be familiar with the standard library and the features it offers.
If you are using an earlier version of Python or you have a very good reason to roll your own word counter (I'd like to hear it!), you could try the following approach using a dict.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> word_list = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
>>> word_counter = {}
>>> for word in word_list:
... if word in word_counter:
... word_counter[word] += 1
... else:
... word_counter[word] = 1
...
>>> popular_words = sorted(word_counter, key = word_counter.get, reverse = True)
>>>
>>> top_3 = popular_words[:3]
>>>
>>> top_3
['Jellicle', 'Cats', 'and']
Top Tip: The interactive Python interpretor is your friend whenever you want to play with an algorithm like this. Just type it in and watch it go, inspecting elements along the way.
To just return a list containing the most common words:
from collections import Counter
words=["i", "love", "you", "i", "you", "a", "are", "you", "you", "fine", "green"]
most_common_words= [word for word, word_count in Counter(words).most_common(3)]
print most_common_words
this prints:
['you', 'i', 'a']
the 3 in "most_common(3)", specifies the number of items to print.
Counter(words).most_common() returns a a list of tuples with each tuple having the word as the first member and the frequency as the second member.The tuples are ordered by the frequency of the word.
`most_common = [item for item in Counter(words).most_common()]
print(str(most_common))
[('you', 4), ('i', 2), ('a', 1), ('are', 1), ('green', 1), ('love',1), ('fine', 1)]`
"the word for word, word_counter in", extracts only the first member of the tuple.
Is't it just this ....
word_list=['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats',
'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and',
'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.',
'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats',
'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise',
'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle',
'Moon', 'to', 'rise.', '']
from collections import Counter
c = Counter(word_list)
c.most_common(3)
Which should output
[('Jellicle', 6), ('Cats', 5), ('are', 3)]
There's two standard library ways to find the most frequent value in a list:
statistics.mode:
from statistics import mode
most_common = mode([3, 2, 2, 2, 1, 1]) # 2
most_common = mode([3, 2]) # StatisticsError: no unique mode
Raises an exception if there's no unique most frequent value
Only returns single most frequent value
collections.Counter.most_common:
from collections import Counter
most_common, count = Counter([3, 2, 2, 2, 1, 1]).most_common(1)[0] # 2, 3
(most_common_1, count_1), (most_common_2, count_2) = Counter([3, 2, 2]).most_common(2) # (2, 2), (3, 1)
Can return multiple most frequent values
Returns element count as well
So in the case of the question, the second one would be the right choice. As a side note, both are identical in terms of performance.
nltk is convenient for a lot of language processing stuff. It has methods for frequency distribution built in. Something like:
import nltk
fdist = nltk.FreqDist(your_list) # creates a frequency distribution from a list
most_common = fdist.max() # returns a single element
top_three = fdist.keys()[:3] # returns a list
A simple, two-line solution to this, which does not require any extra modules is the following code:
lst = ['Jellicle', 'Cats', 'are', 'black', 'and','white,',
'Jellicle', 'Cats','are', 'rather', 'small;', 'Jellicle',
'Cats', 'are', 'merry', 'and','bright,', 'And', 'pleasant',
'to','hear', 'when', 'they', 'caterwaul.','Jellicle',
'Cats', 'have','cheerful', 'faces,', 'Jellicle',
'Cats','have', 'bright', 'black','eyes;', 'They', 'like',
'to', 'practise','their', 'airs', 'and', 'graces', 'And',
'wait', 'for', 'the', 'Jellicle','Moon', 'to', 'rise.', '']
lst_sorted=sorted([ss for ss in set(lst) if len(ss)>0 and ss.istitle()],
key=lst.count,
reverse=True)
print lst_sorted[0:3]
Output:
['Jellicle', 'Cats', 'And']
The term in squared brackets returns all unique strings in the list, which are not empty and start with a capital letter. The sorted() function then sorts them by how often they appear in the list (by using the lst.count key) in reverse order.
The simple way of doing this would be (assuming your list is in 'l'):
>>> counter = {}
>>> for i in l: counter[i] = counter.get(i, 0) + 1
>>> sorted([ (freq,word) for word, freq in counter.items() ], reverse=True)[:3]
[(6, 'Jellicle'), (5, 'Cats'), (3, 'to')]
Complete sample:
>>> l = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
>>> counter = {}
>>> for i in l: counter[i] = counter.get(i, 0) + 1
...
>>> counter
{'and': 3, '': 1, 'merry': 1, 'rise.': 1, 'small;': 1, 'Moon': 1, 'cheerful': 1, 'bright': 1, 'Cats': 5, 'are': 3, 'have': 2, 'bright,': 1, 'for': 1, 'their': 1, 'rather': 1, 'when': 1, 'to': 3, 'airs': 1, 'black': 2, 'They': 1, 'practise': 1, 'caterwaul.': 1, 'pleasant': 1, 'hear': 1, 'they': 1, 'white,': 1, 'wait': 1, 'And': 2, 'like': 1, 'Jellicle': 6, 'eyes;': 1, 'the': 1, 'faces,': 1, 'graces': 1}
>>> sorted([ (freq,word) for word, freq in counter.items() ], reverse=True)[:3]
[(6, 'Jellicle'), (5, 'Cats'), (3, 'to')]
With simple I mean working in nearly every version of python.
if you don't understand some of the functions used in this sample, you can always do this in the interpreter (after pasting the code above):
>>> help(counter.get)
>>> help(sorted)
The answer from #Mark Byers is best, but if you are on a version of Python < 2.7 (but at least 2.5, which is pretty old these days), you can replicate the Counter class functionality very simply via defaultdict (otherwise, for python < 2.5, three extra lines of code are needed before d[i] +=1, as in #Johnnysweb's answer).
from collections import defaultdict
class Counter():
ITEMS = []
def __init__(self, items):
d = defaultdict(int)
for i in items:
d[i] += 1
self.ITEMS = sorted(d.iteritems(), reverse=True, key=lambda i: i[1])
def most_common(self, n):
return self.ITEMS[:n]
Then, you use the class exactly as in Mark Byers's answer, i.e.:
words_to_count = (word for word in word_list if word[:1].isupper())
c = Counter(words_to_count)
print c.most_common(3)
I will like to answer this with numpy, great powerful array computation module in python.
Here is code snippet:
import numpy
a = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats',
'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and',
'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.',
'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats',
'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise',
'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle',
'Moon', 'to', 'rise.', '']
dict(zip(*numpy.unique(a, return_counts=True)))
Output
{'': 1, 'And': 2, 'Cats': 5, 'Jellicle': 6, 'Moon': 1, 'They': 1, 'airs': 1, 'and': 3, 'are': 3, 'black': 2, 'bright': 1, 'bright,': 1, 'caterwaul.': 1, 'cheerful': 1, 'eyes;': 1, 'faces,': 1, 'for': 1, 'graces': 1, 'have': 2, 'hear': 1, 'like': 1, 'merry': 1, 'pleasant': 1, 'practise': 1, 'rather': 1, 'rise.': 1, 'small;': 1, 'the': 1, 'their': 1, 'they': 1, 'to': 3, 'wait': 1, 'when': 1, 'white,': 1}
Output is in dictionary object in format of (key, value) pairs, where value is count of particular word
This answer is inspire by another answer on stackoverflow, you can view it here
If you are using Count, or have created your own Count-style dict and want to show the name of the item and the count of it, you can iterate around the dictionary like so:
top_10_words = Counter(my_long_list_of_words)
# Iterate around the dictionary
for word in top_10_words:
# print the word
print word[0]
# print the count
print word[1]
or to iterate through this in a template:
{% for word in top_10_words %}
<p>Word: {{ word.0 }}</p>
<p>Count: {{ word.1 }}</p>
{% endfor %}
Hope this helps someone

Categories

Resources