Count occurences of words in file [duplicate]

Count occurences of words in file [duplicate] - python

This question already has answers here:
Word count from a txt file program
(9 answers)
Closed 3 years ago.
I want to count the occurence of each word in a file using a dictionary (all words contained in the file are in lower-case and the file does not contain any punctuation).
I want to optimize my code because I am aware that the list takes unnecessary time.
def create_dictionary(filename):
d = {}
flat_list = []
with open(filename,"r") as fin:
for line in fin:
for word in line.split():
flat_list.append(word)
for i in flat_list:
if d.get(i,0) == 0:
d[i] = 1
else :
d[i] +=1
return d
For example, a file containing :
i go to the market to buy some things to
eat and drink because i want
to eat and drink
should return:
{'i': 2, 'go': 1, 'to': 4, 'the': 1, 'market': 1, 'buy': 1, 'some': 1, 'things': 1, 'eat': 2, 'and': 2, 'drink': 2, 'because': 1, 'want': 1}
What can I improve?

Just use collections.Counter:
with open(filename,"r") as fin:
print(Counter(fin.read().split()))

Related

python count the number of words in the list of strings [duplicate]

This question already has answers here:
How to find the count of a word in a string
(9 answers)
Closed 2 years ago.
consider
doc = ["i am a fellow student", "we both are the good student", "a student works hard"]
I have this as input I just wanted to print the number of times each word in the whole list occurs:
For example student occurs 3 times so
expected output student=3, a=2,etc
I was able to print the unique words in the doc, but not able to print the occurrences. Here is the function i used:
def fit(doc):
unique_words = set()
if isinstance(dataset, (list,)):
for row in dataset:
for word in row.split(" "):
if len(word) < 2:
continue
unique_words.add(word)
unique_words = sorted(list(unique_words))
return (unique_words)
doc=fit(docs)
print(doc)
['am', 'are', 'both', 'fellow', 'good', 'hard', 'student', 'the', 'we', 'works']
I got this as output I just want the number of occurrences of the unique_words. How do i do this please?

You just need to use Counter, and you will solve the problem by using a single line of code:
from collections import Counter
doc = ["i am a fellow student",
"we both are the good student",
"a student works hard"]
count = dict(Counter(word for sentence in doc for word in sentence.split()))
count is your desired dictionary:
{
'i': 1,
'am': 1,
'a': 2,
'fellow': 1,
'student': 3,
'we': 1,
'both': 1,
'are': 1,
'the': 1,
'good': 1,
'works': 1,
'hard': 1
}
So for example count['student'] == 3, count['a'] == 2 etc.
Here it's important to use split() instead of split(' '): in this way you will not end up with having an "empty" word within count. Example:
>>> sentence = "Hello world"
>>> dict(Counter(sentence.split(' ')))
{'Hello': 1, '': 4, 'world': 1}
>>> dict(Counter(sentence.split()))
{'Hello': 1, 'world': 1}

Use
from collections import Counter
Counter(" ".join(doc).split())
results in
Counter({'i': 1,
'am': 1,
'a': 2,
'fellow': 1,
'student': 3,
'we': 1,
'both': 1,
'are': 1,
'the': 1,
'good': 1,
'works': 1,
'hard': 1})
Explanation: first create one string by using join and split it on spaces with split to have a list of single words. Use Counter to count the appearances of each word

doc = ["i am a fellow student", "we both are the good student", "a student works hard"]
p = doc[0].split() #first list
p1 = doc[1].split() #second list
p2 = doc[2].split() #third list
f1 = p + p1 + p2
j = len(f1)-1
n = 0
while n < j:
print(f1[n],"is found",f1.count(f1[n]), "times")
n+=1

You can use set and a string to aggregate all word in each sentence after that to use dictionary comprehension to create a dictionary by the key of the word and value of the count in the sentence
doc = ["i am a fellow student", "we both are the good student", "a student works hard"]
uniques = set()
all_words = ''
for i in doc:
for word in i.split(" "):
uniques.add(word)
all_words += f" {word}"
print({i: all_words.count(f" {i} ") for i in uniques})
Output
{'the': 1, 'hard': 0, 'student': 3, 'both': 1, 'fellow': 1, 'works': 1, 'a': 2, 'are': 1, 'am': 1, 'good': 1, 'i': 1, 'we': 1}

Thanks for Posting in Stackoverflow I have written a sample code that does what you need just check it and ask if there is anything you don't understand
doc = ["i am a fellow student", "we both are the good student", "a student works hard"]
checked = []
occurence = []
for sentence in doc:
for word in sentence.split(" "):
if word in checked:
occurence[checked.index(word)] = occurence[checked.index(word)] + 1
else:
checked.append(word)
occurence.append(1)
for i in range(len(checked)):
print(checked[i]+" : "+str(occurence[i]))

try this one
doc = ["i am a fellow student", "we both are the good student", "a student works hard"]
words=[]
for a in doc:
b=a.split()
for c in b:
#if len(c)>3: #most words there length > 3 this line in your choice
words.append(c)
wc=[]
for a in words:
count = 0
for b in words:
if a==b :
count +=1
wc.append([a,count])
print(wc)

My dictionary counts letter instead of words [duplicate]

This question already has answers here:
Iterating through a string word by word
(7 answers)
Closed 4 years ago.
I have made a text string and removed all non alphabetical symbols and added whitespaces in between the words, but when I add them to a dictionary to count the frequency of the words it counts the letters instead. How do I count the words from a dictionary?
dictionary = {}
for item in text_string:
if item in dictionary:
dictionary[item] = dictionary[item]+1
else:
dictionary[item] = 1
print(dictionary)

Change this
for item in text_string:
to this
for item in text_string.split():
Function .split() splits the string to words using whitespace characters (including tabs and newlines) as delimiters.

You are very close. Since you state that your words are already whitespace separated, you need to use str.split to make a list of words.
An example is below:
dictionary = {}
text_string = 'there are repeated words in this sring with many words many are repeated'
for item in text_string.split():
if item in dictionary:
dictionary[item] = dictionary[item]+1
else:
dictionary[item] = 1
print(dictionary)
{'there': 1, 'are': 2, 'repeated': 2, 'words': 2, 'in': 1,
'this': 1, 'sring': 1, 'with': 1, 'many': 2}
Another solution is to use collections.Counter, available in the standard library:
from collections import Counter
text_string = 'there are repeated words in this sring with many words many are repeated'
c = Counter(text_string.split())
print(c)
Counter({'are': 2, 'repeated': 2, 'words': 2, 'many': 2, 'there': 1,
'in': 1, 'this': 1, 'sring': 1, 'with': 1})

Python increment values in a dictionary

I am trying to count every word from text files and appending the word and count to a dictionary as the key-value pairs. It throws me this error: if key not in wordDict:
TypeError: unhashable type: 'list'
Also, I am wondering of .split() is good because my text files contain different punctuation marks.
fileref = open(mypath + '/' + i, 'r')
wordDict = {}
for line in fileref.readlines():
key = line.split()
if key not in wordDict:
wordDict[key] = 1
else:
wordDict[key] += 1

from collections import Counter
text = '''I am trying to count every word from text files and appending the word and count to a dictionary as the key-value pairs. It throws me this error: if key not in wordDict: TypeError: unhashable type: 'list' Also, I am wondering of .split() is good because my text files contain different punctuation marks. Thanks ahead for those who help!'''
split_text = text.split()
counter = Counter(split_text)
print(counter)
out:
Counter({'count': 2, 'and': 2, 'text': 2, 'to': 2, 'I': 2, 'files': 2, 'word': 2, 'am': 2, 'the': 2, 'dictionary': 1, 'a': 1, 'not': 1, 'in': 1, 'ahead': 1, 'me': 1, 'trying': 1, 'every': 1, '.split()': 1, 'type:': 1, 'my': 1, 'punctuation': 1, 'is': 1, 'key': 1, 'error:': 1, 'help!': 1, 'those': 1, 'different': 1, 'throws': 1, 'TypeError:': 1, 'contain': 1, 'wordDict:': 1, 'appending': 1, 'if': 1, 'It': 1, 'Also,': 1, 'unhashable': 1, 'from': 1, 'because': 1, 'marks.': 1, 'pairs.': 1, 'this': 1, 'key-value': 1, 'wondering': 1, 'Thanks': 1, 'of': 1, 'good': 1, "'list'": 1, 'for': 1, 'who': 1, 'as': 1})

key is a list of space-delimited words found in the current line. You would need to iterate over that list as well.
for line in fileref:
keys = line.split()
for key in keys:
if key not in wordDict:
wordDict[key] = 1
else:
wordDict[key] += 1
This can be cleaned up considerably by either using the setdefault method or a defaultdict from the collections module; both allow you to avoid explicitly checking for a key by automatically adding the key with an initial value if it isn't already in the dict.
for key in keys:
wordDict.setdefault(key, 0) += 1
or
from collections import defaultdict
wordDict = defaultdict(int) # Default to 0, since int() == 0
...
for key in keys:
wordDict[key] += 1

key is a list and you're trying to see if a list is in a dictionary which is equivalent to seeing if it is one of the keys. Dictionary keys canot be lists hence the "unhashable type" error.

str.split return a list of words
>>> "hello world".split()
['hello', 'world']
>>>
and lists or any other mutable object cannot be used as a key of a dictionary, and that is why you get the error TypeError: unhashable type: 'list'.
You need to iterate over it to include each one of those, also the recommended way to work with a file is with the with statement
wordDict = {}
with open(mypath + '/' + i, 'r') as fileref:
for line in fileref:
for word in line.split():
if word not in wordDict:
wordDict[word] = 1
else:
wordDict[word] += 1
the above can be shortened with the use Counter and an appropriate call to it
from collections import Counter
with open(mypath + '/' + i, 'r') as fileref:
wordDict = Counter( word for line in fileref for word in line.split() )

counting words from a dictionary?

My function is supposed to have:
One parameter as a tweet.
This tweet can involve numbers, words, hashtags, links and punctuations.
A second parameter is a dictionary that counts the words in that string with tweets, disregarding the hashtag's, mentions, links, and punctuation included in it.
The function returns all individual words in the dictionary as lowercase letters without any punctuation.
If the tweet had Don't then the dictionary would count it as dont.
Here is my function:
def count_words(tweet, num_words):
''' (str, dict of {str: int}) -> None
Return a NoneType that updates the count of words in the dictionary.
>>> count_words('We have made too much progress', num_words)
>>> num_words
{'we': 1, 'have': 1, 'made': 1, 'too': 1, 'much': 1, 'progress': 1}
>>> count_words("#utmandrew Don't you wish you could vote? #MakeAmericaGreatAgain", num_words)
>>> num_words
{'dont': 1, 'wish': 1, 'you': 2, 'could': 1, 'vote': 1}
>>> count_words('I am fighting for you! #FollowTheMoney', num_words)
>>> num_words
{'i': 1, 'am': 1, 'fighting': 1, 'for': 1, 'you': 1}
>>> count_words('', num_words)
>>> num_words
{'': 0}
'''

I might misunderstand your question, but if you want to update the dictionary you can do it in this manner:
d = {}
def update_dict(tweet):
for i in tweet.split():
if i not in d:
d[i] = 1
else:
d[i] += 1
return d

How to count frequency of single word and also double word count from input text in python?

Hello I want to count single word and double word count from input text in python .
Ex.
"what is your name ? what you want from me ?
You know best way to earn money is Hardwork
what is your aim ?"
output:
sinle W.C. :
what 3
is 3
your 2
you 2
and so on..
Double W.C. :
what is 2
is your 2
your name 1
what you 1
ans so on..
please post the way to do this ?
i use following code for the singl word count :
ws={}
for line in text:
for wrd in line:
if wrd not in ws:
ws[wrd]=1
else:
ws[wrd]+=1

from collections import Counter
s = "..."
words = s.split()
pairs = zip(words, words[1:])
single_words, double_words = Counter(words), Counter(pairs)
Output:
print "sinle W.C."
for word, count in sorted(single_words.items(), key=lambda x: -x[1]):
print word, count
print "double W.C."
for pair, count in sorted(double_words.items(), key=lambda x: -x[1]):
print pair, count

import nltk
from nltk import bigrams
from nltk import trigrams
tokens = nltk.word_tokenize(text)
tokens = [token.lower() for token in tokens if len(token) > 1]
bi_tokens = bigrams(tokens)
print [(item, tokens.count(item)) for item in sorted(set(tokens))]
print [(item, bi_tokens.count(item)) for item in sorted(set(bi_tokens))]

this works. using defaultdict. python 2.6
>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> string = "what is your name ? what you want from me ?\n
You know best way to earn money is Hardwork\n what is your aim ?"
>>> l = string.split()
>>> for i in l:
d[i]+=1
>>> d
defaultdict(<type 'int'>, {'me': 1, 'aim': 1, 'what': 3, 'from': 1, 'name': 1,
'You': 1, 'money': 1, 'is': 3, 'earn': 1, 'best': 1, 'Hardwork': 1, 'to': 1,
'way': 1, 'know': 1, 'want': 1, 'you': 1, 'your': 2, '?': 3})
>>> d2 = defaultdict(int)
>>> for i in zip(l[:-1], l[1:]):
d2[i]+=1
>>> d2
defaultdict(<type 'int'>, {('You', 'know'): 1, ('earn', 'money'): 1,
('is', 'Hardwork'): 1, ('you', 'want'): 1, ('know', 'best'): 1,
('what', 'is'): 2, ('your', 'name'): 1, ('from', 'me'): 1,
('name', '?'): 1, ('?', 'You'): 1, ('?', 'what'): 1, ('to', 'earn'): 1,
('aim', '?'): 1, ('way', 'to'): 1, ('Hardwork', 'what'): 1,
('money', 'is'): 1, ('me', '?'): 1, ('what', 'you'): 1, ('best', 'way'): 1,
('want', 'from'): 1, ('is', 'your'): 2, ('your', 'aim'): 1})
>>>

I realize this question is a few years old. I wrote a little routine today to just count individual words from a word doc (docx). I used docx2txt to get the text from the word document, and used my first regex expression ever to remove every character other than alpha, numeric or spaces, and switched all to uppercase. I put this in because the question isn't answered.
Here is my little test routine in case it may help anyone.
mydoc = 'I:/flashdrive/pmw/pmw_py.docx'
words_all = {}
#####
import docx2txt
my_text = docx2txt.process(mydoc)
print(my_text)
my_text_org = my_text
import re
#added this code for the double words
from collections import Counter
pairs = zip(words, words[1:])
pair_list = Counter(pairs)
print('before pair listing')
for pair, count in sorted(pair_list.items(), key=lambda x: -x[1]):
#print (''.join('{} {}'.format(*pair)), count) #worked
#print(' '.join(pair), '', count) #worked
new_pair = ("{} {}")
my_pair = new_pair.format(pair[0],pair[1])
print ((my_pair), ": ", count)
#end of added code
my_text = re.sub('[\W_]+', ' ', my_text.upper(), flags=re.UNICODE)
print(my_text)
words = my_text.split()
words_org = words #just in case I may need the original version later
for i in words:
if not i in words_all:
words_all[i] = words.count(i)
for k,v in sorted(words_all.items()):
print(k, v)
print("Number of items in word list: {}".format(len(words_all)))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Count occurences of words in file [duplicate] - python

Just use collections.Counter: with open(filename,"r") as fin: print(Counter(fin.read().split()))

Related

python count the number of words in the list of strings [duplicate]

My dictionary counts letter instead of words [duplicate]

Python increment values in a dictionary

counting words from a dictionary?

How to count frequency of single word and also double word count from input text in python?

Categories

Resources