I'm trying to get a count of the frequency of a word in a Text File using a python function. I can get the frequency of all of the words separately, but I'm trying to get a count of specific words by having them in a list. Here's what I have so far but I am currently stuck. My
def repeatedWords():
with open(fname) as f:
wordcount={}
for word in word_list:
for word in f.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k,v in wordcount.items():
print k, v
word_list = [‘Emma’, ‘Woodhouse’, ‘father’, ‘Taylor’, ‘Miss’, ‘been’, ‘she’, ‘her’]
repeatedWords('file.txt')
Updated, still showing all words:
def repeatedWords(fname, word_list):
with open(fname) as f:
wordcount = {}
for word in word_list:
for word in f.read().split():
wordcount[word] = wordcount.get(word, 0) + 1
for k,v in wordcount.items():
print k, v
word_list = ['Emma', 'Woodhouse', 'father', 'Taylor', 'Miss', 'been', 'she', 'her']
repeatedWords('Emma.txt', word_list)
So you want the frequency of only the specific words in that list (Emma, Woodhouse, Father...)? If so, this code might help (try running it):
word_list = ['Emma','Woodhouse','father','Taylor','Miss','been','she','her']
#i'm using this example text in place of the file you are using
text = 'This is an example text. It will contain words you are looking for, like Emma, Emma, Emma, Woodhouse, Woodhouse, Father, Father, Taylor,Miss,been,she,her,her,her. I made them repeat to show that the code works.'
text = text.replace(',',' ') #these statements remove irrelevant punctuation
text = text.replace('.','')
text = text.lower() #this makes all the words lowercase, so that capitalization wont affect the frequency measurement
for repeatedword in word_list:
counter = 0 #counter starts at 0
for word in text.split():
if repeatedword.lower() == word:
counter = counter + 1 #add 1 every time there is a match in the list
print(repeatedword,':', counter) #prints the word from 'word_list' and its frequency
The output shows the frequency of only those words in the list you provided, and that's what you wanted right?
the output produced when run in python3 is:
Emma : 3
Woodhouse : 2
father : 2
Taylor : 1
Miss : 1
been : 1
she : 1
her : 3
The best way to deal with this is to use get method in Python dictionary. It can be like this:
def repeatedWords():
with open(fname) as f:
wordcount = {}
#Example list of words not needed
nonwordlist = ['father', 'Miss', 'been']
for word in word_list:
for word in file.read().split():
if not word in nonwordlist:
wordcount[word] = wordcount.get(word, 0) + 1
# Put these outside the function repeatedWords
for k,v in wordcount.items():
print k, v
The print statement should give you this:
word_list = [‘Emma’, ‘Woodhouse’, ‘father’, ‘Taylor’, ‘Miss’, ‘been’, ‘she’, ‘her’]
newDict = {}
for newWord in word_list:
newDict[newWord] = newDict.get(newWord, 0) + 1
print newDict
What this line wordcount[word] = wordcount.get(word, 0) + 1 does is, it first looks for word in the dictionary wordcount, if the word already exists, it gets it's value first and adds 1 to it. If the word does not exist, the value defaults to 0 and at this instance, 1 is added making it the first occurrence of that word having a count of 1.
Related
I want to find a word with the most repeated letters given an input a sentence.
I know how to find the most repeated letters given the sentence but I'm not able how to print the word.
For example:
this is an elementary test example
should print
elementary
def most_repeating_word(strg):
words =strg.split()
for words1 in words:
dict1 = {}
max_repeat_count = 0
for letter in words1:
if letter not in dict1:
dict1[letter] = 1
else:
dict1[letter] += 1
if dict1[letter]> max_repeat_count:
max_repeat_count = dict1[letter]
most_repeated_char = letter
result=words1
return result
You are resetting the most_repeat_count variable for each word to 0. You should move that upper in you code, above first for loop, like this:
def most_repeating_word(strg):
words =strg.split()
max_repeat_count = 0
for words1 in words:
dict1 = {}
for letter in words1:
if letter not in dict1:
dict1[letter] = 1
else:
dict1[letter] += 1
if dict1[letter]> max_repeat_count:
max_repeat_count = dict1[letter]
most_repeated_char = letter
result=words1
return result
Hope this helps
Use a regex instead. It is simple and easy. Iteration is an expensive operation compared to regular expressions.
Please refer to the solution for your problem in this post:
Count repeated letters in a string
Interesting exercise! +1 for using Counter(). Here's my suggestion also making use of max() and its key argument, and the * unpacking operator.
For a final solution note that this (and the other proposed solutions to the question) don't currently consider case, other possible characters (digits, symbols etc) or whether more than one word will have the maximum letter count, or if a word will have more than one letter with the maximum letter count.
from collections import Counter
def most_repeating_word(strg):
# Create list of word tuples: (word, max_letter, max_count)
counters = [ (word, *max(Counter(word).items(), key=lambda item: item[1]))
for word in strg.split() ]
max_word, max_letter, max_count = max(counters, key=lambda item: item[2])
return max_word
word="SBDDUKRWZHUYLRVLIPVVFYFKMSVLVEQTHRUOFHPOALGXCNLXXGUQHQVXMRGVQTBEYVEGMFD"
def most_repeating_word(strg):
dict={}
max_repeat_count = 0
for word in strg:
if word not in dict:
dict[word] = 1
else:
dict[word] += 1
if dict[word]> max_repeat_count:
max_repeat_count = dict[word]
result={}
for word, value in dict.items():
if value==max_repeat_count:
result[word]=value
return result
print(most_repeating_word(word))
I have to write a function that counts how many times a word (or a series of words) appears in a given text.
This is my function so far. What I noticed is that with a series of 3 words the functions works well, but not with 4 words and so on.
from nltk import ngrams
def function(text, word):
for char in ".?!-":
text = text.replace(char, ' ')
n = len(word.split())
countN = 0
bigram_lower = text.lower()
word_lower = word.lower()
n_grams = ngrams(bigram_lower.split(), n)
for gram in n_grams:
for i in range (0, n):
if gram[i] == word_lower.split()[i]:
countN = countN + 1
print (countN)
First thing, please fix your indentation and don't use bigrams as a variable for ngrams as it's a bit confusing (Since you are not storing just bigrams in the bigrams variable). Secondly lets look at this part of your code -
for gram in bigrams:
for i in range (0, n):
if gram[i] == word_lower.split()[i]:
countN = countN + 1
print (countN)
Here you are increasing countN by one for each time a word in your ngram matches up instead of increasing it when the whole ngram matches up. You should instead only increase countN if all the words have matched up -
for gram in bigrams:
if list(gram) == word_lower.split():
countN = countN + 1
print (countN)
May be it was already done in here
Is nltk mandatory?
# Open the file in read mode
text = open("sample.txt", "r")
# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
# Print the contents of dictionary
for key in list(d.keys()):
print(key, ":", d[key])
This shuld work for you:
def function(text, word):
for char in ".?!-,":
text = text.replace(char, ' ')
n = len(word.split())
countN = 0
bigram_lower = text.lower()
word_lower = tuple(word.lower().split())
bigrams = nltk.ngrams(bigram_lower.split(), n)
for gram in bigrams:
if gram == word_lower:
countN += 1
print (countN)
>>> tekst="this is the text i want to search, i want to search it for the words i want to search for, and it should count the occurances of the words i want to search for"
>>> function(tekst, "i want to search")
4
>>> function(tekst, "i want to search for")
2
i have a programm that counts words of a text file. Now i want to restrict the counter to strings with more than x characters
from collections import Counter
input = 'C:/Users/micha/Dropbox/IPCC_Boox/FOD_v1_ch15.txt'
Counter = {}
words = {}
with open(input,'r', encoding='utf-8-sig') as fh:
for line in fh:
word_list = line.replace(',','').replace('\'','').replace('.','').lower().split()
for word in word_list:
if word not in Counter:
Counter[word] = 1
else:
Counter[word] = Counter[word] + 1
N = 20
top_words = Counter(Counter).most_common(N)
for word, frequency in top_words:
print("%s %d" % (word, frequency))
I tried the re code, but it did not work.
re.sub(r'\b\w{1,3}\b')
I dont know how to implement it...
At the end I would like to have an output that ignores all the short words like and, you, be etc.
You could do this more simply with:
for word in word_list:
if len(word) < 5: # check the length of each word is less than 5 for example
continue # this skips the counter portion and jumps to next word in word_list
elif word not in Counter:
Counter[word] = 1
else:
Counter[word] = Counter[word] + 1
Few notes.
1) You import a Counter but don't use it properly (you do a Counter = {} thus overwriting the import).
from collections import Counter
2) Instead of doing several replaces use list comprehension with a set, its faster and only does one (two with the join) iterations instead of several:
sentence = ''.join([char for char in line if char not in {'.', ',', "'"}])
word_list = sentence.split()
3) Use the counter and list comp for length:
c = Counter(word for word in word_list if len(word) > 3)
Thats it.
Counter already does what you want. You can "feed" it wiht an iterable and this will work.
https://docs.python.org/2/library/collections.html#counter-objects
You can use the filter function too https://docs.python.org/3.7/library/functions.html#filter
The could look alike:
counted = Counter(filter(lambda x: len(x) >= 5, words))
I want to count a specific word in the file.
For example how many times does 'apple' appear in the file.
I tried this:
#!/usr/bin/env python
import re
logfile = open("log_file", "r")
wordcount={}
for word in logfile.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k,v in wordcount.items():
print k, v
by replacing 'word' with 'apple', but it still counts all possible words in my file.
Any advice would be greatly appreciated. :)
You could just use str.count() since you only care about occurrences of a single word:
with open("log_file") as f:
contents = f.read()
count = contents.count("apple")
However, to avoid some corner cases, such as erroneously counting words like "applejack", I suggest that you use a regex:
import re
with open("log_file") as f:
contents = f.read()
count = sum(1 for match in re.finditer(r"\bapple\b", contents))
\b in the regex ensures that the pattern begins and ends on a word boundary (as opposed to a substring within a longer string).
If you only care about one word then you do not need to create a dictionary to keep track of every word count. You can just iterate over the file line-by-line and find the occurrences of the word you are interested in.
#!/usr/bin/env python
logfile = open("log_file", "r")
wordcount=0
my_word="apple"
for line in logfile:
if my_word in line.split():
wordcount += 1
print my_word, wordcount
However, if you also want to count all the words, and just print the word count for the word you are interested in then these minor changes to your code should work:
#!/usr/bin/env python
import re
logfile = open("log_file", "r")
wordcount={}
for word in logfile.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
# print only the count for my_word instead of iterating over entire dictionary
my_word="apple"
print my_word, wordcount[my_word]
You can use the Counter dictionary for this
from collections import Counter
with open("log_file", "r") as logfile:
word_counts = Counter(logfile.read().split())
print word_counts.get('apple')
This is an example of counting words in array of words. I am assuming file reader will be pretty much similar.
def count(word, array):
n=0
for x in array:
if x== word:
n+=1
return n
text= 'apple orange kiwi apple orange grape kiwi apple apple'
ar = text.split()
print(count('apple', ar))
def Freq(x,y):
d={}
open_file = open(x,"r")
lines = open_file.readlines()
for line in lines:
word = line.lower()
words = word.split()
for i in words:
if i in d:
d[i] = d[i] + 1
else:
d[i] = 1
print(d)
fi=open("text.txt","r")
cash=0
visa=0
amex=0
for line in fi:
k=line.split()
print(k)
if 'Cash' in k:
cash=cash+1
elif 'Visa' in k:
visa=visa+1
elif 'Amex' in k:
amex=amex+1
print("# persons paid by cash are:",cash)
print("# persons paid by Visa card are :",visa)
print("#persons paid by Amex card are :",amex)
fi.close()
I am counting word of a txt file with the following code:
#!/usr/bin/python
file=open("D:\\zzzz\\names2.txt","r+")
wordcount={}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
print (word,wordcount)
file.close();
this is giving me the output like this:
>>>
goat {'goat': 2, 'cow': 1, 'Dog': 1, 'lion': 1, 'snake': 1, 'horse': 1, '': 1, 'tiger': 1, 'cat': 2, 'dog': 1}
but I want the output in the following manner:
word wordcount
goat 2
cow 1
dog 1.....
Also I am getting an extra symbol in the output (). How can I remove this?
The funny symbols you're encountering are a UTF-8 BOM (Byte Order Mark). To get rid of them, open the file using the correct encoding (I'm assuming you're on Python 3):
file = open(r"D:\zzzz\names2.txt", "r", encoding="utf-8-sig")
Furthermore, for counting, you can use collections.Counter:
from collections import Counter
wordcount = Counter(file.read().split())
Display them with:
>>> for item in wordcount.items(): print("{}\t{}".format(*item))
...
snake 1
lion 2
goat 2
horse 3
#!/usr/bin/python
file=open("D:\\zzzz\\names2.txt","r+")
wordcount={}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k,v in wordcount.items():
print k, v
FILE_NAME = 'file.txt'
wordCounter = {}
with open(FILE_NAME,'r') as fh:
for line in fh:
# Replacing punctuation characters. Making the string to lower.
# The split will spit the line into a list.
word_list = line.replace(',','').replace('\'','').replace('.','').lower().split()
for word in word_list:
# Adding the word into the wordCounter dictionary.
if word not in wordCounter:
wordCounter[word] = 1
else:
# if the word is already in the dictionary update its count.
wordCounter[word] = wordCounter[word] + 1
print('{:15}{:3}'.format('Word','Count'))
print('-' * 18)
# printing the words and its occurrence.
for (word,occurance) in wordCounter.items():
print('{:15}{:3}'.format(word,occurance))
#
Word Count
------------------
of 6
examples 2
used 2
development 2
modified 2
open-source 2
import sys
file=open(sys.argv[1],"r+")
wordcount={}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for key in wordcount.keys():
print ("%s %s " %(key , wordcount[key]))
file.close();
If you are using graphLab, you can use this function. It is really powerfull
products['word_count'] = graphlab.text_analytics.count_words(your_text)
#!/usr/bin/python
file=open("D:\\zzzz\\names2.txt","r+")
wordcount={}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k,v in wordcount.items():
print k,v
file.close();
you can do this:
file= open(r'D:\\zzzz\\names2.txt')
file_split=set(file.read().split())
print(len(file_split))
Below code from Python | How to Count the frequency of a word in the text file? worked for me.
import re
frequency = {}
#Open the sample text file in read mode.
document_text = open('sample.txt', 'r')
#convert the string of the document in lowercase and assign it to text_string variable.
text = document_text.read().lower()
pattern = re.findall(r'\b[a-z]{2,15}\b', text)
for word in pattern:
count = frequency.get(word,0)
frequency[word] = count + 1
frequency_list = frequency.keys()
for words in frequency_list:
print(words, frequency[words])
OUTPUT:
print("sorted counting values:-")
from collections import Counter
fname = open(filename)
fname = fname.read()
fsplit = fname.split()
user = Counter(fsplit)
for i,v in sorted(user.items()):
print((v,i))