My function takes a string as input that is the name of a file and should return a dictionary. The dictionary will have key/value pairs where keys are integers that correspond to word lengths and the values are the number of words that appear in the file with that length.
The file consists of the following sentence:
and then the last assignment ended and everyone was sad
So theoretically the returned diction would look like this:
{ 3:5, 4:2, 5:1, 8:1, 10:1}
So far I have this:
"""
COMP 1005 - Fall 2016
Assignment 10
Problem 1
"""
def wordLengthStats(filename):
file = open(filename, 'r')
wordcount={}
for line in file.read().split():
if line not in wordcount:
wordcount[line] = 1
else:
wordcount[line] += 1
for k,v in wordcount.items():
print (k, v)
return None
def main():
'''
main method to test your wordLengthStats method
'''
d = wordLengthStats("sample.txt")
print("d should be { 3:5, 4:2, 5:1, 8:1, 10:1} ")
print("d is", d)
if __name__ == '__main__':
main()
The sentence is just an example, I need to make it so that any input should work. Any help on approaching this problem would be greatly appreciated.
For every word in the sentence, you need to add an entry to the dictionary where the length of the word is the key:
def wordLengthStats(filename):
file = open(filename, 'r')
wordcount={}
for word in file.read().split():
key = len(word)
if key not in wordcount:
wordcount[key] = 1
else:
wordcount[key] += 1
for k,v in wordcount.items():
print (k, v)
return None
Related
I am making a program, that reads a file and makes a dictionary, that shows how many times a word has been used:
filename = 'for_python.txt'
with open(filename) as file:
contents = file.read().split()
dict = {}
for word in contents:
if word not in dict:
dict[word] = 1
else:
dict[word] += 1
dict = sorted(dict.items(), key=lambda x: x[1], reverse=True)
for i in dict:
print(i[0], i[1])
It works, but it treats words that have commas in them as different words, which I do not want it to do. Is there a simple and efficient way to do this?
Remove all commas before splitting them
filename = 'for_python.txt'
with open(filename) as file:
contents = file.read().replace(",", "").split()
You are splitting the whole data based on " " as the delimiter but not doing the same for commas. You can do so by splitting the words further using commas. Here's how:
...
for word in contents:
new_words = word.split(',')
for new_word in new_words:
if new_word not in dict:
dict[new_word] = 1
else:
dict[new_word] += 1
...
I'd suggest you strip() with the different punctuation chars when using the word. Also don't use builtin dict name, its the dictionnary constructor
import string
words = {}
for word in contents:
word = word.strip(string.punctuation)
if word not in words:
words[word] = 1
else:
words[word] += 1
For you know, it exists collections.Counter that does this jobs
import string
from collections import Counter
filename = 'test.txt'
with open(filename) as file:
contents = file.read().split()
words = Counter(word.strip(string.punctuation) for word in contents)
for k, v in words.most_common(): # All content, in occurence conut order descreasingly
print(k, v)
for k, v in words.most_common(5): # Only 5 most occurrence
print(k, v)
Apologies if this has been addressed before. I can't find any previous answers that address my specific problem, so here it is.
The exercise requires that the user inputs a .txt file name. The code takes that file, and counts the words within it, creating a dictionary of word : count pairs. If the file has already been input, and its words counted, then instead of recounting it, the program refers to the cache, where its previous counts are stored.
My problem is creating a nested dictionary of dictionaries - the cache. The following is what I have so far. At the moment, each new .txt file rewrites the dictionary, and prevents it being used as a cache.
def main():
file = input("Enter the file name: ") #Takes a file input to count the words
d = {} #open dictionary of dictionaries: a cache of word counts]
with open(file) as f:
if f in d: #check if this file is in cache.
for word in sorted(d[f]): #print the result of the word count of an old document.
print("That file has already been assessed:\n%-12s:%5d" % (word, d[f][word]))
else: #count the words in this file and add the count to the cache as a nested list.
d[f] = {} #create a nested dictionary within 'd'.
for line in f: #counts the unique words within the document.
words = line.split()
for word in words:
word = word.rstrip("!'?.,") #clean up punctuation here
word = word.upper() #all words to uppercase here
if word not in d[f]:
d[f][word] = 1
else:
d[f][word] = d[f][word] + 1
for word in sorted(d[f]): #print the result of the word count of a new document.
print("%-12s:%5d" % (word, d[f][word]))
main() #Run code again to try new file.
main()
Easy fix:
d[file] = {}
....
d[file][word] = 1 # and so on
because when you cahnge f d[f] still refers to the same entry in d
Also, you can reuse defaultdict:
from collections import defaultdict
d = defaultdict(lambda x: defaultdict(int))
def count(file):
with (open(file)) as f:
if file not in d:
# this is just list comprehension
[d[file][word.rstrip("!'?.,").upper()] += 1
for word in line.split()
for line in f]
return d[file]
def main():
file = input("Enter the file name: ")
count(file)
if file in d:
print("That file has already been assessed, blah blah")
for word in sorted(d[file]): #print the result of the word count of a new document.
print("%-12s:%5d" % (word, d[f][word]))
if __name__ == "__main__":
main()
Your issue is that you re-initialise the dictionary every time you call main(). You need to declare it outside the loop wherein you ask the user to provide a file name.
The process could also be neatened up a bit using collections.Counter() and string.translate:
from collections import Counter
import string
import os.path
d = {}
while True:
input_file = input("Enter the file name: ")
if not os.path.isfile(input_file):
print('File not found, try again')
continue
if d.get(input_file, None):
print('Already found, top 5 words:')
else:
with open(input_file, 'rb') as f:
d[input_file] = Counter(f.read().upper().translate(None, string.punctuation).split())
for word, freq in sorted(d[input_file].items(), reverse=True, key=lambda x: x[1])[:5]:
print(word.ljust(20) + str(freq).rjust(5))
This will print the top 5 most frequent words and their frequencies for a file. If it has already seen the file, it'll provide a warning as such. Example output:
THE 24
OF 15
AND 12
A 10
MODEL 9
I'm trying to get a count of the frequency of a word in a Text File using a python function. I can get the frequency of all of the words separately, but I'm trying to get a count of specific words by having them in a list. Here's what I have so far but I am currently stuck. My
def repeatedWords():
with open(fname) as f:
wordcount={}
for word in word_list:
for word in f.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k,v in wordcount.items():
print k, v
word_list = [‘Emma’, ‘Woodhouse’, ‘father’, ‘Taylor’, ‘Miss’, ‘been’, ‘she’, ‘her’]
repeatedWords('file.txt')
Updated, still showing all words:
def repeatedWords(fname, word_list):
with open(fname) as f:
wordcount = {}
for word in word_list:
for word in f.read().split():
wordcount[word] = wordcount.get(word, 0) + 1
for k,v in wordcount.items():
print k, v
word_list = ['Emma', 'Woodhouse', 'father', 'Taylor', 'Miss', 'been', 'she', 'her']
repeatedWords('Emma.txt', word_list)
So you want the frequency of only the specific words in that list (Emma, Woodhouse, Father...)? If so, this code might help (try running it):
word_list = ['Emma','Woodhouse','father','Taylor','Miss','been','she','her']
#i'm using this example text in place of the file you are using
text = 'This is an example text. It will contain words you are looking for, like Emma, Emma, Emma, Woodhouse, Woodhouse, Father, Father, Taylor,Miss,been,she,her,her,her. I made them repeat to show that the code works.'
text = text.replace(',',' ') #these statements remove irrelevant punctuation
text = text.replace('.','')
text = text.lower() #this makes all the words lowercase, so that capitalization wont affect the frequency measurement
for repeatedword in word_list:
counter = 0 #counter starts at 0
for word in text.split():
if repeatedword.lower() == word:
counter = counter + 1 #add 1 every time there is a match in the list
print(repeatedword,':', counter) #prints the word from 'word_list' and its frequency
The output shows the frequency of only those words in the list you provided, and that's what you wanted right?
the output produced when run in python3 is:
Emma : 3
Woodhouse : 2
father : 2
Taylor : 1
Miss : 1
been : 1
she : 1
her : 3
The best way to deal with this is to use get method in Python dictionary. It can be like this:
def repeatedWords():
with open(fname) as f:
wordcount = {}
#Example list of words not needed
nonwordlist = ['father', 'Miss', 'been']
for word in word_list:
for word in file.read().split():
if not word in nonwordlist:
wordcount[word] = wordcount.get(word, 0) + 1
# Put these outside the function repeatedWords
for k,v in wordcount.items():
print k, v
The print statement should give you this:
word_list = [‘Emma’, ‘Woodhouse’, ‘father’, ‘Taylor’, ‘Miss’, ‘been’, ‘she’, ‘her’]
newDict = {}
for newWord in word_list:
newDict[newWord] = newDict.get(newWord, 0) + 1
print newDict
What this line wordcount[word] = wordcount.get(word, 0) + 1 does is, it first looks for word in the dictionary wordcount, if the word already exists, it gets it's value first and adds 1 to it. If the word does not exist, the value defaults to 0 and at this instance, 1 is added making it the first occurrence of that word having a count of 1.
I am new to python and trying to print the total number of words in a text file and the total number of specific words in the file provided by the user.
I tested my code, but results output of single word,but i need only the overall word count of all the words in the file and also the overall wordcount of words provided by the user.
Code:
name = raw_input("Enter the query x ")
name1 = raw_input("Enter the query y ")
file=open("xmlfil.xml","r+")
wordcount={}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k,v in wordcount.items():
print k, v
for name in file.read().split():
if name not in wordcount:
wordcount[name] = 1
else:
wordcount[name] += 1
for k,v in wordcount.items():
print k, v
for name1 in file.read().split():
if name1 not in wordcount:
wordcount[name1] = 1
else:
wordcount[name1] += 1
for k,v in wordcount.items():
print k, v
MyFile=open('test.txt','r')
words={}
count=0
given_words=['The','document','1']
for x in MyFile.read().split():
count+=1
if x in given_words:
words.setdefault(x,0)
words[str(x)]+=1
MyFile.close()
print count, words
Sample output
17 {'1': 1, 'The': 1, 'document': 1}
Please do not name the variable to handle open() result file as then you'll overwrite the constructor function for the file type.
You can get what you need easily via Counter
from collections import Counter
c = Counter()
with open('your_file', 'rb') as f:
for ln in f:
c.update(ln.split())
total = sum(c.values())
specific = c['your_specific_word']
Count how many words there are in the .txt file.
Then, print the words ordered by frequency and alphabetically.
def count_words():
d = dict()
word_file = open('words.txt')
for line in word_file:
word = line.strip();
d = countwords(word,d)
return d
I'm not sure if I am doing this correctly. Hoping someone can help me out.
When I run the program, I receive:
>>>
>>>
It's a paragraph of a speech.
I would use a dictionary like you are but differently:
def count_words():
d = dict()
word_file = open('words.txt')
for line in word_file:
word = line.strip();
if word not in d.keys():
d[word] = 0
d[word]+=1
Then you can sort the keys by their counts and print them:
from operator import itemgetter
print(sorted(d.items(), key=itemgetter(1)))
For the sorting blurb I used: Sort a Python dictionary by value
Also, your program doesnt have any print statements, just a return line, which is why you get nothing.
#!/usr/local/cpython-3.3/bin/python
import pprint
import collections
def words(filename):
with open(filename, 'r') as file_:
for line in file_:
for word in line.split():
yield word.lower()
counter = collections.Counter(words('/etc/services'))
sorted_by_frequency = sorted((value, key) for key, value in counter.items())
sorted_by_frequency.reverse()
print('Sorted by frequency')
pprint.pprint(sorted_by_frequency)
print('')
sorted_alphabetically = sorted((key, value) for key, value in counter.items())
print('Sorted alphabetically')
pprint.pprint(sorted_alphabetically)