Write a program that reads the contents of a random text file. The program should create a dictionary in which the keys are individual words found in the file and the values are the number of times each word appears.
How would I go about doing this?
def main():
c = 0
dic = {}
words = set()
inFile = open('text2', 'r')
for line in inFile:
line = line.strip()
line = line.replace('.', '')
line = line.replace(',', '')
line = line.replace("'", '') #strips the punctuation
line = line.replace('"', '')
line = line.replace(';', '')
line = line.replace('?', '')
line = line.replace(':', '')
words = line.split()
for x in words:
for y in words:
if x == y:
c += 1
dic[x] = c
print(dic)
print(words)
inFile.close()
main()
Sorry for the vague question. Never asked any questions here before. This is what I have so far. Also, this is the first ever programming I've done so I expect it to be pretty terrible.
with open('path/to/file') as infile:
# code goes here
That's how you open a file
for line in infile:
# code goes here
That's how you read a file line-by-line
line.strip().split()
That's how you split a line into (white-space separated) words.
some_dictionary['abcd']
That's how you access the key 'abcd' in some_dictionary.
Questions for you:
What does it mean if you can't access the key in a dictionary?
What error does that give you? Can you catch it with a try/except block?
How do you increment a value?
Is there some function that GETS a default value from a dict if the key doesn't exist?
For what it's worth, there's also a function that does almost exactly this, but since this is pretty obviously homework it won't fulfill your assignment requirements anyway. It's in the collections module. If you're interested, try and figure out what it is :)
There are at least three different approaches to add a new word to the dictionary and count the number of occurences in this file.
def add_element_check1(my_dict, elements):
for e in elements:
if e not in my_dict:
my_dict[e] = 1
else:
my_dict[e] += 1
def add_element_check2(my_dict, elements):
for e in elements:
if e not in my_dict:
my_dict[e] = 0
my_dict[e] += 1
def add_element_except(my_dict, elements):
for e in elements:
try:
my_dict[e] += 1
except KeyError:
my_dict[e] = 1
my_words = {}
with open('pathtomyfile.txt', r) as in_file:
for line in in_file:
words = [word.strip().lower() word in line.strip().split()]
add_element_check1(my_words, words)
#or add_element_check2(my_words, words)
#or add_element_except(my_words, words)
If you are wondering which is the fastest? The answer is: it depends. It depends on how often a given word might occur in the file. If a word does only occur (relatively) few times, the try-except would be the best choice in your case.
I have done some simple benchmarks here
This is a perfect job for the built in Python Collections class. From it, you can import Counter, which is a dictionary subclass made for just this.
How you want to process your data is up to you. One way to do this would be something like this
from collections import Counter
# Open your file and split by white spaces
with open("yourfile.txt","r") as infile:
textData = infile.read()
# Replace characters you don't want with empty strings
textData = textData.replace(".","")
textData = textData.replace(",","")
textList = textData.split(" ")
# Put your data into the counter container datatype
dic = Counter(textList)
# Print out the results
for key,value in dic.items():
print "Word: %s\n Count: %d\n" % (key,value)
Hope this helps!
Matt
Related
I'd like to create a program in python 3 to find how many time a specific words appears in txt files and then to built an excel tabel with these values.
I made this function but at the end when I recall the function and put the input, the progam doesn't work. Appearing this sentence: unindent does not match any outer indentation level
def wordcount(filename, listwords):
try:
file = open( filename, "r")
read = file.readlines()
file.close()
for x in listwords:
y = x.lower()
counter = 0
for z in read:
line = z.split()
for ss in line:
l = ss.lower()
if y == l:
counter += 1
print(y , counter)
Now I try to recall the function with a txt file and the word to find
wordcount("aaa.txt" , 'word' )
Like output I'd like to watch
word 4
thanks to everybody !
Here is an example you can use to find the number of time a specific word is in a text file;
def searching(filename,word):
counter = 0
with open(filename) as f:
for line in f:
if word in line:
print(word)
counter += 1
return counter
x = searching("filename","wordtofind")
print(x)
The output will be the word you try to find and the number of time it occur.
As short as possible:
def wordcount(filename, listwords):
with open(filename) as file_object:
file_text = file_object.read()
return {word: file_text.count(word) for word in listwords}
for word, count in wordcount('aaa.txt', ['a', 'list', 'of', 'words']).items():
print("Count of {}: {}".format(word, count))
Getting back to mij's comment about passing listwofwords as an actual list: If you pass a string to code that expects a list, python will interpret the string as a list of characters, which can be confusing if this behaviour is unfamiliar.
Apologies if this has been addressed before. I can't find any previous answers that address my specific problem, so here it is.
The exercise requires that the user inputs a .txt file name. The code takes that file, and counts the words within it, creating a dictionary of word : count pairs. If the file has already been input, and its words counted, then instead of recounting it, the program refers to the cache, where its previous counts are stored.
My problem is creating a nested dictionary of dictionaries - the cache. The following is what I have so far. At the moment, each new .txt file rewrites the dictionary, and prevents it being used as a cache.
def main():
file = input("Enter the file name: ") #Takes a file input to count the words
d = {} #open dictionary of dictionaries: a cache of word counts]
with open(file) as f:
if f in d: #check if this file is in cache.
for word in sorted(d[f]): #print the result of the word count of an old document.
print("That file has already been assessed:\n%-12s:%5d" % (word, d[f][word]))
else: #count the words in this file and add the count to the cache as a nested list.
d[f] = {} #create a nested dictionary within 'd'.
for line in f: #counts the unique words within the document.
words = line.split()
for word in words:
word = word.rstrip("!'?.,") #clean up punctuation here
word = word.upper() #all words to uppercase here
if word not in d[f]:
d[f][word] = 1
else:
d[f][word] = d[f][word] + 1
for word in sorted(d[f]): #print the result of the word count of a new document.
print("%-12s:%5d" % (word, d[f][word]))
main() #Run code again to try new file.
main()
Easy fix:
d[file] = {}
....
d[file][word] = 1 # and so on
because when you cahnge f d[f] still refers to the same entry in d
Also, you can reuse defaultdict:
from collections import defaultdict
d = defaultdict(lambda x: defaultdict(int))
def count(file):
with (open(file)) as f:
if file not in d:
# this is just list comprehension
[d[file][word.rstrip("!'?.,").upper()] += 1
for word in line.split()
for line in f]
return d[file]
def main():
file = input("Enter the file name: ")
count(file)
if file in d:
print("That file has already been assessed, blah blah")
for word in sorted(d[file]): #print the result of the word count of a new document.
print("%-12s:%5d" % (word, d[f][word]))
if __name__ == "__main__":
main()
Your issue is that you re-initialise the dictionary every time you call main(). You need to declare it outside the loop wherein you ask the user to provide a file name.
The process could also be neatened up a bit using collections.Counter() and string.translate:
from collections import Counter
import string
import os.path
d = {}
while True:
input_file = input("Enter the file name: ")
if not os.path.isfile(input_file):
print('File not found, try again')
continue
if d.get(input_file, None):
print('Already found, top 5 words:')
else:
with open(input_file, 'rb') as f:
d[input_file] = Counter(f.read().upper().translate(None, string.punctuation).split())
for word, freq in sorted(d[input_file].items(), reverse=True, key=lambda x: x[1])[:5]:
print(word.ljust(20) + str(freq).rjust(5))
This will print the top 5 most frequent words and their frequencies for a file. If it has already seen the file, it'll provide a warning as such. Example output:
THE 24
OF 15
AND 12
A 10
MODEL 9
I need this to print the corresponding line numbers from the text file.
def index (filename, lst):
infile = open('raven.txt', 'r')
lines = infile.readlines()
words = []
dic = {}
for line in lines:
line_words = line.split(' ')
words.append(line_words)
for i in range(len(words)):
for j in range(len(words[i])):
if words[i][j] in lst:
dic[words[i][j]] = i
return dic
The result:
In: index('raven.txt',['raven', 'mortal', 'dying', 'ghost', 'ghastly', 'evil', 'demon'])
Out: {'dying': 8, 'mortal': 29, 'raven': 77, 'ghost': 8}
(The words above appear in several lines but it's only printing one line and for some it doesn't print anything
Also, it does not count the empty lines in the text file. So 8 should actually be 9 because there's an empty line which it is not counting.)
Please tell me how to fix this.
def index (filename, lst):
infile = open('raven.txt', 'r')
lines = infile.readlines()
words = []
dic = {}
for line in lines:
line_words = line.split(' ')
words.append(line_words)
for i in range(len(words)):
for j in range(len(words[i])):
if words[i][j] in lst:
if words[i][j] not in dic.keys():
dic[words[i][j]] = set()
dic[words[i][j]].add(i + 1) #range starts from 0
return dic
Using a set instead of a list is useful in cases were the word is present several times in the same line.
Use defaultdict to create a list of linenumbers for each line:
from collections import defaultdict
def index(filename, lst):
with open(filename, 'r') as infile:
lines = [line.split() for line in infile]
word2linenumbers = defaultdict(list)
for linenumber, line in enumerate(lines, 1):
for word in line:
if word in lst:
word2linenumbers[word].append(linenumber)
return word2linenumbers
You can also use dict.setdefault to either start a new list for each word or append to an existing list if that word has already been found:
def index(filename, lst):
# For larger lists, checking membership will be asymptotically faster using a set.
lst = set(lst)
dic = {}
with open(filename, 'r') as fobj:
for lineno, line in enumerate(fobj, 1):
words = line.split()
for word in words:
if word in lst:
dic.setdefault(word, []).append(lineno)
return dic
Youre two main problems can be fixed by:
1.) multiple indices: you need to initiate/assign a list as the dict value instead of just a single int. otherwise, each word will be reassigned a new index every time a new line is found with that word.
2.) empty lines SHOULD be read as a line so I think its just an indexing issue. your first line is indexed to 0 since the first number in a range starts at 0.
You can simplify your program as follows:
def index (filename, lst):
wordinds = {key:[] for key in lst} #initiates an empty list for each word
with open(filename,'r') as infile: #why use filename param if you hardcoded the open....
#the with statement is useful. trust.
for linenum,line in enumerate(infile):
for word in line.rstrip().split(): #strip new line and split into words
if word in wordinds:
wordinds[word].append(linenum)
return {x for x in wordinds.iteritems() if x[1]} #filters empty lists
this simplifies everything to nest into one for loop that is enumerated for each line. if you want the first line to be 1 and second line as 2 you would have to change wordinds[word].append(linenum) to ....append(linenum + 1)
EDIT: someone made a good point in another answer to have enumerate(infile,1) to start your enumeration at index 1. thats way cleaner.
I am currently working in a python course and I am lost on this after 6 hrs +. Assignment directs student to create a program where the user enters a file name and python opens the file and builds a sorted word list with out duplicates. Directions are very clear that For loop and append must be used. " For each word on each line check to see if the word is already in the list and if not append it to the list."
fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line = line.strip()
words = line.split()
for words in fh:
if words in 1st:continue
elif 1st.append
1st.sort()
print 1st
It would be easier to just use the set() by itself, but this would be a good implementation per the assignment instructions. It's really fast compared to a list only version!
from collections import Set
def get_exclusive_list(fname):
words = []
with open(fname.txt, 'r') as file_d:
data = file_d.read()
[words.extend(li.split(' ')) for li in data.splitlines()]
data = []
already_seen = set()
for thing in words:
if thing not in already_seen:
data.append(thing)
already_seen.add(thing)
return data
# The better implementation
def get_exclusive_list_improved(fname):
words = []
with open(fname.txt, 'r') as file_d:
data = file_d.read()
[words.extend(li.split(' ')) for li in data.splitlines()]
return list(set(words))
Not sure what the following loop is supposed to do -
for words in fh:
if words in 1st:continue
elif 1st.append
The above does not do anything because you have already exhausted the file fh before control reaches this part.
You should put an inner loop inside - for line in fh: - that goes over the words in words list one by one and appends to lst if its not already there.
Also, you should do lst.append(word)
Also, i do not think your if..elif block is valid syntax either.
You should be doing something like xample -
for line in fh:
line = line.strip()
words = line.split()
for word in words:
if word not in lst:
lst.append(word)
I would like to define a function scaryDict() which takes one parameter (a textfile) and returns the words from the textfile in alphabetical order, basically produce a dictionary but does not print any one or two letter words.
Here is what I have so far...it isn't much but I don't know the next step
def scaryDict(fineName):
inFile = open(fileName,'r')
lines = inFile.read()
line = lines.split()
myDict = {}
for word in inFile:
myDict[words] = []
#I am not sure what goes between the line above and below
for x in lines:
print(word, end='\n')
You are doing fine till line = lines.split(). But your for loop must loop through the line array, not the inFile.
for word in line:
if len(word) > 2: # Make sure to check the word length!
myDict[word] = 'something'
I'm not sure what you want with the dictionary (maybe get the word count?), but once you have it, you can get the words you added to it by,
allWords = myDict.keys() # so allWords is now a list of words
And then you can sort allWords to get them in alphabetical order.
allWords.sort()
I would store all of the words into a set (to eliminate dups), then sort that set:
#!/usr/bin/python3
def scaryDict(fileName):
with open(fileName) as inFile:
return sorted(set(word
for line in inFile
for word in line.split()
if len(word) > 2))
scaryWords = scaryDict('frankenstein.txt')
print ('\n'.join(scaryWords))
Also keep in mind as of 2.5 the 'with' file contains an enter and exit methods which can prevent some issues (such as that file never getting closed)
with open(...) as f:
for line in f:
<do something with line>
Unique set
Sort the set
Now you can put it all together.
sorry that i am 3 years late : ) here is my version
def scaryDict():
infile = open('filename', 'r')
content = infile.read()
infile.close()
table = str.maketrans('.`/()|,\';!:"?=-', 15 * ' ')
content = content.translate(table)
words = content.split()
new_words = list()
for word in words:
if len(word) > 2:
new_words.append(word)
new_words = list(set(new_words))
new_words.sort()
for word in new_words:
print(word)