Nested dictionaries for word count cache - python

Apologies if this has been addressed before. I can't find any previous answers that address my specific problem, so here it is.
The exercise requires that the user inputs a .txt file name. The code takes that file, and counts the words within it, creating a dictionary of word : count pairs. If the file has already been input, and its words counted, then instead of recounting it, the program refers to the cache, where its previous counts are stored.
My problem is creating a nested dictionary of dictionaries - the cache. The following is what I have so far. At the moment, each new .txt file rewrites the dictionary, and prevents it being used as a cache.
def main():
file = input("Enter the file name: ") #Takes a file input to count the words
d = {} #open dictionary of dictionaries: a cache of word counts]
with open(file) as f:
if f in d: #check if this file is in cache.
for word in sorted(d[f]): #print the result of the word count of an old document.
print("That file has already been assessed:\n%-12s:%5d" % (word, d[f][word]))
else: #count the words in this file and add the count to the cache as a nested list.
d[f] = {} #create a nested dictionary within 'd'.
for line in f: #counts the unique words within the document.
words = line.split()
for word in words:
word = word.rstrip("!'?.,") #clean up punctuation here
word = word.upper() #all words to uppercase here
if word not in d[f]:
d[f][word] = 1
else:
d[f][word] = d[f][word] + 1
for word in sorted(d[f]): #print the result of the word count of a new document.
print("%-12s:%5d" % (word, d[f][word]))
main() #Run code again to try new file.
main()

Easy fix:
d[file] = {}
....
d[file][word] = 1 # and so on
because when you cahnge f d[f] still refers to the same entry in d
Also, you can reuse defaultdict:
from collections import defaultdict
d = defaultdict(lambda x: defaultdict(int))
def count(file):
with (open(file)) as f:
if file not in d:
# this is just list comprehension
[d[file][word.rstrip("!'?.,").upper()] += 1
for word in line.split()
for line in f]
return d[file]
def main():
file = input("Enter the file name: ")
count(file)
if file in d:
print("That file has already been assessed, blah blah")
for word in sorted(d[file]): #print the result of the word count of a new document.
print("%-12s:%5d" % (word, d[f][word]))
if __name__ == "__main__":
main()

Your issue is that you re-initialise the dictionary every time you call main(). You need to declare it outside the loop wherein you ask the user to provide a file name.
The process could also be neatened up a bit using collections.Counter() and string.translate:
from collections import Counter
import string
import os.path
d = {}
while True:
input_file = input("Enter the file name: ")
if not os.path.isfile(input_file):
print('File not found, try again')
continue
if d.get(input_file, None):
print('Already found, top 5 words:')
else:
with open(input_file, 'rb') as f:
d[input_file] = Counter(f.read().upper().translate(None, string.punctuation).split())
for word, freq in sorted(d[input_file].items(), reverse=True, key=lambda x: x[1])[:5]:
print(word.ljust(20) + str(freq).rjust(5))
This will print the top 5 most frequent words and their frequencies for a file. If it has already seen the file, it'll provide a warning as such. Example output:
THE 24
OF 15
AND 12
A 10
MODEL 9

Related

Adding additional values to a dictionary value as a list

Here is my code:
def corpus_reading_pos(corpus_name, pos_tag, option="pos"):
pos_tags = []
words = []
tokens_pos = {}
file_count = 0
for root, dirs, files in os.walk(corpus_name):
for file in files:
if file.endswith(".v4_gold_conll"):
with open((os.path.join(root, file))) as f:
pos_tags += [line.split()[4] for line in f if line.strip() and not line.startswith("#")]
with open((os.path.join(root, file))) as g:
words += [line.split()[3] for line in g if line.strip() and not line.startswith("#")]
file_count += 1
for pos in pos_tags:
tokens_pos[pos] = []
words_pos = list(zip(words, pos_tags))
for word in words_pos:
tokens_pos[word[1]] = word[0]
#print(words_pos)
print(tokens_pos)
#print(words)
print("Token count:", len(tokens_pos))
print("File count:", file_count)
I'm trying to create a dictionary that has all of the pos items as keys, and the dictionary values will be all of the words that belong to that specific pos. I'm stuck on the par where for the values in the dictionary, I have to create a list of words, but I can't seem to get there.
In the code, the line tokens_pos[word[1]] = word[0] only adds one word per key, but if I try something like [].append(word[0]), the dictionary returns all values as NONE.
You seem to be doing a lot of double work but to give a solution to your specific question:
for word in words_pos:
tokens_pos[word[1]].append(word[0])
should do what you want to achieve.
with
tokens_pos[word[1]] = word[0]
you are basically overwriting existing values that have the same key, and thus only the last written value with that key will remain in the end.

find words in txt files Python 3

I'd like to create a program in python 3 to find how many time a specific words appears in txt files and then to built an excel tabel with these values.
I made this function but at the end when I recall the function and put the input, the progam doesn't work. Appearing this sentence: unindent does not match any outer indentation level
def wordcount(filename, listwords):
try:
file = open( filename, "r")
read = file.readlines()
file.close()
for x in listwords:
y = x.lower()
counter = 0
for z in read:
line = z.split()
for ss in line:
l = ss.lower()
if y == l:
counter += 1
print(y , counter)
Now I try to recall the function with a txt file and the word to find
wordcount("aaa.txt" , 'word' )
Like output I'd like to watch
word 4
thanks to everybody !
Here is an example you can use to find the number of time a specific word is in a text file;
def searching(filename,word):
counter = 0
with open(filename) as f:
for line in f:
if word in line:
print(word)
counter += 1
return counter
x = searching("filename","wordtofind")
print(x)
The output will be the word you try to find and the number of time it occur.
As short as possible:
def wordcount(filename, listwords):
with open(filename) as file_object:
file_text = file_object.read()
return {word: file_text.count(word) for word in listwords}
for word, count in wordcount('aaa.txt', ['a', 'list', 'of', 'words']).items():
print("Count of {}: {}".format(word, count))
Getting back to mij's comment about passing listwofwords as an actual list: If you pass a string to code that expects a list, python will interpret the string as a list of characters, which can be confusing if this behaviour is unfamiliar.

Recreating a sentence using text files and lists

I am relatively new to Python and I am currently working on a compression program that uses lists containing positions of words in a lists and a list of words that make up the sentence. So far I have written my program inside two functions, the first function; 'compression', gets the words that make up the sentence and the positions of those words. My second function is called 'recreate', this function uses he lists to recreate the sentence. The recreated senetence is then stored in a file called recreate.txt. My issue is that the positions of words and the words that make up the sentence are not being written to their respective files and the 'recreate' file is not being created and written to. Any help would be greatly appreciated. Thanks :)
sentence = input("Input the sentence that you wish to be compressed")
sentence.lower()
sentencelist = sentence.split()
d = {}
plist = []
wds = []
def compress():
for i in sentencelist:
if i not in wds:
wds.append(i)
for i ,j in enumerate(sentencelist):
if j in (d):
plist.append(d[j])
else:
plist.append(i)
print (plist)
tsk3pos = open ("tsk3pos.txt", "wt")
for item in plist:
tsk3pos.write("%s\n" % item)
tsk3pos.close()
tsk3wds = open ("tsk3wds.txt", "wt")
for item in wds:
tsk3wds.write("%s\n" % item)
tsk3wds.close()
print (wds)
def recreate(compress):
compress()
num = list()
wds = list()
with open("tsk3wds.txt", "r") as txt:
for line in txt:
words += line.split()
with open("tsk3pos.txt", "r") as txt:
for line in txt:
num += [int(i) for i in line.split()]
recreate = ' '.join(words[pos] for pos in num)
with open("recreate.txt", "wt") as txt:
txt.write(recreate)
UPDATED
I have fixed all other problems except the recreate function which will not make the 'recreate' file and will not recreate the sentence with the words, although
it recreates the sentence with the positions.
def recreate(compress): #function that will be used to recreate the compressed sentence.
compress()
num = list()
wds = list()
with open("words.txt", "r") as txt: #with statement opening the word text file
for line in txt: #iterating over each line in the text file.
words += line.split() #turning the textfile into a list and appending it to num
with open("tsk3pos.txt", "r") as txt:
for line in txt:
num += [int(i) for i in line.split()]
recreate = ' '.join(wds[pos] for pos in num)
with open("recreate.txt", "wt") as txt:
txt.write(recreate)
main()
def main():
print("Do you want to compress an input or recreate a compressed input?")
user = input("Type 'a' if you want to compress an input. Type 'b' if you wan to recreate an input").lower()
if user not in ("a","b"):
print ("That's not an option. Please try again")
elif user == "a":
compress()
elif user == "b":
recreate(compress)
main()
main()
A simpler ( yet less efficient ) approach :
recreate_file_object = open ( "C:/FullPathToWriteFolder/recreate.txt" , "w" )
recreate_file_object.write ( recreate )
recreate_file_object.close ( )

Python create a sorted list from file using append remove duplicate words

I am currently working in a python course and I am lost on this after 6 hrs +. Assignment directs student to create a program where the user enters a file name and python opens the file and builds a sorted word list with out duplicates. Directions are very clear that For loop and append must be used. " For each word on each line check to see if the word is already in the list and if not append it to the list."
fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line = line.strip()
words = line.split()
for words in fh:
if words in 1st:continue
elif 1st.append
1st.sort()
print 1st
It would be easier to just use the set() by itself, but this would be a good implementation per the assignment instructions. It's really fast compared to a list only version!
from collections import Set
def get_exclusive_list(fname):
words = []
with open(fname.txt, 'r') as file_d:
data = file_d.read()
[words.extend(li.split(' ')) for li in data.splitlines()]
data = []
already_seen = set()
for thing in words:
if thing not in already_seen:
data.append(thing)
already_seen.add(thing)
return data
# The better implementation
def get_exclusive_list_improved(fname):
words = []
with open(fname.txt, 'r') as file_d:
data = file_d.read()
[words.extend(li.split(' ')) for li in data.splitlines()]
return list(set(words))
Not sure what the following loop is supposed to do -
for words in fh:
if words in 1st:continue
elif 1st.append
The above does not do anything because you have already exhausted the file fh before control reaches this part.
You should put an inner loop inside - for line in fh: - that goes over the words in words list one by one and appends to lst if its not already there.
Also, you should do lst.append(word)
Also, i do not think your if..elif block is valid syntax either.
You should be doing something like xample -
for line in fh:
line = line.strip()
words = line.split()
for word in words:
if word not in lst:
lst.append(word)

Python Beginning Program Dictionary and List Issue

Write a program that reads the contents of a random text file. The program should create a dictionary in which the keys are individual words found in the file and the values are the number of times each word appears.
How would I go about doing this?
def main():
c = 0
dic = {}
words = set()
inFile = open('text2', 'r')
for line in inFile:
line = line.strip()
line = line.replace('.', '')
line = line.replace(',', '')
line = line.replace("'", '') #strips the punctuation
line = line.replace('"', '')
line = line.replace(';', '')
line = line.replace('?', '')
line = line.replace(':', '')
words = line.split()
for x in words:
for y in words:
if x == y:
c += 1
dic[x] = c
print(dic)
print(words)
inFile.close()
main()
Sorry for the vague question. Never asked any questions here before. This is what I have so far. Also, this is the first ever programming I've done so I expect it to be pretty terrible.
with open('path/to/file') as infile:
# code goes here
That's how you open a file
for line in infile:
# code goes here
That's how you read a file line-by-line
line.strip().split()
That's how you split a line into (white-space separated) words.
some_dictionary['abcd']
That's how you access the key 'abcd' in some_dictionary.
Questions for you:
What does it mean if you can't access the key in a dictionary?
What error does that give you? Can you catch it with a try/except block?
How do you increment a value?
Is there some function that GETS a default value from a dict if the key doesn't exist?
For what it's worth, there's also a function that does almost exactly this, but since this is pretty obviously homework it won't fulfill your assignment requirements anyway. It's in the collections module. If you're interested, try and figure out what it is :)
There are at least three different approaches to add a new word to the dictionary and count the number of occurences in this file.
def add_element_check1(my_dict, elements):
for e in elements:
if e not in my_dict:
my_dict[e] = 1
else:
my_dict[e] += 1
def add_element_check2(my_dict, elements):
for e in elements:
if e not in my_dict:
my_dict[e] = 0
my_dict[e] += 1
def add_element_except(my_dict, elements):
for e in elements:
try:
my_dict[e] += 1
except KeyError:
my_dict[e] = 1
my_words = {}
with open('pathtomyfile.txt', r) as in_file:
for line in in_file:
words = [word.strip().lower() word in line.strip().split()]
add_element_check1(my_words, words)
#or add_element_check2(my_words, words)
#or add_element_except(my_words, words)
If you are wondering which is the fastest? The answer is: it depends. It depends on how often a given word might occur in the file. If a word does only occur (relatively) few times, the try-except would be the best choice in your case.
I have done some simple benchmarks here
This is a perfect job for the built in Python Collections class. From it, you can import Counter, which is a dictionary subclass made for just this.
How you want to process your data is up to you. One way to do this would be something like this
from collections import Counter
# Open your file and split by white spaces
with open("yourfile.txt","r") as infile:
textData = infile.read()
# Replace characters you don't want with empty strings
textData = textData.replace(".","")
textData = textData.replace(",","")
textList = textData.split(" ")
# Put your data into the counter container datatype
dic = Counter(textList)
# Print out the results
for key,value in dic.items():
print "Word: %s\n Count: %d\n" % (key,value)
Hope this helps!
Matt

Categories

Resources