How to create a dictionary for a text file - python

My program opens a file and it can word count the words contained in it but i want to create a dictionary consisting of all the unique words in the text
for example if the word 'computer' appears three times i want that to count as one unique word
def main():
file = input('Enter the name of the input file: ')
infile = open(file, 'r')
file_contents = infile.read()
infile.close()
words = file_contents.split()
number_of_words = len(words)
print("There are", number_of_words, "words contained in this paragarph")
main()

Use a set. This will only include unique words:
words = set(words)
If you don't care about case, you can do this:
words = set(word.lower() for word in words)
This assumes there is no punctuation. If there is, you will need to strip the punctuation.
import string
words = set(word.lower().strip(string.punctuation) for word in words)
If you need to keep track of how many of each word you have, just replace set with Counter in the examples above:
import string
from collections import Counter
words = Counter(word.lower().strip(string.punctuation) for word in words)
This will give you a dictionary-like object that tells you how many of each word there is.
You can also get the number of unique words from this (although it is slower if that is all you care about):
import string
from collections import Counter
words = Counter(word.lower().strip(string.punctuation) for word in words)
nword = len(words)

#TheBlackCat his solution works but only gives you how much unique words are in the string/file. This solution also shows you how many times it occurs.
dictionaryName = {}
for word in words:
if word not in list(dictionaryName):
dictionaryName[word] = 1
else:
number = dictionaryName.get(word)
dictionaryName[word] = dictionaryName.get(word) + 1
print dictionaryName
tested with:
words = "Foo", "Bar", "Baz", "Baz"
output: {'Foo': 1, 'Bar': 1, 'Baz': 2}

Probably more cleaner and quick solution:
words_dict = {}
for word in words:
word_count = words_dict.get(word, 0)
words_dict[word] = word_count + 1

Related

How do I get the specific number of a word in a txt file?

I'm trying to find whenever one of some specific words is used in a TXT file and then count what number word in the file the word is. My code returns the number for some but not all of the words, and I have no idea why.
My code right now goes through the file word by word with a counter and returns the number if the word matches one of the words I want.
def wordnumber(file, filewrite, word1, word2, word3):
import os
wordlist = [word1, word2, word3]
infile = open(file, 'r')
g = open(filewrite, 'w')
g.write("start")
g.write(os.linesep)
lines = infile.read().splitlines()
infile.close()
wordsString = ' '.join(lines)
words = wordsString.split()
n = 1
for w in words:
if w in wordlist:
g.write(str(n))
g.write(os.linesep)
n = n+1
This works sometimes, but for some text files it only returns some of the numbers and leaves others blank.
If you want find the first occurence of the word in your words, just use
wordIndex = words.index(w) if w in words else None
and for all occurences use
wordIndexes = [i for i,x in enumerate(words) if x==word]
(taken from Python: Find in list)
But beware: if your text is "cat, dog, mouse", your code wouldn't find index of "cat" or "dog". Because "cat, dog, mouse".split() returns ['cat,', 'dog,', 'mouse'], and 'cat,' is not 'cat'.

How to remove duplicates in a document?

I'm writing a words unjumble program. Here are my codes:
import collections
sortedWords = collections.defaultdict(list)
with open("/xxx/xxx/words.txt", "r") as f:
for word in f:
word = word.strip().lower()
sortFword = ''.join(sorted(word))
sortedWords[sortFword].append(word)
while True:
jumble = input("Enter your jumbled word:").lower()
sortedJumble = ''.join(sorted(jumble))
if sortedJumble in sortedWords:
words = sortedWords[sortedJumble]
if len(words) > 1:
print ("Your words are: ")
print ("\n".join(words))
else:
print ("Your word is", words[0]+".")
break
else:
print ("Oops, it can not be unjumbled.")
break
Now these code work. However, my program usually print two identical words. For example, I typed "prisng" as jumbled word, then I got two "spring"s. It is because that there were two "spring"s in the word document: one is "spring" and the other one is "Spring". I want to remove all duplicates of the words.txt, but how to remove them? Please give me some advice.
You can use the builtin set function to do this.
words = ['hi', 'Hi']
words = list(map(lambda x: x.lower(), words)) # makes all the words lowercase
words = list(set(words)) # removes all duplicates

Print frequency of words, in a sentence, in a single line

I have a sentence "The quick fox jumps over the lazy dog", and I have counted the number of times each word occurs in this sentence. The output should be like this:
brown:1,dog:1,fox:1,jumps:1,lazy:1,over:1,quick:1,the:2
There should be no spaces between the characters in this output, and there should be commas between the words/numbers.
The output from my program looks like this:
,brown:1,dog:1,fox:1,jumps:1,lazy:1,over:1,quick:1,the:2
I find that there is a comma place before 'brown'. Is there an easier way to print this?
filename = os.path.basename(path)
with open(filename, 'r+') as f:
fline = f.read()
fwords = fline.split()
allwords = [word.lower() for word in fwords]
sortwords = list(set(allwords))
r = sorted(sortwords, key=str.lower)
finalwords = ','.join(r)
sys.stdout.write(str(finalwords))
print '\n'
countlist = {}
for word in allwords:
try: countlist[word] += 1
except KeyError: countlist[word] = 1
for c,num in sorted(countlist.items()):
sys.stdout.write(",{:}:{:}".format(c, num))
A couple alternate ways of making the word list. First, a one-liner:
countlist = {word:allwords.count(word) for word in allwords}
As pointed out by DSM, that method can be slow with long lists. An alternate would be to use defaultdict:
from itertools import defaultdict
countlist = defaultdict(int)
for word in allwords:
countlist[word] += 1
For output, join individual word counts with a ,, which avoids having one at the beginning:
sys.stdout.write(",".join(["{:}:{:}".format(key, value) for key, value in countlist .items()]))

Word unscrambler

I'm working with some of the corpus materials from NLPP. I'm trying to improve my unscrambling score in the code... at the moment I'm hitting 91.250%.
The point of the exercise is to alter the represent_word function to improve the score.
The function consumes a word a string, and this word is either scrambled or unscrambled. The function produces a "representation" of the word, which is a list containing the following information:
word length
number of vowels
number of consonants
first and last letter of the word (these are always unscrambled)
a tuple of the most commonly used words from the corpus, who's characters are also members of the given word input.
I have also tried analysing anagrams of prefixes and suffixes, but they don't contribute anything to the score in the shadow of the most common words with common characters tuple.
I'm not sure why I can't improve the score. I've even tried increasing dictionary size by importing words from another corpus.
The only section that can be altered here is the represent_word function and the definitions just above it. However, I'm including the entire source incase it might yield some insightful information to someones.
import nltk
import re
def word_counts(corpus, wordcounts = {}):
""" Function that counts all the words in the corpus."""
for word in corpus:
wordcounts.setdefault(word.lower(), 0)
wordcounts[word.lower()] += 1
return wordcounts
JA_list = filter(lambda x: x.isalpha(), map(lambda x:x.lower(),
nltk.corpus.gutenberg.words('austen-persuasion.txt')))
JA_freqdist=nltk.FreqDist(JA_list)
JA_toplist=sorted(JA_freqdist.items(),key=lambda x: x[1], reverse=True)[:0]
JA_topwords=[]
for i in JA_toplist:
JA_topwords.append(i[0])
PP_list = filter(lambda x: x.isalpha(),map(lambda x:x.lower(),
open("Pride and Prejudice.txt").read().split()))
PP_freqdist=nltk.FreqDist(PP_list)
PP_toplist=sorted(PP_freqdist.items(),key=lambda x: x[1], reverse=True)[:7]
PP_topwords=[]
for i in PP_toplist:
PP_topwords.append(i[0])
uniquewords=[]
for i in JA_topwords:
if i not in PP_topwords:
uniquewords.append(i)
else:
continue
uniquewords.extend(PP_topwords)
def represent_word(word):
def common_word(word):
dictionary= uniquewords
findings=[]
for string in dictionary:
if all((letter in word) for letter in string):
findings.append(string)
else:
False
if not findings:
return None
else:
return tuple(findings)
vowels = list("aeiouy")
consonants = list("bcdfghjklmnpqrstvexz")
number_of_consonants = sum(word.count(i) for i in consonants)
number_of_vowels = sum(word.count(i) for i in vowels)
split_word=list(word)
common_words=common_word(word)
return tuple([split_word[0],split_word[-1], len(split_word),number_of_consonants, number_of_vowels, common_words])
def create_mapping(words, mapping = {}):
""" Returns a mapping of representations of words to the most common word for that representation. """
for word in words:
representation = represent_word(word)
mapping.setdefault(representation, ("", 0))
if mapping[representation][1] < words[word]:
mapping[representation] = (word, words[word])
return mapping
if __name__ == '__main__':
# Create a mapping of representations of the words in Persuasian by Jane Austen to use as a corpus
words = JA_freqdist
mapping = create_mapping(words)
# Load the words in the scrambled file
with open("Pdrie and Puicejdre.txt") as scrambled_file:
scrambled_lines = [line.split() for line in scrambled_file if len(line.strip()) > 0 ]
scrambled_words = [word.lower() for line in scrambled_lines for word in line]
# Descramble the words using the best mapping
descrambled_words = []
for scrambled_word in scrambled_words:
representation = represent_word(scrambled_word)
if representation in mapping:
descrambled_word = mapping[representation][0]
else:
descrambled_word = scrambled_word
descrambled_words.append(descrambled_word)
# Load the original words
with open("Pride and Prejudice.txt") as original_file:
original_lines = [line.split() for line in original_file if len(line.strip()) > 0 ]
original_words = [word.lower() for line in original_lines for word in line]
# Make a list of word pairs from descrambled_words and original words
word_pairs = zip(descrambled_words, original_words)
# See if the words are the same
judgements = [descrambled_word == original_word for (descrambled_word, original_word) in word_pairs]
# Print the results
print "Correct: {0:.3%}".format(float(judgements.count(True))/len(judgements))

Counting every word in a text file only once using python

I have a small python script I am working on for a class homework assignment. The script reads a file and prints the 10 most frequent and infrequent words and their frequencies. For this assignment, a word is defined as 2 letters or more. I have the word frequencies working just fine, however the third part of the assignment is to print the total number of unique words in the document. Unique words meaning count every word in the document, only once.
Without changing my current script too much, how can I count all the words in the document only one time?
p.s. I am using Python 2.6 so please don't mention the use of collections.Counter
from string import punctuation
from collections import defaultdict
import re
number = 10
words = {}
total_unique = 0
words_only = re.compile(r'^[a-z]{2,}$')
counter = defaultdict(int)
"""Define words as 2+ letters"""
def count_unique(s):
count = 0
if word in line:
if len(word) >= 2:
count += 1
return count
"""Open text document, read it, strip it, then filter it"""
txt_file = open('charactermask.txt', 'r')
for line in txt_file:
for word in line.strip().split():
word = word.strip(punctuation).lower()
if words_only.match(word):
counter[word] += 1
# Most Frequent Words
top_words = sorted(counter.iteritems(),
key=lambda(word, count): (-count, word))[:number]
print "Most Frequent Words: "
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
# Least Frequent Words:
least_words = sorted(counter.iteritems(),
key=lambda (word, count): (count, word))[:number]
print " "
print "Least Frequent Words: "
for word, frequency in least_words:
print "%s: %d" % (word, frequency)
# Total Unique Words:
print " "
print "Total Number of Unique Words: %s " % total_unique
Count the number of keys in your counter dictionary:
total_unique = len(counter.keys())
Or more simply:
total_unique = len(counter)
A defaultdict is great, but it might be more that what you need. You will need it for the part about most frequent words. But in the absence of that question, using a defaultdict is overkill. In such a situation, I would suggest using a set instead:
words = set()
for line in txt_file:
for word in line.strip().split():
word = word.strip(punctuation).lower()
if words_only.match(word):
words.add(word)
num_unique_words = len(words)
Now words contains only unique words.
I am only posting this because you say that you are new to python, so I want to make sure that you are aware of sets as well. Again, for your purposes, a defaultdict works fine and is justified

Categories

Resources