KeyError on the same word - python

I am trying to generate a sentence in the style of the bible. But whenever I run it, it stops at a KeyError on the same exact word. This is confusing as it is only using its own keys and it is the same word every time in the error, despite having random.choice.
This is the txt file if you want to run it: ftp://ftp.cs.princeton.edu/pub/cs226/textfiles/bible.txt
import random
files = []
content = ""
output = ""
words = {}
files = ["bible.txt"]
sentence_length = 200
for file in files:
file = open(file)
content = content + " " + file.read()
content = content.split(" ")
for i in range(100): # I didn't want to go through every word in the bible, so I'm just going through 100 words
words[content[i]] = []
words[content[i]].append(content[i+1])
word = random.choice(list(words.keys()))
output = output + word
for i in range(int(sentence_length)):
word = random.choice(words[word])
output = output + word
print(output)

The KeyError happens on this line:
word = random.choice(words[word])
It always happens for the word "midst".
How? "midst" is the 100th word in the text.
And the 100th position is the first time it is seen.
The consequence is that "midst" itself was never put in words as a key.
Hence the KeyError.
Why does the program reach this word so fast? Partly because of a bug here:
for i in range(100):
words[content[i]] = []
words[content[i]].append(content[i+1])
The bug here is the words[content[i]] = [] statement.
Every time you see a word,
you recreate an empty list for it.
And the word before "midst" is "the".
It's a very common word,
many other words in the text have "the".
And since words["the"] is ["midst"],
the problem tends to happen a lot, despite the randomness.
You can fix the bug of creating words:
for i in range(100):
if content[i] not in words:
words[content[i]] = []
words[content[i]].append(content[i+1])
And then when you select words randomly,
I suggest to add a if word in words condition,
to handle the corner case of the last word in the input.

"midst" is the 101st word in your source text and it is the first time it shows up. When you do this:
words[content[i]].append(content[i+1])
you are making a key:value pair but you aren't guaranteed that that value is going to be equivalent to an existing key. So when you use that value to search for a key it doesn't exist so you get a KeyError.
If you change your range to 101 instead of 100 you will see that your program almost works. That is because the 102nd word is "of" which has already occurred in your source text.
It's up to you how you want to deal with this edge case. You could do something like this:
if i == (100-1):
words[content[i]].append(content[0])
else:
words[content[i]].append(content[i+1])
which basically loops back around to the beginning of the source text when you get to the end.

Related

How would I select a random word from a file for a user to unscramble?

I am trying to select a random word from a txt file. The context of the file has been provided. I would like the word to be random every time the code is ran. I also only need the words before the comma
import random
print("Please enter mywords file to start game")
user_input=input('Enter file name')
filename = open(user_input)
info=filename.readlines()
filename.close()
words=info[0-3]
objects=words.split(',')
userword=random.choice(objects)
print(userword)
opulence,great wealth
penury,extremely poor
gregarious,fond of company; sociable
entomology,study of insects
So far I able to pull from the second line in the file
"penury,extremely poor"
You were trying to slice, but end up with just one line. You can do a loop over the lines, split on ',' and form a list with what's required. Later, randomly pick from the list:
lst = []
for x in info:
w, _ = x.split(',')
lst.append(w)
print(random.choice(lst))
0-3 seems to be a typo of 0:3, but the problem goes deeper than that. words is supposed to be a list, and lists don't have a .split method. You'll need to split each item (and select the part before the comma).
words = [line.split(',')[0] for line in info]
userword = random.choice(words)

Split sentences, process words, and put sentence back together?

I have a function that scores words. I have lots of text from sentences to several page documents. I'm stuck on how to score the words and return the text near its original state.
Here's an example sentence:
"My body lies over the ocean, my body lies over the sea."
What I want to produce is the following:
"My body (2) lies over the ocean (3), my body (2) lies over the sea."
Below is a dummy version of my scoring algorithm. I've figured out how to take text, tear it apart and score it.
However, I'm stuck on how to put it back together into the format I need it in.
Here's a dummy version of my function:
def word_score(text):
words_to_work_with = []
words_to_return = []
passed_text = TextBlob(passed_text)
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word)
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
for word in words to work with:
if word == 'body':
score = 2
if word == 'ocean':
score = 3
else:
score = None
words_to_return.append((word,score))
return words_to_return
I'm a relative newbie so I have two questions:
How can I put the text back together, and
Should that logic be put into the function or outside of it?
I'd really like to be able to feed entire segments (i.e. sentences, documents) into the function and have it return them.
Thank you for helping me!
So basically, you want to attribute a score for each word. The function you give may be improved using a dictionary instead of several if statements.
Also you have to return all scores, instead of just the score of the first wordin words_to_work_with which is the current behavior of the function since it will return an integer on the first iteration.
So the new function would be :
def word_score(text)
words_to_work_with = []
passed_text = TextBlob(text)
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word) # Is this line really useful ?
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
dict_scores = {'body' : 2, 'ocean' : 3, etc ...}
return [dict_scores.get(word, None)] # if word is not recognized, score is None
For the second part, which is reconstructing the string, I would actually do this in the same function (so this answers your second question) :
def word_score_and_reconstruct(text):
words_to_work_with = []
passed_text = TextBlob(text)
reconstructed_text = ''
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word) # Is this line really useful ?
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
dict_scores = {'body': 2, 'ocean': 3}
dict_strings = {'body': ' (2)', 'ocean': ' (3)'}
word_scores = []
for word in words_to_work_with:
word_scores.append(dict_scores.get(word, None)) # we still construct the scores list here
# we add 'word'+'(word's score)', only if the word has a score
# if not, we add the default value '' meaning we don't add anything
reconstructed_text += word + dict_strings.get(word, '')
return reconstructed_text, word_scores
I'm not guaranteeing this code will work at first try, I can't test it but it'll give you the main idea
Hope this would help. Based on your question, it has worked for me.
best regards!!
"""
Python 3.7.2
Input:
Saved text in the file named as "original_text.txt"
My body lies over the ocean, my body lies over the sea.
"""
input_file = open('original_text.txt', 'r') #Reading text from file
output_file = open('processed_text.txt', 'w') #saving output text in file
output_text = []
for line in input_file:
words = line.split()
for word in words:
if word == 'body':
output_text.append('body (2)')
output_file.write('body (2) ')
elif word == 'body,':
output_text.append('body (2),')
output_file.write('body (2), ')
elif word == 'ocean':
output_text.append('ocean (3)')
output_file.write('ocean (3) ')
elif word == 'ocean,':
output_text.append('ocean (3),')
output_file.write('ocean (3), ')
else:
output_text.append(word)
output_file.write(word+' ')
print (output_text)
input_file.close()
output_file.close()
Here's a working implementation. The function first parses the input text as a list, such that each list element is a word or a combination of punctuation characters (eg. a comma followed by a space.) Once the words in the list have been processed, it combines the list back into a string and returns it.
def word_score(text):
words_to_work_with = re.findall(r"\b\w+|\b\W+",text)
for i,word in enumerate(words_to_work_with):
if word.isalpha():
words_to_work_with[i] = inflection.singularize(word).lower()
words_to_work_with[i] = lemmatizer.lemmatize(word)
if word == 'body':
words_to_work_with[i] = 'body (2)'
elif word == 'ocean':
words_to_work_with[i] = 'ocean (3)'
return ''.join(words_to_work_with)
txt = "My body lies over the ocean, my body lies over the sea."
output = word_score(txt)
print(output)
Output:
My body (2) lie over the ocean (3), my body (2) lie over the sea.
If you have more than 2 words that you want to score, using a dictionary instead of if conditions is indeed a good idea.

Attempting to remove repeated words in a list python

I'm trying to remove repeated words in a list where it saves the location and word in a file but it doesn't save the word which occurs after the repeated word. Can someone tell me what's wrong with it?
sen = input("Input a sentence")
sen1 = sen.lower()
sen2 = sen1.split()
sen4 = sen2
f = open("newfile.txt", "w+")
for words in sen2:
print(words)
ok = sen1.count(words)
print(ok)
sent = sen4.index(words)
print(sent)
f.write(str(sent))
f.write(str(words))
if ok > 1 :
while ok > 0:
sen2.remove(words)
ok = ok-1
if ok == 0:
break
f.close()
You are making a common mistake, modifying a list while you are looping over its items.
Inside the loop for words in sen2: do sometimes execute sen2.remove(words), which modifies the list sen2. Strange things happened when you do this.
To avoid this, make a deep copy of the list sen2 with sen2copy = sen2[:], and loop over one of them and modify the other one. You could do this with
sen2copy = sen2[:]
for words in sen2copy:
or, if you want to be brief,
for words in sen2[:]:
If you don't understand the notation, sen2[:] is a slice of sen2 from the beginning to the end. In other words, it copies each item in sen2 to the new list. If you leave out the brackets and colon you just copy a reference to the entire list, which is not what you want.

Linear search to find spelling errors in Python

I'm working on learning Python with Program Arcade Games and I've gotten stuck on one of the labs.
I'm supposed to compare each word of a text file (http://programarcadegames.com/python_examples/en/AliceInWonderLand200.txt) to find if it is not in the dictionary file (http://programarcadegames.com/python_examples/en/dictionary.txt) and then print it out if it is not. I am supposed to use a linear search for this.
The problem is even words I know are not in the dictionary file aren't being printed out. Any help would be appreciated.
My code is as follows:
# Imports regular expressions
import re
# This function takes a line of text and returns
# a list of words in the line
def split_line(line):
split = re.findall('[A-Za-z]+(?:\'\"[A-Za-z]+)?', line)
return split
# Opens the dictionary text file and adds each line to an array, then closes the file
dictionary = open("dictionary.txt")
dict_array = []
for item in dictionary:
dict_array.append(split_line(item))
print(dict_array)
dictionary.close()
print("---Linear Search---")
# Opens the text for the first chapter of Alice in Wonderland
chapter_1 = open("AliceInWonderland200.txt")
# Breaks down the text by line
for each_line in chapter_1:
# Breaks down each line to a single word
words = split_line(each_line)
# Checks each word against the dictionary array
for each_word in words:
i = 0
# Continues as long as there are more words in the dictionary and no match
while i < len(dict_array) and each_word.upper() != dict_array[i]:
i += 1
# if no match was found print the word being checked
if not i <= len(dict_array):
print(each_word)
# Closes the first chapter file
chapter_1.close()
Linear search to find spelling errors in Python
Something like this should do (pseudo code)
sampleDict = {}
For each word in AliceInWonderLand200.txt:
sampleDict[word] = True
actualWords = {}
For each word in dictionary.txt:
actualWords[word] = True
For each word in sampleDict:
if not (word in actualDict):
# Oh no! word isn't in the dictionary
A set may be more appropriate than a dict, since the value of the dictionary in the sample isn't important. This should get you going, though

python spambot featureset list size

novice coder here, trying to sort out issues I've found with a simple spam detection python script from Youtube.
Naive Bayes cannot be applied because the list isn't generating correctly. I know the problem step is
featuresets = [(email_features(n),g) for (n,g) in mixedemails]
Could someone help me understand why that line is failing to generate anything?
def email_features(sent):
features = {}
wordtokens = [wordlemmatizer.lemmatize(word.lower()) for word in word_tokenize(sent)]
for word in wordtokens:
if word not in commonwords:
features[word] = True
return features
hamtexts=[]
spamtexts=[]
for infile in glob.glob(os.path.join('ham/','*.txt')):
text_file =open(infile,"r")
hamtexts.append(text_file.read())
text_file.close()
for infile in glob.glob(os.path.join('spam/','*.txt')):
text_file =open(infile,"r")
spamtexts.append(text_file.read())
text_file.close()
mixedemails = ([(email,'spam') for email in spamtexts]+ [(email,'ham') for email in hamtexts])
featuresets = [(email_features(n),g) for (n,g) in mixedemails]
I converted your problem into a minimal, runnable example:
commonwords = []
def lemmatize(word):
return word
def word_tokenize(text):
return text.split(" ")
def email_features(sent):
wordtokens = [lemmatize(word.lower()) for word in word_tokenize(sent)]
features = dict((word, True) for word in wordtokens if word not in commonwords)
return features
hamtexts = ["hello test", "test123 blabla"]
spamtexts = ["buy this", "buy that"]
mixedemails = [(email,'spam') for email in spamtexts] + [(email,'ham') for email in hamtexts]
featuresets = [(email_features(n),g) for (n,g) in mixedemails]
print len(mixedemails), len(featuresets)
Executing that example prints 4 4 on the console. Therefore, most of your code seems to work and the exact cause of the error cannot be estimated based on what you posted. I would suggest you to look at the following points for the bug:
Maybe your spam and ham files are not read properly (e.g. your path might be wrong). To validate that this is not the case add print hamtexts, spamtexts before mixedemails = .... Both variables should contain not-empty lists of strings.
Maybe your implementation of word_tokenize() returns always an empty list. Add a print sent, wordtokens after wordtokens = [...] in email_features() to make sure that sent contains a string and that it gets correctly converted to a list of tokens.
Maybe commonwords contains every single word from your ham and spam emails. To make sure that this is not the case, add the previous print sent, wordtokens before the loop in email_features() and a print features after the loop. All three variables should (usually) be not empty.

Categories

Resources