Linear search to find spelling errors in Python - python

I'm working on learning Python with Program Arcade Games and I've gotten stuck on one of the labs.
I'm supposed to compare each word of a text file (http://programarcadegames.com/python_examples/en/AliceInWonderLand200.txt) to find if it is not in the dictionary file (http://programarcadegames.com/python_examples/en/dictionary.txt) and then print it out if it is not. I am supposed to use a linear search for this.
The problem is even words I know are not in the dictionary file aren't being printed out. Any help would be appreciated.
My code is as follows:
# Imports regular expressions
import re
# This function takes a line of text and returns
# a list of words in the line
def split_line(line):
split = re.findall('[A-Za-z]+(?:\'\"[A-Za-z]+)?', line)
return split
# Opens the dictionary text file and adds each line to an array, then closes the file
dictionary = open("dictionary.txt")
dict_array = []
for item in dictionary:
dict_array.append(split_line(item))
print(dict_array)
dictionary.close()
print("---Linear Search---")
# Opens the text for the first chapter of Alice in Wonderland
chapter_1 = open("AliceInWonderland200.txt")
# Breaks down the text by line
for each_line in chapter_1:
# Breaks down each line to a single word
words = split_line(each_line)
# Checks each word against the dictionary array
for each_word in words:
i = 0
# Continues as long as there are more words in the dictionary and no match
while i < len(dict_array) and each_word.upper() != dict_array[i]:
i += 1
# if no match was found print the word being checked
if not i <= len(dict_array):
print(each_word)
# Closes the first chapter file
chapter_1.close()

Linear search to find spelling errors in Python
Something like this should do (pseudo code)
sampleDict = {}
For each word in AliceInWonderLand200.txt:
sampleDict[word] = True
actualWords = {}
For each word in dictionary.txt:
actualWords[word] = True
For each word in sampleDict:
if not (word in actualDict):
# Oh no! word isn't in the dictionary
A set may be more appropriate than a dict, since the value of the dictionary in the sample isn't important. This should get you going, though

Related

My code is missing some of the lines im trying to get out of a file

The basic task is to write a function, get_words_from_file(filename), that returns a list of lower case words that are within the region of interest. They share with you a regular expression: "[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", that finds all words that meet this definition. My code works well on some of the tests but fails when the line that indicates the region of interest is repeated.
Here's is my code:
import re
def get_words_from_file(filename):
"""Returns a list of lower case words that are with the region of
interest, every word in the text file, but, not any of the punctuation."""
with open(filename,'r', encoding='utf-8') as file:
flag = False
words = []
count = 0
for line in file:
if line.startswith("*** START OF"):
while count < 1:
flag=True
count += 1
elif line.startswith("*** END"):
flag=False
break
elif(flag):
new_line = line.lower()
words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+",
new_line)
words.extend(words_on_line)
return words
#test code:
filename = "bee.txt"
words = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words)))
print("Valid word list:")
for word in words:
print(word)
The issue is the string "*** START OF" is repeated and isn't included when it is inside the region of interest.
The test code should result in:
bee.txt loaded ok.↩
16 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
start↩
of↩
synthetic↩
test↩
case↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too
But I'm getting:
bee.txt loaded ok.↩
11 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too
Any help would be great!
Attached is a screenshot of the file
The specific problem of your code is the if .. elif .. elif statement, you're ignoring all lines that look like the line that signals the start or end of a block, even if it's in the test block.
You wanted something like this for your function:
def get_words_from_file(filename):
"""Returns a list of lower case words that are with the region of
interest, every word in the text file, but, not any of the punctuation."""
with open(filename, 'r', encoding='utf-8') as file:
in_block = False
words = []
for line in file:
if not in_block and line == "*** START OF A SYNTHETIC TEST CASE ***\n":
in_block = True
elif in_block and line == "*** END TEST CASE ***\n":
break
elif in_block:
words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line.lower())
words.extend(words_on_line)
return words
This is assuming you are actually looking for the whole line as a marker, but of course you can still use .startswith() if you actually accept that as the start or end of the block, as long as it's sufficiently unambiguous.
Your idea of using a flag is fine, although naming a flag to whatever it represents is always a good idea.

String searching in text file and dict values combinations

i'm a total beginner to python, i'm studying it at university and professor gave us some work to do before the exam. Currently it's been almost 2 weeks that i'm stuck with this program, the rule is that we can't use any library.
Basically I have this dictionary with several possibility of translations from ancient language to english, a dictionary from english to italian (only 1 key - 1 value pairs), a text file in an ancient language and another text file in Italian. Until now what i've done is basically scan the ancient language file and search for corresponding strings with dictionary (using .strip(".,:;?!") method), now i saved those corresponding strigs that contain at least 2 words in a list of strings.
Now comes the hard part: basically i need to try all possible combination of translations (values from ancient language to English) and then take these translations from english to italian the the other dictionary and check if that string exists in the Italian file, if yes i save the result and the paragraph where has been found (result in different paragraphs doesn't count, must be in the same I've made a small piece of code to count the paragraphs).
I'm having issues here for the following reasons:
In the strings that i've found how I'm supposed to replace the words and keep the punctuation? Because the return result must contain all the punctuation otherwise the output result will be wrong
If the string is contained but in 2 different lines of the text how should i proceed in order to make it work? For example i have a string of 5 words, at the end of a line i found the first 2 words corresponding but the remaining 3 words are the first 3 words of the next line.
As mentioned before the dict from ancient language to english is huge and can have up to 7 values (translations) for each key (ancient langauge), is there any efficient way to try all the combinations while searching if the string exists in a text file? This is probably the hardest part.
Probably the best way to process this is word by word scan every time and in case the sequence is broken i reset it somehow and keep scanning the text file...
Any idea?
Here you have commented code of what i've managed to do until now:
k = 2 #Random value, the whole program gonna be a function and the "k" value will be different each time
file = [ line.strip().split(';') for line in open('lexicon-GR-EN.csv', encoding="utf8").readlines() ] #Opening CSV file with possible translations from ancient Greek to English
gr_en = { words[0]: tuple(words[1:]) for words in file } #Creating a dictionary with the several translations (values)
file = open('lexicon-EN-IT.csv', encoding="utf8") # Opening 2nd CSV file
en_it = {} # Initializing dictionary
for row in file: # Scanning each row of the CSV file (From English to Italian)
L = row.rstrip("\n").split(';') # Clearing newline char and splitting the words
x = L[0]
t1 = L[1]
en_it[x] = t1 # Since in this CSV file all the words are 1 - 1 is not necesary any check for the length (len(L) is always 2 basically)
file = open('odyssey.txt', encoding="utf8") # Opening text file
result = () # Empty tuple
spacechecker = 0 # This is the variable that i need to determine if i'm on a even or odd line, if odd the line will be scanned normaly otherwise word order and words will be reversed
wordcount = 0 # Counter of how many words have been found
paragraph = 0 # Paragraph counter, starts at 0
paragraphspace = 0 # Another paragraph variable, i need this to prevent double-space to count as paragraph
string = "" # Empty string to store corresponding sequences
foundwords = [] # Empty list to store words that have been found
completed_sequences = [] # Empty list, here will be stored all completed sequences of words
completed_paragraphs = [] # Paragraph counter, this shows in which paragraph has been found each sequence of completed_sequences
for index, line in enumerate(file.readlines()): # Starting line by line scan of the txt file
words = line.split() # Splitting words
if not line.isspace() and index == 0: # Since i don't know nothing about the "secret tests" that will be conducted with this program i've set this check for the start of the first paragraph to prevent errors: if first line is not space
paragraph += 1 # Add +1 to paragraph counter
spacechecker += 1 # Add +1 to spacechecker
elif not line.isspace() and paragraphspace == 1: # Checking if the previous line was space and the current is not
paragraphspace = 0 # Resetting paragraphspace (precedent line was space) value
spacechecker += 1 # Increasing the spacechecker +1
paragraph +=1 # This means we're on a new paragraph so +1 to paragraph
elif line.isspace() and paragraphspace == 1: # Checking if the current line is space and the precedent line was space too.
continue # Do nothing and cycle again
elif line.isspace(): # Checking if the current line is space
paragraphspace += 1 # Increase paragraphspace (precedent line was space variable) +1
continue
else:
spacechecker += 1 # Any other case increase spacechecker +1
if spacechecker % 2 == 1: # Check if spacechecker is odd
for i in range(len(words)): # If yes scan the words in normal order
if words[i].strip(",.!?:;-") in gr_en != "[unavailable]": # If words[i] without any special char is in dictionary
currword = words[i] # If yes, we will call it "currword"
foundwords.append(currword) # Add currword to the foundwords list
wordcount += 1 # Increase wordcount +1
elif (words[i].strip(",.!?:;-") in gr_en == "[unavailable]" and wordcount >= k) or (currword not in gr_en and wordcount >= k): #Elif check if it's not in dictionary but wordcount has gone over k
string = " ".join(foundwords) # We will put the foundwords list in a string
completed_sequences.append(string) # And add this string to the list of strings of completed_sequences
completed_paragraphs.append(paragraph) # Then add the paragraph of that string to the list of completed_paragraphs
result = list(zip(completed_sequences, completed_paragraphs)) # This the output format required, a tuple with the string and the paragraph of that string
wordcount = 0
foundwords.clear() # Clearing the foundwords list
else: # If none of the above happened (word is not in dictionary and wordcounter still isn't >= k)
wordcount = 0 # Reset wordcount to 0
foundwords.clear() # Clear foundwords list
continue # Do nothing and cycle again
else: # The case of spacechecker being not odd,
words = words[::-1] # Reverse the word order
for i in range(len(words)): # Scanning the row of words
currword = words[i][::-1] # Currword in this case will be reversed since the words in even lines are written in reverse.
if currword.strip(",.!?:;-") in gr_en != "[unavailable]": # If currword without any special char is in dictionary
foundwords.append(currword) # Append it to the foundwords list
wordcount += 1 # Increase wordcount +1
elif (currword.strip(",.!?:;-") in gr_en == "[unavailable]" and wordcount >= k) or (currword.strip(",.!?:;-") not in gr_en and wordcount >= k): #Elif check if it's not in dictionary but wordcount has gone over k
string = " ".join(foundwords) # Add the words that has been found to the string
completed_sequences.append(string) # Append the string to completed_sequences list
completed_paragraphs.append(paragraph) # Append the paragraph of the strings to the completed_paragraphs list
result = list(zip(completed_sequences, completed_paragraphs)) # Adding to the result the tuple combination of strings and corresponding paragraphs
wordcount = 0 # Reset wordcount
foundwords.clear() # Clear foundwords list
else: # In case none of above happened
wordcount = 0 # Reset wordcount to 0
foundwords.clear() # Clear foundwords list
continue # Do nothing and cycle again
I'd probably take the following approach to solving this:
Try to collapse down the 2 word dictionaries into one (ancient_italian below), removing English from the equation. For example, if ancient->English has {"canus": ["dog","puppy", "wolf"]} and English->Italian has {"dog":"cane"} then you can create a new dictionary {"canus": "cane"}. (Of course if the English->Italian dict has all 3 English words, you need to either pick one, or display something like cane|cucciolo|lupo in the output).
Come up with a regular expression that can distinguish between words, and the separators (punctuation), and output them in order into a list (word_list below). I.e something like ['ecce', '!', ' ', 'magnus', ' ', 'canus', ' ', 'esurit', '.']
Step through this list, generating a new list. Something like:
translation = []
for item in word_list:
if item.isalpha():
# It's a word - translate it and add to the list
translation.append(ancient_italian[item])
else:
# It's a separator - add to the list as-is
translaton.append(item)
Finally join the list back together: ''.join(translation)
I'm unable to reply to your comment on the answer by match, but this may help:
For one, its not the most elegant approach but should work:
GR_IT = {}
for greek,eng in GR_EN.items():
for word in eng:
try:
GR_IT[greek] = EN_IT[word]
except:
pass
If theres no translation for a word it will be ignored though.
To get a list of words and punctuation split try this:
def repl_punc(s):
punct = ['.',',',':',';','?','!']
for p in punct:
s=s.replace(p,' '+p+' ')
return s
repl_punc(s).split()

How would I select a random word from a file for a user to unscramble?

I am trying to select a random word from a txt file. The context of the file has been provided. I would like the word to be random every time the code is ran. I also only need the words before the comma
import random
print("Please enter mywords file to start game")
user_input=input('Enter file name')
filename = open(user_input)
info=filename.readlines()
filename.close()
words=info[0-3]
objects=words.split(',')
userword=random.choice(objects)
print(userword)
opulence,great wealth
penury,extremely poor
gregarious,fond of company; sociable
entomology,study of insects
So far I able to pull from the second line in the file
"penury,extremely poor"
You were trying to slice, but end up with just one line. You can do a loop over the lines, split on ',' and form a list with what's required. Later, randomly pick from the list:
lst = []
for x in info:
w, _ = x.split(',')
lst.append(w)
print(random.choice(lst))
0-3 seems to be a typo of 0:3, but the problem goes deeper than that. words is supposed to be a list, and lists don't have a .split method. You'll need to split each item (and select the part before the comma).
words = [line.split(',')[0] for line in info]
userword = random.choice(words)

KeyError on the same word

I am trying to generate a sentence in the style of the bible. But whenever I run it, it stops at a KeyError on the same exact word. This is confusing as it is only using its own keys and it is the same word every time in the error, despite having random.choice.
This is the txt file if you want to run it: ftp://ftp.cs.princeton.edu/pub/cs226/textfiles/bible.txt
import random
files = []
content = ""
output = ""
words = {}
files = ["bible.txt"]
sentence_length = 200
for file in files:
file = open(file)
content = content + " " + file.read()
content = content.split(" ")
for i in range(100): # I didn't want to go through every word in the bible, so I'm just going through 100 words
words[content[i]] = []
words[content[i]].append(content[i+1])
word = random.choice(list(words.keys()))
output = output + word
for i in range(int(sentence_length)):
word = random.choice(words[word])
output = output + word
print(output)
The KeyError happens on this line:
word = random.choice(words[word])
It always happens for the word "midst".
How? "midst" is the 100th word in the text.
And the 100th position is the first time it is seen.
The consequence is that "midst" itself was never put in words as a key.
Hence the KeyError.
Why does the program reach this word so fast? Partly because of a bug here:
for i in range(100):
words[content[i]] = []
words[content[i]].append(content[i+1])
The bug here is the words[content[i]] = [] statement.
Every time you see a word,
you recreate an empty list for it.
And the word before "midst" is "the".
It's a very common word,
many other words in the text have "the".
And since words["the"] is ["midst"],
the problem tends to happen a lot, despite the randomness.
You can fix the bug of creating words:
for i in range(100):
if content[i] not in words:
words[content[i]] = []
words[content[i]].append(content[i+1])
And then when you select words randomly,
I suggest to add a if word in words condition,
to handle the corner case of the last word in the input.
"midst" is the 101st word in your source text and it is the first time it shows up. When you do this:
words[content[i]].append(content[i+1])
you are making a key:value pair but you aren't guaranteed that that value is going to be equivalent to an existing key. So when you use that value to search for a key it doesn't exist so you get a KeyError.
If you change your range to 101 instead of 100 you will see that your program almost works. That is because the 102nd word is "of" which has already occurred in your source text.
It's up to you how you want to deal with this edge case. You could do something like this:
if i == (100-1):
words[content[i]].append(content[0])
else:
words[content[i]].append(content[i+1])
which basically loops back around to the beginning of the source text when you get to the end.

How do I read a file and convert each line into strings and see if any of the strings are in a user input?

What I am trying to do in my program is to have the program open a file with many different words inside it.
Receive a user input and check if any word inside the file is in user input.
Inside the file redflags.txt:
happy
angry
ball
jump
each word on a different line.
For example if the user input is "Hey I am a ball" then it will print redflag.
If the user input is "Hey this is a sphere" then it will print noflag.
Redflags = open("redflags.txt")
data = Redflags.read()
text_post = raw_input("Enter text you wish to analyse")
words = text_post.split() and text_post.lower()
if data in words:
print("redflag")
else:
print("noflag")
This should do the trick! Sets are generally much faster than lookups for list comparisons. Sets can tell you the intersection in this case (overlapping words), differences, etc. We consume the file that has a list of words, remove newline characters, lowercase, and we have our first list. The second list is created from the user input and split on spacing. Now we can perform our set intersection to see if any common words exist.
# read in the words and create list of words with each word on newline
# replace newline chars and lowercase
words = [word.replace('\n', '').lower() for word in open('filepath_to_word_list.txt', 'r').readlines()]
# collect user input and lowercase and split into list
user_input = raw_input('Please enter your words').lower().split()
# use set intersection to find if common words exist and print redflag
if set(words).intersection(set(user_input)):
print('redflag')
else:
print('noflag')
with open('redflags.txt', 'r') as f:
# See this post: https://stackoverflow.com/a/20756176/1141389
file_words = f.read().splitlines()
# get the user's words, lower case them all, then split
user_words = raw_input('Please enter your words').lower().split()
# use sets to find if any user_words are in file_words
if set(file_words).intersection(set(user_words)):
print('redflag')
else:
print('noredflag')
I would suggest you to use list comprehensions
Take a look at this code which does what you want (I will explain them below):
Redflags = open("redflags.txt")
data = Redflags.readlines()
data = [d.strip() for d in data]
text_post = input("Enter text you wish to analyse:")
text_post = text_post.split()
fin_list = [i for i in data if i in text_post]
if (fin_list):
print("RedFlag")
else:
print("NoFlag")
Output 1:
Enter text you wish to analyse:i am sad
NoFlag
Output 2:
Enter text you wish to analyse:i am angry
RedFlag
So first open the file and read them using readlines() this gives a list of lines from file
>>> data = Redflags.readlines()
['happy\n', 'angry \n', 'ball \n', 'jump\n']
See all those unwanted spaces,newlines (\n) use strip() to remove them! But you can't strip() a list. So take individual items from the list and then apply strip(). This can be done efficiently using list comprehensions.
data = [d.strip() for d in data]
Also why are you using raw_input() In Python3 use input() instead.
After getting and splitting the input text.
fin_list = [i for i in data if i in text_post]
I'm creating a list of items where i (each item) from data list is also in text_pos list. So this way I get common items which are in both lists.
>>> fin_list
['angry'] #if my input is "i am angry"
Note: In python empty lists are considered as false,
>>> bool([])
False
While those with values are considered True,
>>> bool(['some','values'])
True
This way your if only executes if list is non-empty. Meaning 'RedFlag' will be printed only when some common item is found between two lists. You get what you want.

Categories

Resources