I am trying to use binary search to check the spelling of words in a file, and print out the words that are not in the dictionary. But as of now, most of the correctly spelled words are being printed as misspelled (words that cannot be find in the dictionary).
Dictionary file is also a text file that looks like:
abactinally
abaction
abactor
abaculi
abaculus
abacus
abacuses
Abad
abada
Abadan
Abaddon
abaddon
abadejo
abadengo
abadia
Code:
def binSearch(x, nums):
low = 0
high = len(nums)-1
while low <= high:
mid = (low + high)//2
item = nums[mid]
if x == item :
print(nums[mid])
return mid
elif x < item:
high = mid - 1
else:
low = mid + 1
return -1
def main():
print("This program performs a spell-check in a file")
print("and prints a report of the possibly misspelled words.\n")
# get the sequence of words from the file
fname = input("File to analyze: ")
text = open(fname,'r').read()
for ch in '!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.split()
#import dictionary from file
fname2 =input("File of dictionary: ")
dic = open(fname2,'r').read()
dic = dic.split()
#perform binary search for misspelled words
misw = []
for w in words:
m = binSearch(w,dic)
if m == -1:
misw.append(w)
Your binary search works perfectly! You don't seem to be removing all special characters, though.
Testing your code (with a sentence of my own):
def main():
print("This program performs a spell-check in a file")
print("and prints a report of the possibly misspelled words.\n")
text = 'An old mann gathreed his abacus, and ran a mile. His abacus\n ran two miles!'
for ch in '!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.lower().split(' ')
dic = ['a','abacus','an','and','arranged', 'gathered', 'his', 'man','mile','miles','old','ran','two']
#perform binary search for misspelled words
misw = []
for w in words:
m = binSearch(w,dic)
if m == -1:
misw.append(w)
print misw
prints as output ['mann', 'gathreed', '', '', 'abacus\n', '']
Those extra empty strings '' are the extra spaces for punctuation that you replaced with spaces. The \n (a line break) is a little more problematic, as it is something you definitely see in external text files but is not something intuitive to account for. What you should do instead of for ch in '!"#$%&()*+,-./:;<=>?#[\\]^_``{|}~': is just check to see if every character .isalpha() Try this:
def main():
...
text = 'An old mann gathreed his abacus, and ran a mile. His abacus\n ran two miles!'
for ch in text:
if not ch.isalpha() and not ch == ' ':
#we want to keep spaces or else we'd only have one word in our entire text
text = text.replace(ch, '') #replace with empty string (basically, remove)
words = text.lower().split(' ')
#import dictionary
dic = ['a','abacus','an','and','arranged', 'gathered', 'his', 'man','mile','miles','old','ran','two']
#perform binary search for misspelled words
misw = []
for w in words:
m = binSearch(w,dic)
if m == -1:
misw.append(w)
print misw
Output:
This program performs a spell-check in a file
and prints a report of the possibly misspelled words.
['mann', 'gathreed']
Hope this was helpful! Feel free to comment if you need clarification or something doesn't work.
Related
I have a file including some sentences. I used polyglot for Named Entity Recognition and stored all detected entities in a list. Now I want to check if in each sentence any or pair of entities exist, show that for me.
Here what I did:
from polyglot.text import Text
file = open('input_raw.txt', 'r')
input_file = file.read()
test = Text(input_file, hint_language_code='fa')
list_entity = []
for sent in test.sentences:
#print(sent[:10], "\n")
for entity in test.entities:
list_entity.append(entity)
for i in range(len(test)):
m = test.entities[i]
n = test.words[m.start: m.end] # it shows only word not tag
if str(n).split('.')[-1] in test: # if each entities exist in each sentence
print(n)
It gives me an empty list.
Input:
sentence1: Bill Gate is the founder of Microsoft.
sentence2: Trump is the president of USA.
Expected output:
Bill Gate, Microsoft
Trump, USA
Output of list_entity:
I-PER(['Trump']), I-LOC(['USA'])
How to check if I-PER(['Trump']), I-LOC(['USA']) is in first sentence?
For starters you were adding the whole text file input to the entities list.
entities can only be called by each sentence in the polyglot object.
from polyglot.text import Text
file = open('input_raw.txt', 'r')
input_file = file.read()
file = Text(input_file, hint_language_code='fa')
list_entity = []
for sentence in file.sentences:
for entity in sentence.entities:
#print(entity)
list_entity.append(entity)
print(list_entity)
Now you don't have an empty list.
As for your problem with identifying the identity terms,
I have not found a way to generate an entity by hand, so the following simply checks if there are entities with the same term. A Chunk can have multiple strings inside, so we can go through them iteratively.
from polyglot.text import Text
file = open('input_raw.txt', 'r')
input_file = file.read()
file = Text(input_file, hint_language_code='ar')
def check_sentence(entities_list, sentence): ## Check if string terms
for term in entities_list: ## are in any of the entities
## Compare each Chunk in the list to each Chunk
## object in the sentence and see if there's any matches.
if any(any(entityTerm == term for entityTerm in entityObject)
for entityObject in sentence.entities):
pass
else:
return False
return True
sentence_number = 1 # Which sentence to check
sentence = file.sentences[sentence_number]
entity_terms = ["Bill",
"Gates"]
if check_sentence(entity_terms, sentence):
print("Entity Terms " + str(entity_terms) +
" are in the sentence. '" + str(sentence)+ "'")
else:
print("Sentence '" + str(sentence) +
"' doesn't contain terms" + str(entity_terms ))
Once you find a way to generate arbitrary entities all you'll have to do is stop popping the term from the sentence checker so you can do type comparison as well.
If you just want to match the list of entities in the file against a specific sentence, then this should do the trick:
from polyglot.text import Text
file = open('input_raw.txt', 'r')
input_file = file.read()
file = Text(input_file, hint_language_code='fa')
def return_match(entities_list, sentence): ## Check if and which chunks
matches = [] ## are in the sentence
for term in entities_list:
## Check each list in each Chunk object
## and see if there's any matches.
for entity in sentence.entities:
if entity == term:
for word in entity:
matches.append(word)
return matches
def return_list_of_entities(file):
list_entity = []
for sentence in file.sentences:
for entity in sentence.entities:
list_entity.append(entity)
return list_entity
list_entity = return_list_of_entities(file)
sentence_number = 1 # Which sentence to check
sentence = file.sentences[sentence_number]
match = return_match(list_entity, sentence)
if match:
print("Entity Term " + str(match) +
" is in the sentence. '" + str(sentence)+ "'")
else:
print("Sentence '" + str(sentence) +
"' doesn't contain any of the terms" + str(list_entity))
This program is supposed to replace the letters ö,ä,õ,ü with different letters. After completing one row it produces an empty row and I don't know why. I have tried to understand it for some time, but I couldn't really understand why it doesn't give me desired output.
f = input("Enter file name: ")
file = open(f, encoding="UTF-8")
for sentence in file:
sentence = sentence.upper()
for letter in sentence:
if letter == "Ä":
lause = sentence.replace(letter, "AE")
elif letter == "Ö" or täht == "Õ":
lause = sentence.replace(letter, "OE")
elif letter == "Ü":
lause = sentence.replace(letter, "UE")
print(sentence)
Reading each line in includes the trailing newline. Your print() also includes a newline so you will get an empty row. Try print(sentence, end='') as follows:
filename = input("Enter file name: ")
with open(filename, encoding="UTF-8") as f_input:
for sentence in f_input:
sentence = sentence.upper()
for letter in sentence:
if letter == "Ä":
lause = sentence.replace(letter, "AE")
elif letter == "Ö" or täht == "Õ":
lause = sentence.replace(letter, "OE")
elif letter == "Ü":
lause = sentence.replace(letter, "UE")
print(sentence, end='')
Note: using with open(... will also automatically close your file afterwards.
You might also want to consider the following approach:
# -*- coding: utf-8
filename = input("Enter file name: ")
replacements = [('Ä', 'AE'), ('ä', 'ae'), ('Ö', 'OE'), ('ö', 'oe'), ('Õ', 'OE'), ('õ', 'oe'), ('Ü', 'UE'), ('ü', 'ue')]
with open(filename, encoding='utf-8') as f_input:
text = f_input.read()
for from_text, to_text in replacements:
text = text.replace(from_text, to_text)
print(text)
This does each replacement on the whole text rather than line by line. It also preserves the case.
I won't fix your program, just try to answer why it doesn't do what you are expecting:
The program doesn't run: in line 14 the variable "täht" might be a typo, supposed to be "letter"
You store the result of replace() in variable "lause" but never use it
by default print() adds "\n" at the end, but you can override it (see help(print) in the python shell)
sentence = "ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU ASK WHAT YOU CAN DO FOR YOUR COUNTRY"
s = sentence.split()
another = [0]
print(sentence)
for count, i in enumerate(s):
if s.count(i) < 2:
another.append(max(another) + 1)
else:
another.append(s.index(i) +1)
another.remove(0)
print(another)
I am guessing you want the sentence put into a text file? If so, here is the code:
text_file = ("textfile.txt", "w")
text_file.write(sentence)
text_file.close()
Make sure textfile.txt is in the same folder as your program.
I have a text file that contains the contents of a book. I want to take this file and build an index which allows the user to search through the file to make searches.
The search would consist of entering a word. Then, the program would return the following:
Every chapter which includes that word.
The line number of the line
which contains the word.
The entire line the word is on.
I tried the following code:
infile = open(file)
Dict = {}
word = input("Enter a word to search: ")
linenum = 0
line = infile.readline()
for line in infile
linenum += 1
for word in wordList:
if word in line:
Dict[word] = Dict.setdefault(word, []) + [linenum]
print(count, word)
line = infile.readline()
return Dict
Something like this does not work and seems too awkward for handling the other modules which would require:
An "or" operator to search for one word or another
An "and" operator to search for one word and another in the same chapter
Any suggestions would be great.
def classify_lines_on_chapter(book_contents):
lines_vs_chapter = []
for line in book_contents:
if line.isupper():
current_chapter = line.strip()
lines_vs_chapter.append(current_chapter)
return lines_vs_chapter
def classify_words_on_lines(book_contents):
words_vs_lines = {}
for i, line in enumerate(book_contents):
for word in set([word.strip(string.punctuation) for word in line.split()]):
if word:
words_vs_lines.setdefault(word, []).append(i)
return words_vs_lines
def main():
skip_lines = 93
with open('book.txt') as book:
book_contents = book.readlines()[skip_lines:]
lines_vs_chapter = classify_lines_on_chapter(book_contents)
words_vs_lines = classify_words_on_lines(book_contents)
while True:
word = input("Enter word to search - ")
# Enter a blank input to exit
if not word:
break
line_numbers = words_vs_lines.get(word, None)
if not line_numbers:
print("Word not found!!\n")
continue
for line_number in line_numbers:
line = book_contents[line_number]
chapter = lines_vs_chapter[line_number]
print("Line " + str(line_number + 1 + skip_lines))
print("Chapter '" + str(chapter) + "'")
print(line)
if __name__ == '__main__':
main()
Try it on this input file. Rename it as book.txt before running it.
so basically, I have a text file with a list of words. I then have to create a raw input to let the user type in words and if the inputted word is in the text file, it will print "Right". for any word that isn't on that list, I have to put it in a different file with the number of "wrong" words.
For the most part, I have the user input correct, where if the word inputted is in the text file, it'll respond whether it is right or wrong.. but im having difficulty adding the wrong words into a different file.
print 'Opening file wordlist.txt'
b = open('wordlist.txt')
print 'Reading file wordlist.txt'
word_list = b.readlines().lower().split()
b.close()
in_word = raw_input('Enter a word: ')
if in_word+'\n' in word_list:
print 'Right'
wrong_list = { word for word in in_word if word not in word_list}
return wrong_list
Why not do
wrong_list = []
print 'Opening file wordlist.txt'
b = open('wordlist.txt')
print 'Reading file wordlist.txt'
word_list = b.readlines().lower().split()
b.close()
in_word = raw_input('Enter a word: ')
if in_word+'\n' in word_list:
print 'Right'
else:
wrong_list.extend(in_word)
print 'Opening file wordlist.txt'
b = open('wordlist.txt', 'r')
print 'Reading file wordlist.txt'
word_list = b.read().splitlines()
b.close()
c = open('wronglist.txt', 'r')
wrong_list = c.read().splitlines()
c.close()
in_word = raw_input('Enter a word: ')
if in_word in word_list:
print 'Right'
else:
print 'Wrong'
c = open('wronglist.txt', 'a')
if in_word not in wrong_list:
c.write("%s\n" % in_word)
c.close()
try this:
in_word = ''
wrong_list = []
with open('wordlist.txt', 'r') as f:
word_list = f.read().lower().split()
while in_word is not '#':
in_word = raw_input('Enter a word(type # to exit): ')
if in_word is '#':
break
if in_word in word_list:
print 'right'
else:
print 'wrong'
wrong_list.append(in_word)
result = """Number of wrong words: %d
Wrong words: %s
""" % (len(wrong_list), ','.join(wrong_list))
print result
with open('wrong.txt', 'a') as f:
f.write(result)
The problem in your current implementation is you need to know how to indent and how to use some features of python's if statement namely "else"
There is a great tutorial on this very relevant topic here.
https://docs.python.org/3.5/tutorial/controlflow.html
You will also need to know how to open a file for writing.
explained here: https://docs.python.org/3.4/library/functions.html?highlight=open#open
which is simply:
with open('/path/filename_here.txt','w') as writeable_file:
#do stuff here with the file
writeable_file.write(line_to_write)
#the file is closed now.