Python check for word in list - python

I'm writing a spell checking function and I'm using two text files: one that has misspelled text and a text file with a bunch of words from the dictionary. I have turned the text of misspelled words into a list of strings and the text file with dictionary words into a list of words. Now I need to see if the words in my misspelled list are in my list of dictionary words.
def spellCheck():
checkFile=input('Enter file name: ')
inFile=open(checkFile,'r')
# This separates my original text file into a list like this
# [['It','was','the','besst','of','times,'],
# ['it','was','teh','worst','of','times']]
separate=[]
for line in inFile:
separate.append(line.split())
# This opens my list of words from the dictionary and
# turns it into a list of the words.
wordFile=open('words.txt','r')
words=wordFile.read()
wordList=(list(words.split()))
wordFile.close()
# I need this newList to be a list of the correctly spelled words
# in my separate[] list and if the word isn't spelled correctly
# it will go into another if statement...
newList=[]
for word in separate:
if word in wordList:
newList.append(word)
return newList

Try this:
newList = []
for line in separate:
for word in line:
if word in wordList:
newList.append(word)
return newList
The problem you had was that you were iterating over separate, which is a list of lists. There does not exist any list in your wordList, which is why that if-statement always fails. The words that you want to iterate over are in the sublists contained in separate. So, you can iterate over these words in a second for-loop. You can also use for word in itertools.chain.from_iterable(separate).
Hope this helps

First, a word about data structures. Instead of lists, you should use sets, since you (apparently) only want a copy of each word. You can create sets out of your lists:
input_words = set(word for line in separate for word in line) # since it is a list of lists
correct_words = set(word_list)
Then, it is simple as that:
new_list = input_words.intersection(correct_words)
And if you want the incorrect words, you have another one liner:
incorrect = input_words.difference(correct_words)
Note that I used names_with_underscores, instead of CamelCase, as recommended in PEP 8.
Bear in mind, however, that this is not very efficient for spell checking, since you don't examine context.

Related

exhaustive search over a list of complex strings without modifying original input

I am attempting to create a minimal algorithm to exhaustively search for duplicates over a list of strings and remove duplicates using an index to avoid changing cases of words and their meanings.
The caveat is the list has such words Blood, blood, DNA, ACTN4, 34-methyl-O-carboxy, Brain, brain-facing-mouse, BLOOD and so on.
I only want to remove the duplicate 'blood' word, keep the first occurrence with the first letter capitalized, and not modify cases of any other words. Any suggestions on how should I proceed?
Here is my code
def remove_duplicates(list_of_strings):
""" function that takes input of a list of strings,
uses index to iterate over each string lowers each string
and returns a list of strings with no duplicates, does not modify the original strings
an exhaustive search to remove duplicates using index of list and list of string"""
list_of_strings_copy = list_of_strings
try:
for i in range(len(list_of_strings)):
list_of_strings_copy[i] = list_of_strings_copy[i].lower()
word = list_of_strings_copy[i]
for j in range(len(list_of_strings_copy)):
if word == list_of_strings_copy[j]:
list_of_strings.pop(i)
j+=1
except Exception as e:
print(e)
return list_of_strings
Make a dictionary, {text.lower():text,...}, use the keys for comparison and save the first instance of the text in the values.
d={}
for item in list_of_strings:
if item.lower() not in d:
d[item.lower()] = item
d.values() should be what you want.
I think something like the following would do what you need:
def remove_duplicates(list_of_strings):
new_list = [] # create empty return list
for string in list_of_strings: # iterate through list of strings
string = string[0].capitalize() + string[1:].lower() # ensure first letter is capitalized and rest are low case
if string not in new_list: # check string is not duplicate in retuned list
new_list.append(string) # if string not in list append to returned list
return new_list # return end list
strings = ["Blood", "blood", "DNA", "ACTN4", "34-methyl-O-carboxy", "Brain", "brain-facing-mouse", "BLOOD"]
returned_strings = remove_duplicates(strings)
print(returned_strings)
(For reference this was written in Python 3.10)

Python "not in" not correctly identifying strings which do not contain a particular letter [duplicate]

This question already has answers here:
How to remove items from a list while iterating?
(25 answers)
Closed last year.
I am attempting to write a little program to play the popular game, Wordle.
One of my function attempts to remove words from a word bank which do not contain a particular letter. The function does a decent job, but I have discovered that the function occasionally fails to identify words that should be removed from the list. If I call on the function repeatedly, it will eventually eliminate all the correct words.
#Creates a list of 4266 five-letter words from a text file
five_letters = open("five.txt", "r")
five = five_letters.read()
five = five.split()
#Iterates across list and removes words that do not contain a particular letter
def eliminate_words_not_containing(list, letter):
for word in list:
if letter not in word:
list.remove(word)
return list
# Call on the function 10 times and print the length of the list after each call
print(len(five))
for i in range(10):
five = eliminate_words_not_containing(five, "e")
print(len(five))
The output is:
4266
2932
2319
2070
2014
2010
2010
2010
2010
2010
2010
How can I write the function so that it catches all the words that do not containing a particular letter the first time through?
Silly question: Is it possible for the program to be running too quickly, and it skips over words?
You iterating over the original list, so when you remove one word the position isn't updated, try to make a copy of the list and iterate over it.
Edit: I'd try list comprehension if it was me so:
list = [word for word in list if letter not in word]
Because you are modifying your list while you are iterating over it, you are skipping words.
If your list is ['word1', 'word2', 'word3'] and you call your eliminate function, you will be left with ['word2'] because you removed 'word1', then your for loop moved on to the 2nd index, which is now 'word3' from the original list (with 'word2' being the first word in the list after the removal of 'word1').
There are several ways to solve the problem, but one solution would be to remember the indices to remove:
def eliminate_words_not_containing(word_list, letter):
to_remove = set()
for i, word in enumerate(word_list):
if letter not in word:
to_remove.add(i)
return [word for i, word in enumerate(word_list) if i not in to_remove]
Probably just a regular list comprehension would be better in this case, but the above would be generalized in case you want to do something more complicated with each entry.
(Also, it's not good practice to use the variable named list because it clobbers the built-in type list in python.)

Delete item in a list within a list

stopwords is a list of strings, tokentext is a list of lists of strings. (Each list is a sentence, the list of lists is an text document).
I am simply trying to take out all the strings in tokentext that also occur in stopwords.
for element in tokentext:
for word in element:
if(word.lower() in stopwords):
element.remove(word)
print(tokentext)
I was hoping for someone to point out some fundamental flaw in the way I am iterating over the list..
Here is a data set where it fails:
http://pastebin.com/p9ezh2nA
Altering a list while iterating on it will always create issues. Try instead something like:
stopwords = ["some", "strings"]
tokentext = [ ["some", "lists"], ["of", "strings"] ]
new_tokentext = [[word for word in lst if word not in stopwords] for lst in tokentext]
# creates a new list of words, filtering out from stopwords
Or using filter:
new_tokentext = [list(filter(lambda x: x not in stopwords, lst)) for lst in tokentext]
# the call to `list` here is unnecessary in Python2
You could just do something simple like:
for element in tokentext:
if element in stop words:
stopwords.remove(element)
It's kinda like yours, but without the extra for loop. But I am not sure if this works, or if that's what you are trying to achieve, but it's an idea, and I hope it helps!

Check if word is inside of list of tuples

I'm wondering how I can efficiently check whether a value is inside a given list of tuples. Say I have a list of:
("the", 1)
("check", 1)
("brown, 2)
("gary", 5)
how can I check whether a given word is inside the list, ignoring the second value of the tuples? If it was just a word I could use
if "the" in wordlist:
#...
but this will not work, is there something along the line this i can do?
if ("the", _) in wordlist:
#...
May be use a hash
>>> word in dict(list_of_tuples)
Use any:
if any(word[0] == 'the' for word in wordlist):
# do something
Lookup of the word in the list will be O(n) time complexity, so the more words in the list, the slower find will work. To speed up you may sort a list by word as a key alphabeticaly and then use binary search - search of the word becomes log(N) complexity, but the most efficient way is to use hashing with the set structure:
'the' in set((word for word, _ in a))
O(1), independent of how many words are in the set. BTW, it guarantees that only one instance of the word is inside the structure, while list can hold as many "the" as many you append. Set should be constructed once, add words with the .add method(add new word is O(1) complexity too)
for tupl in wordlist:
if 'the' in tupl:
# ...
words,scores = zip(*wordlist)
to split the wordlist into a list of words and a list of scores then just
print "the" in words

Adding punctuations to a list?

i have a small problem with punctuations.
My assignment was to check if there were any duplicated words in a text, if there were any duplicated words in the list my job was to highlight them by using .upper().
Example on text: I like apples, apples is the best thing i know.
So i took the original text, striped it from punctuations, transformed all words to lowercase and then split the list.
With a for-loop i compared every word in the list with each other and i found all duplicated word, all of this were placed in a new list.
Example (after using the for-loop): i like apples APPLES is the best thing I know
So the new list is now similar to the original list but with one major exception, it is lacking the punctuations.
Is there a way to add the punctuations on the new list were they are "suppose to be" (from the old lists position)?
Is there some kind of method build in python that can do this, or do i have to compare the two lists with another for-loop and then add the punctuations to the new list?
NewList = [] # Creates an empty list
for word in text:
if word not in NewList:
NewList.append(word)
elif word in NewList: #
NewList.append(word.upper())
List2 = ' '.join(NewList)
the code above works for longer text and thats the code i have been using for Highlighting duplicated words.
The only problem is that the punctations doesn't exist in the new file, thats the only problem i have.
Here's an example of using sub method with callback from build-in regexp module.
This solution respects all the punctuation.
import re
txt = "I like,, ,apples, apples! is the .best. thing *I* know!!1"
def repl(match, stack):
word = match.group(0)
word_upper = word.upper()
if word_upper in stack:
return word_upper
stack.add(word_upper)
return word
def highlight(s):
stack = set()
return re.sub('\\b([a-zA-Z]+)\\b', lambda match: repl(match, stack), s)
print txt
print highlight(txt)

Categories

Resources