import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
list1 =['This is text','This is another text']
stp = stopwords.words('English')
lower_token = [t.lower().split() for t in list1]
new2=[]
for list in lower_token:
new1=[]
for word in list:
if word not in stp:
new1.append(word)
new2.append(new1)
new2
[['text'], ['another', 'text']]
In the above conditional loop, I am trying to split the text into two list and then exclude any word that occurs in stp list. Although I was able to achieve the results using for loop, but I am interested to achieve the same using a list comprehension, but I failed to do.
here is my unsuccessful effort using list comprehension
[word for list in lower_token for word in list if word not in stp]
You're very close, you just need to put the inner list comprehension in brackets too. That also makes it more readable.
[[word for word in txt.lower().split() if word not in stp] for txt in list1]
You need to enclose the inner comprehension as list too.
[[word for word in list if word not in stp] for list in lower_token]
Related
I am using the following python program to remove stopwords from the texts.
import re
from sklearn.feature_extraction import text
mylist= [['an_undergraduate'], ['state_of_the_art', 'terminology']]
######Remove stops
stops = list(text.ENGLISH_STOP_WORDS)
pattern = re.compile(r'|'.join([r'(\_|\b){}\b'.format(x) for x in stops]))
for k in mylist:
for idx, item in enumerate(k):
if item not in stops:
item = pattern.sub('', item).strip()
k[idx] = item
I want the output as
mylist= [['undergraduate'], ['state_art', 'terminology']]
However, the pattern I have mentioned does not capture the stop words properly. Please let me know how to fix this?
If you check the sourcecode of sklearn.feature_extraction.text.ENGLISH_STOP_WORDS, it is of type frozenset. Hence, no need to type-cast it to list. Instead of using regex, using this nested list comprehension expression will be much more performance efficient.
>>> from sklearn.feature_extraction import text
>>> mylist= [['an_undergraduate'], ['state_of_the_art', 'terminology']]
>>> [['_'.join([w for w in i.split('_') if w not in text.ENGLISH_STOP_WORDS]) for i in e] for e in mylist]
[['undergraduate'], ['state_art', 'terminology']]
Here I am firstly splitting the words based on underscore, checking whether the word is present in the ENGLISH_STOP_WORDS, and filtering the words for new string only if it is not present.
I need to parse a sentence like:
"Alice is a boy." into ['Alice', 'boy'] and
and "An elephant is a mammal." into ['elephant', 'mammal']. Meaning I need to split the string by 'is' while also remove 'a/an'.
Is there an elegant way to do it?
If you insists on using a regex, you can do it like this by using re.search:
print(re.search('(\w+) is [a|an]? (\w+)',"Alice is a boy.").groups())
# output: ('Alice', 'boy')
print(re.search('(\w+) is [a|an]? (\w+)',"An elephant is a mammal.").groups())
# output: ('elephant', 'mammal')
# apply list() if you want it as a list
This answer does not make us of regex, but is one way of doing things:
s = 'Alice is a boy'
s = s.split() # each word becomes an entry in a list
s = [word for word in s if word != 'a' and word !='an' and word !='is']
The main downside to this is that you would need to list out every word you want to exclude in the list comprehension.
stopwords is a list of strings, tokentext is a list of lists of strings. (Each list is a sentence, the list of lists is an text document).
I am simply trying to take out all the strings in tokentext that also occur in stopwords.
for element in tokentext:
for word in element:
if(word.lower() in stopwords):
element.remove(word)
print(tokentext)
I was hoping for someone to point out some fundamental flaw in the way I am iterating over the list..
Here is a data set where it fails:
http://pastebin.com/p9ezh2nA
Altering a list while iterating on it will always create issues. Try instead something like:
stopwords = ["some", "strings"]
tokentext = [ ["some", "lists"], ["of", "strings"] ]
new_tokentext = [[word for word in lst if word not in stopwords] for lst in tokentext]
# creates a new list of words, filtering out from stopwords
Or using filter:
new_tokentext = [list(filter(lambda x: x not in stopwords, lst)) for lst in tokentext]
# the call to `list` here is unnecessary in Python2
You could just do something simple like:
for element in tokentext:
if element in stop words:
stopwords.remove(element)
It's kinda like yours, but without the extra for loop. But I am not sure if this works, or if that's what you are trying to achieve, but it's an idea, and I hope it helps!
I have to different lists of words, one list (stopwords) contains a list of words that should be excluded from the other list (kafka).
I tried:
kafka.discard (stop) # this only works with sets and I do not want to transform my list into a set
is there another way to exclude the words in stop from the other list?
I am using python 3.4.0
Since you said you don't want to work with sets (why?), you can use a list comprehension:
kafka[:] = [x for x in kafka if x not in stop]
edit: note the slice[:], this method more closely resembles the behaviour of .discard() in that the identity of your collection is preserved.
You can try this:
stopwords_set = set(stopwords)
kafka = [word for word in kafka if word not in stopwords_set]
One way to remove every word in your stopwords list from kafka is:
for word in stopwords:
while word in kafka:
kafka.remove(word)
I'm writing a spell checking function and I'm using two text files: one that has misspelled text and a text file with a bunch of words from the dictionary. I have turned the text of misspelled words into a list of strings and the text file with dictionary words into a list of words. Now I need to see if the words in my misspelled list are in my list of dictionary words.
def spellCheck():
checkFile=input('Enter file name: ')
inFile=open(checkFile,'r')
# This separates my original text file into a list like this
# [['It','was','the','besst','of','times,'],
# ['it','was','teh','worst','of','times']]
separate=[]
for line in inFile:
separate.append(line.split())
# This opens my list of words from the dictionary and
# turns it into a list of the words.
wordFile=open('words.txt','r')
words=wordFile.read()
wordList=(list(words.split()))
wordFile.close()
# I need this newList to be a list of the correctly spelled words
# in my separate[] list and if the word isn't spelled correctly
# it will go into another if statement...
newList=[]
for word in separate:
if word in wordList:
newList.append(word)
return newList
Try this:
newList = []
for line in separate:
for word in line:
if word in wordList:
newList.append(word)
return newList
The problem you had was that you were iterating over separate, which is a list of lists. There does not exist any list in your wordList, which is why that if-statement always fails. The words that you want to iterate over are in the sublists contained in separate. So, you can iterate over these words in a second for-loop. You can also use for word in itertools.chain.from_iterable(separate).
Hope this helps
First, a word about data structures. Instead of lists, you should use sets, since you (apparently) only want a copy of each word. You can create sets out of your lists:
input_words = set(word for line in separate for word in line) # since it is a list of lists
correct_words = set(word_list)
Then, it is simple as that:
new_list = input_words.intersection(correct_words)
And if you want the incorrect words, you have another one liner:
incorrect = input_words.difference(correct_words)
Note that I used names_with_underscores, instead of CamelCase, as recommended in PEP 8.
Bear in mind, however, that this is not very efficient for spell checking, since you don't examine context.