stopwords is a list of strings, tokentext is a list of lists of strings. (Each list is a sentence, the list of lists is an text document).
I am simply trying to take out all the strings in tokentext that also occur in stopwords.
for element in tokentext:
for word in element:
if(word.lower() in stopwords):
element.remove(word)
print(tokentext)
I was hoping for someone to point out some fundamental flaw in the way I am iterating over the list..
Here is a data set where it fails:
http://pastebin.com/p9ezh2nA
Altering a list while iterating on it will always create issues. Try instead something like:
stopwords = ["some", "strings"]
tokentext = [ ["some", "lists"], ["of", "strings"] ]
new_tokentext = [[word for word in lst if word not in stopwords] for lst in tokentext]
# creates a new list of words, filtering out from stopwords
Or using filter:
new_tokentext = [list(filter(lambda x: x not in stopwords, lst)) for lst in tokentext]
# the call to `list` here is unnecessary in Python2
You could just do something simple like:
for element in tokentext:
if element in stop words:
stopwords.remove(element)
It's kinda like yours, but without the extra for loop. But I am not sure if this works, or if that's what you are trying to achieve, but it's an idea, and I hope it helps!
Related
I have a dataframe column that looks like:
I'm looking into removing special characters. I' hoping to attach the tags (in list of lists) so that I can append the column to an existing df.
This is what I gathered so much, but it doesn't seem to work. Regex in particular is causing me so much pain as it always returns "expected string or byte-like objects".
df = pd.read_csv('flickr_tags_participation_inequality_omit.csv')
#df.dropna(inplace=True) and tokenise
tokens = df["tags"].astype(str).apply(nltk.word_tokenize)
filter_words = ['.',',',':',';','?','#','-','...','!','=', 'edinburgh', 'ecosse', 'écosse', 'scotland']
filtered = [i for i in tokens if i not in filter_words]
#filtered = [re.sub("[.,!?:;-=...##_]", '', w) for w in tokens]
#the above line didn't work
tokenised_tags= []
for i in filtered:
tokenised_tags.append(i) #this turns the single lists of tags into lists of lists
print(tokenised_tags)
The above code doesn't remove the custom-defined stopwords.
Any help is very much appreciated! Thanks!
You need to use
df['filtered'] = df['tags'].apply(lambda x: [t for t in nltk.word_tokenize(x) if t not in filter_words])
Note that nltk.word_tokenize(x) outputs a list of strings so you can apply a regulat list comprehension to it.
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
list1 =['This is text','This is another text']
stp = stopwords.words('English')
lower_token = [t.lower().split() for t in list1]
new2=[]
for list in lower_token:
new1=[]
for word in list:
if word not in stp:
new1.append(word)
new2.append(new1)
new2
[['text'], ['another', 'text']]
In the above conditional loop, I am trying to split the text into two list and then exclude any word that occurs in stp list. Although I was able to achieve the results using for loop, but I am interested to achieve the same using a list comprehension, but I failed to do.
here is my unsuccessful effort using list comprehension
[word for list in lower_token for word in list if word not in stp]
You're very close, you just need to put the inner list comprehension in brackets too. That also makes it more readable.
[[word for word in txt.lower().split() if word not in stp] for txt in list1]
You need to enclose the inner comprehension as list too.
[[word for word in list if word not in stp] for list in lower_token]
I have this list of strings and some prefixes. I want to remove all the strings from the list that start with any of these prefixes. I tried:
prefixes = ('hello', 'bye')
list = ['hi', 'helloyou', 'holla', 'byeyou', 'hellooooo']
for word in list:
list.remove(word.startswith(prexixes)
So I want my new list to be:
list = ['hi', 'holla']
but I get this error:
ValueError: list.remove(x): x not in list
What's going wrong?
You can create a new list that contains all the words that do not start with one of your prefixes:
newlist = [x for x in list if not x.startswith(prefixes)]
The reason your code does not work is that the startswith method returns a boolean, and you're asking to remove that boolean from your list (but your list contains strings, not booleans).
Note that it is usually not a good idea to name a variable list, since this is already the name of the predefined list type.
Greg's solution is definitely more Pythonic, but in your original code, you perhaps meant something like this. Observe that we make a copy (using list[:] syntax) and iterate over the copy, because you should not modify a list while iterating over it.
prefixes = ('hello', 'bye')
list = ['hi', 'helloyou', 'holla', 'byeyou', 'hellooooo']
for word in list[:]:
if word.startswith(prefixes):
list.remove(word)
print list
print len([i for i in os.listdir('/path/to/files') if not i.startswith(('.','~','#'))])
I have to different lists of words, one list (stopwords) contains a list of words that should be excluded from the other list (kafka).
I tried:
kafka.discard (stop) # this only works with sets and I do not want to transform my list into a set
is there another way to exclude the words in stop from the other list?
I am using python 3.4.0
Since you said you don't want to work with sets (why?), you can use a list comprehension:
kafka[:] = [x for x in kafka if x not in stop]
edit: note the slice[:], this method more closely resembles the behaviour of .discard() in that the identity of your collection is preserved.
You can try this:
stopwords_set = set(stopwords)
kafka = [word for word in kafka if word not in stopwords_set]
One way to remove every word in your stopwords list from kafka is:
for word in stopwords:
while word in kafka:
kafka.remove(word)
I'm writing a spell checking function and I'm using two text files: one that has misspelled text and a text file with a bunch of words from the dictionary. I have turned the text of misspelled words into a list of strings and the text file with dictionary words into a list of words. Now I need to see if the words in my misspelled list are in my list of dictionary words.
def spellCheck():
checkFile=input('Enter file name: ')
inFile=open(checkFile,'r')
# This separates my original text file into a list like this
# [['It','was','the','besst','of','times,'],
# ['it','was','teh','worst','of','times']]
separate=[]
for line in inFile:
separate.append(line.split())
# This opens my list of words from the dictionary and
# turns it into a list of the words.
wordFile=open('words.txt','r')
words=wordFile.read()
wordList=(list(words.split()))
wordFile.close()
# I need this newList to be a list of the correctly spelled words
# in my separate[] list and if the word isn't spelled correctly
# it will go into another if statement...
newList=[]
for word in separate:
if word in wordList:
newList.append(word)
return newList
Try this:
newList = []
for line in separate:
for word in line:
if word in wordList:
newList.append(word)
return newList
The problem you had was that you were iterating over separate, which is a list of lists. There does not exist any list in your wordList, which is why that if-statement always fails. The words that you want to iterate over are in the sublists contained in separate. So, you can iterate over these words in a second for-loop. You can also use for word in itertools.chain.from_iterable(separate).
Hope this helps
First, a word about data structures. Instead of lists, you should use sets, since you (apparently) only want a copy of each word. You can create sets out of your lists:
input_words = set(word for line in separate for word in line) # since it is a list of lists
correct_words = set(word_list)
Then, it is simple as that:
new_list = input_words.intersection(correct_words)
And if you want the incorrect words, you have another one liner:
incorrect = input_words.difference(correct_words)
Note that I used names_with_underscores, instead of CamelCase, as recommended in PEP 8.
Bear in mind, however, that this is not very efficient for spell checking, since you don't examine context.