I have to different lists of words, one list (stopwords) contains a list of words that should be excluded from the other list (kafka).
I tried:
kafka.discard (stop) # this only works with sets and I do not want to transform my list into a set
is there another way to exclude the words in stop from the other list?
I am using python 3.4.0
Since you said you don't want to work with sets (why?), you can use a list comprehension:
kafka[:] = [x for x in kafka if x not in stop]
edit: note the slice[:], this method more closely resembles the behaviour of .discard() in that the identity of your collection is preserved.
You can try this:
stopwords_set = set(stopwords)
kafka = [word for word in kafka if word not in stopwords_set]
One way to remove every word in your stopwords list from kafka is:
for word in stopwords:
while word in kafka:
kafka.remove(word)
Related
I'm trying to remove punctuations from a tokenized text in python like so:
word_tokens = ntlk.tokenize(text)
w = word_tokens
for e in word_tokens:
if e in punctuation_marks:
w.remove(e)
This works somewhat, I manage to remove a lot of the punctuation marks but for some reason a lot of the punctuation marks in word_tokens are still left.
If I run the code another time, it again removes some more of the punctuations. After running the same code 3 times all the marks are removed. Why does this happen?
It doesn't seem to matter whether punctuation_marks is a list, a string or a dictionary. I've also tried to iterate over word_tokens.copy() which does a bit better, it almost removes all marks the first time, and all the second time.
Is there a simple way to fix this problem so that it is sufficient to run the code only once?
You are removing elements from the same list that you are iterating. It seems that you are aware of the potential problem, that's why you added the line:
w = word_tokens
However, that line doesn't actually create a copy of the object referenced by word_tokens, it only makes w reference the same object. In order to create a copy you can use the slicing operator, replacing the above line by:
w = word_tokens[:]
Why don't you add tokens that are not punctuations instead?
word_tokens = ntlk.tokenize(text)
w = list()
for e in word_tokens:
if e not in punctuation_marks:
w.append(e)
Suggestions:
I see you are creating words tokens. If that's the case I would suggest you remove punctuations before tokenizing the text. You may use the translate function (under string library) that is already available.
# Import the library
import string
# Initialize the translate to remove punctuations
tr = str.maketrans("", "", string.punctuation)
# Remove punctuations
text = text.translate(tr)
# Get the word tokens
word_tokens = ntlk.tokenize(text)
If you want to do sentence tokenization, then you may do something like the below:
from nltk.tokenize import sent_tokenize
texts = sent_tokenize(text)
for i in range(0, len(texts))
texts[i] = texts[i].translate(tr)
I suggest you try regex and append your results to a new list and not directly manipulating the word_tokens's one:
word_tokens = ntlk.tokenize(text)
w_ = list()
for e in word_tokens:
w_.append(re.sub('[.!?\\-]', e))
You are modifying the the actual word_tokens, which is wrong.
For instance, say you have something like A?!B where it's indexed as: A:0, ?:1, !:2, B:3. Your for loop has a counter (say i) that increase at each loop. Say you remove the ? (Means i=1) that makes the array indexes shift back (New indexes are: A:0, !:1, B:2) and your counter increments (i=2). So you missed the ! character here!
Best not to mess with the original string and simply copy to a new one.
stopwords is a list of strings, tokentext is a list of lists of strings. (Each list is a sentence, the list of lists is an text document).
I am simply trying to take out all the strings in tokentext that also occur in stopwords.
for element in tokentext:
for word in element:
if(word.lower() in stopwords):
element.remove(word)
print(tokentext)
I was hoping for someone to point out some fundamental flaw in the way I am iterating over the list..
Here is a data set where it fails:
http://pastebin.com/p9ezh2nA
Altering a list while iterating on it will always create issues. Try instead something like:
stopwords = ["some", "strings"]
tokentext = [ ["some", "lists"], ["of", "strings"] ]
new_tokentext = [[word for word in lst if word not in stopwords] for lst in tokentext]
# creates a new list of words, filtering out from stopwords
Or using filter:
new_tokentext = [list(filter(lambda x: x not in stopwords, lst)) for lst in tokentext]
# the call to `list` here is unnecessary in Python2
You could just do something simple like:
for element in tokentext:
if element in stop words:
stopwords.remove(element)
It's kinda like yours, but without the extra for loop. But I am not sure if this works, or if that's what you are trying to achieve, but it's an idea, and I hope it helps!
i have a small problem with punctuations.
My assignment was to check if there were any duplicated words in a text, if there were any duplicated words in the list my job was to highlight them by using .upper().
Example on text: I like apples, apples is the best thing i know.
So i took the original text, striped it from punctuations, transformed all words to lowercase and then split the list.
With a for-loop i compared every word in the list with each other and i found all duplicated word, all of this were placed in a new list.
Example (after using the for-loop): i like apples APPLES is the best thing I know
So the new list is now similar to the original list but with one major exception, it is lacking the punctuations.
Is there a way to add the punctuations on the new list were they are "suppose to be" (from the old lists position)?
Is there some kind of method build in python that can do this, or do i have to compare the two lists with another for-loop and then add the punctuations to the new list?
NewList = [] # Creates an empty list
for word in text:
if word not in NewList:
NewList.append(word)
elif word in NewList: #
NewList.append(word.upper())
List2 = ' '.join(NewList)
the code above works for longer text and thats the code i have been using for Highlighting duplicated words.
The only problem is that the punctations doesn't exist in the new file, thats the only problem i have.
Here's an example of using sub method with callback from build-in regexp module.
This solution respects all the punctuation.
import re
txt = "I like,, ,apples, apples! is the .best. thing *I* know!!1"
def repl(match, stack):
word = match.group(0)
word_upper = word.upper()
if word_upper in stack:
return word_upper
stack.add(word_upper)
return word
def highlight(s):
stack = set()
return re.sub('\\b([a-zA-Z]+)\\b', lambda match: repl(match, stack), s)
print txt
print highlight(txt)
So im new at python and i would like some help in making code where:
When an input is typed if the input has a minimum of three words that match any one thing on a list it would replace the input with the text in the list that matches the criteria
Example:
Jan
-Scans List-
-Finds Janice-
-Replaces and gives output as Janice Instead of Jan-
Janice
So far
getname = []
for word in args
room.usernames.get(args,word)
Room.usernames is the list and args input
Error List item has no attribute .get
there is a module used and its ch.py located at http://pastebin.com/5BLZ0UA0
You will need to:
make a replacement words dictionary
get some input
do sanity checking to make sure it fits your parameters
split it into a list
loop through your new list and replace each word with its replacement in your dict, if it is in there
I won't write all of this for you. But here's a tip for how to do that last part: use the get method of dicts - it allows you to provide a "fallback" in case the word is not found in the dict. So just fall back to the word itself.
replacement_words = {'jan':'janice','foo':'bar'}
my_list = ['jan','is','cool']
[replacement_words.get(word,word) for word in my_list]
Out[41]: ['janice', 'is', 'cool']
You Could try this
getname = []
for word in args
"%s"% (room.usernames).get(args,word)
I'm writing a spell checking function and I'm using two text files: one that has misspelled text and a text file with a bunch of words from the dictionary. I have turned the text of misspelled words into a list of strings and the text file with dictionary words into a list of words. Now I need to see if the words in my misspelled list are in my list of dictionary words.
def spellCheck():
checkFile=input('Enter file name: ')
inFile=open(checkFile,'r')
# This separates my original text file into a list like this
# [['It','was','the','besst','of','times,'],
# ['it','was','teh','worst','of','times']]
separate=[]
for line in inFile:
separate.append(line.split())
# This opens my list of words from the dictionary and
# turns it into a list of the words.
wordFile=open('words.txt','r')
words=wordFile.read()
wordList=(list(words.split()))
wordFile.close()
# I need this newList to be a list of the correctly spelled words
# in my separate[] list and if the word isn't spelled correctly
# it will go into another if statement...
newList=[]
for word in separate:
if word in wordList:
newList.append(word)
return newList
Try this:
newList = []
for line in separate:
for word in line:
if word in wordList:
newList.append(word)
return newList
The problem you had was that you were iterating over separate, which is a list of lists. There does not exist any list in your wordList, which is why that if-statement always fails. The words that you want to iterate over are in the sublists contained in separate. So, you can iterate over these words in a second for-loop. You can also use for word in itertools.chain.from_iterable(separate).
Hope this helps
First, a word about data structures. Instead of lists, you should use sets, since you (apparently) only want a copy of each word. You can create sets out of your lists:
input_words = set(word for line in separate for word in line) # since it is a list of lists
correct_words = set(word_list)
Then, it is simple as that:
new_list = input_words.intersection(correct_words)
And if you want the incorrect words, you have another one liner:
incorrect = input_words.difference(correct_words)
Note that I used names_with_underscores, instead of CamelCase, as recommended in PEP 8.
Bear in mind, however, that this is not very efficient for spell checking, since you don't examine context.