I'm trying to remove punctuations from a tokenized text in python like so:
word_tokens = ntlk.tokenize(text)
w = word_tokens
for e in word_tokens:
if e in punctuation_marks:
w.remove(e)
This works somewhat, I manage to remove a lot of the punctuation marks but for some reason a lot of the punctuation marks in word_tokens are still left.
If I run the code another time, it again removes some more of the punctuations. After running the same code 3 times all the marks are removed. Why does this happen?
It doesn't seem to matter whether punctuation_marks is a list, a string or a dictionary. I've also tried to iterate over word_tokens.copy() which does a bit better, it almost removes all marks the first time, and all the second time.
Is there a simple way to fix this problem so that it is sufficient to run the code only once?
You are removing elements from the same list that you are iterating. It seems that you are aware of the potential problem, that's why you added the line:
w = word_tokens
However, that line doesn't actually create a copy of the object referenced by word_tokens, it only makes w reference the same object. In order to create a copy you can use the slicing operator, replacing the above line by:
w = word_tokens[:]
Why don't you add tokens that are not punctuations instead?
word_tokens = ntlk.tokenize(text)
w = list()
for e in word_tokens:
if e not in punctuation_marks:
w.append(e)
Suggestions:
I see you are creating words tokens. If that's the case I would suggest you remove punctuations before tokenizing the text. You may use the translate function (under string library) that is already available.
# Import the library
import string
# Initialize the translate to remove punctuations
tr = str.maketrans("", "", string.punctuation)
# Remove punctuations
text = text.translate(tr)
# Get the word tokens
word_tokens = ntlk.tokenize(text)
If you want to do sentence tokenization, then you may do something like the below:
from nltk.tokenize import sent_tokenize
texts = sent_tokenize(text)
for i in range(0, len(texts))
texts[i] = texts[i].translate(tr)
I suggest you try regex and append your results to a new list and not directly manipulating the word_tokens's one:
word_tokens = ntlk.tokenize(text)
w_ = list()
for e in word_tokens:
w_.append(re.sub('[.!?\\-]', e))
You are modifying the the actual word_tokens, which is wrong.
For instance, say you have something like A?!B where it's indexed as: A:0, ?:1, !:2, B:3. Your for loop has a counter (say i) that increase at each loop. Say you remove the ? (Means i=1) that makes the array indexes shift back (New indexes are: A:0, !:1, B:2) and your counter increments (i=2). So you missed the ! character here!
Best not to mess with the original string and simply copy to a new one.
Related
I am reading a badly formatted text, and often there are unwanted spaces inside a single word. For example, "int ernational trade is not good for economies" and so forth. Is there any efficient tool that can cope with this? (There are a couple of other answers like here, which do not work in a sentence.)
Edit: About the impossibility mentioned, I agree. One option is to preserve all possible options. In my case this edited text will be matched with another database that has the original (clean) text. This way, any wrong removal of spaces just gets tossed away.\
You could use the PyEnchant package to get a list of English words. I will assume words that do not have meaning on their own but do together are a word, and use the following code to find words that are split by a single space:
import enchant
text = "int ernational trade is not good for economies"
fixed_text = []
d = enchant.Dict("en_US")
for i in range(len(words := text.split())):
if fixed_text and not d.check(words[i]) and d.check(compound_word := ''.join([fixed_text[-1], words[i]])):
fixed_text[-1] = compound_word
else:
fixed_text.append(words[i])
print(' '.join(fixed_text))
This will split the text on spaces and append words to fixed_text. When it finds that a previously added word is not in the dictionary, but appending the next word to it does make it valid, it sticks those two words together.
This should help sanitize most of the invalid words, but as the comments mentioned it is sometimes impossible to find out if two words belong together without performing some sort of lexical analysis.
As suggested by Pranav Hosangadi, here is a modified (and a little more involved) version which can remove multiple spaces in words by compounding previously added words which are not in the dictionary. However, since a lot of smaller words are valid in the English language, many spaced out words don't correctly concatenate.
import enchant
text = "inte rnatio nal trade is not good for ec onom ies"
fixed_text = []
d = enchant.Dict("en_US")
for i in range(len(words := text.split())):
if fixed_text and not d.check(compound_word := words[i]):
for j, pending_word in enumerate(fixed_text[::-1], 1):
if not d.check(pending_word) and d.check(compound_word := ''.join([pending_word, compound_word])):
del fixed_text[-j:]
fixed_text.append(compound_word)
break
else:
fixed_text.append(words[i])
else:
fixed_text.append(words[i])
print(' '.join(fixed_text))
train_data = ["Consultant, changing, Waiting"]
I'm trying to apply the stemmer to the data with the following code, but It keeps the original data:
stemmer = stem.porter.PorterStemmer()
train_data = train_stemmer
for i in range(len(train_stemmer)):
train_stemmer[i] = stemmer.stem(train_stemmer[i])
The code runs fine but does not produce my expected result, which is:
["Consult, change, Wait"]
Two things jump out:
train_data in your question is a list containing one string ["Consult, change, Wait"], rather than a list of three strings ["Consult", "change", "Wait"]
Stemming converts to lowercase automatically
If you intended for the list to contain one string, this should work fine:
from nltk.stem import porter
stemmer = porter.PorterStemmer()
# List of one string
string_in_list = ["Consult, change, Wait"]
for word in string_in_list:
print(stemmer.stem(word))
print("----")
If you wanted a list of three strings, then modify to include quotes between commas:
# List of three strings
individual_words = ["Consult", "change", "Wait"]
for word in individual_words:
print(stemmer.stem(word))
print("----")
Handling the upper vs. lowercase at the start of the word requires passing a parameter, but can make sense if you're trying to handle proper nouns (e.g. distinguish stemmed change from the name Chang).
# Stem but do not convert first character to lowercase
for word in individual_words:
print(stemmer.stem(word, to_lowercase=False))
Expected output when all three run:
consult, change, wait
----
consult
chang
wait
----
Consult
chang
Wait
I would need to add further conditions in cleaning data which include removing stopwords, day of week and months.
For day of week and months I created a separated list (I do not know if there is some already built-in package in python to include them). For numbers I would consider isdigit.
So something like this:
days=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
# need to put into lower case
months=['January','February','March', 'April','May','June','July','August','September','October','November','December']
# need to put into lower case
cleaned = [w for w in remove_punc.split() if w.lower() not in stopwords.words('english')]
How could I include in the code above? I know that it is about extra if statements to take into account, but I am struggling with it.
You could convert all your lists to sets and take their union for the final set. Then it's only about checking the membership of your word in the set. Something like the following would work:
# existing code
from nltk.corpus import stopwords
days=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
# need to put into lower case
months=['January','February','March', 'April','May','June','July','August','September','October','November','December']
# need to put into lower case
# add these lines
stop_words = set(stopwords.words('english'))
lowercase_days = {item.lower() for item in days}
lowercase_months = {item.lower() for item in months}
exclusion_set = lowercase_days.union(lowercase_months).union(stop_words)
# now do the final check
cleaned = [w for w in remove_punc.split() if w.lower() not in exclusion_set and not w.isdigit()]
i have a small problem with punctuations.
My assignment was to check if there were any duplicated words in a text, if there were any duplicated words in the list my job was to highlight them by using .upper().
Example on text: I like apples, apples is the best thing i know.
So i took the original text, striped it from punctuations, transformed all words to lowercase and then split the list.
With a for-loop i compared every word in the list with each other and i found all duplicated word, all of this were placed in a new list.
Example (after using the for-loop): i like apples APPLES is the best thing I know
So the new list is now similar to the original list but with one major exception, it is lacking the punctuations.
Is there a way to add the punctuations on the new list were they are "suppose to be" (from the old lists position)?
Is there some kind of method build in python that can do this, or do i have to compare the two lists with another for-loop and then add the punctuations to the new list?
NewList = [] # Creates an empty list
for word in text:
if word not in NewList:
NewList.append(word)
elif word in NewList: #
NewList.append(word.upper())
List2 = ' '.join(NewList)
the code above works for longer text and thats the code i have been using for Highlighting duplicated words.
The only problem is that the punctations doesn't exist in the new file, thats the only problem i have.
Here's an example of using sub method with callback from build-in regexp module.
This solution respects all the punctuation.
import re
txt = "I like,, ,apples, apples! is the .best. thing *I* know!!1"
def repl(match, stack):
word = match.group(0)
word_upper = word.upper()
if word_upper in stack:
return word_upper
stack.add(word_upper)
return word
def highlight(s):
stack = set()
return re.sub('\\b([a-zA-Z]+)\\b', lambda match: repl(match, stack), s)
print txt
print highlight(txt)
I have defined the following code
exclude = set(string.punctuation)
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
wordList= ['"the']
answer = [lmtzr.lemmatize(word.lower()) for word in list(set(wordList)-exclude)]
print answer
I have previously printed exclude and the quotation mark " is part of it. I expected answer to be [the]. However, when I printed answer, it shows up as ['"the']. I'm not entirely sure why it's not taking out the punctuation correctly. Would I need to check each character individually instead?
When you create a set from wordList it stores the string '"the' as the only element,
>>> set(wordList)
set(['"the'])
So using set difference will return the same set,
>>> set(wordList) - set(string.punctuation)
set(['"the'])
If you want to just remove punctuation you probably want something like,
>>> [word.translate(None, string.punctuation) for word in wordList]
['the']
Here I'm using the translate method of strings, only passing in a second argument specifying which characters to remove.
You can then perform the lemmatization on the new list.