Python NLTK not taking out punctuations correctly - python

I have defined the following code
exclude = set(string.punctuation)
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
wordList= ['"the']
answer = [lmtzr.lemmatize(word.lower()) for word in list(set(wordList)-exclude)]
print answer
I have previously printed exclude and the quotation mark " is part of it. I expected answer to be [the]. However, when I printed answer, it shows up as ['"the']. I'm not entirely sure why it's not taking out the punctuation correctly. Would I need to check each character individually instead?

When you create a set from wordList it stores the string '"the' as the only element,
>>> set(wordList)
set(['"the'])
So using set difference will return the same set,
>>> set(wordList) - set(string.punctuation)
set(['"the'])
If you want to just remove punctuation you probably want something like,
>>> [word.translate(None, string.punctuation) for word in wordList]
['the']
Here I'm using the translate method of strings, only passing in a second argument specifying which characters to remove.
You can then perform the lemmatization on the new list.

Related

How to determine if all substrings in a string contains duplicates?

I'm facing this issue:
I need to remove duplications from the beginning of each word of a text, but only if
all words in the text are duplicated. (And capitalized after)
Examples:
text = str("Thethe cacar isis momoving vvery fasfast")
So this text should be treated and printed as:
output:
"The car is moving very fast"
I got these to treat the text:
phrase = str("Thethe cacar isis momoving vvery fasfast")
phrase_up = phrase.upper()
text = re.sub(r'(.+?)\1+', r'\1', phrase_up)
text_cap = text.capitalize()
"The car is moving very fast"
Or:
def remove_duplicates(word):
unique_letters = set(word)
sorted_letters = sorted(unique_letters, key=word.index)
return ''.join(sorted_letters)
words = phrase.split(' ')
new_phrase = ' '.join(remove_duplicates(word) for word in words)
What I can't work it out, is HOW to determine if a text needs this treatment.
Because if we get a text such as:
"This meme is funny, said Barbara"
Where even though "meme" and "Barbara" (ar - ar) are repeating substrings, not all are, so this text shouldn't be treated.
Any pointers here?
I would suggest you to adopt a solution to check if a word is legal, using something like what is described in this post's best answer. If the word is not an english word, than you should use the regex.
For example, a word like meme should be in the english dictionary, so you should not check for repetitions.
So I would firstly split the string on spaces, in order to have the tokens. Then check if a token is an english word. If it is, skip the regex check. Otherwise check for repetitions

Remove punctuation marks from tokenized text using for loop

I'm trying to remove punctuations from a tokenized text in python like so:
word_tokens = ntlk.tokenize(text)
w = word_tokens
for e in word_tokens:
if e in punctuation_marks:
w.remove(e)
This works somewhat, I manage to remove a lot of the punctuation marks but for some reason a lot of the punctuation marks in word_tokens are still left.
If I run the code another time, it again removes some more of the punctuations. After running the same code 3 times all the marks are removed. Why does this happen?
It doesn't seem to matter whether punctuation_marks is a list, a string or a dictionary. I've also tried to iterate over word_tokens.copy() which does a bit better, it almost removes all marks the first time, and all the second time.
Is there a simple way to fix this problem so that it is sufficient to run the code only once?
You are removing elements from the same list that you are iterating. It seems that you are aware of the potential problem, that's why you added the line:
w = word_tokens
However, that line doesn't actually create a copy of the object referenced by word_tokens, it only makes w reference the same object. In order to create a copy you can use the slicing operator, replacing the above line by:
w = word_tokens[:]
Why don't you add tokens that are not punctuations instead?
word_tokens = ntlk.tokenize(text)
w = list()
for e in word_tokens:
if e not in punctuation_marks:
w.append(e)
Suggestions:
I see you are creating words tokens. If that's the case I would suggest you remove punctuations before tokenizing the text. You may use the translate function (under string library) that is already available.
# Import the library
import string
# Initialize the translate to remove punctuations
tr = str.maketrans("", "", string.punctuation)
# Remove punctuations
text = text.translate(tr)
# Get the word tokens
word_tokens = ntlk.tokenize(text)
If you want to do sentence tokenization, then you may do something like the below:
from nltk.tokenize import sent_tokenize
texts = sent_tokenize(text)
for i in range(0, len(texts))
texts[i] = texts[i].translate(tr)
I suggest you try regex and append your results to a new list and not directly manipulating the word_tokens's one:
word_tokens = ntlk.tokenize(text)
w_ = list()
for e in word_tokens:
w_.append(re.sub('[.!?\\-]', e))
You are modifying the the actual word_tokens, which is wrong.
For instance, say you have something like A?!B where it's indexed as: A:0, ?:1, !:2, B:3. Your for loop has a counter (say i) that increase at each loop. Say you remove the ? (Means i=1) that makes the array indexes shift back (New indexes are: A:0, !:1, B:2) and your counter increments (i=2). So you missed the ! character here!
Best not to mess with the original string and simply copy to a new one.

How to avoid .replace replacing a word that was already replaced

Given a string, I have to reverse every word, but keeping them in their places.
I tried:
def backward_string_by_word(text):
for word in text.split():
text = text.replace(word, word[::-1])
return text
But if I have the string Ciao oaiC, when it try to reverse the second word, it's identical to the first after beeing already reversed, so it replaces it again. How can I avoid this?
You can use join in one line plus generator expression:
text = "test abc 123"
text_reversed_words = " ".join(word[::-1] for word in text.split())
s.replace(x, y) is not the correct method to use here:
It does two things:
find x in s
replace it with y
But you do not really find anything here, since you already have the word you want to replace. The problem with that is that it starts searching for x from the beginning at the string each time, not at the position you are currently at, so it finds the word you have already replaced, not the one you want to replace next.
The simplest solution is to collect the reversed words in a list, and then build a new string out of this list by concatenating all reversed words. You can concatenate a list of strings and separate them with spaces by using ' '.join().
def backward_string_by_word(text):
reversed_words = []
for word in text.split():
reversed_words.append(word[::-1])
return ' '.join(reversed_words)
If you have understood this, you can also write it more concisely by skipping the intermediate list with a generator expression:
def backward_string_by_word(text):
return ' '.join(word[::-1] for word in text.split())
Splitting a string converts it to a list. You can just reassign each value of that list to the reverse of that item. See below:
text = "The cat tac in the hat"
def backwards(text):
split_word = text.split()
for i in range(len(split_word)):
split_word[i] = split_word[i][::-1]
return ' '.join(split_word)
print(backwards(text))

Substring replacements based on replace and no-replace rules

I have a string and rules/mappings for replacement and no-replacements.
E.g.
"This is an example sentence that needs to be processed into a new sentence."
"This is a second example sentence that shows how 'sentence' in 'sentencepiece' should not be replaced."
Replacement rules:
replace_dictionary = {'sentence': 'processed_sentence'}
no_replace_set = {'example sentence'}
Result:
"This is an example sentence that needs to be processed into a new processed_sentence."
"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."
Additional criteria:
Only replace if case is matched, i.e. case matters.
Whole words replacement only, interpunction should be ignored, but kept after replacement.
I was thinking what would the cleanest way to solve this problem in Python 3.x be?
Based on the answer of demongolem.
UPDATE
I am sorry, I missed the fact, that only whole words should be replaced. I updated my code and even generalized it for usage in a function.
def replace_whole(sentence, replace_token, replace_with, dont_replace):
rx = f"[\"\'\.,:; ]({replace_token})[\"\'\.,:; ]"
iter = re.finditer(rx, sentence)
out_sentence = ""
found = []
indices = []
for m in iter:
indices.append(m.start(0))
found.append(m.group())
context_size=len(dont_replace)
for i in range(len(indices)):
context = sentence[indices[i]-context_size:indices[i]+context_size]
if dont_replace in context:
continue
else:
# First replace the word only in the substring found
to_replace = found[i].replace(replace_token, replace_with)
# Then replace the word in the context found, so any special token like "" or . gets taken over and the context does not change
replace_val = context.replace(found[i], to_replace)
# finally replace the context found with the replacing context
out_sentence = sentence.replace(context, replace_val)
return out_sentence
Use regular expressions for finding all occurences and values of your string (as we need to check whether is a whole word or embedded in any kind of word), by using finditer(). You might need to adjust the rx to what your definition of "whole word" is. Then get the context around these values of the size of your no_replace rule. Then check, whether the context contains your no_replace string.
If not, you may replace it, by using replace() for the word only, then replace the occurence of the word in the context, then replace the context in the whole text. That way the replacing process is nearly unique and no weird behaviour should happen.
Using your examples, this leads to:
replace_whole(sen2, "sentence", "processed_sentence", "example sentence")
>>>"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."
and
replace_whole(sen1, "sentence", "processed_sentence", "example sentence")
>>>'This is an example sentence that needs to be processed into a new processed_sentence.'
After some research, this is what I believe to be the best and cleanest solution to my problem. The solution works by calling the match_fun whenever a match has been found, and the match_fun only performs the replacement, if and only if, there is no "no-replace-phrase" overlapping with the current match. Let me know if you need more clarification or if you believe something can be improved.
replace_dict = ... # The code below assumes you already have this
no_replace_dict = ...# The code below assumes you already have this
text = ... # The text on input.
def match_fun(match: re.Match):
str_match: str = match.group()
if str_match not in cls.no_replace_dict:
return cls.replace_dict[str_match]
for no_replace in cls.no_replace_dict[str_match]:
no_replace_matches_iter = re.finditer(r'\b' + no_replace + r'\b', text)
for no_replace_match in no_replace_matches_iter:
if no_replace_match.start() >= match.start() and no_replace_match.start() < match.end():
return str_match
if no_replace_match.end() > match.start() and no_replace_match.end() <= match.end():
return str_match
return cls.replace_dict[str_match]
for replace in cls.replace_dict:
pattern = re.compile(r'\b' + replace + r'\b')
text = pattern.sub(match_fun, text)

Using strip() method and punctuation in Python

I've been trying to solve this problem for a few hours now and can't come with the right solution, this is the question:
Write a loop that creates a new word list, using a
string method to strip the words from the list created in Problem 3
of all leading and trailing punctuation. Hint: the string library,
which is imported above, contains a constant named punctuation.
Three lines of code.
Here is my code:
import string
def litCricFriend(wordList, text):
theList = text.lower().replace('-', ' ').split() #problem 3
#problem below
for word in theList:
word.strip(string.punctuation)
return theList
You've got a couple bits in your code that... well, I'm not really sure why they're there, to be honest, haha. Let's work through this together!
I'm assuming you have been given some text: text = "My hovercraft is full of eels!". Let's split this into words, make the words lowercase, and remove all punctuation. We know we need string.punctuation and str.split(), and you've also figured out that str.replace() is useful. So let's use these and get our result!
import string
def remove_punctuation(text):
# First, let's remove the punctuation.
# We do this by looping through each punctuation mark in the
# `string.punctuation` list, and then replacing that mark with
# the empty string and re-assigning that to the same variable.
for punc in string.punctuation:
text = text.replace(punc, '')
# Now our text is all de-punctuated! So let's make a list of
# the words, all lowercased, and return it in one go:
return text.lower().split()
Looks to me like the function is only three lines, which is what you said you wanted!
For the advanced reader, you could also use functools and do it in one line (I split it into two for readability, but it's still "one line"):
import string
import functools
def remove_punctuation(text):
return functools.reduce(lambda newtext, punc: newtext.replace(punc, ''),
punctuation, text).lower().split()

Categories

Resources