How to determine if all substrings in a string contains duplicates? - python

I'm facing this issue:
I need to remove duplications from the beginning of each word of a text, but only if
all words in the text are duplicated. (And capitalized after)
Examples:
text = str("Thethe cacar isis momoving vvery fasfast")
So this text should be treated and printed as:
output:
"The car is moving very fast"
I got these to treat the text:
phrase = str("Thethe cacar isis momoving vvery fasfast")
phrase_up = phrase.upper()
text = re.sub(r'(.+?)\1+', r'\1', phrase_up)
text_cap = text.capitalize()
"The car is moving very fast"
Or:
def remove_duplicates(word):
unique_letters = set(word)
sorted_letters = sorted(unique_letters, key=word.index)
return ''.join(sorted_letters)
words = phrase.split(' ')
new_phrase = ' '.join(remove_duplicates(word) for word in words)
What I can't work it out, is HOW to determine if a text needs this treatment.
Because if we get a text such as:
"This meme is funny, said Barbara"
Where even though "meme" and "Barbara" (ar - ar) are repeating substrings, not all are, so this text shouldn't be treated.
Any pointers here?

I would suggest you to adopt a solution to check if a word is legal, using something like what is described in this post's best answer. If the word is not an english word, than you should use the regex.
For example, a word like meme should be in the english dictionary, so you should not check for repetitions.
So I would firstly split the string on spaces, in order to have the tokens. Then check if a token is an english word. If it is, skip the regex check. Otherwise check for repetitions

Related

regex for replace the beginning of each word in the sentence

I want to replace the beginning of each word in a sentence if the beginning matches string x, regardless of uppercase lowercase.
For example:
s = "My fAther is your grandFAther and Family is important"
replacing "FA" with "zZ" in the beginning of each word in the sentence results in "My zZther is your grandFAther and zZamily is important".
A regex would be very helpful, but if you can't, other code would be ok. I honestly can't find an effective solution.
Also, if you can give me a regex for when you want to replace the word termination, it would be very nice. eg: "green blablaen enenbla" -> "grezz blablazz enenbla".
thank you very much :D!
my code:
input: s, value_to_replace, new_value
w = s.split()
for i in range(len(w)):
w[i] = re.sub(pattern='^' + value_to_replace, repl=new_value, string=w[i],flags=re.IGNORECASE)
return ' '.join(w)
But idk if this is efficient, it can somehow make a regex that applies over all the sentence, not over a separate word.
Give this a shot, I think it'll get you want you want
import re
def main():
text = "My father is your grandFAther and family is important"
text = re.sub(r"\bfa", "zZ", text, flags=re.IGNORECASE)
print (text)
term_text = "green blablaen enenbla"
term_text = re.sub(r"\w{2}\b", "zz", term_text, flags=re.IGNORECASE)
print(term_text)
if __name__ == "__main__":
main()
My zZther is your grandFAther and zZmily is important
grezz blablazz enenbzz
You can certainly iterate/loop by splitting the sentence, but this is probably the direction you want to go in.
What we are taking advantage of is word boundaries (the \b). The \w is a character class (A-Z0-9_) and the {2} is number of characters to its immediate left. You can read more word boundaries here, they are extremely powerful! :) https://www.regular-expressions.info/wordboundaries.html
note: nice update/edit to your question :)

What if I wanted the user to type a phrase and it censors the "bad words" where would I have to place that

def censor(text,word):
t=str(text)
w=str(word)
if w in t:
l=len(w)
item = ("*" * l)
return t.replace(w,item)
else:
return t
print(censor("salad","s"))
So instead of having two parameters, I would want a variable that asks the user for a sentence and then it checks if there are any curse words, if there isn't, it returns it unchanged
It's a good idea to keep a useful function like 'censoring a text' in a function and user input and output separately, because the function could be easily reused in a situation where you don't want user interaction (like censoring another writting message for example).
It also helps to think ahead a little. Your example finds one word and blots it out, but do you need to censor multiple words? And what if the word is part of another word, is it still a bad word? ("My friend from Scherpenisse, has a massive collection of Dickensian fiction, but ever since they moved to Scunthorpe, they bedamn anything Shakespearean. Hey look, an American bushtit!")
Of course the problem is even harder with names to consider, just ask Dick Cheney and George Bush.
The easy things are quick to fix though:
def censor(text, forbidden):
return ' '.join(
word if word.casefold() not in forbidden else '*' * len(word)
for word in text.split())
print(censor('I like fruit salad\n\nbecause it is fruitilicious', ['fruit', 'produce']))
Why this works:
' '.join(<some iterable>) takes the elements from an iterable like a list and joins them together with the string at the start, a space in this case;
word.casefold() gives you the lowercase, simplified version of a word, to avoid things like aßhole or DaMn slipping through;
x if some_condition else y is an expression with a conditional built in;
'*' * len(word) - just a short version of what you already found.
word for word in text.split() gets you a generator that yield the words in text, split over whitespace, one at a time.
All of it taken together splits up text into words, checks if their simplified (casefolded) version is in the list of forbidden words and replaces them with a mask if so, otherwise, it just leaves the word - and at the end, it combines all of them back together into a sentence.
There are more problems to consider. For example, if you need to preserve the original formatting of a text (perhaps it contains multiple spaces, tabs, etc.) an approach using regular expressions might be better:
import re
def censor(text, forbidden):
for word in forbidden:
text = re.sub(rf'(?:(?<=^)|(?<=\s)){word}(?=$|\s)', '*' * len(word), text)
return text
print(censor('I like fruit salad\n\nbecause it is fruitilicious', ['fruit', 'produce']))
This works mainly because of the regular expression, with these parts:
(?:(?<=^)|(?<=\s)) checks if the string has the start of a line ^ or whitespace \s immediately before it; it's done in two because the Python regex engine requires different options separated by | in a lookbehind to be of the same size;
{word} is replaced in the expression by each forbidden word, for each iteration, thanks to the f at the start of the string;
(?=$|\s) checks that the word is followed by the end of the line, or whitespace.
re.sub(some_regex, replacement, text) finds all occurrences of the given regular expression, and replaces them with replacement, which in this case is a string of the right number of *.
def censor(text, words):
for i in words:
text = text.replace(i, len(i)*'*')
return text
censor("You bloody idiot", ["bloody", "idiot"])
Output:
'You ****** *****'
If you want to ask for user input on the command line in the function itself
you can use input(). You can also just use input() outside of the function and pass that in.
def censor(text, word):
if word in text:
return text.replace(word, "*" * len(word))
else:
return text
sentence = input("Enter your sentence or enter to quit. ")
while sentence:
censored_sentence = censor(sentence, "bad")
print(censored_sentence)
sentence = input("Enter your sentence or enter to quit. ")

Substring replacements based on replace and no-replace rules

I have a string and rules/mappings for replacement and no-replacements.
E.g.
"This is an example sentence that needs to be processed into a new sentence."
"This is a second example sentence that shows how 'sentence' in 'sentencepiece' should not be replaced."
Replacement rules:
replace_dictionary = {'sentence': 'processed_sentence'}
no_replace_set = {'example sentence'}
Result:
"This is an example sentence that needs to be processed into a new processed_sentence."
"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."
Additional criteria:
Only replace if case is matched, i.e. case matters.
Whole words replacement only, interpunction should be ignored, but kept after replacement.
I was thinking what would the cleanest way to solve this problem in Python 3.x be?
Based on the answer of demongolem.
UPDATE
I am sorry, I missed the fact, that only whole words should be replaced. I updated my code and even generalized it for usage in a function.
def replace_whole(sentence, replace_token, replace_with, dont_replace):
rx = f"[\"\'\.,:; ]({replace_token})[\"\'\.,:; ]"
iter = re.finditer(rx, sentence)
out_sentence = ""
found = []
indices = []
for m in iter:
indices.append(m.start(0))
found.append(m.group())
context_size=len(dont_replace)
for i in range(len(indices)):
context = sentence[indices[i]-context_size:indices[i]+context_size]
if dont_replace in context:
continue
else:
# First replace the word only in the substring found
to_replace = found[i].replace(replace_token, replace_with)
# Then replace the word in the context found, so any special token like "" or . gets taken over and the context does not change
replace_val = context.replace(found[i], to_replace)
# finally replace the context found with the replacing context
out_sentence = sentence.replace(context, replace_val)
return out_sentence
Use regular expressions for finding all occurences and values of your string (as we need to check whether is a whole word or embedded in any kind of word), by using finditer(). You might need to adjust the rx to what your definition of "whole word" is. Then get the context around these values of the size of your no_replace rule. Then check, whether the context contains your no_replace string.
If not, you may replace it, by using replace() for the word only, then replace the occurence of the word in the context, then replace the context in the whole text. That way the replacing process is nearly unique and no weird behaviour should happen.
Using your examples, this leads to:
replace_whole(sen2, "sentence", "processed_sentence", "example sentence")
>>>"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."
and
replace_whole(sen1, "sentence", "processed_sentence", "example sentence")
>>>'This is an example sentence that needs to be processed into a new processed_sentence.'
After some research, this is what I believe to be the best and cleanest solution to my problem. The solution works by calling the match_fun whenever a match has been found, and the match_fun only performs the replacement, if and only if, there is no "no-replace-phrase" overlapping with the current match. Let me know if you need more clarification or if you believe something can be improved.
replace_dict = ... # The code below assumes you already have this
no_replace_dict = ...# The code below assumes you already have this
text = ... # The text on input.
def match_fun(match: re.Match):
str_match: str = match.group()
if str_match not in cls.no_replace_dict:
return cls.replace_dict[str_match]
for no_replace in cls.no_replace_dict[str_match]:
no_replace_matches_iter = re.finditer(r'\b' + no_replace + r'\b', text)
for no_replace_match in no_replace_matches_iter:
if no_replace_match.start() >= match.start() and no_replace_match.start() < match.end():
return str_match
if no_replace_match.end() > match.start() and no_replace_match.end() <= match.end():
return str_match
return cls.replace_dict[str_match]
for replace in cls.replace_dict:
pattern = re.compile(r'\b' + replace + r'\b')
text = pattern.sub(match_fun, text)

Coloring words based on text list using python

I got two text files d.txt containing paragraph text and phrase.txt containing multi worded phrases like State-of-the-art, counter productive, Fleet Dynamism and like some found in link below
https://en.wikipedia.org/wiki/List_of_buzzwords
I need to color font matching phrase in d.txt if it is found in phrase.txt
Efforts so far:
phrases = open("phrase.txt").readlines()
words = open("d.txt").read()
for phrase in phrases:
all_words_found = False
phrase_words = phrase.lower().split(" ")
for word in phrase_words:
if word in words:
all_words_found = True
break
if all_words_found:
print (phrase)
Expected output:
Please Help!
thanks for the help:
Update: to create html output
To change the code above to create html output add a tag around the word during replace rather than ansi. The example here will use a simple span tag
words = ["catch phrase", "codeword"]
phrase = "He said a catch phrase. And a codeword was written on a wall."
new_phrase = phrase
for word in words:
new_phrase = new_phrase.replace(i, f'<span style="color:Red;">{word}</span>')
print(new_phrase) #Rather than printing, send this wherever you want it.
Solution for inline print
However, to answer your basic question, which is how to replace a set of words in a given paragraph with the same word in a different color, try using .replace() and ansi color escape codes. This will work if you want to print out the words in your python environment.
Here's a simple example of a way to turn certain words in a line of text red:
words = ["catch phrase", "codeword"]
phrase = "He said a catch phrase. And a codeword was written on a wall."
new_phrase = phrase
for i in words:
new_phrase = new_phrase.replace(i, f'\033[91m{i}\033[0;0m')
print(new_phrase)
Here is another stackoverflow post which talks about ANSI escape codes and colors in python output: How to print colored text in Python?
ANSI escape codes are a way to change the color of your output - google them to find more options/colors.
In this example here are the codes I used:
First to change the color to red:
\033[91m
Afer setting the color, you also have to change it back or the rest of the output will also be that color:
\033[0;0m

Python NLTK not taking out punctuations correctly

I have defined the following code
exclude = set(string.punctuation)
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
wordList= ['"the']
answer = [lmtzr.lemmatize(word.lower()) for word in list(set(wordList)-exclude)]
print answer
I have previously printed exclude and the quotation mark " is part of it. I expected answer to be [the]. However, when I printed answer, it shows up as ['"the']. I'm not entirely sure why it's not taking out the punctuation correctly. Would I need to check each character individually instead?
When you create a set from wordList it stores the string '"the' as the only element,
>>> set(wordList)
set(['"the'])
So using set difference will return the same set,
>>> set(wordList) - set(string.punctuation)
set(['"the'])
If you want to just remove punctuation you probably want something like,
>>> [word.translate(None, string.punctuation) for word in wordList]
['the']
Here I'm using the translate method of strings, only passing in a second argument specifying which characters to remove.
You can then perform the lemmatization on the new list.

Categories

Resources