Substring replacements based on replace and no-replace rules

Substring replacements based on replace and no-replace rules - python

I have a string and rules/mappings for replacement and no-replacements.
E.g.
"This is an example sentence that needs to be processed into a new sentence."
"This is a second example sentence that shows how 'sentence' in 'sentencepiece' should not be replaced."
Replacement rules:
replace_dictionary = {'sentence': 'processed_sentence'}
no_replace_set = {'example sentence'}
Result:
"This is an example sentence that needs to be processed into a new processed_sentence."
"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."
Additional criteria:
Only replace if case is matched, i.e. case matters.
Whole words replacement only, interpunction should be ignored, but kept after replacement.
I was thinking what would the cleanest way to solve this problem in Python 3.x be?

Based on the answer of demongolem.
UPDATE
I am sorry, I missed the fact, that only whole words should be replaced. I updated my code and even generalized it for usage in a function.
def replace_whole(sentence, replace_token, replace_with, dont_replace):
rx = f"[\"\'\.,:; ]({replace_token})[\"\'\.,:; ]"
iter = re.finditer(rx, sentence)
out_sentence = ""
found = []
indices = []
for m in iter:
indices.append(m.start(0))
found.append(m.group())
context_size=len(dont_replace)
for i in range(len(indices)):
context = sentence[indices[i]-context_size:indices[i]+context_size]
if dont_replace in context:
continue
else:
# First replace the word only in the substring found
to_replace = found[i].replace(replace_token, replace_with)
# Then replace the word in the context found, so any special token like "" or . gets taken over and the context does not change
replace_val = context.replace(found[i], to_replace)
# finally replace the context found with the replacing context
out_sentence = sentence.replace(context, replace_val)
return out_sentence
Use regular expressions for finding all occurences and values of your string (as we need to check whether is a whole word or embedded in any kind of word), by using finditer(). You might need to adjust the rx to what your definition of "whole word" is. Then get the context around these values of the size of your no_replace rule. Then check, whether the context contains your no_replace string.
If not, you may replace it, by using replace() for the word only, then replace the occurence of the word in the context, then replace the context in the whole text. That way the replacing process is nearly unique and no weird behaviour should happen.
Using your examples, this leads to:
replace_whole(sen2, "sentence", "processed_sentence", "example sentence")
>>>"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."
and
replace_whole(sen1, "sentence", "processed_sentence", "example sentence")
>>>'This is an example sentence that needs to be processed into a new processed_sentence.'

After some research, this is what I believe to be the best and cleanest solution to my problem. The solution works by calling the match_fun whenever a match has been found, and the match_fun only performs the replacement, if and only if, there is no "no-replace-phrase" overlapping with the current match. Let me know if you need more clarification or if you believe something can be improved.
replace_dict = ... # The code below assumes you already have this
no_replace_dict = ...# The code below assumes you already have this
text = ... # The text on input.
def match_fun(match: re.Match):
str_match: str = match.group()
if str_match not in cls.no_replace_dict:
return cls.replace_dict[str_match]
for no_replace in cls.no_replace_dict[str_match]:
no_replace_matches_iter = re.finditer(r'\b' + no_replace + r'\b', text)
for no_replace_match in no_replace_matches_iter:
if no_replace_match.start() >= match.start() and no_replace_match.start() < match.end():
return str_match
if no_replace_match.end() > match.start() and no_replace_match.end() <= match.end():
return str_match
return cls.replace_dict[str_match]
for replace in cls.replace_dict:
pattern = re.compile(r'\b' + replace + r'\b')
text = pattern.sub(match_fun, text)

Related

I want to split a string by a character on its first occurence, which belongs to a list of characters. How to do this in python?

Basically, I have a list of special characters. I need to split a string by a character if it belongs to this list and exists in the string. Something on the lines of:
def find_char(string):
if string.find("some_char"):
#do xyz with some_char
elif string.find("another_char"):
#do xyz with another_char
else:
return False
and so on. The way I think of doing it is:
def find_char_split(string):
char_list = [",","*",";","/"]
for my_char in char_list:
if string.find(my_char) != -1:
my_strings = string.split(my_char)
break
else:
my_strings = False
return my_strings
Is there a more pythonic way of doing this? Or the above procedure would be fine? Please help, I'm not very proficient in python.
(EDIT): I want it to split on the first occurrence of the character, which is encountered first. That is to say, if the string contains multiple commas, and multiple stars, then I want it to split by the first occurrence of the comma. Please note, if the star comes first, then it will be broken by the star.

I would favor using the re module for this because the expression for splitting on multiple arbitrary characters is very simple:
r'[,*;/]'
The brackets create a character class that matches anything inside of them. The code is like this:
import re
results = re.split(r'[,*;/]', my_string, maxsplit=1)
The maxsplit argument makes it so that the split only occurs once.
If you are doing the same split many times, you can compile the regex and search on that same expression a little bit faster (but see Jon Clements' comment below):
c = re.compile(r'[,*;/]')
results = c.split(my_string)
If this speed up is important (it probably isn't) you can use the compiled version in a function instead of having it re compile every time. Then make a separate function that stores the actual compiled expression:
def split_chars(chars, maxsplit=0, flags=0, string=None):
# see note about the + symbol below
c = re.compile('[{}]+'.format(''.join(chars)), flags=flags)
def f(string, maxsplit=maxsplit):
return c.split(string, maxsplit=maxsplit)
return f if string is None else f(string)
Then:
special_split = split_chars(',*;/', maxsplit=1)
result = special_split(my_string)
But also:
result = split_chars(',*;/', my_string, maxsplit=1)
The purpose of the + character is to treat multiple delimiters as one if that is desired (thank you Jon Clements). If this is not desired, you can just use re.compile('[{}]'.format(''.join(chars))) above. Note that with maxsplit=1, this will not have any effect.
Finally: have a look at this talk for a quick introduction to regular expressions in Python, and this one for a much more information packed journey.

Python Regex Partial Match or "hitEnd"

I'm writing a scanner, so I'm matching an arbitrary string against a list of regex rules. It would be useful if I could emulate the Java "hitEnd" functionality of knowing not just when the regular expression didn't match, but when it can't match; when the regular expression matcher reached the end of the input before deciding it was rejected, indicating that a longer input might satisfy the rule.
For example, maybe I'm matching html tags for starting to bold a sentence of the form "< b >". So I compile my rule
bold_html_rule = re.compile("<b>")
And I run some tests:
good_match = bold_html_rule.match("<b>")
uncertain_match = bold_html_rule.match("<")
bad_match = bold_html_rule.match("goat")
How can I tell the difference between the "bad" match, for which goat can never be made valid by more input, and the ambiguous match that isn't a match yet, but could be.
Attempts
It is clear that in the above form, there is no way to distinguish, because both the uncertain attempt and the bad attempt return "None". If I wrap all rules in "(RULE)?" then any input will return a match, because at the least the empty string is a substring of all strings. However, when I try and see how far the regex progressed before rejecting my string by using the group method or endPos field, it is always just the length of the string.
Does the Python regex package do a lot of extra work and traverse the whole string even if it's an invalid match on the first character? I can see what it would have to if I used search, which will verify if the sequence is anywhere in the input, but it seems very strange to do so for match.
I've found the question asked before (on non-stackoverflow places) like this one:
https://mail.python.org/pipermail/python-list/2012-April/622358.html
but he doesn't really get a response.
I looked at the regular expression package itself but wasn't able to discern its behavior; could I extend the package to get this result? Is this the wrong way to tackle my task in the first place (I've built effective Java scanners using this strategy in the past)

Try this out. It does feel like a hack, but at least it does achieve the result you are looking for. Though I am a bit concerned about the PrepareCompileString function. It should be able to handle all the escaped characters, but cannot handle any wildcards
import re
#Grouping every single character
def PrepareCompileString(regexString):
newstring = ''
escapeFlag = False
for char in regexString:
if escapeFlag:
char = escapeString+char
escapeFlag = False
escapeString = ''
if char == '\\':
escapeFlag = True
escapeString = char
if not escapeFlag:
newstring += '({})?'.format(char)
return newstring
def CheckMatch(match):
# counting the number of non matched groups
count = match.groups().count(None)
# If all groups matched - good match
# all groups did not match - bad match
# few groups matched - uncertain match
if count == 0:
print('Good Match:', match.string)
elif count < len(match.groups()):
print('Uncertain Match:', match.string)
elif count == len(match.groups()):
print('Bad Match:', match.string)
regexString = '<b>'
bold_html_rule = re.compile(PrepareCompileString(regexString))
good_match = bold_html_rule.match("<b>")
uncertain_match = bold_html_rule.match("<")
bad_match = bold_html_rule.match("goat")
for match in [good_match, uncertain_match, bad_match]:
CheckMatch(match)
I got this result:
Good Match: <b>
Uncertain Match: <
Bad Match: goat

Python NLTK not taking out punctuations correctly

I have defined the following code
exclude = set(string.punctuation)
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
wordList= ['"the']
answer = [lmtzr.lemmatize(word.lower()) for word in list(set(wordList)-exclude)]
print answer
I have previously printed exclude and the quotation mark " is part of it. I expected answer to be [the]. However, when I printed answer, it shows up as ['"the']. I'm not entirely sure why it's not taking out the punctuation correctly. Would I need to check each character individually instead?

When you create a set from wordList it stores the string '"the' as the only element,
>>> set(wordList)
set(['"the'])
So using set difference will return the same set,
>>> set(wordList) - set(string.punctuation)
set(['"the'])
If you want to just remove punctuation you probably want something like,
>>> [word.translate(None, string.punctuation) for word in wordList]
['the']
Here I'm using the translate method of strings, only passing in a second argument specifying which characters to remove.
You can then perform the lemmatization on the new list.

python: dictionary of words and wordforms

I have the following problem: I created a dictionary (german) with words and their corresponding lemma. exemple:
"Lagerbestände", "Lager-bestand"; "Wohnhäuser", "Wohn-haus"; "Bahnhof", "Bahn-hof"
I now have a text and I want to check for all word their lemmata. It can happen that it appears a word which is not in the dict, such as "Restbestände". But the lemma of "bestände", we already know it. So I want to take the first part of the word which is unknown in dicti and add this to the lemmatized second part and print this out (or return it).
Example: "Restbestände" --> "Rest-bestand". ("bestand" is taken from the lemma of "Lagerbestände")
I coded the following:
for limit in range(1, len(Word)):
for k, v in dicti.iteritems():
if re.search('[\w]*'+Word[limit:], k, re.IGNORECASE) != None:
if '-' in v:
tmp = v.find('-')
end = v[tmp:]
end = re.sub(ur'[-]',"", end)
Word = Word[:limit] + '-' + end `
But I got 2 problems:
At the end of the words, it is printed out every time "&#10". How can I avoid this?
The second part of the word is sometimes not correct - there must be a logical error.
However; how would you solve this?

At the end of the words, it is printed out every time "&#10". How can
I avoid this?
In must use UNICODE everywhere in your script. Everywhere, everywhere, everywhere.
Also, python RegEx functions accept flag re.UNICODE that you should always set. German letters are out of ASCII set, so RegEx can be sometimes confused, for instance when matching r'\w'

More accurate alternative to findline?

I have a list (words.txt) for which I need a method to search that is more exact than findline.
My current function (shown at the bottom) uses findline to search through the list. The problem is this: instead of returning an exact match, findline returns the first string that contains the whole word, regardless of whether there are better matches following it.
Example:
I enter 'BEES' and findline returns 'BAUBEES' because it is the first string to contain the sub-string ('BEES'). Of course, this completely ruins the function.
What I need is a function or (preferably) a built-in method that looks alphabetically for an exact match. So if 'BEES' is in the list (which I assure you it is), I want it to return 'BEES'. Or alternately, if 'BAUBEES' and 'BEESWAX' were the only substring matches in the list, the ideal function would return 'BEESWAX' if only because the second letter in 'BEES' is 'E' NOT 'A' (as in 'BAUBEES').
def iswholeword(word):
openfile = open('/media/Gianson/Python Programs/words.txt','r')
linz = openfile.readlines()[:]
openfile.close()
hit = findline(word,linz)[:]
print 'hit', hit
if len(hit)-1 == len(word):
return True
else:
return False

r = re.compile(r"\b%s" % re.escape(word))
for line in openfile:
hit = r.search(line)
if hit:
# whatever
Explanation: this build a regular expression from \b (word boundary) and the word under consideration, then searches for it in each line of the file. It finds the first word starting with word in the line and return an regexp match object.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Substring replacements based on replace and no-replace rules - python

Related

I want to split a string by a character on its first occurence, which belongs to a list of characters. How to do this in python?

Python Regex Partial Match or "hitEnd"

Python NLTK not taking out punctuations correctly

python: dictionary of words and wordforms

More accurate alternative to findline?

Categories

Resources