Combinatorial product of regex substitutions

Combinatorial product of regex substitutions - python

I am trying to produce string variants by applying substitutions optionally.
For example, one substitution scheme is removing any sequence of blank characters.
Rather than replacing all occurrences like
>>> re.sub(r'\s+', '', 'a b c')
'abc'
– I need, instead, two variants to be produced for each occurrence, in that the substitution is performed in one variant, but not in the other.
For the string 'a b c'
I want to have the variants
['a b c', 'a bc', 'ab c', 'abc']
ie. the cross product of all binary decisions (the result obviously includes the original string).
For this case, the variants can be produced using re.finditer and itertools.product:
def vary(target, pattern, subst):
occurrences = [m.span() for m in pattern.finditer(target)]
for path in itertools.product((True, False), repeat=len(occurrences)):
variant = ''
anchor = 0
for (start, end), apply_this in zip(occurrences, path):
if apply_this:
variant += target[anchor:start] + subst
anchor = end
variant += target[anchor:]
yield variant
This produces the desired output for the above example:
>>> list(vary('a b c', re.compile(r'\s+'), ''))
['abc', 'ab c', 'a bc', 'a b c']
However, this solution only works for fixed-string replacements.
Advanced features from re.sub like group references can't be done like that,
as in the following example for inserting a space after a sequence of digits inside a word:
re.sub(r'\B(\d+)\B'), r'\1 ', 'abc123def')
How can the approach be extended or changed to accept any valid argument to re.sub
(without writing a parser for interpreting group references)?

Thinking about making subst a callable that gets access to match data finally made me learn about MatchObject.expand. So, as an approximation, with subst staying an r string,
def vary(target, pattern, subst):
matches = [m for m in pattern.finditer(target)]
occurrences = [m.span() for m in matches]
for path in itertools.product((True, False), repeat=len(occurrences)):
variant = ''
anchor = 0
for match, (start, end), apply_this in zip(matches, occurrences, path):
if apply_this:
variant += target[anchor:start] + match.expand(subst)
anchor = end
variant += target[anchor:]
yield variant
I am not sure, though, that this covers all needed flexibility in referring to the subject string, being bount to the corresponding match. An indexed power set of the split string came to mind, but I guess that's not far from the parser mentioned.

How about this:
def vary(target, pattern, subst):
numOccurences = len (pattern.findall (target))
for path in itertools.product((True, False), repeat=numOccurences):
variant = ''
remainingStr = target
for currentFlag in path:
if currentFlag:
remainingStr = pattern.sub (subst, remainingStr, 1)
else:
currentMatch = pattern.search (remainingStr);
variant += remainingStr[:currentMatch.end ()]
remainingStr = remainingStr[currentMatch.end ():]
variant += remainingStr
yield variant
For each match, we either let re.sub() do its job (with a count of 1 to stop after one substitution), or we snatch away the unchanged portion of the string.
Trying it out with your examples like this
target = 'a b c'
pattern = re.compile(r'\s+')
subst = ''
print list (vary(target, pattern, subst))
target = 'abc123def'
pattern = re.compile(r'\B(\d+)\B')
subst = r'\1 '
print list (vary(target, pattern, subst))
I get
['abc', 'ab c', 'a bc', 'a b c']
['abc123 def', 'abc123def']

Related

Regex with customized word boundaries in Python

I'm using a function called findlist to return a list of all the positions of a certain string within a text, with regex to look for word boundaries. But I want to ignore the character ( and only consider the other word boundaries, so that it will find split in var split but not in split(a). Is there any way to do this?
import re
def findlist(input, place):
return [m.span() for m in re.finditer(input, place)]
str = '''
var a = 'a b c'
var split = a.split(' ')
'''
instances = findlist(r"\b%s\b" % ('split'), str)
print(instances)

You may check if there is a ( after the trailing word boundary with a negative lookahead (?!\():
instances = findlist(r"\b{}\b(?!\()".format('split'), s)
^^^^^^
The (?!\() will trigger after the whole word is found, and if there is a ( immediately to the right of the found word, the match will be failed.
See the Python demo:
import re
def findlist(input_data, place):
return [m.span() for m in re.finditer(input_data, place)]
s = '''
var a = 'a b c'
var split = a.split(' ')
'''
instances = findlist(r"\b{}\b(?!\()".format('split'), s)
print(instances) # => [(21, 26)]

how to remove whitespace inside bracket?

I have the following string:
res = '(321, 3)-(m-5, 5) -(31,1)'
I wanna remove the whitespace withing the bracket but i haven't any knowledge about regular expression
I ve try this but that doesn't work:
import re
res = re.sub(r'\(.*\s+\)', '', res)

You can substitute a non-greedy wildcard match for characters in parentheses with a function that splits the match on whitespace and rejoins it.
>>> import re
>>> res = '(321, 3)-(m-5, 5) -(31,1)'
>>> re.sub(r'\(.*?\)', lambda x: ''.join(x.group(0).split()), res)
'(321,3)-(m-5,5) -(31,1)'

You could convert the string into a list, go through each letter and count if you are within brackets or not. In toRemove, you collect the positions of whitespaces, which you then remove from the list. Then you convert the list back to a string ...
res = '(321, 3)-(m-5, 5) -(31,1)'
r = list(res)
insideBracket = 0
toRemove = []
for pos,letter in enumerate(r):
if letter == '(':
insideBracket += 1
elif letter == ')':
insideBracket -= 1
if insideBracket > 0:
if letter == ' ':
toRemove.append(pos)
for t in toRemove[::-1]:
r.pop(t)
result = ''.join(r)
print(result)

I think regular expressions aren't quite powerful enough to do what you want here; you want to remove all whitespace that's found in between parenthesis characters. The trouble is, solving this for the general case means you're doing a context-sensitive match on the string, and regular expressions are mostly context-insensitive, and so can't do your job. There are lookaheads and lookbehinds that can restrict matches to particular contexts, but they won't solve your problem in the general case either:
The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not. Group references are not supported even if they match strings of some fixed length.
Because of this, I would match the parenthesis groups first:
>>> re.split(r'(\([^)]*\))', res)
['', '(321, 3)', '-', '(m-5, 5)', ' -', '(31,1)', '']
and then remove whitespace from them in a second step before joining everything back up into a single string:
>>> g = re.split(r'(\([^)]*\))', res)
>>> g[1::2] = [re.sub(r'\s*', '', x) for x in g[1::2]]
>>> ''.join(g)
'(321,3)-(m-5,5) -(31,1)'

Python - Split by comma skipping the content inside parentheses

I need to split a string by commas, but I have a problem with this case:
TEXT EXAMPLE (THIS IS (A EXAMPLE, BUT NOT WORKS, FOR ME)), SECOND , THIRD
I would like to split and get:
var[0] = "TEXT EXAMPLE (THIS IS (A EXAMPLE, BUT NOT WORKS, FOR ME))"
var[1] = "SECOND"
var[2] = "THIRD"
Thank you

Here's a very simple parser approach that works for your example:
def top_level_split(s):
"""
Split `s` by top-level commas only. Commas within parentheses are ignored.
"""
# Parse the string tracking whether the current character is within
# parentheses.
balance = 0
parts = []
part = ''
for c in s:
part += c
if c == '(':
balance += 1
elif c == ')':
balance -= 1
elif c == ',' and balance == 0:
parts.append(part[:-1].strip())
part = ''
# Capture last part
if len(part):
parts.append(part.strip())
return parts
my_list = top_level_split("TEXT EXAMPLE (THIS IS (A EXAMPLE, BUT NOT WORKS, FOR ME)), SECOND , THIRD")
print(my_list)

You can use this negative lookahead based regex:
,(?!(?:[^(]*\([^)]*\))*[^()]*\))
This regex is finding a comma with an assertion that makes sure comma is not in parentheses. This is done using a negative lookahead that first consumes all matching ( and ) and then a ). This assumes parentheses are balanced and unescaped.
RegEx Demo
Code:
>>> s = 'TEXT EXAMPLE (THIS IS (A EXAMPLE, BUT NOT WORKS, FOR ME)), SECOND , THIRD'
print re.split(r',(?!(?:[^(]*\([^)]*\))*[^()]*\))', s)
['TEXT EXAMPLE (THIS IS (A EXAMPLE, BUT NOT WORKS, FOR ME))', ' SECOND ', ' THIRD']
Or:
>>> s = 'TEXT EXAMPLE (THIS, IS (A EXAMPLE, BUT NOT WORKS, FOR ME)), SECOND , THIRD'
>>> print re.split(r',(?!(?:[^(]*\([^)]*\))*[^()]*\))', s)
['TEXT EXAMPLE (THIS, IS (A EXAMPLE, BUT NOT WORKS, FOR ME))', ' SECOND ', ' THIRD']

Thanks to jonrsharpe :
text = "TEXT EXAMPLE (THIS IS (A EXAMPLE, BUT NOT WORKS, FOR ME)), SECOND , THIRD"
array = re.split(r',(?!.*\))', text)
for item in array:
# Print and remove the first space
print item.strip(" ")
Result:
TEXT EXAMPLE (THIS IS (A EXAMPLE, BUT NOT WORKS, FOR ME))
SECOND
THIRD

You can just use rsplit:
l1 = "TEXT EXAMPLE (THIS IS (A EXAMPLE, BUT NOT WORKS, FOR ME)), SECOND , THIRD".rsplit(",", 2)
for line in l1:
print line
TEXT EXAMPLE (THIS IS (A EXAMPLE, BUT NOT WORKS, FOR ME))
SECOND
THIRD

How to find and replace nth occurrence of word in a sentence using python regular expression?

Using python regular expression only, how to find and replace nth occurrence of word in a sentence?
For example:
str = 'cat goose mouse horse pig cat cow'
new_str = re.sub(r'cat', r'Bull', str)
new_str = re.sub(r'cat', r'Bull', str, 1)
new_str = re.sub(r'cat', r'Bull', str, 2)
I have a sentence above where the word 'cat' appears two times in the sentence. I want 2nd occurence of the 'cat' to be changed to 'Bull' leaving 1st 'cat' word untouched. My final sentence would look like:
"cat goose mouse horse pig Bull cow". In my code above I tried 3 different times could not get what I wanted.

Use negative lookahead like below.
>>> s = "cat goose mouse horse pig cat cow"
>>> re.sub(r'^((?:(?!cat).)*cat(?:(?!cat).)*)cat', r'\1Bull', s)
'cat goose mouse horse pig Bull cow'
DEMO
^ Asserts that we are at the start.
(?:(?!cat).)* Matches any character but not of cat , zero or more times.
cat matches the first cat substring.
(?:(?!cat).)* Matches any character but not of cat , zero or more times.
Now, enclose all the patterns inside a capturing group like ((?:(?!cat).)*cat(?:(?!cat).)*), so that we could refer those captured chars on later.
cat now the following second cat string is matched.
OR
>>> s = "cat goose mouse horse pig cat cow"
>>> re.sub(r'^(.*?(cat.*?){1})cat', r'\1Bull', s)
'cat goose mouse horse pig Bull cow'
Change the number inside the {} to replace the first or second or nth occurrence of the string cat
To replace the third occurrence of the string cat, put 2 inside the curly braces ..
>>> re.sub(r'^(.*?(cat.*?){2})cat', r'\1Bull', "cat goose mouse horse pig cat foo cat cow")
'cat goose mouse horse pig cat foo Bull cow'
Play with the above regex on here ...

I use simple function, which lists all occurrences, picks the nth one's position and uses it to split original string into two substrings. Then it replaces first occurrence in the second substring and joins substrings back into the new string:
import re
def replacenth(string, sub, wanted, n):
where = [m.start() for m in re.finditer(sub, string)][n-1]
before = string[:where]
after = string[where:]
newString = before + after.replace(sub, wanted, 1)
print newString
For these variables:
string = 'ababababababababab'
sub = 'ab'
wanted = 'CD'
n = 5
outputs:
ababababCDabababab
Notes:
The where variable actually is a list of matches' positions, where you pick up the nth one. But list item index starts with 0 usually, not with 1. Therefore there is a n-1 index and n variable is the actual nth substring. My example finds 5th string. If you use n index and want to find 5th position, you'll need n to be 4. Which you use usually depends on the function, which generates our n.
This should be the simplest way, but it isn't regex only as you originally wanted.
Sources and some links in addition:
where construction: How to find all occurrences of a substring?
string splitting: https://www.daniweb.com/programming/software-development/threads/452362/replace-nth-occurrence-of-any-sub-string-in-a-string
similar question: Find the nth occurrence of substring in a string

Here's a way to do it without a regex:
def replaceNth(s, source, target, n):
inds = [i for i in range(len(s) - len(source)+1) if s[i:i+len(source)]==source]
if len(inds) < n:
return # or maybe raise an error
s = list(s) # can't assign to string slices. So, let's listify
s[inds[n-1]:inds[n-1]+len(source)] = target # do n-1 because we start from the first occurrence of the string, not the 0-th
return ''.join(s)
Usage:
In [278]: s
Out[278]: 'cat goose mouse horse pig cat cow'
In [279]: replaceNth(s, 'cat', 'Bull', 2)
Out[279]: 'cat goose mouse horse pig Bull cow'
In [280]: print(replaceNth(s, 'cat', 'Bull', 3))
None

I would define a function that will work for every regex:
import re
def replace_ith_instance(string, pattern, new_str, i = None, pattern_flags = 0):
# If i is None - replacing last occurrence
match_obj = re.finditer(r'{0}'.format(pattern), string, flags = pattern_flags)
matches = [item for item in match_obj]
if i == None:
i = len(matches)
if len(matches) == 0 or len(matches) < i:
return string
match = matches[i - 1]
match_start_index = match.start()
match_len = len(match.group())
return '{0}{1}{2}'.format(string[0:match_start_index], new_str, string[match_start_index + match_len:])
A working example:
str = 'cat goose mouse horse pig cat cow'
ns = replace_ith_instance(str, 'cat', 'Bull', 2)
print(ns)
The output:
cat goose mouse horse pig Bull cow
Another example:
str2 = 'abc abc def abc abc'
ns = replace_ith_instance(str2, 'abc\s*abc', '666')
print(ns)
The output:
abc abc def 666

How to replace the nth needle with word:
s.replace(needle,'$$$',n-1).replace(needle,word,1).replace('$$$',needle)

You can match the two occurrences of "cat", keep everything before the second occurrence (\1) and add "Bull":
new_str = re.sub(r'(cat.*?)cat', r'\1Bull', str, 1)
We do only one substitution to avoid replacing the fourth, sixth, etc. occurrence of "cat" (when there are at least four occurrences), as pointed out by Avinash Raj comment.
If you want to replace the n-th occurrence and not the second, use:
n = 2
new_str = re.sub('(cat.*?){%d}' % (n - 1) + 'cat', r'\1Bull', str, 1)
BTW you should not use str as a variable name since it is a Python reserved keyword.

Create a repl function to pass into re.sub(). Except... the trick is to make it a class so you can track the call count.
class ReplWrapper(object):
def __init__(self, replacement, occurrence):
self.count = 0
self.replacement = replacement
self.occurrence = occurrence
def repl(self, match):
self.count += 1
if self.occurrence == 0 or self.occurrence == self.count:
return match.expand(self.replacement)
else:
try:
return match.group(0)
except IndexError:
return match.group(0)
Then use it like this:
myrepl = ReplWrapper(r'Bull', 0) # replaces all instances in a string
new_str = re.sub(r'cat', myrepl.repl, str)
myrepl = ReplWrapper(r'Bull', 1) # replaces 1st instance in a string
new_str = re.sub(r'cat', myrepl.repl, str)
myrepl = ReplWrapper(r'Bull', 2) # replaces 2nd instance in a string
new_str = re.sub(r'cat', myrepl.repl, str)
I'm sure there is a more clever way to avoid using a class, but this seemed straight-forward enough to explain. Also, be sure to return match.expand() as just returning the replacement value is not technically correct of someone decides to use \1 type templates.

I approached this by generating a 'grouped' version of the desired catch pattern relative to the entire string, then applying the sub directly to that instance.
The parent function is regex_n_sub, and collects the same inputs as the re.sub() method.
The catch pattern is passed to get_nsubcatch_catch_pattern() with the instance number. Inside, a list comprehension generates multiples of a pattern '.*? (Match any character, 0 or more repetitions, non-greedy). This pattern will be used to represent the space between pre-nth occurrences of the catch_pattern.
Next, the input catch_pattern is placed between each nth of the 'space pattern' and wrapped with parentheses to form the first group.
The second group is just the catch_pattern wrapped in parentheses - so when the two groups are combined, a pattern for, 'all of the text up to the nth occurrence of the catch pattern is created. This 'new_catch_pattern' has two groups built in, so the second group containing the nth occurence of the catch_pattern can be substituted.
The replace pattern is passed to get_nsubcatch_replace_pattern() and combined with the prefix r'\g<1>' forming a pattern \g<1> + replace_pattern. The \g<1> part of this pattern locates group 1 from the catch pattern, and replaces that group with the text following in the replace pattern.
The code below is verbose only for a clearer understanding of the process flow; it can be reduced as desired.
--
The example below should run stand-alone, and corrects the 4th instance of "I" to "me":
"When I go to the park and I am alone I think the ducks laugh at I but I'm not sure."
with
"When I go to the park and I am alone I think the ducks laugh at me but I'm not sure."
import regex as re
def regex_n_sub(catch_pattern, replace_pattern, input_string, n, flags=0):
new_catch_pattern, new_replace_pattern = generate_n_sub_patterns(catch_pattern, replace_pattern, n)
return_string = re.sub(new_catch_pattern, new_replace_pattern, input_string, 1, flags)
return return_string
def generate_n_sub_patterns(catch_pattern, replace_pattern, n):
new_catch_pattern = get_nsubcatch_catch_pattern(catch_pattern, n)
new_replace_pattern = get_nsubcatch_replace_pattern(replace_pattern, n)
return new_catch_pattern, new_replace_pattern
def get_nsubcatch_catch_pattern(catch_pattern, n):
space_string = '.*?'
space_list = [space_string for i in range(n)]
first_group = catch_pattern.join(space_list)
first_group = first_group.join('()')
second_group = catch_pattern.join('()')
new_catch_pattern = first_group + second_group
return new_catch_pattern
def get_nsubcatch_replace_pattern(replace_pattern, n):
new_replace_pattern = r'\g<1>' + replace_pattern
return new_replace_pattern
### use test ###
catch_pattern = 'I'
replace_pattern = 'me'
test_string = "When I go to the park and I am alone I think the ducks laugh at I but I'm not sure."
regex_n_sub(catch_pattern, replace_pattern, test_string, 4)
This code can be copied directly into a workflow, and will return the replaced object to the regex_n_sub() function call.
Please let me know if implementation fails!
Thanks!

Just because none of the current answers fitted what I needed: based on aleskva's one:
import re
def replacenth(string, pattern, replacement, n):
assert n != 0
matches = list(re.finditer(pattern, string))
if len(matches) < abs(n) :
return string
m = matches[ n-1 if n > 0 else len(matches) + n]
return string[0:m.start()] + replacement + string[m.end():]
It accepts negative match numbers ( n = -1 will return the last match), any regex pattern, and it's efficient. If the there are few than n matches, the original string is returned.

eliminating multiple occurrences of whitespace in a string in python

If I have a string
"this is a string"
How can I shorten it so that I only have one space between the words rather than multiple? (The number of white spaces is random)
"this is a string"

You could use string.split and " ".join(list) to make this happen in a reasonably pythonic way - there are probably more efficient algorithms but they won't look as nice.
Incidentally, this is a lot faster than using a regex, at least on the sample string:
import re
import timeit
s = "this is a string"
def do_regex():
for x in xrange(100000):
a = re.sub(r'\s+', ' ', s)
def do_join():
for x in xrange(100000):
a = " ".join(s.split())
if __name__ == '__main__':
t1 = timeit.Timer(do_regex).timeit(number=5)
print "Regex: ", t1
t2 = timeit.Timer(do_join).timeit(number=5)
print "Join: ", t2
$ python revsjoin.py
Regex: 2.70868492126
Join: 0.333452224731
Compiling this regex does improve performance, but only if you do call sub on the compiled regex, instead of passing the compiled form into re.sub as an argument:
def do_regex_compile():
pattern = re.compile(r'\s+')
for x in xrange(100000):
# Don't do this
# a = re.sub(pattern, ' ', s)
a = pattern.sub(' ', s)
$ python revsjoin.py
Regex: 2.72924399376
Compiled Regex: 1.5852200985
Join: 0.33763718605

re.sub(r'\s+', ' ', 'this is a string')
You can pre-compile and store this for potentially better performance:
MULT_SPACES = re.compile(r'\s+')
MULT_SPACES.sub(' ', 'this is a string')

Pretty the same answer by Ben Gartner, but, this adds the "if this is not an empty string" check.
>>> a = 'this is a string'
>>> ' '.join([k for k in a.split(" ") if k])
'this is a string'
>>>
if you don't check for empty strings you'll get this:
>>> ' '.join([k for k in a.split(" ")])
'this is a string'
>>>

Try this:
s = "this is a string"
tokens = s.split()
neat_s = " ".join(tokens)
The string's split function will return a list of non empty tokens split by whitespace. So if you try
"this is a string".split()
you will get back
['this', 'is', 'a', 'string']
The string's join function will join a list of tokens together using the string itself as a delimiter. In this case we want a space, so
" ".join("this is a string".split())
Will split on occurrences of a space, discard the empties, then join again, separating by spaces. For more about string operations, check out Python's common string function documentation.
EDIT: I misunderstood what happens when you pass a delimiter to the split function. See markuz's answer for this.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combinatorial product of regex substitutions - python

Related

Regex with customized word boundaries in Python

how to remove whitespace inside bracket?

Python - Split by comma skipping the content inside parentheses

How to find and replace nth occurrence of word in a sentence using python regular expression?

eliminating multiple occurrences of whitespace in a string in python

Categories

Resources