error "unmatched group" when using re.sub in Python 2.7 - python

I have a list of strings. Each element represents a field as key value separated by space:
listA = [
'abcd1-2 4d4e',
'xyz0-1 551',
'foo 3ea',
'bar1 2bd',
'mc-mqisd0-2 77a'
]
Behavior
I need to return a dict out of this list with expanding the keys like 'xyz0-1' by the range denoted by 0-1 into multiple keys like abcd1 and abcd2 with the same value like 4d4e.
It should run as part of an Ansible plugin, where Python 2.7 is used.
Expected
The end result would look like the dict below:
{
abcd1: 4d4e,
abcd2: 4d4e,
xyz0: 551,
xyz1: 551,
foo: 3ea,
bar1: 2bd,
mc-mqisd0: 77a,
mc-mqisd1: 77a,
mc-mqisd2: 77a,
}
Code
I have created below function. It is working with Python 3.
def listFln(listA):
import re
fL = []
for i in listA:
aL = i.split()[0]
bL = i.split()[1]
comp = re.sub('^(.+?)(\d+-\d+)?$',r'\1',aL)
cmpCountR = re.sub('^(.+?)(\d+-\d+)?$',r'\2',aL)
if cmpCountR.strip():
nStart = int(cmpCountR.split('-')[0])
nEnd = int(cmpCountR.split('-')[1])
for j in range(nStart,nEnd+1):
fL.append(comp + str(j) + ' ' + bL)
else:
fL.append(i)
return(dict([k.split() for k in fL]))
Error
In lower python versions like Python 2.7. this code throws an "unmatched group" error:
cmpCountR = re.sub('^(.+?)(\d+-\d+)?$',r'\2',aL)
File "/usr/lib64/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/usr/lib64/python2.7/re.py", line 275, in filter
return sre_parse.expand_template(template, match)
File "/usr/lib64/python2.7/sre_parse.py", line 800, in expand_template
raise error, "unmatched group"
Anything wrong with the regex here?

Here's a simpler version using findall instead of sub, successfully tested on 2,7. It also directly creates the dict instead of first building a list:
mylist=[
'abcd1-2 4d4e',
'xyz0-1 551',
'foo 3ea',
'bar1 2bd',
'mc-mqisd0-2 77a'
]
def listFln(listA):
import re
fL = {}
for i in listA:
aL = i.split()[0]
bL = i.split()[1]
comp = re.findall('^(.+?)(\d+-\d+)?$',aL)[0]
if comp[1]:
nStart = int(comp[1].split('-')[0])
nEnd = int(comp[1].split('-')[1])
for j in range(nStart,nEnd+1):
fL[comp[0]+str(j)] = bL
else:
fL[comp[0]] = bL
return fL
print(listFln(mylist))
# {'abcd1': '4d4e',
# 'abcd2': '4d4e',
# 'xyz0': '551',
# 'xyz1': '551',
# 'foo': '3ea',
# 'bar1': '2bd',
# 'mc-mqisd0': '77a',
# 'mc-mqisd1': '77a',
# 'mc-mqisd2': '77a'}

Used Python 2.7 to reproduce. This answer shows the issue with not found backreferences for re.sub in Python 2.7 and some patterns to fix.
Both patterns compile
import re
# both seem identical
regex1 = '^(.+?)(\d+-\d+)?$'
regex2 = '^(.+?)(\d+-\d+)?$'
# also the compiled pattern is identical, see hash
re.compile(regex1) # <_sre.SRE_Pattern object at 0x7f575ef8fd40>
re.compile(regex2) # <_sre.SRE_Pattern object at 0x7f575ef8fd40>
Note: The compiled pattern using re.compile() saves time when re-using multiple times like in this loop.
Fix: test for groups found
The error-message indicates that there are groups that aren't matched.
Put it other: In the matching result of re.sub (docs to 2.7) there are references to groups like the second capturing group (\2) that have not been found or captured in the given string input:
sre_constants.error: unmatched group
To fix this, we should test on groups that were found in the match.
Therefore we use re.match(regex, str) or the compiled variant pattern.match(str) to create a Match object, then Match.groups() to return all found groups as tuple.
import re
regex = '^(.+?)(\d+-\d+)?$' # a key followed by optional digits-range
pattern = re.compile(regex) # <_sre.SRE_Pattern object at 0x7f575ef8fd40>
def dict_with_expanded_digits(fields_list):
entry_list = []
for fields in fields_list:
(key_digits_range, value) = fields.split() # a pair of ('key0-1', 'value')
# test for match and groups found
match = pattern.match(key_digits_range)
print("DEBUG: groups:", match.groups()) # tuple containing all the subgroups of the match,
# watch: the 3rd iteration has only group(1), while group(2) is None
# break to next iteration here, if not maching pattern
if not match:
print('ERROR: no valid key! Will not add to dict.', fields)
continue
# if no 2nd group, only a single key,value
if not match.group(2):
print('WARN: key without range! Will add as single entry:', fields)
entry_list.append( (key_digits_range, value) )
continue # stop iteration here and continue with next
key = pattern.sub(r'\1', key_digits_range)
index_range = pattern.sub(r'\2', key_digits_range)
# no strip needed here
(start, end) = index_range.split('-')
for index in range(int(start), int(end)+1):
expanded_key = "{}{}".format(key, index)
entry = (expanded_key, value) # use tuple for each field entry (key, value)
entry_list.append(entry)
return dict([e for e in entry_list])
list_a = [
'abcd1-2 4d4e', # 2 entries
'xyz0-1 551', # 2 entries
'foo 3ea', # 1 entry
'bar1 2bd', # 1 entry
'mc-mqisd0-2 77a' # 3 entries
]
dict_a = dict_with_expanded_digits(list_a)
print("INFO: resulting dict with length: ", len(dict_a), dict_a)
assert len(dict_a) == 9
Prints:
('DEBUG: groups:', ('abcd', '1-2'))
('DEBUG: groups:', ('xyz', '0-1'))
('DEBUG: groups:', ('foo', None))
('WARN: key without range! Will add as single entry:', 'foo 3ea')
('DEBUG: groups:', ('bar1', None))
('WARN: key without range! Will add as single entry:', 'bar1 2bd')
('DEBUG: groups:', ('mc-mqisd', '0-2'))
('INFO: resulting dict with length: ', 9, {'bar1': '2bd', 'foo': '3ea', 'mc-mqisd2': '77a', 'mc-mqisd0': '77a', 'mc-mqisd1': '77a', 'xyz1': '551', 'xyz0': '551', 'abcd1': '4d4e', 'abcd2': '4d4e'})
Note on added improvements
renamed function and variables to express intend
used tuples where possible, e.g. assignment (start, end)
instead of re. methods used the equivalent methods of compiled pattern pattern.
the guard-statement if not match.group(2): avoids expanding the field and just adds the key-value as is
added assert to verify given list of 7 is expanded to dict of 9 as expected

You could use a single pattern with 4 capture groups, and check if the 3rd capture group value is not empty.
^(\S*?)(?:(\d+)-(\d+))?\s+(.*)
The pattern matches:
^ Start of string
\S*?) Capture group 1, match optional non whitespace chars, as few as possible
(?:(\d+)-(\d+))? Optionally capture 1+ digits in group 2 and group 3 with a - in between
(.*) Capture group 4, match the rest of the line
Regex demo | Python demo
Code example (works on Python 2 and Python 3)
import re
strings = [
'abcd1-2 4d4e',
'xyz0-1 551',
'foo 3ea',
'bar1 2bd',
'mc-mqisd0-2 77a'
]
def listFln(listA):
dct = {}
for s in listA:
lst = sum(re.findall(r"^(\S*?)(?:(\d+)-(\d+))?\s+(.*)", s), ())
if lst and lst[2]:
for i in range(int(lst[1]), int(lst[2]) + 1):
dct[lst[0] + str(i)] = lst[3]
else:
dct[lst[0]] = lst[3]
return dct
print(listFln(strings))
Output
{
'abcd1': '4d4e',
'abcd2': '4d4e',
'xyz0': '551',
'xyz1': '551',
'foo': '3ea',
'bar1': '2bd',
'mc-mqisd0': '77a',
'mc-mqisd1': '77a',
'mc-mqisd2': '77a'
}

Related

How can I remove specific duplicates from a list, rather than remove all duplicates indiscriminately?

In a python script, I need to assess whether a string contains duplicates of a specific character (e.g., "f") and, if so, remove all but the first instance of that character. Other characters in the string may also have duplicates, but the script should not remove any duplicates other than those of the specified character.
This is what I've got so far. The script runs, but it is not accomplishing the desired task. I modified the reduce() line from the top answer to this question, but it's a little more complex than what I've learned at this point, so it's difficult for me to tell what part of this is wrong.
import re
from functools import reduce
string = "100 ffeet"
dups = ["f", "t"]
for char in dups:
if string.count(char) > 1:
lst = list(string)
reduce(lambda acc, el: acc if re.match(char, el) and el in acc else acc + [el], lst, [])
string = "".join(lst)
Let's create a function that receives a string s and a character c as parameters, and returns a new string where all but the first occurrence of c in s are removed.
We'll be making use of the following functions from Python std lib:
str.find(sub): Return the lowest index in the string where substring sub is found.
str.replace(old, new): Return a copy of the string with all occurrences of substring old replaced by new.
The idea is straightforward:
Find the first index of c in s
If none is found, return s
Make a substring of s starting from the next character after c
Remove all occurrences of c in the substring
Concatenate the first part of s with the updated substring
Return the final string
In Python:
def remove_all_but_first(s, c):
i = s.find(c)
if i == -1:
return s
i += 1
return s[:i] + s[i:].replace(c, '')
Now you can use this function to remove all the characters you want.
def main():
s = '100 ffffffffeet'
dups = ['f', 't', 'x']
print('Before:', s)
for c in dups:
s = remove_all_but_first(s, c)
print('After:', s)
if __name__ == '__main__':
main()
Here is one way that you could do it
string = "100 ffeet"
dups = ["f", "t"]
seen = []
for s in range(len(string)-1,0,-1):
if string[s] in dups and string[s] in seen:
string = string[:s] + '' + string[s+1:]
elif string[s] in dups:
seen.append(string[s])
print(string)

Python chunking with regular expressions

In Perl, it was easy to iterate over a string to chunk it into tokens:
$key = ".foo[4][5].bar.baz";
#chunks = $key =~ m/\G\[\d+\]|\.[^][.]+/gc;
print "#chunks\n";
#>> output: .foo [4] [5] .bar .baz
# Optional error handling:
die "Malformed key at '" . substr($key, pos($key)) . "'"
if pos($key) != length($key);
If more control is needed, that can be turned into a loop instead:
while ($key =~ m/(\G\[\d+\]|\.[^][.]+)/g) {
push #chunks, $1; # Optionally process each one
}
I'd like to find a clean, idiomatic way to do this in Python. So far I only have this:
import re
key = ".foo[4][5].bar.baz"
rx = re.compile(r'\[\d+\]|\.[^][.]+')
chunks = []
while True:
m = re.match(rx, key)
if not m:
raise ValueError(f"Malformed key at '{key}'")
chunk = m.group(0)
chunks.append(chunk[1:] if chunk.startswith('.') else int(chunk[1:-1]))
key = key[m.end(0):]
if key == '':
break
print(chunks)
Aside from it being a lot more verbose, I don't love that because I need to destroy the string as I process it, since there doesn't seem to be an equivalent of Perl's \G anchor (pick up where the last match left off). An alternative would be to keep track of my own match position in the string in each loop, but that seems even more fiddly.
Is there some idiom I haven't found? I also tried some solution using re.finditer() but it doesn't seem to have a way to have each match start at the exact end of the previous match (e.g. re.matchiter() or somesuch).
Suggestions & discussion welcome.
Summary
There is no direct equivalent of re.matchiter() as you described it.
Two alternatives come to mind:
Create a mismatch token.
Write your own generator with the desired behavior.
Mismatch Token
The usual technique in Python is to define a MISMATCH catchall token and to raise an exception if that token is ever encountered.
Here's a working example (one that I wrote and put in the Python docs so that everyone could find it):
from typing import NamedTuple
import re
class Token(NamedTuple):
type: str
value: str
line: int
column: int
def tokenize(code):
keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
token_specification = [
('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
('ASSIGN', r':='), # Assignment operator
('END', r';'), # Statement terminator
('ID', r'[A-Za-z]+'), # Identifiers
('OP', r'[+\-*/]'), # Arithmetic operators
('NEWLINE', r'\n'), # Line endings
('SKIP', r'[ \t]+'), # Skip over spaces and tabs
('MISMATCH', r'.'), # Any other character
]
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
line_num = 1
line_start = 0
for mo in re.finditer(tok_regex, code):
kind = mo.lastgroup
value = mo.group()
column = mo.start() - line_start
if kind == 'NUMBER':
value = float(value) if '.' in value else int(value)
elif kind == 'ID' and value in keywords:
kind = value
elif kind == 'NEWLINE':
line_start = mo.end()
line_num += 1
continue
elif kind == 'SKIP':
continue
elif kind == 'MISMATCH':
raise RuntimeError(f'{value!r} unexpected on line {line_num}')
yield Token(kind, value, line_num, column)
statements = '''
IF quantity THEN
total := total + price * quantity;
tax := price * 0.05;
ENDIF;
'''
for token in tokenize(statements):
print(token)
Custom Generator
Another alternative is to write a custom generator with the desired behavior.
The match() method for compiled regular expressions allows an optional starting position for the match operation. With that tool, it isn't hard to write a custom generator that applies match() to consecutive starting positions:
def itermatch(pattern, string):
p = re.compile(pattern)
pos = 0
while True:
mo = p.match(string, pos)
if mo is None:
break # Or raise exception
yield mo
pos = mo.end()

In python, how to 'if finditer(...) has no matches'?

I would like to do something when finditer() does not find anything.
import re
pattern = "1"
string = "abc"
matched_iter = re.finditer(pattern, string)
# <if matched_iter is empty (no matched found>.
# do something.
# else
for m in matched_iter:
print m.group()
The best thing I could come up with is to keep track of found manually:
mi_no_find = re.finditer(r'\w+',"$$%%%%") # not matching.
found = False
for m in mi_no_find:
print m.group()
found = True
if not found:
print "Nothing found"
Related posts that don't answer:
Counting finditer matches: Number of regex matches (I don't need to count, I just need to know if there are no matches).
finditer vs match: different behavior when using re.finditer and re.match (says always have to loop over an iterator returned by finditer)
[edit]
- I have no interest in enumerating or counting total output. Only if found else not found actions.
- I understand I can put finditer into a list, but this would be inefficient for large strings. One objective is to have low memory utilization.
Updated 04/10/2020
Use re.search(pattern, string) to check if a pattern exists.
pattern = "1"
string = "abc"
if re.search(pattern, string) is None:
print('do this because nothing was found')
Returns:
do this because nothing was found
If you want to iterate over the return, then place the re.finditer() within the re.search().
pattern = '[A-Za-z]'
string = "abc"
if re.search(pattern, string) is not None:
for thing in re.finditer(pattern, string):
print('Found this thing: ' + thing[0])
Returns:
Found this thing: a
Found this thing: b
Found this thing: c
Therefore, if you wanted both options, use the else: clause with the if re.search() conditional.
pattern = "1"
string = "abc"
if re.search(pattern, string) is not None:
for thing in re.finditer(pattern, string):
print('Found this thing: ' + thing[0])
else:
print('do this because nothing was found')
Returns:
do this because nothing was found
previous reply below (not sufficient, just read above)
If the .finditer() does not match a pattern, then it will not perform any commands within the related loop.
So:
Set the variable before the loop you are using to iterate over the regex returns
Call the variable after (And outside of) the loop you are using to iterate over the regex returns
This way, if nothing is returned from the regex call, the loop won't execute and your variable call after the loop will return the exact same variable it was set to.
Below, example 1 demonstrates the regex finding the pattern. Example 2 shows the regex not finding the pattern, so the variable within the loop is never set.
Example 3 shows my suggestion - where the variable is set before the regex loop, so if the regex does not find a match (and subsequently, does not trigger the loop), the variable call after the loop returns the initial variable set (Confirming the regex pattern was not found).
Remember to import the import re module.
EXAMPLE 1 (Searching for the characters 'he' in the string 'hello world' will return 'he')
my_string = 'hello world'
pat = '(he)'
regex = re.finditer(pat,my_string)
for a in regex:
b = str(a.groups()[0])
print(b)
# returns 'he'
EXAMPLE 2 (Searching for the characters 'ab' in the string 'hello world' do not match anything, so the 'for a in regex:' loop does not execute and does not assign the b variable any value.)
my_string = 'hello world'
pat = '(ab)'
regex = re.finditer(pat,my_string)
for a in regex:
b = str(a.groups()[0])
print(b)
# no return
EXAMPLE 3 (Searching for the characters 'ab' again, but this time setting the variable b to 'CAKE' before the loop, and calling the variable b after, outside of the loop returns the initial variable - i.e. 'CAKE' - since the loop did not execute).
my_string = 'hello world'
pat = '(ab)'
regex = re.finditer(pat,my_string)
b = 'CAKE' # sets the variable prior to the for loop
for a in regex:
b = str(a.groups()[0])
print(b) # calls the variable after (and outside) the loop
# returns 'CAKE'
It's also worth noting that when designing your pattern to feed into the regex, make sure to use the parenthesis to indicate the start and end of a group.
pattern = '(ab)' # use this
pattern = 'ab' # avoid using this
To tie back to the initial question:
Since nothing found won’t execute the for loop (for a in regex), the user can preload the variable, then check it after the for loop for the original loaded value. This will allow for the user to know if nothing was found.
my_string = 'hello world'
pat = '(ab)'
regex = re.finditer(pat,my_string)
b = 'CAKE' # sets the variable prior to the for loop
for a in regex:
b = str(a.groups()[0])
if b == ‘CAKE’:
# action taken if nothing is returned
If performance isn't an issue, simply use findall or list(finditer(...)), which returns a list.
Otherwise, you can "peek" into the generator with next, then loop as normal if it raises StopIteration. Though there are other ways to do it, this is the simplest to me:
import itertools
import re
pattern = "1"
string = "abc"
matched_iter = re.finditer(pattern, string)
try:
first_match = next(matched_iter)
except StopIteration:
print("No match!") # action for no match
else:
for m in itertools.chain([first_match], matched_iter):
print(m.group())
You can probe the iterator with next and then chain the results back together while excepting StopIteration which means the iterator was empty:
import itertools as it
matches = iter([])
try:
probe = next(matches)
except StopIteration:
print('empty')
else:
for m in it.chain([probe], matches):
print(m)
Regarding your solution you could check m directly, setting it to None beforehand:
matches = iter([])
m = None
for m in matches:
print(m)
if m is None:
print('empty')
It prints the original string if there are no matches in the string.
It will replace the position n of the string.
For more reference: https://docs.python.org/2/howto/regex.html
Input_Str = "FOOTBALL"
def replacing(Input_String, char_2_replace, replaced_char, n):
pattern = re.compile(char_2_replace)
if len(re.findall(pattern, Input_String)) >= n:
where = [m for m in pattern.finditer(Input_String)][n-1]
before = Input_String[:where.start()]
after = Input_String[where.end():]
newString = before + replaced_char + after
else:
newString = Input_String
return newString
print(replacing(Input_Str, 'L', 'X', 4))```
I know this answer is late, but very suitable for Python 3.8+
You can use the new warlus operator := operator along with next(iterator[, default]) to solve for 'no matches' in re.finditer(pattern, string, flags=0) somewhat like this:
import re
pattern_ = "1"
string_ = "abc"
def is_match():
was_found = False
while next((match := re.finditer(pattern_, string_)), None) is not None:
was_found = True
yield match.group() # or just print it
return was_found

How to find and replace nth occurrence of word in a sentence using python regular expression?

Using python regular expression only, how to find and replace nth occurrence of word in a sentence?
For example:
str = 'cat goose mouse horse pig cat cow'
new_str = re.sub(r'cat', r'Bull', str)
new_str = re.sub(r'cat', r'Bull', str, 1)
new_str = re.sub(r'cat', r'Bull', str, 2)
I have a sentence above where the word 'cat' appears two times in the sentence. I want 2nd occurence of the 'cat' to be changed to 'Bull' leaving 1st 'cat' word untouched. My final sentence would look like:
"cat goose mouse horse pig Bull cow". In my code above I tried 3 different times could not get what I wanted.
Use negative lookahead like below.
>>> s = "cat goose mouse horse pig cat cow"
>>> re.sub(r'^((?:(?!cat).)*cat(?:(?!cat).)*)cat', r'\1Bull', s)
'cat goose mouse horse pig Bull cow'
DEMO
^ Asserts that we are at the start.
(?:(?!cat).)* Matches any character but not of cat , zero or more times.
cat matches the first cat substring.
(?:(?!cat).)* Matches any character but not of cat , zero or more times.
Now, enclose all the patterns inside a capturing group like ((?:(?!cat).)*cat(?:(?!cat).)*), so that we could refer those captured chars on later.
cat now the following second cat string is matched.
OR
>>> s = "cat goose mouse horse pig cat cow"
>>> re.sub(r'^(.*?(cat.*?){1})cat', r'\1Bull', s)
'cat goose mouse horse pig Bull cow'
Change the number inside the {} to replace the first or second or nth occurrence of the string cat
To replace the third occurrence of the string cat, put 2 inside the curly braces ..
>>> re.sub(r'^(.*?(cat.*?){2})cat', r'\1Bull', "cat goose mouse horse pig cat foo cat cow")
'cat goose mouse horse pig cat foo Bull cow'
Play with the above regex on here ...
I use simple function, which lists all occurrences, picks the nth one's position and uses it to split original string into two substrings. Then it replaces first occurrence in the second substring and joins substrings back into the new string:
import re
def replacenth(string, sub, wanted, n):
where = [m.start() for m in re.finditer(sub, string)][n-1]
before = string[:where]
after = string[where:]
newString = before + after.replace(sub, wanted, 1)
print newString
For these variables:
string = 'ababababababababab'
sub = 'ab'
wanted = 'CD'
n = 5
outputs:
ababababCDabababab
Notes:
The where variable actually is a list of matches' positions, where you pick up the nth one. But list item index starts with 0 usually, not with 1. Therefore there is a n-1 index and n variable is the actual nth substring. My example finds 5th string. If you use n index and want to find 5th position, you'll need n to be 4. Which you use usually depends on the function, which generates our n.
This should be the simplest way, but it isn't regex only as you originally wanted.
Sources and some links in addition:
where construction: How to find all occurrences of a substring?
string splitting: https://www.daniweb.com/programming/software-development/threads/452362/replace-nth-occurrence-of-any-sub-string-in-a-string
similar question: Find the nth occurrence of substring in a string
Here's a way to do it without a regex:
def replaceNth(s, source, target, n):
inds = [i for i in range(len(s) - len(source)+1) if s[i:i+len(source)]==source]
if len(inds) < n:
return # or maybe raise an error
s = list(s) # can't assign to string slices. So, let's listify
s[inds[n-1]:inds[n-1]+len(source)] = target # do n-1 because we start from the first occurrence of the string, not the 0-th
return ''.join(s)
Usage:
In [278]: s
Out[278]: 'cat goose mouse horse pig cat cow'
In [279]: replaceNth(s, 'cat', 'Bull', 2)
Out[279]: 'cat goose mouse horse pig Bull cow'
In [280]: print(replaceNth(s, 'cat', 'Bull', 3))
None
I would define a function that will work for every regex:
import re
def replace_ith_instance(string, pattern, new_str, i = None, pattern_flags = 0):
# If i is None - replacing last occurrence
match_obj = re.finditer(r'{0}'.format(pattern), string, flags = pattern_flags)
matches = [item for item in match_obj]
if i == None:
i = len(matches)
if len(matches) == 0 or len(matches) < i:
return string
match = matches[i - 1]
match_start_index = match.start()
match_len = len(match.group())
return '{0}{1}{2}'.format(string[0:match_start_index], new_str, string[match_start_index + match_len:])
A working example:
str = 'cat goose mouse horse pig cat cow'
ns = replace_ith_instance(str, 'cat', 'Bull', 2)
print(ns)
The output:
cat goose mouse horse pig Bull cow
Another example:
str2 = 'abc abc def abc abc'
ns = replace_ith_instance(str2, 'abc\s*abc', '666')
print(ns)
The output:
abc abc def 666
How to replace the nth needle with word:
s.replace(needle,'$$$',n-1).replace(needle,word,1).replace('$$$',needle)
You can match the two occurrences of "cat", keep everything before the second occurrence (\1) and add "Bull":
new_str = re.sub(r'(cat.*?)cat', r'\1Bull', str, 1)
We do only one substitution to avoid replacing the fourth, sixth, etc. occurrence of "cat" (when there are at least four occurrences), as pointed out by Avinash Raj comment.
If you want to replace the n-th occurrence and not the second, use:
n = 2
new_str = re.sub('(cat.*?){%d}' % (n - 1) + 'cat', r'\1Bull', str, 1)
BTW you should not use str as a variable name since it is a Python reserved keyword.
Create a repl function to pass into re.sub(). Except... the trick is to make it a class so you can track the call count.
class ReplWrapper(object):
def __init__(self, replacement, occurrence):
self.count = 0
self.replacement = replacement
self.occurrence = occurrence
def repl(self, match):
self.count += 1
if self.occurrence == 0 or self.occurrence == self.count:
return match.expand(self.replacement)
else:
try:
return match.group(0)
except IndexError:
return match.group(0)
Then use it like this:
myrepl = ReplWrapper(r'Bull', 0) # replaces all instances in a string
new_str = re.sub(r'cat', myrepl.repl, str)
myrepl = ReplWrapper(r'Bull', 1) # replaces 1st instance in a string
new_str = re.sub(r'cat', myrepl.repl, str)
myrepl = ReplWrapper(r'Bull', 2) # replaces 2nd instance in a string
new_str = re.sub(r'cat', myrepl.repl, str)
I'm sure there is a more clever way to avoid using a class, but this seemed straight-forward enough to explain. Also, be sure to return match.expand() as just returning the replacement value is not technically correct of someone decides to use \1 type templates.
I approached this by generating a 'grouped' version of the desired catch pattern relative to the entire string, then applying the sub directly to that instance.
The parent function is regex_n_sub, and collects the same inputs as the re.sub() method.
The catch pattern is passed to get_nsubcatch_catch_pattern() with the instance number. Inside, a list comprehension generates multiples of a pattern '.*? (Match any character, 0 or more repetitions, non-greedy). This pattern will be used to represent the space between pre-nth occurrences of the catch_pattern.
Next, the input catch_pattern is placed between each nth of the 'space pattern' and wrapped with parentheses to form the first group.
The second group is just the catch_pattern wrapped in parentheses - so when the two groups are combined, a pattern for, 'all of the text up to the nth occurrence of the catch pattern is created. This 'new_catch_pattern' has two groups built in, so the second group containing the nth occurence of the catch_pattern can be substituted.
The replace pattern is passed to get_nsubcatch_replace_pattern() and combined with the prefix r'\g<1>' forming a pattern \g<1> + replace_pattern. The \g<1> part of this pattern locates group 1 from the catch pattern, and replaces that group with the text following in the replace pattern.
The code below is verbose only for a clearer understanding of the process flow; it can be reduced as desired.
--
The example below should run stand-alone, and corrects the 4th instance of "I" to "me":
"When I go to the park and I am alone I think the ducks laugh at I but I'm not sure."
with
"When I go to the park and I am alone I think the ducks laugh at me but I'm not sure."
import regex as re
def regex_n_sub(catch_pattern, replace_pattern, input_string, n, flags=0):
new_catch_pattern, new_replace_pattern = generate_n_sub_patterns(catch_pattern, replace_pattern, n)
return_string = re.sub(new_catch_pattern, new_replace_pattern, input_string, 1, flags)
return return_string
def generate_n_sub_patterns(catch_pattern, replace_pattern, n):
new_catch_pattern = get_nsubcatch_catch_pattern(catch_pattern, n)
new_replace_pattern = get_nsubcatch_replace_pattern(replace_pattern, n)
return new_catch_pattern, new_replace_pattern
def get_nsubcatch_catch_pattern(catch_pattern, n):
space_string = '.*?'
space_list = [space_string for i in range(n)]
first_group = catch_pattern.join(space_list)
first_group = first_group.join('()')
second_group = catch_pattern.join('()')
new_catch_pattern = first_group + second_group
return new_catch_pattern
def get_nsubcatch_replace_pattern(replace_pattern, n):
new_replace_pattern = r'\g<1>' + replace_pattern
return new_replace_pattern
### use test ###
catch_pattern = 'I'
replace_pattern = 'me'
test_string = "When I go to the park and I am alone I think the ducks laugh at I but I'm not sure."
regex_n_sub(catch_pattern, replace_pattern, test_string, 4)
This code can be copied directly into a workflow, and will return the replaced object to the regex_n_sub() function call.
Please let me know if implementation fails!
Thanks!
Just because none of the current answers fitted what I needed: based on aleskva's one:
import re
def replacenth(string, pattern, replacement, n):
assert n != 0
matches = list(re.finditer(pattern, string))
if len(matches) < abs(n) :
return string
m = matches[ n-1 if n > 0 else len(matches) + n]
return string[0:m.start()] + replacement + string[m.end():]
It accepts negative match numbers ( n = -1 will return the last match), any regex pattern, and it's efficient. If the there are few than n matches, the original string is returned.

Finding whether a string starts with one of a list's variable-length prefixes

I need to find out whether a name starts with any of a list's prefixes and then remove it, like:
if name[:2] in ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"]:
name = name[2:]
The above only works for list prefixes with a length of two. I need the same functionality for variable-length prefixes.
How is it done efficiently (little code and good performance)?
A for loop iterating over each prefix and then checking name.startswith(prefix) to finally slice the name according to the length of the prefix works, but it's a lot of code, probably inefficient, and "non-Pythonic".
Does anybody have a nice solution?
str.startswith(prefix[, start[, end]])¶
Return True if string starts with the prefix, otherwise return
False. prefix can also be a tuple of prefixes to look for. With
optional start, test string beginning at that position. With
optional end, stop comparing string at that position.
$ ipython
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: prefixes = ("i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_")
In [2]: 'test'.startswith(prefixes)
Out[2]: False
In [3]: 'i_'.startswith(prefixes)
Out[3]: True
In [4]: 'd_a'.startswith(prefixes)
Out[4]: True
A bit hard to read, but this works:
name=name[len(filter(name.startswith,prefixes+[''])[0]):]
for prefix in prefixes:
if name.startswith(prefix):
name=name[len(prefix):]
break
Regexes will likely give you the best speed:
prefixes = ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_", "also_longer_"]
re_prefixes = "|".join(re.escape(p) for p in prefixes)
m = re.match(re_prefixes, my_string)
if m:
my_string = my_string[m.end()-m.start():]
If you define prefix to be the characters before an underscore, then you can check for
if name.partition("_")[0] in ["i", "c", "m", "l", "d", "t", "e", "b", "foo"] and name.partition("_")[1] == "_":
name = name.partition("_")[2]
What about using filter?
prefs = ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"]
name = list(filter(lambda item: not any(item.startswith(prefix) for prefix in prefs), name))
Note that the comparison of each list item against the prefixes efficiently halts on the first match. This behaviour is guaranteed by the any function that returns as soon as it finds a True value, eg:
def gen():
print("yielding False")
yield False
print("yielding True")
yield True
print("yielding False again")
yield False
>>> any(gen()) # last two lines of gen() are not performed
yielding False
yielding True
True
Or, using re.match instead of startswith:
import re
patt = '|'.join(["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"])
name = list(filter(lambda item: not re.match(patt, item), name))
Regex, tested:
import re
def make_multi_prefix_matcher(prefixes):
regex_text = "|".join(re.escape(p) for p in prefixes)
print repr(regex_text)
return re.compile(regex_text).match
pfxs = "x ya foobar foo a|b z.".split()
names = "xenon yadda yeti food foob foobarre foo a|b a b z.yx zebra".split()
matcher = make_multi_prefix_matcher(pfxs)
for name in names:
m = matcher(name)
if not m:
print repr(name), "no match"
continue
n = m.end()
print repr(name), n, repr(name[n:])
Output:
'x|ya|foobar|foo|a\\|b|z\\.'
'xenon' 1 'enon'
'yadda' 2 'dda'
'yeti' no match
'food' 3 'd'
'foob' 3 'b'
'foobarre' 6 're'
'foo' 3 ''
'a|b' 3 ''
'a' no match
'b' no match
'z.yx' 2 'yx'
'zebra' no match
When it comes to search and efficiency always thinks of indexing techniques to improve your algorithms. If you have a long list of prefixes you can use an in-memory index by simple indexing the prefixes by the first character into a dict.
This solution is only worth if you had a long list of prefixes and performance becomes an issue.
pref = ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"]
#indexing prefixes in a dict. Do this only once.
d = dict()
for x in pref:
if not x[0] in d:
d[x[0]] = list()
d[x[0]].append(x)
name = "c_abcdf"
#lookup in d to only check elements with the same first character.
result = filter(lambda x: name.startswith(x),\
[] if name[0] not in d else d[name[0]])
print result
This edits the list on the fly, removing prefixes. The break skips the rest of the prefixes once one is found for a particular item.
items = ['this', 'that', 'i_blah', 'joe_cool', 'what_this']
prefixes = ['i_', 'c_', 'a_', 'joe_', 'mark_']
for i,item in enumerate(items):
for p in prefixes:
if item.startswith(p):
items[i] = item[len(p):]
break
print items
Output
['this', 'that', 'blah', 'cool', 'what_this']
Could use a simple regex.
import re
prefixes = ("i_", "c_", "longer_")
re.sub(r'^(%s)' % '|'.join(prefixes), '', name)
Or if anything preceding an underscore is a valid prefix:
name.split('_', 1)[-1]
This removes any number of characters before the first underscore.
import re
def make_multi_prefix_replacer(prefixes):
if isinstance(prefixes,str):
prefixes = prefixes.split()
prefixes.sort(key = len, reverse=True)
pat = r'\b(%s)' % "|".join(map(re.escape, prefixes))
print 'regex patern :',repr(pat),'\n'
def suber(x, reg = re.compile(pat)):
return reg.sub('',x)
return suber
pfxs = "x ya foobar yaku foo a|b z."
replacer = make_multi_prefix_replacer(pfxs)
names = "xenon yadda yeti yakute food foob foobarre foo a|b a b z.yx zebra".split()
for name in names:
print repr(name),'\n',repr(replacer(name)),'\n'
ss = 'the yakute xenon is a|bcdf in the barfoobaratu foobarii'
print '\n',repr(ss),'\n',repr(replacer(ss)),'\n'

Categories

Resources