I struggle to understand the group method in Python's regular expressions library. In this context, I try to do substitutions on a string depending on the matching object.
That is, I want to replace the matched objects (+ and \n in this example) with a particular string in the my_dict dictionary (with rep1 and rep2 respectively).
As seen from this question and answer,
I have tried this:
content = '''
Blah - blah \n blah * blah + blah.
'''
regex = r'[+\-*/]'
for mobj in re.finditer(regex, content):
t = mobj.lastgroup
v = mobj.group(t)
new_content = re.sub(regex, repl_func(mobj), content)
def repl_func(mobj):
my_dict = { '+': 'rep1', '\n': 'rep2'}
try:
match = mobj.group(0)
except AttributeError:
match = ''
else:
return my_dict.get(match, '')
print(new_content)
But I get None for t followed by an IndexError when computing v.
Any explanations and example code would be appreciated.
Despite of Wiktor's truly pythonic answer, there's still the question why the OP's orginal algorithm wouldn't work.
Basically there are 2 issues:
The call of new_content = re.sub(regex, repl_func(mobj), content) will substitute all matches of regex with the replacement value of the very first match.
The correct call has to be new_content = re.sub(regex, repl_func, content).
As documented here, repl_func gets invoked dynamically with the current match object!
repl_func(mobj) does some unnecessary exception handling, which can be simplified:
my_dict = {'\n': '', '+':'rep1', '*':'rep2', '/':'rep3', '-':'rep4'}
def repl_func(mobj):
global my_dict
return my_dict.get(mobj.group(0), '')
This is equivalent to Wiktor's solution - he just got rid of the function definition itself by using a lambda expression.
With this modification, the for mobj in re.finditer(regex, content): loop has become superfluos, as it does the same calculation multiple times.
Just for the sake of completeness here is a working solution using re.finditer(). It builds the result string from the matched slices of content:
my_regx = r'[\n+*/-]'
my_dict = {'\n': '', '+':'rep1' , '*':'rep2', '/':'rep3', '-':'rep4'}
content = "A*B+C-D/E"
res = ""
cbeg = 0
for mobj in re.finditer(my_regx, content):
# get matched string and its slice indexes
mstr = mobj.group(0)
mbeg = mobj.start()
mend = mobj.end()
# replace matched string
mrep = my_dict.get(mstr, '')
# append non-matched part of content plus replacement
res += content[cbeg:mbeg] + mrep
# set new start index of remaining slice
cbeg = mend
# finally add remaining non-matched slice
res += content[cbeg:]
print (res)
The r'[+\-*/]' regex does not match a newline, so your '\n': 'rep2' would not be used. Else, add \n to the regex: r'[\n+*/-]'.
Next, you get None because your regex does not contain any named capturing groups, see re docs:
match.lastgroup
The name of the last matched capturing group, or None if the group didn’t have a name, or if no group was matched at all.
To replace using the match, you do not even need to use re.finditer, use re.sub with a lambda as the replacement:
import re
content = '''
Blah - blah \n blah * blah + blah.
'''
regex = r'[\n+*/-]'
my_dict = { '+': 'rep1', '\n': 'rep2'}
new_content = re.sub(regex, lambda m: my_dict.get(m.group(),""), content)
print(new_content)
# => rep2Blah blah rep2 blah blah rep1 blah.rep2
See the Python demo
The m.group() gets the whole match (the whole match is stored in match.group(0)). If you had a pair of unescaped parentheses in the pattern, it would create a capturing group and you could access the first one with m.group(1), etc.
Related
I am looking for some thoughts on how I would be able to accomplish these tasks:
Allow the first occurrence of a problem_word, but ban any following uses of it and the rest of the problem words.
No modifications to the original document (.txt file). Only modify for print().
Keep the same structure of the email. If there are line breaks, or tabs, or weird spacings, let them keep their integrity.
Here is the code sample:
import re
# Sample email is "Hello, banned1. This is banned2. What is going on with
# banned 3? Hopefully banned1 is alright."
sample_email = open('email.txt', 'r').read()
# First use of any of these words is allowed; those following are banned
problem_words = ['banned1', 'banned2', 'banned3']
# TODO: Filter negative_words into overused_negative_words
banned_problem_words = []
for w in problem_words:
if sample_email.count(f'\\b{w}s?\\b') > 1:
banned_problem_words.append(w)
pattern = '|'.join(f'\\b{w}s?\\b' for w in banned_problem_words)
def list_check(email, pattern):
return re.sub(pattern, 'REDACTED', email, flags=re.IGNORECASE)
print(list_check(sample_email, pattern))
# Result should be: "Hello, banned1. This is REDACTED. What is going on with
# REDACTED? Hopefully REDACTED is alright."
The repl argument of re.sub can take a function that takes a match object and returns the replacement string. Here is my solution:
import re
sample_email = open('email.txt', 'r').read()
# First use of any of these words is allowed; those following are banned
problem_words = ['banned1', 'banned2', 'banned3']
pattern = '|'.join(f'\\b{w}\\b' for w in problem_words)
occurrences = 0
def redact(match):
global occurrences
occurrences += 1
if occurrences > 1:
return "REDACTED"
return match.group(0)
replaced = re.sub(pattern, redact, sample_email, flags=re.IGNORECASE)
print(replaced)
(As a further note, string.count doesn't support regex, but there is no need to count)
I am working on a transpiler and want to replace my language's tokens with those of Python. The substitution is done like so:
for rep in reps:
pattern, translated = rep;
# Replaces every [pattern] with [translated] in [transpiled]
transpiled = re.sub(pattern, translated, transpiled, flags=re.UNICODE)
Where reps is a list of (regex to be replaced, string to replace it with) ordered pairs and transpiled is the text to be transpiled.
However, I can't seem to find a way to exclude text between quotes from the substitution process. Please note that this is for a language, so it should work for escaped quotes and single quotes as well.
This may depend on how you define your patterns, but in general you can always surround your pattern with a lookahead and a lookbehind group to ensure that text between quotes is not matched:
import re
transpiled = "A foo with \"foo\" and single quoted 'foo'. It even has an escaped \\'foo\\'!"
reps = [("foo", "bar"), ("and", "or")]
print(transpiled) # before the changes
for rep in reps:
pattern, translated = rep
transpiled = re.sub("(?<=[^\"']){}(?=\\\\?[^\"'])".format(pattern),
translated, transpiled, flags=re.UNICODE)
print(transpiled) # after each change
Which will yield:
A foo with "foo" and single quoted 'foo'. It even has an escaped \'foo\'!
A bar with "foo" and single quoted 'foo'. It even has an escaped \'foo\'!
A bar with "foo" or single quoted 'foo'. It even has an escaped \'foo\'!
UPDATE: If you want to ignore whole quoted swaths of text, not just a quoted word, you'll have to do a bit more work. While you could do it by looking for repeated quotations the whole lookahead/lookbehind mechanism would get really messy and probably far from optimal - it's just easier to separate the quoted from non-quoted text and do replacements only in the former, something like:
import re
QUOTED_STRING = re.compile("(\\\\?[\"']).*?\\1") # a pattern to match strings between quotes
def replace_multiple(source, replacements, flags=0): # a convenience replacement function
if not source: # no need to process empty strings
return ""
for r in replacements:
source = re.sub(r[0], r[1], source, flags=flags)
return source
def replace_non_quoted(source, replacements, flags=0):
result = [] # a store for the result pieces
head = 0 # a search head reference
for match in QUOTED_STRING.finditer(source):
# process everything until the current quoted match and add it to the result
result.append(replace_multiple(source[head:match.start()], replacements, flags))
result.append(match[0]) # add the quoted match verbatim to the result
head = match.end() # move the search head to the end of the quoted match
if head < len(source): # if the search head is not at the end of the string
# process the rest of the string and add it to the result
result.append(replace_multiple(source[head:], replacements, flags))
return "".join(result) # join back the result pieces and return them
You can test it as:
print(replace_non_quoted("A foo with \"foo\" and 'foo', says: 'I have a foo'!", reps))
# A bar with "foo" or 'foo', says: 'I have a foo'!
print(replace_non_quoted("A foo with \"foo\" and foo, says: \\'I have a foo\\'!", reps))
# A bar with "foo" or bar, says: \'I have a foo\'!
print(replace_non_quoted("A foo with '\"foo\" and foo', says - I have a foo!", reps))
# A bar with '"foo" and foo', says - I have a bar!
As an added bonus, this also allows you to define fully qualified regex patterns as your replacements:
print(replace_non_quoted("My foo and \"bar\" are like 'moo' and star!",
(("(\w+)oo", "oo\\1"), ("(\w+)ar", "ra\\1"))))
# My oof and "bar" are like 'moo' and rast!
But if your replacements do not involve patterns and need just a simple substitution you can replace the re.sub() in the replace_multiple() helper function with the significantly faster native str.replace().
Finally, you can get rid of regex completely if you don't need complex patterns:
QUOTE_STRINGS = ("'", "\\'", '"', '\\"') # a list of substring considered a 'quote'
def replace_multiple(source, replacements): # a convenience multi-replacement function
if not source: # no need to process empty strings
return ""
for r in replacements:
source = source.replace(r[0], r[1])
return source
def replace_non_quoted(source, replacements):
result = [] # a store for the result pieces
head = 0 # a search head reference
eos = len(source) # a convenience string length reference
quote = None # last quote match literal
quote_len = 0 # a convenience reference to the current quote substring length
while True:
if quote: # we already have a matching quote stored
index = source.find(quote, head + quote_len) # find the closing quote
if index == -1: # EOS reached
break
result.append(source[head:index + quote_len]) # add the quoted string verbatim
head = index + quote_len # move the search head after the quoted match
quote = None # blank out the quote literal
else: # the current position is not in a quoted substring
index = eos
# find the first quoted substring from the current head position
for entry in QUOTE_STRINGS: # loop through all quote substrings
candidate = source.find(entry, head)
if head < candidate < index:
index = candidate
quote = entry
quote_len = len(entry)
if not quote: # EOS reached, no quote found
break
result.append(replace_multiple(source[head:index], replacements))
head = index # move the search head to the start of the quoted match
if head < eos: # if the search head is not at the end of the string
result.append(replace_multiple(source[head:], replacements))
return "".join(result) # join back the result pieces and return them
Rather than just using regexes, you probably want to use Python's built in shlex module. It's designed for handling quoted strings like you find in a shell, including nested examples.
import shlex
shlex.split("""look "nested \\"quotes\\"" here""")
# ['look', 'nested "quotes"', 'here']
for example is the string is "abbacdeffel" and the pattern being "xyyx" replaced with "1234"
so it would result from "abbacdeffel" to "1234cd1234l"
I have tried to think this out but I couldnt come up with anything. At first I thought maybe dictionary could help but still nothing came to mind.
What you're looking to do can be accomplished by using regex, or more commonly known as, Regular Expressions. Regular Expressions in programming enables you to extract what you want and just what you want from a string.
In your case, you want to match the string with the pattern abba so using the following regex:
(\w+)(\w+)\2\1
https://regex101.com/r/hP8lA3/1
You can match two word groups and use backreferences to make sure that the second group comes first, then the first group.
So implementing this in python code looks like this:
First, import the regex module in python
import re
Then, declare your variable
text = "abbacdeffel"
The re.finditer returns an iterable so you can iterate through all the groups
matches = re.finditer(r"(\w)(\w)\2\1", text)
Go through all the matches that the regexp found and replace the pattern with "1234"
for match in matches:
text = text.replace(match.group(0), "1234")
For debugging:
print(text)
Complete Code:
import re
text = "abbacdeffel"
matches = re.finditer(r"(\w)(\w)\2\1", text)
for match in matches:
text = text.replace(match.group(0), "1234")
print(text)
You can learn more about Regular Expressions here: https://regexone.com/references/python
New version of code (there was a bug):
def replace_with_pattern(pattern, line, replace):
from collections import OrderedDict
set_of_chars_in_pattern = set(pattern)
indice_start_pattern = 0
output_line = ""
while indice_start_pattern < len(line):
potential_end_pattern = indice_start_pattern + len(pattern)
subline = line[indice_start_pattern:potential_end_pattern]
print(subline)
set_of_chars_in_subline = set(subline)
if len(set_of_chars_in_subline)!= len(set_of_chars_in_pattern):
output_line += line[indice_start_pattern]
indice_start_pattern +=1
continue
map_of_chars = OrderedDict()
liste_of_chars_in_pattern = []
for char in pattern:
if char not in liste_of_chars_in_pattern:
liste_of_chars_in_pattern.append(char)
print(liste_of_chars_in_pattern)
for subline_char in subline:
if subline_char not in map_of_chars.values():
map_of_chars[liste_of_chars_in_pattern.pop(0)] =subline_char
print(map_of_chars)
wanted_subline = ""
for char_of_pattern in pattern:
wanted_subline += map_of_chars[char_of_pattern]
print("wanted_subline =" + wanted_subline)
if subline == wanted_subline:
output_line += replace
indice_start_pattern += len(pattern)
else:
output_line += line[indice_start_pattern]
indice_start_pattern += 1
return output_line
some test :
test1 = replace_with_pattern("xyyx", "abbacdeffel", "1234")
test2 = replace_with_pattern("abbacdeffel", "abbacdeffel", "1234")
print(test1, test2)
=> 1234cd1234l 1234
Here goes my attempt:
([a-zA-Z])(?!\1)([a-zA-Z])\2\1
Assuming you want to match letters only (if other ranges, change both [a-zA-Z] as appropriate, we have:
([a-zA-Z])
Find the first character, and note it so we can later refer to it with \1.
(?!\1)
Check to see if the next character is not the same as the first, but without advancing the search pointer. This is to prevent aaaa being accepted. If aaaa is OK, just remove this subexpression.
([a-zA-Z])
Find the second character, and note it so we can later refer to it with \2.
\2\1
Now find the second again, then the first again, so we match the full abba pattern.
And finally, to do a replace operation, the full command would be:
import re
re.sub(r'([a-zA-Z])(?!\1)([a-zA-Z])\2\1',
'1234',
'abbacdeffelzzzz')
The r at the start of the regex pattern is to prevent Python processing the backslashes. Without it, you would need to do:
import re
re.sub('([a-zA-Z])(?!\\1)([a-zA-Z])\\2\\1',
'1234',
'abbacdeffelzzzz')
Now, I see the spec has expanded to a user-defined pattern; here is some code that will build that pattern:
import re
def make_re(pattern, charset):
result = ''
seen = []
for c in pattern:
# Is this a letter we've seen before?
if c in seen:
# Yes, so we want to match the captured pattern
result += '\\' + str(seen.index(c)+1)
else:
# No, so match a new character from the charset,
# but first exclude already matched characters
for i in xrange(len(seen)):
result += '(?!\\' + str(i + 1) + ')'
result += '(' + charset + ')'
# Note we have seen this letter
seen.append(c)
return result
print re.sub(make_re('xzzx', '\\d'), 'abba', 'abba1221b99999889')
print re.sub(make_re('xyzxyz', '[a-z]'), '123123', 'abcabc zyxzyyx zyzzyz')
Outputs:
abbaabbab9999abba
123123 zyxzyyx zyzzyz
I am defining a regex to match my defined identifiers - an identifier has to start with a letter followed by any number of letters, numbers, and underscores.
I have my current regex r'[A-Za-z][A-Za-z0-9_]*' and it works great except for cases like this: if I send in: testid#entifier_, it returns a match for testid and entifier_. I want it to completely reject the identifier. Not match parts of it.
It just ends up splitting them.
What can I do without using a complex look-ahead for legal chars?
Input is simply:
arg = sys.argv[1]
file = open(arg)
inLines = file.read()
file.close()
tokens = lexer(inLines, tokenFormats)
A sample of my defined regex's are like this:
tokenFormats = [
(r'[\s\n\t]+', None), #Whitespace
(r'\/\*(\*(?!\/)|[^*])*\*\/', None), #Comment
(r'\(', LParent),
(r'\)', RParent),
(r'\[', LBracket),
(r'\]', RBracket),
(r'\{', LBrace),
(r'\}', RBrace),
(r'\,', CommaT),
(r'(?<="{1}).*?(?=")', STRLITERAL),
(r'\"', QuoteT),
(r'\.', PeriodT),
(r'\-?[0-9]*\.[0-9]+', ValueR),
(r'\+', AddT),
(r'-', AddT),
(r'\|\|', AddT),
(r';', Semicolon),
My matching loop is like this:
def lexer(input, tokenFormats):
pos = 0
tokens = []
while pos < len(input):
match = None
for tokenFormat in tokenFormats:
pattern, tag = tokenFormat
regex = re.compile(pattern)
match = regex.match(input,pos) #Essentially Build Lexeme
if match:
lexeme = match.group(0)
if tag:
if tag == Identifier and len(str(lexeme)) > 27: #rough fix to check length. Very hacky
sys.stderr.write('Illegal length for identifier: %s\n' % lexeme)
break;
attr = checkForAttribute(lexeme,tag)
token = (lexeme,tag,attr)
tokens.append(token)
break
else:
break
if not match:
sys.stderr.write('Illegal or unknown character: %s\n' % input[pos])
pos = pos + 1
else:
pos = match.end(0)
return tokens
Try anchoring your expression:
r'^[A-Za-z][A-Za-z0-9_]*$'
This requires that the entire identifier matches the expression, not just part of it because you are anchoring the expression to the beginning and end of the string. This prevents part of the string from matching.
If the # symbol is your only concern, try this r'[a-zA-Z]#?[a-zA-Z0-9_]+'.
If you want to allow the # as well you could use the following regex:
r'[A-Za-z][A-Za-z0-9_]*#?[A-Za-z0-9_]*'
tested: https://regex101.com/r/vlt8qo/3/
however following the description of your problem:
I am defining a regex to match my defined identifiers - an identifier has to start with a letter followed by any number of letters, numbers, and underscores.
it looks like that there is some incoherence since # are not defined as part of your identifiers...
Following your edit in the post:
I have adapted my regex to ->
r'(?<=[\(\)\]\[\-=\+\s\n\t,;\|\.\"])[A-Za-z][A-Za-z0-9_]*(?=[\(\)\]\[\-=\+\s\n\t,;\|\.\"])|^[A-Za-z][A-Za-z0-9_]*(?=[\(\)\]\[\-=\+\s\n\t,;\|\.\"])'
and tested it on several patterns #
https://regex101.com/r/vlt8qo/5/
I need to replace the value inside a capture group of a regular expression with some arbitrary value; I've had a look at the re.sub, but it seems to be working in a different way.
I have a string like this one :
s = 'monthday=1, month=5, year=2018'
and I have a regex matching it with captured groups like the following :
regex = re.compile('monthday=(?P<d>\d{1,2}), month=(?P<m>\d{1,2}), year=(?P<Y>20\d{2})')
now I want to replace the group named d with aaa, the group named m with bbb and group named Y with ccc, like in the following example :
'monthday=aaa, month=bbb, year=ccc'
basically I want to keep all the non matching string and substitute the matching group with some arbitrary value.
Is there a way to achieve the desired result ?
Note
This is just an example, I could have other input regexs with different structure, but same name capturing groups ...
Update
Since it seems like most of the people are focusing on the sample data, I add another sample, let's say that I have this other input data and regex :
input = '2018-12-12'
regex = '((?P<Y>20\d{2})-(?P<m>[0-1]?\d)-(?P<d>\d{2}))'
as you can see I still have the same number of capturing groups(3) and they are named the same way, but the structure is totally different... What I need though is as before replacing the capturing group with some arbitrary text :
'ccc-bbb-aaa'
replace capture group named Y with ccc, the capture group named m with bbb and the capture group named d with aaa.
In the case, regexes are not the best tool for the job, I'm open to some other proposal that achieve my goal.
This is a completely backwards use of regex. The point of capture groups is to hold text you want to keep, not text you want to replace.
Since you've written your regex the wrong way, you have to do most of the substitution operation manually:
"""
Replaces the text captured by named groups.
"""
def replace_groups(pattern, string, replacements):
pattern = re.compile(pattern)
# create a dict of {group_index: group_name} for use later
groupnames = {index: name for name, index in pattern.groupindex.items()}
def repl(match):
# we have to split the matched text into chunks we want to keep and
# chunks we want to replace
# captured text will be replaced. uncaptured text will be kept.
text = match.group()
chunks = []
lastindex = 0
for i in range(1, pattern.groups+1):
groupname = groupnames.get(i)
if groupname not in replacements:
continue
# keep the text between this match and the last
chunks.append(text[lastindex:match.start(i)])
# then instead of the captured text, insert the replacement text for this group
chunks.append(replacements[groupname])
lastindex = match.end(i)
chunks.append(text[lastindex:])
# join all the junks to obtain the final string with replacements
return ''.join(chunks)
# for each occurence call our custom replacement function
return re.sub(pattern, repl, string)
>>> replace_groups(pattern, s, {'d': 'aaa', 'm': 'bbb', 'Y': 'ccc'})
'monthday=aaa, month=bbb, year=ccc'
You can use string formatting with a regex substitution:
import re
s = 'monthday=1, month=5, year=2018'
s = re.sub('(?<=\=)\d+', '{}', s).format(*['aaa', 'bbb', 'ccc'])
Output:
'monthday=aaa, month=bbb, year=ccc'
Edit: given an arbitrary input string and regex, you can use formatting like so:
input = '2018-12-12'
regex = '((?P<Y>20\d{2})-(?P<m>[0-1]?\d)-(?P<d>\d{2}))'
new_s = re.sub(regex, '{}', input).format(*["aaa", "bbb", "ccc"])
Extended Python 3.x solution on extended example (re.sub() with replacement function):
import re
d = {'d':'aaa', 'm':'bbb', 'Y':'ccc'} # predefined dict of replace words
pat = re.compile('(monthday=)(?P<d>\d{1,2})|(month=)(?P<m>\d{1,2})|(year=)(?P<Y>20\d{2})')
def repl(m):
pair = next(t for t in m.groupdict().items() if t[1])
k = next(filter(None, m.groups())) # preceding `key` for currently replaced sequence (i.e. 'monthday=' or 'month=' or 'year=')
return k + d.get(pair[0], '')
s = 'Data: year=2018, monthday=1, month=5, some other text'
result = pat.sub(repl, s)
print(result)
The output:
Data: year=ccc, monthday=aaa, month=bbb, some other text
For Python 2.7 :
change the line k = next(filter(None, m.groups())) to:
k = filter(None, m.groups())[0]
I suggest you use a loop
import re
regex = re.compile('monthday=(?P<d>\d{1,2}), month=(?P<m>\d{1,2}), year=(?P<Y>20\d{2})')
s = 'monthday=1, month=1, year=2017 \n'
s+= 'monthday=2, month=2, year=2019'
regex_as_str = 'monthday={d}, month={m}, year={Y}'
matches = [match.groupdict() for match in regex.finditer(s)]
for match in matches:
s = s.replace(
regex_as_str.format(**match),
regex_as_str.format(**{'d': 'aaa', 'm': 'bbb', 'Y': 'ccc'})
)
You can do this multile times wiht your different regex patterns
Or you can join ("or") both patterns together