I want to provide alternative replacement patterns to re.sub.
Let's say i've got two search patterns as alternatives, like this:
re.sub(r"[A-Z]+|[a-z]+", replacementpattern, string)
and instead of providing one replacement pattern I would like to somehow catch which search pattern alternative was matched and provide alternative replacement patterns.
Is this possible?
Thanks.
PS. code specifics here are irrelevant, it's a general question.
You can pass a function to re.sub(). In the function you can return the value needed based on the captured group. A simple code for illustration:
>>> def fun(m):
... if m:
... if m.group(1):
... return 'x'
... else:
... return 'y'
>>>print re.sub(r"([A-Z]+)|([a-z]+)", fun , "ab")
The function fun() checks if the match succeeded and based on the captured group, returns the replacement string. If [A-Z]+ was matched, x is the replacement string else [a-z]+ was matched and y is the replacement string.
For more information : doc
Usually, you would just use two replacements:
re.sub(r"[A-Z]+", replacement1, string)
re.sub(r"[a-z]+", replacement2, string)
Anticlimactic, right?
It's actually less code than the alternatives usually, and it's far clearer what you're doing.
Related
Using the python re.sub, is there a way I can extract the first alpha numeric characters and disregard the rest form a string that starts with a special character and might have special characters in the middle of the string? For example:
re.sub('[^A-Za-z0-9]','', '#my,name')
How do I just get "my"?
re.sub('[^A-Za-z0-9]','', '#my')
Here I would also want it to just return 'my'.
re.sub(".*?([A-Za-z0-9]+).*", r"\1", str)
The \1 in the replacement is equivalent to matchobj.group(1). In other words it replaces the whole string with just what was matched by the part of the regexp inside the brackets. $ could be added at the end of the regexp for clarity, but it is not necessary because the final .* will be greedy (match as many characters as possible).
This solution does suffer from the problem that if the string doesn't match (which would happen if it contains no alphanumeric characters), then it will simply return the original string. It might be better to attempt a match, then test whether it actually matches, and handle separately the case that it doesn't. Such a solution might look like:
matchobj = re.match(".*?([A-Za-z0-9]+).*", str)
if matchobj:
print(matchobj.group(1))
else:
print("did not match")
But the question called for the use of re.sub.
Instead of re.sub it is easier to do matching using re.search or re.findall.
Using re.search:
>>> s = '#my,name'
>>> res = re.search(r'[a-zA-Z\d]+', s)
>>> if res:
... print (res.group())
...
my
Code Demo
This is not a complete answer. [A-Za-z]+ will give give you ['my','name']
Use this to further explore: https://regex101.com/
I have a string "F(foo)", and I'd like to replace that string with "F('foo')". I know we can also use regular expression in the second parameter and do this replacement using re.sub(r"F\(foo\)", r"F\('foo'\)",str). But the problem here is, foo is a dynamic string variable. It is different every time we want to do this replacement. Is it possible by some sort of regex, to do such replacement in a cleaner way?
I remember one way to extract foo using () and then .group(1). But this would require me to define one more temporary variable just to store foo. I'm curious if there is a way by which we can replace "F(foo)" with "F('foo')" in a single line or in other words in a more cleaner way.
Examples :
F(name) should be replaced with F('name').
F(id) should be replaced with F('id').
G(name) should not be replaced.
So, the regex would be r"F\((\w)+\)" to find such strings.
Using re.sub
Ex:
import re
s = "F(foo)"
print(re.sub(r"\((.*)\)", r"('\1')", s))
Output:
F('foo')
The following regex encloses valid [Python|C|Java] identifiers after F and in parentheses in single quotation marks:
re.sub(r"F\(([_a-z][_a-z0-9]+)\)", r"F('\1')", s, flags=re.I)
#"F('foo')"
There are several ways, depending on what foo actually is.
If it can't contain ( or ), you can just replace ( with (' and ) with '). Otherwise, try using
re.sub(r"F\((.*)\)", r"F('\1')", yourstring)
where the \1 in the replacement part will reference the (.*) capture group in the search regex
In your pattern F\((\w)+\) you are almost there, you just need to put the quantifier + after the \w to repeat matching 1+ word characters.
If you put it after the capturing group, you repeat the group which will give you the value of the last iteration in the capturing group which would be the second o in foo.
You could update your expression to:
F\((\w+)\)
And in the replacement refer to the capturing group using \1
F('\1')
For example:
import re
str = "F(foo)"
print(re.sub(r"F\((\w+)\)", r"F('\1')", str)) # F('foo')
Python demo
In Python in the re module there is the following function:
re.sub(pattern, repl, string, count=0, flags=0) – Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
I've found it can work like this:
print re.sub('[a-z]*\d+','lion','zebra432') # prints 'lion'
I was wondering, is there an easy way to use regular expressions in the replacement string, so that the replacement string contains part of the original regular expression/original string? Specifically, can I do something like this (which doesn't work)?
print re.sub('[a-z]*\d+', 'lion\d+', 'zebra432')
I want that to print 'lion432'. Obviously, it does not. Rather, it prints 'lion\d+'. Is there an easy way to use parts of the matching regular expression in the replacement string?
By the way, this is NOT a special case. Please do NOT assume that the number will always come at the end, the words will always come in the beginning, etc. I want to know a solution to all regexes in general.
Thanks
Place \d+ in a capture group (...) and then use \1 to refer to it:
>>> import re
>>> re.sub('[a-z]*(\d+)', r'lion\1', 'zebra432')
'lion432'
>>>
>>> # You can also refer to more than one capture group
>>> re.sub('([a-z]*)(\d+)', r'\1lion\2', 'zebra432')
'zebralion432'
>>>
From the docs:
Backreferences, such as \6, are replaced with the substring matched
by group 6 in the pattern.
Note that you will also need to use a raw-string so that \1 is not treated as an escape sequence.
Given a regex like r'a (\w+) regex', I know I can capture the group, but given a captured group I want to then sub it back into the regex. I've included below a function I've built to do this, but because I'm no expert at regular expressions I'm wondering if there is a more standard implementation of such behavior, or what the "best practice" would be.
def reverse_capture(regex_string, args, kwargs):
regex_string = str(regex_string)
if not args and not kwargs :
raise ValueError("at least one of args or kwargs must be empty in reverse_capture")
if kwargs :
for kwarg in kwargs :
regex_string = re.sub(r'(?:[^\\[]|[^\\](?:\\\\)+|[^\\](?:\\\\)*\\\[)\(\?P<.+>.+(?:[^\\[]|[^\\](?:\\\\)+|[^\\](?:\\\\)*\\\[)\)',
kwarg,
regex_string)
elif args :
for arg in args :
regex_string = re.sub(r'(?:[^\\[]|[^\\](?:\\\\)+|[^\\](?:\\\\)*\\\[)\(.+(?:[^\\[]|[^\\](?:\\\\)+|[^\\](?:\\\\)*\\\[)\)',
arg,
regex_string)
else :
return regex_string
Note: the above function doesn't actually work yet, because I figured before I try covering every single case I should ask on this site.
EDIT:
I think I should clarify what I mean a bit. My goal is to write a python function such that, given a regex like r"ab(.+)c" and an argument like, "Some strinG", we can have the following:
>>> def reverse_capture(r"ab(.+)c", "Some strinG")
"abSome strinGc"
That is to say, the argument will be substituted into the regex where the capture group is. There are definitely better ways to format strings; however, the regexes are given in my use case, so this is not an option.
For any one who's curious, what I'm trying to do is create a Django package that will use a template tag to find the regex associated to some view function or named url, optionally input some of arguments, and then check if the url from the template was accessed from matches the url generated by the tag. This will solve some navigation problems. There's a simpler package which does something similar, but it doesn't serve my use case.
Examples:
If reverse_capture is the function I'm trying to write, then here are some examples of input/output (I pass in the regexes as raw strings), as well as the function call:
reverse_capture : regex string -> regex
input: a regex and a string
output: the regex obtained by replacing the first capture group of regex which the argument, string.
examples:
>>> reverse_capture(r'(.+)', 'TEST')
'TEST'
>>> reverse_capture(r'a longer (.+) regex', 'TEST')
'a longer TEST regex'
>>> reverse_capture(r'regex with two (.+) capture groups(.+)', 'TEST')
'regex with two TEST capture groups(.+)'
Parsing regexes can be kind of complicated. Rather than trying to parse the regex to figure out where you need to substitute the matches, why not build the regex from a format string with convenient places to string-format the matches right in?
Here's an example template:
>>> regex_template = r'{} lives at {} Baker Street.'
We insert capturing groups to build the regex:
>>> import re
>>> word_group = r'(\w+)'
>>> digit_group = r'(\d+)'
>>> regex = regex_template.format(word_group, digit_group)
Match it against a string:
>>> groups = re.match(regex, 'Alfred lives at 325 Baker Street.').groups()
>>> groups
('Alfred', '325')
And string-format the matches into place:
>>> regex_template.format(*groups)
'Alfred lives at 325 Baker Street.'
For anyone coming across this question in the future, after I searched around, it appeared that there were no good library functions for substituting values into a regex's capture groups.
The easiest way to solve this problem/write your own function, is to make a DFA (Deterministic Finite Automaton), which isn't very hard.
If you are determined on solving it using regexes, then you can convert your DFA into a regex using answers to this question, which is how I ended up implementing my own solution.
I am trying to write a generic replace function for a regex sub operation in Python (trying in both 2 and 3) Where the user can provide a regex pattern and a replacement for the match. This could be just a simple string replacement to replacing using the groups from the match.
In the end, I get from the user a dictionary in this form:
regex_dict = {pattern:replacement}
When I try to replace all the occurrences of a pattern via this command, the replacement works for replacements for a group number, (such as \1) and I call the following operation:
re.sub(pattern, regex_dict[pattern], text)
This works as expected, but I need to do additional stuff when a match is found. Basically, what I try to achieve is as follows:
replace_function(matchobj):
result = regex_dict[matchobj.re]
##
## Do some other things
##
return result
re.sub(pattern, replace_function, text)
I see that this works for normal replacements, but the re.sub does not use the group information to get the match when the function is used.
I also tried to convert the \1 pattern to \g<1>, hoping that the re.sub would understand it, but to no avail.
Am I missing something vital?
Thanks in advance!
Additional notes: I compile the pattern using strings as in bytes, and the replacements are also in bytes. I have non-Latin characters in my pattern, but I read everything in bytes, including the text where the regex substitution will operate on.
EDIT
Just to clarify, I do not know in advance what kind of replacement the user will provide. It could be some combination of normal strings and groups, or just a string replacement.
SOLUTION
replace_function(matchobj):
repl = regex_dict[matchobj.re]
##
## Do some other things
##
return matchobj.expand(repl)
re.sub(pattern, replace_function, text)
I suspect you're after .expand, if you've got a compiled regex object (for instance), you can provide a string to be taken into consideration for the replacements, eg:
import re
text = 'abc'
# This would be your key in the dict
rx = re.compile('a(\w)c')
# This would be the value for the key (the replacement string, eg: `\1\1\1`)
res = rx.match(text).expand(r'\1\1\1')
# bbb