In many programming languages, the following
find foo([a-z]+)bar and replace with GOO\U\1GAR
will result in the entire match being made uppercase. I can't seem to find the equivalent in python; does it exist?
You can pass a function to re.sub() that will allow you to do this, here is an example:
def upper_repl(match):
return 'GOO' + match.group(1).upper() + 'GAR'
And an example of using it:
>>> re.sub(r'foo([a-z]+)bar', upper_repl, 'foobazbar')
'GOOBAZGAR'
Unfortunately this \U\1 syntax could never work in Python because \U in a string literal indicates the beginning of a 32-bit hex escape sequence. For example, "\U0001f4a9" == "💩".
However there are easy alternative to Perl's case conversion escapes available by using a replacement function. In re.sub(pattern, repl, string, count=0, flags=0) the replacement repl is usually a string, but it can also be a callable. If it is a callable, it's passed the Match object and must return a replacement string to be used.
So, for the example given in the question, this is possible:
>>> string = "fooquuxbar"
>>> pattern = "foo([a-z]+)bar"
>>> re.sub(pattern, lambda m: f"GOO{m.group(1).upper()}GAR", string)
'GOOQUUXGAR'
Here is a table of other string methods which might be useful for similar case conversions.
Modifier
Description
Example
Python callable to use
\U
Uppercase
foo BAR --> FOO BAR
str.upper
\L
Lowercase
foo BAR --> foo bar
str.lower or str.casefold
\I
Initial capital
foo BAR --> Foo Bar
str.title
\F
First capital
foo BAR --> Foo bar
str.capitalize
If you already have a replacement string (template), you may not be keen on swapping it out with the verbosity of m.group(1)+...+m.group(2)+...+m.group(3)... Sometimes it's nice to have a tidy little string.
You can use the MatchObject's expand() function to evaluate a template for the match in the same manner as sub(), allowing you to retain as much of your original template as possible. You can use upper on the relevant pieces.
re.sub(r'foo([a-z]+)bar', lambda m: 'GOO' + m.expand(r'\1GAR').upper(), 'foobazbar')
While this would not be particularly useful in the example above, and while it does not aid with complex circumstances, it may be more convenient for longer expressions with a greater number of captured groups, such as a MAC address censoring regex, where you just want to ensure the full replacement is capitalized or not.
You could use some variation of this:
s = 'foohellobar'
def replfunc(m):
return m.groups()[0]+m.groups()[1].upper()+m.groups()[2]
re.sub('(foo)([a-z]+)(bar)',replfunc,s)
gives the output:
'fooHELLObar'
For those coming across this on google...
You can also use re.sub to match repeating patterns. For example, you can convert a string with spaces to camelCase:
def to_camelcase(string):
string = string[0].lower() + string[1:] # lowercase first
return re.sub(
r'[\s]+(?P<first>[a-z])', # match spaces followed by \w
lambda m: m.group('first').upper(), # get following \w and upper()
string)
to_camelcase('String to convert') # --> stringToConvert
Related
I have a string S = '02143' and a list A = ['a','b','c','d','e']. I want to replace all those digits in 'S' with their corresponding element in list A.
For example, replace 0 with A[0], 2 with A[2] and so on. Final output should be S = 'acbed'.
I tried:
S = re.sub(r'([0-9])', A[int(r'\g<1>')], S)
However this gives an error ValueError: invalid literal for int() with base 10: '\\g<1>'. I guess it is considering backreference '\g<1>' as a string. How can I solve this especially using re.sub and capture-groups, else alternatively?
The reason the re.sub(r'([0-9])',A[int(r'\g<1>')],S) does not work is that \g<1> (which is an unambiguous representation of the first backreference otherwise written as \1) backreference only works when used in the string replacement pattern. If you pass it to another method, it will "see" just \g<1> literal string, since the re module won't have any chance of evaluating it at that time. re engine only evaluates it during a match, but the A[int(r'\g<1>')] part is evaluated before the re engine attempts to find a match.
That is why it is made possible to use callback methods inside re.sub as the replacement argument: you may pass the matched group values to any external methods for advanced manipulation.
See the re documentation:
re.sub(pattern, repl, string, count=0, flags=0)
If repl is a function, it is called for every non-overlapping
occurrence of pattern. The function takes a single match object
argument, and returns the replacement string.
Use
import re
S = '02143'
A = ['a','b','c','d','e']
print(re.sub(r'[0-9]',lambda x: A[int(x.group())],S))
See the Python demo
Note you do not need to capture the whole pattern with parentheses, you can access the whole match with x.group().
I'm trying to search and replace part of strings using re.sub and format capabilities of Python.
I want all text like 'ESO \d+-\d+" to be replaced in the format 'ESO \d{3}-\d{3}' using leading zeroes.
I thought that this would work:
re.sub(r"ESO (\d+)-(\d+)" ,"ESO {:0>3}-{:0>3}".format(r"\1",r"\2"), line)
But I get strange results:
'ESO 409-22' becomes 'ESO 0409-022'
'ESO 539-4' becomes 'ESO 0539-04'
I can't see the error, in fact if I use two operations I get the correct result:
>>> ricerca = re.search(r"ESO (\d+)-(\d+)","ESO 409-22")
>>> print("ESO {:0>3}-{:0>3}".format(ricerca.group(1),ricerca.group(2)))
ESO 409-022
"ESO {:0>3}-{:0>3}".format(r"\1",r"\2")
evaluates to the same as:
r"ESO 0\1-0\2"
and then the group substitution proceeds normally, so it just puts a 0 in front of the numbers.
Your last code sample is a very sensible way to solve this problem, stick to it. If you really need to use re.sub, pass a function as the replacement:
>>> import re
>>> line = 'ESO 409-22'
>>> re.sub(r"ESO (\d+)-(\d+)", lambda match: "ESO {:0>3}-{:0>3}".format(*match.groups()), line)
'ESO 409-022'
>>> help(re.sub)
Help on function sub in module re:
sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the match object and must return
a replacement string to be used.
In Python in the re module there is the following function:
re.sub(pattern, repl, string, count=0, flags=0) – Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
I've found it can work like this:
print re.sub('[a-z]*\d+','lion','zebra432') # prints 'lion'
I was wondering, is there an easy way to use regular expressions in the replacement string, so that the replacement string contains part of the original regular expression/original string? Specifically, can I do something like this (which doesn't work)?
print re.sub('[a-z]*\d+', 'lion\d+', 'zebra432')
I want that to print 'lion432'. Obviously, it does not. Rather, it prints 'lion\d+'. Is there an easy way to use parts of the matching regular expression in the replacement string?
By the way, this is NOT a special case. Please do NOT assume that the number will always come at the end, the words will always come in the beginning, etc. I want to know a solution to all regexes in general.
Thanks
Place \d+ in a capture group (...) and then use \1 to refer to it:
>>> import re
>>> re.sub('[a-z]*(\d+)', r'lion\1', 'zebra432')
'lion432'
>>>
>>> # You can also refer to more than one capture group
>>> re.sub('([a-z]*)(\d+)', r'\1lion\2', 'zebra432')
'zebralion432'
>>>
From the docs:
Backreferences, such as \6, are replaced with the substring matched
by group 6 in the pattern.
Note that you will also need to use a raw-string so that \1 is not treated as an escape sequence.
I am trying to write a generic replace function for a regex sub operation in Python (trying in both 2 and 3) Where the user can provide a regex pattern and a replacement for the match. This could be just a simple string replacement to replacing using the groups from the match.
In the end, I get from the user a dictionary in this form:
regex_dict = {pattern:replacement}
When I try to replace all the occurrences of a pattern via this command, the replacement works for replacements for a group number, (such as \1) and I call the following operation:
re.sub(pattern, regex_dict[pattern], text)
This works as expected, but I need to do additional stuff when a match is found. Basically, what I try to achieve is as follows:
replace_function(matchobj):
result = regex_dict[matchobj.re]
##
## Do some other things
##
return result
re.sub(pattern, replace_function, text)
I see that this works for normal replacements, but the re.sub does not use the group information to get the match when the function is used.
I also tried to convert the \1 pattern to \g<1>, hoping that the re.sub would understand it, but to no avail.
Am I missing something vital?
Thanks in advance!
Additional notes: I compile the pattern using strings as in bytes, and the replacements are also in bytes. I have non-Latin characters in my pattern, but I read everything in bytes, including the text where the regex substitution will operate on.
EDIT
Just to clarify, I do not know in advance what kind of replacement the user will provide. It could be some combination of normal strings and groups, or just a string replacement.
SOLUTION
replace_function(matchobj):
repl = regex_dict[matchobj.re]
##
## Do some other things
##
return matchobj.expand(repl)
re.sub(pattern, replace_function, text)
I suspect you're after .expand, if you've got a compiled regex object (for instance), you can provide a string to be taken into consideration for the replacements, eg:
import re
text = 'abc'
# This would be your key in the dict
rx = re.compile('a(\w)c')
# This would be the value for the key (the replacement string, eg: `\1\1\1`)
res = rx.match(text).expand(r'\1\1\1')
# bbb
I'm trying to get a python regex sub function to work but I'm having a bit of trouble. Below is the code that I'm using.
string = 'á:tdfrec'
newString = re.sub(ur"([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):", ur"\1:\2", string)
#newString = re.sub(ur"([a|e|i|o|ä|ë|ö|á|é|í|ó|à|è|ì|ò])([a|e|i|o|ä|ë|ö|á|é|í|ó|ú|à|è|ì|ò]):", ur"\1:\2", string)
print newString
# a:́tdfrec is printed
So the the above code is not working the way that I intend. It's not displaying correctly but the string printed has the accute accent over the :. The regex statement is moving the accute accent from over the a to over the :. For the string that I'm declaring this regex is not suppose be applied. My intention for this regex statement is to only be applied for the following examples:
aä:dtcbd becomes a:ädtcbd
adfseì:gh becomes adfse:ìgh
éò:fdbh becomes é:òfdbh
but my regex statement is being applied and I don't want it to be. I think my problem is the second character set followed by the : (ie á:) is what's causing the regex statement to be applied. I've been staring at this for a while and tried a few other things and I feel like this should work but I'm missing something. Any help is appreciated!
The follow code with re.UNICODE flag also doesn't achieve the desired output:
>>> import re
>>> original = u'á:tdfrec'
>>> pattern = re.compile(ur"([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):", re.UNICODE)
>>> print pattern.sub(ur'\1:\2', string)
á:tdfrec
Is it because of the diacritic and the tony the pony example for les misérable? The diacritic is on the wrong character after reversing it:
>>> original = u'les misérable'
>>> print ''.join([i for i in reversed(original)])
elbarésim sel
edit: Definitely an issue with the combining diacritics, you need to normalize both the regular expression and the strings you are trying to match. For example:
import unicodedata
regex = unicodedata.normalize('NFC', ur'([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):')
string = unicodedata.normalize('NFC', u'aä:dtcbd')
newString = re.sub(regex, ur'\1:\2', string)
Here is an example that shows why you might hit an issue without the normalization. The string u'á' could either be the single code point LATIN SMALL LETTER A WITH ACCUTE (U+00E1) or it could be two code points, LATIN SMALL LETTER A (U+0061) followed by COMBINING ACUTE ACCENT (U+0301). These will probably look the same, but they will have very different behaviors in a regex because you can match the combining accent as its own character. That is what is happening here with the string 'á:tdfrec', a regular 'a' is captured in group 1, and the combining diacritic is captured in group 2.
By normalizing both the regex and the string you are matching you ensure this doesn't happen, because the NFC normalization will replace the diacritic and the character before it with a single equivalent character.
Original answer below.
I think your issue here is that the string you are attempting to do the replacement on is a byte string, not a Unicode string.
If these are string literals make sure you are using the u prefix, e.g. string = u'aä:dtcbd'. If they are not literals you will need to decode them, e.g. string = string.decode('utf-8') (although you may need to use a different codec).
You should probably also normalize your string, because part of the issue may have something to do with combining diacritics.
Note that in this case the re.UNICODE flag will not make a difference, because that only changes the meaning of character class shorthands like \w and \d. The important thing here is that if you are using a Unicode regular expression, it should probably be applied to a Unicode string.