Using re.sub with capture group references and numbers [duplicate]

Using re.sub with capture group references and numbers [duplicate] - python

How can I replace foobar with foo123bar?
This doesn't work:
>>> re.sub(r'(foo)', r'\1123', 'foobar')
'J3bar'
This works:
>>> re.sub(r'(foo)', r'\1hi', 'foobar')
'foohibar'
I think it's a common issue when having something like \number. Can anyone give me a hint on how to handle this?

The answer is:
re.sub(r'(foo)', r'\g<1>123', 'foobar')
Relevant excerpt from the docs:
In addition to character escapes and
backreferences as described above,
\g will use the substring
matched by the group named name, as
defined by the (?P...) syntax.
\g uses the corresponding
group number; \g<2> is therefore
equivalent to \2, but isn’t ambiguous
in a replacement such as \g<2>0. \20
would be interpreted as a reference to
group 20, not a reference to group 2
followed by the literal character '0'.
The backreference \g<0> substitutes in
the entire substring matched by the
RE.

Related

how to insert pattern between two matchs? [duplicate]

This question already has an answer here:
python re.sub group: number after \number
(1 answer)
Closed 4 years ago.
Here is my string
string = '03/25/93 Total time of visit (in minutes)'
I want to match '03/25/93' and replace it with '03/25/1993'. Currently I'm trying this
re.sub(r'(\d?\d/\d?\d/)(\d\d)', r'\119\2', string)
But apparently the '19' between '\1' and '\2' causes some errors. Is there a way to modify this method?

In that case you need to use the syntax \g<group>
Code
import re
string = '03/25/93 Total time of visit (in minutes)'
res = re.sub(r'(\d?\d/\d?\d/)(\d\d)', r'\g<1>19\2', string)
print(res)
Output
'03/25/1993 Total time of visit (in minutes)'
Taken from the docs
In string-type repl arguments, in addition to the character escapes and backreferences described above, \g will use the substring matched by the group named name, as defined by the (?P...) syntax. \g uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE
Take a look at the official documentation of re.sub for better understanding

Named backreference (?P=name) issue in Python re

I am learning 're' part of Python, and the named pattern (?P=name) confused me,
When I using re.sub() to make some exchange for digit and character, the patter '(?P=name)' doesn't work, but the pattern '\N' and '\g<name>' still make sense. Code below:
[IN]print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'(?P=char)-(?P=digit)', '123-abcd'))
[OUT] (?P=char)-(?P=digit)
[IN] print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\2-\1', '123-abcd'))
[OUT] abcd-123
[IN] print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\g<char>-\g<digit>', '123-abcd'))
[OUT] abcd-123
Why it failed to make substitute when I use (?P=name)?
And how to use it correctly?
I am using Python 3.5

The (?P=name) is an inline (in-pattern) backreference. You may use it inside a regular expression pattern to match the same content as is captured by the corresponding named capturing group, see the Python Regular Expression Syntax reference:
(?P=name)
A backreference to a named group; it matches whatever text was matched by the earlier group named name.
See this demo: (?P<digit>\d{3})-(?P<char>\w{4})&(?P=char)-(?P=digit) matches 123-abcd&abcd-123 because the "digit" group matches and captures 123, "char" group captures abcd and then the named inline backreferences match abcd and 123.
To replace matches, use \1, \g<1> or \g<char> syntax with re.sub replacement pattern. Do not use (?P=name) for that purpose:
repl can be a string or a function... Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern... In string-type repl arguments, in addition to the character escapes and backreferences described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. \g<number> uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE.

You can check the details of using and back-referencing ?P visiting:
https://docs.python.org/3/library/re.html
and using CTRL+F in your browser to look for (?P...). It comes a nice chart with all the instructions about when you can make use of ?P=name.
For this example, you're doing right at your third re.sub() call.
In the all re.sub() calls you can only use the ?P=name syntax in the first string parameter of this method and you don't need it in the second string parameter because you have the \g syntax.
In case you're confuse about the ?P=name being useful, it is, but for making a match by backreferencing an already named string.
Example: you want to match potatoXXXpotato and replace it for YYXXXYY. You could make:
re.sub(r'(?P<myName>potato)(XXX)(?P=myName)', r'YY\2YY', 'potatoXXXpotato')
or
re.sub(r'(?P<myName>potato)(?P<triple>XXX)(?P=myName)', r'YY\g<triple>YY', 'potatoXXXpotato')

Is there a way to use regular expressions in the replacement string in re.sub() in Python?

In Python in the re module there is the following function:
re.sub(pattern, repl, string, count=0, flags=0) – Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
I've found it can work like this:
print re.sub('[a-z]*\d+','lion','zebra432') # prints 'lion'
I was wondering, is there an easy way to use regular expressions in the replacement string, so that the replacement string contains part of the original regular expression/original string? Specifically, can I do something like this (which doesn't work)?
print re.sub('[a-z]*\d+', 'lion\d+', 'zebra432')
I want that to print 'lion432'. Obviously, it does not. Rather, it prints 'lion\d+'. Is there an easy way to use parts of the matching regular expression in the replacement string?
By the way, this is NOT a special case. Please do NOT assume that the number will always come at the end, the words will always come in the beginning, etc. I want to know a solution to all regexes in general.
Thanks

Place \d+ in a capture group (...) and then use \1 to refer to it:
>>> import re
>>> re.sub('[a-z]*(\d+)', r'lion\1', 'zebra432')
'lion432'
>>>
>>> # You can also refer to more than one capture group
>>> re.sub('([a-z]*)(\d+)', r'\1lion\2', 'zebra432')
'zebralion432'
>>>
From the docs:
Backreferences, such as \6, are replaced with the substring matched
by group 6 in the pattern.
Note that you will also need to use a raw-string so that \1 is not treated as an escape sequence.

Python regex - (\w+) results different output when used with complex expression

I have doubt on python regex operation. Here you go my sample test.
>>>re.match(r'(\w+)','a-b') gives an output
>>> <_sre.SRE_Match object at 0x7f51c0033210>
>>>re.match(r'(\w+):(\d+)','a-b:1')
>>>
Why does the 2nd regex condition doesn't give match object though the 1st regex gives match object for a normal string match condition, irrespective of special characters is available in the string?
However, \w+ will matches for [a-z,A-Z,_]. I'm not clear why (\w+) gives matched object for the string 'a-b'. How can I check whether the given string doesn't contain any special characters?

Taking a look at the actual match will give you an idea of what happens.
>>> re.match(r'(\w+)', 'a-b')
<_sre.SRE_Match object at 0x0000000002DE45D0>
>>> _.groups()
('a',)
As you can see, the expression matched a. The character sequence \w only contains actual word characters, but not separators like dashes. So you can’t actually match a-b using just a \w+.
Now in the second expression one might think that it would match b:1 at least, given that \w+ matches b and :(\d+) does match the 1. However it does not happen due to how re.match works. As the documentation hints, it only tries to match “at the beginning of string”. So when using re.match there is an implicit ^ at the beginning of the expression that makes it only match from the start. So it actually tries to find a match starting with a.
Instead, you can use re.search which actually looks in the whole string if it can match the expression anywhere. So there, you will get a result:
>>> re.search(r'(\w+):(\d+)', 'a-b:1')
<_sre.SRE_Match object at 0x0000000002E01B58>
>>> _.groups()
('b', '1')
For further information on the search vs. match topic, check this section in the manual.
And finally, if you want to match dashes too, you can use a character sequence [\w-] for example:
>>> re.match(r'([\w-]+):(\d+)', 'a-b:1')
<_sre.SRE_Match object at 0x0000000002E01B58>
>>> _.groups()
('a-b', '1')

The first matches the a - one or more word chars.
The second is one or more word chars immediately followed by a : which there aren't...
[a-z,A-Z,_] (the equivalent of \w) means a to z and A to Z - it isn't the literal hyphen in this context, if you did want a hyphen, put it as the first or last character of a character class.

Match's docs say
If zero or more characters at the beginning of string match the
regular expression pattern, return a corresponding MatchObject
instance.
match method will return the matched object if it finds a match at the beginning of the string. (\w+) matches a in a-b.
print re.match(r'(\w+)','a-b').group()
will print
a
In the second case ((\w+):(\d+)), the actual string which gets matched is b:1, which is not at the beginning of the string. That's why its returning None.
How can I check whether the given string doesn't contain any special characters?
I would say, the second regular expression which you have used should be enough and match function should be enough. I insist on match, since there are differences between match and search http://docs.python.org/2.7/library/re.html#search-vs-match
Remember, you

python re.sub group: number after \number

How can I replace foobar with foo123bar?
This doesn't work:
>>> re.sub(r'(foo)', r'\1123', 'foobar')
'J3bar'
This works:
>>> re.sub(r'(foo)', r'\1hi', 'foobar')
'foohibar'
I think it's a common issue when having something like \number. Can anyone give me a hint on how to handle this?

The answer is:
re.sub(r'(foo)', r'\g<1>123', 'foobar')
Relevant excerpt from the docs:
In addition to character escapes and
backreferences as described above,
\g will use the substring
matched by the group named name, as
defined by the (?P...) syntax.
\g uses the corresponding
group number; \g<2> is therefore
equivalent to \2, but isn’t ambiguous
in a replacement such as \g<2>0. \20
would be interpreted as a reference to
group 20, not a reference to group 2
followed by the literal character '0'.
The backreference \g<0> substitutes in
the entire substring matched by the
RE.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using re.sub with capture group references and numbers [duplicate] - python

How can I replace foobar with foo123bar? This doesn't work: >>> re.sub(r'(foo)', r'\1123', 'foobar') 'J3bar' This works: >>> re.sub(r'(foo)', r'\1hi', 'foobar') 'foohibar' I think it's a common issue when having something like \number. Can anyone give me a hint on how to handle this?

Related

how to insert pattern between two matchs? [duplicate]

Named backreference (?P=name) issue in Python re

Is there a way to use regular expressions in the replacement string in re.sub() in Python?

Python regex - (\w+) results different output when used with complex expression

python re.sub group: number after \number

Categories

Resources